Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

streaming_perf #52

Merged
merged 100 commits into from Dec 11, 2015
Merged

streaming_perf #52

merged 100 commits into from Dec 11, 2015

Conversation

ymahajan
Copy link
Contributor

No description provided.

ymahajan and others added 30 commits October 26, 2015 17:53
Adding StreamingSnappyContext, SchemaDStream, window clause in CQs
Adding DDLs to create stream tables for sockets, file and kafka sources
Adding StreamingSnappyContext, SchemaDStream, window clause in CQs
Adding DDLs to create stream tables for sockets, file and kafka sources
…commons into wip/streaming

Conflicts:
	snappy-core/build.gradle
	snappy-core/src/main/scala/org/apache/spark/sql/snappyParsers.scala
	snappy-core/src/main/scala/org/apache/spark/sql/streaming/SchemaDStream.scala
	snappy-core/src/main/scala/org/apache/spark/sql/streaming/StreamingSnappyContext.scala
Throw exception when truncate/drop stream tables
… and unit tests

Enhanced Twitter quickstart to use twitter4j.StatusJSONImpl class
Added a JSON parser to parse the tweet json
Cleand up code and added docs to key classes
Conflicts:
	snappy-core/src/main/scala/org/apache/spark/sql/SnappyContext.scala
a) A new join operator LocalJoin is added to perform replicated table join with either a replicat$a
This join mimics Broadcast join. Instead of taking build side join from a Broadcast relation we iterate over the single partition of replicated relation.
A relation can declare itself replicated by implementing PartitionedDataSourceScan and defining numPartitions to 1.
A new RDD NarrowPartitionsRDD is used to execute both the build side and stream side RDDs. Stream side is iterated for all the partition, while
 build side which has single partition is iterated for each stream side partition.
NarrowPartitionsRDD takes care of preferred location based on the common node.

b) For Partition to partion join Spark always shuffles if the relations are from DataSources as PhysicalRDD does not have a partitioner.
   We added a new physical plan PartitionedPhysicalRDD which has a partitioner based on the partitioning column.
   If the join operation is on the same columns as that of partition columns of the underlying store we can avoid shuffle and do a
   partition to partion join. Thankfully ZipPartitionRDD which is used by both merge join and shuffled join , takes care of the preferred locations.

 c) I have tested it for equijoins and not LeftSemiJoin.
Joining stream to static tables
Added sql method to StreamingSnappyContext to route SQL queries
Conflicts:
	snappy-core/src/main/scala/org/apache/spark/sql/sources/errorEstimates.scala
Store optimization is now default. Column store integration can only be done after Surnajan's checkin
Conflicts:
	snappy-core/src/main/scala/org/apache/spark/sql/SnappyContext.scala
@@ -65,6 +65,7 @@ object StreamingInputWithLoadData extends Serializable {
val ingestionStream = stream.window(Seconds(5), Seconds(5))


import org.apache.spark.sql.streaming.snappy._
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove it.

@hbhanawat
Copy link
Contributor

First round of review done. I have only reviewed Streaming changes and no other changes in this PR. I have also not reviewed tests. There are few things that we need immediately for quickstart testing:

  1. foreachDataframe function
  2. A SQL way to insert to into gemxd as mentioned in spec.
  3. clean up the api to insert into gemxd.

@ymahajan
Copy link
Contributor Author

ymahajan commented Dec 8, 2015

We are using following in foreachRDD code instead for foreachDataFrame
val df = ssnc.createDataFrame(rdd, tableStream.schema)
That should be sufficient for now.

On Tue, Dec 8, 2015 at 4:57 PM, hbhanawat notifications@github.com wrote:

First round of review done. I have only reviewed Streaming changes and no
other changes in this PR. I have also not reviewed tests. There are few
things that we need immediately for quickstart testing:

  1. foreachDataframe function
  2. A SQL way to insert to into gemxd as mentioned in spec.
  3. clean up the api to insert into gemxd.


Reply to this email directly or view it on GitHub
#52 (comment)
.

@hbhanawat
Copy link
Contributor

We are using following in foreachRDD code instead for foreachDataFrame val df = snc.createDataFrame(rdd, tableStream.schema) That should be sufficient for now.

Ok. But since this was part of spec, we need to get it reviewed/accepted by Jags, Sumedh and team.

rmishra and others added 3 commits December 10, 2015 13:52
2) Modified Stream related tests to honour the cleanup
Fixed samll issue of case in ExternalShellDunit
DirectKafka is a seperate DDL/Relation/Source now.
Cleaned up tests combined all streaming tests in StreamingSuite
@ymahajan
Copy link
Contributor Author

foreachDataFrame is not yet implemented, will file a JIRA to track this
story.
Currently we are using following in foreachRDD as a workaround
val df = ssnc.createDataFrame(rdd, tableStream.schema)

On Fri, Dec 4, 2015 at 3:59 PM, hbhanawat notifications@github.com wrote:

General comments:foreachDataFrame which was mentioned in Spec is not
implemented?


Reply to this email directly or view it on GitHub
#52 (comment)
.

@ymahajan
Copy link
Contributor Author

buildScan is not implemented yet, will track as a story. we are using
df.save APIs

On Fri, Dec 4, 2015 at 3:59 PM, hbhanawat notifications@github.com wrote:

General comments:We have not implemented a buildscan on our stream?


Reply to this email directly or view it on GitHub
#52 (comment)
.

@ymahajan
Copy link
Contributor Author

We need to see how this will be supported from CLI, works with scala
program though.

On Fri, Dec 4, 2015 at 4:00 PM, hbhanawat notifications@github.com wrote:

General comments:Once a stream table is created using sql, how do I insert
my stream table into gem xd using a sql command?


Reply to this email directly or view it on GitHub
#52 (comment)
.

@hbhanawat
Copy link
Contributor

Remove dat and log files in the checkin.

@hbhanawat hbhanawat closed this Dec 10, 2015
@hbhanawat
Copy link
Contributor

Sorry I closed this by mistake. Reopening it.

@hbhanawat hbhanawat reopened this Dec 10, 2015
ymahajan and others added 12 commits December 10, 2015 21:10
+ adding provision to pass a closure for SparkConf additions
+ StreamToRow now returns Seq[InternalRow]
+ minor refactoring
Conflicts:
	snappy-core/src/main/scala/org/apache/spark/sql/SnappyContext.scala
	snappy-core/src/main/scala/org/apache/spark/sql/columnar/CacheBatchHolder.scala
	snappy-core/src/main/scala/org/apache/spark/sql/columnar/ExternalStoreUtils.scala
	snappy-core/src/main/scala/org/apache/spark/sql/columnar/JDBCAppendableRelation.scala
	snappy-core/src/main/scala/org/apache/spark/sql/row/JDBCMutableRelation.scala
	snappy-core/src/main/scala/org/apache/spark/sql/snappyParsers.scala
	snappy-core/src/main/scala/org/apache/spark/sql/store/ExternalStore.scala
	snappy-core/src/main/scala/org/apache/spark/sql/store/JDBCSourceAsStore.scala
	snappy-core/src/test/scala/io/snappydata/SnappyFunSuite.scala
	snappy-core/src/test/scala/io/snappydata/core/LocalTestData.scala
	snappy-spark
	snappy-tools/src/main/scala/org/apache/spark/sql/columntable/ColumnFormatRelation.scala
	snappy-tools/src/main/scala/org/apache/spark/sql/store/StoreInitRDD.scala
	snappy-tools/src/main/scala/org/apache/spark/sql/store/StoreUtils.scala
	snappy-tools/src/main/scala/org/apache/spark/sql/store/impl/JDBCSourceAsColumnarStore.scala
	snappy-tools/src/test/scala/org/apache/spark/sql/store/ColumnTableTest.scala
…-commons into wip/tests/tpch

Conflicts:
	snappy-core/src/test/scala/io/snappydata/SnappyFunSuite.scala
…ninng needs to be worked

to make it enabled again.
Conflicts:
	snappy-spark
rishitesh added a commit that referenced this pull request Dec 11, 2015
@rishitesh rishitesh merged commit f175da2 into master Dec 11, 2015
@sumwale sumwale deleted the wip/tests/tpch branch January 19, 2016 19:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants