New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
streaming_perf #52
streaming_perf #52
Conversation
Adding StreamingSnappyContext, SchemaDStream, window clause in CQs Adding DDLs to create stream tables for sockets, file and kafka sources
Adding StreamingSnappyContext, SchemaDStream, window clause in CQs Adding DDLs to create stream tables for sockets, file and kafka sources
…commons into wip/streaming Conflicts: snappy-core/build.gradle snappy-core/src/main/scala/org/apache/spark/sql/snappyParsers.scala snappy-core/src/main/scala/org/apache/spark/sql/streaming/SchemaDStream.scala snappy-core/src/main/scala/org/apache/spark/sql/streaming/StreamingSnappyContext.scala
Cleaned up kafka configuration APIs
…ith latest upstream master
Throw exception when truncate/drop stream tables
… and unit tests Enhanced Twitter quickstart to use twitter4j.StatusJSONImpl class Added a JSON parser to parse the tweet json Cleand up code and added docs to key classes
Conflicts: snappy-core/src/main/scala/org/apache/spark/sql/SnappyContext.scala
a) A new join operator LocalJoin is added to perform replicated table join with either a replicat$a This join mimics Broadcast join. Instead of taking build side join from a Broadcast relation we iterate over the single partition of replicated relation. A relation can declare itself replicated by implementing PartitionedDataSourceScan and defining numPartitions to 1. A new RDD NarrowPartitionsRDD is used to execute both the build side and stream side RDDs. Stream side is iterated for all the partition, while build side which has single partition is iterated for each stream side partition. NarrowPartitionsRDD takes care of preferred location based on the common node. b) For Partition to partion join Spark always shuffles if the relations are from DataSources as PhysicalRDD does not have a partitioner. We added a new physical plan PartitionedPhysicalRDD which has a partitioner based on the partitioning column. If the join operation is on the same columns as that of partition columns of the underlying store we can avoid shuffle and do a partition to partion join. Thankfully ZipPartitionRDD which is used by both merge join and shuffled join , takes care of the preferred locations. c) I have tested it for equijoins and not LeftSemiJoin.
Joining stream to static tables Added sql method to StreamingSnappyContext to route SQL queries
Conflicts: snappy-core/src/main/scala/org/apache/spark/sql/sources/errorEstimates.scala
Store optimization is now default. Column store integration can only be done after Surnajan's checkin
Conflicts: snappy-core/src/main/scala/org/apache/spark/sql/SnappyContext.scala
@@ -65,6 +65,7 @@ object StreamingInputWithLoadData extends Serializable { | |||
val ingestionStream = stream.window(Seconds(5), Seconds(5)) | |||
|
|||
|
|||
import org.apache.spark.sql.streaming.snappy._ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove it.
First round of review done. I have only reviewed Streaming changes and no other changes in this PR. I have also not reviewed tests. There are few things that we need immediately for quickstart testing:
|
We are using following in foreachRDD code instead for foreachDataFrame On Tue, Dec 8, 2015 at 4:57 PM, hbhanawat notifications@github.com wrote:
|
Ok. But since this was part of spec, we need to get it reviewed/accepted by Jags, Sumedh and team. |
2) Modified Stream related tests to honour the cleanup
Fixed samll issue of case in ExternalShellDunit
DirectKafka is a seperate DDL/Relation/Source now. Cleaned up tests combined all streaming tests in StreamingSuite
foreachDataFrame is not yet implemented, will file a JIRA to track this On Fri, Dec 4, 2015 at 3:59 PM, hbhanawat notifications@github.com wrote:
|
buildScan is not implemented yet, will track as a story. we are using On Fri, Dec 4, 2015 at 3:59 PM, hbhanawat notifications@github.com wrote:
|
We need to see how this will be supported from CLI, works with scala On Fri, Dec 4, 2015 at 4:00 PM, hbhanawat notifications@github.com wrote:
|
Remove dat and log files in the checkin. |
Sorry I closed this by mistake. Reopening it. |
+ adding provision to pass a closure for SparkConf additions
+ StreamToRow now returns Seq[InternalRow] + minor refactoring
…-commons into wip/tests/tpch
Conflicts: snappy-core/src/main/scala/org/apache/spark/sql/SnappyContext.scala snappy-core/src/main/scala/org/apache/spark/sql/columnar/CacheBatchHolder.scala snappy-core/src/main/scala/org/apache/spark/sql/columnar/ExternalStoreUtils.scala snappy-core/src/main/scala/org/apache/spark/sql/columnar/JDBCAppendableRelation.scala snappy-core/src/main/scala/org/apache/spark/sql/row/JDBCMutableRelation.scala snappy-core/src/main/scala/org/apache/spark/sql/snappyParsers.scala snappy-core/src/main/scala/org/apache/spark/sql/store/ExternalStore.scala snappy-core/src/main/scala/org/apache/spark/sql/store/JDBCSourceAsStore.scala snappy-core/src/test/scala/io/snappydata/SnappyFunSuite.scala snappy-core/src/test/scala/io/snappydata/core/LocalTestData.scala snappy-spark snappy-tools/src/main/scala/org/apache/spark/sql/columntable/ColumnFormatRelation.scala snappy-tools/src/main/scala/org/apache/spark/sql/store/StoreInitRDD.scala snappy-tools/src/main/scala/org/apache/spark/sql/store/StoreUtils.scala snappy-tools/src/main/scala/org/apache/spark/sql/store/impl/JDBCSourceAsColumnarStore.scala snappy-tools/src/test/scala/org/apache/spark/sql/store/ColumnTableTest.scala
…-commons into wip/tests/tpch Conflicts: snappy-core/src/test/scala/io/snappydata/SnappyFunSuite.scala
…ninng needs to be worked to make it enabled again.
Conflicts: snappy-spark
No description provided.