-
Notifications
You must be signed in to change notification settings - Fork 3
Replace Cassandra sink with Kafka sink #54
Conversation
jcjimenez
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM with minor question.
| insertion_time: Long | ||
| ) | ||
|
|
||
| case class Stream( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't we still need this Stream class (and maybe TrustedSource below) to setup our pipeline? Or maybe a dupe of this class exists outside the com.microsoft.partnercatalyst.fortis.spark.sinks.cassandra package? (if not I would propose to move to someplace like com.microsoft.partnercatalyst.fortis.spark.CassandraSchema)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This wasn't referenced anywhere and with the schema changes, I'm not sure how useful this still would be. If @kevinhartman's work needs this, it'll come out in the merge when the classes can be added back.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jcjimenez I'll add them elsewhere as you've proposed once they're needed. Thanks for pointing this out.
kevinhartman
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM.
Makes sense to create a separate Spark job(s) to process results from our ingestion/analysis to allow us (and others) to build downstream components such as ML predictors without coupling that logic to ingestion.
| "type": "string" | ||
| } | ||
| }, | ||
| "sentimens": { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"sentiments"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, see #56
| # project-fortis-ingestion | ||
|
|
||
| A repository for all spark jobs running on fortis | ||
| A repository for Project Fortis' data ingestion Spark jobs. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: Project Fortis's
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Doh, I always get this one wrong :( See #56
|
@c-w A question related to this PR came up during todays standup, which can wait til tomorrow for an answer... What are the drawbacks with converting the |
|
@erikschlegel From my understanding, DataFrames and Structured Streaming are orthogonal concerns. The former is a way to represent data inside of Spark, the latter is a way to process data by event-time as opposed to batch-time; it's merely coincidental that they both expose a SQL API. There's a section in the Structured Streaming docs that talks a bit about the difference and about the currently available ways to create a structured stream (which for now is just Kafka or the file system). |
|
@c-w I appreciate you putting this together, but I'm still unclear that we need the dev-ops overhead of Kafka introduced to the pipeline. We can transform the rdd collection of |
|
@erikschlegel We talked about this in detail last week. We need structured streaming to deal with stragglers when aggregating the streams into time windows. Consider the following situation: Spark DataFrames won't help us with this issue (sample reference 1, sample reference 2). The only way to solve the straggler problem in traditional Spark is to make your batch sizes big enough that you hope you will capture all the variance in processing within it which is not a reliable approach and should not be used if we want even just semi-reliable aggregation results. DataFrames and structured streaming are unrelated, really; they just have the same API: DataFrames are a way to give a SQL-like query interface for a set of Spark RDDs, Structured streaming on the other hand is a way to aggregate data and automatically re-run the aggregation in case of stragglers. |
|
Agree with you on using Structured Streaming. We can use structured streaming through event hub(which is a managed Azure service) or just converting each rdd into a dataframe. My earlier point was to better understand the benefits of Kafka as opposed to the other two approaches mentioned ^^^. |
|
I didn't know that EventHub offers Structured Streaming; thanks for bringing that to my attention. Reading their docs, it looks as though the functionality is pretty beta-quality though, so I'm not sure how comfortable I'd be building on it, especially given that their future improvements contain some pretty basic functionality and there isn't even sample code: Given that Structured Streaming itself is pretty new, I'd simply go for the path of least resistance and use the Spark recommended way to integrate with Structured Streaming which is Kafka. Additionally, Kafka is part of essentially every data pipeline out there, so if we can make it easier to stand up on Azure via the Fortis work, that'll be a win in general as it's certainly re-usable functionality :) |

We publish a simple flat/primitive structure to Kafka so that [1] the event is easy to process by any downstream processing (e.g. Spark Structured Streaming or DB ingestion) and [2] we decouple the Spark Streaming data representation from the data storage representation in Kafka.