Skip to content
This repository has been archived by the owner. It is now read-only.

Publish AnalyzedItems to Cassandra #28

Merged
merged 19 commits into from Jun 23, 2017

Conversation

Projects
None yet
5 participants
@c-w
Copy link
Member

commented Jun 23, 2017

For now, using a test keyspace and table on the cluster that Erik set up:

CREATE KEYSPACE fortistest WITH replication = {'class': 'SimpleStrategy', 'replication_factor': '3'}  AND durable_writes = true;

CREATE TABLE fortistest.events (
    created_at timestamp,
    pipeline text,
    PRIMARY KEY (created_at, pipeline)
)

There will be some follow-up work required to adapt the CassandraSchema class once our events schema is finalized.

Resolves #14

@c-w c-w added the in progress label Jun 23, 2017

@Smarker
Copy link
Contributor

left a comment

Everything else looks good.


object Utils {
def mean(items: List[Double]): Double = {
items.sum / items.length

This comment has been minimized.

Copy link
@Smarker

Smarker Jun 23, 2017

Contributor

What if the length of items is 0? Wouldn't this cause an error?

This comment has been minimized.

Copy link
@c-w

c-w Jun 23, 2017

Author Member

Improved error handling in 37c1983.

def rescale(items: List[Double], min_new: Double, max_new: Double): List[Double] = {
val min_old = items.min
val max_old = items.max
val coef = (max_new - min_new) / (max_old - min_old)

This comment has been minimized.

Copy link
@Smarker

Smarker Jun 23, 2017

Contributor

If max_old == min_old, the denominator would be 0. Maybe put a check for this?

This comment has been minimized.

Copy link
@jcjimenez

jcjimenez Jun 23, 2017

Contributor

+1

This comment has been minimized.

Copy link
@c-w

c-w Jun 23, 2017

Author Member

Added error handling in 37c1983.

@c-w c-w force-pushed the publish-to-cassandra branch from e13eef1 to f34f55a Jun 23, 2017

@@ -0,0 +1,6 @@
package com.microsoft.partnercatalyst.fortis.spark.transforms.gender

object GenderDetector {

This comment has been minimized.

Copy link
@jcjimenez

jcjimenez Jun 23, 2017

Contributor

May I suggest adding extends Enumeration here?

This comment has been minimized.

Copy link
@c-w

c-w Jun 23, 2017

Author Member

Sweet. Done in 8162fe3.

@jcjimenez
Copy link
Contributor

left a comment

LGTM with the division by zero check

@jcjimenez jcjimenez merged commit 257576d into master Jun 23, 2017

2 checks passed

continuous-integration/travis-ci/pr The Travis CI build passed
Details
continuous-integration/travis-ci/push The Travis CI build passed
Details

@jcjimenez jcjimenez removed the in progress label Jun 23, 2017

@c-w c-w deleted the publish-to-cassandra branch Jun 23, 2017

@erikschlegel
Copy link
Contributor

left a comment

Looks great. Nice work. Left some comments around offloading the uuid() call to cassandra.

Starting next week, I'll add the pieces to aggregate the results by place, tile and topic and make the saveToCassnadra call for each.

@@ -19,6 +20,7 @@ object TadawebPipeline extends Pipeline {
import transformContext._

stream.map(tada => AnalyzedItem(
id = randomUUID(),

This comment has been minimized.

Copy link
@erikschlegel

erikschlegel Jun 23, 2017

Contributor

This should be called via the uuid() function in cassandra.

This comment has been minimized.

Copy link
@c-w

c-w Jun 24, 2017

Author Member

The advantage of creating the id early is that we have a way to track every event through the pipeline (e.g. useful when logging). Is this benefit worth explicitly creating the UUID?

@@ -17,6 +18,7 @@ object RadioPipeline extends Pipeline {

private def convertToSchema(stream: DStream[RadioTranscription], transformContext: TransformContext): DStream[AnalyzedItem] = {
stream.map(transcription => AnalyzedItem(
id = randomUUID(),

This comment has been minimized.

Copy link
@erikschlegel

erikschlegel Jun 23, 2017

Contributor

This should be called via the uuid() function in cassandra.

@@ -18,6 +19,7 @@ object InstagramPipeline extends Pipeline {
// do computer vision analysis
val analysis = imageAnalyzer.analyze(instagram.images.standard_resolution.url)
AnalyzedItem(
id = randomUUID(),

This comment has been minimized.

Copy link
@erikschlegel

erikschlegel Jun 23, 2017

Contributor

This should be called via the uuid() function in cassandra.

import org.scalatest.{BeforeAndAfter, FlatSpec}

import scala.collection.mutable

class StreamProviderSpec extends FlatSpec with BeforeAndAfter {
class SparkSpec extends FlatSpec with BeforeAndAfter {

This comment has been minimized.

Copy link
@kevinhartman

kevinhartman Jun 24, 2017

Contributor

I would ideally like to keep the StreamProviderSpec separate since the StreamProvider package is an isolated component that could be moved out to a library at any time (and it'd be nice to take the test spec with it).

This comment has been minimized.

Copy link
@c-w

c-w Jun 24, 2017

Author Member

As per the comment on 629e5d3, we can only have a single Spark Context running per JVM so I merged all the tests for now (they were previously split but it's non-trivial to get that to work). If this becomes an issue, we can spend the time to figure out how to split the tests.

This comment has been minimized.

Copy link
@c-w

c-w Jun 24, 2017

Author Member

I briefly looked into using the spark-testing-base package but they don't have anything built-in for Spark Streaming. I started working on a streaming extension but got some odd errors so preferred to just push this for now. I'll look more into this when I get a free minute.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
You can’t perform that action at this time.