In [None]:
%matplotlib inline
import matplotlib
import seaborn as sns
matplotlib.rcParams['savefig.dpi'] = 2 * matplotlib.rcParams['savefig.dpi']

# Streaming Technologies 
Streaming is useful for realtime analytics, online machine learning, continuous computation, and more.

### Apache Kafka
*A distributed, partitioned, replicated commit log service.*     

Data streams are partitioned and spread over a cluster of machines to allow data streams larger than the capability of any single machine and to allow clusters of coordinated consumers
![kafka](images/kafka_producer_consumer.png)

- Kafka maintains feeds of messages in categories called topics.
- Processes that publish messages to a Kafka topic: *producers.*
- Processes that subscribe to topics and process the feed of published messages: *consumers*
- Kafka is run as a cluster comprised of one or more servers, each of which is called a *broker.*
- Java client for communication between clients and servers is provided natively; other languages are available.

#### Useful for:
- Website activity tracking
- Metrics in operational monitoring, i.e. aggregating stats from distributed applications
- Log aggregation, collecting the physical log files off servers and putting them in a central place (e.g. HDFS) for processing
- Stream processing

### Apache Storm
*"A free and open source distributed realtime computation system. Storm makes it easy to reliably process unbounded streams of data, doing for realtime processing what Hadoop did for batch processing."*

- Fast: can process 1 million+ tuples per second per node!
- Scalable and fault-tolerant
- Easy to set up and operate

#### Abstractions 
- **Core abstraction:** The *stream*, an unbounded sequence of tuples. Storm provides the primitives for transforming a stream into a new stream in a distributed and reliable way. 
- **Spout:** a source of streams, e.g. a connection to the Twitter API
- **Bolt:** consumes any number of input streams and runs processing logic; possibly emits new streams. 
    - E.g. run functions, filter, aggregate streams, join streams, talk to databases, etc.
- Networks of spouts and bolts are packaged into a **topology**, which is the top level abstraction that you submit to Storm clusters for execution. 
![storm](images/storm_topology.png)

- Each node in the topology executes in parallel (you can specify the level of parallelism for each node) 

## Spark Streaming

**Core abstraction:** The *DStream*, aka discretized stream - a sequence of data arriving over time.
- Internally represented as a sequence of RDDs arriving at each time step
- Can be created from various input sources, e.g. Kafka, HDFS
- Two types of operations: 
    - *Transformations* yield a new DStream. Many of the same operations available on RDDS, in addition to operations related to time e.g. sliding windows
    - *Output operations* write data to an external system 

### To use Spark Streaming in an application
#### Include as imports:
```scala
import org.apache.spark.streaming.StreamingContext
import org.apache.spark.streaming.StreamingContext._
import org.apache.spark.streaming.dstream.DStream
import org.apache.spark.streaming.Duration
import org.apache.spark.streaming.Seconds 
```

#### Include in your `build.sbt`:
```scala
libraryDependencies += "org.apache.spark" %% "spark-streaming" % "1.3.0" % "provided"
```

#### Building
`sbt clean assembly` or `sbt assembly`

### Streaming Tweets Demo

The project lives here: `datacourse/projects/streaming_twitter_cl`

#### Collecting tweets with built-in Twitter utilities
```scala
 val tweetStream = TwitterUtils.createStream(ssc, Utils.getAuth)
  .map(gson.toJson(_))

tweetStream.foreachRDD((rdd, time) => {
  val count = rdd.count()
  if (count > 0) {
    val outputRDD = rdd.repartition(partitionsEachInterval)
    outputRDD.saveAsTextFile(
      outputDirectory + "/tweets_" + time.milliseconds.toString)
    numTweetsCollected += count
    if (numTweetsCollected > numTweetsToCollect) {
      System.exit(0)
    }
  }
})
```

#### Examining the tweets and training a K-means clustering model

The example uses SparkSQL to examine the data based on the tweets. SparkSQL can load JSON files and infer the schema based on the data. The commands you pass into SparkSQL to bring back to stdout will follow pretty standard SQL syntax.

```scala
sqlContext.sql(<command>).collect().foreach(println)
```

This clustering aims to identify clusters of tweets written in the same language. We do so by vectorizing hashed bigrams of characters within each tweet to recognize common sequences of characters in languages. Here `tf` is a `HashingTF` from `mllib.feature.HashingTF`

```scala
  def featurize(s: String): Vector = {
    tf.transform(s.sliding(2).toSeq)
  }
```

And here we train a KMeans model from MLlib:
```scala
val vectors = texts.map(Utils.featurize).cache()
val model = KMeans.train(vectors, numClusters, numIterations)
sc.makeRDD(model.clusterCenters, numClusters).
    saveAsObjectFile(outputModelDir)
```

#### Realtime classification

Finally, let's apply the model we've created in realtime! We'll:
1. load a Spark Streaming Context
1. create a Twitter DStream and grab just their `text` field
1. load the trained KMeans model
1. Choose the id of a language cluster we're interested in, and apply the model to all tweets, filtering only on that specificed cluster to see if the printed output is what we expect. 

```scala
println("Initializing Streaming Spark Context...")
val conf = new SparkConf().setAppName(this.getClass.getSimpleName)
val ssc = new StreamingContext(conf, Seconds(5))

println("Initializing Twitter stream...")
val tweets = TwitterUtils.createStream(ssc, Utils.getAuth)
val statuses = tweets.map(_.getText)

println("Initializing the the KMeans model...")
val model = new KMeansModel(ssc.sparkContext.objectFile[Vector](
    modelFile.toString).collect())

val filteredTweets = statuses
  .filter(t => model.predict(Utils.featurize(t)) == clusterNumber)
filteredTweets.print()

```

### An exercise for the reader

The first streaming model to make it into MLlib is a streaming k-means. This means that the cluster estimations can be dynamically updated as new data arrive. The algorithm uses a generalization of the mini-batch k-means update rule. For each batch of data, we assign all points to their nearest cluster, compute new cluster centers, then update each cluster. 

In this example we trained our model "in batch" on a static store of data. Try writing another class in the `streaming_twitter_cl` project that instead initializes a new `StreamingKMeans` model with random centers and the number of languages you want to classify as your number of clusters, and dynamically updates on fresh stream of Twitter ata.

*Copyright &copy; 2015 The Data Incubator.  All rights reserved.*