# Streaming Twitter Sentiment Prediction

## Part 1: Training a NaiveBayes Model of Twitter Sentiment

Our approach is making twitter sentiment prediction is to first train a Naive Bayes model of twitter sentiment prediction using a given labled dataset at Kaggle.

We then save the trained model and will load and use it subsequently in a streaming application (in a different notebook)

### Step 1: Download and Explore data

In [0]:
!wget http://idsdl.csom.umn.edu/c/share/msba6330/twitter1.6m.zip

--2024-01-23 19:54:50--  http://idsdl.csom.umn.edu/c/share/msba6330/twitter1.6m.zip
Resolving idsdl.csom.umn.edu (idsdl.csom.umn.edu)... 134.84.138.46, 2607:ea00:101:480a:250:56ff:febb:e76b
Connecting to idsdl.csom.umn.edu (idsdl.csom.umn.edu)|134.84.138.46|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 84855679 (81M) [application/zip]
Saving to: ‘twitter1.6m.zip’


2024-01-23 19:54:53 (33.9 MB/s) - ‘twitter1.6m.zip’ saved [84855679/84855679]



In [0]:
!unzip twitter1.6m.zip

Archive:  twitter1.6m.zip
  inflating: training.1600000.processed.noemoticon.csv  


In [0]:
!head training.1600000.processed.noemoticon.csv

head: cannot open 'training.1600000.processed.noemoticon.csv' for reading: No such file or directory


Read the data into a DataFrame `data` using schema string: `target integer, id long, date string, flag string, user string, text string`

In [0]:
schema = "target integer, id long, date string, flag string, user string, text string"
data = spark.read.csv("file:/databricks/driver/training.1600000.processed.noemoticon.csv", schema=schema)
data.limit(5).display()

target,id,date,flag,user,text
0,1467810369,Mon Apr 06 22:19:45 PDT 2009,NO_QUERY,_TheSpecialOne_,"@switchfoot http://twitpic.com/2y1zl - Awww, that's a bummer. You shoulda got David Carr of Third Day to do it. ;D"
0,1467810672,Mon Apr 06 22:19:49 PDT 2009,NO_QUERY,scotthamilton,is upset that he can't update his Facebook by texting it... and might cry as a result School today also. Blah!
0,1467810917,Mon Apr 06 22:19:53 PDT 2009,NO_QUERY,mattycus,@Kenichan I dived many times for the ball. Managed to save 50% The rest go out of bounds
0,1467811184,Mon Apr 06 22:19:57 PDT 2009,NO_QUERY,ElleCTF,my whole body feels itchy and like its on fire
0,1467811193,Mon Apr 06 22:19:57 PDT 2009,NO_QUERY,Karoli,"@nationwideclass no, it's not behaving at all. i'm mad. why am i here? because I can't see you all over there."


In [0]:
data.printSchema()

root
 |-- target: integer (nullable = true)
 |-- id: long (nullable = true)
 |-- date: string (nullable = true)
 |-- flag: string (nullable = true)
 |-- user: string (nullable = true)
 |-- text: string (nullable = true)



## Step 2: Do Some Data Cleaning and Target Variable transformation

Please note that our train data has a different format from our testing data which comes from twitter stream. Our testing data will have these fields

- `text`: tweet text
- `time`: timestamp

In the following, we first want to 

- transform the date column into a timestamp column `time`
- transform `target` variable into a binary column (tip: using Binarizer, but cast it to double first as Binarizer only works with double/float columns)
- drop the irrelevant columns.

The resulting dataframe, called `data_clean` will have these
- `label`: a 0-1 label column derived from target.
- `text` 
- `time`

In [0]:
from pyspark.sql.functions import col, to_timestamp, substring
from pyspark.ml.feature import *
from pyspark.ml import Pipeline

data_clean = data.drop("id", "flag", "user").withColumn("time", to_timestamp(substring(col("date"),5,24),"MMM dd HH:mm:ss zzz yyyy")).drop("date").cache()

# The query casts the "target" column to a double type and renames it to "target_double". The * selects all other columns in addition to this newly created column
st_cast = SQLTransformer(statement = "select cast(target as double) as target_double, * from __THIS__")

# convert continuous features to binary (0/1) values based on a threshold. 
# If "target_double" is greater than 2.0, it will be converted to 1, otherwise 0.
binarizer = Binarizer(threshold = 2.0, inputCol="target_double", outputCol="label")
target_pipeline = Pipeline(stages=[st_cast,binarizer])
data_train = target_pipeline.fit(data_clean).transform(data_clean).drop("target_double","target")

data_train.limit(5).display()

text,time,label
"@switchfoot http://twitpic.com/2y1zl - Awww, that's a bummer. You shoulda got David Carr of Third Day to do it. ;D",2009-04-07T05:19:45.000+0000,0.0
is upset that he can't update his Facebook by texting it... and might cry as a result School today also. Blah!,2009-04-07T05:19:49.000+0000,0.0
@Kenichan I dived many times for the ball. Managed to save 50% The rest go out of bounds,2009-04-07T05:19:53.000+0000,0.0
my whole body feels itchy and like its on fire,2009-04-07T05:19:57.000+0000,0.0
"@nationwideclass no, it's not behaving at all. i'm mad. why am i here? because I can't see you all over there.",2009-04-07T05:19:57.000+0000,0.0


In [0]:
data_train.printSchema()

root
 |-- text: string (nullable = true)
 |-- time: timestamp (nullable = true)
 |-- label: double (nullable = true)



## Step 3: save `target_pipeline` which includes SQLTransformer and binarizer on DBFS at `/FileStore/fitted_target_pipeline/`

In [0]:
fitted_target_pipeline = target_pipeline.fit(data_clean)
fitted_target_pipeline.write().overwrite().save("/FileStore/fitted_target_pipeline/")

In [0]:
%fs ls /FileStore/fitted_target_pipeline/

path,name,size,modificationTime
dbfs:/FileStore/fitted_target_pipeline/metadata/,metadata/,0,0
dbfs:/FileStore/fitted_target_pipeline/stages/,stages/,0,0


## Step 4: define and fit a ML pipeline containing data preprocessing and model training.

Specifically, we need to remove some unwanted strings components (such as URL), stop words, and vectorize the words. We will try to do all these with PySpark (a combination of SparkSQL and Spark MLlib)

- eliminate URLs, @user, and # (remove the just the symbol but keep the hashtag). We will do this with SQLTransformer
  - Please use regexp_replace from SparkSQL
  - `'http\\\S+'` --> `''`: remove url   (note, we have three escape symbols because the text needs to go through three interpreter; final form should be `http\S+`)
  - `'@\\\w+'` --> `''`: remove @user
  - `'#'` --> `''` --> remove hashtag symbols.
- Tokenize the tweet, using a RegExTokenizer with `\\W+` as the token pattern.
- Remove stopwords, using StopWordsRemover (note that you need to load stop words first)
- Turn words into numerical features using CountVectorizer, limiting document frequency to 20 and above so that rare words are dropped.
- Use NaiveBayes to predict the sentiment with `smoothing` coefficient of `1.0` and `modelType` of `multinomial`

Pleas also define 

- an evaluator `e` for the accuracy metric.
- a pipeline `pipeline` with these stages: SQL transformer (for removing unwanted patterns), tokenizer, stopword remover, count vectorizer, naiveBayes.

In [0]:
from pyspark.ml.classification import NaiveBayes
from pyspark.ml.evaluation import MulticlassClassificationEvaluator

st = SQLTransformer(statement = "select *, regexp_replace(regexp_replace(regexp_replace(lower(text), 'http:\\\S+',''), '@\\\w+', ''),'#','') as text_cleaned from __THIS__")

tokenizer = RegexTokenizer(inputCol = "text_cleaned", outputCol='words', pattern = "\\W+")
stopwords = StopWordsRemover.loadDefaultStopWords("english")
swr = StopWordsRemover(inputCol = 'words', outputCol='words_filtered', stopWords=stopwords)
cv = CountVectorizer(inputCol='words_filtered', outputCol='features', minDF=20)
# ['hello' 'world'] -->
# hello    world
#  1        1
nb = NaiveBayes(smoothing=1.0, modelType='multinomial')
e = MulticlassClassificationEvaluator(metricName = 'accuracy')

pipeline = Pipeline(stages=[st, tokenizer, swr, cv, nb])


In [0]:
pipelineModel = pipeline.fit(data_train)

## Step 5: train the pipeline

- save the resulting model as `pipelineModel`
- use the model to transform the training data `data_clean`
- display sample results.

> Note: the training may take several minutes

In [0]:
results = pipelineModel.transform(data_train)
results.limit(10).display()

text,time,label,text_cleaned,words,words_filtered,features,rawPrediction,probability,prediction
"@switchfoot http://twitpic.com/2y1zl - Awww, that's a bummer. You shoulda got David Carr of Third Day to do it. ;D",2009-04-07T05:19:45.000+0000,0.0,"- awww, that's a bummer. you shoulda got david carr of third day to do it. ;d","List(awww, that, s, a, bummer, you, shoulda, got, david, carr, of, third, day, to, do, it, d)","List(awww, bummer, shoulda, got, david, carr, third, day, d)","Map(vectorType -> sparse, length -> 22150, indices -> List(2, 11, 72, 349, 737, 1074, 1787, 3377, 9539), values -> List(1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0))","Map(vectorType -> dense, length -> 2, values -> List(-73.10331329092534, -75.76040428183782))","Map(vectorType -> dense, length -> 2, values -> List(0.9344466971628318, 0.06555330283716827))",0.0
is upset that he can't update his Facebook by texting it... and might cry as a result School today also. Blah!,2009-04-07T05:19:49.000+0000,0.0,is upset that he can't update his facebook by texting it... and might cry as a result school today also. blah!,"List(is, upset, that, he, can, t, update, his, facebook, by, texting, it, and, might, cry, as, a, result, school, today, also, blah)","List(upset, update, facebook, texting, might, cry, result, school, today, also, blah)","Map(vectorType -> sparse, length -> 22150, indices -> List(7, 70, 174, 197, 425, 429, 440, 682, 1018, 1919, 2240), values -> List(1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0))","Map(vectorType -> dense, length -> 2, values -> List(-84.59414501008106, -89.98197577429755))","Map(vectorType -> dense, length -> 2, values -> List(0.9954489268877933, 0.004551073112206787))",0.0
@Kenichan I dived many times for the ball. Managed to save 50% The rest go out of bounds,2009-04-07T05:19:53.000+0000,0.0,i dived many times for the ball. managed to save 50% the rest go out of bounds,"List(i, dived, many, times, for, the, ball, managed, to, save, 50, the, rest, go, out, of, bounds)","List(dived, many, times, ball, managed, save, 50, rest, go, bounds)","Map(vectorType -> sparse, length -> 22150, indices -> List(5, 216, 256, 370, 800, 981, 1171, 1578), values -> List(1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0))","Map(vectorType -> dense, length -> 2, values -> List(-63.04542955499372, -63.78005141899811))","Map(vectorType -> dense, length -> 2, values -> List(0.67581868825521, 0.32418131174479003))",0.0
my whole body feels itchy and like its on fire,2009-04-07T05:19:57.000+0000,0.0,my whole body feels itchy and like its on fire,"List(my, whole, body, feels, itchy, and, like, its, on, fire)","List(whole, body, feels, itchy, like, fire)","Map(vectorType -> sparse, length -> 22150, indices -> List(4, 331, 381, 705, 1043, 2814), values -> List(1.0, 1.0, 1.0, 1.0, 1.0, 1.0))","Map(vectorType -> dense, length -> 2, values -> List(-47.0100517499005, -50.54793735282175))","Map(vectorType -> dense, length -> 2, values -> List(0.9717467190554288, 0.02825328094457132))",0.0
"@nationwideclass no, it's not behaving at all. i'm mad. why am i here? because I can't see you all over there.",2009-04-07T05:19:57.000+0000,0.0,"no, it's not behaving at all. i'm mad. why am i here? because i can't see you all over there.","List(no, it, s, not, behaving, at, all, i, m, mad, why, am, i, here, because, i, can, t, see, you, all, over, there)","List(behaving, m, mad, see)","Map(vectorType -> sparse, length -> 22150, indices -> List(0, 21, 493, 10190), values -> List(1.0, 1.0, 1.0, 1.0))","Map(vectorType -> dense, length -> 2, values -> List(-30.130673621754106, -31.075616317275248))","Map(vectorType -> dense, length -> 2, values -> List(0.720096976808941, 0.27990302319105903))",0.0
@Kwesidei not the whole crew,2009-04-07T05:20:00.000+0000,0.0,not the whole crew,"List(not, the, whole, crew)","List(whole, crew)","Map(vectorType -> sparse, length -> 22150, indices -> List(331, 2083), values -> List(1.0, 1.0))","Map(vectorType -> dense, length -> 2, values -> List(-17.983945382297364, -17.91495730669582))","Map(vectorType -> dense, length -> 2, values -> List(0.48275981823545605, 0.517240181764544))",1.0
Need a hug,2009-04-07T05:20:03.000+0000,0.0,need a hug,"List(need, a, hug)","List(need, hug)","Map(vectorType -> sparse, length -> 22150, indices -> List(35, 815), values -> List(1.0, 1.0))","Map(vectorType -> dense, length -> 2, values -> List(-14.601759941781173, -15.43562112134421))","Map(vectorType -> dense, length -> 2, values -> List(0.6971707364117425, 0.30282926358825757))",0.0
"@LOLTrish hey long time no see! Yes.. Rains a bit ,only a bit LOL , I'm fine thanks , how's you ?",2009-04-07T05:20:03.000+0000,0.0,"hey long time no see! yes.. rains a bit ,only a bit lol , i'm fine thanks , how's you ?","List(hey, long, time, no, see, yes, rains, a, bit, only, a, bit, lol, i, m, fine, thanks, how, s, you)","List(hey, long, time, see, yes, rains, bit, bit, lol, m, fine, thanks)","Map(vectorType -> sparse, length -> 22150, indices -> List(0, 12, 13, 21, 31, 76, 78, 88, 162, 423, 2560), values -> List(1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 2.0, 1.0, 1.0))","Map(vectorType -> dense, length -> 2, values -> List(-80.42899760997152, -76.07727296338678))","Map(vectorType -> dense, length -> 2, values -> List(0.012720671663382397, 0.9872793283366176))",1.0
@Tatiana_K nope they didn't have it,2009-04-07T05:20:05.000+0000,0.0,nope they didn't have it,"List(nope, they, didn, t, have, it)","List(nope, didn)","Map(vectorType -> sparse, length -> 22150, indices -> List(69, 691), values -> List(1.0, 1.0))","Map(vectorType -> dense, length -> 2, values -> List(-14.79513117152748, -16.11864752311122))","Map(vectorType -> dense, length -> 2, values -> List(0.7897661408990221, 0.21023385910097786))",0.0
@twittera que me muera ?,2009-04-07T05:20:09.000+0000,0.0,que me muera ?,"List(que, me, muera)","List(que, muera)","Map(vectorType -> sparse, length -> 22150, indices -> List(2372), values -> List(1.0))","Map(vectorType -> dense, length -> 2, values -> List(-10.504218124777754, -10.58930814153326))","Map(vectorType -> dense, length -> 2, values -> List(0.5212596785128979, 0.4787403214871022))",0.0


## Step 6: obtain the accuracy of the model using our predefined evaluator.

In [0]:
e.evaluate(results)

Out[23]: 0.77388625

## Step 7: Save the model on DBFS at `/FileStore/twitter_nbpipeline/`

In [0]:
pipelineModel.write().overwrite().save("/FileStore/twitter_nbpipeline/")

In [0]:
%fs ls /FileStore/twitter_nbpipeline/

path,name,size,modificationTime
dbfs:/FileStore/twitter_nbpipeline/metadata/,metadata/,0,0
dbfs:/FileStore/twitter_nbpipeline/stages/,stages/,0,0


In [0]:
!ls /databricks/driver/tmp/

review1.txt   review14.txt  review19.txt  review23.txt	review5.txt
review10.txt  review15.txt  review2.txt   review24.txt	review6.txt
review11.txt  review16.txt  review20.txt  review25.txt	review7.txt
review12.txt  review17.txt  review21.txt  review3.txt	review8.txt
review13.txt  review18.txt  review22.txt  review4.txt	review9.txt


In [0]:
# with open('/databricks/driver/tmp/review1.txt', 'r') as f1:
#   line = f1.readline()
#   print(line)

In [0]:
schema = "target integer, id long, date string, flag string, user string, text string"
lines = spark.read.csv("file:/databricks/driver/tmp/", sep=",", schema=schema)
# lines.display()
data_clean = lines.drop("id", "flag", "user")\
                 .withColumn("time", to_timestamp(substring(col("date"),5,24),"MMM dd HH:mm:ss zzz yyyy"))\
                 .drop("date").cache()
#data_clean.display()

fitted_target_pipeline = PipelineModel.load("/FileStore/fitted_target_pipeline/")
transformed_stream = fitted_target_pipeline.transform(data_clean).drop("target_double","target")

transformed_stream.display()


text,time,label
"Just checked my user timeline on my blackberry, it looks like the twanking is still happening Are ppl still having probs w/ BGs and UIDs?",2009-04-07T05:22:06.000+0000,0.0
"I have a sad feeling that Dallas is not going to show up I gotta say though, you'd think more shows would use music from the game. mmm",2009-04-07T05:22:47.000+0000,0.0
ok I'm sick and spent an hour sitting in the shower cause I was too sick to stand and held back the puke like a champ. BED now,2009-04-07T05:21:20.000+0000,0.0
"@andywana Not sure what they are, only that they are PoS! As much as I want to, I dont think can trade away company assets sorry andy!",2009-04-07T05:22:37.000+0000,0.0
thought sleeping in was an option tomorrow but realizing that it now is not. evaluations in the morning and work in the afternoon!,2009-04-07T05:21:09.000+0000,0.0
"@localtweeps Wow, tons of replies from you, may have to unfollow so I can see my friends' tweets, you're scrolling the feed a lot.",2009-04-07T05:22:24.000+0000,0.0
@alielayus I want to go to promote GEAR AND GROOVE but unfornately no ride there I may b going to the one in Anaheim in May though,2009-04-07T05:21:07.000+0000,0.0
Broadband plan 'a massive broken promise' http://tinyurl.com/dcuc33 via www.diigo.com/~tautao Still waiting for broadband we are,2009-04-07T05:22:23.000+0000,0.0
"@switchfoot http://twitpic.com/2y1zl - Awww, that's a bummer. You shoulda got David Carr of Third Day to do it. ;D",2009-04-07T05:19:45.000+0000,0.0
"@smarrison i would've been the first, but i didn't have a gun. not really though, zac snyder's just a doucheclown.",2009-04-07T05:20:20.000+0000,0.0


In [0]:
# Read streaming data with the full schema
schema = "target integer, id long, date string, flag string, user string, text string"
lines = spark.readStream.option("maxFilesPerTrigger",1).csv("file:/databricks/driver/tmp/", sep=",", schema=schema)\
             .drop("id", "flag", "user")\
             .withColumn("time", to_timestamp(substring(col("date"),5,24),"MMM dd HH:mm:ss zzz yyyy"))\
             .drop("date")



fitted_target_pipeline = PipelineModel.load("/FileStore/fitted_target_pipeline/")
transformed_stream = fitted_target_pipeline.transform(lines).drop("target_double","target")
transformed_stream.display()
# data_clean = lines.drop("id", "flag", "user")\
#                  .withColumn("time", to_timestamp(substring(col("date"),5,24),"MMM dd HH:mm:ss zzz yyyy"))\
#                  .drop("date").cache()





text,time,label
"@switchfoot http://twitpic.com/2y1zl - Awww, that's a bummer. You shoulda got David Carr of Third Day to do it. ;D",2009-04-07T05:19:45.000+0000,0.0
is upset that he can't update his Facebook by texting it... and might cry as a result School today also. Blah!,2009-04-07T05:19:49.000+0000,0.0
@Kenichan I dived many times for the ball. Managed to save 50% The rest go out of bounds,2009-04-07T05:19:53.000+0000,0.0
my whole body feels itchy and like its on fire,2009-04-07T05:19:57.000+0000,0.0
"@nationwideclass no, it's not behaving at all. i'm mad. why am i here? because I can't see you all over there.",2009-04-07T05:19:57.000+0000,0.0
@Kwesidei not the whole crew,2009-04-07T05:20:00.000+0000,0.0
Need a hug,2009-04-07T05:20:03.000+0000,0.0
"@LOLTrish hey long time no see! Yes.. Rains a bit ,only a bit LOL , I'm fine thanks , how's you ?",2009-04-07T05:20:03.000+0000,0.0
@Tatiana_K nope they didn't have it,2009-04-07T05:20:05.000+0000,0.0
@twittera que me muera ?,2009-04-07T05:20:09.000+0000,0.0


In [0]:
modelPath = "/FileStore/twitter_nbpipeline"

pipelineModel = PipelineModel.load(modelPath)

scored_tweets = pipelineModel.transform(transformed_stream)

query = scored_tweets.drop("rawPrediction", "probability", "features", "words", "words_filtered")\
  .writeStream\
    .format("memory")\
      .queryName("scored_tweets")\
        .outputMode("append")\
          .start()

In [0]:
%sql
select * from scored_tweets;

text,time,label,text_cleaned,words,words_filtered,features,rawPrediction,probability,prediction
"@switchfoot http://twitpic.com/2y1zl - Awww, that's a bummer. You shoulda got David Carr of Third Day to do it. ;D",2009-04-07T05:19:45.000+0000,0.0,"- awww, that's a bummer. you shoulda got david carr of third day to do it. ;d","List(awww, that, s, a, bummer, you, shoulda, got, david, carr, of, third, day, to, do, it, d)","List(awww, bummer, shoulda, got, david, carr, third, day, d)","Map(vectorType -> sparse, length -> 22150, indices -> List(2, 11, 72, 349, 737, 1074, 1787, 3377, 9539), values -> List(1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0))","Map(vectorType -> dense, length -> 2, values -> List(-73.10331329092534, -75.76040428183782))","Map(vectorType -> dense, length -> 2, values -> List(0.9344466971628318, 0.06555330283716827))",0.0
is upset that he can't update his Facebook by texting it... and might cry as a result School today also. Blah!,2009-04-07T05:19:49.000+0000,0.0,is upset that he can't update his facebook by texting it... and might cry as a result school today also. blah!,"List(is, upset, that, he, can, t, update, his, facebook, by, texting, it, and, might, cry, as, a, result, school, today, also, blah)","List(upset, update, facebook, texting, might, cry, result, school, today, also, blah)","Map(vectorType -> sparse, length -> 22150, indices -> List(7, 70, 174, 197, 425, 429, 440, 682, 1018, 1919, 2240), values -> List(1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0))","Map(vectorType -> dense, length -> 2, values -> List(-84.59414501008106, -89.98197577429755))","Map(vectorType -> dense, length -> 2, values -> List(0.9954489268877933, 0.004551073112206787))",0.0
@Kenichan I dived many times for the ball. Managed to save 50% The rest go out of bounds,2009-04-07T05:19:53.000+0000,0.0,i dived many times for the ball. managed to save 50% the rest go out of bounds,"List(i, dived, many, times, for, the, ball, managed, to, save, 50, the, rest, go, out, of, bounds)","List(dived, many, times, ball, managed, save, 50, rest, go, bounds)","Map(vectorType -> sparse, length -> 22150, indices -> List(5, 216, 256, 370, 800, 981, 1171, 1578), values -> List(1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0))","Map(vectorType -> dense, length -> 2, values -> List(-63.04542955499372, -63.78005141899811))","Map(vectorType -> dense, length -> 2, values -> List(0.67581868825521, 0.32418131174479003))",0.0
my whole body feels itchy and like its on fire,2009-04-07T05:19:57.000+0000,0.0,my whole body feels itchy and like its on fire,"List(my, whole, body, feels, itchy, and, like, its, on, fire)","List(whole, body, feels, itchy, like, fire)","Map(vectorType -> sparse, length -> 22150, indices -> List(4, 331, 381, 705, 1043, 2814), values -> List(1.0, 1.0, 1.0, 1.0, 1.0, 1.0))","Map(vectorType -> dense, length -> 2, values -> List(-47.0100517499005, -50.54793735282175))","Map(vectorType -> dense, length -> 2, values -> List(0.9717467190554288, 0.02825328094457132))",0.0
"@nationwideclass no, it's not behaving at all. i'm mad. why am i here? because I can't see you all over there.",2009-04-07T05:19:57.000+0000,0.0,"no, it's not behaving at all. i'm mad. why am i here? because i can't see you all over there.","List(no, it, s, not, behaving, at, all, i, m, mad, why, am, i, here, because, i, can, t, see, you, all, over, there)","List(behaving, m, mad, see)","Map(vectorType -> sparse, length -> 22150, indices -> List(0, 21, 493, 10190), values -> List(1.0, 1.0, 1.0, 1.0))","Map(vectorType -> dense, length -> 2, values -> List(-30.130673621754106, -31.075616317275248))","Map(vectorType -> dense, length -> 2, values -> List(0.720096976808941, 0.27990302319105903))",0.0
@Kwesidei not the whole crew,2009-04-07T05:20:00.000+0000,0.0,not the whole crew,"List(not, the, whole, crew)","List(whole, crew)","Map(vectorType -> sparse, length -> 22150, indices -> List(331, 2083), values -> List(1.0, 1.0))","Map(vectorType -> dense, length -> 2, values -> List(-17.983945382297364, -17.91495730669582))","Map(vectorType -> dense, length -> 2, values -> List(0.48275981823545605, 0.517240181764544))",1.0
Need a hug,2009-04-07T05:20:03.000+0000,0.0,need a hug,"List(need, a, hug)","List(need, hug)","Map(vectorType -> sparse, length -> 22150, indices -> List(35, 815), values -> List(1.0, 1.0))","Map(vectorType -> dense, length -> 2, values -> List(-14.601759941781173, -15.43562112134421))","Map(vectorType -> dense, length -> 2, values -> List(0.6971707364117425, 0.30282926358825757))",0.0
"@LOLTrish hey long time no see! Yes.. Rains a bit ,only a bit LOL , I'm fine thanks , how's you ?",2009-04-07T05:20:03.000+0000,0.0,"hey long time no see! yes.. rains a bit ,only a bit lol , i'm fine thanks , how's you ?","List(hey, long, time, no, see, yes, rains, a, bit, only, a, bit, lol, i, m, fine, thanks, how, s, you)","List(hey, long, time, see, yes, rains, bit, bit, lol, m, fine, thanks)","Map(vectorType -> sparse, length -> 22150, indices -> List(0, 12, 13, 21, 31, 76, 78, 88, 162, 423, 2560), values -> List(1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 2.0, 1.0, 1.0))","Map(vectorType -> dense, length -> 2, values -> List(-80.42899760997152, -76.07727296338678))","Map(vectorType -> dense, length -> 2, values -> List(0.012720671663382397, 0.9872793283366176))",1.0
@Tatiana_K nope they didn't have it,2009-04-07T05:20:05.000+0000,0.0,nope they didn't have it,"List(nope, they, didn, t, have, it)","List(nope, didn)","Map(vectorType -> sparse, length -> 22150, indices -> List(69, 691), values -> List(1.0, 1.0))","Map(vectorType -> dense, length -> 2, values -> List(-14.79513117152748, -16.11864752311122))","Map(vectorType -> dense, length -> 2, values -> List(0.7897661408990221, 0.21023385910097786))",0.0
@twittera que me muera ?,2009-04-07T05:20:09.000+0000,0.0,que me muera ?,"List(que, me, muera)","List(que, muera)","Map(vectorType -> sparse, length -> 22150, indices -> List(2372), values -> List(1.0))","Map(vectorType -> dense, length -> 2, values -> List(-10.504218124777754, -10.58930814153326))","Map(vectorType -> dense, length -> 2, values -> List(0.5212596785128979, 0.4787403214871022))",0.0


In [0]:
%sql

select window(time,"30 seconds"), sum(if(prediction=1,1,0)) as positive, sum(if(prediction=0,1,0)) as negative from scored_tweets
where time > (select max(time) from scored_tweets) - INTERVAL 1 minutes
group by window(time,"30 seconds")

window,positive,negative
"List(2009-04-07T05:22:30.000+0000, 2009-04-07T05:23:00.000+0000)",1,7
"List(2009-04-07T05:23:00.000+0000, 2009-04-07T05:23:30.000+0000)",2,5
"List(2009-04-07T05:23:30.000+0000, 2009-04-07T05:24:00.000+0000)",1,0
