# Twitter Sentiment Analysis with Pyspark 
# Model Training

First step in any Apache Spark programming is to create a SparkContext. SparkContext is needed when we want to execute operations in a cluster. SparkContext tells Spark how and where to access a cluster. It is first step to connect with Apache Cluster. 

In [40]:
from pyspark import SparkConf, SparkContext
from pyspark.streaming import StreamingContext
from pyspark.sql import SQLContext, SparkSession
import warnings

SCC_CHECKPOINT_PATH = "/Users/anujchaudhari/Desktop/256/project/samples/twitter_streaming/checkpoint"
STREAMING_SOCKET_IP = "192.168.0.100"
STREAMING_SOCKET_PORT = 5555
STREAMING_TIME_INTERVAL = 2

try:
    # create SparkContext on all CPUs available: in my case I have 4 CPUs on my laptop
    
    spark = SparkSession.builder.appName("twitter").getOrCreate()
    sc = spark.sparkContext
    sqlContext = SQLContext(sc)

    print("Just created a SparkContext")
    
except ValueError:
    warnings.warn("SparkContext already exists in this scope")
    
    

# Create Spark Streaming Context

ssc = StreamingContext(sc, STREAMING_TIME_INTERVAL )
ssc.checkpoint(SCC_CHECKPOINT_PATH)
socket_stream = ssc.socketTextStream(STREAMING_SOCKET_IP, STREAMING_SOCKET_PORT)
lines = socket_stream.window(STREAMING_TIME_INTERVAL)

print("SparkContext Master: " + sc.master)

Just created a SparkContext
SparkContext Master: local[*]


### Load Data From 'clean_tweet.csv'

In [41]:
df = sqlContext.read.format('com.databricks.spark.csv').options(header='true', inferschema='true').load('clean_tweet.csv')

In [42]:
df.show(5)

+---+--------------------+------+
|_c0|                text|target|
+---+--------------------+------+
|  0|awww that s a bum...|     0|
|  1|is upset that he ...|     0|
|  2|i dived many time...|     0|
|  3|my whole body fee...|     0|
|  4|no it s not behav...|     0|
+---+--------------------+------+
only showing top 5 rows



In [43]:
df = df.dropna()

In [44]:
df.count()

1596753

After successfully loading the data as Spark Dataframe, we can take a peek at the data by calling .show(), which is equivalent to Pandas .head(). After dropping NA, we have a bit less than 1.6 million Tweets. I will split this in three parts; training, validation, test. Since I have around 1.6 million entries, 1% each for validation and test set will be enough to test the models.

In [45]:
(train_set, val_set, test_set) = df.randomSplit([0.98, 0.01, 0.01], seed = 2000)

In [46]:
test_set.show(5)

+---+--------------------+------+
|_c0|                text|target|
+---+--------------------+------+
|212|but this is canad...|     0|
|474|cant see the flow...|     0|
|488|have watched that...|     0|
|544|i know exactly ho...|     0|
|752|still procrastina...|     0|
+---+--------------------+------+
only showing top 5 rows



## N-gram Implementation

I had to use VectorAssembler in the pipeline, to combine the features I get from each n-grams.

In [47]:
from pyspark.ml.feature import NGram, VectorAssembler
from pyspark.ml.feature import HashingTF, IDF, Tokenizer, CountVectorizer
from pyspark.ml.feature import StringIndexer
from pyspark.ml import Pipeline
from pyspark.ml.classification import LogisticRegression
from pyspark.ml.evaluation import BinaryClassificationEvaluator

def build_ngrams(inputCol=["text","target"], n=3):
    
    tokenizer = [Tokenizer(inputCol="text", outputCol="words")]
    
    ngrams = [
        NGram(n=i, inputCol="words", outputCol="{0}_grams".format(i))
        for i in range(1, n + 1)
    ]

    cv = [
        CountVectorizer(vocabSize=7260,inputCol="{0}_grams".format(i),
            outputCol="{0}_tf".format(i))
        for i in range(1, n + 1)
    ]
    
    idf = [IDF(inputCol="{0}_tf".format(i), outputCol="{0}_tfidf".format(i), minDocFreq=5) 
           for i in range(1, n + 1)]

    assembler = [VectorAssembler(
        inputCols=["{0}_tfidf".format(i) for i in range(1, n + 1)],
        outputCol="features"
    )]
    
    label_stringIdx = [StringIndexer(inputCol = "target", outputCol = "label")]
    
    lr = [LogisticRegression(maxIter=100)]
    
    return Pipeline(stages = tokenizer + ngrams + cv + idf + assembler + label_stringIdx + lr)


### Model Training and Validation Set Prediction Accuracy 

In [48]:

pipeline = build_ngrams()

trigram_pipelineFit = pipeline.fit(train_set)

predictions = trigram_pipelineFit.transform(val_set)

accuracy = predictions.filter(predictions.label == predictions.prediction).count() / float(val_set.count())

print("Accuracy Score: {0:.4f}".format(accuracy))


Accuracy Score: 0.8136


### Save Trained Model for Future Use

In [36]:
trigram_pipelineFit.save("Model_Twitter_Sentiment")

### Load Trained Model for Evaluation

In [49]:
from pyspark.ml import PipelineModel

pipeline = PipelineModel.load("Model_Twitter_Sentiment")

### Test Set Prediction 

In [50]:
test_predictions = pipeline.transform(test_set)
test_accuracy = test_predictions.filter(test_predictions.label == test_predictions.prediction).count() / float(test_set.count())

# # print accuracy
print("Accuracy Score: {0:.4f}".format(test_accuracy))

Accuracy Score: 0.8135
