### Sentiment Analysis

Sentiment analysis is also known as opinion mining. Sentiment analysis is a type of text mining that finds and extracts subjective information from source material, assisting businesses in determining the social sentiment associated with their brand, product, or service while monitoring online discussions.<br>
will create machine learning model using natural language processing implemented by PySpark on Jupyter<br>

In [1]:
import findspark
findspark.init()

Starting the Spark Session

In [2]:
import pyspark
#create SparkSession instance
from pyspark import SparkConf
from pyspark.sql import SparkSession
spark = SparkSession.builder.master('local[5]').config('spark.driver.memory','16g').appName('sentanaly').getOrCreate()

23/02/16 18:22:22 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).


Import Important modules required 

In [3]:
#importing pyspark ml sql features
from pyspark.ml import Pipeline 
from pyspark.ml.feature import CountVectorizer,StringIndexer, RegexTokenizer,StopWordsRemover
from pyspark.sql.functions import col, udf,regexp_replace,isnull
from pyspark.sql.types import StringType,IntegerType
from pyspark.ml.classification import NaiveBayes
from pyspark.ml.evaluation import MulticlassClassificationEvaluator

This is the sentiment140 dataset. It contains 1,600,000 tweets extracted using the twitter api.<br>
It contains the following 6 fields:
1. target: the polarity of the tweet (0 = negative, 2 = neutral, 4 = positive)
2. ids: The id of the tweet ( 2087)
3. date: the date of the tweet (Sat May 16 23:58:44 UTC 2009)
4. flag: The query (lyx). If there is no query, then this value is NO_QUERY.
5. user: the user that tweeted (robotickilldozr)
6. text: the text of the tweet (Lyx is cool)<br>
Now we are loading the dataset.

In [29]:
#read the csv containing twitter data
news_data = spark.read.csv('trainingsentimentdata.csv',header= False)

In [30]:
news_data.printSchema()

root
 |-- _c0: string (nullable = true)
 |-- _c1: string (nullable = true)
 |-- _c2: string (nullable = true)
 |-- _c3: string (nullable = true)
 |-- _c4: string (nullable = true)
 |-- _c5: string (nullable = true)



In [31]:
#printing the data
news_data.show()

+---+----------+--------------------+--------+---------------+--------------------+
|_c0|       _c1|                 _c2|     _c3|            _c4|                 _c5|
+---+----------+--------------------+--------+---------------+--------------------+
|  0|1467810369|Mon Apr 06 22:19:...|NO_QUERY|_TheSpecialOne_|@switchfoot http:...|
|  0|1467810672|Mon Apr 06 22:19:...|NO_QUERY|  scotthamilton|is upset that he ...|
|  0|1467810917|Mon Apr 06 22:19:...|NO_QUERY|       mattycus|@Kenichan I dived...|
|  0|1467811184|Mon Apr 06 22:19:...|NO_QUERY|        ElleCTF|my whole body fee...|
|  0|1467811193|Mon Apr 06 22:19:...|NO_QUERY|         Karoli|@nationwideclass ...|
|  0|1467811372|Mon Apr 06 22:20:...|NO_QUERY|       joy_wolf|@Kwesidei not the...|
|  0|1467811592|Mon Apr 06 22:20:...|NO_QUERY|        mybirch|         Need a hug |
|  0|1467811594|Mon Apr 06 22:20:...|NO_QUERY|           coZZ|@LOLTrish hey  lo...|
|  0|1467811795|Mon Apr 06 22:20:...|NO_QUERY|2Hood4Hollywood|@Tatiana_K nop

In [8]:
#news_data = news_data.limit(500)
#news_data.cache

<bound method DataFrame.cache of DataFrame[_c0: string, _c1: string, _c2: string, _c3: string, _c4: string, _c5: string]>

We can check the count of total items in the dataset for analysis

In [32]:
#count the rows of dataset
news_data.count()

                                                                                

1600000

We are selecting the titles of tweets and the corresponding category of each tweet

In [33]:
#pick only _c5 and _c0 columns and place in title_category
title_category = news_data.select("_c5","_c0")
title_category.show()

+--------------------+---+
|                 _c5|_c0|
+--------------------+---+
|@switchfoot http:...|  0|
|is upset that he ...|  0|
|@Kenichan I dived...|  0|
|my whole body fee...|  0|
|@nationwideclass ...|  0|
|@Kwesidei not the...|  0|
|         Need a hug |  0|
|@LOLTrish hey  lo...|  0|
|@Tatiana_K nope t...|  0|
|@twittera que me ...|  0|
|spring break in p...|  0|
|I just re-pierced...|  0|
|@caregiving I cou...|  0|
|@octolinz16 It it...|  0|
|@smarrison i woul...|  0|
|@iamjazzyfizzle I...|  0|
|Hollis' death sce...|  0|
|about to file taxes |  0|
|@LettyA ahh ive a...|  0|
|@FakerPattyPattz ...|  0|
+--------------------+---+
only showing top 20 rows



This is the custom function definition to count the null values

In [34]:
#function to count null values in the columns
def null_value_count(df):
  null_columns_counts = [] #initialize array to null
  numRows = df.count() #count the number of rows
  for k in df.columns:
    nullRows = df.where(col(k).isNull()).count() #count null rows
    if(nullRows > 0):
      temp = k,nullRows
      null_columns_counts.append(temp)
  return(null_columns_counts) #return count of null collumns

We are applying the custom function to the data frsme title_category

In [35]:
null_columns_count_list = null_value_count(title_category)
#spark.createDataFrame(null_columns_count_list, ['Column_With_Null_Value', 'Null_Values_Count']).show()
null_columns_count_list

                                                                                

[]

# Cleaning the dataset

Now we can drop the null values

In [36]:
#drop not applicable and null values from the category
title_category = title_category.dropna()
title_category.count()
title_category.show(truncate=False)

                                                                                

+---------------------------------------------------------------------------------------------------------------------+---+
|_c5                                                                                                                  |_c0|
+---------------------------------------------------------------------------------------------------------------------+---+
|@switchfoot http://twitpic.com/2y1zl - Awww, that's a bummer.  You shoulda got David Carr of Third Day to do it. ;D  |0  |
|is upset that he can't update his Facebook by texting it... and might cry as a result  School today also. Blah!      |0  |
|@Kenichan I dived many times for the ball. Managed to save 50%  The rest go out of bounds                            |0  |
|my whole body feels itchy and like its on fire                                                                       |0  |
|@nationwideclass no, it's not behaving at all. i'm mad. why am i here? because I can't see you all over there.       |0  |
|@Kwesid

In [37]:
from functools import reduce

#data containing from the csv
oldColumns = title_category.schema.names
#creating new columns with heading 'Tweets' and 'Sentiment'
newColumns = ['Tweets','Sentiment']

#applying lambda function to create a copy of old columns to new columns
title_category = reduce(lambda title_category, idx: title_category.withColumnRenamed(oldColumns[idx], newColumns[idx]),range(len(oldColumns)), title_category)
title_category.printSchema()
title_category.show()

root
 |-- Tweets: string (nullable = true)
 |-- Sentiment: string (nullable = true)

+--------------------+---------+
|              Tweets|Sentiment|
+--------------------+---------+
|@switchfoot http:...|        0|
|is upset that he ...|        0|
|@Kenichan I dived...|        0|
|my whole body fee...|        0|
|@nationwideclass ...|        0|
|@Kwesidei not the...|        0|
|         Need a hug |        0|
|@LOLTrish hey  lo...|        0|
|@Tatiana_K nope t...|        0|
|@twittera que me ...|        0|
|spring break in p...|        0|
|I just re-pierced...|        0|
|@caregiving I cou...|        0|
|@octolinz16 It it...|        0|
|@smarrison i woul...|        0|
|@iamjazzyfizzle I...|        0|
|Hollis' death sce...|        0|
|about to file taxes |        0|
|@LettyA ahh ive a...|        0|
|@FakerPattyPattz ...|        0|
+--------------------+---------+
only showing top 20 rows



Now we can remove the numbers in tweets

In [38]:
#cleaning the numbers from tweets
title_category = title_category.withColumn("only_str",regexp_replace(col('Tweets'), '\d+', ''))
title_category.select("Tweets","only_str").show(truncate=False)

+---------------------------------------------------------------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------+
|Tweets                                                                                                               |only_str                                                                                                             |
+---------------------------------------------------------------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------+
|@switchfoot http://twitpic.com/2y1zl - Awww, that's a bummer.  You shoulda got David Carr of Third Day to do it. ;D  |@switchfoot http://twitpic.com/yzl - Awww, that's a bummer.  You shoulda got David Carr of Third Day to do it. ;D    |
|is upset that he can't update his Facebook by t

Split the text into constituent words

In [39]:
#split the text to words or tokens
regex_tokenizer = RegexTokenizer(inputCol="only_str", outputCol="words", pattern="\\W")
raw_words = regex_tokenizer.transform(title_category)
raw_words.show()

+--------------------+---------+--------------------+--------------------+
|              Tweets|Sentiment|            only_str|               words|
+--------------------+---------+--------------------+--------------------+
|@switchfoot http:...|        0|@switchfoot http:...|[switchfoot, http...|
|is upset that he ...|        0|is upset that he ...|[is, upset, that,...|
|@Kenichan I dived...|        0|@Kenichan I dived...|[kenichan, i, div...|
|my whole body fee...|        0|my whole body fee...|[my, whole, body,...|
|@nationwideclass ...|        0|@nationwideclass ...|[nationwideclass,...|
|@Kwesidei not the...|        0|@Kwesidei not the...|[kwesidei, not, t...|
|         Need a hug |        0|         Need a hug |      [need, a, hug]|
|@LOLTrish hey  lo...|        0|@LOLTrish hey  lo...|[loltrish, hey, l...|
|@Tatiana_K nope t...|        0|@Tatiana_K nope t...|[tatiana_k, nope,...|
|@twittera que me ...|        0|@twittera que me ...|[twittera, que, m...|
|spring break in p...|   

Remove the stop words from segregated list of words

In [40]:
#Removing the stop words from the list of words
remover = StopWordsRemover(inputCol="words", outputCol="filtered")
words_df = remover.transform(raw_words)
words_df.select("words","filtered").show(truncate=False)

+-------------------------------------------------------------------------------------------------------------------------------------+---------------------------------------------------------------------------------------------+
|words                                                                                                                                |filtered                                                                                     |
+-------------------------------------------------------------------------------------------------------------------------------------+---------------------------------------------------------------------------------------------+
|[switchfoot, http, twitpic, com, yzl, awww, that, s, a, bummer, you, shoulda, got, david, carr, of, third, day, to, do, it, d]       |[switchfoot, http, twitpic, com, yzl, awww, bummer, shoulda, got, david, carr, third, day, d]|
|[is, upset, that, he, can, t, update, his, facebook, by, texting, it, and, migh

Convert text into vectors of token counts

# Partition the dataset into training and test datasets


In [41]:
#Partition the dataset into trainingData 80% and testData 20%
(trainingData, testData) = words_df.randomSplit([0.8, 0.2],seed = 11)
trainingData.show()
testData.show()

                                                                                

+--------------------+---------+--------------------+--------------------+--------------------+
|              Tweets|Sentiment|            only_str|               words|            filtered|
+--------------------+---------+--------------------+--------------------+--------------------+
|                 ...|        0|                 ...|[i, missed, the, ...|[missed, new, moo...|
|           FUCK YOU!|        0|           FUCK YOU!|         [fuck, you]|              [fuck]|
|          i want ...|        0|          i want ...|[i, want, some, b...|[want, ben, amp, ...|
|        my head f...|        0|        my head f...|[my, head, feels,...|[head, feels, lik...|
|        my heart ...|        0|        my heart ...|[my, heart, hurts...|[heart, hurts, ba...|
|      this weeken...|        0|      this weeken...|[this, weekend, h...|[weekend, sucked,...|
|            #canucks|        0|            #canucks|           [canucks]|           [canucks]|
|     &lt;- but mu...|        0|     &lt

[Stage 86:>                                                         (0 + 1) / 1]

+--------------------+---------+--------------------+--------------------+--------------------+
|              Tweets|Sentiment|            only_str|               words|            filtered|
+--------------------+---------+--------------------+--------------------+--------------------+
|       i really2 ...|        0|       i really d...|[i, really, don, ...|[really, like, co...|
|      My current ...|        0|      My current ...|[my, current, hea...|[current, headset...|
|               angry|        0|               angry|             [angry]|             [angry]|
|     jb isnt show...|        0|     jb isnt show...|[jb, isnt, showin...|[jb, isnt, showin...|
|     ok thats it ...|        0|     ok thats it ...|[ok, thats, it, y...|    [ok, thats, win]|
|     ...lonely night|        0|     ...lonely night|     [lonely, night]|     [lonely, night]|
|    I just cut my...|        0|    I just cut my...|[i, just, cut, my...|[cut, beard, grow...|
|       wompppp wompp|        0|       w

                                                                                

# Model Training and Prediction

In [42]:
from pyspark.ml.feature import HashingTF, IDF, Tokenizer
from pyspark.ml.feature import StringIndexer
from pyspark.ml import Pipeline

#tokenizer = Tokenizer(inputCol="text", outputCol="words")
#HashingTF maps a sequence of terms
hashtf = HashingTF(numFeatures=2**16, inputCol="filtered", outputCol='tf')

#IDF stands for Inverse Document Frequency (Common:Rare)
idf = IDF(inputCol='tf', outputCol="features", minDocFreq=5) #minDocFreq: remove sparse terms
#label_stringIdx = StringIndexer(inputCol = "target", outputCol = "label")
pipeline = Pipeline(stages=[hashtf, idf])

pipelineFit = pipeline.fit(trainingData)
train_df = pipelineFit.transform(trainingData)
val_df = pipelineFit.transform(testData)
train_df.show(5)

23/02/16 19:02:13 WARN DAGScheduler: Broadcasting large task binary with size 1082.5 KiB
[Stage 88:>                                                         (0 + 1) / 1]

+--------------------+---------+--------------------+--------------------+--------------------+--------------------+--------------------+
|              Tweets|Sentiment|            only_str|               words|            filtered|                  tf|            features|
+--------------------+---------+--------------------+--------------------+--------------------+--------------------+--------------------+
|                 ...|        0|                 ...|[i, missed, the, ...|[missed, new, moo...|(65536,[4495,2429...|(65536,[4495,2429...|
|           FUCK YOU!|        0|           FUCK YOU!|         [fuck, you]|              [fuck]|(65536,[40503],[1...|(65536,[40503],[5...|
|          i want ...|        0|          i want ...|[i, want, some, b...|[want, ben, amp, ...|(65536,[13007,352...|(65536,[13007,352...|
|        my head f...|        0|        my head f...|[my, head, feels,...|[head, feels, lik...|(65536,[2548,1165...|(65536,[2548,1165...|
|        my heart ...|        0|  

                                                                                

In [43]:
from pyspark.sql.types import IntegerType
train_df=train_df.withColumnRenamed('Sentiment', 'label')
train_df=train_df.withColumn("label",train_df.label.cast('int'))
val_df=val_df.withColumnRenamed('Sentiment', 'label')
val_df=val_df.withColumn("label",val_df.label.cast('int'))

## Logistic Regression

In [44]:
from pyspark.ml.classification import LogisticRegression
## Fitting the model
lr = LogisticRegression(featuresCol = 'features', labelCol = 'label', maxIter=10)
lrModel = lr.fit(train_df)
lrPreds = lrModel.transform(val_df)

23/02/16 19:02:21 WARN DAGScheduler: Broadcasting large task binary with size 1086.6 KiB
23/02/16 19:02:48 WARN DAGScheduler: Broadcasting large task binary with size 1088.1 KiB
23/02/16 19:03:13 WARN BLAS: Failed to load implementation from: com.github.fommil.netlib.NativeSystemBLAS
23/02/16 19:03:13 WARN BLAS: Failed to load implementation from: com.github.fommil.netlib.NativeRefBLAS
23/02/16 19:03:16 WARN DAGScheduler: Broadcasting large task binary with size 1088.1 KiB
23/02/16 19:03:18 WARN DAGScheduler: Broadcasting large task binary with size 1088.1 KiB
23/02/16 19:03:19 WARN DAGScheduler: Broadcasting large task binary with size 1088.1 KiB
23/02/16 19:03:20 WARN DAGScheduler: Broadcasting large task binary with size 1088.1 KiB
23/02/16 19:03:21 WARN DAGScheduler: Broadcasting large task binary with size 1088.1 KiB
23/02/16 19:03:22 WARN DAGScheduler: Broadcasting large task binary with size 1088.1 KiB
23/02/16 19:03:23 WARN DAGScheduler: Broadcasting large task binary with size

In [45]:
## Evaluating the model
evaluator = MulticlassClassificationEvaluator(labelCol="label", predictionCol="prediction", metricName="accuracy")
lr_accuracy = evaluator.evaluate(lrPreds)
print("Accuracy of Logistic Regression is = %g"% (lr_accuracy))

23/02/16 19:03:28 WARN DAGScheduler: Broadcasting large task binary with size 3.6 MiB

Accuracy of Logistic Regression is = 0.763975


                                                                                

## Decision Tree Model

In [46]:
from pyspark.ml.classification import DecisionTreeClassifier
## Fitting the model
dt = DecisionTreeClassifier(featuresCol = 'features', labelCol = 'label', maxDepth = 3)
dtModel = dt.fit(train_df)
dtPreds = dtModel.transform(val_df)

23/02/16 19:05:10 WARN DAGScheduler: Broadcasting large task binary with size 1086.1 KiB
23/02/16 19:05:17 WARN DAGScheduler: Broadcasting large task binary with size 1086.1 KiB
23/02/16 19:05:42 WARN DAGScheduler: Broadcasting large task binary with size 1753.2 KiB
23/02/16 19:06:26 WARN DAGScheduler: Broadcasting large task binary with size 2.3 MiB
23/02/16 19:06:58 WARN MemoryStore: Not enough space to cache rdd_460_3 in memory! (computed 1270.7 MiB so far)
23/02/16 19:06:58 WARN BlockManager: Persisting block rdd_460_3 to disk instead.
23/02/16 19:06:58 WARN MemoryStore: Not enough space to cache rdd_460_0 in memory! (computed 1270.7 MiB so far)
23/02/16 19:06:58 WARN BlockManager: Persisting block rdd_460_0 to disk instead.
23/02/16 19:06:58 WARN MemoryStore: Not enough space to cache rdd_460_2 in memory! (computed 1270.7 MiB so far)
23/02/16 19:06:58 WARN BlockManager: Persisting block rdd_460_2 to disk instead.
23/02/16 19:06:59 WARN MemoryStore: Not enough space to cache rdd_46

In [47]:
## Evaluating the model
evaluator = MulticlassClassificationEvaluator(labelCol="label", predictionCol="prediction", metricName="accuracy")
dt_accuracy = evaluator.evaluate(dtPreds)
#Accuracy of Decision Tree
print("Accuracy of Decision Trees is = %g"% (dt_accuracy))

23/02/16 19:39:40 WARN DAGScheduler: Broadcasting large task binary with size 1106.9 KiB

Accuracy of Decision Trees is = 0.530731




# Naive Bayes Model

In [48]:
#applying NaiveBays algorithm
nb = NaiveBayes(modelType="multinomial",labelCol="label", featuresCol="features")
nbModel = nb.fit(train_df)
#get the prediction by transforming the model
nb_predictions = nbModel.transform(val_df)

23/02/16 19:39:59 WARN DAGScheduler: Broadcasting large task binary with size 1091.4 KiB
23/02/16 19:40:26 WARN DAGScheduler: Broadcasting large task binary with size 1073.4 KiB
                                                                                

In [49]:
nb_predictions.select("prediction", "label", "features").show(100)

23/02/16 19:40:27 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
[Stage 119:>                                                        (0 + 1) / 1]

+----------+-----+--------------------+
|prediction|label|            features|
+----------+-----+--------------------+
|       0.0|    0|(65536,[11650,191...|
|       0.0|    0|(65536,[1198,2001...|
|       0.0|    0|(65536,[465],[7.4...|
|       0.0|    0|(65536,[2284,2108...|
|       1.0|    0|(65536,[1589,1034...|
|       0.0|    0|(65536,[11828,588...|
|       0.0|    0|(65536,[10372,160...|
|       1.0|    0|(65536,[7626,3817...|
|       0.0|    0|(65536,[4570,4832...|
|       0.0|    0|(65536,[65018],[1...|
|       0.0|    0|(65536,[6083,7173...|
|       0.0|    0|(65536,[19153,605...|
|       0.0|    0|(65536,[21823,241...|
|       0.0|    0|(65536,[34288],[4...|
|       0.0|    0|(65536,[32656,417...|
|       0.0|    0|(65536,[2331,2743...|
|       0.0|    0|(65536,[10077,143...|
|       0.0|    0|(65536,[17625,409...|
|       0.0|    0|(65536,[2731,6589...|
|       1.0|    0|(65536,[12001,382...|
|       0.0|    0|(65536,[11985,163...|
|       1.0|    0|(65536,[1198,9818...|


                                                                                

In [50]:
#evaluate and print the accuracy
evaluator = MulticlassClassificationEvaluator(labelCol="label", predictionCol="prediction", metricName="accuracy")
nb_accuracy = evaluator.evaluate(nb_predictions)
print("Accuracy of NaiveBayes is = %g"% (nb_accuracy))
print("Test Error of NaiveBayes = %g " % (1.0 - nb_accuracy))

23/02/16 19:40:34 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB

Accuracy of NaiveBayes is = 0.373336
Test Error of NaiveBayes = 0.626664 




In [51]:
spark.stop()