# Sentiment Analysis

Sentiment analysis is also known as opinion mining. Sentiment analysis is a type of text mining that finds and extracts subjective information from source material, assisting businesses in determining the social sentiment associated with their brand, product, or service while monitoring online discussions.

In [1]:
import findspark
findspark.init()

Starting the Spark Session

In [2]:
import pyspark
#create SparkSession instance
from pyspark import SparkConf
from pyspark.sql import SparkSession
spark = SparkSession.builder.config('spark.executor.memory','16g').appName('sentanaly').getOrCreate()

23/01/20 19:34:32 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).


Import Important modules required 

In [3]:
#importing pyspark ml sql features
from pyspark.ml import Pipeline 
from pyspark.ml.feature import CountVectorizer,StringIndexer, RegexTokenizer,StopWordsRemover
from pyspark.sql.functions import col, udf,regexp_replace,isnull
from pyspark.sql.types import StringType,IntegerType
from pyspark.ml.classification import NaiveBayes
from pyspark.ml.evaluation import MulticlassClassificationEvaluator

Now we are loading the dataset. The dataset used here contains the tweets with the sentiment value. The 0 represent negative sentiments and 1 represent positive sentiments.

In [4]:
import numpy as np

In [5]:
#read the csv containing twitter data
news_data = spark.read.csv('trainingsentimentdata.csv',header= False)
#printing the data
news_data.printSchema()
news_data.show()
news_data = news_data.limit(500)
news_data.cache

                                                                                

root
 |-- _c0: string (nullable = true)
 |-- _c1: string (nullable = true)
 |-- _c2: string (nullable = true)
 |-- _c3: string (nullable = true)
 |-- _c4: string (nullable = true)
 |-- _c5: string (nullable = true)

+---+----------+--------------------+--------+---------------+--------------------+
|_c0|       _c1|                 _c2|     _c3|            _c4|                 _c5|
+---+----------+--------------------+--------+---------------+--------------------+
|  0|1467810369|Mon Apr 06 22:19:...|NO_QUERY|_TheSpecialOne_|@switchfoot http:...|
|  0|1467810672|Mon Apr 06 22:19:...|NO_QUERY|  scotthamilton|is upset that he ...|
|  0|1467810917|Mon Apr 06 22:19:...|NO_QUERY|       mattycus|@Kenichan I dived...|
|  0|1467811184|Mon Apr 06 22:19:...|NO_QUERY|        ElleCTF|my whole body fee...|
|  0|1467811193|Mon Apr 06 22:19:...|NO_QUERY|         Karoli|@nationwideclass ...|
|  0|1467811372|Mon Apr 06 22:20:...|NO_QUERY|       joy_wolf|@Kwesidei not the...|
|  0|1467811592|Mon Apr 06 2

<bound method DataFrame.cache of DataFrame[_c0: string, _c1: string, _c2: string, _c3: string, _c4: string, _c5: string]>

We can check the count of totalitems in the dataset for analysis

In [6]:
#count the rows of dataset
news_data.count()

500

We are selecting the titles of tweets and the corresponding category of each tweet

In [7]:
#pick only _c5 and _c0 columns and place in title_category
title_category = news_data.select("_c5","_c0")
title_category.show()

+--------------------+---+
|                 _c5|_c0|
+--------------------+---+
|@switchfoot http:...|  0|
|is upset that he ...|  0|
|@Kenichan I dived...|  0|
|my whole body fee...|  0|
|@nationwideclass ...|  0|
|@Kwesidei not the...|  0|
|         Need a hug |  0|
|@LOLTrish hey  lo...|  0|
|@Tatiana_K nope t...|  0|
|@twittera que me ...|  0|
|spring break in p...|  0|
|I just re-pierced...|  0|
|@caregiving I cou...|  0|
|@octolinz16 It it...|  0|
|@smarrison i woul...|  0|
|@iamjazzyfizzle I...|  0|
|Hollis' death sce...|  0|
|about to file taxes |  0|
|@LettyA ahh ive a...|  0|
|@FakerPattyPattz ...|  0|
+--------------------+---+
only showing top 20 rows



This is the custom function definition to count the null values

In [8]:
#function to count null values in the columns
def null_value_count(df):
  null_columns_counts = [] #initialize array to null
  numRows = df.count() #count the number of rows
  for k in df.columns:
    nullRows = df.where(col(k).isNull()).count() #count null rows
    if(nullRows > 0):
      temp = k,nullRows
      null_columns_counts.append(temp)
  return(null_columns_counts) #return count of null collumns

We are applying the custom function to the data frsme title_category

In [9]:
null_columns_count_list = null_value_count(title_category)
#spark.createDataFrame(null_columns_count_list, ['Column_With_Null_Value', 'Null_Values_Count']).show()

# Cleaning the dataset

Now we can drop the null values

In [10]:
#drop not applicable and null values from the category
title_category = title_category.dropna()
title_category.count()
title_category.show(truncate=False)

+---------------------------------------------------------------------------------------------------------------------+---+
|_c5                                                                                                                  |_c0|
+---------------------------------------------------------------------------------------------------------------------+---+
|@switchfoot http://twitpic.com/2y1zl - Awww, that's a bummer.  You shoulda got David Carr of Third Day to do it. ;D  |0  |
|is upset that he can't update his Facebook by texting it... and might cry as a result  School today also. Blah!      |0  |
|@Kenichan I dived many times for the ball. Managed to save 50%  The rest go out of bounds                            |0  |
|my whole body feels itchy and like its on fire                                                                       |0  |
|@nationwideclass no, it's not behaving at all. i'm mad. why am i here? because I can't see you all over there.       |0  |
|@Kwesid

In [11]:
from functools import reduce

#data containing from the csv
oldColumns = title_category.schema.names
#creating new columns with heading 'Tweets' and 'Sentiment'
newColumns = ['Tweets','Sentiment']

#applying lambda function to create a copy of old columns to new columns
title_category = reduce(lambda title_category, idx: title_category.withColumnRenamed(oldColumns[idx], newColumns[idx]),range(len(oldColumns)), title_category)
title_category.printSchema()
title_category.show()

root
 |-- Tweets: string (nullable = true)
 |-- Sentiment: string (nullable = true)

+--------------------+---------+
|              Tweets|Sentiment|
+--------------------+---------+
|@switchfoot http:...|        0|
|is upset that he ...|        0|
|@Kenichan I dived...|        0|
|my whole body fee...|        0|
|@nationwideclass ...|        0|
|@Kwesidei not the...|        0|
|         Need a hug |        0|
|@LOLTrish hey  lo...|        0|
|@Tatiana_K nope t...|        0|
|@twittera que me ...|        0|
|spring break in p...|        0|
|I just re-pierced...|        0|
|@caregiving I cou...|        0|
|@octolinz16 It it...|        0|
|@smarrison i woul...|        0|
|@iamjazzyfizzle I...|        0|
|Hollis' death sce...|        0|
|about to file taxes |        0|
|@LettyA ahh ive a...|        0|
|@FakerPattyPattz ...|        0|
+--------------------+---------+
only showing top 20 rows



Now we can remove the numbers in tweets

In [12]:
#cleaning the numbers from tweets
title_category = title_category.withColumn("only_str",regexp_replace(col('Tweets'), '\d+', ''))
title_category.select("Tweets","only_str").show(truncate=False)

+---------------------------------------------------------------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------+
|Tweets                                                                                                               |only_str                                                                                                             |
+---------------------------------------------------------------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------+
|@switchfoot http://twitpic.com/2y1zl - Awww, that's a bummer.  You shoulda got David Carr of Third Day to do it. ;D  |@switchfoot http://twitpic.com/yzl - Awww, that's a bummer.  You shoulda got David Carr of Third Day to do it. ;D    |
|is upset that he can't update his Facebook by t

Split the text into constituent words

In [13]:
#split the text to words or tokens
#https://spark.apache.org/docs/3.1.1/api/python/reference/api/pyspark.ml.feature.RegexTokenizer.html
regex_tokenizer = RegexTokenizer(inputCol="only_str", outputCol="words", pattern="\\W")
raw_words = regex_tokenizer.transform(title_category)
raw_words.show()

+--------------------+---------+--------------------+--------------------+
|              Tweets|Sentiment|            only_str|               words|
+--------------------+---------+--------------------+--------------------+
|@switchfoot http:...|        0|@switchfoot http:...|[switchfoot, http...|
|is upset that he ...|        0|is upset that he ...|[is, upset, that,...|
|@Kenichan I dived...|        0|@Kenichan I dived...|[kenichan, i, div...|
|my whole body fee...|        0|my whole body fee...|[my, whole, body,...|
|@nationwideclass ...|        0|@nationwideclass ...|[nationwideclass,...|
|@Kwesidei not the...|        0|@Kwesidei not the...|[kwesidei, not, t...|
|         Need a hug |        0|         Need a hug |      [need, a, hug]|
|@LOLTrish hey  lo...|        0|@LOLTrish hey  lo...|[loltrish, hey, l...|
|@Tatiana_K nope t...|        0|@Tatiana_K nope t...|[tatiana_k, nope,...|
|@twittera que me ...|        0|@twittera que me ...|[twittera, que, m...|
|spring break in p...|   

Remove the stop words from segregated list of words

In [14]:
#Removing the stop words from the list of words
#https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.ml.feature.StopWordsRemover.html
remover = StopWordsRemover(inputCol="words", outputCol="filtered")
words_df = remover.transform(raw_words)
words_df.select("words","filtered").show(truncate=False)

+-------------------------------------------------------------------------------------------------------------------------------------+---------------------------------------------------------------------------------------------+
|words                                                                                                                                |filtered                                                                                     |
+-------------------------------------------------------------------------------------------------------------------------------------+---------------------------------------------------------------------------------------------+
|[switchfoot, http, twitpic, com, yzl, awww, that, s, a, bummer, you, shoulda, got, david, carr, of, third, day, to, do, it, d]       |[switchfoot, http, twitpic, com, yzl, awww, bummer, shoulda, got, david, carr, third, day, d]|
|[is, upset, that, he, can, t, update, his, facebook, by, texting, it, and, migh

Convert text into vectors of token counts

# Partition the dataset into training and test datasets


In [15]:
#Partition the dataset into trainingData 80% and testData 20%
(trainingData, testData) = words_df.randomSplit([0.8, 0.2],seed = 11)
trainingData.show()
testData.show()

+--------------------+---------+--------------------+--------------------+--------------------+
|              Tweets|Sentiment|            only_str|               words|            filtered|
+--------------------+---------+--------------------+--------------------+--------------------+
| Body Of Missing ...|        0| Body Of Missing ...|[body, of, missin...|[body, missing, n...|
| wonder if Jon lo...|        0| wonder if Jon lo...|[wonder, if, jon,...|[wonder, jon, los...|
|#3 woke up and wa...|        0|# woke up and was...|[woke, up, and, w...|[woke, accident, ...|
|&quot;On popular ...|        0|&quot;On popular ...|[quot, on, popula...|[quot, popular, m...|
|...and, India mis...|        0|...and, India mis...|[and, india, miss...|[india, missed, t...|
|@AmaNorris wow th...|        0|@AmaNorris wow th...|[amanorris, wow, ...|[amanorris, wow, ...|
|@Appomattox_News ...|        0|@Appomattox_News ...|[appomattox_news,...|[appomattox_news,...|
|@B_Barnett I did ...|        0|@B_Barne

# Model Training and Prediction

In [16]:
from pyspark.ml.feature import HashingTF, IDF, Tokenizer
from pyspark.ml.feature import StringIndexer
from pyspark.ml import Pipeline

#tokenizer = Tokenizer(inputCol="text", outputCol="words")
#HashingTF maps a sequence of terms
#https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.ml.feature.HashingTF.html
hashtf = HashingTF(numFeatures=2**16, inputCol="filtered", outputCol='tf')

#IDF stands for Inverse Document Frequency (Common:Rare)
#https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.mllib.feature.IDF.html
idf = IDF(inputCol='tf', outputCol="features", minDocFreq=5) #minDocFreq: remove sparse terms
#label_stringIdx = StringIndexer(inputCol = "target", outputCol = "label")
pipeline = Pipeline(stages=[hashtf, idf])

pipelineFit = pipeline.fit(trainingData)
train_df = pipelineFit.transform(trainingData)
val_df = pipelineFit.transform(testData)
train_df.show(5)

                                                                                

+--------------------+---------+--------------------+--------------------+--------------------+--------------------+--------------------+
|              Tweets|Sentiment|            only_str|               words|            filtered|                  tf|            features|
+--------------------+---------+--------------------+--------------------+--------------------+--------------------+--------------------+
| Body Of Missing ...|        0| Body Of Missing ...|[body, of, missin...|[body, missing, n...|(65536,[731,3159,...|(65536,[731,3159,...|
| wonder if Jon lo...|        0| wonder if Jon lo...|[wonder, if, jon,...|[wonder, jon, los...|(65536,[19153,329...|(65536,[19153,329...|
|#3 woke up and wa...|        0|# woke up and was...|[woke, up, and, w...|[woke, accident, ...|(65536,[5660,7427...|(65536,[5660,7427...|
|&quot;On popular ...|        0|&quot;On popular ...|[quot, on, popula...|[quot, popular, m...|(65536,[178,1903,...|(65536,[178,1903,...|
|...and, India mis...|        0|..

23/01/20 19:34:53 WARN DAGScheduler: Broadcasting large task binary with size 1087.4 KiB


In [17]:
from pyspark.sql.types import IntegerType
train_df=train_df.withColumnRenamed('Sentiment', 'label')
train_df=train_df.withColumn("label",train_df.label.cast('int'))
val_df=val_df.withColumnRenamed('Sentiment', 'label')
val_df=val_df.withColumn("label",val_df.label.cast('int'))

## Logistic Regression

In [18]:
from pyspark.ml.classification import LogisticRegression
## Fitting the model
lr = LogisticRegression(featuresCol = 'features', labelCol = 'label', maxIter=10)
lrModel = lr.fit(train_df)
lrPreds = lrModel.transform(val_df)

23/01/20 19:34:54 WARN DAGScheduler: Broadcasting large task binary with size 1091.5 KiB
23/01/20 19:34:54 WARN Instrumentation: [ce702cde] All labels are the same value and fitIntercept=true, so the coefficients will be zeros. Training is not needed.


In [19]:
## Evaluating the model
evaluator = MulticlassClassificationEvaluator(labelCol="label", predictionCol="prediction", metricName="accuracy")
lr_accuracy = evaluator.evaluate(lrPreds)
print("Accuracy of Logistic Regression is = %g"% (lr_accuracy))

23/01/20 19:34:55 WARN DAGScheduler: Broadcasting large task binary with size 1111.7 KiB


Accuracy of Logistic Regression is = 1


## Decision Tree Model

In [20]:
from pyspark.ml.classification import DecisionTreeClassifier
## Fitting the model
#https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.ml.classification.DecisionTreeClassifier.html
dt = DecisionTreeClassifier(featuresCol = 'features', labelCol = 'label', maxDepth = 3)
dtModel = dt.fit(train_df)
dtPreds = dtModel.transform(val_df)

23/01/20 19:34:57 WARN DAGScheduler: Broadcasting large task binary with size 1091.0 KiB
23/01/20 19:34:57 WARN DAGScheduler: Broadcasting large task binary with size 1091.0 KiB
23/01/20 19:34:57 WARN DAGScheduler: Broadcasting large task binary with size 1756.2 KiB
23/01/20 19:34:59 WARN DAGScheduler: Broadcasting large task binary with size 2.2 MiB
                                                                                

In [21]:
## Evaluating the model
evaluator = MulticlassClassificationEvaluator(labelCol="label", predictionCol="prediction", metricName="accuracy")
dt_accuracy = evaluator.evaluate(dtPreds)
#Accuracy of Decision Tree
print("Accuracy of Decision Trees is = %g"% (dt_accuracy))


Accuracy of Decision Trees is = 1


23/01/20 19:35:01 WARN DAGScheduler: Broadcasting large task binary with size 1109.4 KiB


# Naive Bayes Model

In [22]:
#applying NaiveBays algorithm
nb = NaiveBayes(modelType="multinomial",labelCol="label", featuresCol="features")
nbModel = nb.fit(train_df)
#get the prediction by transforming the model
nb_predictions = nbModel.transform(val_df)

23/01/20 19:35:02 WARN DAGScheduler: Broadcasting large task binary with size 1095.4 KiB


In [23]:
nb_predictions.select("prediction", "label", "features").show(100)

+----------+-----+--------------------+
|prediction|label|            features|
+----------+-----+--------------------+
|       0.0|    0|(65536,[2464,1430...|
|       0.0|    0|(65536,[1903,4529...|
|       0.0|    0|(65536,[1903,9859...|
|       0.0|    0|(65536,[12806,230...|
|       0.0|    0|(65536,[14013,171...|
|       0.0|    0|(65536,[22076,344...|
|       0.0|    0|(65536,[22351,260...|
|       0.0|    0|(65536,[8741,2418...|
|       0.0|    0|(65536,[13712,159...|
|       0.0|    0|(65536,[2762,6040...|
|       0.0|    0|(65536,[5827,8449...|
|       0.0|    0|(65536,[7173,3399...|
|       0.0|    0|(65536,[2548,2888...|
|       0.0|    0|(65536,[31448,478...|
|       0.0|    0|(65536,[308,13889...|
|       0.0|    0|(65536,[6040,6122...|
|       0.0|    0|(65536,[1198,4207...|
|       0.0|    0|(65536,[2338,4166...|
|       0.0|    0|(65536,[3386,2202...|
|       0.0|    0|(65536,[9859,1298...|
|       0.0|    0|(65536,[7194,1908...|
|       0.0|    0|(65536,[6042,8923...|


23/01/20 19:35:02 WARN DAGScheduler: Broadcasting large task binary with size 1606.6 KiB


In [24]:
#evaluate and print the accuracy
evaluator = MulticlassClassificationEvaluator(labelCol="label", predictionCol="prediction", metricName="accuracy")
nb_accuracy = evaluator.evaluate(nb_predictions)
print("Accuracy of NaiveBayes is = %g"% (nb_accuracy))
print("Test Error of NaiveBayes = %g " % (1.0 - nb_accuracy))

Accuracy of NaiveBayes is = 1
Test Error of NaiveBayes = 0 


23/01/20 19:35:03 WARN DAGScheduler: Broadcasting large task binary with size 1617.9 KiB


In [25]:
spark.stop()