# BDDA Project 1 

# Snigdha Mathur (015050) 

## Question 1: Performing Text Classification on the csv file "Corona_NLP_train" using PySpark.

### We have used Multinomial NaiveBayes for calculating the accuracy of the model

Importing garbage collector for reducing the memory usage

In [2]:
import gc

In [3]:
gc.collect()

187

Importing all libraries

In [2]:
from pyspark.ml.feature import CountVectorizer,StringIndexer, RegexTokenizer,StopWordsRemover
from pyspark.sql.functions import col, udf,regexp_replace,isnull
from pyspark.sql.types import StringType,IntegerType
from pyspark.ml.classification import NaiveBayes
from pyspark.ml.evaluation import MulticlassClassificationEvaluator

In [3]:
from pyspark.sql import SparkSession

Creating a new session for Spark with the name ProjNLP2

In [4]:
spark = SparkSession.builder.appName('ProjNLP2').getOrCreate()

Reading the csv file

In [1]:
df=spark.read.csv("Corona_NLP_train.csv", sep=",", header=True, inferSchema=True)

Reading the file and taking a sample (70%) because the size of the file is too large to perform modelling

In [3]:
df2= df.sample(fraction=0.7)

In [7]:
df2.show()

+--------------------+--------------------+--------------------+--------------------+--------------------+------------------+
|            UserName|          ScreenName|            Location|             TweetAt|       OriginalTweet|         Sentiment|
+--------------------+--------------------+--------------------+--------------------+--------------------+------------------+
|                3800|               48752|                  UK|          16-03-2020|advice Talk to yo...|          Positive|
|                3801|               48753|           Vagabonds|          16-03-2020|Coronavirus Austr...|          Positive|
|              PLEASE|         don't panic| THERE WILL BE EN...|                null|                null|              null|
|           Stay calm|          stay safe.|                null|                null|                null|              null|
|#COVID19france #C...|            Positive|                null|                null|                null|            

In [8]:
df2.count()

47631

Depicting the count of tweets made per location and displaying top 20 rows

In [11]:
df2.groupBy("Location").count().show()

+--------------------+-----+
|            Location|count|
+--------------------+-----+
|                 ...|    1|
| Mumbai, Maharashtra|    4|
| Brisbane, Australia|    8|
|          South Asia|    1|
|West Woofle-Dust ...|    1|
|   St Petersburg, FL|   12|
| All across Michigan|    1|
|     Northumberland |    1|
|     stoke on trent |    1|
|   Dallas, Texas USA|    1|
|some where around...|    1|
|           Worcester|    3|
|           Bangalore|   14|
|            Novi, MI|    1|
|Sagaon, Kalyan Do...|    1|
|Ferrara, Emilia R...|    1|
|      Luton, England|    2|
|         black site |    1|
| to all the gas s...|    1|
|Just to the left ...|    1|
+--------------------+-----+
only showing top 20 rows



Selecting the columns (tweet and sentiment) that are needed for prediction and text classification

In [14]:
new_df = df2.select("OriginalTweet", "Sentiment")

In [15]:
new_df.show()

+--------------------+------------------+
|       OriginalTweet|         Sentiment|
+--------------------+------------------+
|advice Talk to yo...|          Positive|
|Coronavirus Austr...|          Positive|
|                null|              null|
|                null|              null|
|                null|              null|
|                null|              null|
|As news of the re...|          Positive|
|"Cashier at groce...|          Positive|
|Was at the superm...|              null|
|                null|              null|
|Due to COVID-19 o...|          Positive|
|All month there h...|           Neutral|
|                null|              null|
|#horningsea is a ...|Extremely Positive|
|Me: I don't need ...|              null|
|                null|              null|
|ADARA Releases CO...|          Positive|
|                null|              null|
|????? ????? ?????...|              null|
|                null|              null|
+--------------------+------------

Checking for NULL values in Sentiment and Tweet

In [16]:
def null_value_count(df2):
    null_columns_counts = []
    numRows = df2.count()
    for k in df2.columns:
        nullRows = df2.where(col(k).isNull()).count()
        if(nullRows > 0):
            temp = k,nullRows
            null_columns_counts.append(temp)
            return(null_columns_counts)



In [17]:
null_columns_count_list = null_value_count(new_df)

Displaying the number of null vales in tweets

In [18]:
spark.createDataFrame(null_columns_count_list, ['Column_With_Null_Value', 'Null_Values_Count']).show()

+----------------------+-----------------+
|Column_With_Null_Value|Null_Values_Count|
+----------------------+-----------------+
|         OriginalTweet|            18712|
+----------------------+-----------------+



Dropping the null values

In [19]:
new_df = new_df.dropna()

In [20]:
new_df.show()

+--------------------+------------------+
|       OriginalTweet|         Sentiment|
+--------------------+------------------+
|advice Talk to yo...|          Positive|
|Coronavirus Austr...|          Positive|
|As news of the re...|          Positive|
|"Cashier at groce...|          Positive|
|Due to COVID-19 o...|          Positive|
|All month there h...|           Neutral|
|#horningsea is a ...|Extremely Positive|
|ADARA Releases CO...|          Positive|
|For those who are...|          Positive|
|with 100  nations...|Extremely Negative|
|In preparation fo...|          Negative|
|This morning I te...|Extremely Negative|
|Went to the super...|           Neutral|
|Worried about the...|          Positive|
|my wife works ret...|          Negative|
|This is the line ...|           Neutral|
| Please Share  Kn...|Extremely Positive|
|"""Everything we...|            racism|
|Why we stock up o...|          Negative|
|Global food price...|Extremely Negative|
+--------------------+------------

Showing the count of tweets after dropping the null values from teh dataset

In [21]:
new_df.count()

19991

Spliting text into raw words (tokenizing) and foeing a new column 'words' 

In [19]:
regex_tokenizer = RegexTokenizer(inputCol="OriginalTweet", outputCol="words", pattern="\\W")
tokenized = regex_tokenizer.transform(new_df)

In [20]:
tokenized.show()

+--------------------+------------------+--------------------+
|       OriginalTweet|         Sentiment|               words|
+--------------------+------------------+--------------------+
|@MeNyrbie @Phil_G...|           Neutral|[menyrbie, phil_g...|
|"Cashier at groce...|          Positive|[cashier, at, gro...|
|For corona preven...|          Negative|[for, corona, pre...|
|All month there h...|           Neutral|[all, month, ther...|
|#horningsea is a ...|Extremely Positive|[horningsea, is, ...|
|ADARA Releases CO...|          Positive|[adara, releases,...|
|For those who are...|          Positive|[for, those, who,...|
|with 100  nations...|Extremely Negative|[with, 100, natio...|
|@10DowningStreet ...|          Negative|[10downingstreet,...|
|UK #consumer poll...|Extremely Positive|[uk, consumer, po...|
|In preparation fo...|          Negative|[in, preparation,...|
|This morning I te...|Extremely Negative|[this, morning, i...|
|Went to the super...|           Neutral|[went, to, the

Removing Stopwords from the raw words

In [21]:
remover = StopWordsRemover(inputCol="words", outputCol="filtered")
removed = remover.transform(tokenized)

Displaying the filtered words 

In [22]:
removed.select("words","filtered").show(truncate=False)

+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|words                                                                                                                                                                                                                                                                                                                                      |filtered                                          

Label Encoding the Sentiment column into SentimentIndex 

Positive- 0.0 <br>
Negative - 1.0 <br>
Neutral - 2.0 <br>
Extremely Positive - 3.0 <br>
Extremely Negative - 4.0 <br>

In [23]:
indexer = StringIndexer(inputCol="Sentiment", outputCol="SentimentIndex")
feature_data = indexer.fit(removed).transform(removed)

In [24]:
feature_data.select("Sentiment","SentimentIndex").show()

+------------------+--------------+
|         Sentiment|SentimentIndex|
+------------------+--------------+
|           Neutral|           2.0|
|          Positive|           0.0|
|          Negative|           1.0|
|           Neutral|           2.0|
|Extremely Positive|           3.0|
|          Positive|           0.0|
|          Positive|           0.0|
|Extremely Negative|           4.0|
|          Negative|           1.0|
|Extremely Positive|           3.0|
|          Negative|           1.0|
|Extremely Negative|           4.0|
|           Neutral|           2.0|
|          Negative|           1.0|
|          Positive|           0.0|
|          Positive|           0.0|
|Extremely Negative|           4.0|
|           Neutral|           2.0|
|Extremely Positive|           3.0|
|            racism|         285.0|
+------------------+--------------+
only showing top 20 rows



Converting text into vectors of token counts using CountVectorizer

In [25]:
cv = CountVectorizer(inputCol="filtered", outputCol="features")
model = cv.fit(feature_data)
countVectorizer_feateures = model.transform(feature_data)

Splitting the data into train and test with 80% in train and rest 20% in test

In [26]:
(trainingData, testData) = countVectorizer_feateures.randomSplit([0.8, 0.2])

Implementing Multinomial Naivebayes on the labelCol "SentimentIndex" 

In [34]:
nb = NaiveBayes(modelType="multinomial",labelCol="SentimentIndex", featuresCol="features")
nbModel = nb.fit(trainingData)
nb_predictions = nbModel.transform(testData)

Showing the predicted vs the actual value of the sentiment index

In [31]:
nb_predictions.select("prediction", "SentimentIndex", "features").show(5)

+----------+--------------+--------------------+
|prediction|SentimentIndex|            features|
+----------+--------------+--------------------+
|       4.0|           4.0|(41642,[11,12,149...|
|       0.0|           2.0|(41642,[2,10,829,...|
|       1.0|           4.0|(41642,[5,19,21,8...|
|       0.0|           2.0|(41642,[3,53,55,5...|
|       1.0|           0.0|(41642,[8,24,26,8...|
+----------+--------------+--------------------+
only showing top 5 rows



Calculating the accuracy of the model and displaying the test error

In [32]:
evaluator = MulticlassClassificationEvaluator(labelCol="SentimentIndex", predictionCol="prediction", metricName="accuracy")
nb_accuracy = evaluator.evaluate(nb_predictions)
print("Accuracy of NaiveBayes is = %g"% (nb_accuracy))
print("Test Error of NaiveBayes = %g " % (1.0 - nb_accuracy))

Accuracy of NaiveBayes is = 0.437672
Test Error of NaiveBayes = 0.562328 


### The accuracy comes out to be 43.7% and test error as 56.2%

### This means that only 43.7% of the sentiments were identified correctly in the dataset