<a href="https://colab.research.google.com/github/Laughing-Bulls/twitter/blob/main/Final_ML_Algorithms.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In order to decide what machine learning algorithm we should implement for the sentiment analysis of tweets, let's go ahead and do some exploratory analysis:


**1. Set up:**

In [1]:
# Load the packages required

!pip install pyspark

from pyspark.mllib.regression import LabeledPoint
from pyspark.ml.feature import HashingTF
from pyspark import SparkConf, SparkContext
from pyspark.ml.classification import LogisticRegression, NaiveBayes
from pyspark.sql.session import SparkSession
from pyspark.sql.functions import split, regexp_replace
from numpy import array
import numpy as np

Collecting pyspark
  Downloading pyspark-3.2.0.tar.gz (281.3 MB)
[K     |████████████████████████████████| 281.3 MB 38 kB/s 
[?25hCollecting py4j==0.10.9.2
  Downloading py4j-0.10.9.2-py2.py3-none-any.whl (198 kB)
[K     |████████████████████████████████| 198 kB 60.5 MB/s 
[?25hBuilding wheels for collected packages: pyspark
  Building wheel for pyspark (setup.py) ... [?25l[?25hdone
  Created wheel for pyspark: filename=pyspark-3.2.0-py2.py3-none-any.whl size=281805912 sha256=b23ad918314a11a13d593bda794fbc8abc87901faebf995b80fbeaeb1f23ffe1
  Stored in directory: /root/.cache/pip/wheels/0b/de/d2/9be5d59d7331c6c2a7c1b6d1a4f463ce107332b1ecd4e80718
Successfully built pyspark
Installing collected packages: py4j, pyspark
Successfully installed py4j-0.10.9.2 pyspark-3.2.0


In [2]:
# Boilerplate Spark stuff:
conf = SparkConf().setMaster("local").setAppName("SparkDecisionTree")
sc = SparkContext(conf = conf)
spark = SparkSession(sc)

**2. Load and prepare the necessary data:**

In [14]:
# We read the processed data files
# In order to read them like this we need to upload them to the "Files" of the Notebook
train = spark.read.csv("processed_training_tweets.csv", inferSchema=True, header=True)
test = spark.read.csv("processed_test_tweets.csv", inferSchema=True, header=True)

# We notice the issue that the "words" columns are type "string" instead of array<string> like we want
print("train data types: ", train.dtypes)
print("test data types: ", test.dtypes, "\n")

# We fix this issue and also change the type of the "score" column to float 
train = train.withColumn('words',split(regexp_replace(train["words"], '\[|\]',''),',').cast('array<string>'))
test = test.withColumn('words',split(regexp_replace(test["words"], '\[|\]',''),',').cast('array<string>'))
print("updated train data types: ", train.dtypes)
print("updated test data types: ", test.dtypes, "\n")

# We remove the neutral tweets (score = 2) from the test data (we are going to be classifying tweets only as positive or negative)
test = test[test["score"] != 2]

# Preview of the data
print("train overview: ")
train.show(truncate=False, n=5)
print("test overview: ")
test.show(truncate=False, n=5)

train data types:  [('_c0', 'int'), ('score', 'int'), ('words', 'string')]
test data types:  [('_c0', 'int'), ('score', 'int'), ('words', 'string')] 

updated train data types:  [('_c0', 'int'), ('score', 'int'), ('words', 'array<string>')]
updated test data types:  [('_c0', 'int'), ('score', 'int'), ('words', 'array<string>')] 

train overview: 
+---+-----+-----------------------------------------------------------------------------------------------------------------------------------------------------------+
|_c0|score|words                                                                                                                                                      |
+---+-----+-----------------------------------------------------------------------------------------------------------------------------------------------------------+
|0  |0    |['bummer',  'shoulda',  'got',  'david',  'carr',  'third',  'day',  'do',  'it',  'd']                                                 

In [15]:
# We now transform the words to a numerical number and keep track of the count
hashTF = HashingTF(inputCol="words", outputCol="numerical")
num_train= hashTF.transform(train).select('score', 'words', 'numerical')
num_test= hashTF.transform(test).select('score', 'words', 'numerical')

# Preview of the modified data
print("num_train overview: ")
num_train.show(truncate=False, n=5)
print("num_test overview: ")
num_test.show(truncate=False, n=5)

num_train overview: 
+-----+-----------------------------------------------------------------------------------------------------------------------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|score|words                                                                                                                                                      |numerical                                                                                                                                                                                       |
+-----+-----------------------------------------------------------------------------------------------------------------------------------------------------------+------------------------------------------------------------------------------------------------------

**3. Train different models to find the best one:**

In [16]:
# Logistic Regression Training
log_reg = LogisticRegression(labelCol = "score", featuresCol="numerical", maxIter = 10, regParam = 0.01).fit(num_train)

In [19]:
# Logistic Regression Prediction
print("Logistic Regression: \n")
pred_log_reg = log_reg.transform(num_test)
results_log_reg = pred_log_reg.select("words", "prediction", "score")
print("results_log_reg overview: ")
results_log_reg.show(truncate=False, n=5)

correct_pred_log_reg = results_log_reg.filter(results_log_reg['prediction'] == results_log_reg['score']).count()
print("# Correct predictions:", correct_pred_log_reg, ", # Data points:", results_log_reg.count(),
      ", Accuracy:", correct_pred_log_reg/results_log_reg.count())

Logistic Regression: 

results_log_reg overview: 
+----------------------------------------------------------------------------------------------------------------------------------------------------------------+----------+-----+
|words                                                                                                                                                           |prediction|score|
+----------------------------------------------------------------------------------------------------------------------------------------------------------------+----------+-----+
|['loooooooovvvvvvee',  'kindl',  'not',  'that',  'dx',  'cool',  'but',  '',  'fantast',  'it',  'own',  'right']                                              |4.0       |4    |
|['read',  'kindl',  'love',  'it',  'lee',  'child',  'good',  'read']                                                                                          |4.0       |4    |
|['ok',  'first',  'asses',  'kindl',  'it',  'fuc

In [20]:
# Naive Bayes Training
naive_bayes = NaiveBayes(labelCol = "score", featuresCol="numerical", smoothing=1.0, modelType="multinomial").fit(num_train)

In [21]:
# Naive Bayes Prediction
print("Naive Bayes: \n")
pred_naive_bayes = naive_bayes.transform(num_test)
#accuracy_log_reg =  log_reg.transform(num_test).score
results_naive_bayes = pred_naive_bayes.select("words", "prediction", "score").replace(1.0, 4.0)
print("results_naive_bayes overview: ")
results_naive_bayes.show(truncate=False, n=5)

correct_pred_naive_bayes = results_naive_bayes.filter(results_naive_bayes['prediction'] == results_naive_bayes['score']).count()
print("# Correct predictions:", correct_pred_naive_bayes, ", # Data points:", results_naive_bayes.count(),
      ", Accuracy:", correct_pred_naive_bayes/results_naive_bayes.count())

Naive Bayes: 

results_naive_bayes overview: 
+----------------------------------------------------------------------------------------------------------------------------------------------------------------+----------+-----+
|words                                                                                                                                                           |prediction|score|
+----------------------------------------------------------------------------------------------------------------------------------------------------------------+----------+-----+
|['loooooooovvvvvvee',  'kindl',  'not',  'that',  'dx',  'cool',  'but',  '',  'fantast',  'it',  'own',  'right']                                              |4.0       |4    |
|['read',  'kindl',  'love',  'it',  'lee',  'child',  'good',  'read']                                                                                          |4.0       |4    |
|['ok',  'first',  'asses',  'kindl',  'it',  'fuck', 

**4. Results:**

We can see that, in this case, the Naive Bayes Prediction Model is the fastest to train. Its training was executed in 37 seconds compared to the almost 4 minutes that it took to train the Logistic Regression Model. Additionally, the Naive Bayes Prediction Model was also the most accurate, making correct predictions for 84.68% of the tweets. In comparison, the Logistic Regression Model made correct predictions 78.55% of the time. 

Therefore, the model that we are going to implement for the unsupervised sentiment analysis is going to be the Naive Bayes Model, since we were able to observe its efficancy in a supervised setting. 