# Question 3: Classification - Twitter Sentiment Analysis (1.0 mark)

Use twitter_training.csv to build a model to determine “if a Tweet content is Positive / Neutral/ Negative or Irrelevant”, then use twitter_validation.csv to test this model.
Read more information here:
https://www.kaggle.com/datasets/jp797498e/twitter-entity-sentiment-analysis?resource=download

In [1]:
import findspark
findspark.init()

In [2]:
from pyspark.sql import SparkSession
from pyspark import SparkContext
from pyspark import SparkConf

In [3]:
spark = SparkSession.builder.appName('classification-twitter').getOrCreate()
spark

In [4]:
from pyspark.sql import SparkSession
from pyspark.sql.types import DoubleType, StringType
from pyspark.sql.functions import *
from pyspark.ml.feature import (
    Tokenizer,
    StopWordsRemover,
    RegexTokenizer,
    CountVectorizer,
    IDF,
    StringIndexer,
    VectorAssembler
)
from pyspark.ml.classification import NaiveBayes, LogisticRegression
from pyspark.ml import Pipeline
from pyspark.ml.evaluation import MulticlassClassificationEvaluator
from bs4 import BeautifulSoup
from pyspark import keyword_only
from pyspark.ml import Transformer
from pyspark.ml.param.shared import HasInputCol, HasOutputCol

# Clean and Prepare the Data

In [5]:
# Load training data
twitter_train = spark.read.csv('twitter_training.csv',inferSchema=True,header=True)

In [6]:
twitter_train.show(5)

+--------+-----------+---------+--------------------+
|Tweet ID|     entity|sentiment|       Tweet content|
+--------+-----------+---------+--------------------+
|    2401|Borderlands| Positive|im getting on bor...|
|    2401|Borderlands| Positive|I am coming to th...|
|    2401|Borderlands| Positive|im getting on bor...|
|    2401|Borderlands| Positive|im coming on bord...|
|    2401|Borderlands| Positive|im getting on bor...|
+--------+-----------+---------+--------------------+
only showing top 5 rows



In [7]:
twitter_train.groupby('sentiment').count().show()

+----------+-----+
| sentiment|count|
+----------+-----+
|Irrelevant|12990|
|  Positive|20832|
|   Neutral|18318|
|  Negative|22542|
+----------+-----+



In [8]:
twitter_train.count()

74682

In [9]:
twitter_train.printSchema()

root
 |-- Tweet ID: integer (nullable = true)
 |-- entity: string (nullable = true)
 |-- sentiment: string (nullable = true)
 |-- Tweet content: string (nullable = true)



In [10]:
twitter_train.show(5)

+--------+-----------+---------+--------------------+
|Tweet ID|     entity|sentiment|       Tweet content|
+--------+-----------+---------+--------------------+
|    2401|Borderlands| Positive|im getting on bor...|
|    2401|Borderlands| Positive|I am coming to th...|
|    2401|Borderlands| Positive|im getting on bor...|
|    2401|Borderlands| Positive|im coming on bord...|
|    2401|Borderlands| Positive|im getting on bor...|
+--------+-----------+---------+--------------------+
only showing top 5 rows



In [11]:
twitter_train.head()

Row(Tweet ID=2401, entity='Borderlands', sentiment='Positive', Tweet content='im getting on borderlands and i will murder you all ,')

In [12]:
data_train = twitter_train.select("sentiment", "entity", "Tweet content")

In [13]:
#3. Kiểm tra dữ liệu NaN, null
data_train.select([count(when(isnan(c), c)).alias(c) for c in data_train.columns]).toPandas().T

Unnamed: 0,0
sentiment,0
entity,0
Tweet content,0


In [14]:
data_train.select([count(when(col(c).isNull(), c)).alias(c) for c in data_train.columns]).toPandas().T

Unnamed: 0,0
sentiment,0
entity,0
Tweet content,686


In [15]:
data_train = data_train.dropna(subset='Tweet content')

In [16]:
data_train.select([count(when(col(c).isNull(), c)).alias(c) for c in data_train.columns]).toPandas().T

Unnamed: 0,0
sentiment,0
entity,0
Tweet content,0


In [17]:
#4. Kiểm tra dữ liệu trùng. 
num_rows = data_train.count()
num_dist_rows = data_train.distinct().count()
dup_rows = num_rows - num_dist_rows

In [18]:
display(num_rows, num_dist_rows, dup_rows)

73996

70912

3084

In [19]:
# Có dữ liệu trùng
dup_rows = num_rows - num_dist_rows
dup_rows

3084

In [20]:
# Xóa dữ liệu trùng
data_train = data_train.drop_duplicates()

In [21]:
data_train.distinct().count()

70912

## Train data

In [22]:
data_train = data_train.withColumn('length', length(data_train['Tweet content']))

In [23]:
data_train = data_train.filter(data_train['length'] > 0)

In [24]:
data_train.show(5)

+---------+-----------+--------------------+------+
|sentiment|     entity|       Tweet content|length|
+---------+-----------+--------------------+------+
| Negative|Borderlands|@ Borderlands how...|    82|
| Positive|Borderlands|I am here... With...|   203|
| Positive|Borderlands|We've been over p...|   208|
| Positive|Borderlands|Actually. I think...|   124|
| Negative|Borderlands|That cricket was ...|   147|
+---------+-----------+--------------------+------+
only showing top 5 rows



In [25]:
# Not Clear Difference
data_train.groupby('sentiment').mean().show()

+----------+------------------+
| sentiment|       avg(length)|
+----------+------------------+
|Irrelevant|110.82921926509609|
|  Positive| 98.03137475688402|
|   Neutral| 118.8148339950515|
|  Negative|112.01758048056406|
+----------+------------------+



In [26]:
data_train.groupby('sentiment').count().show()

+----------+-----+
| sentiment|count|
+----------+-----+
|Irrelevant|12437|
|  Positive|19538|
|   Neutral|17379|
|  Negative|21558|
+----------+-----+



In [27]:
data_train.select('entity').distinct().count()

32

In [28]:
data_train.groupby('entity').count().show(32, truncate=True)

+--------------------+-----+
|              entity|count|
+--------------------+-----+
|       Cyberpunk2077| 2144|
|         Borderlands| 2191|
|       Xbox(Xseries)| 2164|
|   PlayStation5(PS5)| 2177|
|                FIFA| 2224|
|           Overwatch| 2208|
|             Verizon| 2301|
|        WorldOfCraft| 2244|
|      AssassinsCreed| 2141|
|PlayerUnknownsBat...| 2112|
|               CS-GO| 2169|
|         Battlefield| 2236|
| GrandTheftAuto(GTA)| 2201|
|           HomeDepot| 2198|
|               NBA2K| 2290|
|              Google| 2179|
|               Dota2| 2217|
|RedDeadRedemption...| 2123|
|CallOfDutyBlackop...| 2231|
|     LeagueOfLegends| 2228|
|           Microsoft| 2277|
|           MaddenNFL| 2293|
|            Fortnite| 2162|
|TomClancysRainbowSix| 2287|
|              Nvidia| 2174|
|              Amazon| 2207|
|         Hearthstone| 2201|
|          CallOfDuty| 2304|
|TomClancysGhostRecon| 2261|
|     johnson&johnson| 2244|
|         ApexLegends| 2246|
|            F

## Test data

In [29]:
# Load training data
twitter_test = spark.read.csv('twitter_validation.csv',inferSchema=True,header=True)

In [30]:
twitter_test.count()

1509

In [31]:
twitter_test.printSchema()

root
 |-- Tweet ID: string (nullable = true)
 |-- entity: string (nullable = true)
 |-- sentiment: string (nullable = true)
 |-- Tweet content: string (nullable = true)



In [32]:
twitter_test.show(5)

+--------+---------+----------+--------------------+
|Tweet ID|   entity| sentiment|       Tweet content|
+--------+---------+----------+--------------------+
|    3364| Facebook|Irrelevant|I mentioned on Fa...|
|     352|   Amazon|   Neutral|BBC News - Amazon...|
|    8312|Microsoft|  Negative|@Microsoft Why do...|
|    4371|    CS-GO|  Negative|CSGO matchmaking ...|
|    4433|   Google|   Neutral|Now the President...|
+--------+---------+----------+--------------------+
only showing top 5 rows



In [33]:
twitter_test.head()

Row(Tweet ID='3364', entity='Facebook', sentiment='Irrelevant', Tweet content='I mentioned on Facebook that I was struggling for motivation to go for a run the other day, which has been translated by Tom’s great auntie as ‘Hayley can’t get out of bed’ and told to his grandma, who now thinks I’m a lazy, terrible person 🤣')

In [34]:
data_val = twitter_test.select("sentiment", "entity", "Tweet content")

In [35]:
#3. Kiểm tra dữ liệu NaN, null
data_val.select([count(when(isnan(c), c)).alias(c) for c in data_val.columns]).toPandas().T

Unnamed: 0,0
sentiment,0
entity,0
Tweet content,0


In [36]:
data_val.select([count(when(col(c).isNull(), c)).alias(c) for c in data_val.columns]).toPandas().T

Unnamed: 0,0
sentiment,494
entity,466
Tweet content,509


In [37]:
data_val = data_val.dropna()

In [38]:
data_val.select([count(when(col(c).isNull(), c)).alias(c) for c in data_val.columns]).toPandas().T

Unnamed: 0,0
sentiment,0
entity,0
Tweet content,0


In [39]:
#4. Kiểm tra dữ liệu trùng. 
num_rows = data_val.count()
num_dist_rows = data_val.distinct().count()
dup_rows = num_rows - num_dist_rows

In [40]:
display(num_rows, num_dist_rows, dup_rows)

1000

1000

0

In [41]:
data_val.distinct().count()

1000

In [42]:
data_val.groupby('sentiment').count().show()

+----------+-----+
| sentiment|count|
+----------+-----+
|Irrelevant|  172|
|   Neutral|  285|
|  Positive|  277|
|  Negative|  266|
+----------+-----+



## Feature Transformations

In [43]:
class BsTextExtractor(Transformer, HasInputCol, HasOutputCol):

    @keyword_only
    def __init__(self, inputCol=None, outputCol=None):
        super(BsTextExtractor, self).__init__()
        kwargs = self._input_kwargs
        self.setParams(**kwargs)

    @keyword_only
    def setParams(self, inputCol=None, outputCol=None):
        kwargs = self._input_kwargs
        return self._set(**kwargs)

    def _transform(self, dataset):

        def f(s):
            cleaned_post = BeautifulSoup(s).text
            return cleaned_post

        t = StringType()
        out_col = self.getOutputCol()
        in_col = dataset[self.getInputCol()]
        return dataset.withColumn(out_col, udf(f, t)(in_col))

In [44]:
sentiment_to_num = StringIndexer(inputCol='sentiment',outputCol='sentimentIndex') #1
entity_to_num = StringIndexer(inputCol='entity',outputCol='entityIndex') #2

In [45]:
text_extractor = BsTextExtractor(inputCol="Tweet content", outputCol="cleaned_Tweet") #3
tokenizer = RegexTokenizer(inputCol="cleaned_Tweet", outputCol="token_text", pattern="\\W") #4
stopremove = StopWordsRemover(inputCol='token_text',outputCol='stop_tokens') #5
count_vec = CountVectorizer(inputCol='stop_tokens',outputCol='c_vec') #6
idf = IDF(inputCol="c_vec", outputCol="tf_idf") #7

In [46]:
clean_up = VectorAssembler(inputCols=['tf_idf','entityIndex'],
                           outputCol='features') # 8

## The Model
We'll use Naive Bayes, LogisticRegression

In [47]:
# Use defaults
nb = NaiveBayes(labelCol='sentimentIndex',featuresCol='features', smoothing=1.0, modelType="multinomial")
lg = LogisticRegression(labelCol='sentimentIndex',featuresCol='features', maxIter=20, regParam=0.3, elasticNetParam=0)

## Pipeline

In [48]:
data_prep_pipe = Pipeline(stages=[sentiment_to_num,
                                  entity_to_num,
                                  text_extractor,
                                  tokenizer,
                                  stopremove,
                                  count_vec,
                                  idf,
                                  clean_up])

In [49]:
cleaner = data_prep_pipe.fit(data_train)

In [50]:
train_clean_data = cleaner.transform(data_train)

In [51]:
train_clean_data.show(5, truncate=True)

+---------+-----------+--------------------+------+--------------+-----------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+
|sentiment|     entity|       Tweet content|length|sentimentIndex|entityIndex|       cleaned_Tweet|          token_text|         stop_tokens|               c_vec|              tf_idf|            features|
+---------+-----------+--------------------+------+--------------+-----------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+
| Negative|Borderlands|@ Borderlands how...|    82|           0.0|       21.0|@ Borderlands how...|[borderlands, how...|[borderlands, fil...|(30756,[63,187,28...|(30756,[63,187,28...|(30757,[63,187,28...|
| Positive|Borderlands|I am here... With...|   203|           1.0|       21.0|I am here... With...|[i, am, here, wit...|[samsung, full, h...|(30756,[5,10,63,6...|(30756,[5,10,63,6.

In [52]:
test_clean_data = cleaner.transform(data_val)

In [53]:
test_clean_data.show(5, truncate=True)

+----------+---------+--------------------+--------------+-----------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+
| sentiment|   entity|       Tweet content|sentimentIndex|entityIndex|       cleaned_Tweet|          token_text|         stop_tokens|               c_vec|              tf_idf|            features|
+----------+---------+--------------------+--------------+-----------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+
|Irrelevant| Facebook|I mentioned on Fa...|           3.0|        5.0|I mentioned on Fa...|[i, mentioned, on...|[mentioned, faceb...|(30756,[2,5,24,30...|(30756,[2,5,24,30...|(30757,[2,5,24,30...|
|   Neutral|   Amazon|BBC News - Amazon...|           2.0|       17.0|BBC News - Amazon...|[bbc, news, amazo...|[bbc, news, amazo...|(30756,[3,22,29,1...|(30756,[3,22,29,1...|(30757,[3,22,29,1...|
|  Negative|Mic

## Training and Evaluation!

In [54]:
train_data = train_clean_data.select("features",'sentimentIndex')

In [55]:
train_data.count()

70912

In [56]:
# Train the models (its three models, so it might take some time)
nb_model = nb.fit(train_data)
lg_model = lg.fit(train_data)

## Model Comparison
 use twitter_validation.csv

In [57]:
test_data = test_clean_data.select("features",'sentimentIndex')

In [58]:
test_data.count()

1000

In [59]:
test_data.printSchema()

root
 |-- features: vector (nullable = true)
 |-- sentimentIndex: double (nullable = false)



In [60]:
nb_predictions = nb_model.transform(test_data)
lg_predictions = lg_model.transform(test_data)

In [61]:
# Select (prediction, true label) and compute test error
acc_evaluator = MulticlassClassificationEvaluator(labelCol="sentimentIndex", 
                                                  predictionCol="prediction", 
                                                  metricName="accuracy")

In [62]:
nb_acc = acc_evaluator.evaluate(nb_predictions)
lg_acc = acc_evaluator.evaluate(lg_predictions)

In [63]:
print("Naive Bayes - confusions matrix:")
nb_cm = nb_predictions.groupBy('sentimentIndex', 'prediction').count().orderBy('sentimentIndex', 'prediction')
nb_cm.show()

Naive Bayes - confusions matrix:
+--------------+----------+-----+
|sentimentIndex|prediction|count|
+--------------+----------+-----+
|           0.0|       0.0|  226|
|           0.0|       1.0|   30|
|           0.0|       2.0|    1|
|           0.0|       3.0|    9|
|           1.0|       0.0|   14|
|           1.0|       1.0|  252|
|           1.0|       2.0|    5|
|           1.0|       3.0|    6|
|           2.0|       0.0|   20|
|           2.0|       1.0|   42|
|           2.0|       2.0|  210|
|           2.0|       3.0|   13|
|           3.0|       0.0|    8|
|           3.0|       1.0|   14|
|           3.0|       2.0|    3|
|           3.0|       3.0|  147|
+--------------+----------+-----+



In [64]:
print("Logistic Regression - confusions matrix:")
lg_cm = lg_predictions.groupBy('sentimentIndex', 'prediction').count().orderBy('sentimentIndex', 'prediction')
lg_cm.show()

Logistic Regression - confusions matrix:
+--------------+----------+-----+
|sentimentIndex|prediction|count|
+--------------+----------+-----+
|           0.0|       0.0|  248|
|           0.0|       1.0|   16|
|           0.0|       2.0|    2|
|           1.0|       0.0|   14|
|           1.0|       1.0|  260|
|           1.0|       2.0|    2|
|           1.0|       3.0|    1|
|           2.0|       0.0|   25|
|           2.0|       1.0|   34|
|           2.0|       2.0|  223|
|           2.0|       3.0|    3|
|           3.0|       0.0|   15|
|           3.0|       1.0|   25|
|           3.0|       2.0|    3|
|           3.0|       3.0|  129|
+--------------+----------+-----+



In [65]:
print("Accuracy on twitter_validation.csv:")
print('-'*80)
print('Naive Bayes - accuracy: {0:2.2f}%'.format(nb_acc*100))
print('-'*80)
print('Logistic Regression - accuracy: {0:2.2f}%'.format(lg_acc*100))

Accuracy on twitter_validation.csv:
--------------------------------------------------------------------------------
Naive Bayes - accuracy: 83.50%
--------------------------------------------------------------------------------
Logistic Regression - accuracy: 86.00%


#### So sánh đánh giá giữa Naive Bayes và Logistic Regression

| Tiêu chí         | Naive Bayes  | Logistic Regression |
|------------------|--------------|---------------------|
| **Accuracy**     | 83.50%       | 86.00%              |
| **Confusion Matrix** | | |
| 0.0 - 0.0        | 226          | 248                 |
| 0.0 - 1.0        | 30           | 16                  |
| 0.0 - 2.0        | 1            | 2                   |
| 0.0 - 3.0        | 9            | 0                 |
| 1.0 - 0.0        | 14           | 14                  |
| 1.0 - 1.0        | 252          | 260                 |
| 1.0 - 2.0        | 5            | 2                   |
| 1.0 - 3.0        | 6            | 1                   |
| 2.0 - 0.0        | 20           | 25                  |
| 2.0 - 1.0        | 42           | 34                  |
| 2.0 - 2.0        | 210          | 223                 |
| 2.0 - 3.0        | 13           | 3                   |
| 3.0 - 0.0        | 8            | 15                  |
| 3.0 - 1.0        | 14           | 25                  |
| 3.0 - 2.0        | 3            | 3                   |
| 3.0 - 3.0        | 147          | 129                 |



 Dựa trên bảng so sánh trên, Logistic Regression có accuracy cao hơn và ít lỗi hơn trong confusion matrix so với Naive Bayes. Do đó, Logistic Regression là  thuật toán phù hợp hơn cho bộ dữ liệu này.

In [67]:
# save model
nb.save("NB_Tweeter_ Sentiment_model")
lg.save("Lg_Tweeter_ Sentiment_model")