# 2014222 - Semester 2 CA-02 -May 2024

### Github account

https://github.com/2014222-student-cct-ie/2024--Semester-2--CA2/

### Analysis of a large dataset gleaned from the twitter API and is available on Moodle as “ProjectTweets.csv”

# Part 1

In [1]:
# Utilise Python programming language in order to comply with the requisites of the assessment and perform adequate Machine
# Learning algorithms to discover and deliver insights.

# Import the necessary libraries (Numpy and Pandas) in order to perform data cleansing.
# These are the libraries that are conventionally used as a common practice in order to
# perform mathematical and statistical operations during a data analysis process

import numpy as np
import pandas as pd

# Import Matplotlib and Plotly library in order to perform data visualisation procedures

import plotly.express as px
import plotly.graph_objects as go
import matplotlib.pyplot as plt

# I am using this line of code to see all columns in a wide DataFrame

pd.set_option('display.max_columns', None)

from pyspark.sql import SparkSession
from pyspark.ml.feature import VectorAssembler, StandardScaler
from pyspark.ml.linalg import Vectors
from pyspark.sql.functions import col
from pyspark.ml import Pipeline
from pyspark.sql.types import FloatType


# For normalization
from pyspark.ml.feature import MinMaxScaler 
from pyspark.ml.evaluation import RegressionEvaluator
from pyspark.ml.regression import LinearRegression
from pyspark.ml.tuning import ParamGridBuilder, TrainValidationSplit


# process the tweets data
# !pip install textblob
from pyspark.sql.functions import udf
from textblob import TextBlob

# pip install nltk
import nltk
from nltk.sentiment import SentimentIntensityAnalyzer

from pyspark.sql import DataFrame

# re module provides regular expression support.
# In Python a regular expression search is typically written as:
# match = re. search(pat, str)
# The re.search() method takes a regular expression pattern and a string and searches
# for that pattern within the string.

import re

from pyspark.sql.types import StructType, StructField, IntegerType, StringType

#!pip install pyspark

# Import the warnings module

import warnings

# Ignore all warnings by applying th the 'filterwarnings()'' function and passing the 'ignore' argument

warnings.filterwarnings('ignore')

In [2]:
# Read the CSV file by applying the pd.read_csv() function.

tweets_dataset = pd.read_csv('ProjectTweets.csv')

In [3]:
# Print the tweets_dataset

tweets_dataset

Unnamed: 0,0,1467810369,Mon Apr 06 22:19:45 PDT 2009,NO_QUERY,_TheSpecialOne_,"@switchfoot http://twitpic.com/2y1zl - Awww, that's a bummer. You shoulda got David Carr of Third Day to do it. ;D"
0,1,1467810672,Mon Apr 06 22:19:49 PDT 2009,NO_QUERY,scotthamilton,is upset that he can't update his Facebook by ...
1,2,1467810917,Mon Apr 06 22:19:53 PDT 2009,NO_QUERY,mattycus,@Kenichan I dived many times for the ball. Man...
2,3,1467811184,Mon Apr 06 22:19:57 PDT 2009,NO_QUERY,ElleCTF,my whole body feels itchy and like its on fire
3,4,1467811193,Mon Apr 06 22:19:57 PDT 2009,NO_QUERY,Karoli,"@nationwideclass no, it's not behaving at all...."
4,5,1467811372,Mon Apr 06 22:20:00 PDT 2009,NO_QUERY,joy_wolf,@Kwesidei not the whole crew
...,...,...,...,...,...,...
1599994,1599995,2193601966,Tue Jun 16 08:40:49 PDT 2009,NO_QUERY,AmandaMarie1028,Just woke up. Having no school is the best fee...
1599995,1599996,2193601969,Tue Jun 16 08:40:49 PDT 2009,NO_QUERY,TheWDBoards,TheWDB.com - Very cool to hear old Walt interv...
1599996,1599997,2193601991,Tue Jun 16 08:40:49 PDT 2009,NO_QUERY,bpbabe,Are you ready for your MoJo Makeover? Ask me f...
1599997,1599998,2193602064,Tue Jun 16 08:40:49 PDT 2009,NO_QUERY,tinydiamondz,Happy 38th Birthday to my boo of alll time!!! ...


In [4]:
# Print the dimensions of the tweets_dataset DataFrame.

tweets_dataset.shape

(1599999, 6)

As we can see that the tweets dataframe contains 1599999 rows × 6 columns

In [5]:
# Print the first 5 rows of the tweets dataframe by applying the .head() method,
# This method is will display the top 5 observations of the dataset

tweets_dataset.head()

Unnamed: 0,0,1467810369,Mon Apr 06 22:19:45 PDT 2009,NO_QUERY,_TheSpecialOne_,"@switchfoot http://twitpic.com/2y1zl - Awww, that's a bummer. You shoulda got David Carr of Third Day to do it. ;D"
0,1,1467810672,Mon Apr 06 22:19:49 PDT 2009,NO_QUERY,scotthamilton,is upset that he can't update his Facebook by ...
1,2,1467810917,Mon Apr 06 22:19:53 PDT 2009,NO_QUERY,mattycus,@Kenichan I dived many times for the ball. Man...
2,3,1467811184,Mon Apr 06 22:19:57 PDT 2009,NO_QUERY,ElleCTF,my whole body feels itchy and like its on fire
3,4,1467811193,Mon Apr 06 22:19:57 PDT 2009,NO_QUERY,Karoli,"@nationwideclass no, it's not behaving at all...."
4,5,1467811372,Mon Apr 06 22:20:00 PDT 2009,NO_QUERY,joy_wolf,@Kwesidei not the whole crew


In [6]:
# Getting several unique values in each columnan the data description the tweets dataframe
# by applying the .nunique() method,
# Will display continuous and categorical columns in the data.
# Duplicated data can be handled or removed based on further analysis
# helps to understand the data type and information about data

tweets_dataset.nunique()

0                                                                                                                      1599999
1467810369                                                                                                             1598314
Mon Apr 06 22:19:45 PDT 2009                                                                                            774362
NO_QUERY                                                                                                                     1
_TheSpecialOne_                                                                                                         659775
@switchfoot http://twitpic.com/2y1zl - Awww, that's a bummer.  You shoulda got David Carr of Third Day to do it. ;D    1581465
dtype: int64

In [7]:
# Print the last 5 rows of the tweets dataframe by applying the .tail() method,
# Will display the last 5 observations of the dataset

tweets_dataset.tail()

Unnamed: 0,0,1467810369,Mon Apr 06 22:19:45 PDT 2009,NO_QUERY,_TheSpecialOne_,"@switchfoot http://twitpic.com/2y1zl - Awww, that's a bummer. You shoulda got David Carr of Third Day to do it. ;D"
1599994,1599995,2193601966,Tue Jun 16 08:40:49 PDT 2009,NO_QUERY,AmandaMarie1028,Just woke up. Having no school is the best fee...
1599995,1599996,2193601969,Tue Jun 16 08:40:49 PDT 2009,NO_QUERY,TheWDBoards,TheWDB.com - Very cool to hear old Walt interv...
1599996,1599997,2193601991,Tue Jun 16 08:40:49 PDT 2009,NO_QUERY,bpbabe,Are you ready for your MoJo Makeover? Ask me f...
1599997,1599998,2193602064,Tue Jun 16 08:40:49 PDT 2009,NO_QUERY,tinydiamondz,Happy 38th Birthday to my boo of alll time!!! ...
1599998,1599999,2193602129,Tue Jun 16 08:40:50 PDT 2009,NO_QUERY,RyanTrevMorris,happy #charitytuesday @theNSPCC @SparksCharity...


In [8]:
# Getting information about the tweets dataframe by applying the .info() method,
# Will display number of records in each column, data having null or not null, Data type,
# memory usage of the dataset
# helps to understand the data type and information about data

tweets_dataset.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1599999 entries, 0 to 1599998
Data columns (total 6 columns):
 #   Column                                                                                                               Non-Null Count    Dtype 
---  ------                                                                                                               --------------    ----- 
 0   0                                                                                                                    1599999 non-null  int64 
 1   1467810369                                                                                                           1599999 non-null  int64 
 2   Mon Apr 06 22:19:45 PDT 2009                                                                                         1599999 non-null  object
 3   NO_QUERY                                                                                                             1599999 non-null  object
 4   _

As we can see the tweets dataset is structured into a table with almost 1.6 million tweets, spread across six columns, each storing different pieces of information about the tweets such as ID, timestamp, query flag, user, and tweet text.

The data types vary from integers for numeric data to objects for textual data.

The tweets dataset doesn't have proper headers, which is why pandas is using the first row as column names by default.

Therefore we need to clean this by assigning proper column names.

In [9]:
# Assigning proper column names.

columns = ['index', 'tweet_ID', 'date_timestamp', 'query', 'twitter_user', 'tweet_text']

tweets_dataset = pd.read_csv('ProjectTweets.csv',header=None,names=columns,delimiter=',')

tweets_dataset.head()

Unnamed: 0,index,tweet_ID,date_timestamp,query,twitter_user,tweet_text
0,0,1467810369,Mon Apr 06 22:19:45 PDT 2009,NO_QUERY,_TheSpecialOne_,"@switchfoot http://twitpic.com/2y1zl - Awww, t..."
1,1,1467810672,Mon Apr 06 22:19:49 PDT 2009,NO_QUERY,scotthamilton,is upset that he can't update his Facebook by ...
2,2,1467810917,Mon Apr 06 22:19:53 PDT 2009,NO_QUERY,mattycus,@Kenichan I dived many times for the ball. Man...
3,3,1467811184,Mon Apr 06 22:19:57 PDT 2009,NO_QUERY,ElleCTF,my whole body feels itchy and like its on fire
4,4,1467811193,Mon Apr 06 22:19:57 PDT 2009,NO_QUERY,Karoli,"@nationwideclass no, it's not behaving at all...."


**As we can see the column index has a duplicate.**

In [10]:
# Drop the column called 'Index'

tweets_dataset = tweets_dataset.drop(columns=['index'])

In [11]:
# Display the tweets_dataset to confirm the column has been dropped

tweets_dataset.head()

Unnamed: 0,tweet_ID,date_timestamp,query,twitter_user,tweet_text
0,1467810369,Mon Apr 06 22:19:45 PDT 2009,NO_QUERY,_TheSpecialOne_,"@switchfoot http://twitpic.com/2y1zl - Awww, t..."
1,1467810672,Mon Apr 06 22:19:49 PDT 2009,NO_QUERY,scotthamilton,is upset that he can't update his Facebook by ...
2,1467810917,Mon Apr 06 22:19:53 PDT 2009,NO_QUERY,mattycus,@Kenichan I dived many times for the ball. Man...
3,1467811184,Mon Apr 06 22:19:57 PDT 2009,NO_QUERY,ElleCTF,my whole body feels itchy and like its on fire
4,1467811193,Mon Apr 06 22:19:57 PDT 2009,NO_QUERY,Karoli,"@nationwideclass no, it's not behaving at all...."


The column called query only has 1 value which says NO_QUERY

Therefore I am going to drop this column

In [12]:
# Drop the column called 'Query'

if len(tweets_dataset['query'].unique()) <2:
    
    print("The column called query only has 1 value which says NO_QUERY,"
          "\nTherefore we are going to drop this column\n")
    
    tweets_dataset = tweets_dataset[[ 'tweet_ID', 'date_timestamp', 'twitter_user','tweet_text']]
    
    print(tweets_dataset.columns)

The column called query only has 1 value which says NO_QUERY,
Therefore we are going to drop this column

Index(['tweet_ID', 'date_timestamp', 'twitter_user', 'tweet_text'], dtype='object')


In [13]:
# Drop Duplicates

tweets_dataset = tweets_dataset.drop_duplicates()

In [14]:
# Print the first 5 rows of the tweets dataframe by applying the .head() method,
# This method is will display the top 5 observations of the dataset

tweets_dataset.head()

Unnamed: 0,tweet_ID,date_timestamp,twitter_user,tweet_text
0,1467810369,Mon Apr 06 22:19:45 PDT 2009,_TheSpecialOne_,"@switchfoot http://twitpic.com/2y1zl - Awww, t..."
1,1467810672,Mon Apr 06 22:19:49 PDT 2009,scotthamilton,is upset that he can't update his Facebook by ...
2,1467810917,Mon Apr 06 22:19:53 PDT 2009,mattycus,@Kenichan I dived many times for the ball. Man...
3,1467811184,Mon Apr 06 22:19:57 PDT 2009,ElleCTF,my whole body feels itchy and like its on fire
4,1467811193,Mon Apr 06 22:19:57 PDT 2009,Karoli,"@nationwideclass no, it's not behaving at all...."


In [15]:
# Check the information again to see how many entries remain

tweets_dataset.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1598315 entries, 0 to 1599999
Data columns (total 4 columns):
 #   Column          Non-Null Count    Dtype 
---  ------          --------------    ----- 
 0   tweet_ID        1598315 non-null  int64 
 1   date_timestamp  1598315 non-null  object
 2   twitter_user    1598315 non-null  object
 3   tweet_text      1598315 non-null  object
dtypes: int64(1), object(3)
memory usage: 61.0+ MB


**This information is helpful for diagnosing issues with data processing and understanding the structure of the Tweets dataset.**

In [16]:
# Identifying missing values in the Tweets dataframe by applying the .isna().sum() methods,
# I am using this to get the number of missing records in each column

tweets_dataset.isnull().sum()

tweet_ID          0
date_timestamp    0
twitter_user      0
tweet_text        0
dtype: int64

## Cleaning the text on the tweets.

The dataset has so many, @ mentions, hashtags, retweets, hyperlinks, and colons, emojis, unicode characters from a string, emoticons, dingbats, symbols & pictographs, transport & map symbols, flags (iOS), Chinese characters, etc.

The idea is to clean te text on all the tweets

In [17]:
# Printing only the 'tweet_text' column

print(tweets_dataset['tweet_text'].head(30))

0     @switchfoot http://twitpic.com/2y1zl - Awww, t...
1     is upset that he can't update his Facebook by ...
2     @Kenichan I dived many times for the ball. Man...
3       my whole body feels itchy and like its on fire 
4     @nationwideclass no, it's not behaving at all....
5                         @Kwesidei not the whole crew 
6                                           Need a hug 
7     @LOLTrish hey  long time no see! Yes.. Rains a...
8                  @Tatiana_K nope they didn't have it 
9                             @twittera que me muera ? 
10          spring break in plain city... it's snowing 
11                           I just re-pierced my ears 
12    @caregiving I couldn't bear to watch it.  And ...
13    @octolinz16 It it counts, idk why I did either...
14    @smarrison i would've been the first, but i di...
15    @iamjazzyfizzle I wish I got to watch it with ...
16    Hollis' death scene will hurt me severely to w...
17                                 about to file

In [18]:
def remove_punctuation(text):
    return re.sub(r'[^\w\s]', '', text)

In [19]:
# Apply the function to the entire column

tweets_dataset['cleaned_tweet_text'] = tweets_dataset['tweet_text'].apply(remove_punctuation)

In [20]:
# Print the first 20 rows of both the original and cleaned text columns

print(tweets_dataset[['tweet_text', 'cleaned_tweet_text']].head(30))

                                           tweet_text  \
0   @switchfoot http://twitpic.com/2y1zl - Awww, t...   
1   is upset that he can't update his Facebook by ...   
2   @Kenichan I dived many times for the ball. Man...   
3     my whole body feels itchy and like its on fire    
4   @nationwideclass no, it's not behaving at all....   
5                       @Kwesidei not the whole crew    
6                                         Need a hug    
7   @LOLTrish hey  long time no see! Yes.. Rains a...   
8                @Tatiana_K nope they didn't have it    
9                           @twittera que me muera ?    
10        spring break in plain city... it's snowing    
11                         I just re-pierced my ears    
12  @caregiving I couldn't bear to watch it.  And ...   
13  @octolinz16 It it counts, idk why I did either...   
14  @smarrison i would've been the first, but i di...   
15  @iamjazzyfizzle I wish I got to watch it with ...   
16  Hollis' death scene will hu

In [21]:
# Exporting the DataFrame to a CSV file

tweets_dataset.to_csv('cleaned_tweet_text_file.csv', index=False)

print('The CSV file has been created successfully.')

The CSV file has been created successfully.


# Initialize a SparkSession


In [22]:
spark = SparkSession.builder \
    .appName("CA2 Tweets Data Analysis") \
    .getOrCreate()

24/05/07 19:14:39 WARN Utils: Your hostname, Geomars-Mac-Studio.local resolves to a loopback address: 127.0.0.1; using 192.168.0.110 instead (on interface en0)
24/05/07 19:14:39 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
24/05/07 19:14:40 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable


In [23]:
# Define the schema of the tweets dataset

schema = StructType([
    StructField("tweet_ID", IntegerType(), True),
    StructField("date_timestamp", StringType(), True),
    StructField("twitter_user", StringType(), True),
    StructField("tweet_text", StringType(), True),
    StructField("cleaned_tweet_text", StringType(), True)
])

# Load the dataset previosly saved with the colomuns
# Tweet ID, Date / Timestamp,  Twitter User, Tweet text

tweets_dataset = spark.read.csv('clean_table_tweets.csv', schema=schema, header=False)

# tweets_dataset = spark.read.csv('clean_table_tweets_dataset.csv', schema=schema, header=False)

# Show the first top 20 rows

tweets_dataset.show()

# Print the schema to verify

tweets_dataset.printSchema()

+----------+--------------------+---------------+--------------------+--------------------+
|  tweet_ID|      date_timestamp|   twitter_user|          tweet_text|  cleaned_tweet_text|
+----------+--------------------+---------------+--------------------+--------------------+
|      NULL|      date_timestamp|   twitter_user|          tweet_text|  cleaned_tweet_text|
|1467810369|Mon Apr 06 22:19:...|_TheSpecialOne_|@switchfoot http:...|switchfoot httptw...|
|1467810672|Mon Apr 06 22:19:...|  scotthamilton|is upset that he ...|is upset that he ...|
|1467810917|Mon Apr 06 22:19:...|       mattycus|@Kenichan I dived...|Kenichan I dived ...|
|1467811184|Mon Apr 06 22:19:...|        ElleCTF|my whole body fee...|my whole body fee...|
|1467811193|Mon Apr 06 22:19:...|         Karoli|@nationwideclass ...|nationwideclass n...|
|1467811372|Mon Apr 06 22:20:...|       joy_wolf|@Kwesidei not the...|Kwesidei not the ...|
|1467811592|Mon Apr 06 22:20:...|        mybirch|         Need a hug |         N

In [24]:
# The describe() function is a method that provides descriptive statistics which summarize the central tendency,
# dispersion, and shape of a dataset’s distribution, excluding NaN values.
# In this case, describe() function helps me. to see statistics like count, mean, standard deviation,
# minimum, and maximum values for each column in the original DataFrame (tweets_dataset).
# If the columns are categorical, it will include the count, unique, top, and frequency of the top value.

# I am also using The show() function to display the DataFrame in a tabular format.
# This is particularly useful when working in a console or interactive environment (like a Jupyter notebook).
# It makes the data easier to understand and inspect visually.

tweets_dataset.describe().show()

24/05/07 19:14:43 WARN SparkStringUtils: Truncated the string representation of a plan since it was too large. This behavior can be adjusted by setting 'spark.sql.debug.maxToStringFields'.

+-------+--------------------+--------------------+--------------------+--------------------+--------------------+
|summary|            tweet_ID|      date_timestamp|        twitter_user|          tweet_text|  cleaned_tweet_text|
+-------+--------------------+--------------------+--------------------+--------------------+--------------------+
|  count|             1165608|             1598316|             1598316|             1598316|             1598243|
|   mean|1.9151690781009285E9|                NULL| 4.325887521835714E9|                NULL|                NULL|
| stddev| 1.575916195208061E8|                NULL|5.162733218454888E10|                NULL|                NULL|
|    min|          1467810369|Fri Apr 17 20:30:...|        000catnap000|                 ...|\t We love album ...|
|    max|          2072532109|      date_timestamp|          zzzzeus111|ï¿½ï¿½ï¿½ï¿½ï¿½ß§...|ï½ï½ï½ï½ï½ßï½Çï½ï...|
+-------+--------------------+--------------------+--------------------+--------

                                                                                

# Part 2

## Sentiment Analysis of the tweets

Tweets are a great way to get qualitative data because they show feelings, thoughts, and responses.

Sentiment analysis turns these feelings into numbers that can be used for statistical analysis.

This helps to find bigger trends and patterns that might not notice just by reading.

### TextBlob Sentiment Analysis

**I am using TextBlob library, which provides a simple API for common natural language processing (NLP) tasks, including sentiment analysis.**

In [25]:
# A User Defined Function (UDF) allows me to integrate custom Python logic into Spark DataFrame operations.
# Here, I am creatting a UDF that uses TextBlob to perform sentiment analysis.

def sentiment_analysis(text):
    return TextBlob(text).sentiment.polarity

# Register the User Defined Function (UDF)

sentiment_udf = udf(sentiment_analysis, FloatType())

In [26]:
# Apply the sentiment analysis UDF

tweets_dataset = tweets_dataset.withColumn("senti_score_TextBlob", sentiment_udf(tweets_dataset['cleaned_tweet_text']))

In [27]:
# Selecting and displaying only the required columns

tweets_dataset.select("cleaned_tweet_text", "senti_score_TextBlob").show()

+--------------------+--------------------+
|  cleaned_tweet_text|senti_score_TextBlob|
+--------------------+--------------------+
|  cleaned_tweet_text|                 0.0|
|switchfoot httptw...|                 0.2|
|is upset that he ...|                 0.0|
|Kenichan I dived ...|                 0.5|
|my whole body fee...|                 0.2|
|nationwideclass n...|              -0.625|
|Kwesidei not the ...|                 0.2|
|         Need a hug |                 0.0|
|LOLTrish hey  lon...|          0.27333334|
|Tatiana_K nope th...|                 0.0|
|twittera que me m...|                 0.0|
|spring break in p...|         -0.21428572|
|I just repierced ...|                 0.0|
|caregiving I coul...|                 0.0|
|octolinz16 It it ...|                 0.0|
|smarrison i would...|               0.075|
|iamjazzyfizzle I ...|                 0.0|
|Hollis death scen...|                 0.0|
|about to file taxes |                 0.0|
|LettyA ahh ive al...|          

[Stage 4:>                                                          (0 + 1) / 1]                                                                                

In [28]:
# Save TextBlob sentiment_analysis_tweets results back to a CSV file

# tweets_dataset.write.csv('sentiment_analysis_textblob.csv', header=True)

### Vader Sentiment Analysis

It uses a rule-based sentiment analysis framework, which excels in handling informal language typically found on Twitter, Facebook, etc.

In [29]:
# VADER (Valence Aware Dictionary and sEntiment Reasoner) is a lexicon and rule-based sentiment analysis tool
# that is specifically attuned to sentiments expressed in social media, and works well on texts from other domains.

nltk.download('vader_lexicon')

[nltk_data] Downloading package vader_lexicon to
[nltk_data]     /Users/geomarmunoz/nltk_data...
[nltk_data]   Package vader_lexicon is already up-to-date!


True

In [30]:
# Function to calculate sentiment
def vader_sentiment(text):
    sid = SentimentIntensityAnalyzer()
    return sid.polarity_scores(text)['compound']

# Register the UDF
vader_sentiment_udf = udf(vader_sentiment, FloatType())

In [31]:
tweets_dataset = tweets_dataset.withColumn("senti_score_Vader", vader_sentiment_udf(tweets_dataset['cleaned_tweet_text']))

In [32]:
tweets_dataset.select("cleaned_tweet_text", "senti_score_Vader").show()

+--------------------+-----------------+
|  cleaned_tweet_text|senti_score_Vader|
+--------------------+-----------------+
|  cleaned_tweet_text|              0.0|
|switchfoot httptw...|          -0.3818|
|is upset that he ...|          -0.7269|
|Kenichan I dived ...|           0.4939|
|my whole body fee...|            -0.25|
|nationwideclass n...|          -0.6597|
|Kwesidei not the ...|              0.0|
|         Need a hug |           0.4767|
|LOLTrish hey  lon...|           0.8286|
|Tatiana_K nope th...|              0.0|
|twittera que me m...|              0.0|
|spring break in p...|              0.0|
|I just repierced ...|              0.0|
|caregiving I coul...|          -0.5994|
|octolinz16 It it ...|          -0.1027|
|smarrison i would...|           0.3724|
|iamjazzyfizzle I ...|           0.2732|
|Hollis death scen...|          -0.9081|
|about to file taxes |              0.0|
|LettyA ahh ive al...|           0.6369|
+--------------------+-----------------+
only showing top

## TextBlob and Vader results

In [33]:
# Show the results to verify

tweets_dataset.select("cleaned_tweet_text", "senti_score_Vader", "senti_score_TextBlob").show()

# tweets_dataset.show()

+--------------------+-----------------+--------------------+
|  cleaned_tweet_text|senti_score_Vader|senti_score_TextBlob|
+--------------------+-----------------+--------------------+
|  cleaned_tweet_text|              0.0|                 0.0|
|switchfoot httptw...|          -0.3818|                 0.2|
|is upset that he ...|          -0.7269|                 0.0|
|Kenichan I dived ...|           0.4939|                 0.5|
|my whole body fee...|            -0.25|                 0.2|
|nationwideclass n...|          -0.6597|              -0.625|
|Kwesidei not the ...|              0.0|                 0.2|
|         Need a hug |           0.4767|                 0.0|
|LOLTrish hey  lon...|           0.8286|          0.27333334|
|Tatiana_K nope th...|              0.0|                 0.0|
|twittera que me m...|              0.0|                 0.0|
|spring break in p...|              0.0|         -0.21428572|
|I just repierced ...|              0.0|                 0.0|
|caregiv

In [34]:
# Save Vader sentiment_analysis_tweets results back to a CSV file

# tweets_dataset.write.csv('sentiment_analysis_vader.csv', header=True)

# Define UDFs for Sentiment Analysis
Create UDFs for TextBlob and VADER, and also for sentiment classification:

In [35]:
# UDF for TextBlob sentiment analysis

def textblob_sentiment(text):
    return TextBlob(text).sentiment.polarity

textblob_udf = udf(textblob_sentiment, FloatType())

# UDF for VADER sentiment analysis

def vader_sentiment(text):
    sid = SentimentIntensityAnalyzer()
    return sid.polarity_scores(text)['compound']

vader_udf = udf(vader_sentiment, FloatType())

# UDF for sentiment classification based on the score

def classify_sentiment(score):
    if score > 0:
        return 'Positive'
    elif score < 0:
        return 'Negative'
    else:
        return 'Neutral'

classify_udf = udf(classify_sentiment, StringType())

In [36]:
# Apply sentiment analysis

tweets_dataset = tweets_dataset.withColumn("TextBlob_Score", textblob_udf(col("cleaned_tweet_text")))
tweets_dataset = tweets_dataset.withColumn("Vader_Score", vader_udf(col("cleaned_tweet_text")))

# Classify sentiments

tweets_dataset = tweets_dataset.withColumn("TextBlob_Sentiment", classify_udf(col("TextBlob_Score")))
tweets_dataset = tweets_dataset.withColumn("Vader_Sentiment", classify_udf(col("Vader_Score")))

In [37]:
# Print Sentiment comparison table between TextBlob_Sentiment vs Vader_Sentimen

tweets_dataset.select(['cleaned_tweet_text', 'TextBlob_Sentiment', 'Vader_Sentiment']).show()

+--------------------+------------------+---------------+
|  cleaned_tweet_text|TextBlob_Sentiment|Vader_Sentiment|
+--------------------+------------------+---------------+
|  cleaned_tweet_text|           Neutral|        Neutral|
|switchfoot httptw...|          Positive|       Negative|
|is upset that he ...|           Neutral|       Negative|
|Kenichan I dived ...|          Positive|       Positive|
|my whole body fee...|          Positive|       Negative|
|nationwideclass n...|          Negative|       Negative|
|Kwesidei not the ...|          Positive|        Neutral|
|         Need a hug |           Neutral|       Positive|
|LOLTrish hey  lon...|          Positive|       Positive|
|Tatiana_K nope th...|           Neutral|        Neutral|
|twittera que me m...|           Neutral|        Neutral|
|spring break in p...|          Negative|        Neutral|
|I just repierced ...|           Neutral|        Neutral|
|caregiving I coul...|           Neutral|       Negative|
|octolinz16 It

We can see that **Vader Sentiment is more accurate** because is tuned for sentiments expressed in social media and is optimized to understand text that includes emojis, slang, and shorthand, which makes it highly effective for datasets primarily composed of social media commentary.

# This will continue in Part 2