<a href="https://colab.research.google.com/github/RiccardoRobb/BigData_project/blob/main/BigData_project.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Twitter sentiment analysis**
### BigData 2023 project

[@Author](https://github.com/RiccardoRobb): Riccardo Ruberto 1860609

---
## **Inital configuration**
### PySpark installation

In [1]:
! pip install pyspark

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


### Useful imports

In [2]:
import pyspark
from pyspark.sql import *
from pyspark.sql.types import *
from pyspark.sql.functions import *
from pyspark import SparkContext, SparkConf

### Spark configuration

In [3]:
# Create the session
conf = SparkConf().\
                set('spark.ui.port', "8000").\
                set('spark.executor.memory', '4G').\
                set('spark.driver.memory', '45G').\
                set('spark.driver.maxResultSize', '10G').\
                setAppName("Twitter sentiment analysis").\
                setMaster("local[*]")

# Create the context
sc = pyspark.SparkContext(conf=conf)
spark = SparkSession.builder.getOrCreate()

### Check spark configurations

In [4]:
spark

In [5]:
sc._conf.getAll()

[('spark.app.startTime', '1687364497767'),
 ('spark.driver.memory', '45G'),
 ('spark.driver.host', '100b8f9ee09e'),
 ('spark.executor.id', 'driver'),
 ('spark.sql.warehouse.dir', 'file:/content/spark-warehouse'),
 ('spark.ui.port', '8000'),
 ('spark.app.submitTime', '1687364497560'),
 ('spark.driver.maxResultSize', '10G'),
 ('spark.app.id', 'local-1687364500337'),
 ('spark.driver.extraJavaOptions',
  '-Djava.net.preferIPv6Addresses=false -XX:+IgnoreUnrecognizedVMOptions --add-opens=java.base/java.lang=ALL-UNNAMED --add-opens=java.base/java.lang.invoke=ALL-UNNAMED --add-opens=java.base/java.lang.reflect=ALL-UNNAMED --add-opens=java.base/java.io=ALL-UNNAMED --add-opens=java.base/java.net=ALL-UNNAMED --add-opens=java.base/java.nio=ALL-UNNAMED --add-opens=java.base/java.util=ALL-UNNAMED --add-opens=java.base/java.util.concurrent=ALL-UNNAMED --add-opens=java.base/java.util.concurrent.atomic=ALL-UNNAMED --add-opens=java.base/sun.nio.ch=ALL-UNNAMED --add-opens=java.base/sun.nio.cs=ALL-UNNAMED

---
## Load dataset [***Sentiment140***]
### Download dataset

In [6]:
! wget https://raw.githubusercontent.com/RiccardoRobb/BigData_project/main/Sentiment140.zip

! unzip "./*.zip" && rm *.zip
! mv training.1600000.processed.noemoticon.csv train140.csv

--2023-06-21 16:21:44--  https://raw.githubusercontent.com/RiccardoRobb/BigData_project/main/Sentiment140.zip
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 84855679 (81M) [application/zip]
Saving to: ‘Sentiment140.zip’


2023-06-21 16:21:44 (312 MB/s) - ‘Sentiment140.zip’ saved [84855679/84855679]

Archive:  ./Sentiment140.zip
  inflating: training.1600000.processed.noemoticon.csv  


### Create data frame [***Sentiment140***]
1600000 tweets

In [7]:
schema = StructType([ \
    StructField("target",IntegerType(),True), \
    StructField("id",LongType(),True), \
    StructField("full_date",StringType(),True), \
    StructField("flag", StringType(), True), \
    StructField("user", StringType(), True), \
    StructField("text", StringType(), True) \
  ])

df = spark.read.csv('./train140.csv', schema=schema, header="false")

---
## **Inital cleaning**
### Removal of unnecessary columns

In [8]:
# sentiment of the tweet is not affected by the "user"
df = df.drop("user")

# verify "flag" utility
print(df.select(countDistinct("flag")).collect()[0][0])

# "flag" has only one value == NO_QUERY, so I delete it
df = df.drop("flag")

1


### From ***date*** to ***day_name***; ***hour***; ***date***

In [9]:
months_map = {"Jan": "01", "Feb": "02", "Mar": "03", "Apr": "04", "May": "05", "Jun": "06", "Jul": "07", "Aug": "08", "Sep": "09", "Oct": "10", "Nov": "11", "Dec": "12"}

convert_date_udf = udf(lambda month_name : months_map[month_name], StringType())

In [10]:
split_col = pyspark.sql.functions.split(df['full_date'], ' ')

df = df.withColumn("day_name", split_col.getItem(0)) \
      .withColumn("hour", split_col.getItem(3)) \
      .withColumn("date", to_date( concat_ws("-", split_col.getItem(2), convert_date_udf(split_col.getItem(1)), split_col.getItem(5)), "dd-MM-yyyy"))

df = df.drop("full_date")

---
## **Data analysis**
### ***target*** values analisys

In [11]:
df.select("target").distinct().show()
# "target" value is or 0 or 4

+------+
|target|
+------+
|     4|
|     0|
+------+



In [12]:
sad_tweets = df.filter(col("target") == 0)
happy_tweets = df.filter(col("target") == 4)

print("Sad tweets = ", sad_tweets.count())
print("Happy tweets = ", happy_tweets.count())

Sad tweets =  800000
Happy tweets =  800000


Sad tweets and happy tweets are balanced.

### Time frame of interest

In [13]:
print("Min date = ", df.select(min(df.date)).collect()[0][0])
print("Max date = ", df.select(max(df.date)).collect()[0][0])

Min date =  2009-04-06
Max date =  2009-06-25


The time frame used is too small, datas were collected in 2 months.
*Using data column will be useful only if we try to predict tweets written during the [2009-04-06, 2009-06-25] period.*

### Better to delete the ***date*** column

In [14]:
df = df.drop("date")

---
## **Data processing**
### Case normalization

In [15]:
df = df.withColumn("text", lower(col("text")))

### Username and links removal

In [16]:
import re

# twitter ids can contain alphanumeric and '_' characters
username_regex = r"@[A-Za-z0-9_]+"

# http:// / https:// links
link_regex1 = r"https?://[^ ]+"

# www. links
link_regex2 = r"www.[^ ]+"


master_regex = r"|".join((username_regex, link_regex1, link_regex2))

df = df.withColumn("text", regexp_replace(df.text, master_regex, ""))

### Filter out punctual symbols

In [17]:
df = df.withColumn("text", regexp_replace(df.text, "[^a-zA-Z\s]", ""))

### Trimming

In [18]:
df = df.withColumn("text", trim(col("text")))

# extra whitespaces
df = df.withColumn("text", trim(regexp_replace(df.text, " +", " ")))

### Tokenization

In [19]:
from pyspark.ml.feature import Tokenizer

tokenizer = Tokenizer(inputCol = "text", outputCol = "tokens")
tokens_df = tokenizer.transform(df)

### Stopwords removal

In [20]:
from pyspark.ml.feature import StopWordsRemover

stopwords_remover = StopWordsRemover(inputCol = "tokens", outputCol = "terms")
terms_df = stopwords_remover.transform(tokens_df)

### Stemming

In [24]:
from nltk.stem.snowball import SnowballStemmer

stemmer = SnowballStemmer(language = "english")
stemmer_udf = udf(lambda tokens: [stemmer.stem(token) for token in tokens], ArrayType(StringType()))


tweets_df = terms_df.withColumn("terms_stemmed", stemmer_udf("terms"))

### Removal of unnecessary columns

In [25]:
tweets_df = tweets_df.drop("id", "text", "tokens", "terms")

tweets_df.show(7, truncate = False)

+------+--------+--------+-----------------------------------------------------------------------------------+
|target|day_name|hour    |terms_stemmed                                                                      |
+------+--------+--------+-----------------------------------------------------------------------------------+
|0     |Mon     |22:19:45|[awww, that, bummer, shoulda, got, david, carr, third, day, d]                     |
|0     |Mon     |22:19:49|[upset, cant, updat, facebook, text, might, cri, result, school, today, also, blah]|
|0     |Mon     |22:19:53|[dive, mani, time, ball, manag, save, rest, go, bound]                             |
|0     |Mon     |22:19:57|[whole, bodi, feel, itchi, like, fire]                                             |
|0     |Mon     |22:19:57|[behav, im, mad, cant, see]                                                        |
|0     |Mon     |22:20:00|[whole, crew]                                                                      |
|

---
## Verify data

In [None]:
#df.printSchema()
#df.show(10)

#print("tot = ", df.count())

tweets_df.printSchema()
tweets_df.show(10)

print("tot = ", tweets_df.count())

# TODO
* **Are useful tokens with len==1 ?**
* **chek udf validity for month mapping**