<a href="https://colab.research.google.com/github/RiccardoRobb/BigData_project/blob/main/BigData_project.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Twitter sentiment analysis**
### BigData 2023 project

[@Author](https://github.com/RiccardoRobb): Riccardo Ruberto 1860609

---
# **Inital configuration**
### PySpark installation

In [4]:
! pip install pyspark

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


---
### Useful imports

In [5]:
import pyspark
from pyspark.sql import *
from pyspark.sql.types import *
from pyspark.sql.functions import *
from pyspark import SparkContext, SparkConf

---
### Spark configuration

In [6]:
# Create the session
conf = SparkConf().\
                set('spark.ui.port', "8000").\
                set('spark.executor.memory', '4G').\
                set('spark.driver.memory', '45G').\
                set('spark.driver.maxResultSize', '10G').\
                setAppName("Twitter sentiment analysis").\
                setMaster("local[*]")

# Create the context
sc = pyspark.SparkContext(conf=conf)
spark = SparkSession.builder.getOrCreate()

---
### Check spark configurations

In [7]:
spark

In [8]:
sc._conf.getAll()

[('spark.driver.port', '33227'),
 ('spark.app.startTime', '1687274292748'),
 ('spark.driver.memory', '45G'),
 ('spark.executor.id', 'driver'),
 ('spark.sql.warehouse.dir', 'file:/content/spark-warehouse'),
 ('spark.ui.port', '8000'),
 ('spark.driver.maxResultSize', '10G'),
 ('spark.driver.extraJavaOptions',
  '-Djava.net.preferIPv6Addresses=false -XX:+IgnoreUnrecognizedVMOptions --add-opens=java.base/java.lang=ALL-UNNAMED --add-opens=java.base/java.lang.invoke=ALL-UNNAMED --add-opens=java.base/java.lang.reflect=ALL-UNNAMED --add-opens=java.base/java.io=ALL-UNNAMED --add-opens=java.base/java.net=ALL-UNNAMED --add-opens=java.base/java.nio=ALL-UNNAMED --add-opens=java.base/java.util=ALL-UNNAMED --add-opens=java.base/java.util.concurrent=ALL-UNNAMED --add-opens=java.base/java.util.concurrent.atomic=ALL-UNNAMED --add-opens=java.base/sun.nio.ch=ALL-UNNAMED --add-opens=java.base/sun.nio.cs=ALL-UNNAMED --add-opens=java.base/sun.security.action=ALL-UNNAMED --add-opens=java.base/sun.util.calenda

---
## **Data load**
### Kaggle installation

In [9]:
! pip install -q kaggle

---
### Useful imports

In [10]:
from google.colab import files
from os import environ

---
### Upload kaggle API key
You nee to upload your `kaggle.json` file, generated by kaggle.

In [11]:
uploaded = files.upload()

Saving kaggle.json to kaggle.json


---
### Setup kaggle directory

In [12]:
# define kaggle config folder
! mkdir "./kaggle" && mv "./kaggle.json" "./kaggle/kaggle.json"
environ['KAGGLE_CONFIG_DIR'] = './kaggle'

# hide kaggle API key for other users
! chmod 600 ./kaggle/kaggle.json

---
### Download dataset and create data frame [***Sentiment140***]
1600000 tweets

In [84]:
! kaggle datasets download -d kazanova/sentiment140
! unzip "./*.zip" && rm *.zip
! mv training.1600000.processed.noemoticon.csv train140.csv

schema = StructType([ \
    StructField("target",IntegerType(),True), \
    StructField("id",LongType(),True), \
    StructField("full_date",StringType(),True), \
    StructField("flag", StringType(), True), \
    StructField("user", StringType(), True), \
    StructField("text", StringType(), True) \
  ])

df = spark.read.csv('./train140.csv', schema=schema, header="false")

Downloading sentiment140.zip to /content
 82% 66.0M/80.9M [00:00<00:00, 229MB/s]
100% 80.9M/80.9M [00:00<00:00, 224MB/s]
Archive:  ./sentiment140.zip
  inflating: training.1600000.processed.noemoticon.csv  


### Removal of unnecessary columns

In [85]:
# sentiment of the tweet is not affected by the "user" or by the "id"
df = df.drop("user", "id")

# verify "flag" utility
print(df.select(countDistinct("flag")).collect()[0][0])

# "flag" has only one value == NO_QUERY, so I delete it
df = df.drop("flag")

1


### From ***date*** to ***day_name***; ***hour***; ***date***

In [86]:
months_map = {"Jan": "01", "Feb": "02", "Mar": "03", "Apr": "04", "May": "05", "Jun": "06", "Jul": "07", "Aug": "08", "Sep": "09", "Oct": "10", "Nov": "11", "Dec": "12"}

convert_date_udf = udf(lambda month_name : months_map[month_name], StringType())

In [87]:
split_col = pyspark.sql.functions.split(df['full_date'], ' ')

df = df.withColumn("day_name", split_col.getItem(0)) \
      .withColumn("hour", split_col.getItem(3)) \
      .withColumn("date", concat_ws("-", split_col.getItem(5), convert_date_udf(split_col.getItem(1)), split_col.getItem(2)))

df = df.drop("full_date")

---
### Verify schema

In [88]:
df.printSchema()
df.show(8)

root
 |-- target: integer (nullable = true)
 |-- text: string (nullable = true)
 |-- day_name: string (nullable = true)
 |-- hour: string (nullable = true)
 |-- date: string (nullable = false)

+------+--------------------+--------+--------+----------+
|target|                text|day_name|    hour|      date|
+------+--------------------+--------+--------+----------+
|     0|@switchfoot http:...|     Mon|22:19:45|2009-04-06|
|     0|is upset that he ...|     Mon|22:19:49|2009-04-06|
|     0|@Kenichan I dived...|     Mon|22:19:53|2009-04-06|
|     0|my whole body fee...|     Mon|22:19:57|2009-04-06|
|     0|@nationwideclass ...|     Mon|22:19:57|2009-04-06|
|     0|@Kwesidei not the...|     Mon|22:20:00|2009-04-06|
|     0|         Need a hug |     Mon|22:20:03|2009-04-06|
|     0|@LOLTrish hey  lo...|     Mon|22:20:03|2009-04-06|
+------+--------------------+--------+--------+----------+
only showing top 8 rows

