# Sentiment Analysis of Twitter Posts
<!-- Notebook name goes here -->
<center><b>Notebook: Data Description, Cleaning, Exploratory Data Analysis, and Preprocessing</b></center>
<br>

**By**: Stephen Borja, Justin Ching, Erin Chua, and Zhean Ganituen.

**Dataset**: Hussein, S. (2021). Twitter Sentiments Dataset [Dataset]. Mendeley. https://doi.org/10.17632/Z9ZW7NT5H2.1

**Motivation**: Every minute, social media users generate a large influx of textual data on live events. Performing sentiment analysis on this data provides a real-time view of public perception, enabling quick insights into the general population‚Äôs opinions and reactions.

**Goal**: By the end of the project, our goal is to create and compare supervised learning algorithms for sentiment analysis.

### **Dataset Description**

The Twitter Sentiments Dataset is a dataset that contains nearly 163k tweets from Twitter. The time period of when these were collected is unknown, but it was published to Mendeley Data on May 14, 2021 by Sherif Hussein of Mansoura University.

Tweets were extracted using the Twitter API, but the specifics of how the tweets were selected are unmentioned. The tweets are mostly English with a mix of some Hindi words for code-switching<a name="cite_ref-1"></a>[<sup>[1]</sup>](#cite_note-1). All of them seem to be talking about the political state of India. Most tweets mention Narendra Modi, the current Prime Minister of India.

Each tweet was assigned a label using TextBlob's sentiment analysis (El‚ÄëDemerdash, Hussein, & Zaki, 2021), which assigns labels automatically.

Twitter_Data
- **`clean_text`**: The tweet's text
- **`category`**: The tweet's sentiment category

What each row and column represents: `each row represents one tweet.` <br>
Number of observations: `162,980`

---
**References**
1. Alaa A. El-Demerdash, J. F. W. Z., Sherif E. Hussein. (2022). Course Evaluation Based on Deep Learning and SSA Hyperparameters Optimization. Computers, Materials & Continua, 71(1), 941‚Äì959. doi:10.32604/cmc.2022.021839

---

<a name="cite_note-1"></a>1. [^](#cite_ref-1) Code-switching is the practice of alternating between two languages $L_1$ (the native language) and $L_2$ (the source language) in a conversation. In this context, the code-switching is done to appear more casual since the conversation is done via Twitter (now, X). 

## 1 project set up
Set up here the imports for the projects (ensure these are installed via uv and is part of the environment). Furthermore, load the dataset here. Also load the raw dataset here.

In [1]:
import pandas as pd
import numpy as np
import os
import sys

# Import boilerplate file generated by `codegen.py`
from boilerplate import stopwords_set

# Import lib file for function definitions
sys.path.append(os.path.abspath("../lib"))
from janitor import normalize, rem_punctuation, rem_numbers, collapse_whitespace, rem_stopwords, clean_and_tokenize, find_spam_and_empty
from bag_of_words import BagOfWordsModel

# Imports for Stemming and Lematization
from nltk.stem import PorterStemmer, WordNetLemmatizer
from nltk.corpus import wordnet
import nltk

# Imports for Text Vectorization
from sklearn.feature_extraction.text import CountVectorizer

# Set up NLTK objects for stemming and lemmatization
nltk.download('wordnet')
nltk.download('omw-1.4')

stemmer = PorterStemmer()
lemmatizer = WordNetLemmatizer()

# Load raw data file
df = pd.read_csv("../data/Twitter_Data.csv")
df = df.dropna()

[nltk_data] Downloading package wordnet to /home/zrgnt/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package omw-1.4 to /home/zrgnt/nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!


## 2 data cleaning
This section discusses the methodology for data cleaning.

We follow a similar methodology for data cleaning presented in [1].

The cleaning pipeline has four main functions. The first function is the `normalize` function, it normalizes the text input to ASCII-only characters (say, "c√≥mo est√°s" becomes "como estas") and lowercases alphabetic symbols. The dataset contains Unicode characters (e.g., emojis and accented characters) which the function replaces to the empty string (`''`). Then, `rem_punctuation` removes the punctuation marks and special characters with the empty string. Then, `collapse_whitespace` collapses all whitespace characters to a single space. Formally, it is a transducer from $\Box^+ \mapsto \Box$ where $\Box$ is the whitespace character. Now, since the strings are cleaned at this point, the tokenization step reduces to a mere string split at word boundaries. Finally, with the tokenized string, we can do a final clean by removing the stopwords.

**Remark on Stopwords.** For stop words removal, we refer to the English stopwords dataset defined in NLTK and Wolfram Mathematica [2,3]. However, since the task is sentiment analysis, words that invoke polarity, intensification, and negation are important. Words like "not" and "okay" are commonly included as stopwords. Therefore, the stopwords from [2,3] are manually adjusted to only include stopwords that invoke neutrality, examples are "after", "when", and "you."

**Remark on Potential Spam.** Since the domain of the corpus is Twitter, spam may become an issue by the vector representation step. Hence we employed some simple rule-based spam detection systems.

---
**References**
1. George, M., & Murugesan, R., Dr. (2024). Improving sentiment analysis of financial news headlines using hybrid Word2Vec-TFIDF feature extraction technique. Procedia Computer Science, 244, 1‚Äì8. https://doi.org/10.1016/j.procs.2024.10.172
2. Wolfram Language. (2015). DeleteStopwords. Wolfram Language & System Documentation Center. Retrieved from https://reference.wolfram.com/language/ref/DeleteStopwords.html
3. Bird, S., & Loper, E. (2004, July). NLTK: The Natural Language Toolkit. Proceedings of the ACL Interactive Poster and Demonstration Sessions, 214‚Äì217. Retrieved from https://aclanthology.org/P04-3031/

In [2]:
# perform cleaning stage
# at this point NaN entries shouldn't exist
df["clean_ours"] = df["clean_text"].map(lambda x: clean_and_tokenize(x, stopwords_set))
df["clean_ours"] = df["clean_ours"].map(lambda toks: find_spam_and_empty(toks))

df = df.dropna(subset=["clean_ours"]).copy()

# 3 preprocessing

> üèóÔ∏è Perhaps swap S3 and S4. Refer to literature on what comes first.

This section discusses preprocessing steps for the cleaned data.

## Stemming and Lemmatization

We follow a similar methodology for data cleaning presented in [1]. We preprocess the dataset entries via stemming and lemmatization. We employ NLTK for both tasks using PorterStemmer and WordNetLemmatizer for stemming and lemmatization, repectively [2]. For the lemmatization step, we use the WordNet for English lemmatization and Open Multilingual WordNet version 1.4 for translations and multilingual support which is important for our case since some tweets contain text from Indian Languages.

---
**References**
1. George, M., & Murugesan, R., Dr. (2024). Improving sentiment analysis of financial news headlines using hybrid Word2Vec-TFIDF feature extraction technique. Procedia Computer Science, 244, 1‚Äì8. https://doi.org/10.1016/j.procs.2024.10.172
2. Bird, S., & Loper, E. (2004, July). NLTK: The Natural Language Toolkit. Proceedings of the ACL Interactive Poster and Demonstration Sessions, 214‚Äì217. Retrieved from https://aclanthology.org/P04-3031/

In [3]:
df["stemmed"] = df["clean_ours"].map(lambda tokens: [stemmer.stem(t) for t in tokens])
df["lemmatized"] = df["clean_ours"].map(lambda tokens: [lemmatizer.lemmatize(t) for t in tokens])

# do this, since vectorizer expects a string not an array of strings
df["lemmatized_str"] = df["lemmatized"].apply(lambda x: " ".join(x))

## Vector Representation

After to stemming and lemmatization steps, we can now represent each each entry to its vector representation from a Bag of Words (BoW) model. We use scikit-learn's `CountVectorizer` which is an ready-to-use implementation of the BoW model.

**Related Literature and Vectorization Techniques.** Traditional vectorization techniques include BoW and Term Frequency-Inverse Document Frequency (TF-IDF). TF-IDF weights each word based on its frequency in a document and how rare it is across the corpus, reducing the impact of common words. BoW, in contrast, simply counts word occurrences without considering corpus-level frequency. In this project, BoW was chosen because stopwords were already removed during preprocessing, and the dataset is domain-specific [3]. In such datasets, frequent words are often meaningful domain keywords, so scaling them down (as TF-IDF would) could reduce the importance of these key terms in the feature representation.

The resulting vector has 162,969 rows which is the number of entries in the dataset (with `NaN` entries removed), and 101,284 which is the number of unique words in the dataset.

---

**References**
1. Rani, D., Kumar, R., & Chauhan, N. (2022, October). Study and comparision of vectorization techniques used in text classification. In 2022 13th international conference on computing communication and networking technologies (ICCCNT) (pp. 1-6). IEEE.
2. Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., Vanderplas, J., Passos, A., Cournapeau, D., Brucher, M., Perrot, M., & Duchesnay, Œà. (2011). Scikit-learn: Machine Learning in Python. J. Mach. Learn. Res., 12(null), 2825‚Äì2830.

In [10]:
bow = BagOfWordsModel(df["lemmatized_str"], 0.0004)

# some sanity checks
assert bow.matrix.shape[0] == df.shape[0],                               "number of rows in the matrix DOES NOT matches the number of documents"
assert bow.matrix.nnz / (bow.matrix.shape[0] * bow.matrix.shape[1]) < 1, "the sparsity is TOO HIGH, something went wrong"

['aadhaar' 'aadhar' 'aaj' 'aam' 'aap' 'aayega' 'aayog' 'abdullah' 'abe'
 'abhi' 'abhinandan' 'abhisar' 'ability' 'abki' 'able' 'about' 'above'
 'abroad' 'absolute' 'absolutely' 'absurd' 'abt' 'abuse' 'abused'
 'abusing' 'abusive' 'abv' 'accept' 'acceptable' 'accepted' 'accepting'
 'access' 'acche' 'accomplished' 'accomplishment' 'account'
 'accountability' 'accountable' 'accuse' 'accused' 'accusing' 'ache'
 'achhe' 'achieve' 'achieved' 'achievement' 'achieving' 'acknowledge'
 'acronym' 'act' 'acting' 'action' 'active' 'activist' 'activity' 'actor'
 'actual' 'actually' 'ad' 'adani' 'adanis' 'add' 'added' 'address'
 'addressed' 'addressing' 'adityanath' 'administration' 'admire' 'admit'
 'advance' 'advani' 'advantage' 'advertisement' 'advertising' 'advice'
 'advise' 'advisor' 'affair' 'affect' 'affected' 'afford' 'affordable'
 'afraid' 'agar' 'age' 'agency' 'agenda' 'agent' 'aggressive' 'ago'
 'agree' 'agreed' 'agriculture' 'ahead' 'aid' 'aiims' 'aim' 'air'
 'aircraft']


# 4 exploratory data analysis

This section discusses the exploratory data analysis conducted on the dataset after cleaning.

> Notes from Zhean: <br>
> From manual checking via OpenRefine, there are a total of 162972. `df.info()` should have the same result post-processing.
> Furthermore, there should be two columns, `clean_text` (which is a bit of a misnormer since it is still dirty) contains the Tweets (text data). The second column is the `category` which contains the sentiment of the Tweet and is a tribool (1 positive, 0 neutral or indeterminate, and -1 for negative).

---
**References**
1. ()