# Sentiment Analysis of Twitter Posts
<!-- Notebook name goes here -->
<center><b>Notebook: Data Description, Cleaning, Exploratory Data Analysis, and Preprocessing</b></center>
<br>

**by**: Stephen Borja, Justin Ching, Erin Chua, and Zhean Ganituen.

**dataset**: Hussein, S. (2021). Twitter Sentiments Dataset [Dataset]. Mendeley. https://doi.org/10.17632/Z9ZW7NT5H2.1

**motivation**: Every minute, social media users generate a large influx of textual data on live events. Performing sentiment analysis on this data provides a real-time view of public perception, enabling quick insights into the general population‚Äôs opinions and reactions.

**goal**: By the end of the project, our goal is to create and compare supervised learning algorithms for sentiment analysis.

### **dataset description**

The Twitter Sentiments Dataset is a dataset that contains nearly 163k tweets from Twitter. The time period of when these were collected is unknown, but it was published to Mendeley Data on May 14, 2021 by Sherif Hussein of Mansoura University.

Tweets were extracted using the Twitter API, but the specifics of how the tweets were selected are unmentioned. The tweets are mostly English with a mix of some Hindi words for code-switching <u>(El-Demerdash., 2021)</u>. All of them seem to be talking about the political state of India. Most tweets mention Narendra Modi, the current Prime Minister of India.

Each tweet was assigned a label using TextBlob's sentiment analysis <u>(El‚ÄëDemerdash, Hussein, & Zaki, 2021)</u>, which assigns labels automatically.

Twitter_Data
- **`clean_text`**: The tweet's text
- **`category`**: The tweet's sentiment category

What each row and column represents: `each row represents one tweet.` <br>
Number of observations: `162,980`

---

<a name="cite_note-1"></a>1. [^](#cite_ref-1) Code-switching is the practice of alternating between two languages $L_1$ (the native language) and $L_2$ (the source language) in a conversation. In this context, the code-switching is done to appear more casual since the conversation is done via Twitter (now, X). 

## **1 project set up**
We set the global imports for the projects (ensure these are installed via uv and is part of the environment). Furthermore, load the dataset here.

In [1]:
import pandas as pd
import numpy as np
import os
import sys

# Use lib directory
sys.path.append(os.path.abspath("../lib"))

# Imports from lib files
from janitor import *
from lemmatize import lemmatizer
from boilerplate import stopwords_set
from bag_of_words import BagOfWordsModel

# Pandas congiruation
pd.set_option("display.max_colwidth", None)

# Load raw data file
df = pd.read_csv("../data/Twitter_Data.csv")

## **2 data cleaning**
This section discusses the methodology for data cleaning.

As to not waste computational time, a preliminary step is to ensure that no `NaN` and duplicates entries exist before the cleaning steps. Everytime we call a `.drop()` function, we will show the result of `info()` to see how many entries are filtered out.

Let's first drop the `NaN` entries.

In [2]:
df = df.dropna()
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 162969 entries, 0 to 162979
Data columns (total 2 columns):
 #   Column      Non-Null Count   Dtype  
---  ------      --------------   -----  
 0   clean_text  162969 non-null  object 
 1   category    162969 non-null  float64
dtypes: float64(1), object(1)
memory usage: 3.7+ MB


Now, remove the duplicates.

In [3]:
df = df.drop_duplicates()
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 162969 entries, 0 to 162979
Data columns (total 2 columns):
 #   Column      Non-Null Count   Dtype  
---  ------      --------------   -----  
 0   clean_text  162969 non-null  object 
 1   category    162969 non-null  float64
dtypes: float64(1), object(1)
memory usage: 3.7+ MB


## **main cleaning pipeline**

We follow a similar methodology for data cleaning presented in (George & Murugesan, 2024). 

### **normalization**
The first function is the `normalize` function, it normalizes the text input to ASCII-only characters (say, "c√≥mo est√°s" becomes "como estas") and lowercases alphabetic symbols. The dataset contains Unicode characters (e.g., emojis and accented characters) which the function replaces to the empty string (`''`).

In [4]:
normalize??

[31mSignature:[39m normalize(text: str) -> str
[31mSource:[39m   
[38;5;28;01mdef[39;00m normalize(text: str) -> str:
    [33m"""[39m
[33m    Normalize text from a pandas entry to ASCII-only lowercase characters. Hence, this removes Unicode characters with no ASCII[39m
[33m    equivalent (e.g., emojis and CJKs).[39m

[33m    Do not use this function alone, use `clean_and_tokenize()`.[39m

[33m    # Parameters[39m
[33m    * text: String entry.[39m

[33m    # Returns[39m
[33m    ASCII-normalized text containing only lowercase letters.[39m

[33m    # Examples[39m
[33m    normalize("¬øC√≥mo est√°s?")[39m
[33m    $ 'como estas?'[39m

[33m    normalize(" hahahaha HUY! Kamusta üòÖ Mayaman $$$ ka na ba?")[39m
[33m    $ ' hahahaha huy! kamusta  mayaman $$$ ka na ba?'[39m
[33m    """[39m
    normalized = unicodedata.normalize([33m"NFKD"[39m, text)
    ascii_text = normalized.encode([33m"ascii"[39m, [33m"ignore"[39m).decode([33m"ascii"[39m)

    [38;5;2

### **punctuations**

Punctuations do not add much information to the sentiment of a message. The sentiment of `i hate you!` and `i hate you` are going to be the same (of course, the exclamation point accentuates the emotion invoked in the message, but that is irrelevant in a classification study). Hence we defined `rem_punctuation` as seen below

In [5]:
rem_punctuation??

[31mSignature:[39m rem_punctuation(text: str) -> str
[31mSource:[39m   
[38;5;28;01mdef[39;00m rem_punctuation(text: str) -> str:
    [33m"""[39m
[33m    Removes the punctuations. This function simply replaces all punctuation marks and special characters[39m
[33m    to the empty string. Hence, for symbols enclosed by whitespace, the whitespace are not collapsed to a single whitespace[39m
[33m    (for more information, see the examples).[39m

[33m    Do not use this function alone, use `clean_and_tokenize()`.[39m

[33m    # Parameters[39m
[33m    * text: String entry.[39m

[33m    # Returns[39m
[33m    Text with the punctuation removed.[39m

[33m    # Examples[39m
[33m    rem_punctuation("this word $$ has two spaces after it!")[39m
[33m    $ 'this word  has two spaces after it'[39m

[33m    rem_punctuation("these!words@have$no%space")[39m
[33m    $ 'thesewordshavenospace'[39m
[33m    """[39m
    [38;5;28;01mreturn[39;00m re.sub([33mf"[{re.escape(st

### **numbers**
Similar to punctuations, numbers do not add any information to the sentiment of a message. Hence we defined the `rem_numbers` as seen below:

In [6]:
rem_numbers??

[31mSignature:[39m rem_numbers(text: str) -> str
[31mSource:[39m   
[38;5;28;01mdef[39;00m rem_numbers(text: str) -> str:
    [33m"""[39m
[33m    Removes numbers. This function simply replaces all numerical symbols to the empty string. Hence, for symbols enclosed by[39m
[33m    whitespace, the whitespace are not collapsed to a single whitespace (for more information, see the examples).[39m

[33m    Do not use this function alone, use `clean_and_tokenize()`.[39m

[33m    # Parameters[39m
[33m    * text: String entry.[39m

[33m    # Returns[39m
[33m    Text with the numerical symbol removed[39m

[33m    # Examples[39m
[33m    rem_numbers(" h3llo, k4must4 k4  n4?")[39m
[33m    ' hllo, kmust k  n?'[39m
[33m    """[39m
    [38;5;28;01mreturn[39;00m re.sub([33mr"\d+"[39m, [33m""[39m, text)
[31mFile:[39m      ~/STINTSY-Order-of-Erin/lib/janitor.py
[31mType:[39m      function

### **whitespace**
Finally, `collapse_whitespace` collapses all whitespace characters to a single space. Formally, it is a transducer 

$$
\Box^+ \mapsto \Box \qquad \text{where the space character is } \Box
$$

Informally, it replaces all strings of whitespaces to a single whitespace character.

In [7]:
collapse_whitespace??

[31mSignature:[39m collapse_whitespace(text: str) -> str
[31mSource:[39m   
[38;5;28;01mdef[39;00m collapse_whitespace(text: str) -> str:
    [33m"""[39m
[33m    This collapses whitespace. Here, collapsing means the transduction of all whitespace strings of any[39m
[33m    length to a whitespace string of unit length (e.g., "   " -> " "; formally " "+ -> " ").[39m

[33m    Do not use this function alone, use `clean_and_tokenize()`.[39m

[33m    # Parameters[39m
[33m    * text: String entry.[39m

[33m    # Returns[39m
[33m    Text with the whitespaces collapsed.[39m

[33m    # Examples[39m
[33m    collapse_whitespace("  huh,  was.  that!!! ")[39m
[33m    $ 'huh, was. that!!!'[39m
[33m    """[39m
    [38;5;28;01mreturn[39;00m re.sub([33m" +"[39m, [33m" "[39m, text).strip()
[31mFile:[39m      ~/STINTSY-Order-of-Erin/lib/janitor.py
[31mType:[39m      function

To seamlessly call all these cleaning functions, we have the `clean` function that acts as a container that calls these separate components. The definition of this wrapper function is quite long, see [this appendix](#appendix:-clean-wrapper-function-definition) for its definition.

We can now clean the dataset and store it in a new column names `clean_ours` (to differentiate it will the, still dirty, column `clean_text` from the dataset author)

In [8]:
df["clean_ours"] = df["clean_text"].map(clean)
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 162969 entries, 0 to 162979
Data columns (total 3 columns):
 #   Column      Non-Null Count   Dtype  
---  ------      --------------   -----  
 0   clean_text  162969 non-null  object 
 1   category    162969 non-null  float64
 2   clean_ours  162969 non-null  object 
dtypes: float64(1), object(2)
memory usage: 5.0+ MB


### **spam, expressions, onomatopoeia, etc**

Since the domain of the corpus is Twitter, spam (e.g., `bbbb`), expressions (e.g., `bruhhhh`), and onomatopoeia (e.g., `hahahaha`) may become an issue by the vector representation step. Hence we employed a simple rule-based spam removal algorithm.

We remove words in the string that contains the same letter or substring thrice and consecutively. These were done using regular expressions:

$$
\text{same\_char\_thrice} := (.)\textbackslash1^{\{2,\}}
$$

and

$$
\text{same\_substring\_twice} := (.^+)\textbackslash1^+
$$

Furthermore, we also remove any string that has a length less than three, since these are either stopwords (that weren't detected in the stopword removal stage) or more spam. 

Finally, we employ adaptive character diversity threshold for the string $s$. 

$$
\frac{\texttt{\#\_unique\_chars}(s)}{|s|} < 0.3 + \left(\frac{0.1 \cdot \text{min}(|s|, 10)}{10}\right)
$$

It calculates the diversity of characters in a string; if the string repeats the same character alot, we expect it to be unintelligible or useless, hence we remove it.

The definition of this wrapper function is quite long, see its definition in [this appendix](#appendix:-find_spam_and_empty-wrapper-function-definition).

Let's first look at a random sample of 10 entries in the dataset before the cleaning pipeline.

In [9]:
df.sample(10)

Unnamed: 0,clean_text,category,clean_ours
59241,upaera isro chief saraswat reveals then government didnt give resources permission complete mission shakti credits modinsa doval,1.0,upaera isro chief saraswat reveals then government didnt give resources permission complete mission shakti credits modinsa doval
55393,can find flaw everything done modi will keep finding the flaw everything which modi will congress chatora,0.0,can find flaw everything done modi will keep finding the flaw everything which modi will congress chatora
159267,yes what failed years modi did years ultimate destruction leading ultimate annihilation north south divide religious extremism hindutva brigade lynchings rapes dangerous country for foreigners corruption scams much more hidden coming soon,-1.0,yes what failed years modi did years ultimate destruction leading ultimate annihilation north south divide religious extremism hindutva brigade lynchings rapes dangerous country for foreigners corruption scams much more hidden coming soon
161021,who guess can nmodi modi who ever gives ri8 anwer will get rs15 lakh,0.0,who guess can nmodi modi who ever gives ri anwer will get rs lakh
107062,they are several terrorist party leaders who work only for self intrest but the otherside there narender modi who work only for national intrest jai hind jai bjp modi hai mumkin hai,0.0,they are several terrorist party leaders who work only for self intrest but the otherside there narender modi who work only for national intrest jai hind jai bjp modi hai mumkin hai
82819,lol gross violation modi holds the ministry space any technology that why announced this achievement modi haters,1.0,lol gross violation modi holds the ministry space any technology that why announced this achievement modi haters
160921,modi protects the rich rahul gandhi,1.0,modi protects the rich rahul gandhi
3278,who pays for modis propaganda who pays for bank write offs who pays for npas you get worried only when poor people get something,-1.0,who pays for modis propaganda who pays for bank write offs who pays for npas you get worried only when poor people get something
45105,challenge that rahul cannot put any factory its simply not his capacity nor has any vision only freebies can promise from the treasury which full due good governance modi,1.0,challenge that rahul cannot put any factory its simply not his capacity nor has any vision only freebies can promise from the treasury which full due good governance modi
77426,‚Äòcongress aane phir batayenge tum logo ‚Äô placard activist threatened for ‚Äòmodi once more‚Äô sticker car via,1.0,congress aane phir batayenge tum logo placard activist threatened for modi once more sticker car via


Let's now call this function on the `clean_ours` column of the dataset.

In [10]:
df["clean_ours"] = df["clean_ours"].map(find_spam_and_empty)
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 162969 entries, 0 to 162979
Data columns (total 3 columns):
 #   Column      Non-Null Count   Dtype  
---  ------      --------------   -----  
 0   clean_text  162969 non-null  object 
 1   category    162969 non-null  float64
 2   clean_ours  162942 non-null  object 
dtypes: float64(1), object(2)
memory usage: 5.0+ MB


And look at another random sample of 10 entries in the dataset after the cleaning pipeline.

In [11]:
df.sample(10)

Unnamed: 0,clean_text,category,clean_ours
70767,file away your hindu hate file google golwalker indian modi calls him guru worthy worship also wiki kathua rape 8yo muslim girl modis ministers sided with rapists fyi put modi novisa list for genocide till became,-1.0,file away your hindu hate file google golwalker indian modi calls him guru worthy worship also wiki kathua rape muslim girl modis ministers sided with rapists fyi put modi novisa list for genocide till became
63881,makes india powerful but some people doesnt like and running this india should not repeat the mistake they did removing atal vote for modi\n,1.0,makes india powerful but some people doesnt like and running this india should not repeat the mistake they did removing atal vote for modi
3921,modi biopic controversy shabana azmi accuses filmmakers using her husband‚Äô name intentionally,0.0,modi biopic controversy shabana azmi accuses filmmakers using her husband name intentionally
76107,only modi gang can take bhakts seriously sometimes inka mangal grah jana desh liye bahut mangalmay hoga,-1.0,only modi gang can take bhakts seriously sometimes inka mangal grah jana desh liye bahut mangalmay hoga
27801,account was blocked said posted this screenshot which did not happened others who are critics can you please help,0.0,account was blocked said posted this screenshot which did not happened others who are critics can you please help
61325,very much openly violation model code conduct modi india sent notice,0.0,very much openly violation model code conduct modi india sent notice
130876,kya modi modi modi this india election and fattu only selling fake nationalism vision for youth farmers health sectors education sector busy doing main main lets see how long media can save this fattu,-1.0,kya modi modi modi this india election and fattu only selling fake nationalism vision for youth farmers health sectors education sector busy doing main main lets see how long media can save this fattu
42017,guess modi waiting for the nuclear bomb hit pak starts his address the nation,0.0,guess modi waiting for the nuclear bomb hit pak starts his address the nation
97833,its like modi showing dekho chutia,0.0,its like modi showing dekho chutia
139421,does his categorisation cover bjp many states kejriwal may have more interesting things say about delhimodi needs correct his abject disdain for opposition its the other side coin called democracy escaping from thatthey will not speak like his party,1.0,does his categorisation cover bjp many states kejriwal may have more interesting things say about delhimodi needs correct his abject disdain for opposition its the other side coin called democracy escaping from thatthey will not speak like his party


## **post-cleaning steps**

At some point during the cleaning stage, some entries of the dataset could have been reduced to `NaN` or the empty string `""`, or we could have introduced duplicates again. So, let's call `dropna` and `drop_duplicates` again.

In [12]:
df = df.dropna()
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 162942 entries, 0 to 162979
Data columns (total 3 columns):
 #   Column      Non-Null Count   Dtype  
---  ------      --------------   -----  
 0   clean_text  162942 non-null  object 
 1   category    162942 non-null  float64
 2   clean_ours  162942 non-null  object 
dtypes: float64(1), object(2)
memory usage: 5.0+ MB


In [13]:
df = df.drop_duplicates()
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 162942 entries, 0 to 162979
Data columns (total 3 columns):
 #   Column      Non-Null Count   Dtype  
---  ------      --------------   -----  
 0   clean_text  162942 non-null  object 
 1   category    162942 non-null  float64
 2   clean_ours  162942 non-null  object 
dtypes: float64(1), object(2)
memory usage: 5.0+ MB


# **3 preprocessing**

> üèóÔ∏è Perhaps swap S3 and S4. Refer to literature on what comes first.

This section discusses preprocessing steps for the cleaned data. Before and after each preprocessing step, we will show 5 random entries in the dataset to show the effects of each preprocessing task.

## **lemmatization**

We follow a similar methodology for data cleaning presented in <u>(George & Murugesan, 2024)</u>. We preprocess the dataset entries via lemmatization. We use NLTK for this task using WordNetLemmatizer lemmatization, repectively <u>(Bird & Loper, 2004)</u>. For the lemmatization step, we use the WordNet for English lemmatization and Open Multilingual WordNet version 1.4 for translations and multilingual support which is important for our case since some tweets contain text from Indian Languages.

In [14]:
df.sample(10)

Unnamed: 0,clean_text,category,clean_ours
143260,glad that this one not playing nationalism card inn toh bass chakar chakar karwa hamare kahani wale baba hai modi,1.0,glad that this one not playing nationalism card inn toh bass chakar chakar karwa hamare kahani wale hai modi
74441,was not critiquing modi all suspect even modi himself may not approve this fawning language,0.0,was not critiquing modi all suspect even modi himself may not approve this fawning language
30913,rahul fight elections wayanad kerala look who celebrating wayanad waving pakistan flags now you know why congress selected this constituency,0.0,rahul fight elections wayanad kerala look who celebrating wayanad waving pakistan flags now you know why congress selected this constituency
8660,hai modi minister oho she another chor hai,0.0,hai modi minister oho she another chor hai
92811,dont you watch your fav channelsuch republic where they show how modi travels different countriesthose countries are this planet well for your information shocked well maybe you shud try educate yourself more rather than commenting others tweets like bug,-1.0,dont you watch your fav channelsuch republic where they show how modi travels different countriesthose countries are this planet well for your information shocked well maybe you shud try educate yourself more rather than commenting others tweets like bug
109109,should renamed immoral modis code misconduct,0.0,should renamed immoral modis code misconduct
60166,modi will accompany them,0.0,modi will accompany them
3187,differences has clear see condition roads naamdar sonia gandhis raebareli roads modis varanasi ‡§æ‡•å‡•Ä‡§æ,1.0,differences has clear see condition roads naamdar sonia gandhis raebareli roads modis varanasi
138012,dont try single out malya modi during last 70yrs free india thousands such malyasmodies shared the moneymaking tricks with politicians enjoyed royal life this our faulty system that creates malyas modiescorrect the system first,1.0,dont try single out malya modi during last yrs free india thousands such malyasmodies shared the moneymaking tricks with politicians enjoyed royal life this our faulty system that creates malyas modiescorrect the system first
70156,immensely proud the enhanced status our country major thanks modi and his team jai hind\n,1.0,immensely proud the enhanced status our country major thanks modi and his team jai hind


In [15]:
df["lemmatized"] = df["clean_ours"].map(lemmatizer)
df.sample(10)

Unnamed: 0,clean_text,category,clean_ours,lemmatized
145691,india isnt ruled any kings queens although bhakts like you would like modi anointed king\nhes talking about the hate campaign\nthis family has lost family members duty one khalistan terrorists and another ltte\nhes asking for balance,-1.0,india isnt ruled any kings queens although bhakts like you would like modi anointed king hes talking about the hate campaign this family has lost family members duty one khalistan terrorists and another ltte hes asking for balance,india isnt ruled any king queen although bhakts like you would like modi anointed king he talking about the hate campaign this family ha lost family member duty one khalistan terrorist and another ltte he asking for balance
64723,what modi has done add women empowerment bank account and govt benefits transferred directly account generic drugshealth insurancechild educationincentive farmers manuremake india etc,1.0,what modi has done add women empowerment bank account and govt benefits transferred directly account generic drugshealth insurancechild educationincentive farmers manuremake india etc,what modi ha done add woman empowerment bank account and govt benefit transferred directly account generic drugshealth insurancechild educationincentive farmer manuremake india etc
16024,not modi who dilip dsouza here and here,0.0,not modi who dilip dsouza here and here,not modi who dilip dsouza here and here
136509,chowkidar shaib dont speak lieplzzzz now stop your nonsense modi style talk,0.0,chowkidar shaib dont speak now stop your nonsense modi style talk,chowkidar shaib dont speak now stop your nonsense modi style talk
162919,are you remember 2014 modi also why tag lines sayad aap education and unemployment health issues baat krti acha lgta,0.0,are you remember modi also why tag lines sayad education and unemployment health issues baat krti acha lgta,are you remember modi also why tag line sayad education and unemployment health issue baat krti acha lgta
78314,the same accident may happened when crossing the modis convoy simply show the hands him pass away even modi helped the same guy like rahul did thsts all all the medias debating this issue for the days that too their primetime\nfull page paper ads,1.0,the same accident may happened when crossing the modis convoy simply show the hands him pass away even modi helped the same guy like rahul did thsts all all the medias debating this issue for the days that too their primetime full page paper ads,the same accident may happened when crossing the modis convoy simply show the hand him pas away even modi helped the same guy like rahul did thsts all all the medias debating this issue for the day that too their primetime full page paper ad
147521,modi hate and corrupt,-1.0,modi hate and corrupt,modi hate and corrupt
116124,rahul gandhi says haryana‚Äô karnal that modi refers voters ‚Äúmitron‚Äù friends but businessmen anil ambani and mehul choksi ‚Äúbhai‚Äù brother ‚Äúthey take money from friends and give brothers‚Äù gandhi alleges according ani,0.0,rahul gandhi says haryana karnal that modi refers voters mitron friends but businessmen anil ambani and mehul choksi bhai brother they take money from friends and give brothers gandhi alleges according ani,rahul gandhi say haryana karnal that modi refers voter mitron friend but businessmen anil ambani and mehul choksi bhai brother they take money from friend and give brother gandhi alleges according ani
17332,there genius congress partymostly congress leaders are mentally retarded brainless whose mind full hatred with modi and love for muslims terrorists and pakistansuch party must buried soon possible,1.0,there genius congress partymostly congress leaders are mentally retarded brainless whose mind full hatred with modi and love for muslims terrorists and pakistansuch party must buried soon possible,there genius congress partymostly congress leader are mentally retarded brainless whose mind full hatred with modi and love for muslim terrorist and pakistansuch party must buried soon possible
94008,stop policing food issues its choice what eat you cant dictate attack his journalism whos really pathetic but for god sake dont dictate what one can eat not you are not helping modi but are irritating supporters who want real development,-1.0,stop policing food issues its choice what eat you cant dictate attack his journalism whos really pathetic but for god sake dont dictate what one can eat not you are not helping modi but are irritating supporters who want real development,stop policing food issue it choice what eat you cant dictate attack his journalism who really pathetic but for god sake dont dictate what one can eat not you are not helping modi but are irritating supporter who want real development


## **stop word removal**

After lemmatization, we may now remove the stop words present in the dataset. The stopword removal _needs_ to be after lemmatization since this step requires all words to be reduces to their base dictionary form, and the `stopword_set` only considers base dictionary forms of the stopwords.

**stopwords.** For stop words removal, we refer to the English stopwords dataset defined in NLTK and Wolfram Mathematica <u>(Bird & Loper, 2004; Wolfram Research, 2015)</u>. However, since the task is sentiment analysis, words that invoke polarity, intensification, and negation are important. Words like "not" and "okay" are commonly included as stopwords. Therefore, the stopwords from [nltk,mathematica] are manually adjusted to only include stopwords that invoke neutrality, examples are "after", "when", and "you."

In [16]:
df.sample(10)

Unnamed: 0,clean_text,category,clean_ours,lemmatized
73128,only more thing need modi again,1.0,only more thing need modi again,only more thing need modi again
134912,tweeted this 24th march and march 25th surya was given prestigious seat isnt narendra modi listening young voices,1.0,tweeted this march and march surya was given prestigious seat isnt narendra modi listening young voices,tweeted this march and march surya wa given prestigious seat isnt narendra modi listening young voice
143882,modi positive only for himselfnegative for others,1.0,modi positive only for himselfnegative for others,modi positive only for himselfnegative for others
54723,frustrated taklu unkil shivering modis skyrocketing popularity the last miles our nation,-1.0,frustrated taklu unkil shivering modis skyrocketing popularity the last miles our nation,frustrated taklu unkil shivering modis skyrocketing popularity the last mile our nation
63374,vote modi again and again,0.0,vote modi again and again,vote modi again and again
35521,ohh come economist and can work for any country the world not hungry for politics like modi and shah,0.0,ohh come economist and can work for any country the world not hungry for politics like modi and shah,ohh come economist and can work for any country the world not hungry for politics like modi and shah
99176,dear ‚Å¶ you have given pakistan befitting reply after uri pulwama kindly don‚Äô exaggerate the issue any further seek votes plank development see neither ‚Äòsabka saath‚Äô nor ‚Äòsabka vikas‚Äô,1.0,dear you have given pakistan befitting reply after uri pulwama kindly don exaggerate the issue any further seek votes plank development see neither sabka saath nor sabka vikas,dear you have given pakistan befitting reply after uri pulwama kindly don exaggerate the issue any further seek vote plank development see neither sabka saath nor sabka vikas
147793,another masterstroke modi,0.0,another masterstroke modi,another masterstroke modi
131504,plight tea tribes can only understood tea seller like,0.0,plight tea tribes can only understood tea seller like,plight tea tribe can only understood tea seller like
113373,shame and modi wanted,0.0,shame and modi wanted,shame and modi wanted


In [17]:
df["lemmatized"] = df["lemmatized"].map(lambda t: rem_stopwords(t, stopwords_set))
df = df.dropna(subset=["lemmatized"])
df.sample(10)

Unnamed: 0,clean_text,category,clean_ours,lemmatized
48407,congratulations team and modi government,0.0,congratulations team and modi government,congratulation team modi government
87115,blind modi hater,-1.0,blind modi hater,blind modi hater
45076,surgical strike\nairstrike\nspacestrike strike like modi\n,0.0,surgical strike airstrike spacestrike strike like modi,surgical strike airstrike spacestrike strike like modi
132379,arunachal modi quips congress concerned about malai not bhalai,0.0,arunachal modi quips congress concerned about malai not bhalai,arunachal modi quip congress concerned about malai bhalai
58773,not modi your time has come italy,0.0,not modi your time has come italy,modi time ha italy
147597,its admission that modi has economically performed well that you can plan for such big schemes,1.0,its admission that modi has economically performed well that you can plan for such big schemes,admission modi ha economically performed plan such big scheme
64461,scientists have announced then credit would have given them chowkidar modi credit kaise milta fir,0.0,scientists have announced then credit would have given them chowkidar modi credit kaise milta fir,scientist announced credit chowkidar modi credit kaise milta fir
55469,mamta baji should also complain icj against modi shoud complain court people india who will give appropriate answer,1.0,mamta baji should also complain icj against modi shoud complain court people india who will give appropriate answer,mamta baji complain icj modi shoud complain court people india appropriate answer
80204,vows his unborn kids head that does not work for can assure you that there connection with kejri here,0.0,vows his unborn kids head that does not work for can assure you that there connection with kejri here,vow unborn kid head doe work assure connection kejri
80923,eds changing slug one side there strong chowkidar the other line tainted people modi meerut,1.0,eds changing slug one side there strong chowkidar the other line tainted people modi meerut,changing slug strong chowkidar tainted people modi meerut


## **looking at the DataFrame**

After preprocessing, the dataset now contains:

In [18]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 162942 entries, 0 to 162979
Data columns (total 4 columns):
 #   Column      Non-Null Count   Dtype  
---  ------      --------------   -----  
 0   clean_text  162942 non-null  object 
 1   category    162942 non-null  float64
 2   clean_ours  162942 non-null  object 
 3   lemmatized  162942 non-null  object 
dtypes: float64(1), object(3)
memory usage: 6.2+ MB


Here are 10 randomly picked entries in the dataframe with all columns shown for comparison.

In [19]:
display(df.sample(5))

Unnamed: 0,clean_text,category,clean_ours,lemmatized
22414,lok sabha polls kalyan for modi again rajasthan ashok gehlot points governor duty the indian express,0.0,lok sabha polls kalyan for modi again rajasthan ashok gehlot points governor duty the indian express,lok sabha poll kalyan modi rajasthan ashok gehlot point governor duty indian express
113444,sir you have been standing with congress parties leader and have said your fous dilouge khamos lier and cheater our country the bloody rascle narendra modi,-1.0,sir you have been standing with congress parties leader and have said your fous dilouge khamos lier and cheater our country the bloody rascle narendra modi,sir standing congress party leader fous dilouge khamos lier cheater country bloody rascle narendra modi
99633,sister from past years searching for truth but not what about bringing black money 100 day again you may say facts sis better keep your facts with you and vote modi not supporter any party modi wins will not run house have,1.0,sister from past years searching for truth but not what about bringing black money day again you may say facts sis better keep your facts with you and vote modi not supporter any party modi wins will not run house have,sister year searching truth about bringing black money day fact better fact vote modi supporter party modi win house
74692,evms eci all are hacked why attack pak why launch missiles blow satellites modi planning get 100 voteshare,0.0,evms eci all are hacked why attack pak why launch missiles blow satellites modi planning get voteshare,evms eci all hacked attack pak launch missile blow satellite modi planning voteshare
41659,modi will declare aap inc alliance,0.0,modi will declare inc alliance,modi declare alliance


## **tokenization** 

Since the data cleaning and preprocessing stage is comprehensive, the tokenization step in the BoW model reduces to a simple word-boundary split operation. Each preprocessed entry in the DataFrame is split by spaces. For example, the entry `"shri narendra modis"` (entry: 42052) becomes `["shri", "narendra", "modis"]`. By the end of tokenization, all entries are transformed into arrays of strings.

## **word bigrams** 

As noted earlier, modifiers and polarity words are not included in the stopword set. The BoW model constructs a vocabulary containing both unigrams and bigrams. Including bigrams allows the model to capture common word patterns, such as  

$$
\left\langle \texttt{Adj}\right\rangle \left\langle \texttt{M} \mid \texttt{Pron} \right\rangle 
$$  

or  

$$
\left\langle \texttt{Adv}\right\rangle \left\langle \texttt{V} \mid \texttt{Adj} \mid \texttt{Adv} \right\rangle 
$$  

## **vector representation**

After the stemming and lemmatization steps, each entry can now be represented as a vector using a Bag of Words (BoW) model. We employ scikit-learn's `CountVectorizer`, which provides a ready-to-use implementation of BoW <u>(Pedregosa et al., 2011)</u>.

A comparison of other traditional vector representations are discussed in [this appendix](#appendix:-comparison-of-traditional-vectorization-techniques).
Words with modifiers have the modifiers directly attached, enabling subsequent models to capture the concept of modification fully. Consequently, after tokenization and bigram construction, the vocabulary size can grow up to $O(n^2)$, where $n$ is the number of unique tokens.

**minimum document frequency constraint:** Despite cleaning and spam removal, some tokens remain irrelevant or too rare. To address this, a minimum document frequency constraint is applied: $\texttt{min\_df} = 10$, meaning a token must appear in at least 10 documents to be included in the BoW vocabulary. This reduces noise and ensures the model focuses on meaningful terms.

---

These parameters of the BoW model are encapsulated in the `BagOfWordsModel` class. The class definition is available in [this appendix](#appendix:-BagOfWordsModel-class-definition).

In [20]:
bow = BagOfWordsModel(df["lemmatized"], 10)

# some sanity checks
assert bow.matrix.shape[0] == df.shape[0], "number of rows in the matrix DOES NOT matches the number of documents"
assert bow.sparsity,                       "the sparsity is TOO HIGH, something went wrong"



The error above is normal, recall that our tokenization step essentially reduced into an array split step. With this, we need to set the `tokenizer` function attribute of the `BagOfWordsModel` to not use its default tokenization pattern. That causes this warning.

### **model metrics**

To get an idea of the model, we will now look at its shape and sparsity.

The resulting vector has a shape of

In [21]:
bow.matrix.shape

(162942, 30386)

The first entry of the pair is the number of documents (the ones that remain after all the data cleaning and preprocessing steps) and the second entry is the number of tokens (or unique words in the vocabulary).

The resulting model has a sparsity of

In [22]:
bow.sparsity

0.0004960460127828437

> üèóÔ∏è perhaps discuss sparsity's relevance

Now, looking at the most frequent and least frequent terms in the model.

In [23]:
doc_frequencies = np.asarray((bow.matrix > 0).sum(axis=0)).flatten()
freq_order = np.argsort(doc_frequencies)[::-1]
bow.feature_names[freq_order[:50]]

array(['modi', 'india', 'ha', 'all', 'people', 'bjp', 'like', 'congress',
       'narendra', 'only', 'election', 'narendra modi', 'vote', 'govt',
       'about', 'indian', 'year', 'time', 'country', 'just', 'modis',
       'more', 'nation', 'rahul', 'even', 'government', 'party', 'power',
       'gandhi', 'minister', 'leader', 'good', 'modi govt', 'need',
       'modi ha', 'space', 'work', 'prime', 'money', 'credit', 'sir',
       'pakistan', 'back', 'day', 'today', 'prime minister', 'scientist',
       'never', 'support', 'win'], dtype=object)

We see that the main talking point of the Tweets, which hovers around Indian politics with keywords like "modi", "india", and "bjp". For additional context, "bjp" referes to the _Bharatiya Janata Party_ which is a conservative political party in India, and one of the two major Indian political parties.

Now, looking at the least popular words.

In [24]:
bow.feature_names[freq_order[-50:]]

array(['healthy democracy', 'ha mass', 'ha separate', 'ha shifted',
       'hat drdo', 'about defeat', 'yet ha', 'yes more', 'yes narendra',
       'hatred people', 'ha requested', 'hate more', 'hate much',
       'hatemonger', 'hater gonna', 'heal', 'hazaribagh', 'head drdo',
       'sleep night', 'abinandan', 'able provide', 'able speak',
       'able vote', 'youth need', 'youth power', 'hai isliye', 'hai chor',
       'handy', 'hand narendra', 'hand people', 'hae', 'ha withdrawn',
       'happens credit', 'happier', 'bhaiyo', 'socha', 'social political',
       'social security', 'biased journalist', 'big congratulation',
       'sirmodi', 'bhutan', 'bhi berozgar', 'bhi mumkin', 'skta',
       'bhatt aditi', 'bhi aur', 'slamming', 'smart modi', 'slogan blame'],
      dtype=object)

We still see that the themes mentioned in the most frequent terms are still present in this subset. Although, more filler or non-distinct words do appear more often, like "photos", "soft" and "types".

But the present of words like "reelection" and "wars" still point to this subset still being relevant to the main theme of the dataset.

# **4 exploratory data analysis**

This section discusses the exploratory data analysis conducted on the dataset after cleaning.

> Notes from Zhean: <br>
> From manual checking via OpenRefine, there are a total of 162972. `df.info()` should have the same result post-processing.
> Furthermore, there should be two columns, `clean_text` (which is a bit of a misnormer since it is still dirty) contains the Tweets (text data). The second column is the `category` which contains the sentiment of the Tweet and is a tribool (1 positive, 0 neutral or indeterminate, and -1 for negative).

# **references**
Bird, S., & Loper, E. (2004, July). NLTK: The natural language toolkit. *Proceedings of the ACL Interactive Poster and Demonstration Sessions*, 214‚Äì217. https://aclanthology.org/P04-3031/

El-Demerdash, A. A., Hussein, S. E., & Zaki, J. F. W. (2021). Course evaluation based on deep learning and SSA hyperparameters optimization. *Computers, Materials & Continua, 71*(1), 941‚Äì959. https://doi.org/10.32604/cmc.2022.021839

George, M., & Murugesan, R. (2024). Improving sentiment analysis of financial news headlines using hybrid Word2Vec-TFIDF feature extraction technique. *Procedia Computer Science, 244*, 1‚Äì8.

Hussein, S. (2021). *Twitter sentiments dataset*. Mendeley.

Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., Vanderplas, J., Passos, A., Cournapeau, D., Brucher, M., Perrot, M., & Duchesnay, E. (2011). Scikit-learn: Machine learning in Python. *Journal of Machine Learning Research, 12*, 2825‚Äì2830.

Rani, D., Kumar, R., & Chauhan, N. (2022, October). Study and comparison of vectorization techniques used in text classification. In *2022 13th International Conference on Computing Communication and Networking Technologies (ICCCNT)* (pp. 1‚Äì6). IEEE.

Wolfram Research. (2015). *DeleteStopwords*. https://reference.wolfram.com/language/ref/DeleteStopwords.html

# **appendix: `clean` wrapper function definition**
Below is the definition of the `clean` wrapper function that encapsulates all internal functions used in the cleaning pipeline.

In [26]:
clean??

[31mSignature:[39m clean(text: str) -> str
[31mSource:[39m   
[38;5;28;01mdef[39;00m clean(text: str) -> str:
    [33m"""[39m
[33m    This is the main function for data cleaning (i.e., it calls all the cleaning functions in the prescribed order).[39m

[33m    This function should be used as a first-class function in a map.[39m

[33m    # Parameters[39m
[33m    * text: The string entry from a DataFrame column.[39m
[33m    * stopwords: stopword dictionary.[39m

[33m    # Returns[39m
[33m    Clean string[39m
[33m    """[39m
    [38;5;66;03m# cleaning on the base string[39;00m
    text = normalize(text)
    text = rem_punctuation(text)
    text = rem_numbers(text)
    text = collapse_whitespace(text)

    [38;5;28;01mreturn[39;00m text
[31mFile:[39m      ~/STINTSY-Order-of-Erin/lib/janitor.py
[31mType:[39m      function

# **appendix: `find_spam_and_empty` wrapper function definition**
Below is the definition of the `find_spam_and_empty` wrapper function that encapsulates all internal functions for the spam detection algorithm.

In [28]:
find_spam_and_empty??

[31mSignature:[39m find_spam_and_empty(text: str, min_length: int = [32m3[39m) -> str | [38;5;28;01mNone[39;00m
[31mSource:[39m   
[38;5;28;01mdef[39;00m find_spam_and_empty(text: str, min_length: int = [32m3[39m) -> str | [38;5;28;01mNone[39;00m:
    [33m"""[39m
[33m    Filter out empty text and unintelligible/spammy unintelligible substrings in the text.[39m

[33m    Spammy substrings:[39m
[33m    - Shorter than min_length[39m
[33m    - Containing non-alphabetic characters[39m
[33m    - Consisting of a repeated substring (e.g., 'aaaaaa', 'ababab', 'abcabcabc')[39m

[33m    # Parameters[39m
[33m    * text: input string.[39m
[33m    * min_length: minimum length of word to keep.[39m

[33m    # Returns[39m
[33m        Cleaned string, or None if empty after filtering.[39m
[33m    """[39m
    cleaned_tokens = []
    [38;5;28;01mfor[39;00m t [38;5;28;01min[39;00m text.split():
        [38;5;28;01mif[39;00m len(t) < min_length:
            [38;5;2

# **appendix: comparison of traditional vectorization techniques**

Traditional vectorization techniques include BoW and Term Frequency-Inverse Document Frequency (TF-IDF). TF-IDF weights each word based on its frequency in a document and its rarity across the corpus, reducing the impact of common words. BoW, in contrast, simply counts word occurrences without considering corpus-level frequency. In this project, BoW was chosen because stopwords were already removed during preprocessing, and the dataset is domain-specific <u>(Rani et al., 2022)</u>. In such datasets, frequent words are often meaningful domain keywords, so scaling them down (as TF-IDF would) could reduce the importance of these key terms in the feature representation.

# **appendix: `BagOfWordsModel` class definition**
Below is the definition of the `BagOfWordsModel` class that encapsulates the desired parameters.

In [25]:
BagOfWordsModel??

[31mInit signature:[39m BagOfWordsModel(texts: Iterable[str], min_freq: int | float | [38;5;28;01mNone[39;00m = [38;5;28;01mNone[39;00m)
[31mSource:[39m        
[38;5;28;01mclass[39;00m BagOfWordsModel:
    [33m"""[39m
[33m    A Bag-of-Words representation for a text corpus.[39m

[33m    # Attributes[39m
[33m    * matrix (scipy.sparse.csr_matrix): The document-term matrix of word counts.[39m
[33m    * feature_names (list[str]): List of feature names corresponding to the matrix columns.[39m
[33m    *[39m
[33m    # Usage[39m
[33m    ```[39m
[33m    bow = BagOfWordsModel(df["lemmatized_str"])[39m
[33m    ```[39m
[33m    """[39m

    [38;5;28;01mdef[39;00m __init__(self, texts: Iterable[str], min_freq: int | float | [38;5;28;01mNone[39;00m = [38;5;28;01mNone[39;00m):
        [33m"""[39m
[33m        Initialize the BagOfWordsModel by fitting the vectorizer to the text corpus. This also filters out tokens[39m
[33m        that do not appear more than 