# Sentiment Analysis of Twitter Posts
<!-- Notebook name goes here -->
<center><b>Notebook: Data Description, Cleaning, Exploratory Data Analysis, and Preprocessing</b></center>
<br>

**by**: Stephen Borja, Justin Ching, Erin Chua, and Zhean Ganituen.

**dataset**: Hussein, S. (2021). Twitter Sentiments Dataset [Dataset]. Mendeley. https://doi.org/10.17632/Z9ZW7NT5H2.1

**motivation**: Every minute, social media users generate a large influx of textual data on live events. Performing sentiment analysis on this data provides a real-time view of public perception, enabling quick insights into the general population‚Äôs opinions and reactions.

**goal**: By the end of the project, our goal is to create and compare supervised learning algorithms for sentiment analysis.

### **dataset description**

The Twitter Sentiments Dataset is a dataset that contains nearly 163k tweets from Twitter. The time period of when these were collected is unknown, but it was published to Mendeley Data on May 14, 2021 by Sherif Hussein of Mansoura University.

Tweets were extracted using the Twitter API, but the specifics of how the tweets were selected are unmentioned. The tweets are mostly English with a mix of some Hindi words for code-switching <u>(El-Demerdash., 2021)</u>. All of them seem to be talking about the political state of India. Most tweets mention Narendra Modi, the current Prime Minister of India.

Each tweet was assigned a label using TextBlob's sentiment analysis <u>(El‚ÄëDemerdash, Hussein, & Zaki, 2021)</u>, which assigns labels automatically.

Twitter_Data
- **`clean_text`**: The tweet's text
- **`category`**: The tweet's sentiment category

What each row and column represents: `each row represents one tweet.` <br>
Number of observations: `162,980`

---

<a name="cite_note-1"></a>1. [^](#cite_ref-1) Code-switching is the practice of alternating between two languages $L_1$ (the native language) and $L_2$ (the source language) in a conversation. In this context, the code-switching is done to appear more casual since the conversation is done via Twitter (now, X). 

## **1. Project Set-up**
We set the global imports for the projects (ensure these are installed via uv and is part of the environment). Furthermore, load the dataset here.

In [254]:
import pandas as pd
import numpy as np
import os
import sys

# Use lib directory
sys.path.append(os.path.abspath("../lib"))

# Imports from lib files
from janitor import *
from lemmatize import lemmatizer
from boilerplate import stopwords_set
from bag_of_words import BagOfWordsModel

# Pandas congiruation
pd.set_option("display.max_colwidth", None)

# Load raw data file
df = pd.read_csv("../data/Twitter_Data.csv")

## **2. Data Cleaning**
This section discusses the methodology for data cleaning.

As to not waste computational time, a preliminary step is to ensure that no `NaN` or duplicate entries exist before the cleaning steps. Everytime we call a `.drop()` function, we will show the result of `info()` to see how many entries are filtered out.

Let's first drop the `NaN` entries.

In [255]:
df = df.dropna()
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 162969 entries, 0 to 162979
Data columns (total 2 columns):
 #   Column      Non-Null Count   Dtype  
---  ------      --------------   -----  
 0   clean_text  162969 non-null  object 
 1   category    162969 non-null  float64
dtypes: float64(1), object(1)
memory usage: 3.7+ MB


Now, remove the duplicates.

In [256]:
df = df.drop_duplicates()
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 162969 entries, 0 to 162979
Data columns (total 2 columns):
 #   Column      Non-Null Count   Dtype  
---  ------      --------------   -----  
 0   clean_text  162969 non-null  object 
 1   category    162969 non-null  float64
dtypes: float64(1), object(1)
memory usage: 3.7+ MB


We also ensure that all the values in the `category` column are within the range of [-1, 0, 1], which represent the three sentiments, namely, negative, neutral, and positive.

In [257]:
df['category'].unique()

array([-1.,  0.,  1.])

Then remove any values outside of the provided range to keep the data consistent.

In [258]:
df = df[df['category'].isin([-1, 0, 1])]
df['category'].sample(10)

129756   -1.0
159102    1.0
104282    0.0
91065     1.0
34390     1.0
61503     0.0
127467   -1.0
100487    0.0
42512     1.0
11929    -1.0
Name: category, dtype: float64

By converting a CSV file into a DataFrame, pandas automatically defaults numeric values to `float64` when it encounters decimals or `NaN` types. Text of `str` type get inferred and loaded into a `object` as the generic type for strings. We can check the dtype of our DataFrame column through `.info()`

In [259]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 162969 entries, 0 to 162979
Data columns (total 2 columns):
 #   Column      Non-Null Count   Dtype  
---  ------      --------------   -----  
 0   clean_text  162969 non-null  object 
 1   category    162969 non-null  float64
dtypes: float64(1), object(1)
memory usage: 3.7+ MB


First we convert column `category` from `float64` to `int64` after dropping `NaN` rows and removing any values outside of [-1, 0, 1]

In [260]:
df['category'] = df['category'].astype(int)
df['category'].info()

<class 'pandas.core.series.Series'>
Index: 162969 entries, 0 to 162979
Series name: category
Non-Null Count   Dtype
--------------   -----
162969 non-null  int64
dtypes: int64(1)
memory usage: 2.5 MB


In [261]:
df['category'].sample(10)

87818     1
85707    -1
120828   -1
105907    0
120843    1
125318    1
140007    1
30775    -1
31051     0
33845    -1
Name: category, dtype: int64

Next, we convert column `clean_string` from `object` type into the pandas defined `string` type for consistency and better performance.

In [262]:
df['clean_text'] = df['clean_text'].astype('string')
df['clean_text'].info()

<class 'pandas.core.series.Series'>
Index: 162969 entries, 0 to 162979
Series name: clean_text
Non-Null Count   Dtype 
--------------   ----- 
162969 non-null  string
dtypes: string(1)
memory usage: 2.5 MB


In [263]:
type(df.loc[0, 'clean_text'])

str

## **Main Cleaning Pipeline**

We follow a similar methodology for data cleaning presented in (George & Murugesan, 2024). 

### **Normalization**

Due to the nature of the text being tweets, we noticed a prevalence in the use of emojis and accented characters as seen in the samples below. Although in a real-world context these do serve as a form of emotional expression, it provides no relevance towards _textual_ sentiment analysis, thus we normalize the text.

In [264]:
# Finding a sample of rows with accented characters
accented_char_rows = df[df['clean_text'].str.contains(r'√â|√©|√Å|√°|√≥|√ì|√∫|√ö|√≠|√ç')]
accented_char_rows['clean_text'].sample(5)

89813                                                                                                                                                                                                    just love the new con√ßept must watch th√© video 
72727                                                                                                                                                                        modi hails indias arrival space power after shoots down satellite test v√≠a 
86264                                                                                                                                                    there should some basic quality check such clich√©d juvenile satire silly even for modi bashing 
124646    arnab vehemently ranted for weeks against sushma and vasundhara raje during lalit modi expos√©rahul was apparently unbiased anchor until and arun purie travelled together with amit shah during karnataka poll both now are panna pramukhs bjp

In [265]:
# Finding a sample of rows with emojis
rows_with_emojis = df[df['clean_text'].str.contains(r'[\u263a-\U0001f645]', regex=True)]
rows_with_emojis['clean_text'].sample(5)

121548      both seem stupid from there rhetoric speeches one says being anti modi anti india smh ‚Äç‚ôÇÔ∏è and another one still communist 21st century that‚Äôsin itself idiotic ‚Äç‚ôÇÔ∏è
61554     modi has sensed somethingit was really desperate move himwas the nyay effect ‡§æ ‡§¨‡§¶‡§≤ ‡•Ä ‡•á anyway kudos the drdo isro for the achievement hats off all the scientists‚úåÔ∏è 
121880                                                                         @ kmsharma yes this chokidar modi sir thief who theft heart all man world\n‚ù§Ô∏è‚ù§Ô∏è‚ù§Ô∏è‚ù§Ô∏è‚ù§Ô∏è‚ù§Ô∏è‚úå‚úå‚úå‚úå‚úå‚úå‚úã 
82437                                                                                                                                 modi the great lion one roar enough ‚úåÔ∏è‚úåÔ∏è
54332                                                               amazing promosyon really sir very congratulations super performance summit our prayers with you sir‚õ§‚õ§‚õ§‚õ§‚õ§  
Name: clean_text, dtype: string

The first function is the `normalize` function, it normalizes the text input to ASCII-only characters (say, "c√≥mo est√°s" becomes "como estas") and lowercased alphabetic symbols. The dataset contains Unicode characters (e.g., emojis and accented characters) which the function replaces to the empty string (`''`).

In [266]:
normalize??

[31mSignature:[39m normalize(text: str) -> str
[31mSource:[39m   
[38;5;28;01mdef[39;00m normalize(text: str) -> str:
    [33m"""[39m
[33m    Normalize text from a pandas entry to ASCII-only lowercase characters. Hence, this removes Unicode characters with no ASCII[39m
[33m    equivalent (e.g., emojis and CJKs).[39m

[33m    Do not use this function alone, use `clean_and_tokenize()`.[39m

[33m    # Parameters[39m
[33m    * text: String entry.[39m

[33m    # Returns[39m
[33m    ASCII-normalized text containing only lowercase letters.[39m

[33m    # Examples[39m
[33m    normalize("¬øC√≥mo est√°s?")[39m
[33m    $ 'como estas?'[39m

[33m    normalize(" hahahaha HUY! Kamusta üòÖ Mayaman $$$ ka na ba?")[39m
[33m    $ ' hahahaha huy! kamusta  mayaman $$$ ka na ba?'[39m
[33m    """[39m
    normalized = unicodedata.normalize([33m"NFKD"[39m, text)
    ascii_text = normalized.encode([33m"ascii"[39m, [33m"ignore"[39m).decode([33m"ascii"[39m)

    [38;5;2

### **Punctuations**
Punctuations are part of natural speech and reading to provide a sense of structure, clarity, and tone to sentences, but in the context of a classification study punctuations do not add much information to the sentiment of a message. The sentiment of `i hate you!` and `i hate you` are going to be the same despite the punctuation mark `!` being used to accentuate the sentiment. We can see a sample of rows with punctations below.

In [267]:
# Finding a sample of rows with punctuation
rows_with_punc = df[df['clean_text'].str.contains(r'[^\w\s]')]
rows_with_punc['clean_text'].sample(5)

113455                                                     like congress claims because sowed the seeds happened all these loooters too sowed their seeds under government when comes obviously has come robbers aren‚Äô going wait 
43529                                                                                              heres how ajay devgn kriti sanon rajkummar rao madhavan replied narendra modi‚Äô ‚Äòvote kar‚Äô election 2019 campaign via namo app\n
128898    had manmohan singhs govt given clearance drdo ‡•á could have launched this 2014 2015 but singh hardly took any decision was the present govt modi which gave clearance and see the result for that modi govt must credited
26163       modi had made advani the president india then you guys would have started hyperventilating about babri masjid seat gandhinagar given his daughter then dynast would been complain you hate modiji‚Äô guts sadhavi effect
71902                                                                           

The function `rem_punctuation` replaces all punctuations and special characters into an empty string (`''`)

In [268]:
rem_punctuation??

[31mSignature:[39m rem_punctuation(text: str) -> str
[31mSource:[39m   
[38;5;28;01mdef[39;00m rem_punctuation(text: str) -> str:
    [33m"""[39m
[33m    Removes the punctuations. This function simply replaces all punctuation marks and special characters[39m
[33m    to the empty string. Hence, for symbols enclosed by whitespace, the whitespace are not collapsed to a single whitespace[39m
[33m    (for more information, see the examples).[39m

[33m    Do not use this function alone, use `clean_and_tokenize()`.[39m

[33m    # Parameters[39m
[33m    * text: String entry.[39m

[33m    # Returns[39m
[33m    Text with the punctuation removed.[39m

[33m    # Examples[39m
[33m    rem_punctuation("this word $$ has two spaces after it!")[39m
[33m    $ 'this word  has two spaces after it'[39m

[33m    rem_punctuation("these!words@have$no%space")[39m
[33m    $ 'thesewordshavenospace'[39m
[33m    """[39m
    [38;5;28;01mreturn[39;00m re.sub(f"[{re.escape(string.

### **Numbers**
Similar to punctuations, numbers do not add any information to the sentiment of a message as seen in the samples below.

In [269]:
# Finding a sample of rows that contain numbers
rows_with_numbers = df[df['clean_text'].str.contains(r'\d')]
rows_with_numbers['clean_text'].sample(5)

68104                                                                                                                                                               model code conduct was revoked 2014\nnow guided modi code conduct
21202     2009 the environment ministry categorised 170000 hectares hasdeo arand ‚Äúnogo‚Äù area for mining for its rich unfragmented forest cover feb this year modi govt permitted coal mines there adani group will operate this mine 
5911                                 modi announced and implemented prudent allocation already done till 2014 upa didnt buy rafale because there was money this lakh crore more than indian defense budgetwhere will money come from 
111753         23rd may chowkidar ashok swine chowkidar salil chowkidar and their brethren will face cheer haran from indians bet they will not twitter for month after that\ntheir only agenda hate modi and consequently hate india
47989                                                                       

Hence we defined the `rem_numbers` as a function that replaces all numerical values as an empty string (`''`).

In [270]:
rem_numbers??

[31mSignature:[39m rem_numbers(text: str) -> str
[31mSource:[39m   
[38;5;28;01mdef[39;00m rem_numbers(text: str) -> str:
    [33m"""[39m
[33m    Removes numbers. This function simply replaces all numerical symbols to the empty string. Hence, for symbols enclosed by[39m
[33m    whitespace, the whitespace are not collapsed to a single whitespace (for more information, see the examples).[39m

[33m    Do not use this function alone, use `clean_and_tokenize()`.[39m

[33m    # Parameters[39m
[33m    * text: String entry.[39m

[33m    # Returns[39m
[33m    Text with the numerical symbol removed[39m

[33m    # Examples[39m
[33m    rem_numbers(" h3llo, k4must4 k4  n4?")[39m
[33m    ' hllo, kmust k  n?'[39m
[33m    """[39m
    [38;5;28;01mreturn[39;00m re.sub([33mr"\d+"[39m, [33m""[39m, text)
[31mFile:[39m      c:\users\erin\documents\github\stintsy-order-of-erin\lib\janitor.py
[31mType:[39m      function

### **Whitespace**
We also noticed the prevalance of excess whitespaces in between words, as seen in the sample below.

In [271]:
# Finding a sample of rows that contain 2 or more whitespaces in a row
rows_with_whitespaces = df[df['clean_text'].str.contains(r'\s{2,}')]
rows_with_whitespaces['clean_text'].sample(5)

79293                                                                                           fantastic speech hari  and perfect insight why need someone with caliber such our beloved and dynamic prime minister sri narendra modi once again 
142795                                                                                                                                                                                                             modis zumla dreams youngsters  
27095                                                                                                                                                                         who says modi didnt create jobs see even vivek oberoi got the role  
33385                                                                                                                                                   this nonsense bjp leaders will speak like this when modi slogan says sab sat sabka vikas  
39615     micro this story e

Thus, function `collapse_whitespace` collapses all whitespace characters to a single space. Formally, it is a transducer 

$$
\Box^+ \mapsto \Box \qquad \text{where the space character is } \Box
$$

Informally, it replaces all strings of whitespaces to a single whitespace character.

In [272]:
collapse_whitespace??

[31mSignature:[39m collapse_whitespace(text: str) -> str
[31mSource:[39m   
[38;5;28;01mdef[39;00m collapse_whitespace(text: str) -> str:
    [33m"""[39m
[33m    This collapses whitespace. Here, collapsing means the transduction of all whitespace strings of any[39m
[33m    length to a whitespace string of unit length (e.g., "   " -> " "; formally " "+ -> " ").[39m

[33m    Do not use this function alone, use `clean_and_tokenize()`.[39m

[33m    # Parameters[39m
[33m    * text: String entry.[39m

[33m    # Returns[39m
[33m    Text with the whitespaces collapsed.[39m

[33m    # Examples[39m
[33m    collapse_whitespace("  huh,  was.  that!!! ")[39m
[33m    $ 'huh, was. that!!!'[39m
[33m    """[39m
    [38;5;28;01mreturn[39;00m re.sub([33m" +"[39m, [33m" "[39m, text).strip()
[31mFile:[39m      c:\users\erin\documents\github\stintsy-order-of-erin\lib\janitor.py
[31mType:[39m      function

To seamlessly call all these cleaning functions, we have the `clean` function that acts as a container that calls these separate components. The definition of this wrapper function is quite long, see [this appendix](#appendix:-clean-wrapper-function-definition) for its definition.

We can now clean the dataset and store it in a new column names `clean_ours` (to differentiate it will the, still dirty, column `clean_text` from the dataset author)

In [273]:
df["clean_ours"] = df["clean_text"].map(clean).astype('string')
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 162969 entries, 0 to 162979
Data columns (total 3 columns):
 #   Column      Non-Null Count   Dtype 
---  ------      --------------   ----- 
 0   clean_text  162969 non-null  string
 1   category    162969 non-null  int64 
 2   clean_ours  162969 non-null  string
dtypes: int64(1), string(2)
memory usage: 9.0 MB


To confirm if the character cleaning worked, we can check for the differences between `clean_text` and `clean_ours` from the filtered rows below and compare the differences.

In [274]:

example_rows = df[df['clean_text'].str.contains(r'\s{2,}|\d|[^\w\s]|[\u263a-\U0001f645]|[√â√©√Å√°√≥√ì√∫√ö√≠√ç]')]
example_rows.sample(10)

Unnamed: 0,clean_text,category,clean_ours
16510,pmofull speech narendra modi‚Äô lakhs security guards across the¬†country,0,pmofull speech narendra modi lakhs security guards across the country
41434,doubt messiah poor\nworld bank says india longer poor people‚Äô country indians pulled out poverty every minute indias per capita income has also increased under modi govt,-1,doubt messiah poor\nworld bank says india longer poor people country indians pulled out poverty every minute indias per capita income has also increased under modi govt
60921,opposition parties will have copy modi‚Äô permanent campaign trick sooner later theprint via,0,opposition parties will have copy modi permanent campaign trick sooner later theprint via
63017,next 2448 hours the opposition and their loyal gang intellectuals will start demanding the proof thats what modi wants everytime modi lays trap and these idiots walk into,-1,next hours the opposition and their loyal gang intellectuals will start demanding the proof thats what modi wants everytime modi lays trap and these idiots walk into
102834,that next time where you the theatrics wiping out sweat pidi journo looks more real,1,that next time where you the theatrics wiping out sweat pidi journo looks more real
34695,more taxes would needed realize the pappus offer\ninflation would follow miserable future offered stupid along with his mad offer guaranteed income see better inflation rate modi\n2010121 201187 201210 201394\n201549 201645 201736 20183,-1,more taxes would needed realize the pappus offer\ninflation would follow miserable future offered stupid along with his mad offer guaranteed income see better inflation rate modi
59315,definitely today under modi‚Äô leadership india has become force reckoned with let‚Äô not surprised pappu will ask for the proof well from drdo,-1,definitely today under modi leadership india has become force reckoned with let not surprised pappu will ask for the proof well from drdo
110245,very true ramesh jarakiholi will give new angle this fight guess prabhakar kore and ramesh katthi won‚Äô have much say already many rss workers are ground spreading that whoever bjp candidate vote for modi,1,very true ramesh jarakiholi will give new angle this fight guess prabhakar kore and ramesh katthi won have much say already many rss workers are ground spreading that whoever bjp candidate vote for modi
158692,china bring computer 50s india could have made 60s again 30yrs delay 2007 china got its anti missile sattelite modi made 04yrs but 7yrs cong gov mum,0,china bring computer s india could have made s again yrs delay china got its anti missile sattelite modi made yrs but yrs cong gov mum
27952,didn‚Äô modi bhakats are weak maths they just jumping the gun,-1,didn modi bhakats are weak maths they just jumping the gun


### **Spam, Expressions, Onomatopoeia, etc.**

Since the domain of the corpus is Twitter, spam (e.g., `bbbb`), expressions (e.g., `bruhhhh`), and onomatopoeia (e.g., `hahahaha`) may become an issue by the vector representation step. Hence we employed a simple rule-based spam removal algorithm.

We remove words in the string that contains the same letter or substring thrice and consecutively. These were done using regular expressions:

$$
\text{same\_char\_thrice} := (.)\textbackslash1^{\{2,\}}
$$

and

$$
\text{same\_substring\_twice} := (.^+)\textbackslash1^+
$$

Furthermore, we also remove any string that has a length less than three, since these are either stopwords (that weren't detected in the stopword removal stage) or more spam. 

Finally, we employ adaptive character diversity threshold for the string $s$. 

$$
\frac{\texttt{\#\_unique\_chars}(s)}{|s|} < 0.3 + \left(\frac{0.1 \cdot \text{min}(|s|, 10)}{10}\right)
$$

It calculates the diversity of characters in a string; if the string repeats the same character alot, we expect it to be unintelligible or useless, hence we remove it.

The definition of this wrapper function is quite long, see its definition in [this appendix](#appendix:-find_spam_and_empty-wrapper-function-definition).

Let's first look at a random sample of 10 entries in the dataset before the cleaning pipeline.

In [275]:
df.sample(10)

Unnamed: 0,clean_text,category,clean_ours
71301,first reply why did not tested then modi will tell why did,1,first reply why did not tested then modi will tell why did
117536,modi spoke all the oppos seen with their tales btwn the legs,0,modi spoke all the oppos seen with their tales btwn the legs
126262,you dont have any sorrow over farmers suicide then vote for modi you want the dead body soldiers used for political gain then definitely vote for modi,-1,you dont have any sorrow over farmers suicide then vote for modi you want the dead body soldiers used for political gain then definitely vote for modi
60908,jab tak bhaiya president hain inc modi rehenge happy theaters day,1,jab tak bhaiya president hain inc modi rehenge happy theaters day
78292,modis ayushman bharat success coz congress kept millions below poverty line over decades,1,modis ayushman bharat success coz congress kept millions below poverty line over decades
49865,lying concealing factsinfodata are part his via,0,lying concealing factsinfodata are part his via
122273,modi the can‚Äô protect defence files his own office modi feeds peoples life‚Äô due dogs due his flawed policies calls himself chowkidhars,-1,modi the can protect defence files his own office modi feeds peoples life due dogs due his flawed policies calls himself chowkidhars
103927,hmmmdoing everything against modi his supporters,0,hmmmdoing everything against modi his supporters
63519,not only does this tweet officially pellucids paid media and the state journalism today but also shows the surpassing rise fascism india today stating that even after much ruination and despotism mrmodi fears neither the opposition nor the public,1,not only does this tweet officially pellucids paid media and the state journalism today but also shows the surpassing rise fascism india today stating that even after much ruination and despotism mrmodi fears neither the opposition nor the public
47181,mission shakti\nthanks drdo isro and modi proud you all india now space superpower,1,mission shakti\nthanks drdo isro and modi proud you all india now space superpower


Let's now call this function on the `clean_ours` column of the dataset.

In [276]:
df["clean_ours"] = df["clean_ours"].map(find_spam_and_empty).astype('string')
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 162969 entries, 0 to 162979
Data columns (total 3 columns):
 #   Column      Non-Null Count   Dtype 
---  ------      --------------   ----- 
 0   clean_text  162969 non-null  string
 1   category    162969 non-null  int64 
 2   clean_ours  162942 non-null  string
dtypes: int64(1), string(2)
memory usage: 9.0 MB


In [277]:
type(df.loc[0, 'clean_ours'])

str

And look at another random sample of 10 entries in the dataset after the cleaning pipeline.

In [278]:
df.sample(10)

Unnamed: 0,clean_text,category,clean_ours
50519,see what manish tiwari saying this the reason why narendra modi needed,0,see what manish tiwari saying this the reason why narendra modi needed
1458,only modi matters,0,only modi matters
65540,low show for lowly subhumans pakistan showed how much fear modi 272,1,low show for lowly subhumans pakistan showed how much fear modi
63804,opinion modis ‚Äúgrand disclosure‚Äù extension the nationalist card being vigorously pushed the bjp after the and the balakot air strikes writes,1,opinion modis grand disclosure extension the nationalist card being vigorously pushed the bjp after the and the balakot air strikes writes
112699,another jumla exposed modi worst indias history,-1,another jumla exposed modi worst indias history
69845,drdo chief upa govt didnt give the nod ahead but now modi had the will power dont understand why congress trying hard push our country back every front,-1,drdo chief upa govt didnt give the nod ahead but now modi had the will power dont understand why congress trying hard push our country back every front
76199,two strong reasons for not voting for congress forces were prohibited from avenging 2611 attack\n2isro was not allowed conduct mission shakti test,1,two strong reasons for not voting for congress forces were prohibited from avenging attack isro was not allowed conduct mission shakti test
69060,today proud indian have great leader who leading today who kept national interest first proud our narendra modi and our great sciencetists,1,today proud indian have great leader who leading today who kept national interest first proud our narendra modi and our great sciencetists
120390,will send you doggy bag tomorrow\nnice one because nirav modi has dog wanted nirav set free\nwof wof nice try\nwhat next,1,will send you doggy bag tomorrow nice one because nirav modi has dog wanted nirav set free wof wof nice try what next
147651,72k won‚Äô spent ipl tickets pop corn emi but essential goods should only boost the profits those companies who cater them overall they will have chance rise become consumers like you unlike modi‚Äô election strategy that drove many ruin,1,won spent ipl tickets pop corn emi but essential goods should only boost the profits those companies who cater them overall they will have chance rise become consumers like you unlike modi election strategy that drove many ruin


## **Post-Cleaning Steps**

At some point during the cleaning stage, some entries of the dataset could have been reduced to `NaN` or the empty string `""`, or we could have introduced duplicates again. So, let's call `dropna` and `drop_duplicates` again.

In [279]:
df = df.dropna()
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 162942 entries, 0 to 162979
Data columns (total 3 columns):
 #   Column      Non-Null Count   Dtype 
---  ------      --------------   ----- 
 0   clean_text  162942 non-null  string
 1   category    162942 non-null  int64 
 2   clean_ours  162942 non-null  string
dtypes: int64(1), string(2)
memory usage: 5.0 MB


In [280]:
df = df.drop_duplicates()
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 162942 entries, 0 to 162979
Data columns (total 3 columns):
 #   Column      Non-Null Count   Dtype 
---  ------      --------------   ----- 
 0   clean_text  162942 non-null  string
 1   category    162942 non-null  int64 
 2   clean_ours  162942 non-null  string
dtypes: int64(1), string(2)
memory usage: 5.0 MB


# **3. Preprocessing**

> üèóÔ∏è Perhaps swap S3 and S4. Refer to literature on what comes first.

This section discusses preprocessing steps for the cleaned data. Because the goal is to analyze the textual sentiments of tweets the following preprocessing steps are needed to provide the Bag of Words model with the relevant information required to get the semantic embeddings of each tweet.

Before and after each preprocessing step, we will show 5 random entries in the dataset to show the effects of each preprocessing task.

## **Lemmatization**

We follow a similar methodology for data cleaning presented in <u>(George & Murugesan, 2024)</u>. We preprocess the dataset entries via lemmatization. We use NLTK for this task using WordNetLemmatizer lemmatization, repectively <u>(Bird & Loper, 2004)</u>. For the lemmatization step, we use the WordNet for English lemmatization and Open Multilingual WordNet version 1.4 for translations and multilingual support which is important for our case since some tweets contain text from Indian Languages.

In [281]:
df["lemmatized"] = df["clean_ours"].map(lemmatizer)
df.sample(10)

Unnamed: 0,clean_text,category,clean_ours,lemmatized
99797,the people india dont want towards instability prime minister narendra modi,0,the people india dont want towards instability prime minister narendra modi,the people india dont want towards instability prime minister narendra modi
154363,modi went for two seats could have left gujarat argument was that vikas gujarat running awayhe went varanasi argument was that one knows him outside gujarat had prove his pan india image and counter baseless argument congress,0,modi went for two seats could have left gujarat argument was that vikas gujarat running awayhe went varanasi argument was that one knows him outside gujarat had prove his pan india image and counter baseless argument congress,modi went for two seat could have left gujarat argument wa that vikas gujarat running awayhe went varanasi argument wa that one know him outside gujarat had prove his pan india image and counter baseless argument congress
84525,nehru was launching anti satellite missile 1959 for which modi taking credit,0,nehru was launching anti satellite missile for which modi taking credit,nehru wa launching anti satellite missile for which modi taking credit
17430,modi way making job creation 2014 sale chai and proud chaiwallah 2017 sale pakodas and proud business man 2019 became chowkidar front richman home 2019 sale tshirts proud bhakth,1,modi way making job creation sale chai and proud chaiwallah sale pakodas and proud business man became chowkidar front richman home sale tshirts proud bhakth,modi way making job creation sale chai and proud chaiwallah sale pakodas and proud business man became chowkidar front richman home sale tshirts proud bhakth
82287,right\nthe least educated modis bravados now unsold,-1,right the least educated modis bravados now unsold,right the least educated modis bravado now unsold
44536,modi did itt,0,modi did itt,modi did itt
152977,this what call sensible criticism possible only bcoz rahul‚Äô idea had been modi‚Äô there would have been discussion bcoz fear,0,this what call sensible criticism possible only bcoz rahul idea had been modi there would have been discussion bcoz fear,this what call sensible criticism possible only bcoz rahul idea had been modi there would have been discussion bcoz fear
148410,another difference modi chose state where bjp was not projected weak made stronger rahul chose state where his party projected win seats benefits from partys strength like parasite,1,another difference modi chose state where bjp was not projected weak made stronger rahul chose state where his party projected win seats benefits from partys strength like parasite,another difference modi chose state where bjp wa not projected weak made stronger rahul chose state where his party projected win seat benefit from party strength like parasite
12564,you guys give time then can hamara modi sal hoya aaj bhe pakistan pakistan aur nahru par ilzam lagata hai give him some time bhai,0,you guys give time then can hamara modi sal hoya bhe pakistan pakistan aur nahru par ilzam lagata hai give him some time bhai,you guy give time then can hamara modi sal hoya bhe pakistan pakistan aur nahru par ilzam lagata hai give him some time bhai
106793,most idiotic question from criminal modis steps polling arent connected all pakistani stooges like must hanged even after youre dead,-1,most idiotic question from criminal modis steps polling arent connected all pakistani stooges like must hanged even after youre dead,most idiotic question from criminal modis step polling arent connected all pakistani stooge like must hanged even after youre dead


## **Stop Word Removal**

After lemmatization, we may now remove the stop words present in the dataset. The stopword removal _needs_ to be after lemmatization since this step requires all words to be reduces to their base dictionary form, and the `stopword_set` only considers base dictionary forms of the stopwords.

**stopwords.** For stop words removal, we refer to the English stopwords dataset defined in NLTK and Wolfram Mathematica <u>(Bird & Loper, 2004; Wolfram Research, 2015)</u>. However, since the task is sentiment analysis, words that invoke polarity, intensification, and negation are important. Words like "not" and "okay" are commonly included as stopwords. Therefore, the stopwords from [nltk,mathematica] are manually adjusted to only include stopwords that invoke neutrality, examples are "after", "when", and "you."

In [282]:
df["lemmatized"] = df["lemmatized"].map(lambda t: rem_stopwords(t, stopwords_set))
df = df.dropna(subset=["lemmatized"])
df.sample(10)

Unnamed: 0,clean_text,category,clean_ours,lemmatized
148169,bjp party poor says modi meanwhile bjp assets increased 627 last amit shah assets jumped times yrs gadkari income increased 141 yrs other news\nbsnl verge bankruptcy\nbank npa has been highest\nyouths have job,-1,bjp party poor says modi meanwhile bjp assets increased last amit shah assets jumped times yrs gadkari income increased yrs other news bsnl verge bankruptcy bank npa has been highest youths have job,bjp party poor modi bjp asset increased amit shah asset jumped time gadkari income increased news bsnl verge bankruptcy bank npa ha highest youth job
27628,not supported modiproud anti indian,0,not supported modiproud anti indian,supported modiproud anti indian
98798,shameless communal who has ganged with and who meet these creeps every evening chalk out narrative favouring pak and design anti modi caimpaign their hate for man they shame india 24x7 kick ass,-1,shameless communal who has ganged with and who meet these creeps every evening chalk out narrative favouring pak and design anti modi caimpaign their hate for man they shame india kick ass,shameless communal ha ganged meet creep evening chalk narrative favouring pak design anti modi caimpaign hate man shame india kick
120226,would expect more balanced argument from welleducated man like you this article criticizes but gives clean chit clean starts from home,1,would expect more balanced argument from welleducated man like you this article criticizes but gives clean chit clean starts from home,expect more balanced argument welleducated man like article criticizes clean chit clean start
46914,love kaftan anytime any day,1,love kaftan anytime any day,love kaftan anytime day
130127,when indian population turn 600crores modi will definitely bring back black money with the help baba ramdev,-1,when indian population turn crores modi will definitely bring back black money with the help ramdev,indian population turn crore modi definitely bring back black money help ramdev
95825,modi was the 1969,0,modi was the,modi
110417,absolutely but this weird behaviour modi would enable trumpus understand that country can all weather friend except pakistan,-1,absolutely but this weird behaviour modi would enable trumpus understand that country can all weather friend except pakistan,absolutely weird behaviour modi enable trumpus understand country all weather friend pakistan
120234,not fan hater modi rahul big fan india \nbest use that money was for the poor people india for public welfare schemes\ndont you all think that this could the best use tax payers money,1,not fan hater modi rahul big fan india best use that money was for the poor people india for public welfare schemes dont you all think that this could the best use tax payers money,fan hater modi rahul big fan india best money poor people india public welfare scheme all best tax payer money
117918,you vote modi you are with india you are not with modi you are anti india the above shit bjp candidate from bangalore dear not with modi with indian people not anti india sanjeev bard,-1,you vote modi you are with india you are not with modi you are anti india the above shit bjp candidate from bangalore dear not with modi with indian people not anti india sanjeev bard,vote modi india modi anti india above shit bjp candidate bangalore dear modi indian people anti india sanjeev bard


## **Looking at the DataFrame**

After preprocessing, the dataset now contains:

In [283]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 162942 entries, 0 to 162979
Data columns (total 4 columns):
 #   Column      Non-Null Count   Dtype 
---  ------      --------------   ----- 
 0   clean_text  162942 non-null  string
 1   category    162942 non-null  int64 
 2   clean_ours  162942 non-null  string
 3   lemmatized  162942 non-null  object
dtypes: int64(1), object(1), string(2)
memory usage: 6.2+ MB


Here are 10 randomly picked entries in the dataframe with all columns shown for comparison.

In [284]:
display(df.sample(5))

Unnamed: 0,clean_text,category,clean_ours,lemmatized
16111,two penny hate monger modi and netanyahu lover etc etc etc and not peaceful any way all,-1,two penny hate monger modi and netanyahu lover etc etc etc and not peaceful any way all,penny hate monger modi netanyahu lover peaceful all
2266,western buyers who were shifting india from china because price rise there went pakistan bangladesh sri lanka vietnam etc because they have better price they are clocking big time thanks modi,1,western buyers who were shifting india from china because price rise there went pakistan bangladesh sri lanka vietnam etc because they have better price they are clocking big time thanks modi,western buyer shifting india china price rise pakistan bangladesh sri lanka vietnam better price clocking big time modi
139905,want watch debate between kanhaiya kumar the great orator narendra modi plz arrange debate,1,want watch debate between kanhaiya kumar the great orator narendra modi plz arrange debate,watch debate kanhaiya kumar great orator narendra modi plz arrange debate
83968,narendra modi says meerut ‚Äòthe rld and bsp together make sharab alcohol this alcohol will ruin you‚Äô,0,narendra modi says meerut the rld and bsp together make sharab alcohol this alcohol will ruin you,narendra modi meerut rld bsp sharab alcohol alcohol ruin
57144,everytime modi govt achieves somethingone your body part starts burningironically you use the same body part think and that body part not brain,0,everytime modi govt achieves somethingone your body part starts burningironically you use the same body part think and that body part not brain,everytime modi govt achieves somethingone body start burningironically body body brain


## **Tokenization** 

Since the data cleaning and preprocessing stage is comprehensive, the tokenization step in the BoW model reduces to a simple word-boundary split operation. Each preprocessed entry in the DataFrame is split by spaces. For example, the entry `"shri narendra modis"` (entry: 42052) becomes `["shri", "narendra", "modis"]`. By the end of tokenization, all entries are transformed into arrays of strings.

## **Word Bigrams** 

As noted earlier, modifiers and polarity words are not included in the stopword set. The BoW model constructs a vocabulary containing both unigrams and bigrams. Including bigrams allows the model to capture common word patterns, such as  

$$
\left\langle \texttt{Adj}\right\rangle \left\langle \texttt{M} \mid \texttt{Pron} \right\rangle 
$$  

<center>or</center>

$$
\left\langle \texttt{Adv}\right\rangle \left\langle \texttt{V} \mid \texttt{Adj} \mid \texttt{Adv} \right\rangle 
$$  

## **Vector Representation**

After the stemming and lemmatization steps, each entry can now be represented as a vector using a Bag of Words (BoW) model. We employ scikit-learn's `CountVectorizer`, which provides a ready-to-use implementation of BoW <u>(Pedregosa et al., 2011)</u>.

A comparison of other traditional vector representations are discussed in [this appendix](#appendix:-comparison-of-traditional-vectorization-techniques).
Words with modifiers have the modifiers directly attached, enabling subsequent models to capture the concept of modification fully. Consequently, after tokenization and bigram construction, the vocabulary size can grow up to $O(n^2)$, where $n$ is the number of unique tokens.

**minimum document frequency constraint:** Despite cleaning and spam removal, some tokens remain irrelevant or too rare. To address this, a minimum document frequency constraint is applied: $\texttt{min\_df} = 10$, meaning a token must appear in at least 10 documents to be included in the BoW vocabulary. This reduces noise and ensures the model focuses on meaningful terms.

---

These parameters of the BoW model are encapsulated in the `BagOfWordsModel` class. The class definition is available in [this appendix](#appendix:-BagOfWordsModel-class-definition).

In [285]:
bow = BagOfWordsModel(df["lemmatized"], 10)

# some sanity checks
assert bow.matrix.shape[0] == df.shape[0], "number of rows in the matrix DOES NOT matches the number of documents"
assert bow.sparsity,                       "the sparsity is TOO HIGH, something went wrong"



The error above is normal, recall that our tokenization step essentially reduced into an array split step. With this, we need to set the `tokenizer` function attribute of the `BagOfWordsModel` to not use its default tokenization pattern. That causes this warning.

### **Model Metrics**

To get an idea of the model, we will now look at its shape and sparsity, with shape being the number of documents and tokens present in the model. While sparsity refers to the number of elements in a matrix that are zero, calculating how sparse or varied the words are in the dataset.

The resulting vector has a shape of

In [286]:
bow.matrix.shape

(162942, 30386)

The first entry of the pair is the number of documents (the ones that remain after all the data cleaning and preprocessing steps) and the second entry is the number of tokens (or unique words in the vocabulary). 

The resulting model has a sparsity of

In [293]:
1 - bow.sparsity

0.9995039539872171

The model is 99.95% sparse, meaning the tweets often do not share the same words leading to a large vocabulary.

Now, looking at the most frequent and least frequent terms in the model.

In [288]:
doc_frequencies = np.asarray((bow.matrix > 0).sum(axis=0)).flatten()
freq_order = np.argsort(doc_frequencies)[::-1]
bow.feature_names[freq_order[:50]]

array(['modi', 'india', 'ha', 'all', 'people', 'bjp', 'like', 'congress',
       'narendra', 'only', 'election', 'narendra modi', 'vote', 'govt',
       'about', 'indian', 'year', 'time', 'country', 'just', 'modis',
       'more', 'nation', 'rahul', 'even', 'government', 'party', 'power',
       'gandhi', 'minister', 'leader', 'good', 'modi govt', 'need',
       'modi ha', 'space', 'work', 'prime', 'money', 'credit', 'sir',
       'pakistan', 'back', 'day', 'today', 'prime minister', 'scientist',
       'never', 'support', 'win'], dtype=object)

We see that the main talking point of the Tweets, which hovers around Indian politics with keywords like "modi", "india", and "bjp". For additional context, "bjp" referes to the _Bharatiya Janata Party_ which is a conservative political party in India, and one of the two major Indian political parties.

Now, looking at the least popular words.

In [289]:
bow.feature_names[freq_order[-50:]]

array(['healthy democracy', 'ha mass', 'ha separate', 'ha shifted',
       'hat drdo', 'about defeat', 'yet ha', 'yes more', 'yes narendra',
       'hatred people', 'ha requested', 'hate more', 'hate much',
       'hatemonger', 'hater gonna', 'heal', 'hazaribagh', 'head drdo',
       'sleep night', 'abinandan', 'able provide', 'able speak',
       'able vote', 'youth need', 'youth power', 'hai isliye', 'hai chor',
       'handy', 'hand narendra', 'hand people', 'hae', 'ha withdrawn',
       'happens credit', 'happier', 'bhaiyo', 'socha', 'social political',
       'social security', 'biased journalist', 'big congratulation',
       'sirmodi', 'bhutan', 'bhi berozgar', 'bhi mumkin', 'skta',
       'bhatt aditi', 'bhi aur', 'slamming', 'smart modi', 'slogan blame'],
      dtype=object)

We still see that the themes mentioned in the most frequent terms are still present in this subset. Although, more filler or non-distinct words do appear more often, like "photos", "soft" and "types".

But the present of words like "reelection" and "wars" still point to this subset still being relevant to the main theme of the dataset.

# **4 exploratory data analysis**

This section discusses the exploratory data analysis conducted on the dataset after cleaning.

> Notes from Zhean: <br>
> From manual checking via OpenRefine, there are a total of 162972. `df.info()` should have the same result post-processing.
> Furthermore, there should be two columns, `clean_text` (which is a bit of a misnormer since it is still dirty) contains the Tweets (text data). The second column is the `category` which contains the sentiment of the Tweet and is a tribool (1 positive, 0 neutral or indeterminate, and -1 for negative).

# **references**
Bird, S., & Loper, E. (2004, July). NLTK: The natural language toolkit. *Proceedings of the ACL Interactive Poster and Demonstration Sessions*, 214‚Äì217. https://aclanthology.org/P04-3031/

El-Demerdash, A. A., Hussein, S. E., & Zaki, J. F. W. (2021). Course evaluation based on deep learning and SSA hyperparameters optimization. *Computers, Materials & Continua, 71*(1), 941‚Äì959. https://doi.org/10.32604/cmc.2022.021839

George, M., & Murugesan, R. (2024). Improving sentiment analysis of financial news headlines using hybrid Word2Vec-TFIDF feature extraction technique. *Procedia Computer Science, 244*, 1‚Äì8.

Hussein, S. (2021). *Twitter sentiments dataset*. Mendeley.

Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., Vanderplas, J., Passos, A., Cournapeau, D., Brucher, M., Perrot, M., & Duchesnay, E. (2011). Scikit-learn: Machine learning in Python. *Journal of Machine Learning Research, 12*, 2825‚Äì2830.

Rani, D., Kumar, R., & Chauhan, N. (2022, October). Study and comparison of vectorization techniques used in text classification. In *2022 13th International Conference on Computing Communication and Networking Technologies (ICCCNT)* (pp. 1‚Äì6). IEEE.

Wolfram Research. (2015). *DeleteStopwords*. https://reference.wolfram.com/language/ref/DeleteStopwords.html

# **appendix: `clean` wrapper function definition**
Below is the definition of the `clean` wrapper function that encapsulates all internal functions used in the cleaning pipeline.

In [290]:
clean??

[31mSignature:[39m clean(text: str) -> str
[31mSource:[39m   
[38;5;28;01mdef[39;00m clean(text: str) -> str:
    [33m"""[39m
[33m    This is the main function for data cleaning (i.e., it calls all the cleaning functions in the prescribed order).[39m

[33m    This function should be used as a first-class function in a map.[39m

[33m    # Parameters[39m
[33m    * text: The string entry from a DataFrame column.[39m
[33m    * stopwords: stopword dictionary.[39m

[33m    # Returns[39m
[33m    Clean string[39m
[33m    """[39m
    [38;5;66;03m# cleaning on the base string[39;00m
    text = normalize(text)
    text = rem_punctuation(text)
    text = rem_numbers(text)
    text = collapse_whitespace(text)

    [38;5;28;01mreturn[39;00m text
[31mFile:[39m      c:\users\erin\documents\github\stintsy-order-of-erin\lib\janitor.py
[31mType:[39m      function

# **appendix: `find_spam_and_empty` wrapper function definition**
Below is the definition of the `find_spam_and_empty` wrapper function that encapsulates all internal functions for the spam detection algorithm.

In [291]:
find_spam_and_empty??

[31mSignature:[39m find_spam_and_empty(text: str, min_length: int = [32m3[39m) -> str | [38;5;28;01mNone[39;00m
[31mSource:[39m   
[38;5;28;01mdef[39;00m find_spam_and_empty(text: str, min_length: int = [32m3[39m) -> str | [38;5;28;01mNone[39;00m:
    [33m"""[39m
[33m    Filter out empty text and unintelligible/spammy unintelligible substrings in the text.[39m

[33m    Spammy substrings:[39m
[33m    - Shorter than min_length[39m
[33m    - Containing non-alphabetic characters[39m
[33m    - Consisting of a repeated substring (e.g., 'aaaaaa', 'ababab', 'abcabcabc')[39m

[33m    # Parameters[39m
[33m    * text: input string.[39m
[33m    * min_length: minimum length of word to keep.[39m

[33m    # Returns[39m
[33m        Cleaned string, or None if empty after filtering.[39m
[33m    """[39m
    cleaned_tokens = []
    [38;5;28;01mfor[39;00m t [38;5;28;01min[39;00m text.split():
        [38;5;28;01mif[39;00m len(t) < min_length:
            [38;5;2

# **appendix: comparison of traditional vectorization techniques**

Traditional vectorization techniques include BoW and Term Frequency-Inverse Document Frequency (TF-IDF). TF-IDF weights each word based on its frequency in a document and its rarity across the corpus, reducing the impact of common words. BoW, in contrast, simply counts word occurrences without considering corpus-level frequency. In this project, BoW was chosen because stopwords were already removed during preprocessing, and the dataset is domain-specific <u>(Rani et al., 2022)</u>. In such datasets, frequent words are often meaningful domain keywords, so scaling them down (as TF-IDF would) could reduce the importance of these key terms in the feature representation.

# **appendix: `BagOfWordsModel` class definition**
Below is the definition of the `BagOfWordsModel` class that encapsulates the desired parameters.

In [292]:
BagOfWordsModel??

[31mInit signature:[39m BagOfWordsModel(texts: Iterable[str], min_freq: int | float | [38;5;28;01mNone[39;00m = [38;5;28;01mNone[39;00m)
[31mSource:[39m        
[38;5;28;01mclass[39;00m BagOfWordsModel:
    [33m"""[39m
[33m    A Bag-of-Words representation for a text corpus.[39m

[33m    # Attributes[39m
[33m    * matrix (scipy.sparse.csr_matrix): The document-term matrix of word counts.[39m
[33m    * feature_names (list[str]): List of feature names corresponding to the matrix columns.[39m
[33m    *[39m
[33m    # Usage[39m
[33m    ```[39m
[33m    bow = BagOfWordsModel(df["lemmatized_str"])[39m
[33m    ```[39m
[33m    """[39m

    [38;5;28;01mdef[39;00m __init__(self, texts: Iterable[str], min_freq: int | float | [38;5;28;01mNone[39;00m = [38;5;28;01mNone[39;00m):
        [33m"""[39m
[33m        Initialize the BagOfWordsModel by fitting the vectorizer to the text corpus. This also filters out tokens[39m
[33m        that do not appear more than 