# Sentiment Analysis of Twitter Posts

<!-- Notebook name goes here -->
<center><b>Notebook: Data Description, Cleaning, Exploratory Data Analysis, and Preprocessing</b></center>
<br>

**by**: Stephen Borja, Justin Ching, Erin Chua, and Zhean Ganituen.

**dataset**: Hussein, S. (2021). Twitter Sentiments Dataset [Dataset]. Mendeley. https://doi.org/10.17632/Z9ZW7NT5H2.1

**motivation**: Every minute, social media users generate a large influx of textual data on live events. Performing sentiment analysis on this data provides a real-time view of public perception, enabling quick insights into the general population‚Äôs opinions and reactions.

**goal**: By the end of the project, our goal is to create and compare supervised learning algorithms for sentiment analysis.

### **dataset description**

The Twitter Sentiments Dataset is a dataset that contains nearly 163k tweets from Twitter. The time period of when these were collected is unknown, but it was published to Mendeley Data on May 14, 2021 by Sherif Hussein of Mansoura University.

Tweets were extracted using the Twitter API, but the specifics of how the tweets were selected are unmentioned. The tweets are mostly English with a mix of some Hindi words for code-switching <u>(El-Demerdash., 2021)</u>. All of them seem to be talking about the political state of India. Most tweets mention Narendra Modi, the current Prime Minister of India.

Each tweet was assigned a label using TextBlob's sentiment analysis <u>(El‚ÄëDemerdash, Hussein, & Zaki, 2021)</u>, which assigns labels automatically.

Twitter_Data

- **`clean_text`**: The tweet's text
- **`category`**: The tweet's sentiment category

What each row and column represents: `each row represents one tweet.` <br>
Number of observations: `162,980`

---

<a name="cite_note-1"></a>1. [^](#cite_ref-1) Code-switching is the practice of alternating between two languages $L_1$ (the native language) and $L_2$ (the source language) in a conversation. In this context, the code-switching is done to appear more casual since the conversation is done via Twitter (now, X).


## **1. Project Set-up**

We set the global imports for the projects (ensure these are installed via uv and is part of the environment). Furthermore, load the dataset here.


In [79]:
import pandas as pd
import numpy as np
import os
import sys

# Use lib directory
sys.path.append(os.path.abspath("../lib"))

# Imports from lib files
from janitor import *
from lemmatize import lemmatizer
from boilerplate import stopwords_set
from bag_of_words import BagOfWordsModel

# Pandas congiruation
pd.set_option("display.max_colwidth", None)

# Load raw data file
df = pd.read_csv("../data/Twitter_Data.csv")

## **2. Data Cleaning**

This section discusses the methodology for data cleaning.




As to not waste computational time, a preliminary step is to ensure that no `NaN` or duplicate entries exist before the cleaning steps. Everytime we call a `.drop()` function, we will show the result of `info()` to see how many entries are filtered out.

Let's first drop the `NaN` entries.

In [80]:
df = df.dropna()
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 162969 entries, 0 to 162979
Data columns (total 2 columns):
 #   Column      Non-Null Count   Dtype  
---  ------      --------------   -----  
 0   clean_text  162969 non-null  object 
 1   category    162969 non-null  float64
dtypes: float64(1), object(1)
memory usage: 3.7+ MB


Now, remove the duplicates.


In [81]:
df = df.drop_duplicates()
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 162969 entries, 0 to 162979
Data columns (total 2 columns):
 #   Column      Non-Null Count   Dtype  
---  ------      --------------   -----  
 0   clean_text  162969 non-null  object 
 1   category    162969 non-null  float64
dtypes: float64(1), object(1)
memory usage: 3.7+ MB


We also ensure that all the values in the `category` column are within the range of [-1, 0, 1], which represent the three sentiments, namely, negative, neutral, and positive.


In [82]:
df["category"].unique()

array([-1.,  0.,  1.])

Then remove any values outside of the provided range to keep the data consistent.


In [83]:
df = df[df["category"].isin([-1, 0, 1])]
df["category"].sample(10)

124887    0.0
143995    1.0
127684    1.0
97193     1.0
56583     0.0
21555    -1.0
67785    -1.0
65603     1.0
1980      0.0
108019    1.0
Name: category, dtype: float64

By converting a CSV file into a DataFrame, pandas automatically defaults numeric values to `float64` when it encounters decimals or `NaN` types. Text of `str` type get inferred and loaded into a `object` as the generic type for strings. We can check the dtype of our DataFrame column through `.info()`


In [84]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 162969 entries, 0 to 162979
Data columns (total 2 columns):
 #   Column      Non-Null Count   Dtype  
---  ------      --------------   -----  
 0   clean_text  162969 non-null  object 
 1   category    162969 non-null  float64
dtypes: float64(1), object(1)
memory usage: 3.7+ MB


First we convert column `category` from `float64` to `int64` after dropping `NaN` rows and removing any values outside of [-1, 0, 1]


In [85]:
df["category"] = df["category"].astype(int)
df["category"].info()

<class 'pandas.core.series.Series'>
Index: 162969 entries, 0 to 162979
Series name: category
Non-Null Count   Dtype
--------------   -----
162969 non-null  int64
dtypes: int64(1)
memory usage: 2.5 MB


In [86]:
df["category"].sample(10)

132994    1
59144     0
31184    -1
147998   -1
24931     0
51014     1
55969     0
53409    -1
107186    0
161426    1
Name: category, dtype: int64

Next, we convert column `clean_string` from `object` type into the pandas defined `string` type for consistency and better performance.


In [87]:
df["clean_text"] = df["clean_text"].astype("string")
df["clean_text"].info()

<class 'pandas.core.series.Series'>
Index: 162969 entries, 0 to 162979
Series name: clean_text
Non-Null Count   Dtype 
--------------   ----- 
162969 non-null  string
dtypes: string(1)
memory usage: 2.5 MB


## **Main Cleaning Pipeline**

We follow a similar methodology for data cleaning presented in (George & Murugesan, 2024).


### **Normalization**

Due to the nature of the text being tweets, we noticed a prevalence in the use of emojis and accented characters as seen in the samples below. Although in a real-world context these do serve as a form of emotional expression, it provides no relevance towards _textual_ sentiment analysis, thus we normalize the text.


In [88]:
# Finding a sample of rows with accented characters
accented_char_rows = df[df["clean_text"].str.contains(r"√â|√©|√Å|√°|√≥|√ì|√∫|√ö|√≠|√ç")]
accented_char_rows["clean_text"].sample(5)

86264                                                                                                                                                          there should some basic quality check such clich√©d juvenile satire silly even for modi bashing 
97413                                                                                                                                                  sir please one expos√© about the degree modi also everyone wants see his degree entire political science
159585                                                                                                                      and might get d√©j√† watching this interview and reminisce his interview with modi which they asked sir how you get this much energy
89813                                                                                                                                                                                                          just love the new con√ße

In [89]:
# Finding a sample of rows with emojis
rows_with_emojis = df[df["clean_text"].str.contains(r"[\u263a-\U0001f645]", regex=True)]
rows_with_emojis["clean_text"].sample(5)

48840                                                                                                                 modi divine gift nation says union minister harsh vardhan ‚ö°article 370 
155000                                                                                                           rahul gandhi attacks modi for failing deliver 2014 poll promises ‚ö°bairstow‚ö° 
49985                                                          our scientists have successfully shot down low earth orbit satellite 300 away space modi become the fourthafter russia china‚úå 
82342                                                                                                                       modi government again indians have made their mind meerut rally‚ù£ 
128690    Ô∏ègreat news worlds biggest electric car manufacturer confirms will enter india 2019 ceo founder tesla motors confirmed this today will one the biggest make india success for modi 
Name: clean_text, dtype: string

The first function is the `normalize` function, it normalizes the text input to ASCII-only characters (say, "c√≥mo est√°s" becomes "como estas") and lowercased alphabetic symbols. The dataset contains Unicode characters (e.g., emojis and accented characters) which the function replaces to the empty string (`''`).


In [90]:
normalize??

[31mSignature:[39m normalize(text: str) -> str
[31mSource:[39m   
[38;5;28;01mdef[39;00m normalize(text: str) -> str:
    [33m"""[39m
[33m    Normalize text from a pandas entry to ASCII-only lowercase characters. Hence, this removes Unicode characters with no ASCII[39m
[33m    equivalent (e.g., emojis and CJKs).[39m

[33m    Do not use this function alone, use `clean_and_tokenize()`.[39m

[33m    # Parameters[39m
[33m    * text: String entry.[39m

[33m    # Returns[39m
[33m    ASCII-normalized text containing only lowercase letters.[39m

[33m    # Examples[39m
[33m    normalize("¬øC√≥mo est√°s?")[39m
[33m    $ 'como estas?'[39m

[33m    normalize(" hahahaha HUY! Kamusta üòÖ Mayaman $$$ ka na ba?")[39m
[33m    $ ' hahahaha huy! kamusta  mayaman $$$ ka na ba?'[39m
[33m    """[39m
    normalized = unicodedata.normalize([33m"NFKD"[39m, text)
    ascii_text = normalized.encode([33m"ascii"[39m, [33m"ignore"[39m).decode([33m"ascii"[39m)

    [38;5;2

### **Punctuations**

Punctuations are part of natural speech and reading to provide a sense of structure, clarity, and tone to sentences, but in the context of a classification study punctuations do not add much information to the sentiment of a message. The sentiment of `i hate you!` and `i hate you` are going to be the same despite the punctuation mark `!` being used to accentuate the sentiment. We can see a sample of rows with punctations below.


In [91]:
# Finding a sample of rows with punctuation
rows_with_punc = df[df["clean_text"].str.contains(r"[^\w\s]")]
rows_with_punc["clean_text"].sample(5)

122653                                                                                                                                                        modi‚Äô address didn‚Äô violate model code conduct finds election commission read more 
35178                                               very soon namo going nomo  modi shah bjp rss destroyed indian economy and lives indian people but modi shah ambani‚Äô adanis modi‚Äô bjp rss well prospered money and power  modi double chor hai
82089                                                                                                                                             modi worse than upa2 which turn was worse than upa1 vajpayee was good guess ‚Äô spiral the bottom
139294            every bjp campaign same routine elect modi else there hidden agenda modi not elected will india face blood bath form riots bandhs lynching genocide far fetched but who knows there plan which coming government will face ‚òπ‚òπ‚òπ‚òπ
6338      indi

The function `rem_punctuation` replaces all punctuations and special characters into an empty string (`''`)


In [92]:
rem_punctuation??

[31mSignature:[39m rem_punctuation(text: str) -> str
[31mSource:[39m   
[38;5;28;01mdef[39;00m rem_punctuation(text: str) -> str:
    [33m"""[39m
[33m    Removes the punctuations. This function simply replaces all punctuation marks and special characters[39m
[33m    to the empty string. Hence, for symbols enclosed by whitespace, the whitespace are not collapsed to a single whitespace[39m
[33m    (for more information, see the examples).[39m

[33m    Do not use this function alone, use `clean_and_tokenize()`.[39m

[33m    # Parameters[39m
[33m    * text: String entry.[39m

[33m    # Returns[39m
[33m    Text with the punctuation removed.[39m

[33m    # Examples[39m
[33m    rem_punctuation("this word $$ has two spaces after it!")[39m
[33m    $ 'this word  has two spaces after it'[39m

[33m    rem_punctuation("these!words@have$no%space")[39m
[33m    $ 'thesewordshavenospace'[39m
[33m    """[39m
    [38;5;28;01mreturn[39;00m re.sub(f"[{re.escape(string.

### **Numbers**

Similar to punctuations, numbers do not add any information to the sentiment of a message as seen in the samples below.


In [93]:
# Finding a sample of rows that contain numbers
rows_with_numbers = df[df["clean_text"].str.contains(r"\d")]
rows_with_numbers["clean_text"].sample(5)

152411                                                                                                                       modi was contesting those seats for the 1st time was making his lok sabha rahul has been sitting for yrs significant difference
18735                                      congress had done the same things catch votes poor india since freedom but done nothing for the poor actually declares 72000 the poor family india yesterday rahul koi astra kam nahi karena bar fir modi sarkar 
157038                                              nirav modis statement london court was threatened the congress leaders escape and run away from india paid them commission 456 congress leaders true not why was rushed london when nirav got arressted 
28104                                 modi government formed 2019 then request you include railway ticketing that body weight person should included while booking ticket heavy weight man having probs reservations seat comfortable upper and m

Hence we defined the `rem_numbers` as a function that replaces all numerical values as an empty string (`''`).


In [94]:
rem_numbers??

[31mSignature:[39m rem_numbers(text: str) -> str
[31mSource:[39m   
[38;5;28;01mdef[39;00m rem_numbers(text: str) -> str:
    [33m"""[39m
[33m    Removes numbers. This function simply replaces all numerical symbols to the empty string. Hence, for symbols enclosed by[39m
[33m    whitespace, the whitespace are not collapsed to a single whitespace (for more information, see the examples).[39m

[33m    Do not use this function alone, use `clean_and_tokenize()`.[39m

[33m    # Parameters[39m
[33m    * text: String entry.[39m

[33m    # Returns[39m
[33m    Text with the numerical symbol removed[39m

[33m    # Examples[39m
[33m    rem_numbers(" h3llo, k4must4 k4  n4?")[39m
[33m    ' hllo, kmust k  n?'[39m
[33m    """[39m
    [38;5;28;01mreturn[39;00m re.sub([33mr"\d+"[39m, [33m""[39m, text)
[31mFile:[39m      c:\users\erin\documents\github\stintsy-order-of-erin\lib\janitor.py
[31mType:[39m      function

### **Whitespace**

We also noticed the prevalance of excess whitespaces in between words, as seen in the sample below.


In [95]:
# Finding a sample of rows that contain 2 or more whitespaces in a row
rows_with_whitespaces = df[df["clean_text"].str.contains(r"\s{2,}")]
rows_with_whitespaces["clean_text"].sample(5)

25226                                       full report card modi out now you will shocked see how india has changed last years  india now suffering from highest unemployment rate years nsso data all top most 
54816                                                                                                                                         someone wants say something about yours devta prachar mantri modi  
119101                                                                                         modi stopped hawala money  this filmmakers buttocks say hope this  will hang themselves after seeing namos victory
79761     rahul you are doing brilliant journalism  make sure you dont contract the disease modi bashing from your fellow journalist india today compulsive contrarianism broker journalists not accepted indians
90733                                                                                      its not nonbhakt the correct word either antinational antihindu\n\npe

Thus, function `collapse_whitespace` collapses all whitespace characters to a single space. Formally, it is a transducer

$$
\Box^+ \mapsto \Box \qquad \text{where the space character is } \Box
$$

Informally, it replaces all strings of whitespaces to a single whitespace character.


In [96]:
collapse_whitespace??

[31mSignature:[39m collapse_whitespace(text: str) -> str
[31mSource:[39m   
[38;5;28;01mdef[39;00m collapse_whitespace(text: str) -> str:
    [33m"""[39m
[33m    This collapses whitespace. Here, collapsing means the transduction of all whitespace strings of any[39m
[33m    length to a whitespace string of unit length (e.g., "   " -> " "; formally " "+ -> " ").[39m

[33m    Do not use this function alone, use `clean_and_tokenize()`.[39m

[33m    # Parameters[39m
[33m    * text: String entry.[39m

[33m    # Returns[39m
[33m    Text with the whitespaces collapsed.[39m

[33m    # Examples[39m
[33m    collapse_whitespace("  huh,  was.  that!!! ")[39m
[33m    $ 'huh, was. that!!!'[39m
[33m    """[39m
    [38;5;28;01mreturn[39;00m re.sub([33m" +"[39m, [33m" "[39m, text).strip()
[31mFile:[39m      c:\users\erin\documents\github\stintsy-order-of-erin\lib\janitor.py
[31mType:[39m      function

To seamlessly call all these cleaning functions, we have the `clean` function that acts as a container that calls these separate components. The definition of this wrapper function is quite long, see [this appendix](#appendix:-clean-wrapper-function-definition) for its definition.

We can now clean the dataset and store it in a new column named `clean_ours` (to differentiate it with the, still dirty, column `clean_text` from the dataset author)


In [97]:
df["clean_ours"] = df["clean_text"].map(clean).astype("string")
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 162969 entries, 0 to 162979
Data columns (total 3 columns):
 #   Column      Non-Null Count   Dtype 
---  ------      --------------   ----- 
 0   clean_text  162969 non-null  string
 1   category    162969 non-null  int64 
 2   clean_ours  162969 non-null  string
dtypes: int64(1), string(2)
memory usage: 5.0 MB


To confirm if the character cleaning worked, we can check for the differences between `clean_text` and `clean_ours` from the filtered rows below and compare the differences.


In [98]:
example_rows = df[
    df["clean_text"].str.contains(r"\s{2,}|\d|[^\w\s]|[\u263a-\U0001f645]|[√â√©√Å√°√≥√ì√∫√ö√≠√ç]")
]
example_rows.sample(10)

Unnamed: 0,clean_text,category,clean_ours
72379,modi announcement ‚Äì where bunker banker via,0,modi announcement where bunker banker via
112121,modi sought vote saying that wont for indians but will chowkidar guard against corruption won bcoz congress party was accused corruption 2014 now rahul gandhi says modi corrupt coz gave rafale contract his friend anil ambani younger,-1,modi sought vote saying that wont for indians but will chowkidar guard against corruption won bcoz congress party was accused corruption now rahul gandhi says modi corrupt coz gave rafale contract his friend anil ambani younger
89417,watch the official trailer modi inspiring story modi awesome man check out,1,watch the official trailer modi inspiring story modi awesome man check out
58982,with the bosh modi spewed climate change about the environment that indians can never harm nature them the moon mama the sun dada the earth mata wants tell about low earth orbit satellites abms ‚Äô worse than lie that‚Äô malfeasance,-1,with the bosh modi spewed climate change about the environment that indians can never harm nature them the moon mama the sun dada the earth mata wants tell about low earth orbit satellites abms worse than lie that malfeasance
96332,the power modi again the manifesto language improved much,1,the power modi again the manifesto language improved much
53558,selfproclaimed journalist abhisar sharma who often comes with bizarre conspiracy theories and indulges false propaganda discredit narendra modi government seems involved controversy has been caught handing out ‚Äòsomething‚Äô one the villagers,-1,selfproclaimed journalist abhisar sharma who often comes with bizarre conspiracy theories and indulges false propaganda discredit narendra modi government seems involved controversy has been caught handing out something one the villagers
121816,lies about mgnrega target the modi government here are the facts ‚Äì opindia news via the problem with dynast that never read people are fool,0,lies about mgnrega target the modi government here are the facts opindia news via the problem with dynast that never read people are fool
80327,congress led upa\nsurgical strike dont air strike dont asat missile dont modi sarkar\nsurgical strikego for air strikego for asat missilego for modi hai mumkin hai ‡•á ‡•à ‡•ã‡•Ä ‡•Ä ‡•á ‡§Ç ‡§æ ‡§æ‡•§,0,congress led upa\nsurgical strike dont air strike dont asat missile dont modi sarkar\nsurgical strikego for air strikego for asat missilego for modi hai mumkin hai
44791,fantastic mission shakti superb achievement india joined league super power becoming 4th the world under the able leadership this was possible why such things happens modi raising magic,1,fantastic mission shakti superb achievement india joined league super power becoming th the world under the able leadership this was possible why such things happens modi raising magic
113607,fact reality modi pappu can‚Äô even give figures correct and the guy promising moon,1,fact reality modi pappu can even give figures correct and the guy promising moon


We are now finished with basic text cleaning, but the data cleaning does not end here. Given that the text is sourced from Twitter, it includes characteristics, such as spam and informal expressions, which are not addressed by basic cleaning methods. As a result, we move on to further cleaning tailored to the nature of Twitter data.


### **Spam, Expressions, Onomatopoeia, etc.**

Since the domain of the corpus is Twitter, spam (e.g., `bbbb`), expressions (e.g., `bruhhhh`), and onomatopoeia (e.g., `hahahaha`) may become an issue by the vector representation step. Hence we employed a simple rule-based spam removal algorithm.

We remove words in the string that contains the same letter or substring thrice and consecutively. These were done using regular expressions:

$$
\text{same\_char\_thrice} := (.)\textbackslash1^{\{2,\}}
$$

and

$$
\text{same\_substring\_twice} := (.^+)\textbackslash1^+
$$

Furthermore, we also remove any string that has a length less than three, since these are either stopwords (that weren't detected in the stopword removal stage) or more spam.

Finally, we employ adaptive character diversity threshold for the string $s$.

$$
\frac{\texttt{\#\_unique\_chars}(s)}{|s|} < 0.3 + \left(\frac{0.1 \cdot \text{min}(|s|, 10)}{10}\right)
$$

It calculates the diversity of characters in a string; if the string repeats the same character alot, we expect it to be unintelligible or useless, hence we remove the string.

The definition of this wrapper function is quite long, see its definition in [this appendix](#appendix:-find_spam_and_empty-wrapper-function-definition).

Let's first look at a random sample of 10 entries from the dataset that will be modified by the function.


In [99]:
affected = df[df["clean_ours"].apply(spam_affected)]
affected_sample = affected["clean_ours"].sample(10)
affected_sample

86268                                                                                             the view nehruindirarajiv were the only architect india shastrimorarjicharanvpchandrashekharraoataldevegowdagujralatalmanmohanmodi has role all
127984                                                                                                       seems hussain haqqani working now for modi exactly what did for nawaz sharif when doctored benazirs pictures just before elections s
4068                                                                                                                                                                                    narendra modi biopic sparks meme fest indian social media
41535                                                                                                                  prime minister modi just tweeted that will address the nation and people are already checking out their rs note this khauf
138991                          

Let's now call this function on the `clean_ours` column of the dataset.


In [100]:
df["clean_ours"] = df["clean_ours"].map(find_spam_and_empty).astype("string")

To confirm if the function was able to do remove all the spammy substrings, we can check `before` and `after` and compare their differences.


In [101]:
comparison = pd.DataFrame({"before": affected_sample, "after": df["clean_ours"]})

changed = comparison[comparison["before"] != comparison["after"]]
changed.sample(10)

Unnamed: 0,before,after
5824,dont need k need modi,dont need need modi
86268,the view nehruindirarajiv were the only architect india shastrimorarjicharanvpchandrashekharraoataldevegowdagujralatalmanmohanmodi has role all,the view nehruindirarajiv were the only architect india has role all
138991,only channel tlking all purchased indian election ralley times modi spoke about pakistan must more sad state our politics no taking about growth creating job opportunities,only channel tlking all purchased indian election ralley times modi spoke about pakistan must more sad state our politics taking about growth creating job opportunities
48495,really feel sorry for modi dont have anything talk about his work done for the people just took the credit our brave and qualified scientists thanks chacha nehru who made the isro,really feel sorry for modi dont have anything talk about his work done for the people just took the credit our brave and qualified scientists thanks nehru who made the isro
41535,prime minister modi just tweeted that will address the nation and people are already checking out their rs note this khauf,prime minister modi just tweeted that will address the nation and people are already checking out their note this khauf
127984,seems hussain haqqani working now for modi exactly what did for nawaz sharif when doctored benazirs pictures just before elections s,seems hussain haqqani working now for modi exactly what did for nawaz sharif when doctored benazirs pictures just before elections
4068,narendra modi biopic sparks meme fest indian social media,narendra modi biopic sparks fest indian social media
114932,modi namechanger not gamechanger mnrega aadhar direct benefits transfer renamed jan dhan nirmal bharat abhiyan packaged swachh bharat with huge publicity budget fdi retail the liberalisation insurance and gst itself all upa schemes,modi namechanger not gamechanger mnrega direct benefits transfer renamed jan dhan nirmal bharat abhiyan packaged swachh bharat with huge publicity budget fdi retail the liberalisation insurance and gst itself all upa schemes
29528,our baby happiest birthday grateful have friend like you and tani wala kaman nagregret nga years modi world years nako ara life hahahaha thank you for everything love you much tani nagenjoy kaman gin give namon,our baby happiest birthday grateful have friend like you and tani wala kaman nagregret nga years modi world years nako ara life thank you for everything love you much tani nagenjoy kaman gin give namon
6684,ooo narendra modi right people,narendra modi right people


Let‚Äôs examine whether applying this function has caused any significant changes to the DataFrame structure, given that it can convert entire cells to `NaN`.


In [102]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 162969 entries, 0 to 162979
Data columns (total 3 columns):
 #   Column      Non-Null Count   Dtype 
---  ------      --------------   ----- 
 0   clean_text  162969 non-null  string
 1   category    162969 non-null  int64 
 2   clean_ours  162942 non-null  string
dtypes: int64(1), string(2)
memory usage: 5.0 MB


The DataFrame structure is intact, but `clean_ours` now has 27 fewer non-null values, reflecting cells that were entirely filtered out as spam as seen below.


In [103]:
missing_rows = df[df['clean_ours'].isna()]
missing_rows[['clean_text', 'clean_ours']]

Unnamed: 0,clean_text,clean_ours
21806,bjpmpsubramanianswamyiamchowkidarcampaignpmmodi,
21855,terrorfundinghurriyatleaderspropertyseizedhafizsaeedmodigovt,
24148,pmnarendramodirequestsofexservicemanindianarmyhavildarombirsinghsharma9258,
35636,2019,
35866,‚Äç,
35968,whattttttt,
37837,allllll,
40587,1145am,
40977,‚åö1145 ‚ù§,
48127,birthdaaaaaay,


## **Post-Cleaning Steps**

At some point during the cleaning stage, some entries of the dataset could have been reduced to `NaN` or the empty string `""`, or we could have introduced duplicates again. So, let's call `dropna` and `drop_duplicates` again to finalize the cleaning stage.


In [104]:
df = df.dropna()
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 162942 entries, 0 to 162979
Data columns (total 3 columns):
 #   Column      Non-Null Count   Dtype 
---  ------      --------------   ----- 
 0   clean_text  162942 non-null  string
 1   category    162942 non-null  int64 
 2   clean_ours  162942 non-null  string
dtypes: int64(1), string(2)
memory usage: 5.0 MB


In [105]:
df = df.drop_duplicates()
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 162942 entries, 0 to 162979
Data columns (total 3 columns):
 #   Column      Non-Null Count   Dtype 
---  ------      --------------   ----- 
 0   clean_text  162942 non-null  string
 1   category    162942 non-null  int64 
 2   clean_ours  162942 non-null  string
dtypes: int64(1), string(2)
memory usage: 5.0 MB


# **3. Preprocessing**

> üèóÔ∏è Perhaps swap S3 and S4. Refer to literature on what comes first.

This section discusses preprocessing steps for the cleaned data. Because the goal is to analyze the textual sentiments of tweets the following preprocessing steps are needed to provide the Bag of Words model with the relevant information required to get the semantic embeddings of each tweet.

Before and after each preprocessing step, we will show 5 random entries in the dataset to show the effects of each preprocessing task.

## **Lemmatization**

We follow a similar methodology for data cleaning presented in <u>(George & Murugesan, 2024)</u>. We preprocess the dataset entries via lemmatization. We use NLTK for this task using WordNetLemmatizer lemmatization, repectively <u>(Bird & Loper, 2004)</u>. For the lemmatization step, we use the WordNet for English lemmatization and Open Multilingual WordNet version 1.4 for translations and multilingual support which is important for our case since some tweets contain text from Indian Languages.


In [106]:
df["lemmatized"] = df["clean_ours"].map(lemmatizer)
df.sample(10)

Unnamed: 0,clean_text,category,clean_ours,lemmatized
158639,modi doesn‚Äô drink doesn‚Äô have guests and journalists the plane haven‚Äô restocked the bar even once the last years this discussion over year old,1,modi doesn drink doesn have guests and journalists the plane haven restocked the bar even once the last years this discussion over year old,modi doesn drink doesn have guest and journalist the plane haven restocked the bar even once the last year this discussion over year old
110054,when puppet like bilawal zardari makes indian media and modi happy was supporting pakistan aur helping modi secure win indian electionsit unfortunately that his father and aunt has robbed this nation and brought people sindh their knees,1,when puppet like bilawal zardari makes indian media and modi happy was supporting pakistan aur helping modi secure win indian electionsit unfortunately that his father and aunt has robbed this nation and brought people sindh their knees,when puppet like bilawal zardari make indian medium and modi happy wa supporting pakistan aur helping modi secure win indian electionsit unfortunately that his father and aunt ha robbed this nation and brought people sindh their knee
148707,modi chose varanasi make impact the hindi heartland vadodara already bjps safe seat\nits reverse rahuls case hes scrambling for safe seat amethi more,1,modi chose varanasi make impact the hindi heartland vadodara already bjps safe seat its reverse rahuls case hes scrambling for safe seat amethi more,modi chose varanasi make impact the hindi heartland vadodara already bjps safe seat it reverse rahuls case he scrambling for safe seat amethi more
98812,was unable walk and can now walk hail jesus mean modi,-1,was unable walk and can now walk hail jesus mean modi,wa unable walk and can now walk hail jesus mean modi
56087,vote for modi not for candidate,0,vote for modi not for candidate,vote for modi not for candidate
56546,not for modi,0,not for modi,not for modi
278,apke yar modi message our people celebrate pakistan day believe time begin comprehensive dialogue with india address resolve all issues esp the central issue kashmir forge new relationship based peace prosperity for all our people,1,apke yar modi message our people celebrate pakistan day believe time begin comprehensive dialogue with india address resolve all issues esp the central issue kashmir forge new relationship based peace prosperity for all our people,apke yar modi message our people celebrate pakistan day believe time begin comprehensive dialogue with india address resolve all issue esp the central issue kashmir forge new relationship based peace prosperity for all our people
94792,for your info,0,for your info,for your info
155476,prime minister narendra modi interacts with people main bhi chowkidar program,1,prime minister narendra modi interacts with people main bhi chowkidar program,prime minister narendra modi interacts with people main bhi chowkidar program
55683,sir you are first politician who congrats isro nahi jyadatar log modi congrats kar rahe,1,sir you are first politician who congrats isro nahi jyadatar log modi congrats kar rahe,sir you are first politician who congrats isro nahi jyadatar log modi congrats kar rahe


## **Stop Word Removal**

After lemmatization, we may now remove the stop words present in the dataset. The stopword removal _needs_ to be after lemmatization since this step requires all words to be reduces to their base dictionary form, and the `stopword_set` only considers base dictionary forms of the stopwords.

**stopwords.** For stop words removal, we refer to the English stopwords dataset defined in NLTK and Wolfram Mathematica <u>(Bird & Loper, 2004; Wolfram Research, 2015)</u>. However, since the task is sentiment analysis, words that invoke polarity, intensification, and negation are important. Words like "not" and "okay" are commonly included as stopwords. Therefore, the stopwords from [nltk,mathematica] are manually adjusted to only include stopwords that invoke neutrality, examples are "after", "when", and "you."


In [107]:
df["lemmatized"] = df["lemmatized"].map(lambda t: rem_stopwords(t, stopwords_set))
df = df.dropna(subset=["lemmatized"])
df.sample(10)

Unnamed: 0,clean_text,category,clean_ours,lemmatized
35810,hey looser are with india not with fake chowkidars like modi,-1,hey looser are with india not with fake chowkidars like modi,hey looser india fake chowkidars like modi
55111,this modi will destroy congress and its ecosystem,-1,this modi will destroy congress and its ecosystem,modi destroy congress ecosystem
39501,ram ram mitrvar modiamit say they will rule india for next years are democracy otherwise,0,ram ram mitrvar modiamit say they will rule india for next years are democracy otherwise,ram ram mitrvar modiamit rule india year democracy
122040,this main difference between raga and modimodi never changed anything abruptly anything during his tenure but made them effective removing weaknesses raga everyday come with new idea implemented,1,this main difference between raga and never changed anything abruptly anything during his tenure but made them effective removing weaknesses raga everyday come with new idea implemented,main difference raga never changed abruptly tenure effective removing weakness raga everyday idea implemented
88615,now even this also left,0,now even this also left,even left
107857,nothing will work anymore modi only works,0,nothing will work anymore modi only works,work modi only work
88254,modi himself the grip intoxication power,0,modi himself the grip intoxication power,modi grip intoxication power
20983,next after modi,0,next after modi,modi
93377,you mean the plan which already implemented modi,-1,you mean the plan which already implemented modi,plan already implemented modi
143553,let tell you what happened there kannaya why unemployment growing tejasvi modi took amazing revenge pulvama attack kannaya didnt expected that youll indirectly refuses the actual question tejasvi actually had this urgent campaign bye public,1,let tell you what happened there kannaya why unemployment growing tejasvi modi took amazing revenge pulvama attack kannaya didnt expected that youll indirectly refuses the actual question tejasvi actually had this urgent campaign bye public,happened kannaya unemployment growing tejasvi modi amazing revenge pulvama attack kannaya expected indirectly refuse actual question tejasvi actually urgent campaign bye public


## **Looking at the DataFrame**

After preprocessing, the dataset now contains:


In [108]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 162942 entries, 0 to 162979
Data columns (total 4 columns):
 #   Column      Non-Null Count   Dtype 
---  ------      --------------   ----- 
 0   clean_text  162942 non-null  string
 1   category    162942 non-null  int64 
 2   clean_ours  162942 non-null  string
 3   lemmatized  162942 non-null  object
dtypes: int64(1), object(1), string(2)
memory usage: 6.2+ MB


Here are 10 randomly picked entries in the dataframe with all columns shown for comparison.


In [109]:
display(df.sample(5))

Unnamed: 0,clean_text,category,clean_ours,lemmatized
121831,voters must ask questions modi which modi asked the government 201314,0,voters must ask questions modi which modi asked the government,voter ask question modi modi asked government
94248,joke why did upa hold back asat you blame the guy who acted next rafale cong had 1st chance again but twas modithe doer again what makes after the doer and not the one who was doodling surely you love india maybe looking for job with cong,1,joke why did upa hold back asat you blame the guy who acted next rafale cong had chance again but twas modithe doer again what makes after the doer and not the one who was doodling surely you love india maybe looking for job with cong,joke upa hold back asat blame guy acted rafale cong chance twas modithe doer doer doodling surely love india maybe job cong
159655,first give press conferences first time history india cbi cbirbi govt govt fights happened because modi wanted control all democratic institutions first time history india4 supreme court judges gave press conference say,1,first give press conferences first time history india cbi cbirbi govt govt fights happened because modi wanted control all democratic institutions first time history india supreme court judges gave press conference say,press conference time history india cbi cbirbi govt govt fight happened modi wanted control all democratic institution time history india supreme court judge press conference
124987,pakistan would love see modi power again hindu nationalism and hindu extremism what pakistan want more modi regimes would enough fuel independence movements all across india,1,pakistan would love see modi power again hindu nationalism and hindu extremism what pakistan want more modi regimes would enough fuel independence movements all across india,pakistan love modi power hindu nationalism hindu extremism pakistan more modi regime enough fuel independence movement all india
133503,congress always cheated people but your chowkidar will fight against infiltration terrorism and corruption modi tells bjp rally,0,congress always cheated people but your chowkidar will fight against infiltration terrorism and corruption modi tells bjp rally,congress always cheated people chowkidar fight infiltration terrorism corruption modi bjp rally


## **Tokenization**

Since the data cleaning and preprocessing stage is comprehensive, the tokenization step in the BoW model reduces to a simple word-boundary split operation. Each preprocessed entry in the DataFrame is split by spaces. For example, the entry `"shri narendra modis"` (entry: 42052) becomes `["shri", "narendra", "modis"]`. By the end of tokenization, all entries are transformed into arrays of strings.

## **Word Bigrams**

As noted earlier, modifiers and polarity words are not included in the stopword set. The BoW model constructs a vocabulary containing both unigrams and bigrams. Including bigrams allows the model to capture common word patterns, such as

$$
\left\langle \texttt{Adj}\right\rangle \left\langle \texttt{M} \mid \texttt{Pron} \right\rangle
$$

<center>or</center>

$$
\left\langle \texttt{Adv}\right\rangle \left\langle \texttt{V} \mid \texttt{Adj} \mid \texttt{Adv} \right\rangle
$$

## **Vector Representation**

After the stemming and lemmatization steps, each entry can now be represented as a vector using a Bag of Words (BoW) model. We employ scikit-learn's `CountVectorizer`, which provides a ready-to-use implementation of BoW <u>(Pedregosa et al., 2011)</u>.

A comparison of other traditional vector representations are discussed in [this appendix](#appendix:-comparison-of-traditional-vectorization-techniques).
Words with modifiers have the modifiers directly attached, enabling subsequent models to capture the concept of modification fully. Consequently, after tokenization and bigram construction, the vocabulary size can grow up to $O(n^2)$, where $n$ is the number of unique tokens.

**minimum document frequency constraint:** Despite cleaning and spam removal, some tokens remain irrelevant or too rare. To address this, a minimum document frequency constraint is applied: $\texttt{min\_df} = 10$, meaning a token must appear in at least 10 documents to be included in the BoW vocabulary. This reduces noise and ensures the model focuses on meaningful terms.

---

These parameters of the BoW model are encapsulated in the `BagOfWordsModel` class. The class definition is available in [this appendix](#appendix:-BagOfWordsModel-class-definition).


In [110]:
bow = BagOfWordsModel(df["lemmatized"], 10)

# some sanity checks
assert (
    bow.matrix.shape[0] == df.shape[0]
), "number of rows in the matrix DOES NOT matches the number of documents"
assert bow.sparsity, "the sparsity is TOO HIGH, something went wrong"



The error above is normal, recall that our tokenization step essentially reduced into an array split step. With this, we need to set the `tokenizer` function attribute of the `BagOfWordsModel` to not use its default tokenization pattern. That causes this warning.


### **Model Metrics**

To get an idea of the model, we will now look at its shape and sparsity, with shape being the number of documents and tokens present in the model. While sparsity refers to the number of elements in a matrix that are zero, calculating how sparse or varied the words are in the dataset.


The resulting vector has a shape of


In [111]:
bow.matrix.shape

(162942, 30386)

The first entry of the pair is the number of documents (the ones that remain after all the data cleaning and preprocessing steps) and the second entry is the number of tokens (or unique words in the vocabulary).

The resulting model has a sparsity of


In [112]:
1 - bow.sparsity

0.9995039539872171

The model is 99.95% sparse, meaning the tweets often do not share the same words leading to a large vocabulary.


Now, looking at the most frequent and least frequent terms in the model.


In [113]:
doc_frequencies = np.asarray((bow.matrix > 0).sum(axis=0)).flatten()
freq_order = np.argsort(doc_frequencies)[::-1]
bow.feature_names[freq_order[:50]]

array(['modi', 'india', 'ha', 'all', 'people', 'bjp', 'like', 'congress',
       'narendra', 'only', 'election', 'narendra modi', 'vote', 'govt',
       'about', 'indian', 'year', 'time', 'country', 'just', 'modis',
       'more', 'nation', 'rahul', 'even', 'government', 'party', 'power',
       'gandhi', 'minister', 'leader', 'good', 'modi govt', 'need',
       'modi ha', 'space', 'work', 'prime', 'money', 'credit', 'sir',
       'pakistan', 'back', 'day', 'today', 'prime minister', 'scientist',
       'never', 'support', 'win'], dtype=object)

We see that the main talking point of the Tweets, which hovers around Indian politics with keywords like "modi", "india", and "bjp". For additional context, "bjp" referes to the _Bharatiya Janata Party_ which is a conservative political party in India, and one of the two major Indian political parties.


Now, looking at the least popular words.


In [114]:
bow.feature_names[freq_order[-50:]]

array(['healthy democracy', 'ha mass', 'ha separate', 'ha shifted',
       'hat drdo', 'about defeat', 'yet ha', 'yes more', 'yes narendra',
       'hatred people', 'ha requested', 'hate more', 'hate much',
       'hatemonger', 'hater gonna', 'heal', 'hazaribagh', 'head drdo',
       'sleep night', 'abinandan', 'able provide', 'able speak',
       'able vote', 'youth need', 'youth power', 'hai isliye', 'hai chor',
       'handy', 'hand narendra', 'hand people', 'hae', 'ha withdrawn',
       'happens credit', 'happier', 'bhaiyo', 'socha', 'social political',
       'social security', 'biased journalist', 'big congratulation',
       'sirmodi', 'bhutan', 'bhi berozgar', 'bhi mumkin', 'skta',
       'bhatt aditi', 'bhi aur', 'slamming', 'smart modi', 'slogan blame'],
      dtype=object)

We still see that the themes mentioned in the most frequent terms are still present in this subset. Although, more filler or non-distinct words do appear more often, like "photos", "soft" and "types".

But the present of words like "reelection" and "wars" still point to this subset still being relevant to the main theme of the dataset.


# **4 exploratory data analysis**

This section discusses the exploratory data analysis conducted on the dataset after cleaning.

> Notes from Zhean: <br>
> From manual checking via OpenRefine, there are a total of 162972. `df.info()` should have the same result post-processing.
> Furthermore, there should be two columns, `clean_text` (which is a bit of a misnormer since it is still dirty) contains the Tweets (text data). The second column is the `category` which contains the sentiment of the Tweet and is a tribool (1 positive, 0 neutral or indeterminate, and -1 for negative).


# **references**

Bird, S., & Loper, E. (2004, July). NLTK: The natural language toolkit. _Proceedings of the ACL Interactive Poster and Demonstration Sessions_, 214‚Äì217. https://aclanthology.org/P04-3031/

El-Demerdash, A. A., Hussein, S. E., & Zaki, J. F. W. (2021). Course evaluation based on deep learning and SSA hyperparameters optimization. _Computers, Materials & Continua, 71_(1), 941‚Äì959. https://doi.org/10.32604/cmc.2022.021839

George, M., & Murugesan, R. (2024). Improving sentiment analysis of financial news headlines using hybrid Word2Vec-TFIDF feature extraction technique. _Procedia Computer Science, 244_, 1‚Äì8.

Hussein, S. (2021). _Twitter sentiments dataset_. Mendeley.

Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., Vanderplas, J., Passos, A., Cournapeau, D., Brucher, M., Perrot, M., & Duchesnay, E. (2011). Scikit-learn: Machine learning in Python. _Journal of Machine Learning Research, 12_, 2825‚Äì2830.

Rani, D., Kumar, R., & Chauhan, N. (2022, October). Study and comparison of vectorization techniques used in text classification. In _2022 13th International Conference on Computing Communication and Networking Technologies (ICCCNT)_ (pp. 1‚Äì6). IEEE.

Wolfram Research. (2015). _DeleteStopwords_. https://reference.wolfram.com/language/ref/DeleteStopwords.html


# **appendix: `clean` wrapper function definition**

Below is the definition of the `clean` wrapper function that encapsulates all internal functions used in the cleaning pipeline.


In [115]:
clean??

[31mSignature:[39m clean(text: str) -> str
[31mSource:[39m   
[38;5;28;01mdef[39;00m clean(text: str) -> str:
    [33m"""[39m
[33m    This is the main function for data cleaning (i.e., it calls all the cleaning functions in the prescribed order).[39m

[33m    This function should be used as a first-class function in a map.[39m

[33m    # Parameters[39m
[33m    * text: The string entry from a DataFrame column.[39m
[33m    * stopwords: stopword dictionary.[39m

[33m    # Returns[39m
[33m    Clean string[39m
[33m    """[39m
    [38;5;66;03m# cleaning on the base string[39;00m
    text = normalize(text)
    text = rem_punctuation(text)
    text = rem_numbers(text)
    text = collapse_whitespace(text)

    [38;5;28;01mreturn[39;00m text
[31mFile:[39m      c:\users\erin\documents\github\stintsy-order-of-erin\lib\janitor.py
[31mType:[39m      function

# **appendix: `find_spam_and_empty` wrapper function definition**

Below is the definition of the `find_spam_and_empty` wrapper function that encapsulates all internal functions for the spam detection algorithm.


In [116]:
find_spam_and_empty??

[31mSignature:[39m find_spam_and_empty(text: str, min_length: int = [32m3[39m) -> str | [38;5;28;01mNone[39;00m
[31mSource:[39m   
[38;5;28;01mdef[39;00m find_spam_and_empty(text: str, min_length: int = [32m3[39m) -> str | [38;5;28;01mNone[39;00m:
    [33m"""[39m
[33m    Filter out empty text and unintelligible/spammy unintelligible substrings in the text.[39m

[33m    Spammy substrings:[39m
[33m    - Shorter than min_length[39m
[33m    - Containing non-alphabetic characters[39m
[33m    - Consisting of a repeated substring (e.g., 'aaaaaa', 'ababab', 'abcabcabc')[39m

[33m    # Parameters[39m
[33m    * text: input string.[39m
[33m    * min_length: minimum length of word to keep.[39m

[33m    # Returns[39m
[33m        Cleaned string, or None if empty after filtering.[39m
[33m    """[39m
    cleaned_tokens = []
    [38;5;28;01mfor[39;00m t [38;5;28;01min[39;00m text.split():
        [38;5;28;01mif[39;00m len(t) < min_length:
            [38;5;2

# **appendix: comparison of traditional vectorization techniques**

Traditional vectorization techniques include BoW and Term Frequency-Inverse Document Frequency (TF-IDF). TF-IDF weights each word based on its frequency in a document and its rarity across the corpus, reducing the impact of common words. BoW, in contrast, simply counts word occurrences without considering corpus-level frequency. In this project, BoW was chosen because stopwords were already removed during preprocessing, and the dataset is domain-specific <u>(Rani et al., 2022)</u>. In such datasets, frequent words are often meaningful domain keywords, so scaling them down (as TF-IDF would) could reduce the importance of these key terms in the feature representation.


# **appendix: `BagOfWordsModel` class definition**

Below is the definition of the `BagOfWordsModel` class that encapsulates the desired parameters.


In [117]:
BagOfWordsModel??

[31mInit signature:[39m BagOfWordsModel(texts: Iterable[str], min_freq: int | float | [38;5;28;01mNone[39;00m = [38;5;28;01mNone[39;00m)
[31mSource:[39m        
[38;5;28;01mclass[39;00m BagOfWordsModel:
    [33m"""[39m
[33m    A Bag-of-Words representation for a text corpus.[39m

[33m    # Attributes[39m
[33m    * matrix (scipy.sparse.csr_matrix): The document-term matrix of word counts.[39m
[33m    * feature_names (list[str]): List of feature names corresponding to the matrix columns.[39m
[33m    *[39m
[33m    # Usage[39m
[33m    ```[39m
[33m    bow = BagOfWordsModel(df["lemmatized_str"])[39m
[33m    ```[39m
[33m    """[39m

    [38;5;28;01mdef[39;00m __init__(self, texts: Iterable[str], min_freq: int | float | [38;5;28;01mNone[39;00m = [38;5;28;01mNone[39;00m):
        [33m"""[39m
[33m        Initialize the BagOfWordsModel by fitting the vectorizer to the text corpus. This also filters out tokens[39m
[33m        that do not appear more than 