# Sentiment Analysis of Twitter Posts

<!-- Notebook name goes here -->
<center><b>Notebook: Data Description, Cleaning, Exploratory Data Analysis, and Preprocessing</b></center>
<br>

**by**: Stephen Borja, Justin Ching, Erin Chua, and Zhean Ganituen.

**dataset**: Hussein, S. (2021). Twitter Sentiments Dataset [Dataset]. Mendeley. https://doi.org/10.17632/Z9ZW7NT5H2.1

**motivation**: Every minute, social media users generate a large influx of textual data on live events. Performing sentiment analysis on this data provides a real-time view of public perception, enabling quick insights into the general population‚Äôs opinions and reactions.

**goal**: By the end of the project, our goal is to create and compare supervised learning algorithms for sentiment analysis.

### **dataset description**

The Twitter Sentiments Dataset is a dataset that contains nearly 163k tweets from Twitter. The time period of when these were collected is unknown, but it was published to Mendeley Data on May 14, 2021 by Sherif Hussein of Mansoura University.

Tweets were extracted using the Twitter API, but the specifics of how the tweets were selected are unmentioned. The tweets are mostly English with a mix of some Hindi words for code-switching <u>(El-Demerdash., 2021)</u>. All of them seem to be talking about the political state of India. Most tweets mention Narendra Modi, the current Prime Minister of India.

Each tweet was assigned a label using TextBlob's sentiment analysis <u>(El‚ÄëDemerdash, Hussein, & Zaki, 2021)</u>, which assigns labels automatically.

Twitter_Data

- **`clean_text`**: The tweet's text
- **`category`**: The tweet's sentiment category

What each row and column represents: `each row represents one tweet.` <br>
Number of observations: `162,980`

---

<a name="cite_note-1"></a>1. [^](#cite_ref-1) Code-switching is the practice of alternating between two languages $L_1$ (the native language) and $L_2$ (the source language) in a conversation. In this context, the code-switching is done to appear more casual since the conversation is done via Twitter (now, X).


## **1. Project Set-up**

We set the global imports for the projects (ensure these are installed via uv and is part of the environment). Furthermore, load the dataset here.


In [453]:
import pandas as pd
import numpy as np
import os
import sys

# Use lib directory
sys.path.append(os.path.abspath("../lib"))

# Imports from lib files
from janitor import *
from lemmatize import lemmatizer
from boilerplate import stopwords_set
from bag_of_words import BagOfWordsModel

# Pandas congiruation
pd.set_option("display.max_colwidth", None)

# Load raw data file
df = pd.read_csv("../data/Twitter_Data.csv")

## **2. Data Cleaning**

This section discusses the methodology for data cleaning.

As to not waste computational time, a preliminary step is to ensure that no `NaN` or duplicate entries exist before the cleaning steps. Everytime we call a `.drop()` function, we will show the result of `info()` to see how many entries are filtered out.

Let's first drop the `NaN` entries.


In [454]:
df = df.dropna()
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 162969 entries, 0 to 162979
Data columns (total 2 columns):
 #   Column      Non-Null Count   Dtype  
---  ------      --------------   -----  
 0   clean_text  162969 non-null  object 
 1   category    162969 non-null  float64
dtypes: float64(1), object(1)
memory usage: 3.7+ MB


Now, remove the duplicates.


In [455]:
df = df.drop_duplicates()
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 162969 entries, 0 to 162979
Data columns (total 2 columns):
 #   Column      Non-Null Count   Dtype  
---  ------      --------------   -----  
 0   clean_text  162969 non-null  object 
 1   category    162969 non-null  float64
dtypes: float64(1), object(1)
memory usage: 3.7+ MB


We also ensure that all the values in the `category` column are within the range of [-1, 0, 1], which represent the three sentiments, namely, negative, neutral, and positive.


In [456]:
df["category"].unique()

array([-1.,  0.,  1.])

Then remove any values outside of the provided range to keep the data consistent.


In [457]:
df = df[df["category"].isin([-1, 0, 1])]
df["category"].sample(10)

3659      1.0
117064    1.0
136016    0.0
72881     1.0
132625    1.0
110168    0.0
50209     1.0
46837     1.0
55919     1.0
41572     0.0
Name: category, dtype: float64

By converting a CSV file into a DataFrame, pandas automatically defaults numeric values to `float64` when it encounters decimals or `NaN` types. Text of `str` type get inferred and loaded into a `object` as the generic type for strings. We can check the dtype of our DataFrame column through `.info()`


In [458]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 162969 entries, 0 to 162979
Data columns (total 2 columns):
 #   Column      Non-Null Count   Dtype  
---  ------      --------------   -----  
 0   clean_text  162969 non-null  object 
 1   category    162969 non-null  float64
dtypes: float64(1), object(1)
memory usage: 3.7+ MB


First we convert column `category` from `float64` to `int64` after dropping `NaN` rows and removing any values outside of [-1, 0, 1]


In [459]:
df["category"] = df["category"].astype(int)
df["category"].info()

<class 'pandas.core.series.Series'>
Index: 162969 entries, 0 to 162979
Series name: category
Non-Null Count   Dtype
--------------   -----
162969 non-null  int64
dtypes: int64(1)
memory usage: 2.5 MB


In [460]:
df["category"].sample(10)

22847     0
17788     0
7755     -1
60042     1
87280     1
58364    -1
47166     0
60971     1
11434     0
114741   -1
Name: category, dtype: int64

Next, we convert column `clean_string` from `object` type into the pandas defined `string` type for consistency and better performance.


In [461]:
df["clean_text"] = df["clean_text"].astype("string")
df["clean_text"].info()

<class 'pandas.core.series.Series'>
Index: 162969 entries, 0 to 162979
Series name: clean_text
Non-Null Count   Dtype 
--------------   ----- 
162969 non-null  string
dtypes: string(1)
memory usage: 2.5 MB


In [462]:
type(df.loc[0, "clean_text"])

str

## **Main Cleaning Pipeline**

We follow a similar methodology for data cleaning presented in (George & Murugesan, 2024).


### **Normalization**

Due to the nature of the text being tweets, we noticed a prevalence in the use of emojis and accented characters as seen in the samples below. Although in a real-world context these do serve as a form of emotional expression, it provides no relevance towards _textual_ sentiment analysis, thus we normalize the text.


In [463]:
# Finding a sample of rows with accented characters
accented_char_rows = df[df["clean_text"].str.contains(r"√â|√©|√Å|√°|√≥|√ì|√∫|√ö|√≠|√ç")]
accented_char_rows["clean_text"].sample(5)

114886                    declares modi breaks thread communal harmony karnatakaso rahul will contest election from karnataka contrary earlier decisi√≥n kerala addition amethi 
74639     stop write dont know about 2012 congress party has capability test this fire but today have capability leader like narendra modi who can take any difficult decisi√≥n 
50461                              v√≠a not against any particular nation demonstration our own technology former drdo chief saraswat tells cnnnews18s follow live updates here 
59831                                                                                                      india shoots down satellite test modi hails arrival space power v√≠a 
23047                                                                     unlikely titfortat istan darpok nikamm√© babus chorriforri crook donnie bullyfears strength look jago 
Name: clean_text, dtype: string

In [464]:
# Finding a sample of rows with emojis
rows_with_emojis = df[df["clean_text"].str.contains(r"[\u263a-\U0001f645]", regex=True)]
rows_with_emojis["clean_text"].sample(5)

21900     look this modi landmark example good governance ‚úåÔ∏è global political leaders and business houses use modi‚Äô reforms bench mark best practices ‡§ø‡§ú‡§Ø‡§∏‡§Ç‡•ç‡§™‡§∏‡§≠‡§æ 
61261                                               life congressi\nends with nehru Ôø£Ôø£Ôø£Ôø£Ôø£Ôø£Ôø£Ôø£Ôø£Ôø£Ôø£ dont blame nehru modiji did nehru did not modi\nÔºøÔºøÔºøÔºøÔºøÔºøÔºøÔºøÔºøÔºøÔºø \ ‚Ä¢‚Ä¢ 
125459                                                                                                                                   ‚ú® ‚Äúchaukidar‚Äù ‚ú®\n‚ú® join 
162951                                                                                                                           now confirmed modi supporter ‚ò∫‚ò∫‚ò∫
46706                              congrats isro and drdo for their great work and years research but what exactly did modi this why are people praising him ‚Äç‚ôÇÔ∏è 
Name: clean_text, dtype: string

The first function is the `normalize` function, it normalizes the text input to ASCII-only characters (say, "c√≥mo est√°s" becomes "como estas") and lowercased alphabetic symbols. The dataset contains Unicode characters (e.g., emojis and accented characters) which the function replaces to the empty string (`''`).


In [465]:
normalize??

[31mSignature:[39m normalize(text: str) -> str
[31mSource:[39m   
[38;5;28;01mdef[39;00m normalize(text: str) -> str:
    [33m"""[39m
[33m    Normalize text from a pandas entry to ASCII-only lowercase characters. Hence, this removes Unicode characters with no ASCII[39m
[33m    equivalent (e.g., emojis and CJKs).[39m

[33m    Do not use this function alone, use `clean_and_tokenize()`.[39m

[33m    # Parameters[39m
[33m    * text: String entry.[39m

[33m    # Returns[39m
[33m    ASCII-normalized text containing only lowercase letters.[39m

[33m    # Examples[39m
[33m    normalize("¬øC√≥mo est√°s?")[39m
[33m    $ 'como estas?'[39m

[33m    normalize(" hahahaha HUY! Kamusta üòÖ Mayaman $$$ ka na ba?")[39m
[33m    $ ' hahahaha huy! kamusta  mayaman $$$ ka na ba?'[39m
[33m    """[39m
    normalized = unicodedata.normalize([33m"NFKD"[39m, text)
    ascii_text = normalized.encode([33m"ascii"[39m, [33m"ignore"[39m).decode([33m"ascii"[39m)

    [38;5;2

### **Punctuations**

Punctuations are part of natural speech and reading to provide a sense of structure, clarity, and tone to sentences, but in the context of a classification study punctuations do not add much information to the sentiment of a message. The sentiment of `i hate you!` and `i hate you` are going to be the same despite the punctuation mark `!` being used to accentuate the sentiment. We can see a sample of rows with punctations below.


In [466]:
# Finding a sample of rows with punctuation
rows_with_punc = df[df["clean_text"].str.contains(r"[^\w\s]")]
rows_with_punc["clean_text"].sample(5)

144450                                                                                                        the truth nehru makes modi look petty minded crass can‚Äô think thing that modi has done over the last years that made feel proud indian
162676                                                                                                                                                opposition‚Äô show strength andhra pradesh‚Äô vizag can opposition unite against modi more videos 
139450                                                                                                                                                                                    modi took stoneage and rahulji aims for golden era\n‡§æ‡•á‡§∂‡§¨‡§ö‡§æ
72535     ‚Äô shame rajdeep how you find ways target modi for every achievements his and his government tried give spin while scientists giving credit very due but loyalty rest only with one family more trust left with journalists and especially 


The function `rem_punctuation` replaces all punctuations and special characters into an empty string (`''`)


In [467]:
rem_punctuation??

[31mSignature:[39m rem_punctuation(text: str) -> str
[31mSource:[39m   
[38;5;28;01mdef[39;00m rem_punctuation(text: str) -> str:
    [33m"""[39m
[33m    Removes the punctuations. This function simply replaces all punctuation marks and special characters[39m
[33m    to the empty string. Hence, for symbols enclosed by whitespace, the whitespace are not collapsed to a single whitespace[39m
[33m    (for more information, see the examples).[39m

[33m    Do not use this function alone, use `clean_and_tokenize()`.[39m

[33m    # Parameters[39m
[33m    * text: String entry.[39m

[33m    # Returns[39m
[33m    Text with the punctuation removed.[39m

[33m    # Examples[39m
[33m    rem_punctuation("this word $$ has two spaces after it!")[39m
[33m    $ 'this word  has two spaces after it'[39m

[33m    rem_punctuation("these!words@have$no%space")[39m
[33m    $ 'thesewordshavenospace'[39m
[33m    """[39m
    [38;5;28;01mreturn[39;00m re.sub(f"[{re.escape(string.

### **Numbers**

Similar to punctuations, numbers do not add any information to the sentiment of a message as seen in the samples below.


In [468]:
# Finding a sample of rows that contain numbers
rows_with_numbers = df[df["clean_text"].str.contains(r"\d")]
rows_with_numbers["clean_text"].sample(5)

146733                                                                                                           its not that easy end article 370 mehbooba mufti hold beer modi
27684                                                    would have been the first person criticize joshi was made mhrd 2014 now praising him not because love but hate for modi
108020                                            political parties have the polls only meet the expectations voters says modi and says bjp will win more seats than 2014 polls 
113465                 ‡±ç‡∞ø ‡∞∏‡∞Æ‡∞Ø‡∞Ç‡±ã ‡±Ä‡±Å ‡±ç‡∞ø ‡∞æ‡±Ä‡±Å ‡±ç‡∞ø ‡±Ç‡∞æ ‡∞ø‡±ç‡±ç‡∞ø‡∞Ç‡±á‡±Å 2903 1830 ajay gadde persnol how many jobs did your government provided youth how can you partial towards northern india
147838    ppl ppl this not about modi western analysts have been doubting for long time when are successful something they point poverty india ‚Äô set pattern modi came only 2014
Name: clean_text, dtype: string

Hence we defined the `rem_numbers` as a function that replaces all numerical values as an empty string (`''`).


In [469]:
rem_numbers??

[31mSignature:[39m rem_numbers(text: str) -> str
[31mSource:[39m   
[38;5;28;01mdef[39;00m rem_numbers(text: str) -> str:
    [33m"""[39m
[33m    Removes numbers. This function simply replaces all numerical symbols to the empty string. Hence, for symbols enclosed by[39m
[33m    whitespace, the whitespace are not collapsed to a single whitespace (for more information, see the examples).[39m

[33m    Do not use this function alone, use `clean_and_tokenize()`.[39m

[33m    # Parameters[39m
[33m    * text: String entry.[39m

[33m    # Returns[39m
[33m    Text with the numerical symbol removed[39m

[33m    # Examples[39m
[33m    rem_numbers(" h3llo, k4must4 k4  n4?")[39m
[33m    ' hllo, kmust k  n?'[39m
[33m    """[39m
    [38;5;28;01mreturn[39;00m re.sub([33mr"\d+"[39m, [33m""[39m, text)
[31mFile:[39m      a:\college\year 3\term 2\stintsy\stintsy-order-of-erin\lib\janitor.py
[31mType:[39m      function

### **Whitespace**

We also noticed the prevalance of excess whitespaces in between words, as seen in the sample below.


In [470]:
# Finding a sample of rows that contain 2 or more whitespaces in a row
rows_with_whitespaces = df[df["clean_text"].str.contains(r"\s{2,}")]
rows_with_whitespaces["clean_text"].sample(5)

126955                                                                                                                               how important sentence can missed from tweet  killed the democracy its because the state government now got  
18357                                                                                    ran away looks like you are smoking something special  changed identity chowkider not \nanyways where are those lakhs which was being screamed your modi 
16883                                                                                                                                                                         clearly modi tsunami after balakot strike the very least modi wave  
7597      congress was blaming modi that under his govt ppl unemployed now instead bribing poor ppl and make them nakaras lyk just give them jobs mehnat roti khane mein maza hai rishwat khareedi roti mein nahi  khair tumhe samajh nahi aayega 
26099                       

Thus, function `collapse_whitespace` collapses all whitespace characters to a single space. Formally, it is a transducer

$$
\Box^+ \mapsto \Box \qquad \text{where the space character is } \Box
$$

Informally, it replaces all strings of whitespaces to a single whitespace character.


In [471]:
collapse_whitespace??

[31mSignature:[39m collapse_whitespace(text: str) -> str
[31mSource:[39m   
[38;5;28;01mdef[39;00m collapse_whitespace(text: str) -> str:
    [33m"""[39m
[33m    This collapses whitespace. Here, collapsing means the transduction of all whitespace strings of any[39m
[33m    length to a whitespace string of unit length (e.g., "   " -> " "; formally " "+ -> " ").[39m

[33m    Do not use this function alone, use `clean_and_tokenize()`.[39m

[33m    # Parameters[39m
[33m    * text: String entry.[39m

[33m    # Returns[39m
[33m    Text with the whitespaces collapsed.[39m

[33m    # Examples[39m
[33m    collapse_whitespace("  huh,  was.  that!!! ")[39m
[33m    $ 'huh, was. that!!!'[39m
[33m    """[39m
    [38;5;28;01mreturn[39;00m re.sub([33m" +"[39m, [33m" "[39m, text).strip()
[31mFile:[39m      a:\college\year 3\term 2\stintsy\stintsy-order-of-erin\lib\janitor.py
[31mType:[39m      function

To seamlessly call all these cleaning functions, we have the `clean` function that acts as a container that calls these separate components. The definition of this wrapper function is quite long, see [this appendix](#appendix:-clean-wrapper-function-definition) for its definition.

We can now clean the dataset and store it in a new column named `clean_ours` (to differentiate it with the, still dirty, column `clean_text` from the dataset author)


In [472]:
df["clean_ours"] = df["clean_text"].map(clean).astype("string")
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 162969 entries, 0 to 162979
Data columns (total 3 columns):
 #   Column      Non-Null Count   Dtype 
---  ------      --------------   ----- 
 0   clean_text  162969 non-null  string
 1   category    162969 non-null  int64 
 2   clean_ours  162969 non-null  string
dtypes: int64(1), string(2)
memory usage: 9.0 MB


To confirm if the character cleaning worked, we can check for the differences between `clean_text` and `clean_ours` from the filtered rows below and compare the differences.


In [473]:
example_rows = df[
    df["clean_text"].str.contains(r"\s{2,}|\d|[^\w\s]|[\u263a-\U0001f645]|[√â√©√Å√°√≥√ì√∫√ö√≠√ç]")
]
example_rows.sample(10)

Unnamed: 0,clean_text,category,clean_ours
71349,wont say much read below please\nanswer people are largely criticizing rahul gandhis minimum income plan 72000 but why they did not say anything narendra modis demonitisation when most the black money did not return anonymous,1,wont say much read below please\nanswer people are largely criticizing rahul gandhis minimum income plan but why they did not say anything narendra modis demonitisation when most the black money did not return anonymous
98592,income tax raids karnataka kannadigas must blame jawaharlal nehru not modi for bringing income tax act 1961,0,income tax raids karnataka kannadigas must blame jawaharlal nehru not modi for bringing income tax act
84576,were joking yesterday about surgical strikes space well heres modi campaign rally today ‚Äúland sky space government has shown courage conduct surgical strike all spheres‚Äù,0,were joking yesterday about surgical strikes space well heres modi campaign rally today land sky space government has shown courage conduct surgical strike all spheres
90594,congress claims credit for mission shakti congress era defence minister says had idea ‚Äì opindia news via,0,congress claims credit for mission shakti congress era defence minister says had idea opindia news via
154801,modi govt says terror and talks can‚Äô together you you not accept this precondition valid and fair,1,modi govt says terror and talks can together you you not accept this precondition valid and fair
113329,modi mentions how the strict laws his govt against economic offenders are reaping resultsproperties worth rs14000crore belonging vijaymallya have been seized even though total liability against him stands rs9000crore says,1,modi mentions how the strict laws his govt against economic offenders are reaping resultsproperties worth rscrore belonging vijaymallya have been seized even though total liability against him stands rscrore says
108588,should vote modi for orop implemented after years 35000 crores disbursed crore veterans,0,should vote modi for orop implemented after years crores disbursed crore veterans
139770,obviously the ministers like giriraj singh anant hegdektaka every day comments against muslims christians the communal politician become naturally communities will appeal against bjp ‡§æ‡•á‡§∂‡§¨‡§ö‡§æ,1,obviously the ministers like giriraj singh anant hegdektaka every day comments against muslims christians the communal politician become naturally communities will appeal against bjp
69959,now the debris left over the space will cleaned under the swatch bharat mission modi sarkar hai toh mumkin hai bai,0,now the debris left over the space will cleaned under the swatch bharat mission modi sarkar hai toh mumkin hai bai
39354,modi‚Äô skill india raga‚Äô kill india,0,modi skill india raga kill india


We are now finished with basic text cleaning, but the data cleaning does not end here. Given that the text is sourced from Twitter, it includes characteristics, such as spam and informal expressions, which are not addressed by basic cleaning methods. As a result, we move on to further cleaning tailored to the nature of Twitter data.


### **Spam, Expressions, Onomatopoeia, etc.**

Since the domain of the corpus is Twitter, spam (e.g., `bbbb`), expressions (e.g., `bruhhhh`), and onomatopoeia (e.g., `hahahaha`) may become an issue by the vector representation step. Hence we employed a simple rule-based spam removal algorithm.

We remove words in the string that contains the same letter or substring thrice and consecutively. These were done using regular expressions:

$$
\text{same\_char\_thrice} := (.)\textbackslash1^{\{2,\}}
$$

and

$$
\text{same\_substring\_twice} := (.^+)\textbackslash1^+
$$

Furthermore, we also remove any string that has a length less than three, since these are either stopwords (that weren't detected in the stopword removal stage) or more spam.

Finally, we employ adaptive character diversity threshold for the string $s$.

$$
\frac{\texttt{\#\_unique\_chars}(s)}{|s|} < 0.3 + \left(\frac{0.1 \cdot \text{min}(|s|, 10)}{10}\right)
$$

It calculates the diversity of characters in a string; if the string repeats the same character alot, we expect it to be unintelligible or useless, hence we remove the string.

The definition of this wrapper function is quite long, see its definition in [this appendix](#appendix:-find_spam_and_empty-wrapper-function-definition).

Let's first look at a random sample of 10 entries from the dataset that will be modified by the function.


In [474]:
affected = df[df["clean_ours"].apply(spam_affected)]
affected_sample = affected["clean_ours"].sample(10)
affected_sample

121729                                                                                        accused just like accused bibi and vice versa bilawal accuse modi yaar and then nazriyati and now again have stuck deal accused isnt accuse pak really
12236                                                                                                          you are just aaptardmind you also played role creating chaos that brought modi power and now you are crying against your own creation
136123                                                                            month remaining for ssc rrb fci other competitive exam guys keep away form yogi modi rahul gandhi whatsapp instagram before the exam otherwise you know the result
74552                                                                                                                                              suppose this happening and suddnly mpdi enters the hall all chants modi modi modi modi hahahhahah
111591    not convic

Let's now call this function on the `clean_ours` column of the dataset.


In [475]:
df["clean_ours"] = df["clean_ours"].map(find_spam_and_empty).astype("string")

To confirm if the function was able to do remove all the spammy substrings, we can check `before` and `after` and compare their differences.


In [476]:
comparison = pd.DataFrame({"before": affected_sample, "after": df["clean_ours"]})

changed = comparison[comparison["before"] != comparison["after"]]
changed.sample(10)

Unnamed: 0,before,after
92314,mean its all good say good things about your religion and ideology everyone does that but should have some basis facts koi modi marketing campaign nahi hai mann mein aaya pel,mean its all good say good things about your religion and ideology everyone does that but should have some basis facts koi modi marketing campaign nahi hai mann mein pel
54063,its that cant survive another five year modi motherjaat swineis retweeting aatankistans tweet,its that cant survive another five year modi motherjaat swineis retweeting tweet
121729,accused just like accused bibi and vice versa bilawal accuse modi yaar and then nazriyati and now again have stuck deal accused isnt accuse pak really,accused just like accused and vice versa bilawal accuse modi yaar and then nazriyati and now again have stuck deal accused isnt accuse pak really
2441,how aaj tak and media creates anti modi voices must watch aktk via,how tak and media creates anti modi voices must watch aktk via
144833,question for modi and sitharaman why couldnt the iafs su fighters engage intruding paf fs therein lies scandal,question for modi and sitharaman why couldnt the iafs fighters engage intruding paf therein lies scandal
12236,you are just aaptardmind you also played role creating chaos that brought modi power and now you are crying against your own creation,you are just you also played role creating chaos that brought modi power and now you are crying against your own creation
136123,month remaining for ssc rrb fci other competitive exam guys keep away form yogi modi rahul gandhi whatsapp instagram before the exam otherwise you know the result,month remaining for fci other competitive exam guys keep away form yogi modi rahul gandhi whatsapp instagram before the exam otherwise you know the result
111591,not conviction they knew writing the wall the anti corruption movement rise modi and after modi was announced candidate they became pro modi its they are opportunists thats okay prob when ppl say zeerepublic are pro bjppro nationalist,not conviction they knew writing the wall the anti corruption movement rise modi and after modi was announced candidate they became pro modi its they are opportunists thats okay prob when say zeerepublic are pro bjppro nationalist
74552,suppose this happening and suddnly mpdi enters the hall all chants modi modi modi modi hahahhahah,suppose this happening and suddnly mpdi enters the hall all chants modi modi modi modi
8347,god our bbc news google sorry for narendra modi images top criminals list,god our news google sorry for narendra modi images top criminals list


Let‚Äôs examine whether applying this function has caused any significant changes to the DataFrame structure, given that it can convert entire cells to `NaN`.


In [477]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 162969 entries, 0 to 162979
Data columns (total 3 columns):
 #   Column      Non-Null Count   Dtype 
---  ------      --------------   ----- 
 0   clean_text  162969 non-null  string
 1   category    162969 non-null  int64 
 2   clean_ours  162942 non-null  string
dtypes: int64(1), string(2)
memory usage: 9.0 MB


The DataFrame structure is intact, but `clean_ours` now has 27 fewer non-null values, reflecting cell that were entirely filtered out as spam.


## **Post-Cleaning Steps**

At some point during the cleaning stage, some entries of the dataset could have been reduced to `NaN` or the empty string `""`, or we could have introduced duplicates again. So, let's call `dropna` and `drop_duplicates` again to finalize the cleaning stage.


In [478]:
df = df.dropna()
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 162942 entries, 0 to 162979
Data columns (total 3 columns):
 #   Column      Non-Null Count   Dtype 
---  ------      --------------   ----- 
 0   clean_text  162942 non-null  string
 1   category    162942 non-null  int64 
 2   clean_ours  162942 non-null  string
dtypes: int64(1), string(2)
memory usage: 5.0 MB


In [479]:
df = df.drop_duplicates()
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 162942 entries, 0 to 162979
Data columns (total 3 columns):
 #   Column      Non-Null Count   Dtype 
---  ------      --------------   ----- 
 0   clean_text  162942 non-null  string
 1   category    162942 non-null  int64 
 2   clean_ours  162942 non-null  string
dtypes: int64(1), string(2)
memory usage: 5.0 MB


# **3. Preprocessing**

> üèóÔ∏è Perhaps swap S3 and S4. Refer to literature on what comes first.

This section discusses preprocessing steps for the cleaned data. Because the goal is to analyze the textual sentiments of tweets the following preprocessing steps are needed to provide the Bag of Words model with the relevant information required to get the semantic embeddings of each tweet.

Before and after each preprocessing step, we will show 5 random entries in the dataset to show the effects of each preprocessing task.

## **Lemmatization**

We follow a similar methodology for data cleaning presented in <u>(George & Murugesan, 2024)</u>. We preprocess the dataset entries via lemmatization. We use NLTK for this task using WordNetLemmatizer lemmatization, repectively <u>(Bird & Loper, 2004)</u>. For the lemmatization step, we use the WordNet for English lemmatization and Open Multilingual WordNet version 1.4 for translations and multilingual support which is important for our case since some tweets contain text from Indian Languages.


In [480]:
df["lemmatized"] = df["clean_ours"].map(lemmatizer)
df.sample(10)

Unnamed: 0,clean_text,category,clean_ours,lemmatized
106289,wait for modis 2nd term the desperation will conveyed accordingly,0,wait for modis term the desperation will conveyed accordingly,wait for modis term the desperation will conveyed accordingly
48662,the bjp spot slipping from 240 200 now hear back towards 160180 type number leading indicators that happening modi assassination plots timesnow nitin gadkari making statements,-1,the bjp spot slipping from now hear back towards type number leading indicators that happening modi plots timesnow nitin gadkari making statements,the bjp spot slipping from now hear back towards type number leading indicator that happening modi plot timesnow nitin gadkari making statement
107500,the failed promise every village ranked the bottom usagetopopulation ratio along with tanzania with only oneinfour indians using the internet according 2018 report,-1,the failed promise every village ranked the bottom usagetopopulation ratio along with tanzania with only oneinfour indians using the internet according report,the failed promise every village ranked the bottom usagetopopulation ratio along with tanzania with only oneinfour indian using the internet according report
75969,india shot down one its satellites space with antisatellite missile wednesday prime minister narendra modi said hailing the countrys first test such technology major breakthrough that establishes space power\n,1,india shot down one its satellites space with antisatellite missile wednesday prime minister narendra modi said hailing the countrys first test such technology major breakthrough that establishes space power,india shot down one it satellite space with antisatellite missile wednesday prime minister narendra modi said hailing the country first test such technology major breakthrough that establishes space power
138071,all these pakistani lover anti india fake agenda gajwa hind behind making film fool nation became faud baji money destroy india make another pakistan modi bunker saving nation they scare how suxese besharm gandi nali keede,-1,all these pakistani lover anti india fake agenda gajwa hind behind making film fool nation became faud baji money destroy india make another pakistan modi bunker saving nation they scare how suxese besharm gandi nali keede,all these pakistani lover anti india fake agenda gajwa hind behind making film fool nation became faud baji money destroy india make another pakistan modi bunker saving nation they scare how suxese besharm gandi nali keede
17787,modi totally messed the economy will remembered the history for wasteful expenditures like demonetisation statue makingsight seeing and spending thousands crores public money for improving his imaage,0,modi totally messed the economy will remembered the history for wasteful expenditures like demonetisation statue makingsight seeing and spending thousands crores public money for improving his imaage,modi totally messed the economy will remembered the history for wasteful expenditure like demonetisation statue makingsight seeing and spending thousand crore public money for improving his imaage
100422,modi bashing can demonising him has become thing agar aap yeah nahi karte log aapko secular nahi mante,0,modi bashing can demonising him has become thing agar yeah nahi karte log secular nahi mante,modi bashing can demonising him ha become thing agar yeah nahi karte log secular nahi mante
154761,good day for banking industry for more mergers future india need less bank but more branches modi hai toh mumkin hai merger years hind modi,1,good day for banking industry for more mergers future india need less bank but more branches modi hai toh mumkin hai merger years hind modi,good day for banking industry for more merger future india need less bank but more branch modi hai toh mumkin hai merger year hind modi
64825,modi govt will extol the virtues national security while ignoring crucial component economic security bhumish khudkhudia public policy professional writes,1,modi govt will extol the virtues national security while ignoring crucial component economic security bhumish public policy professional writes,modi govt will extol the virtue national security while ignoring crucial component economic security bhumish public policy professional writes
35756,ask electoral commission modi has nothing now,0,ask electoral commission modi has nothing now,ask electoral commission modi ha nothing now


## **Stop Word Removal**

After lemmatization, we may now remove the stop words present in the dataset. The stopword removal _needs_ to be after lemmatization since this step requires all words to be reduces to their base dictionary form, and the `stopword_set` only considers base dictionary forms of the stopwords.

**stopwords.** For stop words removal, we refer to the English stopwords dataset defined in NLTK and Wolfram Mathematica <u>(Bird & Loper, 2004; Wolfram Research, 2015)</u>. However, since the task is sentiment analysis, words that invoke polarity, intensification, and negation are important. Words like "not" and "okay" are commonly included as stopwords. Therefore, the stopwords from [nltk,mathematica] are manually adjusted to only include stopwords that invoke neutrality, examples are "after", "when", and "you."


In [481]:
df["lemmatized"] = df["lemmatized"].map(lambda t: rem_stopwords(t, stopwords_set))
df = df.dropna(subset=["lemmatized"])
df.sample(10)

Unnamed: 0,clean_text,category,clean_ours,lemmatized
106279,every politician promises all these more the fact remains the condition small keeps getting worse with every passing day easy credit available for ambani nirav modi mallaya but not for small businesses who toil their lives away and finally shut shop,1,every politician promises all these more the fact remains the condition small keeps getting worse with every passing day easy credit available for ambani nirav modi mallaya but not for small businesses who toil their lives away and finally shut shop,politician promise all more fact remains condition small worse passing day easy credit available ambani nirav modi mallaya small business toil life away finally shut shop
107411,the election commission does not act this violation the model code the railways then modi has ensured india the new north korea ‡§æ‡•á‡§∂‡§¨‡§ö‡§æ,1,the election commission does not act this violation the model code the railways then modi has ensured india the new north korea,election commission doe violation model code railway modi ha ensured india north korea
161658,sorry modi sarkar will come with lakh new vacancies 2020,-1,sorry modi sarkar will come with lakh new vacancies,sorry modi sarkar lakh vacancy
5629,modi could too didn‚Äô had the urge spend public money selfglorification also you think the massive political funding comes for free eventually modi has return the favor diverting public funds welfare programs his owners hence bjp cant,1,modi could too didn had the urge spend public money selfglorification also you think the massive political funding comes for free eventually modi has return the favor diverting public funds welfare programs his owners hence bjp cant,modi urge spend public money selfglorification massive political funding free eventually modi ha return favor diverting public fund welfare program owner bjp
89989,just type modi and jumlas google and click search and see magic,1,just type modi and jumlas google and click search and see magic,just type modi jumlas google click search magic
43989,narendra modi his address nation antisatellite weapon asat successfully targeted live satellite low earth orbit leo part mission shakti,1,narendra modi his address nation antisatellite weapon asat successfully targeted live satellite low earth orbit leo part mission shakti,narendra modi address nation antisatellite weapon asat successfully targeted live satellite low earth orbit leo mission shakti
117436,400 for modi make india super power 2022,1,for modi make india super power,modi india super power
67618,where are the jobs modi where are the jobs,0,where are the jobs modi where are the jobs,job modi job
117732,gandhi vows that elected will remove people with rss links from the bureaucracy‚Äúthey are judges they are professors they went from rssrun crammers pass the civilservice exam rss military academies into the army‚Äù\n,-1,gandhi vows that elected will remove people with rss links from the bureaucracythey are judges they are professors they went from rssrun crammers pass the civilservice exam rss military academies into the army,gandhi vow elected remove people rss link bureaucracythey judge professor rssrun crammer civilservice exam rss military academy army
19383,positive campaign always yields better results instead giving 6000month for free modi should introduce universal job guarantee yojna give voters moment pride and usefulness\niss desh yuva jagrook hai mehanati hai swabhimani hai usko job chahiye bheekh,1,positive campaign always yields better results instead giving month for free modi should introduce universal job guarantee yojna give voters moment pride and usefulness iss desh yuva jagrook hai mehanati hai swabhimani hai usko job chahiye bheekh,positive campaign always yield better result month free modi introduce universal job guarantee yojna voter moment pride usefulness iss desh yuva jagrook hai mehanati hai swabhimani hai usko job chahiye bheekh


## **Looking at the DataFrame**

After preprocessing, the dataset now contains:


In [482]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 162942 entries, 0 to 162979
Data columns (total 4 columns):
 #   Column      Non-Null Count   Dtype 
---  ------      --------------   ----- 
 0   clean_text  162942 non-null  string
 1   category    162942 non-null  int64 
 2   clean_ours  162942 non-null  string
 3   lemmatized  162942 non-null  object
dtypes: int64(1), object(1), string(2)
memory usage: 6.2+ MB


Here are 10 randomly picked entries in the dataframe with all columns shown for comparison.


In [483]:
display(df.sample(5))

Unnamed: 0,clean_text,category,clean_ours,lemmatized
62203,did you see his face rahul gandhi scoffs after modis address rahul congress should celebrate along with the nation rather than sulk grudging that the proud acheivement took place ndas tenure,1,did you see his face rahul gandhi scoffs after modis address rahul congress should celebrate along with the nation rather than sulk grudging that the proud acheivement took place ndas tenure,face rahul gandhi scoff modis address rahul congress celebrate along nation rather sulk grudging proud acheivement place ndas tenure
116917,will scrap the niti aayog and replace with lean planning commission voted power tweets president updates,0,will scrap the niti and replace with lean planning commission voted power tweets president updates,scrap niti replace lean planning commission voted power tweet president update
75246,very happy that india became the 4th space power world thanks modi support our scientist for doing this,1,very happy that india became the space power world thanks modi support our scientist for doing this,very happy india space power modi support scientist
46221,jindabad congratulations successfully testing indias first missile har har modi\nghar ghar,1,jindabad congratulations successfully testing indias first missile har har modi ghar ghar,jindabad congratulation successfully testing india missile har har modi ghar ghar
99336,why this not used varanasi modi fight there,0,why this not used varanasi modi fight there,varanasi modi fight


## **Tokenization**

Since the data cleaning and preprocessing stage is comprehensive, the tokenization step in the BoW model reduces to a simple word-boundary split operation. Each preprocessed entry in the DataFrame is split by spaces. For example, the entry `"shri narendra modis"` (entry: 42052) becomes `["shri", "narendra", "modis"]`. By the end of tokenization, all entries are transformed into arrays of strings.

## **Word Bigrams**

As noted earlier, modifiers and polarity words are not included in the stopword set. The BoW model constructs a vocabulary containing both unigrams and bigrams. Including bigrams allows the model to capture common word patterns, such as

$$
\left\langle \texttt{Adj}\right\rangle \left\langle \texttt{M} \mid \texttt{Pron} \right\rangle
$$

<center>or</center>

$$
\left\langle \texttt{Adv}\right\rangle \left\langle \texttt{V} \mid \texttt{Adj} \mid \texttt{Adv} \right\rangle
$$

## **Vector Representation**

After the stemming and lemmatization steps, each entry can now be represented as a vector using a Bag of Words (BoW) model. We employ scikit-learn's `CountVectorizer`, which provides a ready-to-use implementation of BoW <u>(Pedregosa et al., 2011)</u>.

A comparison of other traditional vector representations are discussed in [this appendix](#appendix:-comparison-of-traditional-vectorization-techniques).
Words with modifiers have the modifiers directly attached, enabling subsequent models to capture the concept of modification fully. Consequently, after tokenization and bigram construction, the vocabulary size can grow up to $O(n^2)$, where $n$ is the number of unique tokens.

**minimum document frequency constraint:** Despite cleaning and spam removal, some tokens remain irrelevant or too rare. To address this, a minimum document frequency constraint is applied: $\texttt{min\_df} = 10$, meaning a token must appear in at least 10 documents to be included in the BoW vocabulary. This reduces noise and ensures the model focuses on meaningful terms.

---

These parameters of the BoW model are encapsulated in the `BagOfWordsModel` class. The class definition is available in [this appendix](#appendix:-BagOfWordsModel-class-definition).


In [484]:
bow = BagOfWordsModel(df["lemmatized"], 10)

# some sanity checks
assert (
    bow.matrix.shape[0] == df.shape[0]
), "number of rows in the matrix DOES NOT matches the number of documents"
assert bow.sparsity, "the sparsity is TOO HIGH, something went wrong"



The error above is normal, recall that our tokenization step essentially reduced into an array split step. With this, we need to set the `tokenizer` function attribute of the `BagOfWordsModel` to not use its default tokenization pattern. That causes this warning.


### **Model Metrics**

To get an idea of the model, we will now look at its shape and sparsity, with shape being the number of documents and tokens present in the model. While sparsity refers to the number of elements in a matrix that are zero, calculating how sparse or varied the words are in the dataset.


The resulting vector has a shape of


In [485]:
bow.matrix.shape

(162942, 30386)

The first entry of the pair is the number of documents (the ones that remain after all the data cleaning and preprocessing steps) and the second entry is the number of tokens (or unique words in the vocabulary).

The resulting model has a sparsity of


In [486]:
1 - bow.sparsity

0.9995039539872171

The model is 99.95% sparse, meaning the tweets often do not share the same words leading to a large vocabulary.


Now, looking at the most frequent and least frequent terms in the model.


In [487]:
doc_frequencies = np.asarray((bow.matrix > 0).sum(axis=0)).flatten()
freq_order = np.argsort(doc_frequencies)[::-1]
bow.feature_names[freq_order[:50]]

array(['modi', 'india', 'ha', 'all', 'people', 'bjp', 'like', 'congress',
       'narendra', 'only', 'election', 'narendra modi', 'vote', 'govt',
       'about', 'indian', 'year', 'time', 'country', 'just', 'modis',
       'more', 'nation', 'rahul', 'even', 'government', 'party', 'power',
       'gandhi', 'minister', 'leader', 'good', 'modi govt', 'need',
       'modi ha', 'space', 'work', 'prime', 'money', 'credit', 'sir',
       'pakistan', 'back', 'day', 'today', 'prime minister', 'scientist',
       'never', 'support', 'win'], dtype=object)

We see that the main talking point of the Tweets, which hovers around Indian politics with keywords like "modi", "india", and "bjp". For additional context, "bjp" referes to the _Bharatiya Janata Party_ which is a conservative political party in India, and one of the two major Indian political parties.


Now, looking at the least popular words.


In [488]:
bow.feature_names[freq_order[-50:]]

array(['healthy democracy', 'ha mass', 'ha separate', 'ha shifted',
       'hat drdo', 'about defeat', 'yet ha', 'yes more', 'yes narendra',
       'hatred people', 'ha requested', 'hate more', 'hate much',
       'hatemonger', 'hater gonna', 'heal', 'hazaribagh', 'head drdo',
       'sleep night', 'abinandan', 'able provide', 'able speak',
       'able vote', 'youth need', 'youth power', 'hai isliye', 'hai chor',
       'handy', 'hand narendra', 'hand people', 'hae', 'ha withdrawn',
       'happens credit', 'happier', 'bhaiyo', 'socha', 'social political',
       'social security', 'biased journalist', 'big congratulation',
       'sirmodi', 'bhutan', 'bhi berozgar', 'bhi mumkin', 'skta',
       'bhatt aditi', 'bhi aur', 'slamming', 'smart modi', 'slogan blame'],
      dtype=object)

We still see that the themes mentioned in the most frequent terms are still present in this subset. Although, more filler or non-distinct words do appear more often, like "photos", "soft" and "types".

But the present of words like "reelection" and "wars" still point to this subset still being relevant to the main theme of the dataset.


# **4 exploratory data analysis**

This section discusses the exploratory data analysis conducted on the dataset after cleaning.

> Notes from Zhean: <br>
> From manual checking via OpenRefine, there are a total of 162972. `df.info()` should have the same result post-processing.
> Furthermore, there should be two columns, `clean_text` (which is a bit of a misnormer since it is still dirty) contains the Tweets (text data). The second column is the `category` which contains the sentiment of the Tweet and is a tribool (1 positive, 0 neutral or indeterminate, and -1 for negative).


# **references**

Bird, S., & Loper, E. (2004, July). NLTK: The natural language toolkit. _Proceedings of the ACL Interactive Poster and Demonstration Sessions_, 214‚Äì217. https://aclanthology.org/P04-3031/

El-Demerdash, A. A., Hussein, S. E., & Zaki, J. F. W. (2021). Course evaluation based on deep learning and SSA hyperparameters optimization. _Computers, Materials & Continua, 71_(1), 941‚Äì959. https://doi.org/10.32604/cmc.2022.021839

George, M., & Murugesan, R. (2024). Improving sentiment analysis of financial news headlines using hybrid Word2Vec-TFIDF feature extraction technique. _Procedia Computer Science, 244_, 1‚Äì8.

Hussein, S. (2021). _Twitter sentiments dataset_. Mendeley.

Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., Vanderplas, J., Passos, A., Cournapeau, D., Brucher, M., Perrot, M., & Duchesnay, E. (2011). Scikit-learn: Machine learning in Python. _Journal of Machine Learning Research, 12_, 2825‚Äì2830.

Rani, D., Kumar, R., & Chauhan, N. (2022, October). Study and comparison of vectorization techniques used in text classification. In _2022 13th International Conference on Computing Communication and Networking Technologies (ICCCNT)_ (pp. 1‚Äì6). IEEE.

Wolfram Research. (2015). _DeleteStopwords_. https://reference.wolfram.com/language/ref/DeleteStopwords.html


# **appendix: `clean` wrapper function definition**

Below is the definition of the `clean` wrapper function that encapsulates all internal functions used in the cleaning pipeline.


In [489]:
clean??

[31mSignature:[39m clean(text: str) -> str
[31mSource:[39m   
[38;5;28;01mdef[39;00m clean(text: str) -> str:
    [33m"""[39m
[33m    This is the main function for data cleaning (i.e., it calls all the cleaning functions in the prescribed order).[39m

[33m    This function should be used as a first-class function in a map.[39m

[33m    # Parameters[39m
[33m    * text: The string entry from a DataFrame column.[39m
[33m    * stopwords: stopword dictionary.[39m

[33m    # Returns[39m
[33m    Clean string[39m
[33m    """[39m
    [38;5;66;03m# cleaning on the base string[39;00m
    text = normalize(text)
    text = rem_punctuation(text)
    text = rem_numbers(text)
    text = collapse_whitespace(text)

    [38;5;28;01mreturn[39;00m text
[31mFile:[39m      a:\college\year 3\term 2\stintsy\stintsy-order-of-erin\lib\janitor.py
[31mType:[39m      function

# **appendix: `find_spam_and_empty` wrapper function definition**

Below is the definition of the `find_spam_and_empty` wrapper function that encapsulates all internal functions for the spam detection algorithm.


In [490]:
find_spam_and_empty??

[31mSignature:[39m find_spam_and_empty(text: str, min_length: int = [32m3[39m) -> str | [38;5;28;01mNone[39;00m
[31mSource:[39m   
[38;5;28;01mdef[39;00m find_spam_and_empty(text: str, min_length: int = [32m3[39m) -> str | [38;5;28;01mNone[39;00m:
    [33m"""[39m
[33m    Filter out empty text and unintelligible/spammy unintelligible substrings in the text.[39m

[33m    Spammy substrings:[39m
[33m    - Shorter than min_length[39m
[33m    - Containing non-alphabetic characters[39m
[33m    - Consisting of a repeated substring (e.g., 'aaaaaa', 'ababab', 'abcabcabc')[39m

[33m    # Parameters[39m
[33m    * text: input string.[39m
[33m    * min_length: minimum length of word to keep.[39m

[33m    # Returns[39m
[33m        Cleaned string, or None if empty after filtering.[39m
[33m    """[39m
    cleaned_tokens = []
    [38;5;28;01mfor[39;00m t [38;5;28;01min[39;00m text.split():
        [38;5;28;01mif[39;00m len(t) < min_length:
            [38;5;2

# **appendix: comparison of traditional vectorization techniques**

Traditional vectorization techniques include BoW and Term Frequency-Inverse Document Frequency (TF-IDF). TF-IDF weights each word based on its frequency in a document and its rarity across the corpus, reducing the impact of common words. BoW, in contrast, simply counts word occurrences without considering corpus-level frequency. In this project, BoW was chosen because stopwords were already removed during preprocessing, and the dataset is domain-specific <u>(Rani et al., 2022)</u>. In such datasets, frequent words are often meaningful domain keywords, so scaling them down (as TF-IDF would) could reduce the importance of these key terms in the feature representation.


# **appendix: `BagOfWordsModel` class definition**

Below is the definition of the `BagOfWordsModel` class that encapsulates the desired parameters.


In [491]:
BagOfWordsModel??

[31mInit signature:[39m BagOfWordsModel(texts: Iterable[str], min_freq: int | float | [38;5;28;01mNone[39;00m = [38;5;28;01mNone[39;00m)
[31mSource:[39m        
[38;5;28;01mclass[39;00m BagOfWordsModel:
    [33m"""[39m
[33m    A Bag-of-Words representation for a text corpus.[39m

[33m    # Attributes[39m
[33m    * matrix (scipy.sparse.csr_matrix): The document-term matrix of word counts.[39m
[33m    * feature_names (list[str]): List of feature names corresponding to the matrix columns.[39m
[33m    *[39m
[33m    # Usage[39m
[33m    ```[39m
[33m    bow = BagOfWordsModel(df["lemmatized_str"])[39m
[33m    ```[39m
[33m    """[39m

    [38;5;28;01mdef[39;00m __init__(self, texts: Iterable[str], min_freq: int | float | [38;5;28;01mNone[39;00m = [38;5;28;01mNone[39;00m):
        [33m"""[39m
[33m        Initialize the BagOfWordsModel by fitting the vectorizer to the text corpus. This also filters out tokens[39m
[33m        that do not appear more than 