# Sentiment Analysis of Twitter Posts

<!-- Notebook name goes here -->
<center><b>Notebook: Data Description, Cleaning, Exploratory Data Analysis, and Preprocessing</b></center>
<br>

**by**: Stephen Borja, Justin Ching, Erin Chua, and Zhean Ganituen.

**dataset**: Hussein, S. (2021). Twitter Sentiments Dataset [Dataset]. Mendeley. https://doi.org/10.17632/Z9ZW7NT5H2.1

**motivation**: Every minute, social media users generate a large influx of textual data on live events. Performing sentiment analysis on this data provides a real-time view of public perception, enabling quick insights into the general population‚Äôs opinions and reactions.

**goal**: By the end of the project, our goal is to create and compare supervised learning algorithms for sentiment analysis.

### **dataset description**

The Twitter Sentiments Dataset is a dataset that contains nearly 163k tweets from Twitter. The time period of when these were collected is unknown, but it was published to Mendeley Data on May 14, 2021 by Sherif Hussein of Mansoura University.

Tweets were extracted using the Twitter API, but the specifics of how the tweets were selected are unmentioned. The tweets are mostly English with a mix of some Hindi words for code-switching <u>(El-Demerdash., 2021)</u>. All of them seem to be talking about the political state of India. Most tweets mention Narendra Modi, the current Prime Minister of India.

Each tweet was assigned a label using TextBlob's sentiment analysis <u>(El‚ÄëDemerdash, Hussein, & Zaki, 2021)</u>, which assigns labels automatically.

Twitter_Data

- **`clean_text`**: The tweet's text
- **`category`**: The tweet's sentiment category

What each row and column represents: `each row represents one tweet.` <br>
Number of observations: `162,980`

---

<a name="cite_note-1"></a>1. [^](#cite_ref-1) Code-switching is the practice of alternating between two languages $L_1$ (the native language) and $L_2$ (the source language) in a conversation. In this context, the code-switching is done to appear more casual since the conversation is done via Twitter (now, X).


## **1. Project Set-up**

We set the global imports for the projects (ensure these are installed via uv and is part of the environment). Furthermore, load the dataset here.


In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from tqdm.auto import tqdm
from sklearn.model_selection import train_test_split
import os
import sys
from wordcloud import WordCloud

# Set tqdm to pandas
tqdm.pandas()

# Use lib directory
sys.path.append(os.path.abspath("../lib"))

# Imports from lib files
from janitor import *
from lemmatize import lemmatizer
from boilerplate import stopwords_set
from bag_of_words import BagOfWordsModel

# Pandas congiruation
pd.set_option("display.max_colwidth", None)

# Load raw data file
df = pd.read_csv("../data/Twitter_Data.csv")

## **2. Data Cleaning**

This section discusses the methodology for data cleaning.


As to not waste computational time, a preliminary step is to ensure that no **`NaN`** or duplicate entries exist before the cleaning steps. We can call on `info()` after each step to see the rows changed in our DataFrame


In [2]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 162980 entries, 0 to 162979
Data columns (total 2 columns):
 #   Column      Non-Null Count   Dtype  
---  ------      --------------   -----  
 0   clean_text  162976 non-null  object 
 1   category    162973 non-null  float64
dtypes: float64(1), object(1)
memory usage: 2.5+ MB


There are clear inconsistencies with the amount of non-null values between column **`clean_text`** and **`category`** versus the total entries, so our first step would be to drop the `NaN` entries. We can first check which rows have **`category`** as **`NaN`**.


In [3]:
NaN_rows = df[df.isna().any(axis=1)]
NaN_rows

Unnamed: 0,clean_text,category
148,,0.0
130448,the foundation stone northeast gas grid inaugurated modi came major,
155642,dear terrorists you can run but you cant hide are giving more years modi which you won‚Äô see you,
155698,offense the best defence with mission shakti modi has again proved why the real chowkidar our,
155770,have always heard politicians backing out their promises but modi has been fulfilling his each every,
158693,modi government plans felicitate the faceless nameless warriors india totally deserved,
158694,,-1.0
159442,chidambaram gives praises modinomics,
159443,,0.0
160559,the reason why modi contested from seats 2014 and the real reason why rahul doing the same now,


We found that there were a total of 11 rows that have **`NaN`** values, thus we drop them to ensure the integrity and accuracy of our data analysis.


In [4]:
df = df.dropna()
NaN_rows = df[df.isna().any(axis=1)]
NaN_rows

Unnamed: 0,clean_text,category


Another issue found commonly in real-world datasets would be duplicate rows, often from manual data entry errors, system glitches, or when merging data from multiple, overlapping sources. We can first check for duplicates in our `DataFrame` then remove them.

> üç† do i need to cite this


In [5]:
duplicate_rows = df[df.duplicated()]
duplicate_rows

Unnamed: 0,clean_text,category


There exist no duplicate rows within our `DataFrame`.


By converting a CSV file into a DataFrame, pandas automatically defaults numeric values to `float64` when it encounters decimals or **`NaN`** types. Text of `str` type get inferred and loaded into a `object` as the generic type for strings. We can check the dtype of our `DataFrame` columns through [`info()`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.info.html)


In [6]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 162969 entries, 0 to 162979
Data columns (total 2 columns):
 #   Column      Non-Null Count   Dtype  
---  ------      --------------   -----  
 0   clean_text  162969 non-null  object 
 1   category    162969 non-null  float64
dtypes: float64(1), object(1)
memory usage: 3.7+ MB


We can see that **`clean_text`** column dtype is of `object` and category is of dytpe `float64`, to determine if the columns are assigned the right data type we check the unqiue values in each column.


In [7]:
for item in df["category"].unique():
    print(item)

-1.0
0.0
1.0


In [8]:
for item in df["clean_text"].unique()[:3]:
    print(item)

when modi promised ‚Äúminimum government maximum governance‚Äù expected him begin the difficult job reforming the state why does take years get justice state should and not business and should exit psus and temples
talk all the nonsense and continue all the drama will vote for modi 
what did just say vote for modi  welcome bjp told you rahul the main campaigner for modi think modi should just relax


Now that we have seen the unique values of each column, we can safely say that the data types assigned to both columns were not the right ones.


We first will convert column **`category`** from `float64` to `int64` considering that the range of values (**`-1`**, **`0`**, **`1`**) for a tweet's sentiment category will only ever be whole numbers. This step is done after dropping **`NaN`** value rows because **`NaN`** is fundamentally a float type.


In [9]:
df["category"] = df["category"].astype(int)
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 162969 entries, 0 to 162979
Data columns (total 2 columns):
 #   Column      Non-Null Count   Dtype 
---  ------      --------------   ----- 
 0   clean_text  162969 non-null  object
 1   category    162969 non-null  int64 
dtypes: int64(1), object(1)
memory usage: 3.7+ MB


After successfully converting the **`category`** column into `int64`, next we convert column `clean_string` from `object` type into the pandas defined `string` type for consistency and better performance.


In [10]:
df["clean_text"] = df["clean_text"].astype("string")
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 162969 entries, 0 to 162979
Data columns (total 2 columns):
 #   Column      Non-Null Count   Dtype 
---  ------      --------------   ----- 
 0   clean_text  162969 non-null  string
 1   category    162969 non-null  int64 
dtypes: int64(1), string(1)
memory usage: 3.7 MB


We are now finished with the _initial_ data cleaning steps, this level is more focused on the standard or common issues present in public datasets and the cleaning of it before we move onto our main cleaning pipeline, which would be more focused on cleaning the tweets themselves.


## **Main Cleaning Pipeline**

We follow a similar methodology for data cleaning presented in (George & Murugesan, 2024).


### **Normalization**

Due to the nature of the text being tweets, the presence of emojis and accented characters are to be expected. To see if our data has these special characters, we selected a sample set of them to be displayed if they were in **`clean_text`**.


In [11]:
# Finding a sample of rows with emojis
rows_with_emojis = df[df["clean_text"].str.contains(r"[\u263a-\U0001f645]", regex=True)]
rows_with_emojis["clean_text"].sample(5)

65299                                                                                                                                                                        credit goes modi‚úå‚úå
46551                                                                                                                                                                       right said madam ‚ò∫Ô∏è
133143    criminals love their number\nÔ∏èstolen modireg tse 1563 section‚öñÔ∏èfir\ncheckoutÔ∏èplaces youll find daily harassmentcriminal instigatingtaunting\nyeah right its all coincidence\ntheresÔ∏è 
17203                                                                                                                        nirav modis paintings may fetch crore auction ndtv news ‚ö°buttler‚ö° 
44710                                                                                                                                                            space war message from modi‚úåÔ∏è 
Name: clean_te

In [12]:
# Finding a sample of rows with accented characters
accented_char_rows = df[df["clean_text"].str.contains(r"√â|√©|√Å|√°|√≥|√ì|√∫|√ö|√≠|√ç")]
accented_char_rows["clean_text"].sample(5)

156327    advani must ruing the day nourished his prot√©g√©s bjp such modi arun jaitley venkaiah naidu sushma swaraj who serially betrayed him the longheld view that the kind politics you practise eventually catches with you 
23047                                                                                                                     unlikely titfortat istan darpok nikamm√© babus chorriforri crook donnie bullyfears strength look jago 
23608                                                                                          dinesh rodi ardent fan modi has opened rodi resto cafe themed modi tamil nadus thoothukudi take peep inside the modithemed caf√© 
24641     sagara sangamam moment for komali haasan  just how many blows can the ulaga nalayagan take first his prot√©g√© madhavan backs modi now this komali haasan can always drown away his sorrows the teynampet tasmac store 
161501                                                                                            

Although in a real-world context these do serve as a form of emotional expression, they provide no relevance towards _textual_ sentiment analysis, thus we normalize the text.


To normalize the text, the `normalize` function was created. It normalizes the text input to ASCII-only characters (say, "c√≥mo est√°s" becomes "como estas") and lowercased alphabetic symbols. The dataset contains Unicode characters (e.g., emojis and accented characters) which the function replaces to the empty string (`''`).


In [13]:
normalize??

[31mSignature:[39m normalize(text: str) -> str
[31mSource:[39m   
[38;5;28;01mdef[39;00m normalize(text: str) -> str:
    [33m"""[39m
[33m    Normalize text from a pandas entry to ASCII-only lowercase characters. Hence, this removes Unicode characters with no ASCII[39m
[33m    equivalent (e.g., emojis and CJKs).[39m

[33m    Do not use this function alone, use `clean_and_tokenize()`.[39m

[33m    # Parameters[39m
[33m    * text: String entry.[39m

[33m    # Returns[39m
[33m    ASCII-normalized text containing only lowercase letters.[39m

[33m    # Examples[39m
[33m    normalize("¬øC√≥mo est√°s?")[39m
[33m    $ 'como estas?'[39m

[33m    normalize(" hahahaha HUY! Kamusta üòÖ Mayaman $$$ ka na ba?")[39m
[33m    $ ' hahahaha huy! kamusta  mayaman $$$ ka na ba?'[39m
[33m    """[39m
    normalized = unicodedata.normalize([33m"NFKD"[39m, text)
    ascii_text = normalized.encode([33m"ascii"[39m, [33m"ignore"[39m).decode([33m"ascii"[39m)

    [38;5;2

### **Punctuations**

Punctuations are part of natural speech and reading to provide a sense of structure, clarity, and tone to sentences, but in the context of a classification study, punctuations do not add much information to the sentiment of a message. The sentiment of `i hate you!` and `i hate you` are going to be the same despite the punctuation mark `!` being used to accentuate the sentiment. We can see a sample of rows with punctations below.


In [14]:
# Finding a sample of rows with punctuation
rows_with_punc = df[df["clean_text"].str.contains(r"[^\w\s]")]
rows_with_punc["clean_text"].sample(5)

2404                                                                                                                 nonexhaustive list important data that the modi govt has not released doesn‚Äô have via 
118497                                                                                                                congress equated modi stands for masood azhar osama bin laden dawood ibrahim and isi‚Äô
4698                                            yes the time has come\nbut 1st have receive ‚Çπ lakh from otherwise will miss ‚Çπ lakh get ‚Çπ72000 thank you sir will not leave modi till receive amount ‚Çπ lakh 
111183    let india crore out 130 crore people are belonging below proverty then said that get 72000 now solution 20of 10cr 2cr √ó72000 xxxxxx then what about people lets modi may take care remain people 
146677                                                                                                                                 ‚Äòoppn scared chowkidar people trus

To address this, the function `rem_punctuation` was made, which replaces all punctuations and special characters with an empty string (`''`)


In [15]:
rem_punctuation??

[31mSignature:[39m rem_punctuation(text: str) -> str
[31mSource:[39m   
[38;5;28;01mdef[39;00m rem_punctuation(text: str) -> str:
    [33m"""[39m
[33m    Removes the punctuations. This function simply replaces all punctuation marks and special characters[39m
[33m    to the empty string. Hence, for symbols enclosed by whitespace, the whitespace are not collapsed to a single whitespace[39m
[33m    (for more information, see the examples).[39m

[33m    Do not use this function alone, use `clean_and_tokenize()`.[39m

[33m    # Parameters[39m
[33m    * text: String entry.[39m

[33m    # Returns[39m
[33m    Text with the punctuation removed.[39m

[33m    # Examples[39m
[33m    rem_punctuation("this word $$ has two spaces after it!")[39m
[33m    $ 'this word  has two spaces after it'[39m

[33m    rem_punctuation("these!words@have$no%space")[39m
[33m    $ 'thesewordshavenospace'[39m
[33m    """[39m
    [38;5;28;01mreturn[39;00m re.sub([33mf"[{re.escape(st

### **Numbers**

Similar to punctuations, numbers do not add any information to the sentiment of a message.


In [16]:
# Finding a sample of rows that contain numbers
rows_with_numbers = df[df["clean_text"].str.contains(r"\d")]
rows_with_numbers["clean_text"].sample(5)

122093         actually 1984 was bloody massacre and living never forgot they have been anticongress since then might have changed congress perception with his push for small but important step towards peace delaying til elections
102669                                                                                                                                                      off course stealing 30000 crores better than stealing 250 crores chor modi
141013    after losing ge2019 messers should follow footsteps gods rama laxmana taking jalsamaadhi committing suicide drowning oneself this will set scintillating example for generations aryanbrahminist politicians their followers
152527                                 lolak begging with party seat delhilol all these fellow were collie before 2014 and yes aap irrelevant even delhi nowif modi has done nothing than why the hell begging front rahul for lagbagh
100530                  gradually moving from modi baiting agendapolicy issu

Hence, we defined the `rem_numbers` as a function that replaces all numerical values as an empty string (`''`).


In [17]:
rem_numbers??

[31mSignature:[39m rem_numbers(text: str) -> str
[31mSource:[39m   
[38;5;28;01mdef[39;00m rem_numbers(text: str) -> str:
    [33m"""[39m
[33m    Removes numbers. This function simply replaces all numerical symbols to the empty string. Hence, for symbols enclosed by[39m
[33m    whitespace, the whitespace are not collapsed to a single whitespace (for more information, see the examples).[39m

[33m    Do not use this function alone, use `clean_and_tokenize()`.[39m

[33m    # Parameters[39m
[33m    * text: String entry.[39m

[33m    # Returns[39m
[33m    Text with the numerical symbol removed[39m

[33m    # Examples[39m
[33m    rem_numbers(" h3llo, k4must4 k4  n4?")[39m
[33m    ' hllo, kmust k  n?'[39m
[33m    """[39m
    [38;5;28;01mreturn[39;00m re.sub([33mr"\d+"[39m, [33m""[39m, text)
[31mFile:[39m      ~/STINTSY-Order-of-Erin/lib/janitor.py
[31mType:[39m      function

### **Whitespace**

Similar to punctations, whitespaces do not add any information to the text and are from user errors. We check if our data has whitespace.


In [18]:
# Finding a sample of rows that contain 2 or more whitespaces in a row
rows_with_whitespaces = df[df["clean_text"].str.contains(r"\s{2,}")]
rows_with_whitespaces["clean_text"].sample(5)

106781                                                             those who criticise today arnab live interview with modi  challenge those people have guts tell face live interview with arnab goswami waiting hashtag 
45637            indias chowkidar narendra modi makes india capable hunting down its enemies land air sea and now space another addition india arsenal  india has entered its name elite space power under leadership modi
88894                                  this proves you congress are corrupt and congress with corruption scams policy paralysis indecisiveness shame this ideology which wants take india back into stone age namo again  
37519     dear failif insist him pay lakh promise which modi never gave might try meet this certain extentif comes back power remember also said that‚Äô the kind money kept family swiss mauritius  will take and give poor
59339                                                                                    \ndidnt believe modi will space p

Thus to address the problem, the function `collapse_whitespace` was made, which collapses all whitespace characters to a single space. Formally, it is a transducer

$$
\Box^+ \mapsto \Box \qquad \text{where the space character is } \Box
$$

Informally, it replaces all strings of whitespaces to a single whitespace character.


In [19]:
collapse_whitespace??

[31mSignature:[39m collapse_whitespace(text: str) -> str
[31mSource:[39m   
[38;5;28;01mdef[39;00m collapse_whitespace(text: str) -> str:
    [33m"""[39m
[33m    This collapses whitespace. Here, collapsing means the transduction of all whitespace strings of any[39m
[33m    length to a whitespace string of unit length (e.g., "   " -> " "; formally " "+ -> " ").[39m

[33m    Do not use this function alone, use `clean_and_tokenize()`.[39m

[33m    # Parameters[39m
[33m    * text: String entry.[39m

[33m    # Returns[39m
[33m    Text with the whitespaces collapsed.[39m

[33m    # Examples[39m
[33m    collapse_whitespace("  huh,  was.  that!!! ")[39m
[33m    $ 'huh, was. that!!!'[39m
[33m    """[39m
    [38;5;28;01mreturn[39;00m re.sub([33m" +"[39m, [33m" "[39m, text).strip()
[31mFile:[39m      ~/STINTSY-Order-of-Erin/lib/janitor.py
[31mType:[39m      function

To seamlessly call all these cleaning functions, we have the `clean` function that acts as a container that calls these separate components. The definition of this wrapper function is quite long, see [this appendix](#appendix:-clean-wrapper-function-definition) for its definition.

We can now clean the dataset and store it in a new column named `clean_ours` (to differentiate it with the, still dirty, column `clean_text` from the dataset author)


In [20]:
df["clean_ours"] = df["clean_text"].map(clean).astype("string")
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 162969 entries, 0 to 162979
Data columns (total 3 columns):
 #   Column      Non-Null Count   Dtype 
---  ------      --------------   ----- 
 0   clean_text  162969 non-null  string
 1   category    162969 non-null  int64 
 2   clean_ours  162969 non-null  string
dtypes: int64(1), string(2)
memory usage: 5.0 MB


To confirm if the character cleaning worked, we can check for the differences between `clean_text` and `clean_ours` from the filtered rows below and compare the differences.


In [21]:
example_rows = df[
    df["clean_text"].str.contains(r"\s{2,}|\d|[^\w\s]|[\u263a-\U0001f645]|[√â√©√Å√°√≥√ì√∫√ö√≠√ç]")
]
example_rows.sample(10)

Unnamed: 0,clean_text,category,clean_ours
94411,sees the biggest danger his design propaganda and deceit however must remember that don‚Äô protect our freedoms they may lost forever must fight back this election save the soul india,-1,sees the biggest danger his design propaganda and deceit however must remember that don protect our freedoms they may lost forever must fight back this election save the soul india
29593,politics was never about courtesy but what‚Äô happening since 2014 new heres what changed,1,politics was never about courtesy but what happening since new heres what changed
152631,you have any proof that modi had committed 1500000 what had actually talked why you are bluffing jumla rahul the grandson feroze,0,you have any proof that modi had committed what had actually talked why you are bluffing jumla rahul the grandson feroze
106890,this viscerally antimodi joker spewing out his bile will eat crow 23rd may,0,this viscerally antimodi joker spewing out his bile will eat crow rd may
133462,legendary singer lata mangeshkar releases song which recital the ‚Äòsaugandh mujhe mitti ‚Äô poem which prime minister modi has often recited,1,legendary singer lata mangeshkar releases song which recital the saugandh mujhe mitti poem which prime minister modi has often recited
5678,the last four years not even single legislative step has been taken the central government protect the environment‚Äù said lawyer ritwick dutta ‚Äùevery single law related environment being diluted which will make urban areas unliveable\n,-1,the last four years not even single legislative step has been taken the central government protect the environment said lawyer ritwick dutta every single law related environment being diluted which will make urban areas unliveable
95979,thats what modi did 2014\nrararara achhe din aayenge,0,thats what modi did \nrararara achhe din aayenge
17308,same when promise give 72000 but you are little think people and you could not think about modis thinking\nwhere your brains nerves stop workingfrom there starts think about powerful india,1,same when promise give but you are little think people and you could not think about modis thinking\nwhere your brains nerves stop workingfrom there starts think about powerful india
21683,useful chart modi still popular mprajjharkhand but less gujkarnataka maharashtra etc punjab just tweeted ‚Äô orissa 625 but 649 236cm 413 wbengal 432cm 456 439cm 222 game,1,useful chart modi still popular mprajjharkhand but less gujkarnataka maharashtra etc punjab just tweeted orissa but cm wbengal cm cm game
53684,wait and see its not over yet more come \nmodi here for least more years and manmohan innocent trees bring real smile face media yet why,1,wait and see its not over yet more come \nmodi here for least more years and manmohan innocent trees bring real smile face media yet why


We are now finished with basic text cleaning, but the data cleaning does not end here. Given that the text is sourced from Twitter, it includes characteristics, such as spam and informal expressions, which are not addressed by basic cleaning methods. As a result, we move on to further cleaning tailored to the nature of Twitter data.


### **Spam, Expressions, Onomatopoeia, etc.**

Since the domain of the corpus is Twitter, spam (e.g., `bbbb`), expressions (e.g., `bruhhhh`), and onomatopoeia (e.g., `hahahaha`) may become an issue by the vector representation step. Hence we employed a simple rule-based spam removal algorithm.

We remove words in the string that contains the same letter or substring thrice and consecutively. These were done using regular expressions:

$$
\text{same\_char\_thrice} := (.)\textbackslash1^{\{2,\}}
$$

and

$$
\text{same\_substring\_twice} := (.^+)\textbackslash1^+
$$

Furthermore, we also remove any string that has a length less than three, since these are either stopwords (that weren't detected in the stopword removal stage) or more spam.

Finally, we employ adaptive character diversity threshold for the string $s$.

$$
\frac{\texttt{\#\_unique\_chars}(s)}{|s|} < 0.3 + \left(\frac{0.1 \cdot \text{min}(|s|, 10)}{10}\right)
$$

It calculates the diversity of characters in a string; if the string repeats the same character alot, we expect it to be unintelligible or useless, hence we remove the string.

The definition of this wrapper function is quite long, see its definition in [this appendix](#appendix:-find_spam_and_empty-wrapper-function-definition).

Let's first look at a random sample of 10 entries from the dataset that will be modified by the function.


In [22]:
affected = df[df["clean_ours"].apply(spam_affected)]
affected_sample = affected["clean_ours"].sample(10)
affected_sample

4463                                                                                                                                                                                      can identify slave did ever modi said l that rafool and his pidis said
116058                                                                                                                                                                                    aap files complaint with against narendra modi for violating poll code
65237                                                                                                                                          year old tejasvi surya candidate for bangalore south place late anantkumar inspiration for youth jai bjp jai modi
18013                     strange logic since death rajiv gandhi theirs gandhi family minister they have given the best till date pvn then mms both were intellectuals non politician both steered india new economic aspersions whic

Let's now call this function on the `clean_ours` column of the dataset.


In [23]:
df["clean_ours"] = df["clean_ours"].map(find_spam_and_empty).astype("string")

To confirm if the function was able to do remove all the spammy substrings, we can check `before` and `after` to compare their differences.


In [24]:
comparison = pd.DataFrame({"before": affected_sample, "after": df["clean_ours"]})

changed = comparison[comparison["before"] != comparison["after"]]
changed.sample(10)

Unnamed: 0,before,after
116058,aap files complaint with against narendra modi for violating poll code,files complaint with against narendra modi for violating poll code
4463,can identify slave did ever modi said l that rafool and his pidis said,can identify slave did ever modi said that rafool and his pidis said
45730,amazing hats off you modi kaka,amazing hats off you modi
76428,yes before modi had army and drdo cheap politics ever history india cashing everything for politics sensible ppl except bhakt should think and reward him,yes before modi had army and drdo cheap politics ever history india cashing everything for politics sensible except bhakt should think and reward him
18013,strange logic since death rajiv gandhi theirs gandhi family minister they have given the best till date pvn then mms both were intellectuals non politician both steered india new economic aspersions which destroyed past years modi,strange logic since death rajiv gandhi theirs gandhi family minister they have given the best till date pvn then both were intellectuals non politician both steered india new economic aspersions which destroyed past years modi
125544,you said that you told that for defeat modi you ready any job\npls resigh gaunfor farming\nv hindu hates you\ntum gaddar,you said that you told that for defeat modi you ready any job pls resigh gaunfor farming hindu hates you tum gaddar
81216,modi doesnt put ppl who abuse him jail unlike happy abuse away its karma,modi doesnt put who abuse him jail unlike happy abuse away its karma
117605,another indian govt lie how can sophisticated air defense system hit own aircraft pakistani fighters shot down this chopper too russian air defense system awesome maybe modi sarkar looking for discounted price for their next deal with russian s,another indian govt lie how can sophisticated air defense system hit own aircraft pakistani fighters shot down this chopper too russian air defense system awesome maybe modi sarkar looking for discounted price for their next deal with russian
65237,year old tejasvi surya candidate for bangalore south place late anantkumar inspiration for youth jai bjp jai modi,year old tejasvi surya candidate for bangalore south place late inspiration for youth jai bjp jai modi
103716,actually anandabazar become mouth pice pakistani pakistani supporter desh gaddar news paper mein anandabazar will write golden word continue propaganda against india modi birodh aur desh birodh mein difference pata hai fir desh birodhi propaganda,actually become mouth pice pakistani pakistani supporter desh gaddar news paper mein will write golden word continue propaganda against india modi birodh aur desh birodh mein difference pata hai fir desh birodhi propaganda


Let‚Äôs examine whether applying this function has caused any significant changes to the DataFrame structure, given that it can convert entire cells to `NaN`.


In [25]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 162969 entries, 0 to 162979
Data columns (total 3 columns):
 #   Column      Non-Null Count   Dtype 
---  ------      --------------   ----- 
 0   clean_text  162969 non-null  string
 1   category    162969 non-null  int64 
 2   clean_ours  162942 non-null  string
dtypes: int64(1), string(2)
memory usage: 5.0 MB


The DataFrame structure is intact, but **`clean_ours`** now has 27 fewer non-null values, reflecting cells that were entirely filtered out as spam as seen below.


In [26]:
spam_rows = df[df["clean_ours"].isna()]
spam_rows[["clean_text", "clean_ours"]]

Unnamed: 0,clean_text,clean_ours
21806,bjpmpsubramanianswamyiamchowkidarcampaignpmmodi,
21855,terrorfundinghurriyatleaderspropertyseizedhafizsaeedmodigovt,
24148,pmnarendramodirequestsofexservicemanindianarmyhavildarombirsinghsharma9258,
35636,2019,
35866,‚Äç,
35968,whattttttt,
37837,allllll,
40587,1145am,
40977,‚åö1145 ‚ù§,
48127,birthdaaaaaay,


## **Post-Cleaning Steps**

At some point during the cleaning stage, some entries of the dataset could have been reduced to `NaN` or the empty string `""`, or we could have introduced duplicates again. So, let's call `dropna` and `drop_duplicates` again to finalize the cleaning stage.


In [27]:
df = df.dropna()
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 162942 entries, 0 to 162979
Data columns (total 3 columns):
 #   Column      Non-Null Count   Dtype 
---  ------      --------------   ----- 
 0   clean_text  162942 non-null  string
 1   category    162942 non-null  int64 
 2   clean_ours  162942 non-null  string
dtypes: int64(1), string(2)
memory usage: 5.0 MB


In [28]:
df = df.drop_duplicates()
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 162942 entries, 0 to 162979
Data columns (total 3 columns):
 #   Column      Non-Null Count   Dtype 
---  ------      --------------   ----- 
 0   clean_text  162942 non-null  string
 1   category    162942 non-null  int64 
 2   clean_ours  162942 non-null  string
dtypes: int64(1), string(2)
memory usage: 5.0 MB


# **3. Preprocessing**

> WIP Narrative and Sequence
> üèóÔ∏è Perhaps swap S3 and S4. Refer to literature on what comes first.

This section discusses preprocessing steps for the cleaned data. Because the goal is to analyze the textual sentiments of tweets the following preprocessing steps are needed to provide the Bag of Words model with the relevant information required to get the semantic embeddings of each tweet.

Before and after each preprocessing step, we will show 5 random entries in the dataset to show the effects of each preprocessing task.

## **Lemmatization**

We follow a similar methodology for data cleaning presented in <u>(George & Murugesan, 2024)</u>. We preprocess the dataset entries via lemmatization. For the lemmatization step, we use the SpaCy's `en_core_web_sm` version 3.8.0, which is a pretrained language model for English <u>(Honnibal et al., 2020)</u>.

In [None]:
df["lemmatized"] = df["clean_ours"].progress_apply(lemmatizer)
df.sample(10)

  0%|          | 0/162942 [00:00<?, ?it/s]

## **Stop Word Removal**

After lemmatization, we may now remove the stop words present in the dataset. The stopword removal _needs_ to be after lemmatization since this step requires all words to be reduces to their base dictionary form, and the `stopword_set` only considers base dictionary forms of the stopwords.

**stopwords.** For stop words removal, we refer to the English stopwords dataset defined in NLTK and Wolfram Mathematica <u>(Bird & Loper, 2004; Wolfram Research, 2015)</u>. However, since the task is sentiment analysis, words that invoke polarity, intensification, and negation are important. Words like "not" and "okay" are commonly included as stopwords. Therefore, the stopwords from <u>(Bird & Loper, 2004; Wolfram Research, 2015)</u> are manually adjusted to only include stopwords that invoke neutrality, examples are "after", "when", and "you."

In [None]:
df["lemmatized"] = df["lemmatized"].map(lambda t: rem_stopwords(t, stopwords_set))
df = df.dropna(subset=["lemmatized"])
df.sample(10)

After preprocessing, the dataset now contains:


In [None]:
df.info()

Here are 5 randomly picked entries in the dataframe with all columns shown for comparison.


In [None]:
display(df.sample(5))

## **Tokenization**

Since the data cleaning and preprocessing stage is comprehensive, the tokenization step in the BoW model reduces to a simple word-boundary split operation. Each preprocessed entry in the DataFrame is split by spaces. For example, the entry `"shri narendra modis"` (entry: 42052) becomes `["shri", "narendra", "modis"]`. By the end of tokenization, all entries are transformed into arrays of strings.

## **Word Bigrams**

As noted earlier, modifiers and polarity words are not included in the stopword set. The BoW model constructs a vocabulary containing both unigrams and bigrams. Including bigrams allows the model to capture common word patterns, such as

$$
\left\langle \texttt{Adj}\right\rangle \left\langle \texttt{M} \mid \texttt{Pron} \right\rangle
$$

<center>or</center>

$$
\left\langle \texttt{Adv}\right\rangle \left\langle \texttt{V} \mid \texttt{Adj} \mid \texttt{Adv} \right\rangle
$$

## **Vector Representation**

After the stemming and lemmatization steps, each entry can now be represented as a vector using a Bag of Words (BoW) model. We employ scikit-learn's `CountVectorizer`, which provides a ready-to-use implementation of BoW <u>(Pedregosa et al., 2011)</u>.

A comparison of other traditional vector representations are discussed in [this appendix](#appendix:-comparison-of-traditional-vectorization-techniques).
Words with modifiers have the modifiers directly attached, enabling subsequent models to capture the concept of modification fully. Consequently, after tokenization and bigram construction, the vocabulary size can grow up to $O(n^2)$, where $n$ is the number of unique tokens.

**minimum document frequency constraint:** Despite cleaning and spam removal, some tokens remain irrelevant or too rare. To address this, a minimum document frequency constraint is applied: $\texttt{min\_df} = 10$, meaning a token must appear in at least 10 documents to be included in the BoW vocabulary. This reduces noise and ensures the model focuses on meaningful terms.

---

These parameters of the BoW model are encapsulated in the `BagOfWordsModel` class. The class definition is available in [this appendix](#appendix:-BagOfWordsModel-class-definition).

In [None]:
bow = BagOfWordsModel(
    texts=df["lemmatized"],   # list of words to include in the model
    min_freq=10,              # words must appear in at least 10 different documents to be included
)

# some sanity checks
assert (
    bow.matrix.shape[0] == df.shape[0]
), "number of rows in the matrix DOES NOT matches the number of documents"
assert bow.sparsity, "the sparsity is TOO HIGH, something went wrong"

The error above is normal, recall that our tokenization step essentially reduced into an array split step. With this, we need to set the `tokenizer` function attribute of the `BagOfWordsModel` to not use its default tokenization pattern. That causes this warning.


### **Model Metrics**

To get an idea of the model, we will now look at its shape and sparsity, with shape being the number of documents and tokens present in the model. While sparsity refers to the number of elements in a matrix that are zero, calculating how sparse or varied the words are in the dataset.


The resulting vector has a shape of


In [None]:
bow.matrix.shape

The first entry of the pair is the number of documents (the ones that remain after all the data cleaning and preprocessing steps) and the second entry is the number of tokens (or unique words in the vocabulary).

The resulting model has a sparsity of


In [None]:
1 - bow.sparsity

The model is 99.95% sparse, meaning the tweets often do not share the same words leading to a large vocabulary.


Now, looking at the most frequent and least frequent terms in the model.


In [None]:
doc_frequencies = np.asarray((bow.matrix > 0).sum(axis=0)).flatten()
freq_order = np.argsort(doc_frequencies)[::-1]
bow.feature_names[freq_order[:50]]

We see that the main talking point of the Tweets, which hovers around Indian politics with keywords like "modi", "india", and "bjp". For additional context, "bjp" referes to the _Bharatiya Janata Party_ which is a conservative political party in India, and one of the two major Indian political parties.

To better understand these, we can check the wordcloud generated from the model.

In [None]:
wc = WordCloud(width=800, height=400, background_color="white", min_font_size=10).generate(" ".join(bow.feature_names))
plt.figure(figsize=(10,5))
plt.imshow(wc, interpolation="bilinear")
plt.axis("off")
plt.show

Now, looking at the least popular words.


In [None]:
bow.feature_names[freq_order[-50:]]

We still see that the themes mentioned in the most frequent terms are still present in this subset. Although, more filler or non-distinct words do appear more often, like "photos", "soft" and "types".

But the present of words like "reelection" and "wars" still point to this subset still being relevant to the main theme of the dataset.


# **4. Exploratory Data Analysis**

This section discusses the exploratory data analysis conducted on the dataset after cleaning.

> Notes from Zhean: <br>
> From manual checking via OpenRefine, there are a total of 162972. `df.info()` should have the same result post-processing.
> Furthermore, there should be two columns, `clean_text` (which is a bit of a misnormer since it is still dirty) contains the Tweets (text data). The second column is the `category` which contains the sentiment of the Tweet and is a tribool (1 positive, 0 neutral or indeterminate, and -1 for negative).


Now that we have our clean, lemmatized tweets, we can now work with a new DataFrame containing only **`lemmatized`** and the **`category`** columns.


In [None]:
df_cleaned = df.copy()
df_cleaned = df_cleaned.drop(["clean_text", "clean_ours"], axis=1)

df_cleaned = df_cleaned[["lemmatized", "category"]]  # for column reordering

df_cleaned

Because we will be splitting this dataset later, we need to know if the distribution of the categories is balanced. An imbalanced distribution may cause a bias to the majority class. Understanding the distribution will inform us whether stratified splitting is necessary so that we do not have an under or overrepresented class.

We'll be using a bar graph as that is the simplest way for us to see the differences between the categorical data.


In [None]:
count = df_cleaned["category"].value_counts()

plt.title("Sentiment Labels Distribution")

count.plot(kind="bar")
plt.xlabel("Sentiment")
plt.xticks(rotation=0)

plt.ylabel("Count")

plt.grid(axis="y", linestyle="--", alpha=0.8)  # horizontal lines

plt.show()

We can see that there is a noticeable difference between the three classes. The positive class (1) has a count of over 70,000, the neutral class (0) has around 55,000, and the negative class (-1) has around 30,000.

This imbalance indicates that we must use stratified splitting in the later section.


# **5 Dataset Splitting**

Before being able to use the dataset, we need to partition it into three sets:

1. **Training** - used to train the model to learn and change its parameters
2. **Validation** - used to evaluate the model, comparing its predictions to correct answers for hyperparameter tuning
3. **Test** - used to test the model with new, unseen data

The following section will be dedicated solely to splitting the dataset. We will split the dataset with 70% for training, 15% for validation, and 15% for testing as this is a standard partitioning.

## **Splitting the dataset into Training, Validation, and Testing sets**

We'll first split the dataset into 70% and 30% parts by using Scikit-learn's `train_test_split` function. As mentioned earlier, the distribution of categories is imbalanced, so we have to use the function's `stratify` parameter to maintain an even proportion.


In [None]:
train, temp = train_test_split(
    df_cleaned, test_size=0.3, stratify=df_cleaned["category"], random_state=5
)  # 70/30 split

print(train.shape, temp.shape)

We now have our two sets for training and testing, but we're still missing one more for validation. We can split the 30% part into two halves of 15% so that we have a part for validation and the other part for testing.


In [None]:
validation, test = train_test_split(
    temp, test_size=0.5, stratify=temp["category"], random_state=5
)  # 15/15 split

print(train.shape, validation.shape, test.shape)

Now that we have our training, validation, and testing sets, we can use these on the models.


# **References**

Bird, S., & Loper, E. (2004, July). NLTK: The natural language toolkit. _Proceedings of the ACL Interactive Poster and Demonstration Sessions_, 214‚Äì217. https://aclanthology.org/P04-3031/

El-Demerdash, A. A., Hussein, S. E., & Zaki, J. F. W. (2021). Course evaluation based on deep learning and SSA hyperparameters optimization. _Computers, Materials & Continua, 71_(1), 941‚Äì959. https://doi.org/10.32604/cmc.2022.021839

George, M., & Murugesan, R. (2024). Improving sentiment analysis of financial news headlines using hybrid Word2Vec-TFIDF feature extraction technique. _Procedia Computer Science, 244_, 1‚Äì8.

Honnibal, M., Montani, I., Van Landeghem, S., & Boyd, A. (2020). spaCy: Industrial-strength Natural Language Processing in Python. https://doi.org/10.5281/zenodo.1212303

Hussein, S. (2021). _Twitter sentiments dataset_. Mendeley.

Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., Vanderplas, J., Passos, A., Cournapeau, D., Brucher, M., Perrot, M., & Duchesnay, E. (2011). Scikit-learn: Machine learning in Python. _Journal of Machine Learning Research, 12_, 2825‚Äì2830.

Rani, D., Kumar, R., & Chauhan, N. (2022, October). Study and comparison of vectorization techniques used in text classification. In _2022 13th International Conference on Computing Communication and Networking Technologies (ICCCNT)_ (pp. 1‚Äì6). IEEE.

Wolfram Research. (2015). _DeleteStopwords_. https://reference.wolfram.com/language/ref/DeleteStopwords.html

# **Appendix: `clean` wrapper function definition**

Below is the definition of the `clean` wrapper function that encapsulates all internal functions used in the cleaning pipeline.


In [None]:
clean??

# **Appendix: `find_spam_and_empty` wrapper function definition**

Below is the definition of the `find_spam_and_empty` wrapper function that encapsulates all internal functions for the spam detection algorithm.


In [None]:
find_spam_and_empty??

# **Appendix: comparison of traditional vectorization techniques**

Traditional vectorization techniques include BoW and Term Frequency-Inverse Document Frequency (TF-IDF). TF-IDF weights each word based on its frequency in a document and its rarity across the corpus, reducing the impact of common words. BoW, in contrast, simply counts word occurrences without considering corpus-level frequency. In this project, BoW was chosen because stopwords were already removed during preprocessing, and the dataset is domain-specific <u>(Rani et al., 2022)</u>. In such datasets, frequent words are often meaningful domain keywords, so scaling them down (as TF-IDF would) could reduce the importance of these key terms in the feature representation.


# **Appendix: `BagOfWordsModel` class definition**

Below is the definition of the `BagOfWordsModel` class that encapsulates the desired parameters.


In [None]:
BagOfWordsModel??