# Sentiment Analysis of Twitter Posts

<!-- Notebook name goes here -->
<center><b>Notebook: Data Description, Cleaning, Exploratory Data Analysis, and Preprocessing</b></center>
<br>

**by**: Stephen Borja, Justin Ching, Erin Chua, and Zhean Ganituen.

**dataset**: Hussein, S. (2021). Twitter Sentiments Dataset [Dataset]. Mendeley. https://doi.org/10.17632/Z9ZW7NT5H2.1

**motivation**: Every minute, social media users generate a large influx of textual data on live events. Performing sentiment analysis on this data provides a real-time view of public perception, enabling quick insights into the general population‚Äôs opinions and reactions.

**goal**: By the end of the project, our goal is to create and compare supervised learning algorithms for sentiment analysis.

### **dataset description**

The Twitter Sentiments Dataset is a dataset that contains nearly 163k tweets from Twitter. The time period of when these were collected is unknown, but it was published to Mendeley Data on May 14, 2021 by Sherif Hussein of Mansoura University.

Tweets were extracted using the Twitter API, but the specifics of how the tweets were selected are unmentioned. The tweets are mostly English with a mix of some Hindi words for code-switching <u>(El-Demerdash., 2021)</u>. All of them seem to be talking about the political state of India. Most tweets mention Narendra Modi, the current Prime Minister of India.

Each tweet was assigned a label using TextBlob's sentiment analysis <u>(El‚ÄëDemerdash, Hussein, & Zaki, 2021)</u>, which assigns labels automatically.

Twitter_Data

- **`clean_text`**: The tweet's text
- **`category`**: The tweet's sentiment category

What each row and column represents: `each row represents one tweet.` <br>
Number of observations: `162,980`

---

<a name="cite_note-1"></a>1. [^](#cite_ref-1) Code-switching is the practice of alternating between two languages $L_1$ (the native language) and $L_2$ (the source language) in a conversation. In this context, the code-switching is done to appear more casual since the conversation is done via Twitter (now, X).


## **1. Project Set-up**

We set the global imports for the projects (ensure these are installed via uv and is part of the environment). Furthermore, load the dataset here.


In [175]:
import pandas as pd
import numpy as np
import os
import sys

# Use lib directory
sys.path.append(os.path.abspath("../lib"))

# Imports from lib files
from janitor import *
from lemmatize import lemmatizer
from boilerplate import stopwords_set
from bag_of_words import BagOfWordsModel

# Pandas congiruation
pd.set_option("display.max_colwidth", None)

# Load raw data file
df = pd.read_csv("../data/Twitter_Data.csv")

## **2. Data Cleaning**

This section discusses the methodology for data cleaning.




As to not waste computational time, a preliminary step is to ensure that no `NaN` or duplicate entries exist before the cleaning steps. We can call on `.info()` after each step to see the rows changed in our DataFrame

In [176]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 162980 entries, 0 to 162979
Data columns (total 2 columns):
 #   Column      Non-Null Count   Dtype  
---  ------      --------------   -----  
 0   clean_text  162976 non-null  object 
 1   category    162973 non-null  float64
dtypes: float64(1), object(1)
memory usage: 2.5+ MB


There are clear inconsistencies with the amount of non-null values between column `clean_text` and `category` versus the total entries, so our first step would be to drop the `NaN` entries. We can first check which rows have `category` as `NaN`.

In [177]:
NaN_rows = df[df.isna().any(axis=1)]
NaN_rows

Unnamed: 0,clean_text,category
148,,0.0
130448,the foundation stone northeast gas grid inaugurated modi came major,
155642,dear terrorists you can run but you cant hide are giving more years modi which you won‚Äô see you,
155698,offense the best defence with mission shakti modi has again proved why the real chowkidar our,
155770,have always heard politicians backing out their promises but modi has been fulfilling his each every,
158693,modi government plans felicitate the faceless nameless warriors india totally deserved,
158694,,-1.0
159442,chidambaram gives praises modinomics,
159443,,0.0
160559,the reason why modi contested from seats 2014 and the real reason why rahul doing the same now,


As expected, there are a total of 11 rows that have `NaN` values, thus we drop them to ensure the integrity and accuracy of our data analysis.

In [178]:
df = df.dropna()
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 162969 entries, 0 to 162979
Data columns (total 2 columns):
 #   Column      Non-Null Count   Dtype  
---  ------      --------------   -----  
 0   clean_text  162969 non-null  object 
 1   category    162969 non-null  float64
dtypes: float64(1), object(1)
memory usage: 3.7+ MB


Another issue found commonly in real-world datasets would be duplicate rows, often from manual data entry errors, system glitches, or when merging data from multiple, overlapping sources. We can first check for duplicates in our DataFrame then remove them.
> üç† do i need to cite this

In [179]:
duplicate_rows = df[df.duplicated()]
duplicate_rows

Unnamed: 0,clean_text,category


There exist no duplicate rows within our DataFrame but we will still drop any duplicate rows for consistency.

In [180]:
df = df.drop_duplicates()
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 162969 entries, 0 to 162979
Data columns (total 2 columns):
 #   Column      Non-Null Count   Dtype  
---  ------      --------------   -----  
 0   clean_text  162969 non-null  object 
 1   category    162969 non-null  float64
dtypes: float64(1), object(1)
memory usage: 3.7+ MB


By converting a CSV file into a DataFrame, pandas automatically defaults numeric values to `float64` when it encounters decimals or `NaN` types. Text of `str` type get inferred and loaded into a `object` as the generic type for strings. We can check the dtype of our DataFrame columns through `.info()`


In [181]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 162969 entries, 0 to 162979
Data columns (total 2 columns):
 #   Column      Non-Null Count   Dtype  
---  ------      --------------   -----  
 0   clean_text  162969 non-null  object 
 1   category    162969 non-null  float64
dtypes: float64(1), object(1)
memory usage: 3.7+ MB


We can see that `clean_text` column dtype is of `object` and category is of dytpe `float64`, we first we convert column `category` from `float64` to `int64` considering that the range of values (-1, 0, 1) for a tweet's sentiment category will only ever be whole numbers. This step is done after dropping `NaN` value rows because `NaN` is fundamentally a float type.


In [182]:
df["category"] = df["category"].astype(int)
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 162969 entries, 0 to 162979
Data columns (total 2 columns):
 #   Column      Non-Null Count   Dtype 
---  ------      --------------   ----- 
 0   clean_text  162969 non-null  object
 1   category    162969 non-null  int64 
dtypes: int64(1), object(1)
memory usage: 3.7+ MB


After successfully converting the `category` column into `int64`, next we convert column `clean_string` from `object` type into the pandas defined `string` type for consistency and better performance.


In [183]:
df["clean_text"] = df["clean_text"].astype("string")
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 162969 entries, 0 to 162979
Data columns (total 2 columns):
 #   Column      Non-Null Count   Dtype 
---  ------      --------------   ----- 
 0   clean_text  162969 non-null  string
 1   category    162969 non-null  int64 
dtypes: int64(1), string(1)
memory usage: 3.7 MB


Considering that the sentiment values or the `category` column should be within the range [-1, 0, 1] to represent the three sentiments, namely, negative, neutral, and positive, we check for all unique values in `category` and remove any that do not fall within the provided range.


In [184]:
df["category"].unique()

array([-1,  0,  1])

All existing values in the `category` column in the DataFrame are within the expected range, but we still drop any rows that have values outside of the provided range for data consistency and extra precaution.


In [None]:
df = df[df["category"].isin([-1, 0, 1])]
df.info()

31875     1
126871   -1
84922    -1
13722     0
2910      0
106910    1
39553     0
88887     0
64471     1
12340    -1
Name: category, dtype: int64

## **Main Cleaning Pipeline**

We follow a similar methodology for data cleaning presented in (George & Murugesan, 2024).


### **Normalization**

Due to the nature of the text being tweets, we noticed a prevalence in the use of emojis and accented characters as seen in the samples below. Although in a real-world context these do serve as a form of emotional expression, it provides no relevance towards _textual_ sentiment analysis, thus we normalize the text.


In [187]:
# Finding a sample of rows with accented characters
accented_char_rows = df[df["clean_text"].str.contains(r"√â|√©|√Å|√°|√≥|√ì|√∫|√ö|√≠|√ç")]
accented_char_rows["clean_text"].sample(5)

61048                                                           v√≠a bjp leaders hail for indias successful demonstration antimissile technology read 
97413                                         sir please one expos√© about the degree modi also everyone wants see his degree entire political science
21088                                  leaders opposition parties will joint press conference today 100 says will expos√© one scam the modi government
50461    v√≠a not against any particular nation demonstration our own technology former drdo chief saraswat tells cnnnews18s follow live updates here 
23608                dinesh rodi ardent fan modi has opened rodi resto cafe themed modi tamil nadus thoothukudi take peep inside the modithemed caf√© 
Name: clean_text, dtype: string

In [188]:
# Finding a sample of rows with emojis
rows_with_emojis = df[df["clean_text"].str.contains(r"[\u263a-\U0001f645]", regex=True)]
rows_with_emojis["clean_text"].sample(5)

73411                                                                                                                                                                                   love ‚ù§Ô∏è love ‚ù§Ô∏è love ‚ù§Ô∏è chal jutha 
119259                                                                                                                                                                                                         how sweet ‚ò∫Ô∏è
88147                                                                                                                                                                         here ‚ò∫Ô∏è the trailer upcoming web series modi 
23615     too much appeasement for vote banking resulted you have forgotten your hindu customs and religion that have festival called ‚Äú ayudha pooja ‚Äú khangress doesn‚Äô respect hindu festivals vote for modi modi again ‚úåÔ∏è
23190                                                                                     

The first function is the `normalize` function, it normalizes the text input to ASCII-only characters (say, "c√≥mo est√°s" becomes "como estas") and lowercased alphabetic symbols. The dataset contains Unicode characters (e.g., emojis and accented characters) which the function replaces to the empty string (`''`).


In [189]:
normalize??

[31mSignature:[39m normalize(text: str) -> str
[31mSource:[39m   
[38;5;28;01mdef[39;00m normalize(text: str) -> str:
    [33m"""[39m
[33m    Normalize text from a pandas entry to ASCII-only lowercase characters. Hence, this removes Unicode characters with no ASCII[39m
[33m    equivalent (e.g., emojis and CJKs).[39m

[33m    Do not use this function alone, use `clean_and_tokenize()`.[39m

[33m    # Parameters[39m
[33m    * text: String entry.[39m

[33m    # Returns[39m
[33m    ASCII-normalized text containing only lowercase letters.[39m

[33m    # Examples[39m
[33m    normalize("¬øC√≥mo est√°s?")[39m
[33m    $ 'como estas?'[39m

[33m    normalize(" hahahaha HUY! Kamusta üòÖ Mayaman $$$ ka na ba?")[39m
[33m    $ ' hahahaha huy! kamusta  mayaman $$$ ka na ba?'[39m
[33m    """[39m
    normalized = unicodedata.normalize([33m"NFKD"[39m, text)
    ascii_text = normalized.encode([33m"ascii"[39m, [33m"ignore"[39m).decode([33m"ascii"[39m)

    [38;5;2

### **Punctuations**

Punctuations are part of natural speech and reading to provide a sense of structure, clarity, and tone to sentences, but in the context of a classification study punctuations do not add much information to the sentiment of a message. The sentiment of `i hate you!` and `i hate you` are going to be the same despite the punctuation mark `!` being used to accentuate the sentiment. We can see a sample of rows with punctations below.


In [190]:
# Finding a sample of rows with punctuation
rows_with_punc = df[df["clean_text"].str.contains(r"[^\w\s]")]
rows_with_punc["clean_text"].sample(5)

28665                                           hey you cut modi not the nation ‚Äô with democracy people like you are rupturing india great and modi the fat ugly moron sht ‚Äô with india ‚Äô with democracy but ‚Äô not with hate monger airhead
3560                                                                                                                                              modi can execute this scheme simply raiding gandhi‚Äô and vadra‚Äô assets swiss bank accounts
27259                                                                                   why necessary for you that year old brahmin must fight election just because you don‚Äô like modi why don‚Äô you bell the cat instead preaching others 
26503                                                           nikal demonetisation killed over 100 people left over crore jobless still you were over modi‚Äô masterstoke demonetisation people get three times food than vote for congress
70648    were national disaster you wo

The function `rem_punctuation` replaces all punctuations and special characters into an empty string (`''`)


In [191]:
rem_punctuation??

[31mSignature:[39m rem_punctuation(text: str) -> str
[31mSource:[39m   
[38;5;28;01mdef[39;00m rem_punctuation(text: str) -> str:
    [33m"""[39m
[33m    Removes the punctuations. This function simply replaces all punctuation marks and special characters[39m
[33m    to the empty string. Hence, for symbols enclosed by whitespace, the whitespace are not collapsed to a single whitespace[39m
[33m    (for more information, see the examples).[39m

[33m    Do not use this function alone, use `clean_and_tokenize()`.[39m

[33m    # Parameters[39m
[33m    * text: String entry.[39m

[33m    # Returns[39m
[33m    Text with the punctuation removed.[39m

[33m    # Examples[39m
[33m    rem_punctuation("this word $$ has two spaces after it!")[39m
[33m    $ 'this word  has two spaces after it'[39m

[33m    rem_punctuation("these!words@have$no%space")[39m
[33m    $ 'thesewordshavenospace'[39m
[33m    """[39m
    [38;5;28;01mreturn[39;00m re.sub(f"[{re.escape(string.

### **Numbers**

Similar to punctuations, numbers do not add any information to the sentiment of a message as seen in the samples below.


In [192]:
# Finding a sample of rows that contain numbers
rows_with_numbers = df[df["clean_text"].str.contains(r"\d")]
rows_with_numbers["clean_text"].sample(5)

161779                                                                                                                               according survey hindus want modi will the countys again 2019 you are also then retweet and follow hindu supporter\n‡•ç‡§ü‡§∞‡§π‡§ø‡•ç‡•Ç
10223                                                                                                                                  \nmodi govt built 153 crore houses under awas yojana from 201418 this multiple times more than earlier govt\nvia namo app
88050     modi was fact shocked and made immobile when said will not accept modi prime minister after 2014 elections entire india got stumped and came stand still after repeated persuasion accepted and modi became how can you forget something happened 2014
82740                                                                                                                                                      the hindu ‚Äúunlike 2014 there modi wave this time open elec

Hence we defined the `rem_numbers` as a function that replaces all numerical values as an empty string (`''`).


In [193]:
rem_numbers??

[31mSignature:[39m rem_numbers(text: str) -> str
[31mSource:[39m   
[38;5;28;01mdef[39;00m rem_numbers(text: str) -> str:
    [33m"""[39m
[33m    Removes numbers. This function simply replaces all numerical symbols to the empty string. Hence, for symbols enclosed by[39m
[33m    whitespace, the whitespace are not collapsed to a single whitespace (for more information, see the examples).[39m

[33m    Do not use this function alone, use `clean_and_tokenize()`.[39m

[33m    # Parameters[39m
[33m    * text: String entry.[39m

[33m    # Returns[39m
[33m    Text with the numerical symbol removed[39m

[33m    # Examples[39m
[33m    rem_numbers(" h3llo, k4must4 k4  n4?")[39m
[33m    ' hllo, kmust k  n?'[39m
[33m    """[39m
    [38;5;28;01mreturn[39;00m re.sub([33mr"\d+"[39m, [33m""[39m, text)
[31mFile:[39m      c:\users\erin\documents\github\stintsy-order-of-erin\lib\janitor.py
[31mType:[39m      function

### **Whitespace**

We also noticed the prevalance of excess whitespaces in between words, as seen in the sample below.


In [194]:
# Finding a sample of rows that contain 2 or more whitespaces in a row
rows_with_whitespaces = df[df["clean_text"].str.contains(r"\s{2,}")]
rows_with_whitespaces["clean_text"].sample(5)

123951    and your bjp friends with pakistan birthday wish karne waha chale jaate chacha nirav modi vijay mallaya mehul choksi such scamsters whom you helped flee from country with friends like those can public interest served  
104950                                                                                                                  sir need thank nehru and congress for doing nothing and leaving all the work done modi they also know that  
66865                                                                                                                           hello  thank you for admitting that had brave leadership who started this mission least years modi  
76220                                                    modi gets the credit credit does not capability credit goes one who gets things done did your rajdeeps program madhavan said upa was disaster disaster  and rajdeep beeped 
16422                                                                               

Thus, function `collapse_whitespace` collapses all whitespace characters to a single space. Formally, it is a transducer

$$
\Box^+ \mapsto \Box \qquad \text{where the space character is } \Box
$$

Informally, it replaces all strings of whitespaces to a single whitespace character.


In [195]:
collapse_whitespace??

[31mSignature:[39m collapse_whitespace(text: str) -> str
[31mSource:[39m   
[38;5;28;01mdef[39;00m collapse_whitespace(text: str) -> str:
    [33m"""[39m
[33m    This collapses whitespace. Here, collapsing means the transduction of all whitespace strings of any[39m
[33m    length to a whitespace string of unit length (e.g., "   " -> " "; formally " "+ -> " ").[39m

[33m    Do not use this function alone, use `clean_and_tokenize()`.[39m

[33m    # Parameters[39m
[33m    * text: String entry.[39m

[33m    # Returns[39m
[33m    Text with the whitespaces collapsed.[39m

[33m    # Examples[39m
[33m    collapse_whitespace("  huh,  was.  that!!! ")[39m
[33m    $ 'huh, was. that!!!'[39m
[33m    """[39m
    [38;5;28;01mreturn[39;00m re.sub([33m" +"[39m, [33m" "[39m, text).strip()
[31mFile:[39m      c:\users\erin\documents\github\stintsy-order-of-erin\lib\janitor.py
[31mType:[39m      function

To seamlessly call all these cleaning functions, we have the `clean` function that acts as a container that calls these separate components. The definition of this wrapper function is quite long, see [this appendix](#appendix:-clean-wrapper-function-definition) for its definition.

We can now clean the dataset and store it in a new column named `clean_ours` (to differentiate it with the, still dirty, column `clean_text` from the dataset author)


In [196]:
df["clean_ours"] = df["clean_text"].map(clean).astype("string")
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 162969 entries, 0 to 162979
Data columns (total 3 columns):
 #   Column      Non-Null Count   Dtype 
---  ------      --------------   ----- 
 0   clean_text  162969 non-null  string
 1   category    162969 non-null  int64 
 2   clean_ours  162969 non-null  string
dtypes: int64(1), string(2)
memory usage: 5.0 MB


To confirm if the character cleaning worked, we can check for the differences between `clean_text` and `clean_ours` from the filtered rows below and compare the differences.


In [197]:
example_rows = df[
    df["clean_text"].str.contains(r"\s{2,}|\d|[^\w\s]|[\u263a-\U0001f645]|[√â√©√Å√°√≥√ì√∫√ö√≠√ç]")
]
example_rows.sample(10)

Unnamed: 0,clean_text,category,clean_ours
100701,2019 elections are achievements narendra modi people and large hugely appreciative his leadership quality,1,elections are achievements narendra modi people and large hugely appreciative his leadership quality
91067,india upfront issues distinguished voices questions has modi demonstrated ‚Äòdum‚Äô fight ‚Äògaribi‚Äô does india believe that rahul better placed give ‚Äònyay‚Äô the ‚Äòkisan‚Äô can ‚Äòhandout‚Äô niti create naukri join,1,india upfront issues distinguished voices questions has modi demonstrated dum fight garibi does india believe that rahul better placed give nyay the kisan can handout niti create naukri join
118749,full modi prime minister first 2019 interview speaks arnab goswami media network watch ‚ñ∂Ô∏è,1,full modi prime minister first interview speaks arnab goswami media network watch
61787,india‚Äô scientists have destroyed ‚Äúlive satellite‚Äù the lower earth orbit today completing hightech and difficult task dubbed,-1,india scientists have destroyed live satellite the lower earth orbit today completing hightech and difficult task dubbed
69690,india shot down satellite modi says shifting balance power asia via nytimes ‚òû,-1,india shot down satellite modi says shifting balance power asia via nytimes
37885,major terrorist attacks since balakot bombings wiped out 300 pakis has modi finally made the difference,1,major terrorist attacks since balakot bombings wiped out pakis has modi finally made the difference
9839,shot the arm bjp will sure terrible harm bjp coming elections congress will have fun for five years atal lost due onion price modi can also face the same music quick and guaranty minimum 15000month job all graduates dig pools\n,1,shot the arm bjp will sure terrible harm bjp coming elections congress will have fun for five years atal lost due onion price modi can also face the same music quick and guaranty minimum month job all graduates dig pools
131982,from nehru rahul gandhi family lying from past four generation poverty narendra modi ‚Å¶ ‚Å¶ ‚Å¶,-1,from nehru rahul gandhi family lying from past four generation poverty narendra modi
15835,congress was corrupt despite that implemented rti which was used expose many its scams under modi information commissioner vacancies have been left open and they dont respond rti queries still the one fighting corruption ‚Äç‚ôÇÔ∏è‚Äç‚ôÇÔ∏è,0,congress was corrupt despite that implemented rti which was used expose many its scams under modi information commissioner vacancies have been left open and they dont respond rti queries still the one fighting corruption
10003,narendra modi scores 100100 the prime minister this great country deserves another chance hope the rest his cabinet colleagues follow suit and win hearts and brains the electoral,1,narendra modi scores the prime minister this great country deserves another chance hope the rest his cabinet colleagues follow suit and win hearts and brains the electoral


We are now finished with basic text cleaning, but the data cleaning does not end here. Given that the text is sourced from Twitter, it includes characteristics, such as spam and informal expressions, which are not addressed by basic cleaning methods. As a result, we move on to further cleaning tailored to the nature of Twitter data.


### **Spam, Expressions, Onomatopoeia, etc.**

Since the domain of the corpus is Twitter, spam (e.g., `bbbb`), expressions (e.g., `bruhhhh`), and onomatopoeia (e.g., `hahahaha`) may become an issue by the vector representation step. Hence we employed a simple rule-based spam removal algorithm.

We remove words in the string that contains the same letter or substring thrice and consecutively. These were done using regular expressions:

$$
\text{same\_char\_thrice} := (.)\textbackslash1^{\{2,\}}
$$

and

$$
\text{same\_substring\_twice} := (.^+)\textbackslash1^+
$$

Furthermore, we also remove any string that has a length less than three, since these are either stopwords (that weren't detected in the stopword removal stage) or more spam.

Finally, we employ adaptive character diversity threshold for the string $s$.

$$
\frac{\texttt{\#\_unique\_chars}(s)}{|s|} < 0.3 + \left(\frac{0.1 \cdot \text{min}(|s|, 10)}{10}\right)
$$

It calculates the diversity of characters in a string; if the string repeats the same character alot, we expect it to be unintelligible or useless, hence we remove the string.

The definition of this wrapper function is quite long, see its definition in [this appendix](#appendix:-find_spam_and_empty-wrapper-function-definition).

Let's first look at a random sample of 10 entries from the dataset that will be modified by the function.


In [198]:
affected = df[df["clean_ours"].apply(spam_affected)]
affected_sample = affected["clean_ours"].sample(10)
affected_sample

5826                                                                                                                                                                      modi for ambani adani rahul for aam garib aadmi choice yours choose wisely
154860                      problem south bangalore constituency late ananth kumar wife doing lot social work every voter cofusionall want vote modi central but due humiliation done her and for sake humanity please justice her you wonderful man
5761                                                                                                                                     aap comes pawar central will give lakh per annam evary indian who did nat voted modi new young arvind yojna
52589                                                                                                                   live west bengal mamata banerjee said that drdo achievements are being used for publicity mongering modi\nfollow for updates
75481               

Let's now call this function on the `clean_ours` column of the dataset.


In [199]:
df["clean_ours"] = df["clean_ours"].map(find_spam_and_empty).astype("string")

To confirm if the function was able to do remove all the spammy substrings, we can check `before` and `after` and compare their differences.


In [200]:
comparison = pd.DataFrame({"before": affected_sample, "after": df["clean_ours"]})

changed = comparison[comparison["before"] != comparison["after"]]
changed.sample(10)

Unnamed: 0,before,after
52589,live west bengal mamata banerjee said that drdo achievements are being used for publicity mongering modi\nfollow for updates,live west bengal banerjee said that drdo achievements are being used for publicity mongering modi follow for updates
5826,modi for ambani adani rahul for aam garib aadmi choice yours choose wisely,modi for ambani adani rahul for garib choice yours choose wisely
112920,our kids financial sector assured give gm gold free every newly married couple after submitting marriage certificate once our kids get elected they can order all jewellery shops india give gm gold free newly married modi created funds,our kids financial sector assured give gold free every newly married couple after submitting marriage certificate once our kids get elected they can order all jewellery shops india give gold free newly married modi created funds
154860,problem south bangalore constituency late ananth kumar wife doing lot social work every voter cofusionall want vote modi central but due humiliation done her and for sake humanity please justice her you wonderful man,problem south bangalore constituency late kumar wife doing lot social work every voter cofusionall want vote modi central but due humiliation done her and for sake humanity please justice her you wonderful man
52769,hii everyonetoday the proud moment for usindia becomes th nation enter elite space power club with antisatellite weapon announces modi,hii everyonetoday the proud moment for usindia becomes nation enter elite space power club with antisatellite weapon announces modi
75481,for nd consecutive time narendra modi will launch the bjp poll campaign for lok sabha elections from jammu the city temples addressing election rally dhoomi akhnoor,for consecutive time narendra modi will launch the bjp poll campaign for lok sabha elections from jammu the city temples addressing election rally dhoomi akhnoor
20922,delhi who has seats begging party for alliance which has seats party with seats rejecting that offer\nvaranasi loksabha seat against modi aap gives cong cong gives gives bsp bsp candidate joins bjp,delhi who has seats begging party for alliance which has seats party with seats rejecting that offer varanasi loksabha seat against modi gives cong cong gives gives bsp bsp candidate joins bjp
85090,real interest rates were jacked rrr the great economist sabotage the recovery under modi,real interest rates were jacked the great economist sabotage the recovery under modi
134418,look how modi boy runs away form live program hahaha,look how modi boy runs away form live program
5761,aap comes pawar central will give lakh per annam evary indian who did nat voted modi new young arvind yojna,comes pawar central will give lakh per annam evary indian who did nat voted modi new young arvind yojna


Let‚Äôs examine whether applying this function has caused any significant changes to the DataFrame structure, given that it can convert entire cells to `NaN`.


In [201]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 162969 entries, 0 to 162979
Data columns (total 3 columns):
 #   Column      Non-Null Count   Dtype 
---  ------      --------------   ----- 
 0   clean_text  162969 non-null  string
 1   category    162969 non-null  int64 
 2   clean_ours  162942 non-null  string
dtypes: int64(1), string(2)
memory usage: 5.0 MB


The DataFrame structure is intact, but `clean_ours` now has 27 fewer non-null values, reflecting cells that were entirely filtered out as spam as seen below.


In [202]:
missing_rows = df[df['clean_ours'].isna()]
missing_rows[['clean_text', 'clean_ours']]

Unnamed: 0,clean_text,clean_ours
21806,bjpmpsubramanianswamyiamchowkidarcampaignpmmodi,
21855,terrorfundinghurriyatleaderspropertyseizedhafizsaeedmodigovt,
24148,pmnarendramodirequestsofexservicemanindianarmyhavildarombirsinghsharma9258,
35636,2019,
35866,‚Äç,
35968,whattttttt,
37837,allllll,
40587,1145am,
40977,‚åö1145 ‚ù§,
48127,birthdaaaaaay,


## **Post-Cleaning Steps**

At some point during the cleaning stage, some entries of the dataset could have been reduced to `NaN` or the empty string `""`, or we could have introduced duplicates again. So, let's call `dropna` and `drop_duplicates` again to finalize the cleaning stage.


In [203]:
df = df.dropna()
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 162942 entries, 0 to 162979
Data columns (total 3 columns):
 #   Column      Non-Null Count   Dtype 
---  ------      --------------   ----- 
 0   clean_text  162942 non-null  string
 1   category    162942 non-null  int64 
 2   clean_ours  162942 non-null  string
dtypes: int64(1), string(2)
memory usage: 5.0 MB


In [204]:
df = df.drop_duplicates()
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 162942 entries, 0 to 162979
Data columns (total 3 columns):
 #   Column      Non-Null Count   Dtype 
---  ------      --------------   ----- 
 0   clean_text  162942 non-null  string
 1   category    162942 non-null  int64 
 2   clean_ours  162942 non-null  string
dtypes: int64(1), string(2)
memory usage: 5.0 MB


# **3. Preprocessing**

> üèóÔ∏è Perhaps swap S3 and S4. Refer to literature on what comes first.

This section discusses preprocessing steps for the cleaned data. Because the goal is to analyze the textual sentiments of tweets the following preprocessing steps are needed to provide the Bag of Words model with the relevant information required to get the semantic embeddings of each tweet.

Before and after each preprocessing step, we will show 5 random entries in the dataset to show the effects of each preprocessing task.

## **Lemmatization**

We follow a similar methodology for data cleaning presented in <u>(George & Murugesan, 2024)</u>. We preprocess the dataset entries via lemmatization. We use NLTK for this task using WordNetLemmatizer lemmatization, repectively <u>(Bird & Loper, 2004)</u>. For the lemmatization step, we use the WordNet for English lemmatization and Open Multilingual WordNet version 1.4 for translations and multilingual support which is important for our case since some tweets contain text from Indian Languages.


In [205]:
df["lemmatized"] = df["clean_ours"].map(lemmatizer)
df.sample(10)

Unnamed: 0,clean_text,category,clean_ours,lemmatized
40695,there‚Äô modi govt there‚Äô only modi,0,there modi govt there only modi,there modi govt there only modi
146107,ranjith joins list filmmakers against modi,0,ranjith joins list filmmakers against modi,ranjith join list filmmaker against modi
153907,need review for hair\nwhich one best brown black @ ahmedabad india,1,need review for hair which one best brown black ahmedabad india,need review for hair which one best brown black ahmedabad india
60420,modi announces drdo‚Äô recent achievement his election rally,0,modi announces drdo recent achievement his election rally,modi announces drdo recent achievement his election rally
3576,why would demobilization lead money modi and ambani not the only one with money,0,why would demobilization lead money modi and ambani not the only one with money,why would demobilization lead money modi and ambani not the only one with money
31169,still modi says its acche din,0,still modi says its acche din,still modi say it acche din
94694,oooextending last post make that conclusivesee what were they including ndtv trying project the public\nwhateverjeetega toh modi ‚úå‚úå,0,last post make that conclusivesee what were they including ndtv trying project the public whateverjeetega toh modi,last post make that conclusivesee what were they including ndtv trying project the public whateverjeetega toh modi
51972,modi plz check some institutes nehru built\ndrdo\ncsir\nbarc\napsara\nincospar isro\nnpl\niit\niist\niofs\nongc\naiims\niim\nnit\nbokaro rourkela steel\ncdri\ncbri\ncecri\nceeri\ncftri\ncgcri\ncimap\nclri\ncmeri\ncrri\ncsio\ncsmcri\ncazri\ntoday‚Äô glories are based yesterday‚Äô preperation,0,modi plz check some institutes nehru built drdo csir barc apsara incospar isro npl iofs ongc aiims nit bokaro rourkela steel cdri cbri cecri ceeri cftri cgcri cimap clri cmeri crri csio csmcri cazri today glories are based yesterday preperation,modi plz check some institute nehru built drdo csir barc apsara incospar isro npl iofs ongc aiims nit bokaro rourkela steel cdri cbri cecri ceeri cftri cgcri cimap clri cmeri crri csio csmcri cazri today glory are based yesterday preperation
124696,ignore trolls modi sending them irritate you focus your work looking the impending defeat modi frustrated happy see inch shrink inch,1,ignore trolls modi sending them irritate you focus your work looking the impending defeat modi frustrated happy see inch shrink inch,ignore troll modi sending them irritate you focus your work looking the impending defeat modi frustrated happy see inch shrink inch
105367,modi meansmaker developed india,1,modi meansmaker developed india,modi meansmaker developed india


## **Stop Word Removal**

After lemmatization, we may now remove the stop words present in the dataset. The stopword removal _needs_ to be after lemmatization since this step requires all words to be reduces to their base dictionary form, and the `stopword_set` only considers base dictionary forms of the stopwords.

**stopwords.** For stop words removal, we refer to the English stopwords dataset defined in NLTK and Wolfram Mathematica <u>(Bird & Loper, 2004; Wolfram Research, 2015)</u>. However, since the task is sentiment analysis, words that invoke polarity, intensification, and negation are important. Words like "not" and "okay" are commonly included as stopwords. Therefore, the stopwords from [nltk,mathematica] are manually adjusted to only include stopwords that invoke neutrality, examples are "after", "when", and "you."


In [206]:
df["lemmatized"] = df["lemmatized"].map(lambda t: rem_stopwords(t, stopwords_set))
df = df.dropna(subset=["lemmatized"])
df.sample(10)

Unnamed: 0,clean_text,category,clean_ours,lemmatized
85309,please follow this thread schemes modi thanks for information,1,please follow this thread schemes modi thanks for information,please follow thread scheme modi
15630,this anchor taught tough lesson about modi jis work live show that hell never forget shop guy says hell vote for modi and then the anchor tried mock gst passerby gives him good gyana about gst benefits must watch,1,this anchor taught tough lesson about modi jis work live show that hell never forget shop guy says hell vote for modi and then the anchor tried mock gst passerby gives him good gyana about gst benefits must watch,anchor taught tough lesson about modi ji work live never forget shop guy vote modi anchor mock gst passerby good gyana about gst benefit watch
32423,bengaluru bjp mla says some incompetent candidates party banking modis popularity,-1,bengaluru bjp mla says some incompetent candidates party banking modis popularity,bengaluru bjp mla incompetent candidate party banking modis popularity
136111,her response happiness and more and more modis blessings,1,her response happiness and more and more modis blessings,response happiness more more modis blessing
131969,voodoo vindaloo whoodunit,0,voodoo vindaloo whoodunit,voodoo vindaloo whoodunit
41379,modi wave wont felt telugu states professionals the hans india ‚ö°assembly elections,0,modi wave wont felt telugu states professionals the hans india assembly elections,modi wave felt telugu state professional han india assembly election
33092,finance minister arun jaitley tuesday said that there only one gamechanger the 2019 lok sabha elections and that none other than prime minister narendra modi,1,finance minister arun jaitley tuesday said that there only one gamechanger the lok sabha elections and that none other than prime minister narendra modi,finance minister arun jaitley tuesday only gamechanger lok sabha election prime minister narendra modi
70692,unable accept the facts all opposition that modi becoming famous daybyday the crusial election time what things happening not created better oppositions accept the defeat pool avoid their mental agony,1,unable accept the facts all opposition that modi becoming famous daybyday the crusial election time what things happening not created better oppositions accept the defeat pool avoid their mental agony,unable accept fact all opposition modi famous daybyday crusial election time thing happening created better opposition accept defeat pool avoid mental agony
149895,leader kcr believes masood azhar pak govt not indian army their claimsits abt modibjp kcr good that trust indian army matram sense leader ledu,1,leader kcr believes masood azhar pak govt not indian army their claimsits abt modibjp kcr good that trust indian army matram sense leader ledu,leader kcr belief masood azhar pak govt indian army claimsits abt modibjp kcr good trust indian army matram sense leader ledu
39134,there series modi besides movie what else radio series street theaters all over india comic books wah sahab how much more marketing you have any shame you had done some work you wouldnt have required this level marketing,1,there series modi besides movie what else radio series street theaters all over india comic books wah sahab how much more marketing you have any shame you had done some work you wouldnt have required this level marketing,series modi movie radio series street theater all india comic book wah sahab much more marketing shame work required level marketing


## **Looking at the DataFrame**

After preprocessing, the dataset now contains:


In [207]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 162942 entries, 0 to 162979
Data columns (total 4 columns):
 #   Column      Non-Null Count   Dtype 
---  ------      --------------   ----- 
 0   clean_text  162942 non-null  string
 1   category    162942 non-null  int64 
 2   clean_ours  162942 non-null  string
 3   lemmatized  162942 non-null  object
dtypes: int64(1), object(1), string(2)
memory usage: 6.2+ MB


Here are 10 randomly picked entries in the dataframe with all columns shown for comparison.


In [208]:
display(df.sample(5))

Unnamed: 0,clean_text,category,clean_ours,lemmatized
109570,modi will back our tab tum chane bhunnaaa aur khana usskae next elections plans banana,0,modi will back our tab tum chane aur khana usskae next elections plans banana,modi back tab tum chane aur khana usskae election plan banana
154819,you lack outlook even not accepting good suggestionsyuo are also hopeless this time along with modi,1,you lack outlook even not accepting good suggestionsyuo are also hopeless this time along with modi,lack outlook even accepting good suggestionsyuo hopeless time along modi
3127,from rahul gandhis fake promise today\nfrom where will bring money this much money india than according him modi government failure and chowkidar chor\nthan more than this money was india before five years\nthan why not thought this,-1,from rahul gandhis fake promise today from where will bring money this much money india than according him modi government failure and chowkidar chor than more than this money was india before five years than why not thought this,rahul gandhi fake promise today bring money much money india modi government failure chowkidar chor more money india year thought
154668,will most stupid decision fight election from need keep check his strategic team whether they are committed win the election either they want modi once again just saying rethink and get rid sanghis,1,will most stupid decision fight election from need keep check his strategic team whether they are committed win the election either they want modi once again just saying rethink and get rid sanghis,stupid decision fight election need check strategic team committed win election modi just rethink rid sanghis
69331,curious know what this historian eats and drinks daytime may not like one man but understand technical achievements drdo and isro appreciate modi that these guys are exposing themselves will recorded history,-1,curious know what this historian eats and drinks daytime may not like one man but understand technical achievements drdo and isro appreciate modi that these guys are exposing themselves will recorded history,curious historian eats drink daytime like man understand technical achievement drdo isro appreciate modi guy exposing recorded history


## **Tokenization**

Since the data cleaning and preprocessing stage is comprehensive, the tokenization step in the BoW model reduces to a simple word-boundary split operation. Each preprocessed entry in the DataFrame is split by spaces. For example, the entry `"shri narendra modis"` (entry: 42052) becomes `["shri", "narendra", "modis"]`. By the end of tokenization, all entries are transformed into arrays of strings.

## **Word Bigrams**

As noted earlier, modifiers and polarity words are not included in the stopword set. The BoW model constructs a vocabulary containing both unigrams and bigrams. Including bigrams allows the model to capture common word patterns, such as

$$
\left\langle \texttt{Adj}\right\rangle \left\langle \texttt{M} \mid \texttt{Pron} \right\rangle
$$

<center>or</center>

$$
\left\langle \texttt{Adv}\right\rangle \left\langle \texttt{V} \mid \texttt{Adj} \mid \texttt{Adv} \right\rangle
$$

## **Vector Representation**

After the stemming and lemmatization steps, each entry can now be represented as a vector using a Bag of Words (BoW) model. We employ scikit-learn's `CountVectorizer`, which provides a ready-to-use implementation of BoW <u>(Pedregosa et al., 2011)</u>.

A comparison of other traditional vector representations are discussed in [this appendix](#appendix:-comparison-of-traditional-vectorization-techniques).
Words with modifiers have the modifiers directly attached, enabling subsequent models to capture the concept of modification fully. Consequently, after tokenization and bigram construction, the vocabulary size can grow up to $O(n^2)$, where $n$ is the number of unique tokens.

**minimum document frequency constraint:** Despite cleaning and spam removal, some tokens remain irrelevant or too rare. To address this, a minimum document frequency constraint is applied: $\texttt{min\_df} = 10$, meaning a token must appear in at least 10 documents to be included in the BoW vocabulary. This reduces noise and ensures the model focuses on meaningful terms.

---

These parameters of the BoW model are encapsulated in the `BagOfWordsModel` class. The class definition is available in [this appendix](#appendix:-BagOfWordsModel-class-definition).


In [209]:
bow = BagOfWordsModel(df["lemmatized"], 10)

# some sanity checks
assert (
    bow.matrix.shape[0] == df.shape[0]
), "number of rows in the matrix DOES NOT matches the number of documents"
assert bow.sparsity, "the sparsity is TOO HIGH, something went wrong"



The error above is normal, recall that our tokenization step essentially reduced into an array split step. With this, we need to set the `tokenizer` function attribute of the `BagOfWordsModel` to not use its default tokenization pattern. That causes this warning.


### **Model Metrics**

To get an idea of the model, we will now look at its shape and sparsity, with shape being the number of documents and tokens present in the model. While sparsity refers to the number of elements in a matrix that are zero, calculating how sparse or varied the words are in the dataset.


The resulting vector has a shape of


In [210]:
bow.matrix.shape

(162942, 30386)

The first entry of the pair is the number of documents (the ones that remain after all the data cleaning and preprocessing steps) and the second entry is the number of tokens (or unique words in the vocabulary).

The resulting model has a sparsity of


In [211]:
1 - bow.sparsity

0.9995039539872171

The model is 99.95% sparse, meaning the tweets often do not share the same words leading to a large vocabulary.


Now, looking at the most frequent and least frequent terms in the model.


In [212]:
doc_frequencies = np.asarray((bow.matrix > 0).sum(axis=0)).flatten()
freq_order = np.argsort(doc_frequencies)[::-1]
bow.feature_names[freq_order[:50]]

array(['modi', 'india', 'ha', 'all', 'people', 'bjp', 'like', 'congress',
       'narendra', 'only', 'election', 'narendra modi', 'vote', 'govt',
       'about', 'indian', 'year', 'time', 'country', 'just', 'modis',
       'more', 'nation', 'rahul', 'even', 'government', 'party', 'power',
       'gandhi', 'minister', 'leader', 'good', 'modi govt', 'need',
       'modi ha', 'space', 'work', 'prime', 'money', 'credit', 'sir',
       'pakistan', 'back', 'day', 'today', 'prime minister', 'scientist',
       'never', 'support', 'win'], dtype=object)

We see that the main talking point of the Tweets, which hovers around Indian politics with keywords like "modi", "india", and "bjp". For additional context, "bjp" referes to the _Bharatiya Janata Party_ which is a conservative political party in India, and one of the two major Indian political parties.


Now, looking at the least popular words.


In [213]:
bow.feature_names[freq_order[-50:]]

array(['healthy democracy', 'ha mass', 'ha separate', 'ha shifted',
       'hat drdo', 'about defeat', 'yet ha', 'yes more', 'yes narendra',
       'hatred people', 'ha requested', 'hate more', 'hate much',
       'hatemonger', 'hater gonna', 'heal', 'hazaribagh', 'head drdo',
       'sleep night', 'abinandan', 'able provide', 'able speak',
       'able vote', 'youth need', 'youth power', 'hai isliye', 'hai chor',
       'handy', 'hand narendra', 'hand people', 'hae', 'ha withdrawn',
       'happens credit', 'happier', 'bhaiyo', 'socha', 'social political',
       'social security', 'biased journalist', 'big congratulation',
       'sirmodi', 'bhutan', 'bhi berozgar', 'bhi mumkin', 'skta',
       'bhatt aditi', 'bhi aur', 'slamming', 'smart modi', 'slogan blame'],
      dtype=object)

We still see that the themes mentioned in the most frequent terms are still present in this subset. Although, more filler or non-distinct words do appear more often, like "photos", "soft" and "types".

But the present of words like "reelection" and "wars" still point to this subset still being relevant to the main theme of the dataset.


# **4 exploratory data analysis**

This section discusses the exploratory data analysis conducted on the dataset after cleaning.

> Notes from Zhean: <br>
> From manual checking via OpenRefine, there are a total of 162972. `df.info()` should have the same result post-processing.
> Furthermore, there should be two columns, `clean_text` (which is a bit of a misnormer since it is still dirty) contains the Tweets (text data). The second column is the `category` which contains the sentiment of the Tweet and is a tribool (1 positive, 0 neutral or indeterminate, and -1 for negative).


# **references**

Bird, S., & Loper, E. (2004, July). NLTK: The natural language toolkit. _Proceedings of the ACL Interactive Poster and Demonstration Sessions_, 214‚Äì217. https://aclanthology.org/P04-3031/

El-Demerdash, A. A., Hussein, S. E., & Zaki, J. F. W. (2021). Course evaluation based on deep learning and SSA hyperparameters optimization. _Computers, Materials & Continua, 71_(1), 941‚Äì959. https://doi.org/10.32604/cmc.2022.021839

George, M., & Murugesan, R. (2024). Improving sentiment analysis of financial news headlines using hybrid Word2Vec-TFIDF feature extraction technique. _Procedia Computer Science, 244_, 1‚Äì8.

Hussein, S. (2021). _Twitter sentiments dataset_. Mendeley.

Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., Vanderplas, J., Passos, A., Cournapeau, D., Brucher, M., Perrot, M., & Duchesnay, E. (2011). Scikit-learn: Machine learning in Python. _Journal of Machine Learning Research, 12_, 2825‚Äì2830.

Rani, D., Kumar, R., & Chauhan, N. (2022, October). Study and comparison of vectorization techniques used in text classification. In _2022 13th International Conference on Computing Communication and Networking Technologies (ICCCNT)_ (pp. 1‚Äì6). IEEE.

Wolfram Research. (2015). _DeleteStopwords_. https://reference.wolfram.com/language/ref/DeleteStopwords.html


# **appendix: `clean` wrapper function definition**

Below is the definition of the `clean` wrapper function that encapsulates all internal functions used in the cleaning pipeline.


In [214]:
clean??

[31mSignature:[39m clean(text: str) -> str
[31mSource:[39m   
[38;5;28;01mdef[39;00m clean(text: str) -> str:
    [33m"""[39m
[33m    This is the main function for data cleaning (i.e., it calls all the cleaning functions in the prescribed order).[39m

[33m    This function should be used as a first-class function in a map.[39m

[33m    # Parameters[39m
[33m    * text: The string entry from a DataFrame column.[39m
[33m    * stopwords: stopword dictionary.[39m

[33m    # Returns[39m
[33m    Clean string[39m
[33m    """[39m
    [38;5;66;03m# cleaning on the base string[39;00m
    text = normalize(text)
    text = rem_punctuation(text)
    text = rem_numbers(text)
    text = collapse_whitespace(text)

    [38;5;28;01mreturn[39;00m text
[31mFile:[39m      c:\users\erin\documents\github\stintsy-order-of-erin\lib\janitor.py
[31mType:[39m      function

# **appendix: `find_spam_and_empty` wrapper function definition**

Below is the definition of the `find_spam_and_empty` wrapper function that encapsulates all internal functions for the spam detection algorithm.


In [215]:
find_spam_and_empty??

[31mSignature:[39m find_spam_and_empty(text: str, min_length: int = [32m3[39m) -> str | [38;5;28;01mNone[39;00m
[31mSource:[39m   
[38;5;28;01mdef[39;00m find_spam_and_empty(text: str, min_length: int = [32m3[39m) -> str | [38;5;28;01mNone[39;00m:
    [33m"""[39m
[33m    Filter out empty text and unintelligible/spammy unintelligible substrings in the text.[39m

[33m    Spammy substrings:[39m
[33m    - Shorter than min_length[39m
[33m    - Containing non-alphabetic characters[39m
[33m    - Consisting of a repeated substring (e.g., 'aaaaaa', 'ababab', 'abcabcabc')[39m

[33m    # Parameters[39m
[33m    * text: input string.[39m
[33m    * min_length: minimum length of word to keep.[39m

[33m    # Returns[39m
[33m        Cleaned string, or None if empty after filtering.[39m
[33m    """[39m
    cleaned_tokens = []
    [38;5;28;01mfor[39;00m t [38;5;28;01min[39;00m text.split():
        [38;5;28;01mif[39;00m len(t) < min_length:
            [38;5;2

# **appendix: comparison of traditional vectorization techniques**

Traditional vectorization techniques include BoW and Term Frequency-Inverse Document Frequency (TF-IDF). TF-IDF weights each word based on its frequency in a document and its rarity across the corpus, reducing the impact of common words. BoW, in contrast, simply counts word occurrences without considering corpus-level frequency. In this project, BoW was chosen because stopwords were already removed during preprocessing, and the dataset is domain-specific <u>(Rani et al., 2022)</u>. In such datasets, frequent words are often meaningful domain keywords, so scaling them down (as TF-IDF would) could reduce the importance of these key terms in the feature representation.


# **appendix: `BagOfWordsModel` class definition**

Below is the definition of the `BagOfWordsModel` class that encapsulates the desired parameters.


In [216]:
BagOfWordsModel??

[31mInit signature:[39m BagOfWordsModel(texts: Iterable[str], min_freq: int | float | [38;5;28;01mNone[39;00m = [38;5;28;01mNone[39;00m)
[31mSource:[39m        
[38;5;28;01mclass[39;00m BagOfWordsModel:
    [33m"""[39m
[33m    A Bag-of-Words representation for a text corpus.[39m

[33m    # Attributes[39m
[33m    * matrix (scipy.sparse.csr_matrix): The document-term matrix of word counts.[39m
[33m    * feature_names (list[str]): List of feature names corresponding to the matrix columns.[39m
[33m    *[39m
[33m    # Usage[39m
[33m    ```[39m
[33m    bow = BagOfWordsModel(df["lemmatized_str"])[39m
[33m    ```[39m
[33m    """[39m

    [38;5;28;01mdef[39;00m __init__(self, texts: Iterable[str], min_freq: int | float | [38;5;28;01mNone[39;00m = [38;5;28;01mNone[39;00m):
        [33m"""[39m
[33m        Initialize the BagOfWordsModel by fitting the vectorizer to the text corpus. This also filters out tokens[39m
[33m        that do not appear more than 