# Text Processing & EDA — Step 1

This notebook covers the **first step** of the LLM project pipeline:

1.**Exploratory Data Analysis**  

2. **Text preprocessing**  
- Normalize the text by making all letters lowercase.
-  Remove all HTML tags (e.g., <br/>).
- Remove all email addresses.
- Remove all URLs.
- Remove all punctuation.
- Remove stop words.
- Lemmatize the words (via spaCy).

3. **Saving cleaned data**  
   - Export the preprocessed DataFrame to Google Drive (`MyDrive/preprocessed_reviews.csv`)  
   - Ready for  modeling


In [1]:
import pandas as pd #To deal with Dataframs
import re, string

#Mounting to Drive

In [2]:
# mount or connect to drive to get the dataset
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [3]:
# file path on MyDrive
file_path = '/content/drive/My Drive/IMDB Dataset.csv'

# Read the CSV file into a pandas DataFrame
df = pd.read_csv(file_path)

# EDA : Exploring The dataset

In [4]:
df.describe()

Unnamed: 0,review,sentiment
count,50000,50000
unique,49582,2
top,Loved today's show!!! It was a variety and not...,positive
freq,5,25000


In [5]:
# Display the first five rows of the DataFrame
print(df.head())

                                              review sentiment
0  One of the other reviewers has mentioned that ...  positive
1  A wonderful little production. <br /><br />The...  positive
2  I thought this was a wonderful way to spend ti...  positive
3  Basically there's a family where a little boy ...  negative
4  Petter Mattei's "Love in the Time of Money" is...  positive


In [6]:
# Displaying the data shape + type + count for each class
print("Shape:", df.shape)
print(df.dtypes)
print("\n")
display(df['sentiment'].value_counts())

Shape: (50000, 2)
review       object
sentiment    object
dtype: object




Unnamed: 0_level_0,count
sentiment,Unnamed: 1_level_1
positive,25000
negative,25000




```
< The Previous output shows that we are dealing with balanced data >
```



In [7]:
# Count missing values
print(df[['review','sentiment']].isnull().sum())

review       0
sentiment    0
dtype: int64




```
< The Previous output shows that there is no missing values >
```



In [8]:
# Counting Top frequent words for each class
from collections import Counter
import re

def top_n(texts, n=15):
    cnt = Counter()
    for t in texts:
        cnt.update(re.findall(r'\b\w+\b', t.lower()))
    return cnt.most_common(n)

for label in df['sentiment'].unique():
    print(f"\nTop words for {label}:")
    for word, freq in top_n(df[df['sentiment']==label]['review']):
        print(f"  {word}: {freq}")



Top words for positive:
  the: 341281
  and: 176634
  a: 164323
  of: 152105
  to: 131322
  is: 111830
  in: 99250
  br: 97954
  it: 95133
  i: 81997
  this: 69648
  that: 69593
  s: 64675
  as: 51106
  with: 45718

Top words for negative:
  the: 326712
  a: 158647
  and: 147807
  of: 137305
  to: 136802
  br: 103997
  is: 99252
  it: 95724
  i: 93636
  in: 87531
  this: 81354
  that: 74286
  s: 60333
  was: 52269
  movie: 50117


# Text Processing

Required pre-processesing steps:

● Normalize the text by making all letters lowercase.

● Remove all HTML tags (e.g., \<br/>).

● Remove all email addresses.

● Remove all URLs.

● Remove all punctuation.

● Remove stop words.

● Lemmatize the words.


###Normalize letters to lowercase

In [9]:
#Normalize the text by making all letters lowercase.
df['review'] = df['review'].str.lower()
print(df.head())

                                              review sentiment
0  one of the other reviewers has mentioned that ...  positive
1  a wonderful little production. <br /><br />the...  positive
2  i thought this was a wonderful way to spend ti...  positive
3  basically there's a family where a little boy ...  negative
4  petter mattei's "love in the time of money" is...  positive


###Remove HTML tags

In [10]:
#Remove all HTML tags (e.g., <br/>).
#Using pandas Series.str.replace with a regex to strip (Delete) anything between <…>
df['review'] = df['review'].str.replace(r'<[^>]+>', '', regex=True)
print(df.head())

                                              review sentiment
0  one of the other reviewers has mentioned that ...  positive
1  a wonderful little production. the filming tec...  positive
2  i thought this was a wonderful way to spend ti...  positive
3  basically there's a family where a little boy ...  negative
4  petter mattei's "love in the time of money" is...  positive


## Remove email addresses

**before:**

In [11]:
# Print first 5 reviews containing email addresses
# Print full text
for idx, text in df.loc[df['review'].str.contains(r'[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Za-z]{2,}', na=False), 'review'].head(5).items():
    print(idx, text)

1281 i like many others saw this as a child and i loved it and it horrified me up until adulthood, i have been trying to find this movie and even been searching for it to play again on tv someday, since it originally played on usa networks. does anyone know where to buy this movie, or does anyone have it and would be willing to make a copy for me? also does anyone know if there is a chance for it to be played on tv again? maybe all of us fans should write a station in hopes of them airing it again. i don't think they did a good job of promoting this movie in the past because no one really knows about, people only know of the stepford wives and stepford husband movies. no one is familiar with the fact that there was a children version. maybe they should also do a re-make of it since they seem to be doing that a lot lately with a lot of my favorite old thriller/horror flicks. well if anyone has any input please i beg of you write me with information. thanks taira tcampo23@aol.com
3568 i 

In [12]:
#Remove all email addresses -> using regex
df['review'] = df['review'].replace(to_replace=r'\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b', value='', regex=True)

**After:**

In [13]:
# showing how the emails removed from specific rows that used to have emails
pd.set_option('display.max_colwidth', None)
df.loc[[1281, 3568]]

Unnamed: 0,review,sentiment
1281,"i like many others saw this as a child and i loved it and it horrified me up until adulthood, i have been trying to find this movie and even been searching for it to play again on tv someday, since it originally played on usa networks. does anyone know where to buy this movie, or does anyone have it and would be willing to make a copy for me? also does anyone know if there is a chance for it to be played on tv again? maybe all of us fans should write a station in hopes of them airing it again. i don't think they did a good job of promoting this movie in the past because no one really knows about, people only know of the stepford wives and stepford husband movies. no one is familiar with the fact that there was a children version. maybe they should also do a re-make of it since they seem to be doing that a lot lately with a lot of my favorite old thriller/horror flicks. well if anyone has any input please i beg of you write me with information. thanks taira",positive
3568,"i have noticed that people have asked if anyone has this show. i have all 26 episodes that aired in the u.s. and will be willing to share these with anyone interested. all i require is that you supply the vhs tapes or blank dvd's i have them on both formats and pay for shipping. my email is , just send me an email and your request and i will notify you and we can make the arrangements. the quality is very good and they are very enjoyable to watch especially if you have not been able to see them since they aired in the 60's. it was one of my favorite shows as a child and hold a very special place in my heart because it brings back a lot of memories of my childhood as well as other shows like ultraman and astroboy.peter",positive


###Remove all URLs

**before:**

In [14]:
# Print first row that contains a URL
df.loc[df['review'].str.contains(r'https?://\S+|www\.\S+', na=False)].head(1)

Unnamed: 0,review,sentiment
742,"mario lewis of the competitive enterprise institute has written a definitive 120-page point-by-point, line-by-line refutation of this mendacious film, which should be titled a convenient lie. the website address where his debunking report, which is titled ""a skeptic's guide to an inconvenient truth"" can be found at is :www.cei.org. a shorter 10-page version can be found at: www.cei.org/pdf/5539.pdf once you read those demolitions, you'll realize that alleged ""global warming"" is no more real or dangerous than the y2k scare of 1999, which gore also endorsed, as he did the pseudo-scientific film the day after tomorrow, which was based on a book written by alleged ufo abductee whitley strieber. as james ""the amazing"" randi does to psychics, and philip klass does to ufos, and gerald posner does to jfk conspir-idiocy theories, so does mario lewis does to al gore's movie and the whole ""global warming"" scam.",negative


In [15]:
# Remove all URLs from 'review'
df['review'] = df['review'].str.replace(r'https?://\S+|www\.\S+', '', regex=True)

**After:**

In [16]:
# showing the same row with the 742 index to ensure URLs are removed from df
pd.set_option('display.max_colwidth', None)
df.loc[[742]]

Unnamed: 0,review,sentiment
742,"mario lewis of the competitive enterprise institute has written a definitive 120-page point-by-point, line-by-line refutation of this mendacious film, which should be titled a convenient lie. the website address where his debunking report, which is titled ""a skeptic's guide to an inconvenient truth"" can be found at is : a shorter 10-page version can be found at: once you read those demolitions, you'll realize that alleged ""global warming"" is no more real or dangerous than the y2k scare of 1999, which gore also endorsed, as he did the pseudo-scientific film the day after tomorrow, which was based on a book written by alleged ufo abductee whitley strieber. as james ""the amazing"" randi does to psychics, and philip klass does to ufos, and gerald posner does to jfk conspir-idiocy theories, so does mario lewis does to al gore's movie and the whole ""global warming"" scam.",negative


###Remove all punctuation

**before:**

In [17]:
# detect first row with punctuation and display it
idx = df.index[df['review'].str.contains(f"[{re.escape(string.punctuation)}]", na=False)][0]
print(idx, df.at[idx, 'review'])

0 one of the other reviewers has mentioned that after watching just 1 oz episode you'll be hooked. they are right, as this is exactly what happened with me.the first thing that struck me about oz was its brutality and unflinching scenes of violence, which set in right from the word go. trust me, this is not a show for the faint hearted or timid. this show pulls no punches with regards to drugs, sex or violence. its is hardcore, in the classic use of the word.it is called oz as that is the nickname given to the oswald maximum security state penitentary. it focuses mainly on emerald city, an experimental section of the prison where all the cells have glass fronts and face inwards, so privacy is not high on the agenda. em city is home to many..aryans, muslims, gangstas, latinos, christians, italians, irish and more....so scuffles, death stares, dodgy dealings and shady agreements are never far away.i would say the main appeal of the show is due to the fact that it goes where other shows w

In [18]:
#remove punctuation from 'review'
df['review'] = df['review'].str.translate(str.maketrans('', '', string.punctuation))

**After:**

In [19]:
#show same row after punctuation removal
print(idx, df.at[idx, 'review'])

0 one of the other reviewers has mentioned that after watching just 1 oz episode youll be hooked they are right as this is exactly what happened with methe first thing that struck me about oz was its brutality and unflinching scenes of violence which set in right from the word go trust me this is not a show for the faint hearted or timid this show pulls no punches with regards to drugs sex or violence its is hardcore in the classic use of the wordit is called oz as that is the nickname given to the oswald maximum security state penitentary it focuses mainly on emerald city an experimental section of the prison where all the cells have glass fronts and face inwards so privacy is not high on the agenda em city is home to manyaryans muslims gangstas latinos christians italians irish and moreso scuffles death stares dodgy dealings and shady agreements are never far awayi would say the main appeal of the show is due to the fact that it goes where other shows wouldnt dare forget pretty pictu

###Remove stop words.

**before:**

In [20]:
#detect first row containing stop words and display it -> Using scikit learn built-in English stop word list (ENGLISH_STOP_WORDS)
from sklearn.feature_extraction.text import ENGLISH_STOP_WORDS as stop
idx = df.index[df['review'].str.lower().str.split().apply(lambda ws: any(w in stop for w in ws))][0]
print(idx, df.at[idx, 'review'])

0 one of the other reviewers has mentioned that after watching just 1 oz episode youll be hooked they are right as this is exactly what happened with methe first thing that struck me about oz was its brutality and unflinching scenes of violence which set in right from the word go trust me this is not a show for the faint hearted or timid this show pulls no punches with regards to drugs sex or violence its is hardcore in the classic use of the wordit is called oz as that is the nickname given to the oswald maximum security state penitentary it focuses mainly on emerald city an experimental section of the prison where all the cells have glass fronts and face inwards so privacy is not high on the agenda em city is home to manyaryans muslims gangstas latinos christians italians irish and moreso scuffles death stares dodgy dealings and shady agreements are never far awayi would say the main appeal of the show is due to the fact that it goes where other shows wouldnt dare forget pretty pictu

In [21]:
## remove stop words from 'review'
df['review'] = df['review'].str.split().apply(lambda ws: ' '.join(w for w in ws if w.lower() not in stop))

**After:**

In [22]:
#show same row after stop words removal
print(idx, df.at[idx, 'review'])

0 reviewers mentioned watching just 1 oz episode youll hooked right exactly happened methe thing struck oz brutality unflinching scenes violence set right word trust faint hearted timid pulls punches regards drugs sex violence hardcore classic use wordit called oz nickname given oswald maximum security state penitentary focuses mainly emerald city experimental section prison cells glass fronts face inwards privacy high agenda em city home manyaryans muslims gangstas latinos christians italians irish moreso scuffles death stares dodgy dealings shady agreements far awayi say main appeal fact goes shows wouldnt dare forget pretty pictures painted mainstream audiences forget charm forget romanceoz doesnt mess episode saw struck nasty surreal say ready watched developed taste oz got accustomed high levels graphic violence just violence injustice crooked guards wholl sold nickel inmates wholl kill order away mannered middle class inmates turned prison bitches lack street skills prison experi



```
< As shown words like "one", "of", "the", "has" have been removed>
```



###Lemmatize the words.

**before:**



```
# Spacy does not assume that the words are nouns unlike some methods:
“running” → “run”

“better” → “good”
```



In [23]:
# get the first review that changes after spaCy lemmatization
import spacy
nlp = spacy.load("en_core_web_sm", disable=["parser","ner"])
def spacy_lemma(text):
    return " ".join(tok.lemma_ for tok in nlp(text))

#Locate the first review whose lemmatized version differs from the original
idx = next(i for i, t in df['review'].items() if spacy_lemma(t) != t)
#Show that index and its original review text
print(idx, df.at[idx, 'review'])


0 reviewers mentioned watching just 1 oz episode youll hooked right exactly happened methe thing struck oz brutality unflinching scenes violence set right word trust faint hearted timid pulls punches regards drugs sex violence hardcore classic use wordit called oz nickname given oswald maximum security state penitentary focuses mainly emerald city experimental section prison cells glass fronts face inwards privacy high agenda em city home manyaryans muslims gangstas latinos christians italians irish moreso scuffles death stares dodgy dealings shady agreements far awayi say main appeal fact goes shows wouldnt dare forget pretty pictures painted mainstream audiences forget charm forget romanceoz doesnt mess episode saw struck nasty surreal say ready watched developed taste oz got accustomed high levels graphic violence just violence injustice crooked guards wholl sold nickel inmates wholl kill order away mannered middle class inmates turned prison bitches lack street skills prison experi

In [24]:
#  lemmatize every review using spaCy
df['review'] = df['review'].apply(spacy_lemma)

**After:**

In [25]:
# print that same review again to see the lemmatized version
print(idx, df.at[idx, 'review'])

0 reviewer mention watch just 1 oz episode you ll hook right exactly happen methe thing strike oz brutality unflinche scene violence set right word trust faint hearted timid pull punch regard drug sex violence hardcore classic use wordit call oz nickname give oswald maximum security state penitentary focus mainly emerald city experimental section prison cell glass front face inward privacy high agenda em city home manyaryan muslim gangstas latinos christians italians irish moreso scuffle death stare dodgy dealing shady agreement far awayi say main appeal fact go show would not dare forget pretty picture paint mainstream audience forget charm forget romanceoz do not mess episode see strike nasty surreal say ready watch develop taste oz get accustomed high level graphic violence just violence injustice crook guard who ll sell nickel inmate who ll kill order away mannered middle class inmate turn prison bitch lack street skill prison experience watch oz comfortable uncomfortable viewingth

# Save the Processed Data to MyDrive

In [26]:
#Display the first 5 rows
df.head()

Unnamed: 0,review,sentiment
0,reviewer mention watch just 1 oz episode you ll hook right exactly happen methe thing strike oz brutality unflinche scene violence set right word trust faint hearted timid pull punch regard drug sex violence hardcore classic use wordit call oz nickname give oswald maximum security state penitentary focus mainly emerald city experimental section prison cell glass front face inward privacy high agenda em city home manyaryan muslim gangstas latinos christians italians irish moreso scuffle death stare dodgy dealing shady agreement far awayi say main appeal fact go show would not dare forget pretty picture paint mainstream audience forget charm forget romanceoz do not mess episode see strike nasty surreal say ready watch develop taste oz get accustomed high level graphic violence just violence injustice crook guard who ll sell nickel inmate who ll kill order away mannered middle class inmate turn prison bitch lack street skill prison experience watch oz comfortable uncomfortable viewingthat touch darker,positive
1,wonderful little production filming technique unassume oldtimebbc fashion give comfort discomforte sense realism entire piece actor extremely choose michael sheen get polari voice pat truly seamless editing guide reference williams diary entry worth watch terrificly write perform piece masterful production great master comedy life realism really come home little thing fantasy guard use traditional dream technique remain solid disappear play knowledge sense particularly scene concern orton halliwell set particularly flat halliwell mural decorate surface terribly,positive
2,think wonderful way spend time hot summer weekend sit air condition theater watch lighthearted comedy plot simplistic dialogue witty character likable bread suspect serial killer disappoint realize match point 2 risk addiction think proof woody allen fully control style grow lovethis i d laugh woodys comedy year dare say decade I ve impress scarlet johanson manage tone sexy image jump right average spirited young womanthis crown jewel career witty devil wear prada interesting superman great comedy friend,positive
3,basically there s family little boy jake think there s zombie closet parent fight timethis movie slow soap opera suddenly jake decide rambo kill zombieok you re go make film decide thriller drama drama movie watchable parent divorce argue like real life jake closet totally ruin film expect boogeyman similar movie instead watch drama meaningless thriller spots3 10 just play parent descent dialog shot jake just ignore,negative
4,petter matteis love time money visually stunning film watch mr mattei offer vivid portrait human relation movie tell money power success people different situation encounter variation arthur schnitzlers play theme director transfer action present time new york different character meet connect connect way person know previous point contact stylishly film sophisticated luxurious look take people live world live habitatthe thing get soul picture different stage loneliness inhabit big city exactly good place human relation fulfillment discern case people encounterthe act good mr matteis direction steve buscemi rosario dawson carol kane michael imperioli adrian grenier rest talente cast make character come alivewe wish mr mattei good luck await anxiously work,positive


In [27]:
#Save the preprocessed DataFrame to MyDrive as CSV
output_path = '/content/drive/MyDrive/preprocessed_reviews.csv'
df.to_csv(output_path, index=False)
print(f"DataFrame saved to {output_path}")

DataFrame saved to /content/drive/MyDrive/preprocessed_reviews.csv


In [29]:
# Confirm the file exists and view its size
import os

path = '/content/drive/MyDrive/preprocessed_reviews.csv'
exists = os.path.exists(path)
size_mb = os.path.getsize(path) / (1024**2)

print(f"Exists: {exists}")
if exists:
    print(f"Size: {size_mb:.2f} MB")


Exists: True
Size: 35.13 MB
