# **Data Preprocessing and Embeddings**

In this section we'll learn how to carefully preprocess data, which is an incredibly important step in Generative AI pipelines.

We will use Kaggle's [IMDB Dataset of 50K Movie Reviews](https://www.kaggle.com/datasets/lakshmi25npathi/imdb-dataset-of-50k-movie-reviews?resource=download) and work on it.

## 1. Text Preprocessing

### 1.1 Get Text Data

In [1]:
import kagglehub
from pathlib import Path
import pandas as pd

# Download latest version
path = kagglehub.dataset_download("lakshmi25npathi/imdb-dataset-of-50k-movie-reviews")

# Convert path to a Path object for convenience
path = Path(path)

# List the files
print(list(path.glob("*")))

# Get 'IMDB Dataset.csv'
csv_file = path / "IMDB Dataset.csv"

# Load into pandas
df = pd.read_csv(csv_file)

[PosixPath('/kaggle/input/imdb-dataset-of-50k-movie-reviews/IMDB Dataset.csv')]


In [2]:
# Check the first rows
df.head()

Unnamed: 0,review,sentiment
0,One of the other reviewers has mentioned that ...,positive
1,A wonderful little production. <br /><br />The...,positive
2,I thought this was a wonderful way to spend ti...,positive
3,Basically there's a family where a little boy ...,negative
4,"Petter Mattei's ""Love in the Time of Money"" is...",positive


In [3]:
df.shape

(50000, 2)

### 1.2 Turn Reviews into lower case

We can simply use python's `str.lower()` method:

In [4]:
df['review'][3]

"Basically there's a family where a little boy (Jake) thinks there's a zombie in his closet & his parents are fighting all the time.<br /><br />This movie is slower than a soap opera... and suddenly, Jake decides to become Rambo and kill the zombie.<br /><br />OK, first of all when you're going to make a film you must Decide if its a thriller or a drama! As a drama the movie is watchable. Parents are divorcing & arguing like in real life. And then we have Jake with his closet which totally ruins all the film! I expected to see a BOOGEYMAN similar movie, and instead i watched a drama with some meaningless thriller spots.<br /><br />3 out of 10 just for the well playing parents & descent dialogs. As for the shots with Jake: just ignore them."

In [5]:
df['review'] = df['review'].str.lower()

In [6]:
df

Unnamed: 0,review,sentiment
0,one of the other reviewers has mentioned that ...,positive
1,a wonderful little production. <br /><br />the...,positive
2,i thought this was a wonderful way to spend ti...,positive
3,basically there's a family where a little boy ...,negative
4,"petter mattei's ""love in the time of money"" is...",positive
...,...,...
49995,i thought this movie did a down right good job...,positive
49996,"bad plot, bad dialogue, bad acting, idiotic di...",negative
49997,i am a catholic taught in parochial elementary...,negative
49998,i'm going to have to disagree with the previou...,negative


### 1.3 Remove HTML tags

Another thing we can do is remove the html tags to clean our reviews.

We can use the [`regular expressions`](https://docs.python.org/3/library/re.html) Python package in order to do so (remove html tags and replace them with... nothing).

In [7]:
import re
def remove_html_tags(text):
    pattern = re.compile('<.*?>')
    return pattern.sub(r'', text)

In [8]:
text = "<html><body><p> Movie 1</p><p> Actor - Aamir Khan</p><p> Click here to <a href='http://google.com'>download</a></p></body></html>"

In [9]:
remove_html_tags(text)

' Movie 1 Actor - Aamir Khan Click here to download'

In [10]:
# apply the function to the full datasets with the apply() method

df['review'] = df['review'].apply(remove_html_tags)

In [11]:
df['review'][3]

"basically there's a family where a little boy (jake) thinks there's a zombie in his closet & his parents are fighting all the time.this movie is slower than a soap opera... and suddenly, jake decides to become rambo and kill the zombie.ok, first of all when you're going to make a film you must decide if its a thriller or a drama! as a drama the movie is watchable. parents are divorcing & arguing like in real life. and then we have jake with his closet which totally ruins all the film! i expected to see a boogeyman similar movie, and instead i watched a drama with some meaningless thriller spots.3 out of 10 just for the well playing parents & descent dialogs. as for the shots with jake: just ignore them."

### 1.4 Punctuation Handling

Can we handle weird punctuation if we don't like some of it? Yes we can...

In [12]:
import string,time
string.punctuation

'!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~'

In [13]:
exclude = string.punctuation
exclude

'!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~'

In [14]:
def remove_punc(text):
    for char in exclude:
        text = text.replace(char,'')
    return text

In [15]:
text = 'string. With. Punctuation?'

In [16]:
start = time.time()
print(remove_punc(text))
time1 = time.time() - start
print(time1*50000)

string With Punctuation
18.143653869628906


Actually there is a faster way in order to remove punctuation:

In [17]:
def remove_punc1(text):
    return text.translate(str.maketrans('', '', exclude))

In [18]:
start = time.time()
remove_punc1(text)
time2 = time.time() - start
print(time2*50000)

47.206878662109375


In [19]:
time1/time2

0.3843434343434343

In [20]:
df['review'][5]

'probably my all-time favorite movie, a story of selflessness, sacrifice and dedication to a noble cause, but it\'s not preachy or boring. it just never gets old, despite my having seen it some 15 or more times in the last 25 years. paul lukas\' performance brings tears to my eyes, and bette davis, in one of her very few truly sympathetic roles, is a delight. the kids are, as grandma says, more like "dressed-up midgets" than children, but that only makes them more fun to watch. and the mother\'s slow awakening to what\'s happening in the world and under her own roof is believable and startling. if i had a dozen thumbs, they\'d all be "up" for this movie.'

In [21]:
remove_punc1(df['review'][5])

'probably my alltime favorite movie a story of selflessness sacrifice and dedication to a noble cause but its not preachy or boring it just never gets old despite my having seen it some 15 or more times in the last 25 years paul lukas performance brings tears to my eyes and bette davis in one of her very few truly sympathetic roles is a delight the kids are as grandma says more like dressedup midgets than children but that only makes them more fun to watch and the mothers slow awakening to whats happening in the world and under her own roof is believable and startling if i had a dozen thumbs theyd all be up for this movie'

In [22]:
# we could do it on the entire dataset by passing the entire dataset
# remove_punc1(df['review'])

### 1.5 Handling Chat Shortcuts

How about changing chat shortcuts? Like LOL -> laughing out loud (we would find a lot of these in social media data!)

In [23]:
# create a dictionary between shortcuts and actual phrases

chat_words = {
    'AFAIK':'As Far As I Know',
    'AFK':'Away From Keyboard',
    'ASAP':'As Soon As Possible',
    "FYI": "For Your Information",
    "ASAP": "As Soon As Possible",
    "BRB": "Be Right Back",
    "BTW": "By The Way",
    "OMG": "Oh My God",
    "IMO": "In My Opinion",
    "LOL": "Laugh Out Loud",
    "TTYL": "Talk To You Later",
    "GTG": "Got To Go",
    "TTYT": "Talk To You Tomorrow",
    "IDK": "I Don't Know",
    "TMI": "Too Much Information",
    "IMHO": "In My Humble Opinion",
    "ICYMI": "In Case You Missed It",
    "AFAIK": "As Far As I Know",
    "BTW": "By The Way",
    "FAQ": "Frequently Asked Questions",
    "TGIF": "Thank God It's Friday",
    "FYA": "For Your Action",
    "ICYMI": "In Case You Missed It",
}

In [24]:
# create a chat conversion function

def chat_conversion(text):
    new_text = []
    for w in text.split():
        if w.upper() in chat_words:
            new_text.append(chat_words[w.upper()])
        else:
            new_text.append(w)
    return " ".join(new_text)

In [25]:
chat_conversion('Do this work ASAP')

'Do this work As Soon As Possible'

### 1.6 Incorrect Text Handling

In [26]:
from textblob import TextBlob

In [27]:
incorrect_text = 'ceertain conditionas duriing seveal ggenerations aree moodified in the saame maner.'

textBlb = TextBlob(incorrect_text)

textBlb.correct().string

'certain conditions during several generations are modified in the same manner.'

### 1.7 Stopwords

Sometimes we want to remove stopwords bcause the meaning of a sentence is given by the rest of the words (this movie was awesome, I loved it -> movie awesome loved -> positive review).

In [28]:
from nltk.corpus import stopwords
import nltk
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [29]:
stopwords.words('english')

['a',
 'about',
 'above',
 'after',
 'again',
 'against',
 'ain',
 'all',
 'am',
 'an',
 'and',
 'any',
 'are',
 'aren',
 "aren't",
 'as',
 'at',
 'be',
 'because',
 'been',
 'before',
 'being',
 'below',
 'between',
 'both',
 'but',
 'by',
 'can',
 'couldn',
 "couldn't",
 'd',
 'did',
 'didn',
 "didn't",
 'do',
 'does',
 'doesn',
 "doesn't",
 'doing',
 'don',
 "don't",
 'down',
 'during',
 'each',
 'few',
 'for',
 'from',
 'further',
 'had',
 'hadn',
 "hadn't",
 'has',
 'hasn',
 "hasn't",
 'have',
 'haven',
 "haven't",
 'having',
 'he',
 "he'd",
 "he'll",
 'her',
 'here',
 'hers',
 'herself',
 "he's",
 'him',
 'himself',
 'his',
 'how',
 'i',
 "i'd",
 'if',
 "i'll",
 "i'm",
 'in',
 'into',
 'is',
 'isn',
 "isn't",
 'it',
 "it'd",
 "it'll",
 "it's",
 'its',
 'itself',
 "i've",
 'just',
 'll',
 'm',
 'ma',
 'me',
 'mightn',
 "mightn't",
 'more',
 'most',
 'mustn',
 "mustn't",
 'my',
 'myself',
 'needn',
 "needn't",
 'no',
 'nor',
 'not',
 'now',
 'o',
 'of',
 'off',
 'on',
 'once',
 'on

In [30]:
len(stopwords.words('english'))

198

In [31]:
def remove_stopwords(text):
    new_text = []

    for word in text.split():
        if word in stopwords.words('english'):
            new_text.append('')
        else:
            new_text.append(word)
    x = new_text[:]
    new_text.clear()
    return " ".join(x)

In [32]:
remove_stopwords('probably my all-time favorite movie, a story of selflessness, sacrifice and dedication to a noble cause, but it\'s not preachy or boring. it just never gets old, despite my having seen it some 15 or more times')

'probably  all-time favorite movie,  story  selflessness, sacrifice  dedication   noble cause,    preachy  boring.   never gets old, despite   seen   15   times'

### 1.8 Handling Emojis

Emojis are just unicode characters. So for example, if we watn to remove it:

In [33]:
import re
def remove_emoji(text):
    emoji_pattern = re.compile("["
                           u"\U0001F600-\U0001F64F"  # emoticons
                           u"\U0001F300-\U0001F5FF"  # symbols & pictographs
                           u"\U0001F680-\U0001F6FF"  # transport & map symbols
                           u"\U0001F1E0-\U0001F1FF"  # flags (iOS)
                           u"\U00002702-\U000027B0"
                           u"\U000024C2-\U0001F251"
                           "]+", flags=re.UNICODE)
    return emoji_pattern.sub(r'', text)

In [34]:
remove_emoji("Loved the movie. It was 😘😘")

'Loved the movie. It was '

In [35]:
remove_emoji("Lmao 😂😂")

'Lmao '

What if we want to keep the eomji? We can use the `emoji` library to "extract" the emojis' meaning:

In [36]:
!pip install emoji



In [37]:
import emoji
print(emoji.demojize('Python is 🔥'))

Python is :fire:


In [38]:
print(emoji.demojize('Loved the movie. It was 😘'))

Loved the movie. It was :face_blowing_a_kiss:


### 1.9. Tokenization

This is a key step in using LLMs. We usually want to tokenize our text data in order to use it for our LLMs.

How can we do that?

#### 1.9.1 Using the `split()` function

We can use the `split()` function:

In [39]:
# word tokenization
sent1 = 'I am going to delhi'
sent1.split()

['I', 'am', 'going', 'to', 'delhi']

In [40]:
# sentence tokenization
sent2 = 'I am going to delhi. I will stay there for 3 days. Let\'s hope the trip to be great'
sent2.split('.')

['I am going to delhi',
 ' I will stay there for 3 days',
 " Let's hope the trip to be great"]

But it has some limitations:

In [41]:
# Problems with split function
sent3 = 'I am going to delhi!'
sent3.split()

['I', 'am', 'going', 'to', 'delhi!']

In [42]:
sent4 = 'Where do think I should go? I have 3 day holiday'
sent4.split('.')

['Where do think I should go? I have 3 day holiday']

#### 1.9.2 Using regular Expression

In [43]:
import re
sent3 = 'I am going to delhi!'
tokens = re.findall("[\w']+", sent3)
tokens

['I', 'am', 'going', 'to', 'delhi']

In [44]:
text = """Lorem Ipsum is simply dummy text of the printing and typesetting industry?
Lorem Ipsum has been the industry's standard dummy text ever since the 1500s,
when an unknown printer took a galley of type and scrambled it to make a type specimen book."""
sentences = re.compile('[.!?] ').split(text)
sentences

["Lorem Ipsum is simply dummy text of the printing and typesetting industry?\nLorem Ipsum has been the industry's standard dummy text ever since the 1500s,\nwhen an unknown printer took a galley of type and scrambled it to make a type specimen book."]

#### 1.9.3 Using `NLTK`

In [2]:
import nltk

nltk.download('punkt')
nltk.download('punkt_tab')
nltk.download('wordnet')
nltk.download('omw-1.4')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt_tab.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package omw-1.4 to /root/nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!


True

In [3]:
from nltk.tokenize import word_tokenize,sent_tokenize

sent1 = 'I am going to visit delhi!'
word_tokenize(sent1)

['I', 'am', 'going', 'to', 'visit', 'delhi', '!']

In [4]:
text = """Lorem Ipsum is simply dummy text of the printing and typesetting industry?
Lorem Ipsum has been the industry's standard dummy text ever since the 1500s,
when an unknown printer took a galley of type and scrambled it to make a type specimen book."""

sent_tokenize(text)

['Lorem Ipsum is simply dummy text of the printing and typesetting industry?',
 "Lorem Ipsum has been the industry's standard dummy text ever since the 1500s,\nwhen an unknown printer took a galley of type and scrambled it to make a type specimen book."]

In [5]:
sent5 = 'I have a Ph.D in A.I'
sent6 = "We're here to help! mail us at nks@gmail.com"
sent7 = 'A 5km ride cost $10.50'

word_tokenize(sent5)

['I', 'have', 'a', 'Ph.D', 'in', 'A.I']

In [6]:
word_tokenize(sent6)

['We',
 "'re",
 'here',
 'to',
 'help',
 '!',
 'mail',
 'us',
 'at',
 'nks',
 '@',
 'gmail.com']

In [8]:
word_tokenize(sent7)

['A', '5km', 'ride', 'cost', '$', '10.50']

#### 1.9.4 Using Spacy

In [9]:
import spacy
nlp = spacy.load('en_core_web_sm')

In [10]:
doc1 = nlp(sent5)
doc2 = nlp(sent6)
doc3 = nlp(sent7)
doc4 = nlp(sent1)

In [11]:
doc4 = nlp(sent1)
doc4

I am going to visit delhi!

In [12]:
for token in doc4:
    print(token)

I
am
going
to
visit
delhi
!


### 1.10 Stemming

Stemming is a preprocessing technique in Natural Language Processing (NLP) where you reduce a word to its base or root form.

In [13]:
from nltk.stem.porter import PorterStemmer

In [14]:
ps = PorterStemmer()
def stem_words(text):
    return " ".join([ps.stem(word) for word in text.split()])

In [15]:
sample = "walk walks walking walked"
stem_words(sample)

'walk walk walk walk'

In [16]:
text = 'probably my alltime favorite movie a story of selflessness sacrifice and dedication to a noble cause but its not preachy or boring it just never gets old despite my having seen it some 15 or more times in the last 25 years paul lukas performance brings tears to my eyes and bette davis in one of her very few truly sympathetic roles is a delight the kids are as grandma says more like dressedup midgets than children but that only makes them more fun to watch and the mothers slow awakening to whats happening in the world and under her own roof is believable and startling if i had a dozen thumbs theyd all be up for this movie'
print(text)

probably my alltime favorite movie a story of selflessness sacrifice and dedication to a noble cause but its not preachy or boring it just never gets old despite my having seen it some 15 or more times in the last 25 years paul lukas performance brings tears to my eyes and bette davis in one of her very few truly sympathetic roles is a delight the kids are as grandma says more like dressedup midgets than children but that only makes them more fun to watch and the mothers slow awakening to whats happening in the world and under her own roof is believable and startling if i had a dozen thumbs theyd all be up for this movie


In [17]:
stem_words(text)

'probabl my alltim favorit movi a stori of selfless sacrific and dedic to a nobl caus but it not preachi or bore it just never get old despit my have seen it some 15 or more time in the last 25 year paul luka perform bring tear to my eye and bett davi in one of her veri few truli sympathet role is a delight the kid are as grandma say more like dressedup midget than children but that onli make them more fun to watch and the mother slow awaken to what happen in the world and under her own roof is believ and startl if i had a dozen thumb theyd all be up for thi movi'

### 1.11. Lemmatization

Lemmatization reduces words to their lemma — the dictionary form of a word.

Unlike stemming, which just chops off suffixes blindly, lemmatization uses vocabulary and morphology of words to get real dictionary words.

In [18]:
import nltk
from nltk.stem import WordNetLemmatizer
import nltk
nltk.download('wordnet')
nltk.download('omw-1.4')
wordnet_lemmatizer = WordNetLemmatizer()

sentence = "He was running and eating at same time. He has bad habit of swimming after playing long hours in the Sun."
punctuations="?:!.,;"
sentence_words = nltk.word_tokenize(sentence)
for word in sentence_words:
    if word in punctuations:
        sentence_words.remove(word)

sentence_words
print("{0:20}{1:20}".format("Word","Lemma"))
for word in sentence_words:
    print ("{0:20}{1:20}".format(word,wordnet_lemmatizer.lemmatize(word,pos='v')))

Word                Lemma               


[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package omw-1.4 to /root/nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!


He                  He                  
was                 be                  
running             run                 
and                 and                 
eating              eat                 
at                  at                  
same                same                
time                time                
He                  He                  
has                 have                
bad                 bad                 
habit               habit               
of                  of                  
swimming            swim                
after               after               
playing             play                
long                long                
hours               hours               
in                  in                  
the                 the                 
Sun                 Sun                 


>**Note:** Lemmatization is slower than Stemming

## 2. Text Representation - Word Embeddings



Word embedding is a foundational technique in Natural Language Processing (NLP) used to represent words numerically so that machine learning models can process them. Early approaches like one-hot encoding represent words as sparse binary vectors, but they suffer from high dimensionality and no semantic understanding.

Bag of Words (BoW) improves on this by counting word occurrences but loses word order and context.

N-grams attempt to reintroduce local context by considering sequences of N words, yet they still generate sparse and large representations.

The evolution came with Word2Vec, which introduced dense, low-dimensional vectors that capture semantic relationships between words based on their context in large corpora. Words with similar meanings have vectors close to each other in the embedding space. However, Word2Vec assigns one vector per word, ignoring word meaning variation depending on context.

Modern Transformer-based encodings like BERT and GPT overcome this limitation by creating contextual embeddings, where the same word can have different representations based on its surrounding text. This allows Large Language Models (LLMs) to understand nuances, polysemy, and complex linguistic structures, making them much more powerful in tasks like translation, summarization, and question answering.

### 2.1 Bag of Words

Bag of Words (BoW) represents text as a vector of word counts. It builds a vocabulary of all unique words in a corpus and for each document counts how often each word appears. The order of words is ignored, and only frequency matters. This results in a fixed-size, sparse vector. Though simple, BoW loses the context and semantics between words.


In [19]:
import numpy as np
import pandas as pd

df = pd.DataFrame({"text":["people watch Mario",
                         "Mario watch Mario",
                         "people write comment",
                          "Mario write comment"],"output":[1,1,0,1]})

df

Unnamed: 0,text,output
0,people watch Mario,1
1,Mario watch Mario,1
2,people write comment,0
3,Mario write comment,1


Bag of words is easy to imlplement through `sklearn`:

In [20]:
from sklearn.feature_extraction.text import CountVectorizer
cv = CountVectorizer()

In [21]:
bow = cv.fit_transform(df['text'])

In [22]:
bow

<Compressed Sparse Row sparse matrix of dtype 'int64'
	with 11 stored elements and shape (4, 5)>

In [23]:
#vocabulary
print(cv.vocabulary_)

{'people': 2, 'watch': 3, 'mario': 1, 'write': 4, 'comment': 0}


In [24]:
bow.toarray() # bow representation of our datafram

array([[0, 1, 1, 1, 0],
       [0, 2, 0, 1, 0],
       [1, 0, 1, 0, 1],
       [1, 1, 0, 0, 1]])

In [25]:
print(bow[0].toarray()) # people watch Mario : we have 1 occurency of 'people', 1 of 'watch' and 1 of 'Mario'
print(bow[1].toarray())
print(bow[2].toarray())

[[0 1 1 1 0]]
[[0 2 0 1 0]]
[[1 0 1 0 1]]


In [26]:
# new
cv.transform(['Matteo watch Mario']).toarray()

array([[0, 1, 0, 1, 0]])

In [27]:
# for example we could store X and y to pass to our AI model in the folowing way
X = bow.toarray()
y = df['output']

### 2.2 N-grams

N-grams are contiguous sequences of N items (typically words) from a given text. They capture local word order by grouping words into fixed-size windows, such as bigrams ($N=2$) or trigrams ($N=3$). This preserves some context compared to Bag of Words. However, N-grams can still lead to large, sparse vectors and struggle with longer-range dependencies. They are a simple way to add structure to text representation.

In [28]:
# BI grams
from sklearn.feature_extraction.text import CountVectorizer
cv = CountVectorizer(ngram_range=(2,2))

In [29]:
bow = cv.fit_transform(df['text'])

In [30]:
print(cv.vocabulary_)

{'people watch': 2, 'watch mario': 4, 'mario watch': 0, 'people write': 3, 'write comment': 5, 'mario write': 1}


In [31]:
print(bow[0].toarray())
print(bow[1].toarray())
print(bow[2].toarray())

[[0 0 1 0 1 0]]
[[1 0 0 0 1 0]]
[[0 0 0 1 0 1]]


In [32]:
# Tri gram
from sklearn.feature_extraction.text import CountVectorizer
cv = CountVectorizer(ngram_range=(3,3))

In [33]:
bow = cv.fit_transform(df['text'])

In [34]:
print(cv.vocabulary_)

{'people watch mario': 2, 'mario watch mario': 0, 'people write comment': 3, 'mario write comment': 1}


In [35]:
print(bow[0].toarray())
print(bow[1].toarray())
print(bow[2].toarray())

[[0 0 1 0]]
[[1 0 0 0]]
[[0 0 0 1]]


### 2.3 TF-IDF (Term frequency- Inverse document frequency)

TF-IDF (Term Frequency - Inverse Document Frequency) is a technique to represent words by how important they are to a document in a corpus.

* Term Frequency (TF) counts how often a word appears in a document.

* Inverse Document Frequency (IDF) downscales words that appear in many documents (like "the", "and") since they are less informative.

* TF-IDF score = TF x IDF, highlighting rare but important words.
This helps models focus on words that better discriminate between documents instead of frequent, generic terms.

In [36]:
from sklearn.feature_extraction.text import TfidfVectorizer
tfid= TfidfVectorizer()

In [37]:
arr = tfid.fit_transform(df['text']).toarray()

In [38]:
arr

array([[0.        , 0.49681612, 0.61366674, 0.61366674, 0.        ],
       [0.        , 0.8508161 , 0.        , 0.52546357, 0.        ],
       [0.57735027, 0.        , 0.57735027, 0.        , 0.57735027],
       [0.61366674, 0.49681612, 0.        , 0.        , 0.61366674]])

In [39]:
print(tfid.idf_)

[1.51082562 1.22314355 1.51082562 1.51082562 1.51082562]


### 2.4 Word2Vec

Word2Vec is a neural network-based model that learns dense vector representations of words based on their context in large corpora. Unlike sparse methods like BoW or TF-IDF, Word2Vec maps words to a continuous vector space where semantically similar words are close together.

It uses two main architectures: CBOW (Continuous Bag of Words), which predicts a word from its context, and Skip-gram, which predicts context words from a target word. These embeddings capture relationships like "king" - "man" + "woman" ≈ "queen".

Word2Vec revolutionized NLP by introducing efficient, meaningful word representations.

Let's make an example using the game of thrones books dataset from Kaggle:

In [1]:
import kagglehub
from pathlib import Path

# Download latest version
path = kagglehub.dataset_download("khulasasndh/game-of-thrones-books")

# Convert path to a Path object for convenience
path = Path(path)

# List the files
print(list(path.glob("*")))

[PosixPath('/kaggle/input/game-of-thrones-books/004ssb.txt'), PosixPath('/kaggle/input/game-of-thrones-books/005ssb.txt'), PosixPath('/kaggle/input/game-of-thrones-books/001ssb.txt'), PosixPath('/kaggle/input/game-of-thrones-books/002ssb.txt'), PosixPath('/kaggle/input/game-of-thrones-books/003ssb.txt')]


In [5]:
txt_file = path / "001ssb.txt"
txt_file

PosixPath('/kaggle/input/game-of-thrones-books/001ssb.txt')

In [3]:
!pip install --upgrade gensim --user   # for word2vec



In [11]:
# we're going to tokenize the book

from nltk import sent_tokenize
from gensim.utils import simple_preprocess
import nltk
import gensim
nltk.download('punkt')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [7]:
story = []

f = open(txt_file)
corpus = f.read()

raw_sent = sent_tokenize(corpus)  # sentence tokenization
print(raw_sent[:10])

for sent in raw_sent:
    story.append(simple_preprocess(sent)) # apply simple preprocess to sentence tokens https://tedboy.github.io/nlps/generated/generated/gensim.utils.simple_preprocess.html#gensim-utils-simple-preprocess

['A Game Of Thrones \nBook One of A Song of Ice and Fire \nBy George R. R. Martin \nPROLOGUE \n"We should start back," Gared urged as the woods began to grow dark around them.', '"The wildlings are \ndead."', '"Do the dead frighten you?"', 'Ser Waymar Royce asked with just the hint of a smile.', 'Gared did not rise to the bait.', 'He was an old man, past fifty, and he had seen the lordlings come and go.', '"Dead is dead," he said.', '"We have no business with the dead."', '"Are they dead?"', 'Royce asked softly.']


In [8]:
story[:50]

[['game',
  'of',
  'thrones',
  'book',
  'one',
  'of',
  'song',
  'of',
  'ice',
  'and',
  'fire',
  'by',
  'george',
  'martin',
  'prologue',
  'we',
  'should',
  'start',
  'back',
  'gared',
  'urged',
  'as',
  'the',
  'woods',
  'began',
  'to',
  'grow',
  'dark',
  'around',
  'them'],
 ['the', 'wildlings', 'are', 'dead'],
 ['do', 'the', 'dead', 'frighten', 'you'],
 ['ser',
  'waymar',
  'royce',
  'asked',
  'with',
  'just',
  'the',
  'hint',
  'of',
  'smile'],
 ['gared', 'did', 'not', 'rise', 'to', 'the', 'bait'],
 ['he',
  'was',
  'an',
  'old',
  'man',
  'past',
  'fifty',
  'and',
  'he',
  'had',
  'seen',
  'the',
  'lordlings',
  'come',
  'and',
  'go'],
 ['dead', 'is', 'dead', 'he', 'said'],
 ['we', 'have', 'no', 'business', 'with', 'the', 'dead'],
 ['are', 'they', 'dead'],
 ['royce', 'asked', 'softly'],
 ['what', 'proof', 'have', 'we'],
 ['will', 'saw', 'them', 'gared', 'said'],
 ['if',
  'he',
  'says',
  'they',
  'are',
  'dead',
  'that',
  'proof',


In [9]:
len(story)

27244

In [13]:
# Initialize the Word2Vec model

model = gensim.models.Word2Vec(
    window=10,  # context window
    min_count=2 # words that appear fewer than min_count times will be dropped from the vocabulary and ignored during training
)

In [14]:
# convert our data to vector representation

model.build_vocab(story)

In [15]:
model.train(story, total_examples=model.corpus_count, epochs=model.epochs)

(1058976, 1423500)

In [16]:
# what is the most similar word in our embedding to Daenerys?

model.wv.most_similar('daenerys')

[('animal', 0.9971373081207275),
 ('needles', 0.9969444274902344),
 ('terrible', 0.996902585029602),
 ('riding', 0.9960564970970154),
 ('noye', 0.9960087537765503),
 ('milk', 0.9959766268730164),
 ('plump', 0.9959636926651001),
 ('eyed', 0.9959321618080139),
 ('tables', 0.995866596698761),
 ('frozen', 0.9958657622337341)]

In [23]:
model.wv.most_similar('prince')

[('tommen', 0.990727961063385),
 ('jory', 0.9907121062278748),
 ('quietly', 0.9905585646629333),
 ('summoned', 0.9903908371925354),
 ('poole', 0.9897469282150269),
 ('dress', 0.9895995855331421),
 ('used', 0.989297091960907),
 ('uncertainly', 0.9891262054443359),
 ('final', 0.9890204668045044),
 ('yoren', 0.9889761805534363)]

In [17]:
# How similar are Arya and Sansa?

model.wv.similarity('arya', 'sansa')

0.986129

In [24]:
# get all vectors
vec = model.wv.get_normed_vectors()

In [25]:
vec

array([[-0.08774093,  0.07370759,  0.03242683, ..., -0.13685684,
        -0.00982286,  0.13431473],
       [-0.06997778,  0.08711688, -0.00980961, ..., -0.16992667,
         0.07999758,  0.03873117],
       [-0.13372657,  0.06356536,  0.07830802, ..., -0.0664702 ,
         0.08242119, -0.15897521],
       ...,
       [ 0.0104038 , -0.0020393 ,  0.03665128, ..., -0.17739414,
         0.08987062, -0.03940014],
       [-0.01686865,  0.0656902 ,  0.02911037, ..., -0.14694607,
         0.02747683, -0.01544604],
       [ 0.02031591,  0.14081398,  0.00206513, ..., -0.16663148,
         0.06609166, -0.01143338]], dtype=float32)

In [26]:
# let's visualize our vectors

from sklearn.decomposition import PCA

In [28]:
pca = PCA(n_components=3) # reduces the dimension size to 3
X = pca.fit_transform(model.wv.get_normed_vectors())

In [29]:
X

array([[-0.36606967, -0.20003307, -0.3087351 ],
       [-0.441606  , -0.07403529, -0.25082672],
       [ 0.4274943 , -0.12707937,  0.07356171],
       ...,
       [-0.01386701, -0.12182158,  0.3731503 ],
       [-0.17011   ,  0.07849652,  0.00550981],
       [-0.42613995, -0.08157599, -0.05143297]], dtype=float32)

In [30]:
X.shape

(7432, 3)

In [55]:
# visualize:
import plotly.express as px
import pandas as pd

words = list(model.wv.index_to_key)
df = pd.DataFrame(X, columns=['x', 'y', 'z'])
df['word'] = words

# Plot
fig = px.scatter_3d(df[500:650], x='x', y='y', z='z',
                    hover_name='word',
                    color='word')  # Careful: too many colors if many words
fig.show()

We won't be actually using Word2Vec, because as we said before, owadays models surpassed this embedding through transformers architectures.

Still, it's very useful to understand how Word2Vec works because its working concept is the core of word embedding.

## 3. Text Classification Using ML

We will now classify text data using a simple ML model, using some of the embeddings we saw earlier on and comparing our results.

### 3.1 Get Data

In [56]:
import os
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import re
import nltk
from nltk.corpus import stopwords
nltk.download('stopwords')
import string
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.metrics import confusion_matrix
pd.set_option('display.max_rows', None)
pd.set_option('display.max_columns', None)
pd.set_option('display.max_colwidth', 255)

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [57]:
import kagglehub
from pathlib import Path
import pandas as pd

# Download latest version
path = kagglehub.dataset_download("lakshmi25npathi/imdb-dataset-of-50k-movie-reviews")

# Convert path to a Path object for convenience
path = Path(path)

# List the files
print(list(path.glob("*")))

# Get 'IMDB Dataset.csv'
csv_file = path / "IMDB Dataset.csv"

# Load into pandas
df = pd.read_csv(csv_file)

[PosixPath('/kaggle/input/imdb-dataset-of-50k-movie-reviews/IMDB Dataset.csv')]


In [59]:
df.head()

Unnamed: 0,review,sentiment
0,"One of the other reviewers has mentioned that after watching just 1 Oz episode you'll be hooked. They are right, as this is exactly what happened with me.<br /><br />The first thing that struck me about Oz was its brutality and unflinching scenes of v...",positive
1,"A wonderful little production. <br /><br />The filming technique is very unassuming- very old-time-BBC fashion and gives a comforting, and sometimes discomforting, sense of realism to the entire piece. <br /><br />The actors are extremely well chosen-...",positive
2,"I thought this was a wonderful way to spend time on a too hot summer weekend, sitting in the air conditioned theater and watching a light-hearted comedy. The plot is simplistic, but the dialogue is witty and the characters are likable (even the well b...",positive
3,"Basically there's a family where a little boy (Jake) thinks there's a zombie in his closet & his parents are fighting all the time.<br /><br />This movie is slower than a soap opera... and suddenly, Jake decides to become Rambo and kill the zombie.<br...",negative
4,"Petter Mattei's ""Love in the Time of Money"" is a visually stunning film to watch. Mr. Mattei offers us a vivid portrait about human relations. This is a movie that seems to be telling us what money, power and success do to people in the different situ...",positive


In [60]:
# let's consider only 10000 examples
df = df.iloc[:10000]

Let's check if our data has any problems:

In [61]:
df['sentiment'].value_counts()

Unnamed: 0_level_0,count
sentiment,Unnamed: 1_level_1
positive,5028
negative,4972


In [62]:
# does it have missing values?
df.isnull().sum()

Unnamed: 0,0
review,0
sentiment,0


In [64]:
# does it have duplicates?
df.duplicated().sum()

17

In [66]:
# drop duplicates
df.drop_duplicates(inplace=True)

In [67]:
df.duplicated().sum()

0

### 3.2 Basic Preprocessing

We will now:
* remove HTML tags
* Get everything lower case
* Remove stopwords

In [68]:
import re
def remove_tags(raw_text):
  cleaned_text = re.sub(re.compile('<.*?>'), '', raw_text)
  return cleaned_text

In [69]:
df['review'] = df['review'].apply(remove_tags)

In [70]:
df.head()

Unnamed: 0,review,sentiment
0,"One of the other reviewers has mentioned that after watching just 1 Oz episode you'll be hooked. They are right, as this is exactly what happened with me.The first thing that struck me about Oz was its brutality and unflinching scenes of violence, whi...",positive
1,"A wonderful little production. The filming technique is very unassuming- very old-time-BBC fashion and gives a comforting, and sometimes discomforting, sense of realism to the entire piece. The actors are extremely well chosen- Michael Sheen not only ...",positive
2,"I thought this was a wonderful way to spend time on a too hot summer weekend, sitting in the air conditioned theater and watching a light-hearted comedy. The plot is simplistic, but the dialogue is witty and the characters are likable (even the well b...",positive
3,"Basically there's a family where a little boy (Jake) thinks there's a zombie in his closet & his parents are fighting all the time.This movie is slower than a soap opera... and suddenly, Jake decides to become Rambo and kill the zombie.OK, first of al...",negative
4,"Petter Mattei's ""Love in the Time of Money"" is a visually stunning film to watch. Mr. Mattei offers us a vivid portrait about human relations. This is a movie that seems to be telling us what money, power and success do to people in the different situ...",positive


In [71]:
df['review'] = df['review'].apply(lambda x:x.lower())

In [72]:
df.head()

Unnamed: 0,review,sentiment
0,"one of the other reviewers has mentioned that after watching just 1 oz episode you'll be hooked. they are right, as this is exactly what happened with me.the first thing that struck me about oz was its brutality and unflinching scenes of violence, whi...",positive
1,"a wonderful little production. the filming technique is very unassuming- very old-time-bbc fashion and gives a comforting, and sometimes discomforting, sense of realism to the entire piece. the actors are extremely well chosen- michael sheen not only ...",positive
2,"i thought this was a wonderful way to spend time on a too hot summer weekend, sitting in the air conditioned theater and watching a light-hearted comedy. the plot is simplistic, but the dialogue is witty and the characters are likable (even the well b...",positive
3,"basically there's a family where a little boy (jake) thinks there's a zombie in his closet & his parents are fighting all the time.this movie is slower than a soap opera... and suddenly, jake decides to become rambo and kill the zombie.ok, first of al...",negative
4,"petter mattei's ""love in the time of money"" is a visually stunning film to watch. mr. mattei offers us a vivid portrait about human relations. this is a movie that seems to be telling us what money, power and success do to people in the different situ...",positive


In [79]:
from nltk.corpus import stopwords
import nltk

nltk.download('stopwords')
sw_list = stopwords.words('english')

df['review'] = df['review'].apply(lambda x: [item for item in x.split()if item not in sw_list]).apply(lambda x:" ".join(x))

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


Now we separate `X` (review) and `y` (sentiment) data:

In [82]:
X = df.iloc[:,0:1]
y = df['sentiment']

In [81]:
X.head()

Unnamed: 0,review
0,"one reviewers mentioned watching 1 oz episode hooked. right, exactly happened me.the first thing struck oz brutality unflinching scenes violence, set right word go. trust me, show faint hearted timid. show pulls punches regards drugs, sex violence. ha..."
1,"wonderful little production. filming technique unassuming- old-time-bbc fashion gives comforting, sometimes discomforting, sense realism entire piece. actors extremely well chosen- michael sheen ""has got polari"" voices pat too! truly see seamless edit..."
2,"thought wonderful way spend time hot summer weekend, sitting air conditioned theater watching light-hearted comedy. plot simplistic, dialogue witty characters likable (even well bread suspected serial killer). may disappointed realize match point 2: r..."
3,"basically there's family little boy (jake) thinks there's zombie closet & parents fighting time.this movie slower soap opera... suddenly, jake decides become rambo kill zombie.ok, first going make film must decide thriller drama! drama movie watchable..."
4,"petter mattei's ""love time money"" visually stunning film watch. mr. mattei offers us vivid portrait human relations. movie seems telling us money, power success people different situations encounter. variation arthur schnitzler's play theme, director ..."


In [84]:
y[:100]

Unnamed: 0,sentiment
0,positive
1,positive
2,positive
3,negative
4,positive
5,positive
6,positive
7,negative
8,negative
9,positive


In [85]:
# let's encode negative 0 and positive 1, using sklearn's LabelEncoder


from sklearn.preprocessing import LabelEncoder

encoder = LabelEncoder()

y = encoder.fit_transform(y)

y[:100]

array([1, 1, 1, 0, 1, 1, 1, 0, 0, 1, 0, 0, 0, 0, 1, 0, 1, 0, 1, 0, 1, 0,
       1, 0, 0, 1, 1, 0, 0, 1, 1, 1, 0, 1, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0,
       1, 1, 0, 0, 1, 0, 1, 1, 1, 1, 0, 0, 0, 0, 1, 1, 0, 0, 1, 0, 0, 1,
       0, 0, 0, 0, 0, 0, 1, 1, 0, 1, 1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 1, 0, 1, 1, 0, 1, 0, 0, 0, 1])

In [86]:
# let's split into train (80%) and test(20%)...

from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.2,random_state=1)

In [87]:
X_train.shape

(7986, 1)

In [88]:
X_test.shape

(1997, 1)

In [89]:
# Applying BoW
from sklearn.feature_extraction.text import CountVectorizer
cv = CountVectorizer()

In [90]:
X_train_bow = cv.fit_transform(X_train['review']).toarray()
X_test_bow = cv.transform(X_test['review']).toarray()

> **Note:**
When using Bag of Words (BoW), always apply `fit_transform` on the training data and `transform` on the testing data. The `fit_transform` step builds the vocabulary from the training set and encodes the documents accordingly. The `transform` step applies this learned vocabulary to the test set without altering it. This ensures no data leakage and that the model only learns patterns from the training data. Any unseen words in the test data are ignored. Using `fit_transform` on both train and test could result in different vocabularies, leading to inconsistent and unreliable model performance.

In [91]:
X_train_bow

array([[0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       ...,
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0]])

Let's first try to classify through `GaussianNB`.

`GaussianNB` is a Naive Bayes classifier from scikit-learn that assumes features follow a **Gaussian (Normal) distribution**. It is used for classification tasks where features are continuous. The `fit` method trains the model by estimating the mean and variance for each class. It’s fast and works well on high-dimensional data like text represented by BoW. Despite its simplicity, it often performs surprisingly well in many NLP tasks.


In [93]:
from sklearn.naive_bayes import GaussianNB
gnb = GaussianNB()

gnb.fit(X_train_bow, y_train)

In [94]:
# is it performing well?

y_pred = gnb.predict(X_test_bow)

from sklearn.metrics import accuracy_score,confusion_matrix
accuracy_score(y_test,y_pred)

0.6324486730095142

Now let's try with a Random Forest instead:

In [95]:
from sklearn.ensemble import RandomForestClassifier
rf = RandomForestClassifier()

rf.fit(X_train_bow,y_train)

y_pred = rf.predict(X_test_bow)

accuracy_score(y_test,y_pred)

0.8512769153730596

Way better!

Can we tweak our BoW setting to make it perform as good as the RF?