<a href="https://colab.research.google.com/github/MatteoFalcioni/GenAI/blob/main/02_PT2_data_preprocessing.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## 2. Text Representation - Word Embeddings



### 2.4 Word2Vec

Word2Vec is a neural network-based model that learns dense vector representations of words based on their context in large corpora. Unlike sparse methods like BoW or TF-IDF, Word2Vec maps words to a continuous vector space where semantically similar words are close together.

It uses two main architectures: CBOW (Continuous Bag of Words), which predicts a word from its context, and Skip-gram, which predicts context words from a target word. These embeddings capture relationships like "king" - "man" + "woman" ≈ "queen".

Word2Vec revolutionized NLP by introducing efficient, meaningful word representations.

Let's make an example using the game of thrones books dataset from Kaggle:

In [2]:
import kagglehub
from pathlib import Path

# Download latest version
path = kagglehub.dataset_download("khulasasndh/game-of-thrones-books")

# Convert path to a Path object for convenience
path = Path(path)

# List the files
print(list(path.glob("*")))

[PosixPath('/kaggle/input/game-of-thrones-books/004ssb.txt'), PosixPath('/kaggle/input/game-of-thrones-books/005ssb.txt'), PosixPath('/kaggle/input/game-of-thrones-books/001ssb.txt'), PosixPath('/kaggle/input/game-of-thrones-books/002ssb.txt'), PosixPath('/kaggle/input/game-of-thrones-books/003ssb.txt')]


In [4]:
txt_file = path / "001ssb.txt"
txt_file

PosixPath('/kaggle/input/game-of-thrones-books/001ssb.txt')

In [3]:
!pip install --upgrade gensim --user   # for word2vec

Collecting scipy<1.14.0,>=1.7.0 (from gensim)
  Using cached scipy-1.13.1-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (60 kB)
Using cached scipy-1.13.1-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (38.6 MB)
Installing collected packages: scipy
[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
tsfresh 0.21.0 requires scipy>=1.14.0; python_version >= "3.10", but you have scipy 1.13.1 which is incompatible.[0m[31m
[0mSuccessfully installed scipy-1.13.1


In [8]:
# we're going to tokenize the book

from gensim.utils import simple_preprocess
import nltk
import gensim
nltk.download('punkt')
nltk.download('punkt_tab')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt_tab.zip.


True

In [9]:
from nltk import sent_tokenize


story = []

f = open(txt_file)
corpus = f.read()

raw_sent = sent_tokenize(corpus)  # sentence tokenization
print(raw_sent[:10])

for sent in raw_sent:
    story.append(simple_preprocess(sent)) # apply simple preprocess to sentence tokens https://tedboy.github.io/nlps/generated/generated/gensim.utils.simple_preprocess.html#gensim-utils-simple-preprocess

['A Game Of Thrones \nBook One of A Song of Ice and Fire \nBy George R. R. Martin \nPROLOGUE \n"We should start back," Gared urged as the woods began to grow dark around them.', '"The wildlings are \ndead."', '"Do the dead frighten you?"', 'Ser Waymar Royce asked with just the hint of a smile.', 'Gared did not rise to the bait.', 'He was an old man, past fifty, and he had seen the lordlings come and go.', '"Dead is dead," he said.', '"We have no business with the dead."', '"Are they dead?"', 'Royce asked softly.']


In [10]:
story[:50]

[['game',
  'of',
  'thrones',
  'book',
  'one',
  'of',
  'song',
  'of',
  'ice',
  'and',
  'fire',
  'by',
  'george',
  'martin',
  'prologue',
  'we',
  'should',
  'start',
  'back',
  'gared',
  'urged',
  'as',
  'the',
  'woods',
  'began',
  'to',
  'grow',
  'dark',
  'around',
  'them'],
 ['the', 'wildlings', 'are', 'dead'],
 ['do', 'the', 'dead', 'frighten', 'you'],
 ['ser',
  'waymar',
  'royce',
  'asked',
  'with',
  'just',
  'the',
  'hint',
  'of',
  'smile'],
 ['gared', 'did', 'not', 'rise', 'to', 'the', 'bait'],
 ['he',
  'was',
  'an',
  'old',
  'man',
  'past',
  'fifty',
  'and',
  'he',
  'had',
  'seen',
  'the',
  'lordlings',
  'come',
  'and',
  'go'],
 ['dead', 'is', 'dead', 'he', 'said'],
 ['we', 'have', 'no', 'business', 'with', 'the', 'dead'],
 ['are', 'they', 'dead'],
 ['royce', 'asked', 'softly'],
 ['what', 'proof', 'have', 'we'],
 ['will', 'saw', 'them', 'gared', 'said'],
 ['if',
  'he',
  'says',
  'they',
  'are',
  'dead',
  'that',
  'proof',


In [11]:
len(story)

27244

In [12]:
# Initialize the Word2Vec model

model = gensim.models.Word2Vec(
    window=10,  # context window
    min_count=2 # words that appear fewer than min_count times will be dropped from the vocabulary and ignored during training
)

In [13]:
# convert our data to vector representation

model.build_vocab(story)

In [14]:
model.train(story, total_examples=model.corpus_count, epochs=model.epochs)

(1059274, 1423500)

In [15]:
# what is the most similar word in our embedding to Daenerys?

model.wv.most_similar('daenerys')

[('halder', 0.9964900612831116),
 ('qotho', 0.9961335062980652),
 ('needles', 0.9957756400108337),
 ('wine', 0.995547890663147),
 ('rider', 0.9953729510307312),
 ('jyck', 0.9952894449234009),
 ('mycah', 0.9949954748153687),
 ('septon', 0.9949194192886353),
 ('toad', 0.994870662689209),
 ('eunuch', 0.9948117136955261)]

In [16]:
model.wv.most_similar('prince')

[('hit', 0.9927298426628113),
 ('dwarf', 0.9914879202842712),
 ('grief', 0.9908930659294128),
 ('vayon', 0.9906121492385864),
 ('bride', 0.9902135729789734),
 ('weary', 0.9898699522018433),
 ('jory', 0.9897211194038391),
 ('dress', 0.9890674352645874),
 ('urged', 0.9890033006668091),
 ('nymeria', 0.9888286590576172)]

In [17]:
# How similar are Arya and Sansa?

model.wv.similarity('arya', 'sansa')

0.98409224

In [18]:
# get all vectors
vec = model.wv.get_normed_vectors()

In [19]:
vec

array([[-0.09421236,  0.03557105,  0.02413206, ..., -0.12434553,
        -0.030233  ,  0.11922417],
       [-0.06206828,  0.07936141, -0.00230469, ..., -0.17681174,
         0.03534278, -0.01524555],
       [-0.1886489 ,  0.02586647,  0.0334567 , ..., -0.08167767,
         0.07794011, -0.12723646],
       ...,
       [-0.03891177, -0.05499551,  0.02513578, ..., -0.21452597,
         0.01065187, -0.04674507],
       [-0.03270711,  0.06273347, -0.00369681, ..., -0.17462158,
        -0.00834404, -0.03143879],
       [-0.01699405,  0.10845236, -0.0230192 , ..., -0.18052326,
         0.04669645, -0.08183569]], dtype=float32)

In [20]:
# let's visualize our vectors

from sklearn.decomposition import PCA

In [21]:
pca = PCA(n_components=3) # reduces the dimension size to 3
X = pca.fit_transform(model.wv.get_normed_vectors())

In [22]:
X

array([[-0.3879074 ,  0.10762352,  0.34413856],
       [-0.4298803 ,  0.00398082,  0.2327003 ],
       [ 0.42144346,  0.16180176,  0.02893546],
       ...,
       [ 0.03301668,  0.18026906, -0.2665986 ],
       [-0.24509203, -0.06821531,  0.01601654],
       [-0.3246151 ,  0.03585994,  0.02492589]], dtype=float32)

In [23]:
X.shape

(7432, 3)

In [24]:
# visualize:
import plotly.express as px
import pandas as pd

words = list(model.wv.index_to_key)
df = pd.DataFrame(X, columns=['x', 'y', 'z'])
df['word'] = words

# Plot
fig = px.scatter_3d(df[500:650], x='x', y='y', z='z',
                    hover_name='word',
                    color='word')  # Careful: too many colors if many words
fig.show()

We won't be actually using Word2Vec, because as we said before, owadays models surpassed this embedding through transformers architectures.

Still, it's very useful to understand how Word2Vec works because its working concept is the core of word embedding.

## 3. Text Classification Using ML

We will now classify text data using a simple ML model, using some of the embeddings we saw earlier on and comparing our results.

### 3.1 Get Data

In [25]:
import os
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import re
import nltk
from nltk.corpus import stopwords
nltk.download('stopwords')
import string
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.metrics import confusion_matrix
pd.set_option('display.max_rows', None)
pd.set_option('display.max_columns', None)
pd.set_option('display.max_colwidth', 255)

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


In [26]:
import kagglehub
from pathlib import Path
import pandas as pd

# Download latest version
path = kagglehub.dataset_download("lakshmi25npathi/imdb-dataset-of-50k-movie-reviews")

# Convert path to a Path object for convenience
path = Path(path)

# List the files
print(list(path.glob("*")))

# Get 'IMDB Dataset.csv'
csv_file = path / "IMDB Dataset.csv"

# Load into pandas
df = pd.read_csv(csv_file)

[PosixPath('/kaggle/input/imdb-dataset-of-50k-movie-reviews/IMDB Dataset.csv')]


In [27]:
df.head()

Unnamed: 0,review,sentiment
0,"One of the other reviewers has mentioned that after watching just 1 Oz episode you'll be hooked. They are right, as this is exactly what happened with me.<br /><br />The first thing that struck me about Oz was its brutality and unflinching scenes of v...",positive
1,"A wonderful little production. <br /><br />The filming technique is very unassuming- very old-time-BBC fashion and gives a comforting, and sometimes discomforting, sense of realism to the entire piece. <br /><br />The actors are extremely well chosen-...",positive
2,"I thought this was a wonderful way to spend time on a too hot summer weekend, sitting in the air conditioned theater and watching a light-hearted comedy. The plot is simplistic, but the dialogue is witty and the characters are likable (even the well b...",positive
3,"Basically there's a family where a little boy (Jake) thinks there's a zombie in his closet & his parents are fighting all the time.<br /><br />This movie is slower than a soap opera... and suddenly, Jake decides to become Rambo and kill the zombie.<br...",negative
4,"Petter Mattei's ""Love in the Time of Money"" is a visually stunning film to watch. Mr. Mattei offers us a vivid portrait about human relations. This is a movie that seems to be telling us what money, power and success do to people in the different situ...",positive


In [28]:
# let's consider only 10000 examples
df = df.iloc[:10000]

Let's check if our data has any problems:

In [29]:
df['sentiment'].value_counts()

Unnamed: 0_level_0,count
sentiment,Unnamed: 1_level_1
positive,5028
negative,4972


In [30]:
# does it have missing values?
df.isnull().sum()

Unnamed: 0,0
review,0
sentiment,0


In [31]:
# does it have duplicates?
df.duplicated().sum()

17

In [32]:
# drop duplicates
df.drop_duplicates(inplace=True)

In [33]:
df.duplicated().sum()

0

### 3.2 Basic Preprocessing

We will now:
* remove HTML tags
* Get everything lower case
* Remove stopwords

In [34]:
import re
def remove_tags(raw_text):
  cleaned_text = re.sub(re.compile('<.*?>'), '', raw_text)
  return cleaned_text

In [35]:
df['review'] = df['review'].apply(remove_tags)

In [36]:
df.head()

Unnamed: 0,review,sentiment
0,"One of the other reviewers has mentioned that after watching just 1 Oz episode you'll be hooked. They are right, as this is exactly what happened with me.The first thing that struck me about Oz was its brutality and unflinching scenes of violence, whi...",positive
1,"A wonderful little production. The filming technique is very unassuming- very old-time-BBC fashion and gives a comforting, and sometimes discomforting, sense of realism to the entire piece. The actors are extremely well chosen- Michael Sheen not only ...",positive
2,"I thought this was a wonderful way to spend time on a too hot summer weekend, sitting in the air conditioned theater and watching a light-hearted comedy. The plot is simplistic, but the dialogue is witty and the characters are likable (even the well b...",positive
3,"Basically there's a family where a little boy (Jake) thinks there's a zombie in his closet & his parents are fighting all the time.This movie is slower than a soap opera... and suddenly, Jake decides to become Rambo and kill the zombie.OK, first of al...",negative
4,"Petter Mattei's ""Love in the Time of Money"" is a visually stunning film to watch. Mr. Mattei offers us a vivid portrait about human relations. This is a movie that seems to be telling us what money, power and success do to people in the different situ...",positive


In [37]:
df['review'] = df['review'].apply(lambda x:x.lower())

In [38]:
df.head()

Unnamed: 0,review,sentiment
0,"one of the other reviewers has mentioned that after watching just 1 oz episode you'll be hooked. they are right, as this is exactly what happened with me.the first thing that struck me about oz was its brutality and unflinching scenes of violence, whi...",positive
1,"a wonderful little production. the filming technique is very unassuming- very old-time-bbc fashion and gives a comforting, and sometimes discomforting, sense of realism to the entire piece. the actors are extremely well chosen- michael sheen not only ...",positive
2,"i thought this was a wonderful way to spend time on a too hot summer weekend, sitting in the air conditioned theater and watching a light-hearted comedy. the plot is simplistic, but the dialogue is witty and the characters are likable (even the well b...",positive
3,"basically there's a family where a little boy (jake) thinks there's a zombie in his closet & his parents are fighting all the time.this movie is slower than a soap opera... and suddenly, jake decides to become rambo and kill the zombie.ok, first of al...",negative
4,"petter mattei's ""love in the time of money"" is a visually stunning film to watch. mr. mattei offers us a vivid portrait about human relations. this is a movie that seems to be telling us what money, power and success do to people in the different situ...",positive


In [39]:
from nltk.corpus import stopwords
import nltk

nltk.download('stopwords')
sw_list = stopwords.words('english')

df['review'] = df['review'].apply(lambda x: [item for item in x.split()if item not in sw_list]).apply(lambda x:" ".join(x))

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


Now we separate `X` (review) and `y` (sentiment) data:

In [40]:
X = df.iloc[:,0:1]
y = df['sentiment']

In [41]:
X.head()

Unnamed: 0,review
0,"one reviewers mentioned watching 1 oz episode hooked. right, exactly happened me.the first thing struck oz brutality unflinching scenes violence, set right word go. trust me, show faint hearted timid. show pulls punches regards drugs, sex violence. ha..."
1,"wonderful little production. filming technique unassuming- old-time-bbc fashion gives comforting, sometimes discomforting, sense realism entire piece. actors extremely well chosen- michael sheen ""has got polari"" voices pat too! truly see seamless edit..."
2,"thought wonderful way spend time hot summer weekend, sitting air conditioned theater watching light-hearted comedy. plot simplistic, dialogue witty characters likable (even well bread suspected serial killer). may disappointed realize match point 2: r..."
3,"basically there's family little boy (jake) thinks there's zombie closet & parents fighting time.this movie slower soap opera... suddenly, jake decides become rambo kill zombie.ok, first going make film must decide thriller drama! drama movie watchable..."
4,"petter mattei's ""love time money"" visually stunning film watch. mr. mattei offers us vivid portrait human relations. movie seems telling us money, power success people different situations encounter. variation arthur schnitzler's play theme, director ..."


In [42]:
y[:100]

Unnamed: 0,sentiment
0,positive
1,positive
2,positive
3,negative
4,positive
5,positive
6,positive
7,negative
8,negative
9,positive


In [43]:
# let's encode negative 0 and positive 1, using sklearn's LabelEncoder


from sklearn.preprocessing import LabelEncoder

encoder = LabelEncoder()

y = encoder.fit_transform(y)

y[:100]

array([1, 1, 1, 0, 1, 1, 1, 0, 0, 1, 0, 0, 0, 0, 1, 0, 1, 0, 1, 0, 1, 0,
       1, 0, 0, 1, 1, 0, 0, 1, 1, 1, 0, 1, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0,
       1, 1, 0, 0, 1, 0, 1, 1, 1, 1, 0, 0, 0, 0, 1, 1, 0, 0, 1, 0, 0, 1,
       0, 0, 0, 0, 0, 0, 1, 1, 0, 1, 1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 1, 0, 1, 1, 0, 1, 0, 0, 0, 1])

In [44]:
# let's split into train (80%) and test(20%)...

from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.2,random_state=1)

In [45]:
X_train.shape

(7986, 1)

In [46]:
X_test.shape

(1997, 1)

### 3.3 Using BoW

In [47]:
# Applying BoW
from sklearn.feature_extraction.text import CountVectorizer
cv = CountVectorizer()

In [48]:
X_train_bow = cv.fit_transform(X_train['review']).toarray()
X_test_bow = cv.transform(X_test['review']).toarray()

> **Note:**
When using Bag of Words (BoW), always apply `fit_transform` on the training data and `transform` on the testing data. The `fit_transform` step builds the vocabulary from the training set and encodes the documents accordingly. The `transform` step applies this learned vocabulary to the test set without altering it. This ensures no data leakage and that the model only learns patterns from the training data. Any unseen words in the test data are ignored. Using `fit_transform` on both train and test could result in different vocabularies, leading to inconsistent and unreliable model performance.

In [49]:
X_train_bow

array([[0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       ...,
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0]])

Let's first try to classify through `GaussianNB`.

`GaussianNB` is a Naive Bayes classifier from scikit-learn that assumes features follow a **Gaussian (Normal) distribution**. It is used for classification tasks where features are continuous. The `fit` method trains the model by estimating the mean and variance for each class. It’s fast and works well on high-dimensional data like text represented by BoW. Despite its simplicity, it often performs surprisingly well in many NLP tasks.


In [50]:
from sklearn.naive_bayes import GaussianNB
gnb = GaussianNB()

gnb.fit(X_train_bow, y_train)

In [51]:
# is it performing well?

y_pred = gnb.predict(X_test_bow)

from sklearn.metrics import accuracy_score,confusion_matrix
accuracy_score(y_test,y_pred)

0.6324486730095142

Now let's try with a Random Forest instead:

In [52]:
from sklearn.ensemble import RandomForestClassifier
rf = RandomForestClassifier()

rf.fit(X_train_bow,y_train)

y_pred = rf.predict(X_test_bow)

accuracy_score(y_test,y_pred)

0.8477716574862293

Way better!

Also, a somewhat counterintuitive fact is that the model can learn just as well if we lower the dimensionality of its parameter space, since this classification task is not very complex.

Let's learn only $3000$ features from the data:

In [53]:
cv = CountVectorizer(max_features=3000)   # learn only 3000 features

X_train_bow = cv.fit_transform(X_train['review']).toarray()
X_test_bow = cv.transform(X_test['review']).toarray()

rf = RandomForestClassifier()

rf.fit(X_train_bow,y_train)
y_pred = rf.predict(X_test_bow)
accuracy_score(y_test,y_pred)

0.8387581372058087

### 3.4 Using N-grams

How will N-grams perform for this task?

In [54]:
cv = CountVectorizer(ngram_range=(1,2),max_features=5000)

X_train_bow = cv.fit_transform(X_train['review']).toarray()
X_test_bow = cv.transform(X_test['review']).toarray()

rf = RandomForestClassifier()

rf.fit(X_train_bow,y_train)
y_pred = rf.predict(X_test_bow)
accuracy_score(y_test,y_pred)

0.8392588883324987

### 3.5 Using Tfidf

In [55]:
from sklearn.feature_extraction.text import TfidfVectorizer
tfidf = TfidfVectorizer()

In [56]:
X_train_tfidf = tfidf.fit_transform(X_train['review']).toarray()
X_test_tfidf = tfidf.transform(X_test['review'])

In [57]:
rf = RandomForestClassifier()

rf.fit(X_train_tfidf,y_train)
y_pred = rf.predict(X_test_tfidf)

accuracy_score(y_test,y_pred)

0.8452679018527791

This was just an introduction to data preprocessing. As we said before, for the actual embedding of LLMs we will be using Transfomer-based encoding.

In the next notebook we'll learn more about LLMs.