In [None]:
# Upgrade pip, setuptools, wheel and install gensim
%pip install --upgrade pip setuptools wheel
%pip install gensim

# Day 3: Text representation - Word Vectorization & Embeddings

## What is vectorization? 
Vectorization is a process of converting input text data into vectors of real numbers which is the format that ML models support. 

This notebook covers advanced NLP representation techniques, focusing on word vectorization, word embeddings, and their applications. You'll learn how to move beyond Bag-of-Words and TF-IDF to dense vector representations, visualize embeddings, and use them in downstream tasks.

**Outline:**
- Import Required Libraries
- Load and Explore the Dataset
- Text Preprocessing Recap
- Word Embeddings: Introduction
- Word2Vec Embedding with Gensim
- GloVe Embedding Integration
- Visualizing Word Embeddings
- Finding Similar Words and Analogies
- Document Embeddings (Averaging, Doc2Vec)
- Using Pretrained Embeddings in Scikit-learn Pipelines
- Save Embeddings and Processed Data

## 1. Bag of Words

It involves three major steps [Tokenization], Vocabulary creation and Vector creation (via considering frequency of vocabulary words in a given document).


In [13]:
# Bag of Words Example
from sklearn.feature_extraction.text import CountVectorizer
import pandas as pd

corpus = [
    'The quick brown fox jumps over the lazy dog',
    'Never jump over the lazy dog quickly',
    'A fox is quick and brown'
 ]

vectorizer = CountVectorizer()
X = vectorizer.fit_transform(corpus)
print(X.toarray())
#Let’s print the vocabulary to understand why it looks like this.
print(sorted(vectorizer.vocabulary_.keys()))

# The X returned by fit_transform is a sparse matrix for memory efficiency.
# .toarray() converts this sparse matrix to a dense NumPy array, so we can easily create a DataFrame.
df_bow = pd.DataFrame(X.toarray(), columns=vectorizer.get_feature_names_out())
print(df_bow)
df_bow

[[0 1 1 1 0 0 1 1 0 1 1 0 2]
 [0 0 1 0 0 1 0 1 1 1 0 1 1]
 [1 1 0 1 1 0 0 0 0 0 1 0 0]]
['and', 'brown', 'dog', 'fox', 'is', 'jump', 'jumps', 'lazy', 'never', 'over', 'quick', 'quickly', 'the']
   and  brown  dog  fox  is  jump  jumps  lazy  never  over  quick  quickly  \
0    0      1    1    1   0     0      1     1      0     1      1        0   
1    0      0    1    0   0     1      0     1      1     1      0        1   
2    1      1    0    1   1     0      0     0      0     0      1        0   

   the  
0    2  
1    1  
2    0  


Unnamed: 0,and,brown,dog,fox,is,jump,jumps,lazy,never,over,quick,quickly,the
0,0,1,1,1,0,0,1,1,0,1,1,0,2
1,0,0,1,0,0,1,0,1,1,1,0,1,1
2,1,1,0,1,1,0,0,0,0,0,1,0,0


## 2. Term Frequency–Inverse Document Frequency (TF-IDF)

TF stands for Term Frequency. It can be understood as a normalized frequency score. IDF is a reciprocal of the Document Frequency. Please refer to [article](https://neptune.ai/blog/vectorization-techniques-in-nlp-guide) for more details.


Traditional vectorization methods like Bag-of-Words and TF-IDF create high-dimensional, sparse vectors and do not capture word meaning or context. Word embeddings (Word2Vec, GloVe, FastText) create dense, low-dimensional vectors that capture semantic relationships between words.

TF-IDF (Term Frequency–Inverse Document Frequency) weighs the frequency of a word in a document against its frequency in the entire corpus, reducing the impact of common words.

TF-IDF or Term Frequency–Inverse Document Frequency, is a numerical statistic that’s intended to reflect how important a word is to a document. Although it’s another frequency-based method, it’s not as naive as Bag of Words.

How does TF-IDF improve over Bag of Words?

In Bag of Words, we witnessed how vectorization was just concerned with the frequency of vocabulary words in a given document. As a result, articles, prepositions, and conjunctions which don’t contribute a lot to the meaning get as much importance as, say, adjectives. 

TF-IDF helps us to overcome this issue. Words that get repeated too often don’t overpower less frequent but important words.

It has two parts:

TF
TF stands for Term Frequency. It can be understood as a normalized frequency score. It is calculated via the following formula:


So one can imagine that this number will always stay ≤ 1, thus we now judge how frequent a word is in the context of all of the words in a document.

IDF
IDF stands for Inverse Document Frequency, but before we go into IDF, we must make sense of DF – Document Frequency. It’s given by the following formula:


DF tells us about the proportion of documents that contain a certain word. So what’s IDF?

It’s the reciprocal of the Document Frequency, and the final IDF score comes out of the following formula:


Why inverse the DF?

Just as we discussed above, the intuition behind it is that the more common a word is across all documents, the lesser its importance is for the current document.

A logarithm is taken to dampen the effect of IDF in the final calculation.

The final TF-IDF score comes out to be:


This is how TF-IDF manages to incorporate the significance of a word. The higher the score, the more important that word is.

Let’s get our hands dirty now and see how TF-IDF looks in practice.

Again, we’ll be using the Sklearn library for this exercise, just as we did in the case of Bag of Words.

Making the required imports.

https://medium.com/analytics-vidhya/understanding-tf-idf-in-nlp-4a28eebdee6a

https://neptune.ai/blog/vectorization-techniques-in-nlp-guide

In [14]:
# TF-IDF Example
from sklearn.feature_extraction.text import TfidfVectorizer

corpus = [
    'The quick brown fox jumps over the lazy dog',
    'Never jump over the lazy dog quickly',
    'A fox is quick and brown'
 ]

vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(corpus)
df_tfidf = pd.DataFrame(X.toarray(), columns=vectorizer.get_feature_names_out())
print(df_tfidf.round(2))
df_tfidf

    and  brown   dog   fox    is  jump  jumps  lazy  never  over  quick  \
0  0.00   0.29  0.29  0.29  0.00  0.00   0.38  0.29   0.00  0.29   0.29   
1  0.00   0.00  0.33  0.00  0.00  0.43   0.00  0.33   0.43  0.33   0.00   
2  0.52   0.39  0.00  0.39  0.52  0.00   0.00  0.00   0.00  0.00   0.39   

   quickly   the  
0     0.00  0.58  
1     0.43  0.33  
2     0.00  0.00  


Unnamed: 0,and,brown,dog,fox,is,jump,jumps,lazy,never,over,quick,quickly,the
0,0.0,0.291992,0.291992,0.291992,0.0,0.0,0.383935,0.291992,0.0,0.291992,0.291992,0.0,0.583984
1,0.0,0.0,0.329928,0.0,0.0,0.433816,0.0,0.329928,0.433816,0.329928,0.0,0.433816,0.329928
2,0.51742,0.393511,0.0,0.393511,0.51742,0.0,0.0,0.0,0.0,0.0,0.393511,0.0,0.0


## 5. N-gram Vectorization

N-grams are contiguous sequences of n items (words) from a given text. Using n-grams (bigrams, trigrams, etc.) can capture more context than single words.

In [15]:
# N-gram (Bigram) Vectorization
vectorizer_bigram = CountVectorizer(ngram_range=(2,2))
X_bigram = vectorizer_bigram.fit_transform(corpus)
df_bigram = pd.DataFrame(X_bigram.toarray(), columns=vectorizer_bigram.get_feature_names_out())
df_bigram

Unnamed: 0,and brown,brown fox,dog quickly,fox is,fox jumps,is quick,jump over,jumps over,lazy dog,never jump,over the,quick and,quick brown,the lazy,the quick
0,0,1,0,0,1,0,0,1,1,0,1,0,1,1,1
1,0,0,1,0,0,0,1,0,1,1,1,0,0,1,0
2,1,0,0,1,0,1,0,0,0,0,0,1,0,0,0


## 3. [Word2Vec](https://arxiv.org/pdf/1301.3781.pdf)

Neural Network based method to generate [word embeddings](https://neptune.ai/blog/word-embeddings-guide). In earlier two methods, semantics were completely ignored. With the introduction of Word2Vec, the vector representation of words was said to be contextually aware, probably for the first time ever.

## 6. Word2Vec Embeddings

Word2Vec is a neural network-based technique that learns dense vector representations for words, capturing semantic relationships.

Traditional vectorization methods like Bag-of-Words and TF-IDF create high-dimensional, sparse vectors and do not capture word meaning or context. Word embeddings (Word2Vec, GloVe, FastText) create dense, low-dimensional vectors that capture semantic relationships between words.

In [10]:
# Word2Vec Example with Gensim
from gensim.models import Word2Vec
import nltk
nltk.download('punkt_tab')

corpus = [
    'The quick brown fox jumps over the lazy dog',
    'Never jump over the lazy dog quickly',
    'A fox is quick and brown'
 ]

# Tokenize sentences
tokenized_corpus = [nltk.word_tokenize(sentence.lower()) for sentence in corpus]

# Train Word2Vec model
w2v_model = Word2Vec(
    sentences=tokenized_corpus,  # List of tokenized sentences (list of list of words)
    vector_size=50,              # Dimensionality of the word vectors (embedding size)
    window=3,                    # Maximum distance between the current and predicted word within a sentence
    min_count=1,                 # Ignores all words with total frequency lower than this
    workers=1,                   # Number of CPU cores to use during training
    seed=42                      # Random seed for reproducibility
)

# Get vector for a word
print('Vector for "fox":', w2v_model.wv['fox'])

# Find most similar words
print('Most similar to "fox":', w2v_model.wv.most_similar('fox'))

ModuleNotFoundError: No module named 'gensim'

### 4. Global Vectors for word representation [(GloVe)](https://nlp.stanford.edu/pubs/glove.pdf)
It is also based on creating contextual word embeddings. Word2Vec is a window-based method, in which the model relies on local information for generating word embeddings, which in turn is limited to the window size that we choose. GloVe on the other hand captures both global and local statistics in order to come up with the word embeddings.

In [None]:
# GloVe Example: Using Pretrained Embeddings
import numpy as np
import requests, zipfile, io, os

# Download GloVe embeddings (small sample for demo)
if not os.path.exists('glove.6B.50d.txt'):
    url = 'http://nlp.stanford.edu/data/glove.6B.zip'
    r = requests.get(url)
    z = zipfile.ZipFile(io.BytesIO(r.content))
    z.extract('glove.6B.50d.txt')

# Load GloVe vectors
glove_vectors = {}
with open('glove.6B.50d.txt', encoding='utf8') as f:
    for line in f:
        values = line.split()
        word = values[0]
        vector = np.asarray(values[1:], dtype='float32')
        glove_vectors[word] = vector

# Example: Get vector for 'fox' and find cosine similarity with 'dog'
from numpy.linalg import norm
def cosine_similarity(vec1, vec2):
    return np.dot(vec1, vec2) / (norm(vec1) * norm(vec2))

vec_fox = glove_vectors['fox']
vec_dog = glove_vectors['dog']
print('Cosine similarity between fox and dog:', cosine_similarity(vec_fox, vec_dog))

### 6. Doc2Vec

In [None]:
# Doc2Vec Example with Gensim
from gensim.models.doc2vec import Doc2Vec, TaggedDocument
import nltk
nltk.download('punkt')

corpus = [
    'The quick brown fox jumps over the lazy dog',
    'Never jump over the lazy dog quickly',
    'A fox is quick and brown'
 ]

# Tag documents
tagged_data = [TaggedDocument(words=nltk.word_tokenize(doc.lower()), tags=[str(i)]) for i, doc in enumerate(corpus)]

# Train Doc2Vec model
doc2vec_model = Doc2Vec(vector_size=50, window=2, min_count=1, workers=1, epochs=40, seed=42)
doc2vec_model.build_vocab(tagged_data)
doc2vec_model.train(tagged_data, total_examples=doc2vec_model.corpus_count, epochs=doc2vec_model.epochs)

# Get vector for first document
print('Vector for first document:', doc2vec_model.dv['0'])

# Find most similar documents
print('Most similar to first document:', doc2vec_model.dv.most_similar('0'))