# NLP Vectorization Techniques: A Practical Guide

This notebook demonstrates various NLP vectorization techniques inspired by the Neptune.ai guide: [Vectorization Techniques in NLP](https://neptune.ai/blog/vectorization-techniques-in-nlp-guide).

**Outline:**
1. Import Required Libraries
2. Load and Explore Sample Text Data
3. Bag of Words Vectorization
4. TF-IDF Vectorization
5. N-gram Vectorization
6. Word2Vec Embeddings
7. GloVe Embeddings
8. FastText Embeddings
9. Visualize Word Embeddings
10. Finding Similar Words and Analogies
11. Document Embeddings (Averaging, Doc2Vec)
12. Using Pretrained Embeddings in ML Pipelines


## 1. Import Required Libraries

We will use pandas, numpy, scikit-learn, gensim, and matplotlib for NLP vectorization and visualization.

In [1]:
# Import Required Libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.decomposition import PCA
from sklearn.manifold import TSNE
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report
import gensim
from gensim.models import Word2Vec, FastText
from gensim.models.doc2vec import Doc2Vec, TaggedDocument
import nltk
import os
import requests
import zipfile
import io
nltk.download('punkt')

ModuleNotFoundError: No module named 'gensim'

## 2. Load and Explore Sample Text Data

Let's create a small sample corpus for demonstration and display its contents.

In [None]:
# Sample corpus for demonstration
corpus = [
    'The quick brown fox jumps over the lazy dog',
    'Never jump over the lazy dog quickly',
    'A fox is quick and brown',
    'Dogs are loyal and friendly animals',
    'Foxes are wild animals and very clever',
    'The dog barked at the fox',
    'Quick brown foxes leap over lazy dogs in summer',
    'A lazy dog sleeps all day',
    'Wild animals live in the forest',
    'Friendly dogs make great pets'
]

# Display the corpus
df_corpus = pd.DataFrame({'Text': corpus})
df_corpus

## 3. Bag of Words Vectorization

Bag of Words (BoW) is a simple and commonly used method to convert text into numerical vectors by counting word occurrences.

In [None]:
# Bag of Words Vectorization
vectorizer_bow = CountVectorizer()
X_bow = vectorizer_bow.fit_transform(corpus)
X.toarray()
df_bow = pd.DataFrame(X_bow.toarray(), columns=vectorizer_bow.get_feature_names_out())
df_bow

Let’s print the vocabulary to understand why it looks like this.

sorted(cv.vocabulary_.keys())

## 4. TF-IDF Vectorization

TF-IDF (Term Frequency–Inverse Document Frequency) weighs the frequency of a word in a document against its frequency in the entire corpus, reducing the impact of common words.

In [None]:
# TF-IDF Vectorization
vectorizer_tfidf = TfidfVectorizer()
X_tfidf = vectorizer_tfidf.fit_transform(corpus)
df_tfidf = pd.DataFrame(X_tfidf.toarray(), columns=vectorizer_tfidf.get_feature_names_out())
df_tfidf.round(2)

## 5. N-gram Vectorization

N-grams are contiguous sequences of n items (words) from a given text. Using n-grams (bigrams, trigrams, etc.) can capture more context than single words.

In [None]:
# N-gram (Bigram) Vectorization
vectorizer_bigram = CountVectorizer(ngram_range=(2,2))
X_bigram = vectorizer_bigram.fit_transform(corpus)
df_bigram = pd.DataFrame(X_bigram.toarray(), columns=vectorizer_bigram.get_feature_names_out())
df_bigram

## 6. Word2Vec Embeddings

Word2Vec is a neural network-based technique that learns dense vector representations for words, capturing semantic relationships.

In [None]:
# Tokenize corpus for Word2Vec
corpus_tokenized = [nltk.word_tokenize(doc.lower()) for doc in corpus]

# Train Word2Vec model
w2v_model = Word2Vec(sentences=corpus_tokenized, vector_size=50, window=3, min_count=1, workers=1, seed=42)

# Get vector for a word
print('Vector for "fox":', w2v_model.wv['fox'])

# Find most similar words to 'fox'
print('Most similar to "fox":', w2v_model.wv.most_similar('fox'))

## 7. GloVe Embeddings

GloVe (Global Vectors for Word Representation) is a popular pretrained word embedding method that captures global word-word co-occurrence statistics from a corpus.

In [None]:
# Download and load GloVe embeddings (50d for demo)
glove_path = 'glove.6B.50d.txt'
if not os.path.exists(glove_path):
    url = 'http://nlp.stanford.edu/data/glove.6B.zip'
    r = requests.get(url)
    z = zipfile.ZipFile(io.BytesIO(r.content))
    z.extract('glove.6B.50d.txt')

glove_vectors = {}
with open(glove_path, encoding='utf8') as f:
    for line in f:
        values = line.split()
        word = values[0]
        vector = np.asarray(values[1:], dtype='float32')
        glove_vectors[word] = vector

def cosine_similarity(vec1, vec2):
    return np.dot(vec1, vec2) / (np.linalg.norm(vec1) * np.linalg.norm(vec2))

vec_fox = glove_vectors.get('fox')
vec_dog = glove_vectors.get('dog')
if vec_fox is not None and vec_dog is not None:
    print('Cosine similarity between fox and dog:', cosine_similarity(vec_fox, vec_dog))

## 8. FastText Embeddings

FastText, developed by Facebook, extends Word2Vec by representing words as bags of character n-grams, allowing it to generate embeddings for out-of-vocabulary words.

In [None]:
# Train FastText model
ft_model = FastText(sentences=corpus_tokenized, vector_size=50, window=3, min_count=1, workers=1, seed=42)

# Get vector for a word
print('Vector for "fox":', ft_model.wv['fox'])

# Get vector for an out-of-vocabulary word (e.g., 'foxes')
print('Vector for "foxes":', ft_model.wv['foxes'])

## 9. Visualize Word Embeddings

We can use dimensionality reduction techniques like t-SNE or PCA to visualize high-dimensional word embeddings in 2D space.

In [None]:
# Visualize Word2Vec Embeddings with t-SNE
words = list(w2v_model.wv.index_to_key)
word_vectors = np.array([w2v_model.wv[w] for w in words])
tsne = TSNE(n_components=2, random_state=42, perplexity=5)
word_vec_2d = tsne.fit_transform(word_vectors)

plt.figure(figsize=(8,6))
plt.scatter(word_vec_2d[:,0], word_vec_2d[:,1])
for i, word in enumerate(words):
    plt.annotate(word, xy=(word_vec_2d[i,0], word_vec_2d[i,1]))
plt.title('Word2Vec Embeddings Visualized with t-SNE')
plt.show()

## 10. Finding Similar Words and Analogies

Embedding models can be used to find similar words and solve analogy tasks (e.g., king - man + woman ≈ queen).

In [None]:
# Find most similar words to 'dog'
print('Most similar to "dog":', w2v_model.wv.most_similar('dog'))

# Analogy: 'fox' - 'dog' + 'cat' ≈ ?
try:
    result = w2v_model.wv.most_similar(positive=['fox', 'cat'], negative=['dog'])
    print("'fox' - 'dog' + 'cat' ≈", result[0][0])
except Exception as e:
    print('Analogy not found:', e)

## 11. Document Embeddings (Averaging, Doc2Vec)

We can represent entire documents by averaging word vectors or using Doc2Vec for document-level embeddings.

In [None]:
# Document Embeddings by Averaging Word2Vec Vectors
def document_vector(doc):
    words = [w for w in nltk.word_tokenize(doc.lower()) if w in w2v_model.wv]
    if words:
        return np.mean([w2v_model.wv[w] for w in words], axis=0)
    else:
        return np.zeros(w2v_model.vector_size)

doc_vectors = np.array([document_vector(doc) for doc in corpus])
print('Shape of document vectors (averaged):', doc_vectors.shape)

# Doc2Vec Example
tagged_data = [TaggedDocument(words=nltk.word_tokenize(doc.lower()), tags=[str(i)]) for i, doc in enumerate(corpus)]
d2v_model = Doc2Vec(vector_size=50, window=2, min_count=1, workers=1, epochs=40, seed=42)
d2v_model.build_vocab(tagged_data)
d2v_model.train(tagged_data, total_examples=d2v_model.corpus_count, epochs=d2v_model.epochs)

# Get vector for first document
print('Doc2Vec vector for first document:', d2v_model.dv['0'])

## 12. Using Pretrained Embeddings in ML Pipelines

We can integrate pretrained embeddings into scikit-learn pipelines for text classification tasks. Here, we'll use averaged Word2Vec vectors as features.

In [None]:
# Example: Text Classification with Averaged Word2Vec Vectors
# Create dummy labels for demonstration (e.g., 0 for animal-related, 1 for pet-related)
labels = [0,0,0,1,0,1,0,1,0,1]

X_train, X_test, y_train, y_test = train_test_split(doc_vectors, labels, test_size=0.3, random_state=42)

clf = LogisticRegression(max_iter=200)
clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)

print(classification_report(y_test, y_pred))