# NLP Assignment 2: Text Vectorization Techniques

This notebook demonstrates:
1. **Bag of Words (BoW)** - Count occurrence
2. **Bag of Words (BoW)** - Normalized count occurrence
3. **TF-IDF (Term Frequency-Inverse Document Frequency)**
4. **Word2Vec Embeddings**

## 1. Import Required Libraries

In [2]:
import numpy as np
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from gensim.models import Word2Vec
import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
import warnings
warnings.filterwarnings('ignore')

# Download required NLTK data
nltk.download('punkt', quiet=True)
nltk.download('stopwords', quiet=True)
print("Libraries imported successfully!")



Libraries imported successfully!


## 2. Prepare Sample Data

In [3]:
# Sample corpus of documents
documents = [
    "Natural language processing is a subfield of artificial intelligence",
    "Machine learning and deep learning are part of artificial intelligence",
    "Natural language processing uses machine learning algorithms",
    "Deep learning models are used in natural language processing",
    "Artificial intelligence includes machine learning and natural language processing"
]

print("Sample Documents:")
for i, doc in enumerate(documents, 1):
    print(f"{i}. {doc}")

Sample Documents:
1. Natural language processing is a subfield of artificial intelligence
2. Machine learning and deep learning are part of artificial intelligence
3. Natural language processing uses machine learning algorithms
4. Deep learning models are used in natural language processing
5. Artificial intelligence includes machine learning and natural language processing


## 3. Bag of Words (BoW) - Count Occurrence

The Bag of Words model represents text as a vector of word counts, ignoring grammar and word order.

In [4]:
# Create Count Vectorizer
count_vectorizer = CountVectorizer()

# Fit and transform the documents
bow_matrix = count_vectorizer.fit_transform(documents)

# Get feature names (vocabulary)
feature_names = count_vectorizer.get_feature_names_out()

# Convert to DataFrame for better visualization
bow_df = pd.DataFrame(bow_matrix.toarray(), columns=feature_names)
bow_df.index = [f"Doc {i+1}" for i in range(len(documents))]

print("Bag of Words (Count Occurrence):")
print(f"Vocabulary size: {len(feature_names)}")
print(f"\nVocabulary: {list(feature_names)}\n")
print(bow_df)

Bag of Words (Count Occurrence):
Vocabulary size: 20

Vocabulary: ['algorithms', 'and', 'are', 'artificial', 'deep', 'in', 'includes', 'intelligence', 'is', 'language', 'learning', 'machine', 'models', 'natural', 'of', 'part', 'processing', 'subfield', 'used', 'uses']

       algorithms  and  are  artificial  deep  in  includes  intelligence  is  \
Doc 1           0    0    0           1     0   0         0             1   1   
Doc 2           0    1    1           1     1   0         0             1   0   
Doc 3           1    0    0           0     0   0         0             0   0   
Doc 4           0    0    1           0     1   1         0             0   0   
Doc 5           0    1    0           1     0   0         1             1   0   

       language  learning  machine  models  natural  of  part  processing  \
Doc 1         1         0        0       0        1   1     0           1   
Doc 2         0         2        1       0        0   1     1           0   
Doc 3       

## 4. Bag of Words - Normalized Count Occurrence

Normalized counts help compare documents of different lengths by dividing counts by the total number of words.

In [5]:
# Normalize the count matrix (L1 normalization - divide by sum of all counts per document)
bow_normalized = bow_matrix.toarray()
row_sums = bow_normalized.sum(axis=1, keepdims=True)
bow_normalized = bow_normalized / row_sums

# Convert to DataFrame
bow_normalized_df = pd.DataFrame(bow_normalized, columns=feature_names)
bow_normalized_df.index = [f"Doc {i+1}" for i in range(len(documents))]

print("Bag of Words (Normalized Count Occurrence):")
print(bow_normalized_df.round(4))

Bag of Words (Normalized Count Occurrence):
       algorithms     and     are  artificial    deep      in  includes  \
Doc 1      0.0000  0.0000  0.0000      0.1250  0.0000  0.0000    0.0000   
Doc 2      0.0000  0.1000  0.1000      0.1000  0.1000  0.0000    0.0000   
Doc 3      0.1429  0.0000  0.0000      0.0000  0.0000  0.0000    0.0000   
Doc 4      0.0000  0.0000  0.1111      0.0000  0.1111  0.1111    0.0000   
Doc 5      0.0000  0.1111  0.0000      0.1111  0.0000  0.0000    0.1111   

       intelligence     is  language  learning  machine  models  natural  \
Doc 1        0.1250  0.125    0.1250    0.0000   0.0000  0.0000   0.1250   
Doc 2        0.1000  0.000    0.0000    0.2000   0.1000  0.0000   0.0000   
Doc 3        0.0000  0.000    0.1429    0.1429   0.1429  0.0000   0.1429   
Doc 4        0.0000  0.000    0.1111    0.1111   0.0000  0.1111   0.1111   
Doc 5        0.1111  0.000    0.1111    0.1111   0.1111  0.0000   0.1111   

          of  part  processing  subfield    used

## 5. TF-IDF (Term Frequency - Inverse Document Frequency)

TF-IDF weighs words by their importance: frequent in a document but rare across all documents get higher scores.

In [6]:
# Create TF-IDF Vectorizer
tfidf_vectorizer = TfidfVectorizer()

# Fit and transform the documents
tfidf_matrix = tfidf_vectorizer.fit_transform(documents)

# Get feature names
tfidf_feature_names = tfidf_vectorizer.get_feature_names_out()

# Convert to DataFrame
tfidf_df = pd.DataFrame(tfidf_matrix.toarray(), columns=tfidf_feature_names)
tfidf_df.index = [f"Doc {i+1}" for i in range(len(documents))]

print("TF-IDF Matrix:")
print(tfidf_df.round(4))

TF-IDF Matrix:
       algorithms     and     are  artificial    deep      in  includes  \
Doc 1      0.0000  0.0000  0.0000      0.3157  0.0000  0.0000    0.0000   
Doc 2      0.0000  0.3235  0.3235      0.2686  0.3235  0.0000    0.0000   
Doc 3      0.5186  0.0000  0.0000      0.0000  0.0000  0.0000    0.0000   
Doc 4      0.0000  0.0000  0.3418      0.0000  0.3418  0.4237    0.0000   
Doc 5      0.0000  0.3906  0.0000      0.3242  0.0000  0.0000    0.4842   

       intelligence      is  language  learning  machine  models  natural  \
Doc 1        0.3157  0.4714    0.2656    0.0000   0.0000  0.0000   0.2656   
Doc 2        0.2686  0.0000    0.0000    0.4518   0.2686  0.0000   0.0000   
Doc 3        0.0000  0.0000    0.2922    0.2922   0.3473  0.0000   0.2922   
Doc 4        0.0000  0.0000    0.2387    0.2387   0.0000  0.4237   0.2387   
Doc 5        0.3242  0.0000    0.2728    0.2728   0.3242  0.0000   0.2728   

           of   part  processing  subfield    used    uses  
Doc 1  0.3

In [7]:
# Analyze TF-IDF scores
print("\nTop 5 important words per document (based on TF-IDF):")
print("="*60)
for i, doc_idx in enumerate(tfidf_df.index):
    doc_scores = tfidf_df.loc[doc_idx]
    top_words = doc_scores.nlargest(5)
    print(f"\n{doc_idx}: {documents[i][:50]}...")
    print("-" * 60)
    for word, score in top_words.items():
        if score > 0:
            print(f"  {word:20s}: {score:.4f}")


Top 5 important words per document (based on TF-IDF):

Doc 1: Natural language processing is a subfield of artif...
------------------------------------------------------------
  is                  : 0.4714
  subfield            : 0.4714
  of                  : 0.3803
  artificial          : 0.3157
  intelligence        : 0.3157

Doc 2: Machine learning and deep learning are part of art...
------------------------------------------------------------
  learning            : 0.4518
  part                : 0.4010
  and                 : 0.3235
  are                 : 0.3235
  deep                : 0.3235

Doc 3: Natural language processing uses machine learning ...
------------------------------------------------------------
  algorithms          : 0.5186
  uses                : 0.5186
  machine             : 0.3473
  language            : 0.2922
  learning            : 0.2922

Doc 4: Deep learning models are used in natural language ...
-------------------------------------------------

## 6. Word2Vec Embeddings

Word2Vec creates dense vector representations of words that capture semantic relationships.

In [8]:
# Tokenize documents for Word2Vec
tokenized_docs = [word_tokenize(doc.lower()) for doc in documents]

print("Tokenized Documents:")
for i, tokens in enumerate(tokenized_docs, 1):
    print(f"Doc {i}: {tokens}")

Tokenized Documents:
Doc 1: ['natural', 'language', 'processing', 'is', 'a', 'subfield', 'of', 'artificial', 'intelligence']
Doc 2: ['machine', 'learning', 'and', 'deep', 'learning', 'are', 'part', 'of', 'artificial', 'intelligence']
Doc 3: ['natural', 'language', 'processing', 'uses', 'machine', 'learning', 'algorithms']
Doc 4: ['deep', 'learning', 'models', 'are', 'used', 'in', 'natural', 'language', 'processing']
Doc 5: ['artificial', 'intelligence', 'includes', 'machine', 'learning', 'and', 'natural', 'language', 'processing']


In [9]:
# Train Word2Vec model
# Parameters:
# - vector_size: dimensionality of word vectors
# - window: maximum distance between current and predicted word
# - min_count: ignores words with frequency less than this
# - sg: 1 for skip-gram, 0 for CBOW

word2vec_model = Word2Vec(
    sentences=tokenized_docs,
    vector_size=50,
    window=3,
    min_count=1,
    sg=1,  # Skip-gram
    epochs=100
)

print("Word2Vec Model trained successfully!")
print(f"Vocabulary size: {len(word2vec_model.wv)}")
print(f"Vector dimensionality: {word2vec_model.wv.vector_size}")

Word2Vec Model trained successfully!
Vocabulary size: 21
Vector dimensionality: 50


In [10]:
# Display word vectors for some key words
print("\nWord Embeddings (Word2Vec Vectors):")
print("="*60)

key_words = ['natural', 'language', 'processing', 'machine', 'learning', 'artificial']
for word in key_words:
    if word in word2vec_model.wv:
        vector = word2vec_model.wv[word]
        print(f"\n'{word}': {vector[:10]}... (showing first 10 dimensions)")
        print(f"Full vector shape: {vector.shape}")


Word Embeddings (Word2Vec Vectors):

'natural': [ 0.0146931  -0.01868898 -0.00132968  0.00742045 -0.00301441  0.01420391
  0.0214204   0.01612241 -0.00467669  0.01201618]... (showing first 10 dimensions)
Full vector shape: (50,)

'language': [-0.01798476  0.00768935  0.00966274  0.01206527  0.01451019 -0.01418762
  0.00465903  0.01443525 -0.0078618  -0.01484121]... (showing first 10 dimensions)
Full vector shape: (50,)

'processing': [-0.01713256  0.00932818 -0.00881199  0.0022169   0.01634271 -0.01073778
  0.01165339 -0.01098105 -0.00942809  0.01617806]... (showing first 10 dimensions)
Full vector shape: (50,)

'machine': [-0.00070727  0.00644643 -0.01438979 -0.00216373  0.01459725  0.01254674
 -0.00417216  0.00822531 -0.01943616  0.00931134]... (showing first 10 dimensions)
Full vector shape: (50,)

'learning': [-0.00161282  0.00060139  0.00946599  0.01822652 -0.01933221 -0.01576339
  0.01516495  0.02005878 -0.01218199 -0.00974327]... (showing first 10 dimensions)
Full vector shape:

In [11]:
# Find similar words using Word2Vec
print("\n" + "="*60)
print("Word Similarity Analysis:")
print("="*60)

test_words = ['machine', 'natural', 'artificial']
for word in test_words:
    if word in word2vec_model.wv:
        print(f"\nWords most similar to '{word}':")
        similar_words = word2vec_model.wv.most_similar(word, topn=3)
        for similar_word, similarity in similar_words:
            print(f"  {similar_word:15s}: {similarity:.4f}")


Word Similarity Analysis:

Words most similar to 'machine':
  language       : 0.2754
  are            : 0.2705
  used           : 0.2293

Words most similar to 'natural':
  in             : 0.2464
  are            : 0.1646
  artificial     : 0.1006

Words most similar to 'artificial':
  intelligence   : 0.2914
  part           : 0.2819
  are            : 0.2571


In [12]:
# Create document embeddings by averaging word vectors
def get_document_embedding(tokens, model):
    """Average word vectors to get document embedding"""
    vectors = [model.wv[word] for word in tokens if word in model.wv]
    if len(vectors) > 0:
        return np.mean(vectors, axis=0)
    else:
        return np.zeros(model.wv.vector_size)

# Generate embeddings for all documents
doc_embeddings = [get_document_embedding(tokens, word2vec_model) for tokens in tokenized_docs]

print("\nDocument Embeddings (Word2Vec - averaged word vectors):")
print("="*60)
for i, embedding in enumerate(doc_embeddings, 1):
    print(f"Doc {i} embedding shape: {embedding.shape}")
    print(f"  First 10 dimensions: {embedding[:10]}")
    print()


Document Embeddings (Word2Vec - averaged word vectors):
Doc 1 embedding shape: (50,)
  First 10 dimensions: [-5.0833886e-03  1.8785053e-03 -2.4312432e-03  3.3965025e-05
 -1.6989702e-03 -9.4283614e-03  3.9328886e-03  6.4147082e-03
 -5.4802126e-03 -6.6288989e-03]

Doc 2 embedding shape: (50,)
  First 10 dimensions: [-0.00378224  0.00388806 -0.00300068 -0.00431014 -0.00550977 -0.00615963
  0.00437296  0.00823436 -0.00594596 -0.00830627]

Doc 3 embedding shape: (50,)
  First 10 dimensions: [-0.00853967  0.00293612  0.0030956   0.00590068  0.00358409  0.00133784
  0.00738684  0.00685054 -0.00959731  0.00473232]

Doc 4 embedding shape: (50,)
  First 10 dimensions: [ 0.00022861 -0.00370952 -0.0031969   0.00464718 -0.00277108 -0.00417245
  0.00935719  0.00510561 -0.00798658 -0.00360427]

Doc 5 embedding shape: (50,)
  First 10 dimensions: [-0.00284712  0.00132368  0.00156641  0.00317542 -0.00110207 -0.0055602
  0.00660508  0.0030253  -0.00862447 -0.0051529 ]



## 7. Summary and Comparison

### Comparison of Methods:

| Method | Type | Dimensionality | Semantic Meaning | Sparsity |
|--------|------|----------------|------------------|----------|
| **BoW (Count)** | Frequency-based | Vocabulary size | No | High (sparse) |
| **BoW (Normalized)** | Frequency-based | Vocabulary size | No | High (sparse) |
| **TF-IDF** | Weighted frequency | Vocabulary size | No | High (sparse) |
| **Word2Vec** | Neural embedding | Fixed (50 in this example) | Yes | Low (dense) |

### Key Differences:

1. **BoW & TF-IDF**: Create sparse vectors where each dimension represents a word in vocabulary
2. **Word2Vec**: Creates dense vectors that capture semantic relationships between words
3. **Normalization**: Helps handle documents of varying lengths
4. **TF-IDF**: Reduces weight of common words across documents
5. **Word2Vec**: Can find similar words and perform word arithmetic

In [13]:
# Final comparison
print("Dimensionality Comparison:")
print("="*60)
print(f"BoW (Count):         {bow_matrix.shape[1]} dimensions (vocabulary size)")
print(f"BoW (Normalized):    {bow_normalized.shape[1]} dimensions (vocabulary size)")
print(f"TF-IDF:              {tfidf_matrix.shape[1]} dimensions (vocabulary size)")
print(f"Word2Vec:            {word2vec_model.wv.vector_size} dimensions (fixed)")
print()
print(f"Number of documents: {len(documents)}")
print("="*60)

Dimensionality Comparison:
BoW (Count):         20 dimensions (vocabulary size)
BoW (Normalized):    20 dimensions (vocabulary size)
TF-IDF:              20 dimensions (vocabulary size)
Word2Vec:            50 dimensions (fixed)

Number of documents: 5
