**Perform bag-of-words approach (count occurrence, normalized count ccurrence), TF-IDF on data. Create embeddings using Word2Vec**

In [None]:
pip install nltk scikit-learn gensim




In [None]:
import nltk
import numpy as np
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from gensim.models import Word2Vec
from nltk.tokenize import word_tokenize
nltk.download('punkt')


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

In [None]:
documents = [
    "Natural Language Processing is amazing!",
    "Machine learning and deep learning are subfields of AI.",
    "Natural Language Processing (NLP) is part of AI and Machine Learning.",
    "Deep learning improves NLP tasks.",
]

 **1. Bag-of-Words (BoW)**
Counts word occurrences in each document.
Creates a document-term matrix.
Normalized count occurrence (TF) is automatically handled by TF-IDF.




In [None]:
# 1. Bag of Words (Count occurrence & Normalized count)
print("\n--- Bag-of-Words (BoW) ---")

# Convert text to count vectors
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(documents)

# Convert to DataFrame
bow_df = pd.DataFrame(X.toarray(), columns=vectorizer.get_feature_names_out())
print("\nBoW Matrix (Raw Count):\n", bow_df)



--- Bag-of-Words (BoW) ---

BoW Matrix (Raw Count):
    ai  amazing  and  are  deep  improves  is  language  learning  machine  \
0   0        1    0    0     0         0   1         1         0        0   
1   1        0    1    1     1         0   0         0         2        1   
2   1        0    1    0     0         0   1         1         1        1   
3   0        0    0    0     1         1   0         0         1        0   

   natural  nlp  of  part  processing  subfields  tasks  
0        1    0   0     0           1          0      0  
1        0    0   1     0           0          1      0  
2        1    1   1     1           1          0      0  
3        0    1   0     0           0          0      1  


**2. TF-IDF (Term Frequency-Inverse Document Frequency)**
Normalizes term frequencies by penalizing frequent words across documents.
Helps in giving more weight to unique terms in a document.

In [None]:
# 2. TF-IDF (Term Frequency-Inverse Document Frequency)
print("\n--- TF-IDF ---")

tfidf_vectorizer = TfidfVectorizer()
X_tfidf = tfidf_vectorizer.fit_transform(documents)

# Convert to DataFrame
tfidf_df = pd.DataFrame(X_tfidf.toarray(), columns=tfidf_vectorizer.get_feature_names_out())
print("\nTF-IDF Matrix:\n", tfidf_df)


--- TF-IDF ---

TF-IDF Matrix:
          ai   amazing       and       are      deep  improves        is  \
0  0.000000  0.535566  0.000000  0.000000  0.000000  0.000000  0.422247   
1  0.303739  0.000000  0.303739  0.385254  0.303739  0.000000  0.000000   
2  0.297954  0.000000  0.297954  0.000000  0.000000  0.000000  0.297954   
3  0.000000  0.000000  0.000000  0.000000  0.412640  0.523381  0.000000   

   language  learning   machine   natural       nlp        of      part  \
0  0.422247  0.000000  0.000000  0.422247  0.000000  0.000000  0.000000   
1  0.000000  0.491805  0.303739  0.000000  0.000000  0.303739  0.000000   
2  0.297954  0.241220  0.297954  0.297954  0.297954  0.297954  0.377917   
3  0.000000  0.334067  0.000000  0.000000  0.412640  0.000000  0.000000   

   processing  subfields     tasks  
0    0.422247   0.000000  0.000000  
1    0.000000   0.385254  0.000000  
2    0.297954   0.000000  0.000000  
3    0.000000   0.000000  0.523381  


In [None]:
# Download the 'punkt_tab' data package
nltk.download('punkt_tab')

[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt_tab.zip.


True

**3. Word2Vec**
Creates word embeddings using the Continuous Bag of Words (CBOW) model.
Converts words into high-dimensional numeric vectors.

In [None]:
# 3. Word2Vec Embeddings
print("\n--- Word2Vec Embeddings ---")

# Tokenize text
tokenized_corpus = [word_tokenize(doc.lower()) for doc in documents]

# Train Word2Vec model (using CBOW)
word2vec_model = Word2Vec(sentences=tokenized_corpus, vector_size=100, window=5, min_count=1, workers=4)

# Get word embedding for a word
word = "learning"
if word in word2vec_model.wv:
    print(f"\nWord Embedding for '{word}':\n", word2vec_model.wv[word])
else:
    print(f"\nWord '{word}' not found in vocabulary.")



--- Word2Vec Embeddings ---

Word Embedding for 'learning':
 [-5.3808355e-04  2.4747057e-04  5.1054261e-03  9.0144686e-03
 -9.2937471e-03 -7.1226158e-03  6.4635528e-03  8.9830793e-03
 -5.0169979e-03 -3.7686056e-03  7.3833433e-03 -1.5425601e-03
 -4.5443131e-03  6.5564695e-03 -4.8607639e-03 -1.8169996e-03
  2.8797004e-03  9.9720282e-04 -8.2858307e-03 -9.4584674e-03
  7.3169568e-03  5.0672246e-03  6.7624357e-03  7.5818895e-04
  6.3456185e-03 -3.4061950e-03 -9.5028954e-04  5.7748272e-03
 -7.5254757e-03 -3.9375424e-03 -7.5109187e-03 -9.3785976e-04
  9.5394310e-03 -7.3277755e-03 -2.3322091e-03 -1.9385585e-03
  8.0853011e-03 -5.9210146e-03  4.6518860e-05 -4.7458964e-03
 -9.5945410e-03  4.9975729e-03 -8.7691769e-03 -4.3799556e-03
 -3.1460335e-05 -2.9525690e-04 -7.6617515e-03  9.6108336e-03
  4.9875192e-03  9.2362370e-03 -8.1496472e-03  4.4931308e-03
 -4.1252617e-03  8.2184700e-04  8.4948484e-03 -4.4612531e-03
  4.5260801e-03 -6.7854030e-03 -3.5504571e-03  9.4027435e-03
 -1.5833589e-03  3.1810