# **_1. Count Vectorizer (Bag of Words)_**

<u>**Concept:**</u>  
The simplest NLP technique, where we convert text into a matrix of word occurrences.

<u>**Example:**</u>  
Consider two simple sentences:  
1. "NLP is evolving fast"  
2. "Machine learning is evolving"

In [87]:
corpus_1 = [
    "the cat sat on the mat",
    "the dog sat on the mat",
    "the mat is on the floor"
]

corpus_2 = [
    "artificial intelligence is the future",
    "machine learning is a subset of artificial intelligence",
    "the future of AI is bright"
]

corpus_3 = ["NLP is evolving fast", "Machine learning is evolving", "Future of NLP is bright"]

corpus_4 = [
    "Machine learning improves decision-making in businesses.",
    "Businesses use machine learning for data analysis.",
    "Deep learning and neural networks outperform traditional models."
]

In [88]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics.pairwise import cosine_similarity

# Corpus of sentences
corpus = corpus_3
# Convert text into a matrix of word occurrences
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(corpus)

# Print results
print("Feature Names:", vectorizer.get_feature_names_out())  # Unique words
print("Word Frequency Matrix:\n", X.toarray())  # Word frequency matrix
print("Cosine Similarity:\n", cosine_similarity(X[0],X[1])[0][0])  # Cosine similarity matrix

Feature Names: ['bright' 'evolving' 'fast' 'future' 'is' 'learning' 'machine' 'nlp' 'of']
Word Frequency Matrix:
 [[0 1 1 0 1 0 0 1 0]
 [0 1 0 0 1 1 1 0 0]
 [1 0 0 1 1 0 0 1 1]]
Cosine Similarity:
 0.5


# **_Analysis:_**

<u>**Analysis:**</u>  
- Each row represents a sentence.  
- Each column represents a word.  
- The numbers show how many times a word appears in each sentence.

<u>**Limitations:**</u>  
- Does not capture meaning.  
- Common words like "is" are treated the same as "NLP" or "Machine."  
- Ignores word order.

# **_2. TF-IDF (Term Frequency-Inverse Document Frequency)_**

<u>**Concept:**</u>  
Improves Count Vectorizer by reducing the importance of commonly occurring words.

<u>**Example:**</u>  
We apply TF-IDF on the same corpus.

In [89]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(corpus)

print(vectorizer.get_feature_names_out())
print(X.toarray())  # TF-IDF scores

# Cosine similarity using TF-IDF
print("Cosine Similarity:\n", cosine_similarity(X[0],X[1])[0][0])  # Cosine similarity matrix

['bright' 'evolving' 'fast' 'future' 'is' 'learning' 'machine' 'nlp' 'of']
[[0.         0.4804584  0.63174505 0.         0.37311881 0.
  0.         0.4804584  0.        ]
 [0.         0.44451431 0.         0.         0.34520502 0.5844829
  0.5844829  0.         0.        ]
 [0.50461134 0.         0.         0.50461134 0.29803159 0.
  0.         0.38376993 0.50461134]]
Cosine Similarity:
 0.34237311738896226


In [92]:
from gensim.models import Word2Vec
from nltk.corpus import stopwords
import nltk

# Download stopwords if not already downloaded
#nltk.download('stopwords')
stop_words = set(stopwords.words('english'))
# Tokenize the corpus (split each sentence into words)
tokenized_corpus = [sentence.lower().split() for sentence in corpus]

# Train Word2Vec model
model = Word2Vec(tokenized_corpus, vector_size=5, window=2, min_count=1, workers=4)

# Get the word embedding for "nlp"
print("Word embedding for 'nlp':", model.wv["nlp"])

# Find words most similar to "evolving"
print("Words most similar to 'evolving':", model.wv.most_similar("evolving"))

Word embedding for 'nlp': [ 0.1476101  -0.03066943 -0.09073226  0.13108103 -0.09720321]
Words most similar to 'evolving': [('learning', 0.8512495160102844), ('future', 0.774187445640564), ('bright', 0.6178626418113708), ('fast', 0.20796996355056763), ('is', 0.2018294483423233), ('machine', 0.18539191782474518), ('of', 0.07266710698604584), ('nlp', -0.6734715700149536)]


In [91]:
tokenized_corpus


[['nlp', 'is', 'evolving', 'fast'],
 ['machine', 'learning', 'is', 'evolving'],
 ['future', 'of', 'nlp', 'is', 'bright']]

In [110]:
from gensim.models import Word2Vec
from nltk.tokenize import sent_tokenize, word_tokenize
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
import nltk

# Download necessary NLTK data
# nltk.download('punkt')
# nltk.download('stopwords')
# nltk.download('wordnet')

# Load the text of "Alice's Adventures in Wonderland"
with open("alice_in_wonderland.txt", "r", encoding="utf-8") as file:
    alice_text = file.read()

# Preprocess the text: tokenize sentences, remove stop words, and lemmatize
stop_words = set(stopwords.words('english'))
lemmatizer = WordNetLemmatizer()
sentences = sent_tokenize(alice_text)
tokenized_sentences = [
    [lemmatizer.lemmatize(word.lower()) for word in word_tokenize(sentence) if word.isalnum() and word.lower() not in stop_words]
    for sentence in sentences
]

# Train Word2Vec model with improved parameters
word2vec_model = Word2Vec(
    sentences=tokenized_sentences,
    vector_size=200,  # Increased vector size
    window=15,        # Larger context window
    min_count=5,      # Ignore rare words
    workers=4,
    sg=1,             # Use Skip-Gram model
    negative=10,      # Negative sampling
    epochs=10000         # Train for more epochs
)

# Save the model for later use
word2vec_model.save("alice_word2vec_optimized.model")

# Example: Get the word embedding for "alice"
if "alice" in word2vec_model.wv:
    print("Word embedding for 'alice':", word2vec_model.wv["alice"])

# Example: Find words most similar to "queen"
if "queen" in word2vec_model.wv:
    print("Words most similar to 'queen':", word2vec_model.wv.most_similar("world"))

Word embedding for 'alice': [ 1.90299869e-01 -2.04534546e-01 -3.73386443e-01 -7.34372884e-02
  6.52075931e-02  1.18548110e-01  1.10937469e-01  3.34335774e-01
 -3.95782024e-01 -2.06006467e-01 -6.18554689e-02  1.98676437e-02
  2.47976258e-02  1.40431687e-01 -9.42094177e-02  9.02730078e-02
 -1.98675245e-01  3.22468020e-02 -4.22513157e-01 -1.01513542e-01
  1.04099557e-01 -2.36585543e-01  1.78225487e-01 -1.21879496e-01
  1.98791564e-01 -2.77300805e-01  2.91333228e-01 -1.60437614e-01
 -1.03365280e-01 -2.38217771e-01 -8.74060467e-02  1.17918357e-01
  6.23091795e-02  8.25336110e-03  5.44182807e-02 -2.06720367e-01
 -1.38412982e-01  3.46360624e-01 -9.96583030e-02  3.28844965e-01
  1.42124772e-01 -3.15592848e-02  3.00972790e-01  1.73621457e-02
  2.12500930e-01  1.43602759e-01  1.71818528e-02  1.51198423e-02
  8.48782659e-02  4.61562350e-03 -1.49530187e-01  2.27833316e-01
 -8.44580755e-02  1.01157874e-01 -2.52896100e-01 -1.50145769e-01
 -3.90410364e-01  1.01244040e-02 -8.20528939e-02 -4.28183489e-

In [111]:
# Example: Calculate cosine similarity between two words
try:
    word1 = "alice"
    word2 = "queen"
    cosine_similarity = word2vec_model.wv.similarity(word1, word2)
    print(f"Cosine similarity between '{word1}' and '{word2}': {cosine_similarity}")
except KeyError as e:
    print(f"Error: {e}. One of the words is not in the vocabulary.")

Cosine similarity between 'alice' and 'queen': 0.061967842280864716


In [124]:
word2vec_model.wv.most_similar("alice")

[('said', 0.25662288069725037),
 ('executed', 0.2416895627975464),
 ('join', 0.21482664346694946),
 ('asking', 0.20554569363594055),
 ('fetch', 0.1829044669866562),
 ('unimportant', 0.18088111281394958),
 ('evening', 0.17958509922027588),
 ('repeat', 0.17934490740299225),
 ('oop', 0.17799676954746246),
 ('would', 0.17165294289588928)]

In [112]:
# Python program to generate word vectors using Word2Vec
 
# importing all necessary modules
from gensim.models import Word2Vec
import gensim
from nltk.tokenize import sent_tokenize, word_tokenize
import warnings
 
warnings.filterwarnings(action='ignore')
 
 
#  Reads ‘alice.txt’ file
sample = open("alice_in_wonderland.txt")
s = sample.read()
 
# Replaces escape character with space
f = s.replace("\n", " ")
 
data = []
 
# iterate through each sentence in the file
for i in sent_tokenize(f):
    temp = []
 
    # tokenize the sentence into words
    for j in word_tokenize(i):
        temp.append(j.lower())
 
    data.append(temp)
 
# Create CBOW model
model1 = gensim.models.Word2Vec(data, min_count=1,
                                vector_size=100, window=5)
 
# Print results
print("Cosine similarity between 'alice' " +
      "and 'wonderland' - CBOW : ",
      model1.wv.similarity('alice', 'wonderland'))
 
print("Cosine similarity between 'alice' " +
      "and 'machines' - CBOW : ",
      model1.wv.similarity('alice', 'machines'))
 
# Create Skip Gram model
model2 = gensim.models.Word2Vec(data, min_count=1, vector_size=100,
                                window=5, sg=1)
 
# Print results
print("Cosine similarity between 'alice' " +
      "and 'wonderland' - Skip Gram : ",
      model2.wv.similarity('alice', 'wonderland'))
 
print("Cosine similarity between 'alice' " +
      "and 'machines' - Skip Gram : ",
      model2.wv.similarity('alice', 'machines'))

Cosine similarity between 'alice' and 'wonderland' - CBOW :  0.99473584
Cosine similarity between 'alice' and 'machines' - CBOW :  0.9594937
Cosine similarity between 'alice' and 'wonderland' - Skip Gram :  0.8946934
Cosine similarity between 'alice' and 'machines' - Skip Gram :  0.8759582


In [117]:
model2.wv.similarity('king', 'queen')

0.97924376

In [125]:
model2.wv.most_similar("alice")

[('thought', 0.9901615977287292),
 (',', 0.9874187111854553),
 ('herself', 0.9844238758087158),
 (';', 0.982392430305481),
 ('so', 0.979295551776886),
 ('it', 0.9789102077484131),
 ('hatter', 0.9743627905845642),
 ('then', 0.9742511510848999),
 ('but', 0.9740870594978333),
 ('very', 0.9684762358665466)]

In [126]:
from gensim.test.utils import common_texts
from gensim.models import Word2Vec

In [127]:
model = Word2Vec(sentences=common_texts, vector_size=100, window=5, min_count=1, workers=4)

In [128]:
common_texts

[['human', 'interface', 'computer'],
 ['survey', 'user', 'computer', 'system', 'response', 'time'],
 ['eps', 'user', 'interface', 'system'],
 ['system', 'human', 'system', 'eps'],
 ['user', 'response', 'time'],
 ['trees'],
 ['graph', 'trees'],
 ['graph', 'minors', 'trees'],
 ['graph', 'minors', 'survey']]

In [129]:
import gensim.downloader

In [130]:
glove_vectors = gensim.downloader.load('glove-twitter-25')



In [139]:
glove_vectors.most_similar('fast')

[('hard', 0.9070873856544495),
 ('fall', 0.8905317187309265),
 ('work', 0.883845329284668),
 ('get', 0.882828950881958),
 ('quick', 0.8828085064888),
 ('easy', 0.8759028911590576),
 ('yet', 0.8758782744407654),
 ('clean', 0.8740941882133484),
 ('working', 0.8733303546905518),
 ('out', 0.8721160888671875)]

In [138]:
glove_vectors.most_similar('actress')

[('singer', 0.9319669604301453),
 ('comedian', 0.8841865658760071),
 ('bollywood', 0.8836835622787476),
 ('superstar', 0.8688210844993591),
 ('musician', 0.863633930683136),
 ('poet', 0.8499681949615479),
 ('vocalist', 0.8467605710029602),
 ('entertainer', 0.8449896574020386),
 ('actor', 0.8414555788040161),
 ('celebrity', 0.8391618132591248)]

In [149]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.model_selection import train_test_split

# Sample dataset
emails = ["Win a free iPhone now", "Meeting scheduled at 10 AM", "Congratulations, you won!", "Please find the report attached"]
labels = [1, 0, 1, 0]  # 1 = Spam, 0 = Not Spam

# Convert text to vectors
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(emails)

# Train classifier
X_train, X_test, y_train, y_test, train_indices, test_indices = train_test_split(
    X, labels, range(len(emails)), test_size=0.25, random_state=67
)

# Get the original text for X_test
X_test_text = [emails[i] for i in test_indices]

# Train the classifier
classifier = MultinomialNB()
classifier.fit(X_train, y_train)

# Predict on test data
predictions = classifier.predict(X_test)
print("Predictions:", predictions)
print("X_test Text:", X_test_text)

Predictions: [1]
X_test Text: ['Congratulations, you won!']


In [151]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression

# Sample dataset
reviews = ["The movie was absolutely fantastic", "The plot was terrible and boring", "Loved the cinematography", "Worst film ever"]
labels = [1, 0, 1, 0]  # 1 = Positive, 0 = Negative

# Convert text to vectors
vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(reviews)

# Train classifier
classifier = LogisticRegression()
classifier.fit(X, labels)

# Predict on new data
print(classifier.predict(vectorizer.transform(["The film was worst"])))


[0]


| Feature Extraction | Captures Meaning? | Handles Rare Words? | Works for Large Text? | Best For                          |
|---------------------|-------------------|---------------------|-----------------------|-----------------------------------|
| Count Vectorizer    | ❌ No            | ❌ No              | ✅ Yes               | Simple text classification       |
| TF-IDF              | ❌ No            | ✅ Yes             | ✅ Yes               | News classification, document ranking |
| Word2Vec            | ✅ Yes           | ✅ Yes             | ❌ Needs more data   | Sentiment analysis, chatbots     |

In [152]:
from gensim.models import Word2Vec
import numpy as np
from sklearn.ensemble import RandomForestClassifier

# Sample dataset
sentences = [
    ["government", "passes", "economic", "reforms"],
    ["football", "team", "wins", "championship"],
    ["new", "phone", "features", "AI", "camera"],
]
labels = [0, 1, 2]  # 0 = Politics, 1 = Sports, 2 = Technology

# Train Word2Vec model
w2v_model = Word2Vec(sentences, vector_size=10, window=5, min_count=1, sg=1)

# Convert sentences to vectorized form (mean of word embeddings)
def get_sentence_embedding(sentence):
    return np.mean([w2v_model.wv[word] for word in sentence if word in w2v_model.wv], axis=0)

X = np.array([get_sentence_embedding(sent) for sent in sentences])

# Train classifier
classifier = RandomForestClassifier()
classifier.fit(X, labels)

# Predict on new article
new_article = ["team", "scores", "goal"]
new_embedding = get_sentence_embedding(new_article)
print(classifier.predict([new_embedding]))


[1]
