<a href="https://colab.research.google.com/github/SRIKAR-SILUVERI/NLP/blob/main/Lab9_WordEmbeddings.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Import required libraries**

In [None]:
!pip install gensim
!pip install matplotlib

Collecting gensim
  Downloading gensim-4.4.0-cp312-cp312-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl.metadata (8.4 kB)
Downloading gensim-4.4.0-cp312-cp312-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl (27.9 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m27.9/27.9 MB[0m [31m77.8 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: gensim
Successfully installed gensim-4.4.0


In [None]:
import gensim
from gensim.models import KeyedVectors
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity
import matplotlib.pyplot as plt

# **Load Pre-trained Word2Vec (Google News)**

In [None]:
import gensim.downloader as api
from gensim.models import KeyedVectors

# Load pre-trained Word2Vec model (may take time on first download)
model = api.load("word2vec-google-news-300")

# Print vocabulary size
print("Vocabulary Size:", len(model.key_to_index))

# Display vector for a sample word
word = "apple"
vector = model[word]

print("\nWord:", word)
print("Vector length:", len(vector))
print("First 10 values of the vector:\n", vector[:10])

Vocabulary Size: 3000000

Word: apple
Vector length: 300
First 10 values of the vector:
 [-0.06445312 -0.16015625 -0.01208496  0.13476562 -0.22949219  0.16210938
  0.3046875  -0.1796875  -0.12109375  0.25390625]


# **Load Pre-trained GloVe**

In [None]:
import gensim.downloader as api

# Load GloVe embeddings (100-dimensional)
model = api.load("glove-wiki-gigaword-100")

# Print vocabulary size
print("Vocabulary Size:", len(model.key_to_index))

# Display vector for a sample word
word = "machine"
vector = model[word]

print("\nWord:", word)
print("Vector length:", len(vector))
print("First 10 values of the vector:\n", vector[:10])

Vocabulary Size: 400000

Word: machine
Vector length: 100
First 10 values of the vector:
 [-0.65365  0.49419 -0.26245 -0.20722 -0.11413  0.35701  1.0454   0.21881
  0.52769  0.60606]


# **Explore Word Similarity**

In [None]:
import gensim.downloader as api

# Load pre-trained GloVe model (100D)
model = api.load("glove-wiki-gigaword-100")

# Define word pairs
word_pairs = [
    ('computer','laptop'),
    ('mobile','phone'),
    ('city','village'),
    ('engineer','scientist'),
    ('hospital','clinic'),
    ('school','university'),
    ('car','road'),
    ('music','song'),
    ('rain','cloud'),
    ('earth','planet'),
    ('teacher','professor'),
    ('india','china'),
    ('football','cricket'),
    ('river','lake'),
    ('happy','joy'),
    ('sad','cry'),
    ('bird','airplane'),
    ('mountain','hill'),
    ('book','library'),
    ('market','shop')
]

print("Word Similarity Scores:\n")

for w1, w2 in word_pairs:
    similarity = model.similarity(w1, w2)
    print(f"{w1} - {w2} : {similarity:.4f}")


Word Similarity Scores:

computer - laptop : 0.7024
mobile - phone : 0.7307
city - village : 0.6327
engineer - scientist : 0.6081
hospital - clinic : 0.8140
school - university : 0.7548
car - road : 0.5374
music - song : 0.7319
rain - cloud : 0.5149
earth - planet : 0.8551
teacher - professor : 0.6188
india - china : 0.5997
football - cricket : 0.6660
river - lake : 0.7426
happy - joy : 0.5189
sad - cry : 0.5766
bird - airplane : 0.2634
mountain - hill : 0.6858
book - library : 0.5616
market - shop : 0.4741


# **Nearest Neighbor Exploration**

In [None]:
import gensim.downloader as api

# Load pre-trained GloVe embeddings (100D)
model = api.load("glove-wiki-gigaword-100")

# Choose at least 5 words, converting them to lowercase for better vocabulary match
chosen_words = ["india", "computer", "nlp", "machine", "ai"]

for word in chosen_words:
    print(f"\nTop similar words for '{word}':\n")

    similar_words = model.most_similar(word, topn=5)

    for similar_word, score in similar_words:
        print(f"{similar_word} : {score:.4f}")


Top similar words for 'india':

pakistan : 0.8370
indian : 0.7802
delhi : 0.7712
bangladesh : 0.7662
lanka : 0.7639

Top similar words for 'computer':

computers : 0.8752
software : 0.8373
technology : 0.7642
pc : 0.7366
hardware : 0.7290

Top similar words for 'nlp':

hagelin : 0.6065
oth : 0.5276
grn : 0.5221
lib : 0.5205
inp : 0.4841

Top similar words for 'machine':

machines : 0.7854
device : 0.6773
equipment : 0.6412
gun : 0.6409
guns : 0.6362

Top similar words for 'ai':

hey : 0.6296
sugiyama : 0.6117
gonna : 0.5951
ya : 0.5950
fukuhara : 0.5866


# **Word Analogy Tasks**

In [None]:
import gensim.downloader as api

# Load pre-trained Word2Vec (better for analogies)
model = api.load("word2vec-google-news-300")

# Analogy 1
result1 = model.most_similar(
    positive=["doctor", "school"],
    negative=["hospital"],
    topn=5
)

# Analogy 2
result2 = model.most_similar(
    positive=["bigger", "small"],
    negative=["big"],
    topn=5
)

# Analogy 3
result3 = model.most_similar(
    positive=["mother", "man"],
    negative=["woman"],
    topn=5
)


print(result1)


print(result2)


print(result3)


[('guidance_counselor', 0.5969594717025757), ('teacher', 0.5755364298820496), ('eighth_grade', 0.5226408243179321), ('schoolers', 0.5168290138244629), ('elementary', 0.5085657238960266)]
[('larger', 0.7402471899986267), ('smaller', 0.7329993844032288), ('tiny', 0.5698219537734985), ('tinier', 0.543969452381134), ('large', 0.5191665887832642)]
[('father', 0.8097569346427917), ('son', 0.7835153937339783), ('uncle', 0.7502947449684143), ('dad', 0.7315300703048706), ('brother', 0.7288522124290466)]


# **Word2Vec Similarity**

In [None]:
import gensim.downloader as api

# Load Word2Vec
word2vec = api.load("word2vec-google-news-300")

pairs = [
    ('doctor','nurse'),
    ('cat','dog'),
    ('car','bus'),
    ('king','queen'),
    ('teacher','student'),
    ('man','woman'),
    ('apple','orange'),
    ('university','college'),
    ('river','water'),
    ('sun','moon')
]

print("Word2Vec Similarity Scores:\n")

for w1, w2 in pairs:
    print(w1, "-", w2, ":", round(word2vec.similarity(w1,w2),4))

Word2Vec Similarity Scores:

doctor - nurse : 0.632
cat - dog : 0.7609
car - bus : 0.4693
king - queen : 0.6511
teacher - student : 0.6301
man - woman : 0.7664
apple - orange : 0.392
university - college : 0.6385
river - water : 0.5769
sun - moon : 0.4263


# **Word2Vec Neighbours**

In [None]:
words = ["king", "doctor", "car", "music", "university"]

print("Word2Vec Nearest Neighbours:\n")

for word in words:
    print(f"\nTop similar words for '{word}':\n")
    similar_words = word2vec.most_similar(word, topn=5)
    for sim_word, score in similar_words:
        print(sim_word, ":", round(score,4))

Word2Vec Nearest Neighbours:


Top similar words for 'king':

kings : 0.7138
queen : 0.6511
monarch : 0.6413
crown_prince : 0.6204
prince : 0.616

Top similar words for 'doctor':

physician : 0.7806
doctors : 0.7477
gynecologist : 0.6948
surgeon : 0.6793
dentist : 0.6785

Top similar words for 'car':

vehicle : 0.7821
cars : 0.7424
SUV : 0.7161
minivan : 0.6907
truck : 0.6736

Top similar words for 'music':

classical_music : 0.7198
jazz : 0.6835
Music : 0.6596
Without_Donny_Kirshner : 0.6416
songs : 0.6396

Top similar words for 'university':

universities : 0.7004
faculty : 0.6781
unversity : 0.6758
undergraduate : 0.6587
univeristy : 0.6585


**Word2Vec Analogy**

In [None]:
print("Word2Vec Analogy Results:\n")

print("king - man + woman =")
print(word2vec.most_similar(positive=['king','woman'], negative=['man'], topn=5))

print("\nparis - france + india =")
print(word2vec.most_similar(positive=['paris','india'], negative=['france'], topn=5))

print("\nteacher - school + hospital =")
print(word2vec.most_similar(positive=['teacher','hospital'], negative=['school'], topn=5))

Word2Vec Analogy Results:

king - man + woman =
[('queen', 0.7118193507194519), ('monarch', 0.6189674139022827), ('princess', 0.5902431011199951), ('crown_prince', 0.5499460697174072), ('prince', 0.5377321839332581)]

paris - france + india =
[('chennai', 0.5442505478858948), ('delhi', 0.5149926543235779), ('mumbai', 0.5024341344833374), ('hyderabad', 0.49932485818862915), ('gujarat', 0.48732805252075195)]

teacher - school + hospital =
[('Hospital', 0.6331106424331665), ('nurse', 0.6280134320259094), ('hopsital', 0.6217317581176758), ('intensive_care', 0.5683753490447998), ('Hosptial', 0.5647749304771423)]


# **GloVe Analogy**

In [None]:
# Load GloVe
glove = api.load("glove-wiki-gigaword-100")

print("GloVe Analogy Results:\n")

print("king - man + woman =")
print(glove.most_similar(positive=['king','woman'], negative=['man'], topn=5))

print("\nparis - france + india =")
print(glove.most_similar(positive=['paris','india'], negative=['france'], topn=5))

print("\nteacher - school + hospital =")
print(glove.most_similar(positive=['teacher','hospital'], negative=['school'], topn=5))

GloVe Analogy Results:

king - man + woman =
[('queen', 0.7698540687561035), ('monarch', 0.6843381524085999), ('throne', 0.6755736470222473), ('daughter', 0.6594556570053101), ('princess', 0.6520534157752991)]

paris - france + india =
[('delhi', 0.8654932975769043), ('mumbai', 0.7718895077705383), ('bombay', 0.7222235798835754), ('dhaka', 0.6891742944717407), ('calcutta', 0.6761991381645203)]

teacher - school + hospital =
[('nurse', 0.7798740267753601), ('doctor', 0.76134192943573), ('patient', 0.6908571124076843), ('physician', 0.6851393580436707), ('hospitalized', 0.6718847751617432)]


# **STEP 8 :Reflection and Interpretation**

In this lab, I learned how Word2Vec and GloVe convert words into numerical vectors. These vectors help the model understand word meanings. Similar words like doctor and nurse showed high similarity scores. The nearest neighbour task showed that related words appear close in vector space. In the analogy task, king – man + woman gave queen, which shows the model understands relationships. Word2Vec performed slightly better in analogy results

# **STEP 9 — Lab Report**

**Title**

Word2Vec and GloVe Word Embedding Analysis

**Objective**

To understand and analyze Word2Vec and GloVe models using similarity, neighbours, and analogy tasks.

**Introduction**

Word embeddings represent words as numerical vectors. Similar words are placed close together in vector space.

**Model Description**

Word2Vec uses neural networks to learn word relationships from context.
GloVe uses global word co-occurrence statistics to generate vectors.

**Results**

Similarity scores were high for related words.
Nearest neighbour results showed meaningful similar words.
Analogy tasks like king – man + woman gave queen.

**Conclusion**

Both models capture semantic relationships effectively. Word2Vec performed slightly better in analogy tasks.

# **7 Theory Answers (Very Short)**

**Word Embedding:** Representation of words as numerical vectors.

**Word2Vec:** Neural network model to generate word vectors.

**CBOW:** Predicts word from context.

**Skip-Gram:**Predicts context from word.

**GloVe:** Uses global word co-occurrence statistics.

**Cosine Similarity:**Measures similarity between two vectors.

**Application:** Used in chatbots, translation, sentiment analysis.