<a href="https://colab.research.google.com/github/Bosy-Ayman/IR/blob/main/Assignment_(3)_Term_Representation.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#Objective:
The objective of this assignment is to gain practical experience in
implementing popular word embedding models for information retrieval:
Skip-gram, CBOW , and GloVe.


# Assignment's Requirment:
**a. Skip-gram / CBOW  Model using Gensim Libarary:**

Implement the training algorithm (Skip-gram model) using a large corpus of text data of your choice.


Experiment with different
hyperparameters and analyze their impact on the quality of word embeddings.

Compare the performance and characteristics of the CBOW model with the Skip-gram model.

**b. GloVe Model  using Flair Libarary:**

Choose either the word "rose" or "tie" to create two different sentences such that they share the same word but with different meanings.

Use GloVe to get the word embeddings. Check the similarity between the embeddings of the common word in both sentences when GloVe was used.

In [69]:
#install FLAIR
!pip install flair



In [70]:
from gensim.models import Word2Vec
from gensim.models import KeyedVectors
import pandas as pd
import gensim
from gensim.models import Word2Vec
from scipy import spatial

In [71]:
from flair.data import Sentence # represent a sentence
from flair.embeddings import WordEmbeddings

from termcolor import colored #add color to text output

# initialize embedding by specifying which model we want to use
glove_embedding = WordEmbeddings('glove')

# a. Skip-gram / CBOW  Model using Gensim Library

In [72]:
df = pd.read_csv('/content/twitter_dataset (2).csv')
df.head()

Unnamed: 0,Tweet_ID,Text
0,1,Party least receive say or single. Prevent prevent husband affect. May himself cup style evening protect. Effect another themselves stage perform....
1,2,Hotel still Congress may member staff. Media draw buy fly. Identify on another turn minute would.\nLocal subject way believe which question some m...
2,3,Nice be her debate industry that year. Film where generation push discover partner level.\nNearly money store style may enjoy. Kid discuss blue sa...
3,4,Laugh explain situation career occur serious. Five particular important size.\nCatch continue east teach dark discussion spring. Then candidate fi...
4,5,Involve sense former often approach government. While season family term close do number. Cost through second image indeed.\nProduction thousand w...


In [73]:
sentence_list = []
for index, row in df.iterrows():
    text = row['Text']
    sentence = Sentence(text)
    sentence_list.append(sentence)


In [74]:
import pandas as pd
from gensim.models import Word2Vec
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
import nltk
nltk.download('punkt')
nltk.download('stopwords')

# Load the CSV file
df = pd.read_csv('/content/twitter_dataset (2).csv')


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [75]:
sentences = [word_tokenize(tweet.lower()) for tweet in df['Text']]
stop_words = set(stopwords.words('english'))
sentences = [[word for word in sentence if word not in stop_words] for sentence in sentences]

model = Word2Vec(sentences=sentences,
                 sg=1,              # Indicates the Skip-gram model
                 vector_size=100,   # Each word will be represented as a vector of 100 dimensions
                 window=2,          # Size of the context window (model will consider two words to the left/right of the target word)
                 min_count=1,       # Minimum frequency count of words required to be included in the vocabulary
                 workers=4,         # Utilizing multiple CPU cores
                 epochs=20)         # Number of iterations over the corpus (epochs)

word_embeddings = model.wv
print(word_embeddings['party'])


[-0.22683775  0.2182583   0.2926825   0.5055522  -0.08856519  0.0874006
  0.12288462 -0.1399819  -0.28057373 -0.02035831 -0.15872064 -0.46922153
  0.18242688  0.09030382 -0.12382513 -0.01707836  0.05956579 -0.16294993
 -0.09502144 -0.49716672  0.43930364  0.34270528  0.14791672 -0.23497978
 -0.3049218  -0.0542582  -0.42367712  0.02894799  0.31370208  0.14484785
 -0.07302731  0.08523177  0.07438986 -0.09783047 -0.19768862 -0.03004237
  0.08894357 -0.07823889 -0.0091505  -0.1841218  -0.48894605  0.12340184
 -0.39107838  0.12193912  0.1640068  -0.363533   -0.02346594 -0.21154469
  0.00852326 -0.12792711 -0.17689666  0.14973143 -0.11548445 -0.12964123
 -0.23888247  0.01175362 -0.20547694 -0.01734245  0.24121162 -0.24950263
  0.13978246 -0.12203284 -0.34663266  0.13930176 -0.37118828  0.08714594
  0.43834853  0.27831665 -0.12842563  0.11857757  0.09859692 -0.11854866
  0.30077752 -0.28461844 -0.03670403  0.08141619  0.12774009  0.05200681
 -0.14563595  0.4473571   0.07837964 -0.26990706  0.

In [76]:
skip_gram_model = Word2Vec(sentences=sentences,
                           sg=1,              # Indicates the Skip-gram model
                           vector_size=100,   # Each word will be represented as a vector of 100 dimensions
                           window=5,          # Size of the context window (model will consider five words to the left/right of the target word)
                           min_count=0,       # Lower minimum frequency count to include more words in the vocabulary
                           workers=4,         # Utilizing multiple CPU cores
                           epochs=20)         # Number of iterations over the corpus (epochs)


# **Word Similarity**


In [77]:
skip_gram_model.wv.most_similar(positive=[ "media"])


[('information', 0.4375090003013611),
 ('strong', 0.3916800320148468),
 ('painting', 0.38238534331321716),
 ('everyone', 0.37348809838294983),
 ('seem', 0.3702261447906494),
 ('increase', 0.36771634221076965),
 ('agree', 0.34971004724502563),
 ('great', 0.34173569083213806),
 ('easy', 0.3390679657459259),
 ('.', 0.33716365694999695)]

# **CBOW**

In [78]:
cbow_model = Word2Vec(sentences=sentences,
                      sg=0,              # Indicates the CBOW model
                      vector_size=100,
                      window=5,
                      min_count=1,
                      workers=4,
                      epochs=20)

In [79]:
print(word_embeddings.most_similar('car'))

[('kid', 0.40061330795288086), ('game', 0.3643937408924103), ('form', 0.3620052933692932), ('part', 0.35906222462654114), ('camera', 0.3527991771697998), ('along', 0.35020798444747925), ('evidence', 0.3489747941493988), ('stock', 0.3361576497554779), ('movie', 0.33015307784080505), ('program', 0.327744722366333)]


# Compare betwen CBOW & Skipgram

In [80]:
import time

word = 'career'

In [81]:
start_time_skipgram = time.time()
skipgram_similar_words = skip_gram_model.wv.most_similar(word)
end_time_skipgram = time.time()
skipgram_time = end_time_skipgram - start_time_skipgram

start_time_cbow = time.time()
cbow_similar_words = cbow_model.wv.most_similar(word)
end_time_cbow = time.time()
cbow_time = end_time_cbow-start_time_cbow


In [82]:
print("Skip-gram model:")
print(skipgram_similar_words)
print("Time taken for Skip-gram model:", skipgram_time, "seconds")

print("CBOW model:")
print(cbow_similar_words)
print("Time taken for CBOW model:", cbow_time, "seconds")

if skipgram_time<cbow_time:
    print("Skip-gram model was faster.")
elif cbow_time < skipgram_time:
    print("CBOW model was faster")
else:
    print("Both took the same amount of time")


Skip-gram model:
[('.', 0.4454668462276459), ('current', 0.41692256927490234), ('husband', 0.37884750962257385), ('politics', 0.3552754819393158), ('pm', 0.3448687195777893), ('leave', 0.3403257727622986), ('sure', 0.3394918143749237), ('onto', 0.3320145606994629), ('term', 0.32907429337501526), ('scene', 0.3204682469367981)]
Time taken for Skip-gram model: 0.00667572021484375 seconds
CBOW model:
[('fight', 0.5081037878990173), ('various', 0.4429844617843628), ('eat', 0.4175970256328583), ('within', 0.4072027802467346), ('support', 0.4054358899593353), ('capital', 0.40062740445137024), ('individual', 0.3944961726665497), ('school', 0.3934437036514282), ('compare', 0.37474745512008667), ('country', 0.3696836233139038)]
Time taken for CBOW model: 0.0006642341613769531 seconds
CBOW model was faster


In [None]:
word = "career"

cbow_model = Word2Vec(sentences=sentences, min_count=1, sg=0)

skip_gram_model = Word2Vec(sentences=sentences, min_count=1, sg=1)

cbow_word_vector = cbow_model.wv[word]

skip_gram_word_vector = skip_gram_model.wv[word]

def cosine_similarity(v1, v2):
    dot_product = np.dot(v1, v2)
    norm_v1 = np.linalg.norm(v1)
    norm_v2 = np.linalg.norm(v2)
    return dot_product / (norm_v1 * norm_v2)
similarity_score = cosine_similarity(cbow_word_vector, skip_gram_word_vector)

print("Cosine Similarity of '{}' is {:.2f}".format(word, similarity_score))


# b. GloVe Model  using Flair Libarary


In [None]:
sentence1 = Sentence("She gave her a rose as a symbol of love.")
sentence2 = Sentence("He watched as the sun rose over the horizon.")

In [None]:
glove_embedding.embed(sentence1)
glove_embedding.embed(sentence2)

In [None]:
print("Embedded tokens for Sentence 1:")
for token in sentence1:
    print(colored(token.text, 'blue', attrs=['bold']))
    print(token.embedding)

print("\nEmbedded tokens for Sentence 2:")
for token in sentence2:
    print(colored(token.text, 'green', attrs=['bold']))
    print(token.embedding)

In [None]:
similarity = 1 - spatial.distance.cosine(sentence1[5].embedding, sentence2[5].embedding)
print(f"\nSimilarity between the embeddings of the word 'rose': {similarity}")
