# Word Embedding:  word2vec VS GloVe


### Word Embedding using word2vec

Word2Vec is a feed forward neural network based model to find word embeddings.

In [20]:
nltk.download('punkt')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [29]:
import gensim
from gensim.models import Word2Vec
from nltk.tokenize import word_tokenize
import nltk
import numpy as np
from gensim.models import Word2Vec
from sklearn.metrics.pairwise import cosine_similarity

In [36]:
corpus = [
    "I love natural language processing",
    "Word embeddings are useful for NLP tasks",
    "Word2Vec and GloVe are popular word embedding techniques",
    "Embeddings capture semantic relationships between words"
]

 Corpus is a list that contains a collection of text sentences or phrases. Each element in the list represents a single sentence or phrase. This corpus can be considered a small collection of text used for demonstration or experimentation purposes in NLP tasks

In [3]:
tokenized_corpus = [word_tokenize(sentence.lower()) for sentence in corpus]

Creates a tokenized version of the original corpus, breaking down each sentence into its individual words.

In [4]:
tokenized_corpus

[['i', 'love', 'natural', 'language', 'processing'],
 ['word', 'embeddings', 'are', 'useful', 'for', 'nlp', 'tasks'],
 ['word2vec',
  'and',
  'glove',
  'are',
  'popular',
  'word',
  'embedding',
  'techniques'],
 ['embeddings', 'capture', 'semantic', 'relationships', 'between', 'words']]

In [5]:
model = Word2Vec(sentences=tokenized_corpus, vector_size=100, window=5, min_count=1, sg=0)

Word2Vec model is being created and trained using the provided **tokenized_corpus**.

We are using the Skip-gram architecuture here, which is specified using "sg=0"

In [14]:
vector = model.wv['love']

print("Word vector for 'love':", vector)

Word vector for 'love': [-8.7276660e-03  2.1322442e-03 -8.7234029e-04 -9.3180090e-03
 -9.4277617e-03 -1.4115345e-03  4.4360980e-03  3.7051938e-03
 -6.4997668e-03 -6.8744812e-03 -4.9965675e-03 -2.2889164e-03
 -7.2506540e-03 -9.6036931e-03 -2.7440847e-03 -8.3625512e-03
 -6.0377736e-03 -5.6711119e-03 -2.3457278e-03 -1.7099340e-03
 -8.9553650e-03 -7.3277560e-04  8.1559829e-03  7.6896204e-03
 -7.2059133e-03 -3.6662850e-03  3.1182212e-03 -9.5726410e-03
  1.4767149e-03  6.5224483e-03  5.7462831e-03 -8.7638423e-03
 -4.5154700e-03 -8.1404923e-03  4.4413522e-05  9.2636719e-03
  5.9747146e-03  5.0679548e-03  5.0612800e-03 -3.2449535e-03
  9.5528029e-03 -7.3563098e-03 -7.2702118e-03 -2.2664559e-03
 -7.7782269e-04 -3.2147393e-03 -5.9475785e-04  7.4897148e-03
 -6.9789623e-04 -1.6242372e-03  2.7439173e-03 -8.3591873e-03
  7.8553734e-03  8.5351923e-03 -9.5840860e-03  2.4437979e-03
  9.9072168e-03 -7.6671345e-03 -6.9674756e-03 -7.7371001e-03
  8.3942134e-03 -6.8317266e-04  9.1467435e-03 -8.1588430e-03


using the trained Word2Vec model, we can obtain the numerical representation for the word 'love' and then printing that vector.

### Word Embedding using GloVe

GloVe is an unsupervised learning algorithm that relies on statistical properties of word co-occurrence within a corpus to learn word embeddings.

In [8]:
glove_file = "glove.6B.100d.txt"

The corpus we will use for the GloVe example

In [15]:
embeddings_index = {}
with open(glove_file, encoding="utf-8") as f:
    for line in f:
        values = line.split()
        word = values[0]
        vector = np.array(values[1:], dtype="float32")
        embeddings_index[word] = vector


you are creating a Python dictionary called embeddings_index to map words to their respective GloVe word vectors. This code processes a GloVe word vectors file to extract the word vectors and store them in the dictionary.

In [18]:
word = "love"
vector = embeddings_index.get(word)

Let's test it out using the word "love"

In [19]:
print("GloVe vector for 'love':", vector)

GloVe vector for 'love': [ 2.5975e-01  5.5833e-01  5.7986e-01 -2.1361e-01  1.3084e-01  9.4385e-01
 -4.2817e-01 -3.7420e-01 -9.4499e-02 -4.3344e-01 -2.0937e-01  3.4702e-01
  8.2516e-02  7.9735e-01  1.6606e-01 -2.6878e-01  5.8830e-01  6.7397e-01
 -4.9965e-01  1.4764e+00  5.5261e-01  2.5295e-02 -1.6068e-01 -1.3878e-01
  4.8686e-01  1.1420e+00  5.6195e-02 -7.3306e-01  8.6932e-01 -3.5892e-01
 -5.1877e-01  9.0402e-01  4.9249e-01 -1.4915e-01  4.8493e-02  2.6096e-01
  1.1352e-01  4.1275e-01  5.3803e-01 -4.4950e-01  8.5733e-02  9.1184e-02
  5.0177e-03 -3.4645e-01 -1.1058e-01 -2.2235e-01 -6.5290e-01 -5.1838e-02
  5.3791e-01 -8.1040e-01 -1.8253e-01  2.4194e-01  5.4855e-01  8.7731e-01
  2.2165e-01 -2.7124e+00  4.9405e-01  4.4703e-01  5.5882e-01  2.6076e-01
  2.3760e-01  1.0668e+00 -5.6971e-01 -6.4960e-01  3.3511e-01  3.4609e-01
  1.1033e+00  8.5261e-02  2.4847e-02 -4.5453e-01  7.7012e-02  2.1321e-01
  1.0444e-01  6.7157e-02 -3.4261e-01  8.5534e-01  1.3361e-01 -4.3296e-01
 -5.6726e-01 -2.1348e-01 -

The numerical representation for the word 'love' printed as a vector.

GloVe can be used to calculate the similarity between two words. This is helpful for capturing semantic relationships between words.

In [30]:
word1 = "king"
word2 = "queen"


# Ensure both words are in the embeddings dictionary
if word1 in embeddings_index and word2 in embeddings_index:
    # Calculate cosine similarity between the word vectors
    vector1 = embeddings_index[word1]
    vector2 = embeddings_index[word2]
    similarity_score = cosine_similarity([vector1], [vector2])[0][0]
    print(f"Cosine Similarity between '{word1}' and '{word2}' using GloVe: {similarity_score:.4f}")
else:
    print("One or both words are not in the embeddings dictionary.")


Cosine Similarity between 'king' and 'queen' using GloVe: 0.7508


### Exercises:
**1. Modify the Word2Vec example:** We have implemented the example using Skip-gram architechture, try it using Continuous Bag of Words and report your findings.

**2. Calculate word similarity using Word2Vec:** Similar to GloVe,write code to calculate the cosine similarity between the two words "queen" and "king"