# Advanced Certification in AIML
## A Program by IIIT-H and TalentSprint

### Not for Grading

### Learning Objectives:

At the end of the experiment, you will be able to:
 
*  generate word embeddings using pre-trained models
*  visualize the similar words

In [None]:
! wget https://cdn.iiith.talentsprint.com/aiml/Experiment_related_data/glove.6B.zip
! unzip glove.6B.zip
! wget https://cdn.talentsprint.com/talentsprint1/archives/sc/aiml/experiment_related_data/AIML_DS_GOOGLENEWS-VECTORS-NEGATIVE-300_STD.rar
! unrar e /content/AIML_DS_GOOGLENEWS-VECTORS-NEGATIVE-300_STD.rar
    

## PART I
### Find the similarity between words using GloVe

In [None]:
# Import required Packages
import pandas as pd
import numpy as np

# pprint is a native Python library that allows to customize the formatting of output
import pprint

* **Load the GloVe pretrained model**

  GloVe stands for “Global Vectors” for word representation. It is developed by Stanford for generating word embeddings. GloVe captures both global statistics and local statistics of a corpus, in order to come up with word vectors.


In [None]:
GloVe_Dict = {}
# Loading the 50-dimensional vector of the model
with open("glove.6B.50d.txt", 'r') as f:
  for line in f:
      values = line.split()
      word = values[0]
      vector = np.asarray(values[1:], "float32")
      GloVe_Dict[word] = vector

In [None]:
# Length of the word vocabulary
print(len(GloVe_Dict))

* Develop GloVe Embeddings for the given list of words

In [None]:
words = ['king', 'queen', 'river', 'water', 'ocean', 'tree', 'leaf', 'happy', 'glad', 'mother', 'daughter']

In [None]:
# Creating a PrettyPrinter() object
pp = pprint.PrettyPrinter()

# Vector representation of a specific word 
print("Size of the vector is", len(GloVe_Dict["king"]))
pp.pprint(GloVe_Dict["king"])

In [None]:
# Vector representation of each word using GloVe
vectors = []
for word in words:
  vector = GloVe_Dict[word]
  vectors.append(vector)
print("There are %d words and the vector size of each word is %d" %((len(vectors),len(vectors[0]))))

* Measure the similarity between the words using cosine_similarity


In [None]:
# Importing the cosine similarity
from sklearn.metrics.pairwise import cosine_similarity

In [None]:
word_similarity = []
for i, word_1 in enumerate(words):
  row_wise_simiarity = []
  for j, word_2 in enumerate(words):
    # Get the vectors of the word using GloVe
    vec_1, vec_2 = GloVe_Dict[word_1], GloVe_Dict[word_2]

    # As the vectors are in one dimensional, convert it to 2D by reshaping
    vec_1, vec_2 = np.array(vec_1).reshape(1,-1), np.array(vec_2).reshape(1,-1) 

    # Measure the cosine similarity between the vectors.
    similarity = cosine_similarity(vec_1, vec_2)
    row_wise_simiarity.append(np.array(similarity).item())

  # Store the cosine similarity values in a list  
  word_similarity.append(row_wise_simiarity)

# Create a DataFrame to view the similarity between words
pd.DataFrame(word_similarity, columns=words, index=words)

 *GloVe derives the semantic relationship between the words. Higher the cosine similarity, the words are relatively closer*

*For eg:* *The word 'King' is more closer to word 'Queen'*

* Visualize the words in 2D-plane by reducing the dimensions using PCA

In [None]:
# Create a 2-dimensional PCA model of the word vectors using the scikit-learn PCA class
from sklearn.decomposition import PCA

# n_components in PCA specifies the no.of dimensions
pca = PCA(n_components=2)

# Fit and transform the vectors using PCA model
reduced_vectors = pca.fit_transform(vectors)

In [None]:
from matplotlib import pyplot as plt
plt.figure(figsize=(7,5))
plt.scatter(reduced_vectors[:,0],reduced_vectors[:,1], s = 12, color = 'red')
plt.xlim([-3.5,4.5])
plt.ylim([-3.5,3.5])
x, y = reduced_vectors[:,0] , reduced_vectors[:,1]
for i in range(len(x)):
  plt.annotate(words[i],xy=(x[i], y[i]),xytext=(x[i]+0.05,y[i]+0.05))

## PART II
### Find the similarity between words using Word2Vec



* Load Gensim pretrained model

  * Gensim is an open source Python library for natural language processing. It is developed and is maintained by the Czech natural language processing researcher Radim Řehůřek and his company RaRe Technologies. 

  * Use gensim to load a word2vec model, pretrained on google news, covering approximately 3 million words and phrases. The vector size is 300 features.

  * Download the google news bin file with the limit 500000 words and save in a binary word2vec format. If **binary = True**, then the data will be saved in binary word2vec format, else it will be saved in plain text.

In [None]:
import gensim

# Load Google news 300 vectors file
model = gensim.models.KeyedVectors.load_word2vec_format('AIML_DS_GOOGLENEWS-VECTORS-NEGATIVE-300_STD.bin', binary=True, limit=500000)

* Develop Word2Vec Embeddings for the list of words

In [None]:
# Vector representation of a specific word 
print("Size of the vector is", len(model["king"]))
pp.pprint(model["king"])

In [None]:
# Vector representation of each word using Word2Vec
word2vec = []

for word in words:
  try:
    word2vec.append(model[word])
  except:
    pass
print("There are %d words and the vector size of each word is %d" %(len(word2vec),len(word2vec[0])))

* Measure the similarity between the words using cosine_similarity


In [None]:
w2v_similarity = []
for i, word_1 in enumerate(words):
  w2v_row_wise_simiarity = []
  for j, word_2 in enumerate(words):
    # Get the vectors of the word using Word2Vec
    vec_1, vec_2 = model[word_1], model[word_2]

    # As the vectors are in one dimensional, convert it to 2D by reshaping
    vec_1, vec_2 = np.array(vec_1).reshape(1,-1), np.array(vec_2).reshape(1,-1) 

    # Measure the cosine similarity between two vectors
    similarity = cosine_similarity(vec_1,vec_2)
    w2v_row_wise_simiarity.append(np.array(similarity).item())

  # Store the cosine similarity values in a list    
  w2v_similarity.append(w2v_row_wise_simiarity)

pd.DataFrame(w2v_similarity, columns = words, index = words)

*Higher the cosine similarity, the words are more closer*

*For eg: The word 'King' is more similar to the word 'Queen'*

* Visualize the words in 2D-plane by reducing the dimensions using PCA.

In [None]:
# Create a 2-dimensional PCA model of the word vectors using the scikit-learn PCA class
from sklearn.decomposition import PCA

# n_components in PCA specifies the no.of dimensions
pca = PCA(n_components=2)

# Fit and transform the vectors using PCA model
reduced_w2v = pca.fit_transform(word2vec)

In [None]:
plt.figure(figsize=(8,5))
plt.scatter(reduced_w2v[:,0],reduced_w2v[:,1], s = 12, color = 'red')
plt.xlim([-2.5,2.5])
plt.ylim([-2.5,2.5])
x, y = reduced_w2v[:,0] , reduced_w2v[:,1]
for i in range(len(x)):
  plt.annotate(words[i],xy=(x[i], y[i]),xytext=(x[i]+0.05,y[i]+0.05))