# Advanced Certification in AIML
## A Program by IIIT-H and TalentSprint

### Learning Objectives:

At the end of the experiment, you will be able to:

*  generate the vectors for the given words
*  find the similarities between the words


In [None]:
#@title Experiment Walkthrough Video
#@markdown Word2vec similarity
from IPython.display import HTML

HTML("""<video width="520" height="440" controls>
  <source src="https://cdn.exec.talentsprint.com/content/2021-06-08_iiith_aiml_word2vec_similarity.mp4">
</video>
""")

In [None]:
! wget https://cdn.talentsprint.com/talentsprint1/archives/sc/aiml/experiment_related_data/AIML_DS_GOOGLENEWS-VECTORS-NEGATIVE-300_STD.rar")
! unrar e /content/AIML_DS_GOOGLENEWS-VECTORS-NEGATIVE-300_STD.rar")
! wget https://cdn.iiith.talentsprint.com/aiml/Experiment_related_data/Word2Vec_Similarity/dimensionality_reduction.py")

### Importing required packages

In [None]:
import numpy as np
import gensim
import matplotlib.pyplot as plt
from sklearn.metrics.pairwise import cosine_similarity

### Representation using Word2Vec pre-trained model

Load Gensim pretrained model

  * Gensim is an open source Python library for natural language processing. It is developed and is maintained by the Czech natural language processing researcher Radim Rehurek and his company RaRe Technologies. 

  * Use gensim to load a word2vec model, pretrained on google news, covering approximately 3 million words and phrases. The vector length is 300 features.

  * Download the google news bin file with the limit 500000 words and save in a binary word2vec format. If **binary = True**, then the data will be saved in binary word2vec format, else it will be saved in plain text.

In [None]:
# Load 300 vectors directly from the file. As the model is in .bin extension, we need to enable default parameter, binary = True
model = gensim.models.KeyedVectors.load_word2vec_format('AIML_DS_GOOGLENEWS-VECTORS-NEGATIVE-300_STD.bin', binary=True, limit=500000)

Develop Word Embedding for the given list of words

In [None]:
words_list = ['India','Delhi','Turkey', 'Ankara', 'Russia', 'Moscow','Japan', 'Tokyo', 'Vietnam', 'Hanoi','China', 'Beijing']

In [None]:
vect = []
for word in words_list:
    # Getting vectors of the each word and appending to the list
    vect.append(model[word])

###  Visualization and Plotting the reduced Word2Vec representation

The vector size of the given words is 300. To plot the words in 2 dimensions, reduce the  dimensionality of the 300-dimensional vectors to 2 dimensions.


In [None]:
from dimensionality_reduction import reduce_dimensions

reduced_vector = reduce_dimensions(vect)
len(reduced_vector), len(reduced_vector[0])


Visualize the words in 2D-plane

In [None]:
plt.figure(figsize=(16,5))
plt.scatter(reduced_vector[:,0],reduced_vector[:,1])
x, y = reduced_vector[:,0] , reduced_vector[:,1]

for i in range(len(x)):
  plt.annotate(words_list[i],xy=(x[i], y[i]), xytext=(x[i]+0.02, y[i]+0.02))

### Finding the cosine similarity  between the two words 

In [None]:
# As the vectors are in one dimensional, convert it to 2D by reshaping
cosine_similarity(model['Tokyo'].reshape(1,-1), model['Japan'].reshape(1,-1))

### Finding the nearest or most similar words of a given word using Word2vec

In [None]:
model.most_similar('Tokyo', topn=5)

 A cosine value of 0 means that the two vectors are at 90 degrees to each other (orthogonal) and have no match. The closer the cosine value to 1, the smaller the angle and the greater the match between vectors.

### Ungraded Exercises

### Exercise 01: For the below given words, generate the vectors and visualize them in 2D

In [None]:
words = ['king', 'queen', 'man', 'woman', 'best', 'good', 'strong', 'strongest']
# YOUR CODE HERE

### Exercise 02: Find the cosine similarity for 'king' and 'ruler'

In [None]:
# YOUR CODE HERE

###Exercise 03: Find top 5 nearest or most similar words of a word 'king'

In [None]:
# YOUR CODE HERE