## Word Embedding Magic
I just saw this data and realized this is 50-dimensional GloVe vectors represention of words. In this notebook, I will try to show some magic of this dataset and the importance this dataset should deserve!

<img src="https://humancomputation.com/blog/wp-content/uploads/2016/11/Experiment.png", width="500px" height="5px"/>

Pic Credit : https://humancomputation.com



**Introduction**

Word embedding plays an important in Natural language processing. Throwing the one hot vector representation out of the window this feature learning maps words or phrases from the vocabulary to vectors of real numbers. GloVe is one of the approach where each word is mapped to 50-dimension vector. These vector can be used to learn the semantic of words like Man is Woman as King is to Queen. Or Man + Female = Woman. This embedding plays an important role in many applications. It is kind of a transfer learning also where these embedding are leart from large corpus of data and then can be used on smaller dataset.


In [2]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import os
print(os.listdir("../input"))

In [3]:
def read_data(file_name):
    with open(file_name,'r') as f:
        word_vocab = set() # not using list to avoid duplicate entry
        word2vector = {}
        for line in f:
            line_ = line.strip() #Remove white space
            words_Vec = line_.split()
            word_vocab.add(words_Vec[0])
            word2vector[words_Vec[0]] = np.array(words_Vec[1:],dtype=float)
    print("Total Words in DataSet:",len(word_vocab))
    return word_vocab,word2vector

**Read the File**

Let us first load the file. The file is formated as following:

word1 ["embedding vector"]

word2 ["embedding vector"]

-- and so on

Lets read the dataset to show the magic of embedding

In [4]:
vocab, w2v = read_data("../input/glove.6B.50d.txt")

# APPLICATION  : Similarity Score

Lets try to get similarty score between similar words like King and Queen or Baby and mother etc.. 

** Distance Metrics ** 

So every word is converted to a vector and to check the closeness we can use any similarity score like L2 or cosine. Since cosine similarities are more used lets implement cosine similarity.

We Define cosine similarity as following:

$$\text{Cos Similarity(u, v)} = \frac {u . v} {||u||_2 ||v||_2} = cos(\theta) \tag{1}$$

In [5]:
def cos_sim(u,v):
    """
    u: vector of 1st word
    v: vector of 2nd Word
    """
    numerator_ = u.dot(v)
    denominator_= np.sqrt(np.sum(np.square(u))) * np.sqrt(np.sum(np.square(v)))
    return numerator_/denominator_

** Examples for Fun**

In [6]:
all_words = w2v.keys()

In [7]:
print("Similarity Score of King and Queen",cos_sim(w2v['king'],w2v['queen']))
print("Similarity Score of Mother and Pizza",cos_sim(w2v['mother'],w2v['pizza']))
print("Similarity Score of Man and Pizza",cos_sim(w2v['man'],w2v['pizza']))
print("Similarity Score of Mother and baby",cos_sim(w2v['mother'],w2v['baby']))

# Visualization of Word Embedding

In [21]:
def return_matrix(random_words,dim =50):
    word_matrix = np.random.randn(len(random_words),dim)
    i = 0
    for word in random_words:
        word_matrix[i] = w2v[word]
        i +=1
    return word_matrix

## Visualization Using PCA

In [65]:
from sklearn.decomposition import PCA
import matplotlib.pyplot as plt
from pylab import rcParams
rcParams['figure.figsize'] = 15, 10


In [66]:
random_words = ['man','woman','king','queen','microwave','baby','boy','girl','pizza','royal','mother','father','doctor','cook','throne']
return_matrix_ = return_matrix(random_words)
pca_ = PCA(n_components=2)
viz_data = pca_.fit_transform(return_matrix_) 

In [67]:
plt.scatter(viz_data[:,0],viz_data[:,1],cmap=plt.get_cmap('Spectral'))
for label,x,y in zip(random_words,viz_data[:,0],viz_data[:,1]):
    plt.annotate(
        label,
        xy=(x,y),
        xytext=(-14, 14),
        textcoords='offset points',
        bbox=dict(boxstyle='round,pad=0.5', fc='yellow', alpha=0.5),
        arrowprops=dict(arrowstyle = '->', connectionstyle='arc3,rad=0')
    )
plt.xlabel('PCA Component 1 ')
plt.ylabel('PCA Component 2')
plt.title('PCA representation for Word Embedding')
plt.xlim(-10,10)
plt.ylim(-5,6)

## Visualization using T-SNE 

In [89]:
from sklearn.manifold import TSNE
tsne = TSNE(n_components=2, verbose=1,perplexity=3,method='exact')
tsne_results = tsne.fit_transform(return_matrix_)

In [90]:
plt.scatter(tsne_results[:,0],tsne_results[:,1],cmap=plt.get_cmap('Spectral'))
for label,x,y in zip(random_words,tsne_results[:,0],tsne_results[:,1]):
    plt.annotate(
        label,
        xy=(x,y),
        xytext=(-14, 14),
        textcoords='offset points',
        bbox=dict(boxstyle='round,pad=0.5', fc='yellow', alpha=0.5),
        arrowprops=dict(arrowstyle = '->', connectionstyle='arc3,rad=0')
    )
plt.xlabel('TSNE Component 1 ')
plt.ylabel('TSNE Component 2')
plt.title('TSNE representation for Word Embedding')

We can see how related words are grouped together. Amazing right!!

# Application : Analogy 

Lets try another application where we try analogy i.e. "Man is to woman as King is to ?" or "India is to Delhi as Japan is to ?". 

Intution is the new word should be close to (word3 - (word1 - word2)) where word3 = doctor, word1 = cook, word1 = pizza in "cook is to pizza as doctor is to ?".

In [97]:
def find_w4(word1,word2,word3, w2v):
    """
    """
    word_list = w2v.keys()
    max_sim = -1000
    #Make Sure they are lower
    word1,word2,word3 = word1.lower(),word2.lower(),word3.lower()
    diff_vec = w2v[word3] - (w2v[word1]-w2v[word2]) #word3 - (word1 - word2)
    for word in word_list:
        vec = w2v[word]
        sim_ = cos_sim(u=diff_vec,v=vec)
        if sim_ > max_sim:
            max_sim = sim_
            word_selected =  word
            
    return word_selected

In [100]:
print("King is to Queen as Man is to ",find_w4('king','queen','man',w2v))
print("Cook is to Pizza as Doctor is to ",find_w4('cook','pizza','doctor',w2v))
print("India is to Delhi as Japan is to ",find_w4('india','delhi','japan',w2v))
print("kid is to toy as doctor is to ",find_w4('kid','toy','doctor',w2v))

**Cook is to Pizza as Doctor is to  pizza **

Doctor is to Pizza!! Haha.. This was a secret I guess. Well machine learning can be funny. 

<img src="https://9bf6ddc20002c5f1a946-ef07da46c7e506e973e0d9fa57c693df.ssl.cf1.rackcdn.com/636445631371205663+32594.png" height = "100" width ="200"/>