# Visualizing Word Embeddings

In [2]:
import pandas as pd
import numpy as np
import flair
from sklearn.preprocessing import LabelEncoder
from flair.embeddings import WordEmbeddings,TransformerWordEmbeddings,ELMoEmbeddings
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.manifold import TSNE 
from flair.data import Sentence,Token
import plotly.graph_objects as go
import plotly.express as px


### Pretrained Language Models - Context is King

1. Glove
2. ELMO
3. BERT

Defining some helper functions that would plot the embeddings on a lower dimension for easier visualization

In [3]:
def get_embeddings(text,emb_model):
    sentence = Sentence(text)
    emb_model.embed(sentence)
    emb_token = [ token.embedding.numpy() for token in sentence]
    emb_mat=np.matrix([e for e in emb_token])
    print(emb_mat.shape)
    return emb_mat,sentence.tokens
    

In [5]:
def tsne_plot_embeddings(emb_mat,tokens,perplexity=30):
    tsne = TSNE(n_components=2,random_state=12,perplexity=perplexity)
    emb_tsne = tsne.fit_transform(emb_mat)
    emb_tsne_data = np.vstack((emb_tsne.T, tokens)).T
    df_tsne = pd.DataFrame(emb_tsne_data, columns=['Dim1', 'Dim2', 'token'])
    fig = px.scatter(df_tsne, x="Dim1", y="Dim2",color="token",text=pd.Series([t.text for t in df_tsne.token]))
    fig.update_traces(textposition='top center')
    return fig

Lets visualize the below text. 

This is a sample which has a couple of sentences about a training school. Words like <i> train </i> are called Homographs because they are spelt the same, but can have different meanings based on the context.

In [34]:
text="""
The train briefly stopped at the SBT Terminal.
It was in this city that I underwent my training at law school. 
There were so many qualified professors who used to train us back then. 
"""

Glove Embeddings

The glove embedding is visualized by looking at the embeddings by reducing the dimension into a 2d using T-SNE.

From the plot we could see that related words like <i> professors, school </i> and <i> training </i> are closely aligned. However both the <i> train </i> words seem to appear close but generally they both are used in different context in our text. While the train in the first sentence refers to the railway train, the other train is used in sentence 3 refers to the training in school

In [59]:
perplexity_s=25
glove_embedding = WordEmbeddings('glove')
embeddings,tokens=get_embeddings(text,glove_embedding)
fig=tsne_plot_embeddings(embeddings,tokens,perplexity_s)
fig.update_layout(height=400, title_text=" Glove embeddings")
fig.show()

(37, 100)



In the below plot, ELMO was successful is differentiating the word train referring to the locomotive and the word referring to teach or coach (And Oh coach is a homograph too!).

This feature known as the word sense disambiguation is key for the [ELMO model](https://arxiv.org/pdf/1802.05365.pdf)

In [58]:
perplexity_s=30
elmo_embedding =  ELMoEmbeddings()
embeddings,tokens=get_embeddings(text,elmo_embedding)
fig=tsne_plot_embeddings(embeddings,tokens,perplexity_s)
fig.update_layout(height=400, title_text=" ELMO embeddings")
fig.show()

(37, 3072)


BERT

The BERT model alos clearly understands the difference in both the terms. This based on the appearance of the words in the sentence template and also from its learnings across both ways

In [45]:
perplexity_s=34
bert_embedding = TransformerWordEmbeddings('bert-base-uncased')
embeddings,tokens=get_embeddings(text,bert_embedding)
fig=tsne_plot_embeddings(embeddings,tokens,perplexity_s)
fig.update_layout(height=400, title_text=" BERT embeddings")
fig.show()

(37, 768)


In [None]:
# text="""On our way a huge tree bark was lying on the road so we had to take a longer route to reach Rob's home. 
# When we reachd his place the dog gave a loud bark and become silent afterward
# """

In [8]:
# text="""
# I will miss the training school and all the fun I had. 
# I am grateful that I could train a bright batch of students here. It was time to board the train to my hometown. 
# """

In [6]:
# def tsne_plot_embeddings_sns(emb_mat,tokens,perplexity=30):
#     tsne = TSNE(n_components=2,random_state=12,perplexity=perplexity)
#     emb_tsne = tsne.fit_transform(emb_mat)
#     emb_tsne_data = np.vstack((emb_tsne.T, tokens)).T
#     df_tsne = pd.DataFrame(emb_tsne_data, columns=['Dim1', 'Dim2', 'token'])
#     ax = sns.scatterplot(data=df_tsne, x='Dim1', y='Dim2', hue='token')
#     ax.set_title('T-SNE ELMO Embeddings, colored by Part of Speech Tag')
#     plt.legend(bbox_to_anchor=(1.01, 1),borderaxespad=0)
#     label_point(df_tsne.Dim1, df_tsne.Dim2, pd.Series([t.text for t in df_tsne.token]), plt.gca())

In [7]:
# def label_point(x, y, val, ax):
#     a = pd.concat({'x': x, 'y': y, 'val': val}, axis=1)
#     for i, point in a.iterrows():
#         ax.text(point['x']+.02, point['y'], str(point['val']))  