# Tutorial 3 - ELMO: Deep Contextualized Word Representations

<img src="https://live.staticflickr.com/3545/3508966591_9dc9cbe3f5_b.jpg">

Word Embeddings are very useful in preparing meaningful representation of textual data in vector space. Authors Peters et.al. in their work titled ["Deep contextualized word representations"](https://arxiv.org/pdf/1802.05365.pdf) present a deep bidirectional language model which models:
1. complex characteristics of word use (e.g., syntax and semantics)
2. how these uses vary across linguistic contexts (i.e., to model
polysemy)


The ELMO model uses vectors derived from a bidirectional LSTMs that are trained with a coupled language model(LM) objective on a large text corpus hence the name __ELMo or (Embeddings from Language Models)__ representations


In [1]:
!nvidia-smi

Thu Jul 22 16:24:00 2021       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.42.01    Driver Version: 460.32.03    CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Tesla V100-SXM2...  Off  | 00000000:00:04.0 Off |                    0 |
| N/A   33C    P0    23W / 300W |      0MiB / 16160MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

### Load Dependencies

In [1]:
!pip install chart-studio

Collecting chart-studio
  Downloading chart_studio-1.1.0-py3-none-any.whl (64 kB)
[?25l[K     |█████                           | 10 kB 20.2 MB/s eta 0:00:01[K     |██████████▏                     | 20 kB 12.0 MB/s eta 0:00:01[K     |███████████████▎                | 30 kB 9.3 MB/s eta 0:00:01[K     |████████████████████▍           | 40 kB 8.4 MB/s eta 0:00:01[K     |█████████████████████████▍      | 51 kB 5.2 MB/s eta 0:00:01[K     |██████████████████████████████▌ | 61 kB 5.3 MB/s eta 0:00:01[K     |████████████████████████████████| 64 kB 1.7 MB/s 
Installing collected packages: chart-studio
Successfully installed chart-studio-1.1.0


In [2]:
import os
import numpy as np
import pandas as pd
import tensorflow as tf
import tensorflow_hub as hub
from sklearn import preprocessing

from IPython.display import HTML
import logging
logging.getLogger('tensorflow').disabled = True

In [3]:
import chart_studio.plotly as py
import plotly.graph_objs as go
from plotly.offline import download_plotlyjs, init_notebook_mode, plot, iplot

init_notebook_mode(connected=True)

## TensorFlow-Hub

__TensorFlow Hub__ is a repository for machine learning models.
From image classification, text embeddings, audio, and video action recognition, TensorFlow Hub is a space where you can browse trained models and datasets from across the TensorFlow ecosystem. 


Loading ELMO pretrained model using TF-Hub is as simple as mentioning the endpoint/URL of the model of interest. In this case, we will be using _version 3_ of the model

In [4]:
url = "https://tfhub.dev/google/elmo/3"
embed = hub.load(url)

In [6]:
# elmo = hub.Module("https://tfhub.dev/google/elmo/3", trainable=True)
# embeddings = elmo(
#     ["the cat is on the mat", "dogs are in the fog"],
#     signature="default",
#     as_dict=True)["elmo"]


## Contextual Vectors


Techniques such as Word2Vec, Glove, FastText, etc covered so far are great at preparing dense vector representations of a given word. Yet, there are difficulties when it comes to capturing the meaning of same word in different contexts (_word sense disambiguation_). 

No matter how many senses a word has, word embedding methods would generate the same vector representation. This can lead to problems for downstream NLP tasks.

For instance, let us talk about the word __bank__. The word bank can refer to a financial institution, can be used as an adjective or even to point towards a location (slope land besides a water body like river).

Unless a model understands different meanings of the word __bank__, it may mis-classify a sentence related to financial instituitions into a class of sentences refering to river banks.

Let us understand this better with the help of an example

In [7]:
# sample dataset
word = 'bank'
wsd_bank = [
            'The river bank is an amazing place',
            'I have deposited my money at the bank',
            'Did you withdraw money from the bank',
            'We walked along the river bank',
            'My wife and I have a joint bank account',
            'She is a dependable person and you can bank upon her'
            ]

In [8]:
# identify the position of the word of interest across the corpus
idx_of_interest = []
idx = 0
for s in wsd_bank:
  for w in s.split():
    if w.lower() in [word, word + 's', word + 'ing', word + 'ed', word + 'er']:
      idx_of_interest.append(idx)
    idx += 1

idx_of_interest

[2, 14, 21, 27, 35, 45]

In [9]:
# transform corpus into tensors
tokens_input = [s.split() for s in wsd_bank]
tokens_input

[['The', 'river', 'bank', 'is', 'an', 'amazing', 'place'],
 ['I', 'have', 'deposited', 'my', 'money', 'at', 'the', 'bank'],
 ['Did', 'you', 'withdraw', 'money', 'from', 'the', 'bank'],
 ['We', 'walked', 'along', 'the', 'river', 'bank'],
 ['My', 'wife', 'and', 'I', 'have', 'a', 'joint', 'bank', 'account'],
 ['She',
  'is',
  'a',
  'dependable',
  'person',
  'and',
  'you',
  'can',
  'bank',
  'upon',
  'her']]

In [10]:
tokens_length = [len(s.split()) for s in wsd_bank]
tokens_length

[7, 8, 7, 6, 9, 11]

In [11]:
# transform each sentence into a tensors of different lengths (ragged tensors)
sentences = tf.constant(wsd_bank)
sentences

<tf.Tensor: shape=(6,), dtype=string, numpy=
array([b'The river bank is an amazing place',
       b'I have deposited my money at the bank',
       b'Did you withdraw money from the bank',
       b'We walked along the river bank',
       b'My wife and I have a joint bank account',
       b'She is a dependable person and you can bank upon her'],
      dtype=object)>

In [12]:
words = tf.strings.split(sentences, ' ')
words

<tf.RaggedTensor [[b'The', b'river', b'bank', b'is', b'an', b'amazing', b'place'], [b'I', b'have', b'deposited', b'my', b'money', b'at', b'the', b'bank'], [b'Did', b'you', b'withdraw', b'money', b'from', b'the', b'bank'], [b'We', b'walked', b'along', b'the', b'river', b'bank'], [b'My', b'wife', b'and', b'I', b'have', b'a', b'joint', b'bank', b'account'], [b'She', b'is', b'a', b'dependable', b'person', b'and', b'you', b'can', b'bank', b'upon', b'her']]>

In [13]:
# standardize the size and transform ragged tensors to fixed length tensors
tokens = words.to_tensor(default_value='', 
                         shape=[None, max(tokens_length)])
tokens

<tf.Tensor: shape=(6, 11), dtype=string, numpy=
array([[b'The', b'river', b'bank', b'is', b'an', b'amazing', b'place',
        b'', b'', b'', b''],
       [b'I', b'have', b'deposited', b'my', b'money', b'at', b'the',
        b'bank', b'', b'', b''],
       [b'Did', b'you', b'withdraw', b'money', b'from', b'the', b'bank',
        b'', b'', b'', b''],
       [b'We', b'walked', b'along', b'the', b'river', b'bank', b'', b'',
        b'', b'', b''],
       [b'My', b'wife', b'and', b'I', b'have', b'a', b'joint', b'bank',
        b'account', b'', b''],
       [b'She', b'is', b'a', b'dependable', b'person', b'and', b'you',
        b'can', b'bank', b'upon', b'her']], dtype=object)>

### ELMO Embeddings

The ELMO model exposes three different types of embeddings. These depend upon the layer which is being used for inference. 

![](https://i.imgur.com/zNe5Ydx.png)

[Source](http://jalammar.github.io/illustrated-bert/)

![](https://i.imgur.com/E65GAvp.png)

From the TF-Hub documentation, we have the following output options available:
+ __word_emb__: the character-based word representations with shape [batch_size, max_length, 512].
+ __lstm_outputs1__: the first LSTM hidden state with shape [batch_size, max_length, 1024].
+ __lstm_outputs2__: the second LSTM hidden state with shape [batch_size, max_length, 1024].
+ __elmo__: the weighted sum of the 3 layers, where the weights are trainable. This tensor has shape [batch_size, max_length, 1024]
+ __default__: a fixed mean-pooling of all contextualized word representations with shape [batch_size, 1024].

For this section, as we are interested in understanding how the model understands and builds different embeddings of the same word used in different contexts, we will make use of __lstm_outputs1__.

You can experiment with other layers as well and note the difference in behaviour

In [22]:
# get embeddings
outputs = embed.signatures["tokens"](tokens=tokens,
                                     sequence_len=tf.constant(tokens_length))

In [None]:
outputs

In [17]:
# flatten the output tensor
elmo_vectors = [j.numpy() for i in outputs['lstm_outputs1'] \
                            for j in i  \
                              if ~np.all((j.numpy() == 0))]

In [20]:
len(elmo_vectors)

48

In [21]:
elmo_vectors[0].shape

(1024,)

## Visualize the Embeddings

We have the embeddings for each of the words in our sample corpus. Let us now visualize the contextual embeddings to understand how ELMO picks up on the context of any given word.

Note that in this sample setup, our word of interest is __bank__ and its usage in different contexts.

We will perform the following steps to visualize the contextual embeddings:

+ __Dimensionality Reduction__ : ELMO transforms each word is transformed into a $1024$ dimensional contextual embedding vector. To visualize such a large vector, we first transform it into managiable 2D setup using __Principal Component Analysis (PCA)__
+ __Scatter Plot__ : We plot each usage of the _word of interest_ on the scatter plot and label/annotate each data point with the sentence where the word was used.

In [24]:
from sklearn.decomposition import PCA

pca = PCA(n_components=2)
pca_output = pca.fit_transform(elmo_vectors)

In [26]:
data = [
    go.Scatter(
        x=[i[0] for idx,i in enumerate(pca_output) if idx in idx_of_interest],
        y=[i[1] for idx,i in enumerate(pca_output) if idx in idx_of_interest],
        mode='markers+text',
        textposition="bottom center",
        text=[i for i in wsd_bank],
    marker=dict(
        size=16,
        color = [i for i in idx_of_interest], #set color equal to a variable
        opacity= 0.8,
        colorscale='sunset',
        showscale=False
    )
    )
]
layout = go.Layout()
layout = dict(
              yaxis = dict(zeroline = True),
              xaxis = dict(zeroline = True),
             )
fig = go.Figure(data=data, layout=layout)
fig.show(renderer="colab")

This is pretty interesting. As you can see in the above plot, our sample corpus had 6 sentences containing the word __bank__ in different contexts. The contexts where the meaning is similar, the vectors are placed nearby as compared to contexts where the word refers to something entirely different.

The vectors where the word _bank_ is used in context of a __financial instituition__ are placed nearby as compared to its usage in context of __location__(river bank) or as an __adjective__ (bankable/dependable person)

## Application: Document clustering with Elmo Embeddings

We have performed document clustering in previous notebooks using different embedding techniques like (FastText, Word2Vec, etc.).

In this section, we will use ELMO embeddings to understand how it performs at a document level. 

+ Remember that ELMO provides us a direct interface to get sentence level embeddings. 
+ This is contrast to earlier techniques where we had to perform averaging or use other methods to aggregate word level embeddings to get document/sentence level embeddings. 
+ Also note that ELMO learns contextual embeddings hence it is imperative that the model uses the sentences/corpus as-is (without any pre-processing)

In [27]:
# sample corpus
corpus = ['The sky is blue and beautiful.',
          'Love this blue and beautiful sky!',
          'The quick brown fox jumps over the lazy dog.',
          "A king's breakfast has sausages, ham, bacon, eggs, toast and beans",
          'I love green eggs, ham, sausages and bacon!',
          'The brown fox is quick and the blue dog is lazy!',
          'The sky is very blue and the sky is very beautiful today',
          'The dog is lazy but the brown fox is quick!'    
]
labels = ['weather', 'weather', 'animals', 'food', 'food', 'animals', 'weather', 'animals']

In [28]:
corpus = np.array(corpus)
corpus_df = pd.DataFrame({'Document': corpus, 
                          'Category': labels})
corpus_df = corpus_df[['Document', 'Category']]

In [None]:
# get elmo sentence embeddings
elmo_sent_vectors = embed.signatures["default"](tf.constant(corpus))['default']
elmo_sent_vectors.shape

TensorShape([8, 1024])

In [None]:
from sklearn.cluster import AffinityPropagation

ap = AffinityPropagation()
ap.fit(elmo_sent_vectors.numpy())

cluster_labels = ap.labels_
cluster_labels = pd.DataFrame(cluster_labels, 
                              columns=['ClusterLabel'])

pd.concat([corpus_df, cluster_labels], axis=1)

Unnamed: 0,Document,Category,ClusterLabel
0,The sky is blue and beautiful.,weather,0
1,Love this blue and beautiful sky!,weather,0
2,The quick brown fox jumps over the lazy dog.,animals,2
3,"A king's breakfast has sausages, ham, bacon, e...",food,1
4,"I love green eggs, ham, sausages and bacon!",food,1
5,The brown fox is quick and the blue dog is lazy!,animals,2
6,The sky is very blue and the sky is very beaut...,weather,0
7,The dog is lazy but the brown fox is quick!,animals,2
