# Using ELMO

In this notebook, we will try to explore the ELMO embeddings. Link to the elmo paper: https://arxiv.org/pdf/1802.05365.pdf

Also have a look at the elmo tutorial provided by allennlp: https://github.com/allenai/allennlp/blob/master/tutorials/how_to/elmo.md

This notebook is heavily inspired from the above tutorial, except that it is in a notebook format for easy reproducibility. 

Another tutorial explaining ELMO very well: http://jalammar.github.io/illustrated-bert/ (see from word embeddings)

### Elmo in a nutshell

Word embeddings are quite powerful. They give a dense representation (an embedding vector) for each word and these capture relations. Word embeddings are conceptually simple, given a word, it is converted to an embedding vector. 

Elmo takes it a step further. Instead of providing word embeddings from the word directly, it instead looks at the whole sentence before giving the embedding. These are called contextualized embeddings since we are looking at the whole context before deciding the embedding of a particular word. Internally, it uses a bidirectional LSTM Language Model. 

### Why Contextualized Embeddings?

Contextualized embeddings are important because words by themselves are ambiguous and their meaning becomes clear with their usage. For eg. consider the two sentences: 
1. "When not flying, bats hang upside down from their feet, a posture known as roosting."
2. "Bats are made of either wood, or a metal alloy (typically aluminum)."

Given just the word "bat" it is difficult to understand what it means, the animal or a wooden club, but there is no ambiguity when it occurs in the whole sentence. ELMO exploits this knowledge to give richer representations for the words. 

Note that the above case is an example of homography. ELMO also solves the case of polysemy in a similar way. 

### Elmo as a Model

Elmo has three layers. First is the word level representation. Second and third layers are Bi-LSTM.

The authors claim that the second layer is good for low-level tasks like POS Tagging. The third layer is good for higher-level tasks like sentiment classification.

Elmo has trainable paramters for weighing each of the layers. So for tasks like POS-Tagging the weight for the second layer is higher and for semantic classification weight for the third layer is higher.

### Using Elmo in your code

There are few different ways you might want to use ELMO embeddings depending on your use case:
1. Use the elmo embeddings directly in your existing framework.
2. Use the elmo embeddings directly, but learn the weight of each layer to compute the final representation
3. Fine-tune elmo on the corpus you want to use and then use elmo
4. Train elmo along with other pytorch models.

#### 1. Use elmo embeddings directly

There are essentially two ways to go about this:
1. Store the elmo embeddings of the sentences to h5py file.
2. Retrieve the embeddings interactively

##### Storing in a h5py file

In [2]:
sentences = ['You are a bold one.', 'Hello there!', 'Perhaps the archives are incomplete.']

In [3]:
with open('sent.txt', 'w') as f:
    f.write('\n'.join(sentences))

In your terminal do: `allennlp elmo sent.txt elmo_layers.hdf5 --all`. This should create `elmo_layers.hdf5`. We can open and check it.

In [8]:
import h5py
h5_file = h5py.File('elmo_layers.hdf5', 'r')

In [12]:
[k for k in h5_file.keys()]

['0', '1', '2', 'sentence_to_index']

In [14]:
h5_file['0'], h5_file['1'], h5_file['2']

(<HDF5 dataset "0": shape (3, 5, 1024), type "<f4">,
 <HDF5 dataset "1": shape (3, 2, 1024), type "<f4">,
 <HDF5 dataset "2": shape (3, 5, 1024), type "<f4">)

These are stored in (BatchSize x SeqLen x EmbeddingDim)

##### Retrieve it interactively

In [15]:
from allennlp.commands.elmo import ElmoEmbedder

In [68]:
from allennlp.data.tokenizers.word_tokenizer import WordTokenizer

Initialize the embedder

In [21]:
elmo_emb = ElmoEmbedder()

12/11/2018 15:03:25 - INFO - allennlp.commands.elmo -   Initializing ELMo.


In [70]:
wtokenizer = WordTokenizer()
tokenized_sents = wtokenizer.batch_tokenize(sentences)

In [74]:
tokenized_sents

[[You, are, a, bold, one, .],
 [Hello, there, !],
 [Perhaps, the, archives, are, incomplete, .]]

In [78]:
tok_sents = [[x.text for x in y] for y in tokenized_sents]

In [79]:
tok_sents

[['You', 'are', 'a', 'bold', 'one', '.'],
 ['Hello', 'there', '!'],
 ['Perhaps', 'the', 'archives', 'are', 'incomplete', '.']]

In [85]:
vec_elmo = list(elmo_emb.embed_sentences(tok_sents))

  index_range = sequence_lengths.new_tensor(torch.arange(0, len(sequence_lengths)))


In [87]:
[v.shape for v in vec_elmo]

[(3, 6, 1024), (3, 3, 1024), (3, 6, 1024)]

In [96]:
vec_elmo[0][0]

array([[ 0.61176616, -0.1803728 , -0.6626564 , ...,  0.108252  ,
        -0.31069914, -0.76622486],
       [-0.03124004,  0.08035831, -0.282419  , ...,  0.03819396,
         0.4789119 ,  0.08654939],
       [ 0.10400566,  0.12288515, -0.07056469, ..., -0.12283114,
        -0.02834528, -0.06579691],
       [ 1.1506842 , -0.05340729, -0.30548504, ..., -0.2646592 ,
        -0.4776665 ,  0.10963759],
       [ 0.31488723, -0.08592107, -0.39453682, ..., -0.66952085,
         0.08430362,  0.26585263],
       [-0.88715065, -0.20039944, -1.060133  , ..., -0.26554623,
         0.21145949,  0.19772954]], dtype=float32)

In [97]:
h5_file['0'][0]

array([[ 0.61176616, -0.1803728 , -0.6626564 , ...,  0.108252  ,
        -0.31069914, -0.76622486],
       [-0.03124004,  0.08035831, -0.282419  , ...,  0.03819396,
         0.4789119 ,  0.08654939],
       [ 0.10400566,  0.12288515, -0.07056469, ..., -0.12283114,
        -0.02834528, -0.06579691],
       [ 1.1506842 , -0.05340729, -0.30548504, ..., -0.2646592 ,
        -0.4776665 ,  0.10963759],
       [-0.543568  , -0.09246673, -1.7003565 , ..., -1.9184163 ,
        -0.04922673,  0.6620259 ]], dtype=float32)

The only difference between the command line and this method is that we are tokenizing explicitly in the latter. This gives an extra representation for the punctuation

####  2. Learning only the layer weights for final representation

In [88]:
from allennlp.modules.elmo import Elmo, batch_to_ids

In [89]:
options_file = "https://s3-us-west-2.amazonaws.com/allennlp/models/elmo/2x4096_512_2048cnn_2xhighway/elmo_2x4096_512_2048cnn_2xhighway_options.json"
weight_file = "https://s3-us-west-2.amazonaws.com/allennlp/models/elmo/2x4096_512_2048cnn_2xhighway/elmo_2x4096_512_2048cnn_2xhighway_weights.hdf5"

In [90]:
elmo = Elmo(options_file, weight_file, 1, dropout=0)

12/11/2018 15:52:15 - INFO - allennlp.modules.elmo -   Initializing ELMo


The different parameters are: 
- (i) options file: the configuration file for elmo 
- (ii) weights file: to load the pretrained model 
- (iii) number of representations: for most case 1 suffices when you need to use the elmo embeddings in the input. If you also need to use it in output, make this 2.

In [99]:
tok_sents

[['You', 'are', 'a', 'bold', 'one', '.'],
 ['Hello', 'there', '!'],
 ['Perhaps', 'the', 'archives', 'are', 'incomplete', '.']]

First we get character level representations. Elmo uses character level convolution in case the word is not in a dictionary, and hence requires character level representation. 

In [102]:
char_embs = batch_to_ids(tok_sents)

In [103]:
char_embs.shape

torch.Size([3, 6, 50])

In [104]:
char_embs

tensor([[[259,  90, 112, 118, 260, 261, 261, 261, 261, 261, 261, 261, 261, 261,
          261, 261, 261, 261, 261, 261, 261, 261, 261, 261, 261, 261, 261, 261,
          261, 261, 261, 261, 261, 261, 261, 261, 261, 261, 261, 261, 261, 261,
          261, 261, 261, 261, 261, 261, 261, 261],
         [259,  98, 115, 102, 260, 261, 261, 261, 261, 261, 261, 261, 261, 261,
          261, 261, 261, 261, 261, 261, 261, 261, 261, 261, 261, 261, 261, 261,
          261, 261, 261, 261, 261, 261, 261, 261, 261, 261, 261, 261, 261, 261,
          261, 261, 261, 261, 261, 261, 261, 261],
         [259,  98, 260, 261, 261, 261, 261, 261, 261, 261, 261, 261, 261, 261,
          261, 261, 261, 261, 261, 261, 261, 261, 261, 261, 261, 261, 261, 261,
          261, 261, 261, 261, 261, 261, 261, 261, 261, 261, 261, 261, 261, 261,
          261, 261, 261, 261, 261, 261, 261, 261],
         [259,  99, 112, 109, 101, 260, 261, 261, 261, 261, 261, 261, 261, 261,
          261, 261, 261, 261, 261, 261, 261, 26

In [105]:
emb_elmo = elmo(char_embs)

  index_range = sequence_lengths.new_tensor(torch.arange(0, len(sequence_lengths)))


The output is a dict, with keys `elmo_representation` and `mask`

In [108]:
type(emb_elmo), emb_elmo.keys()

(dict, dict_keys(['elmo_representations', 'mask']))

In [109]:
emb_elmo_vec = emb_elmo['elmo_representations']

The output is a list of length number of representations

In [112]:
type(emb_elmo_vec), len(emb_elmo_vec)

(list, 1)

In [113]:
emb_elmo_vec0 = emb_elmo_vec[0]

In [114]:
emb_elmo_vec0.shape

torch.Size([3, 6, 1024])

Mask is simply 1 where a word is present, 0 otherwise

In [115]:
mask = emb_elmo['mask']

In [116]:
mask

tensor([[1, 1, 1, 1, 1, 1],
        [1, 1, 1, 0, 0, 0],
        [1, 1, 1, 1, 1, 1]])

#### 3. Fine-tune on a separate corpus

The recommended way is to use this repository: https://github.com/allenai/bilm-tf.

In particular see: https://github.com/allenai/bilm-tf#how-to-do-fine-tune-a-model-on-additional-unlabeled-data

#### 4. Train elmo along with other pytorch models.

Simply set `requires_grad=True` when instantiating the `elmo` model.

In [117]:
elmo = Elmo(options_file, weight_file, 1, dropout=0, requires_grad=True)

12/11/2018 16:10:31 - INFO - allennlp.modules.elmo -   Initializing ELMo


This allows the whole elmo model to be trained.