# handling missing word in embedding
* I think the best solution to this problem is to use a language model that is able to generate embedding vectors even when it does not know the exact word. This is the case with fasttext and ELMO embeddings / models.


**authors train a charCNN + bi-LSTM language model then use the internal representations from this model to produce "ELMo" representations.**

https://stackoverflow.com/questions/53798582/is-elmo-a-word-embedding-or-a-sentence-embedding

The output dictionary contains:

word_emb: the character-based word representations with shape [batch_size, max_length, 512].

lstm_outputs1: the first LSTM hidden state with shape [batch_size, max_length, 1024].

lstm_outputs2: the second LSTM hidden state with shape [batch_size, max_length, 1024].

elmo: the weighted sum of the 3 layers, where the weights are trainable. This tensor has shape [batch_size, max_length, 1024]

default: a fixed mean-pooling of all contextualized word representations with shape [batch_size, 1024].

**The 4th layer is the actual word embedding. The 5th one reduces sequence output by the 4th layer to a single vector, effectively turning the whole thing into a sentence embedding.**

How can I train my own ELMo embeddings on my own data for later use in the model?
* -sadly this cannot be done with the tensorflow hub model directly. You have to train the bilm model they used in the paper and convert it to a tensorhub model yourself. Here you can find the code to train the bilm https://github.com/allenai/bilm-tf.

# Getting neccesary imports

In [1]:
import tensorflow as tf
import tensorflow_hub as hub

#  importing elmo thats downloaded locally
* original_file.tar.gz converted using 7 zip 
 
* https://tfhub.dev/google/elmo/2 **Reference**

In [2]:
elmo = hub.Module("D:/dataset/Embedding/tf_module_ELMO2", trainable=False) # trinable = True / False

# USING [elmo]
**elmo: the weighted sum of the 3 layers, where the weights are trainable. This tensor has shape [batch_size, max_length, 1024]**
* has actual word embedding

# 1.  Passing Sentences as ip
**signature="default"**
* word_emb: the character-based word representations with shape [batch_size, max_length, 512].

# without seq len

In [3]:
                        #6                     #5                     #5
sentence_embeddings = elmo(["the cat is on the mat ", "dogs are in the fog","I am hungry want food"],
                  signature="default",
                  as_dict=True)["elmo"]

# also needs numb of sentences inputs thats calculated in background 

INFO:tensorflow:Saver not created because there are no variables in the graph to restore


In [4]:
sentence_embeddings.shape
# [batch_size, max_length, 512].

TensorShape([Dimension(3), Dimension(6), Dimension(1024)])

# with seq len

In [12]:
sentence_embedding_with_seq_len = elmo( 
                                    inputs = { #6                           # 7                          #8
                                    ["the cat is on the mat ", "dogs are in the fog extra text"," long sent be I am hungry want food"],
                                    "sequence_len"  : 8                                         
                                    },
                                    signature = "default",
                                    as_dict=True)["elmo"]

SyntaxError: invalid syntax (<ipython-input-12-a90fc7d0fda8>, line 6)

# 2 . Passing tokenized sentences as ip
**signature="tokens"**



In [13]:
tokens_input = [["the", "cat", "is", "on", "the", "mat"],
                ["dogs", "are", "in", "the", "fog", ""]]
# also needs numb of sentences inputs thats calculated in background
tokens_length = [6, 5]

token_embeddings = elmo(inputs={
                            "tokens": tokens_input,
                            "sequence_len": tokens_length
                          },
                  signature="tokens",
                  as_dict=True)["elmo"]


INFO:tensorflow:Saver not created because there are no variables in the graph to restore


In [14]:
token_embeddings.shape
# batch size , token_max , lstm

TensorShape([Dimension(2), Dimension(6), Dimension(1024)])

# Sentence embedding (seq_len , 1024) 
* 1024 hidden state

In [30]:
token_embeddings[0].shape

TensorShape([Dimension(6), Dimension(1024)])

In [31]:


input_text = Input(shape=(1,), dtype=tf.string)
embedding = Lambda(UniversalEmbedding, output_shape=(512, ))(input_text)
dense = Dense(256, activation='relu')(embedding)
pred = Dense(2, activation='softmax')(dense)
model = Model(inputs=[input_text], outputs=pred)
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])

# Using [elmo] test

In [51]:
test1 = elmo(["the cat is on the mat "],
                  signature="default",
                  as_dict=True)["elmo"]

INFO:tensorflow:Saver not created because there are no variables in the graph to restore


In [52]:
test1.shape

TensorShape([Dimension(1), Dimension(6), Dimension(1024)])

In [63]:
print(test1[0][2].shape) #CAT
print(test1[0][5].shape) # MAT

(1024,)
(1024,)


In [69]:
test1[0][2] == test1[0][5]

False

# Using [default]
**default: a fixed mean-pooling of all contextualized word representations with shape [batch_size, 1024].**

In [53]:
test2 = elmo(["the cat is on the mat "],
                  signature="default",
                  as_dict=True)["default"]

INFO:tensorflow:Saver not created because there are no variables in the graph to restore


In [76]:
test2.shape  # has sent embedding

TensorShape([Dimension(1), Dimension(1024)])

In [21]:
# with multiple sentence

In [20]:
sentence_embeddings = elmo(["the cat is on the mat ", "dogs are in the fog","I am hungry want food"],
                  signature="default",
                  as_dict=True)["default"]
sentence_embeddings.shape

INFO:tensorflow:Saver not created because there are no variables in the graph to restore


TensorShape([Dimension(3), Dimension(1024)])

# Using ["word_emb"]
**the character-based word representations with shape [batch_size, max_length, 512].**
* char level embedding

In [28]:
test3_ = elmo(["the cat is on the mat "],
                  signature="default",
                  as_dict=True)["word_emb"]

INFO:tensorflow:Saver not created because there are no variables in the graph to restore


In [29]:
test3_.shape

TensorShape([Dimension(1), Dimension(6), Dimension(512)])

In [34]:
print(test3_[0])

Tensor("strided_slice_1:0", shape=(6, 512), dtype=float32)



## as per me
Below shape=(1, 1, 16, 32) (1, 2, 16, 32) , (1, 3, 16, 64) , (1, 4, 16, 128) , (1, 5, 16, 256) ,  shape=(1, 6, 16, 512) , shape=(1, 7, 16, 1024)

* (w, x, y, z)
* (w ,x ) is size of kernel to perform char level convolution
* (y) is max sequence of word
* (z) is number of filter

* 2048 number of hidden state of biLM model
* finally connected to 512 Fully connected layer

In [45]:
elmo.variables

[<tf.Variable 'module_1/aggregation/scaling:0' shape=() dtype=float32>,
 <tf.Variable 'module_1/aggregation/weights:0' shape=(3,) dtype=float32>,
 <tf.Variable 'module_1/bilm/CNN/W_cnn_0:0' shape=(1, 1, 16, 32) dtype=float32>,
 <tf.Variable 'module_1/bilm/CNN/W_cnn_1:0' shape=(1, 2, 16, 32) dtype=float32>,
 <tf.Variable 'module_1/bilm/CNN/W_cnn_2:0' shape=(1, 3, 16, 64) dtype=float32>,
 <tf.Variable 'module_1/bilm/CNN/W_cnn_3:0' shape=(1, 4, 16, 128) dtype=float32>,
 <tf.Variable 'module_1/bilm/CNN/W_cnn_4:0' shape=(1, 5, 16, 256) dtype=float32>,
 <tf.Variable 'module_1/bilm/CNN/W_cnn_5:0' shape=(1, 6, 16, 512) dtype=float32>,
 <tf.Variable 'module_1/bilm/CNN/W_cnn_6:0' shape=(1, 7, 16, 1024) dtype=float32>,
 <tf.Variable 'module_1/bilm/CNN/b_cnn_0:0' shape=(32,) dtype=float32>,
 <tf.Variable 'module_1/bilm/CNN/b_cnn_1:0' shape=(32,) dtype=float32>,
 <tf.Variable 'module_1/bilm/CNN/b_cnn_2:0' shape=(64,) dtype=float32>,
 <tf.Variable 'module_1/bilm/CNN/b_cnn_3:0' shape=(128,) dtype=flo

In [71]:
import numpy as np 
a = ['aaa bbbb cccc uuuu vvvv wrwr', 'ddd ee fffff ppppp']
a = np.array(a, dtype=object)[:, np.newaxis]
a.shape==(2,1)


True