<a href="https://colab.research.google.com/github/SaketMunda/introduction-to-nlp/blob/master/nlp_with_tensorflow.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Natural Language Processing with TensorFlow

NLP has the goal of deriving information out of natural language (could be sequences text or speech).

Another common term for NLP problems is sequence to sequence problems (seq2seq).

In [1]:
# Since we're going to experiment deep-learning models so we need to enable GPUs
!nvidia-smi -L

GPU 0: Tesla T4 (UUID: GPU-4e5bcf02-8dad-ddba-edaf-fed1b85c1e60)


## Get Helper functions

In [2]:
# Get helper_functions.py script from Github
!wget https://raw.githubusercontent.com/SaketMunda/ml-helpers/master/helper_functions.py

from helper_functions import unzip_data, create_tensorboard_callback

--2023-02-01 03:47:35--  https://raw.githubusercontent.com/SaketMunda/ml-helpers/master/helper_functions.py
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 2904 (2.8K) [text/plain]
Saving to: ‘helper_functions.py’


2023-02-01 03:47:35 (54.4 MB/s) - ‘helper_functions.py’ saved [2904/2904]



## Get a Text Dataset
The dataset that we're going to be using is Kaggle's introduction to NLP dataset (text samples of Tweets labelled as disaster or not disaster)

See the original source here: https://www.kaggle.com/c/nlp-getting-started

In [3]:
!wget https://storage.googleapis.com/ztm_tf_course/nlp_getting_started.zip

# unzip the data
unzip_data('nlp_getting_started.zip')

--2023-02-01 03:47:38--  https://storage.googleapis.com/ztm_tf_course/nlp_getting_started.zip
Resolving storage.googleapis.com (storage.googleapis.com)... 142.250.153.128, 108.177.119.128, 108.177.127.128, ...
Connecting to storage.googleapis.com (storage.googleapis.com)|142.250.153.128|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 607343 (593K) [application/zip]
Saving to: ‘nlp_getting_started.zip’


2023-02-01 03:47:38 (150 MB/s) - ‘nlp_getting_started.zip’ saved [607343/607343]



## Visualizing Text Dataset

To visualize our text samples, we first have to read them in, so we can do it through pandas.

In [4]:
import pandas as pd
train_df = pd.read_csv('train.csv')
test_df = pd.read_csv('test.csv')

# check the shapes
train_df.shape, test_df.shape

((7613, 5), (3263, 4))

In [5]:
# view some samples
train_df.head()

Unnamed: 0,id,keyword,location,text,target
0,1,,,Our Deeds are the Reason of this #earthquake M...,1
1,4,,,Forest fire near La Ronge Sask. Canada,1
2,5,,,All residents asked to 'shelter in place' are ...,1
3,6,,,"13,000 people receive #wildfires evacuation or...",1
4,7,,,Just got sent this photo from Ruby #Alaska as ...,1


So here, `text` is the tweet and `target` variable is to identify whether the tweet is a disaster or not, so if `1` then it's a disaster else not a disaster.

Let's visualize some random `training` samples, but before that this is a good practice to shuffle the training samples first,

In [6]:
train_df_shuffled = train_df.sample(frac=1, random_state=17)
# frac=1 means 100% of samples will be shuffled
train_df_shuffled

Unnamed: 0,id,keyword,location,text,target
7027,10072,typhoon,,Typhoon Soudelor: When will it hit Taiwan ÛÒ ...,1
318,463,armageddon,,RT @RTRRTcoach: #Love #TrueLove #romance lith ...,0
1681,2425,collide,www.youtube.com?Malkavius2,I liked a @YouTube video from @gassymexican ht...,0
5131,7318,nuclear%20reactor,"New York, New York",Japan's Restart of Nuclear Reactor Fleet Fast ...,1
2967,4262,drowning,"Hendersonville, NC",#ICYMI #Annoucement from Al Jackson... http://...,0
...,...,...,...,...,...
406,584,arson,"Jerusalem, Israel",Mourning notices for stabbing arson victims st...,1
5510,7863,quarantined,"Livonia, MI",Reddit's new content policy goes into effect m...,0
2191,3139,debris,,Plane debris discovered on Reunion Island belo...,1
7409,10600,wounded,santo domingo,Police Officer Wounded Suspect Dead After Exch...,1


In [7]:
# how does the test set looks like ?
test_df.head()

Unnamed: 0,id,keyword,location,text
0,0,,,Just happened a terrible car crash
1,2,,,"Heard about #earthquake is different cities, s..."
2,3,,,"there is a forest fire at spot pond, geese are..."
3,9,,,Apocalypse lighting. #Spokane #wildfires
4,11,,,Typhoon Soudelor kills 28 in China and Taiwan


In [8]:
# How many examples of each class ?
train_df['target'].value_counts()

0    4342
1    3271
Name: target, dtype: int64

In [9]:
# Let's visualize some random samples

import random
random_index = random.randint(0, len(train_df_shuffled)-5)
for row in train_df_shuffled[["text", "target"]][random_index:random_index+5].itertuples():
  _, text, target = row
  print(f"Target: {target}", "(disaster)" if target > 0 else "(not a disaster)")
  print(f"Text:\n{text}\n")
  print("----------\n")

Target: 0 (not a disaster)
Text:
Choking Hazard Prompts Recall Of Kraft Cheese Singles http://t.co/XGKyVF9t4f

----------

Target: 0 (not a disaster)
Text:
Real Hip Hop: Apollo Brown Feat M.O.P. - Detonate 
#JTW http://t.co/cEiaO1TEXr

----------

Target: 0 (not a disaster)
Text:
#GrowingupBlack walking past chicken frying was like entering a war zone.

----------

Target: 0 (not a disaster)
Text:
I hear the mumbling i hear the cackling i got em scared shook panicking

----------

Target: 0 (not a disaster)
Text:
Last Chance Animal Rescue has 3 new posts. http://t.co/1EB2DaUYfn #animalrescue

----------



## Split dataset into Train and Validation sets

Since the test set doesn't contain the target variable so we might need some unseen data for model to be validated after training, so how about splitting our training set for validating purpose with some amount.


In [10]:
from sklearn.model_selection import train_test_split

train_sentences, val_sentences, train_labels, val_labels = train_test_split(train_df_shuffled["text"].to_numpy(),
                                                                            train_df_shuffled["target"].to_numpy(),
                                                                            test_size=0.1,
                                                                            random_state=17)

## Converting Text into Numbers

Our labels are in numerical form (0 and 1) but our tweets are in string form.

But machine learning algorithm learns only through numbers so we have to convert those tweets/texts into numbers.

In NLP, there are two main concepts for turning text into numbers,
- **Tokenization** : A straight mapping from **word**(known as *word-level tokenization*) or character(which is *character-level tokenization*) or sub-word(*sub-word tokenization*) to a numerical value. Just like One hot encoding, suppose we have a sentence as "My name is Alpha", then if we are mapping according to word, "My" would `0`, "name" as `1`, "is" as `2` and "Alpha" as `3`.
- **Embeddings** : An embedding is a representation of natural language which can be learned. Representation comes in the form of **feature-vector**. For example the word "Alpha" could be represented by 5-D vector `[0.564, 0.897, 0.456, -0.987, 0.15]`. The size of the feature vector is tuneable. There are two ways to use embeddings:

    - **Create your own embedding** - Once your text has been turned into numbers (required for an embedding), you can put them through an embedding layer (such as `tf.keras.layers.Embedding`) and an embedding representation will be learned during model training.
    - **Reuse a pre-learned embedding** - Many pre-trained embeddings exist online. These pre-trained embeddings have often been learned on large corpuses of text (such as all of Wikipedia) and thus have a good underlying representation of natural language. You can use a pre-trained embedding to initialize your model and fine-tune it to your own specific task.


Simply, 

**Tokenization** : Straight mapping from word to number.

**Embedding** : Richer representation of relationships between tokens.

It depends on your problem. You could try character-level tokenization/embeddings and word-level tokenization/embeddings and see which perform best. You might even want to try stacking them (e.g. combining the outputs of your embedding layers using [tf.keras.layers.concatenate](https://www.tensorflow.org/api_docs/python/tf/keras/layers/concatenate)).

If you're looking for pre-trained word embeddings, [Word2vec embeddings](https://jalammar.github.io/illustrated-word2vec/), [GloVe embeddings](https://nlp.stanford.edu/projects/glove/) and many of the options available on TensorFlow Hub are great places to start.

Much like searching for a pre-trained computer vision model, we can search for pre-trained word embedding to use for your problem. Try searching for something like "use pre-trained word embeddings in TensorFlow".

### Text Vectorization

Mapping words to numbers.

To tokenize our words, we'll use the preprocessing layer,
`tf.keras.layers.preprocessing.TextVectorization`

In [11]:
import tensorflow as tf
from tensorflow.keras.layers.experimental.preprocessing import TextVectorization

# Using the default TextVectorization variables
text_vectorizer = TextVectorization(max_tokens=None, # how many words in the vocabulary (all of the different words in your text)
                                 standardize="lower_and_strip_punctuation", # how to process the text
                                 split="whitespace", # how to split the text
                                 ngrams=None, # create groups of n-words
                                 output_mode='int', # how to map tokens to numbers
                                 output_sequence_length=None) # How long should the output sequence of tokens be?

About the above params,

- `max_tokens` : The maximum number of words in your vocabulary (e.g 20000 or the number of unique words in your text), includes a value for OOV(out of vocabulary) tokens
- `standardize` : Methods for standardizing text
- `split`: split the text
- `ngrams`: how many words to contain per token split, for example if 2, it splits tokens into continous sequences of 2
- `output_mode`: How to output tokens can be `int`(integer mapping), `binary`(OHE), `count` or `tf-idf`
- `output_sequence_length`: Length of tokenized sequence to output, For example if set to 150, all tokenized sequences will be 150 tokens long.

In the above cell, we have initialized the object with the default settings but let's customize it a little bit for our own use case.

In particular, let's set values for `max_tokens` and `output_sequence_length`.

For `max_tokens`(the number of words in the vocabulary), multiples of 10,000(`10,000`, `20,000`, `30,000`) or the exact number of unqiue words in your text(e.g `32,179`) are common values.

For our use case, `10,000`

And for the `output_sequence_length` we'll use the average number of tokens per Tweet in the training set. But first, we'll need to find it.

In [12]:
# Find average number of tokens (words) in training tweets
round(sum([len(i.split()) for i in train_sentences])/len(train_sentences))

15

In [13]:
# Setup text vectorization with custom variables
max_vocab_length = 10000
max_length = 15

text_vectorizer = TextVectorization(max_tokens=max_vocab_length,
                                    output_mode='int',
                                    output_sequence_length=max_length)

To map our `TextVectorization` instance `text_vectorizer` to our data, we can call the `adapt()` method on it whilst passing it our training set.

In [14]:
# fit the text vectorizer to the training text
text_vectorizer.adapt(train_sentences)

Training data mapped! Let's try our `text_vectorizer` on a custom sentence.

In [15]:
# create a sample sentence
sample_sentence = "There's a flood in my village!"
text_vectorizer([sample_sentence])

<tf.Tensor: shape=(1, 15), dtype=int64, numpy=
array([[281,   3, 214,   4,  13, 881,   0,   0,   0,   0,   0,   0,   0,
          0,   0]])>

Try our `text_vectorizer` on a few random sentences ?

In [16]:
import random
random_sentence = random.choice(train_sentences)
print(f'Original Text: \n{random_sentence}\
        \n\n Vectorized Text:')
text_vectorizer([random_sentence])

Original Text: 
One day I want someone to run for the ferry fall and crack there face open for almost knocking me over just to get on a boat ??????        

 Vectorized Text:


<tf.Tensor: shape=(1, 15), dtype=int64, numpy=
array([[  61,   94,    8,  142,  320,    5,  321,   10,    2, 5649,  292,
           7, 3083,   77,  262]])>

We can also check the unique tokens in our vocabulary using the `get_vocabulary()` method

In [17]:
# get the unique words in the vocabulary
words_in_vocab = text_vectorizer.get_vocabulary()
top_5_words = words_in_vocab[:5] # most common tokens (notice the [UNK] token for "unknown" words)
bottom_5_words = words_in_vocab[-5:] # least common tokens

print(f"Number of words in Vocab:{len(words_in_vocab)}")
print(f"Top 5 most common words:{top_5_words}")
print(f"Bottom 5 least common words:{bottom_5_words}")

Number of words in Vocab:10000
Top 5 most common words:['', '[UNK]', 'the', 'a', 'in']
Bottom 5 least common words:['ovo', 'overåÊhostages', 'overzero', 'overwatch', 'overturns']


### Embedding

**Create an Embedding using Embedding Layer**

We've got a way to map our text to numbers. How about we go a step further and turn those numbers into an embedding?

The powerful thing about an embedding is it can be learned during training. This means rather that just be static, a word's numeric representation can be improved as a model goes through data samples.

We can see what an embedding of a word looks like by using the `tf.keras.layers.Embedding` layer.



In [18]:
tf.random.set_seed(17)
from tensorflow.keras import layers

embedding = layers.Embedding(input_dim=max_vocab_length,
                             output_dim=128,
                             embeddings_initializer="uniform",
                             input_length=max_length,
                             name="embedding_1")
embedding

<keras.layers.core.embedding.Embedding at 0x7effc0164f10>

`embedding` is a TensorFlow Layer, so that we can use it as part of a model, meaning its parameters(word representations) can be updated and improved as the model learns.



In [19]:
# Get a random sentence from training set
random_sentence = random.choice(train_sentences)
print(f"Original text: \n{random_sentence}")
print("\n\nEmbedded version:")

# embed the random sentence
sample_embed = embedding(text_vectorizer([random_sentence]))
sample_embed

Original text: 
Pic of 16yr old PKK suicide bomber who detonated bomb in Turkey Army trench released http://t.co/1yB8SiZarG http://t.co/69iIzvyQYC


Embedded version:


<tf.Tensor: shape=(1, 15, 128), dtype=float32, numpy=
array([[[ 0.0302324 ,  0.03418828, -0.0086311 , ..., -0.0166344 ,
         -0.04029257, -0.02557068],
        [-0.01625264,  0.02158523, -0.02340857, ..., -0.0386681 ,
         -0.0491827 ,  0.0268322 ],
        [ 0.01352911,  0.03944962,  0.00111158, ...,  0.0023241 ,
         -0.03535595, -0.02569958],
        ...,
        [-0.03414071, -0.00578203, -0.00360845, ...,  0.02272416,
         -0.0006538 ,  0.02469077],
        [ 0.00044195, -0.01737716,  0.01441267, ..., -0.02856711,
          0.04359982,  0.03417357],
        [-0.03798722, -0.0472891 , -0.00219483, ...,  0.00439606,
          0.03558886,  0.02385728]]], dtype=float32)>

These values might not mean much to us but they're what our computer sees each word as. When our model looks for patterns in different samples, these values will be updated as necessary.

If we review the shape of Embedded Version Tensor it is, `(1, 15, 128)`, it means that,
- 1: Is the quantity of sequences(sentences) we passed
- 15: is the `max_length` that we decided to normalize every sentence, if greater than 15 tokens/words then trim extra tokens or if less than 15 then pad it.
- 128: is the array size of each words, for example above sentence as `Now`, for this token the embedding tensor will look like,

In [20]:
sample_embed[0][0]

<tf.Tensor: shape=(128,), dtype=float32, numpy=
array([ 0.0302324 ,  0.03418828, -0.0086311 ,  0.04724665, -0.04517785,
       -0.00348382, -0.0239004 ,  0.00090963, -0.02799761,  0.00294216,
       -0.01197449,  0.02722785, -0.04120981,  0.02095106, -0.01188434,
        0.03945949,  0.02153495,  0.03358576,  0.001255  ,  0.04303605,
       -0.0018708 , -0.01481173,  0.04547796,  0.04528261,  0.00333439,
        0.04331895,  0.00089623,  0.00450709, -0.00557902, -0.02301625,
        0.02571747,  0.02261743, -0.03729066,  0.04433427, -0.03010633,
       -0.02109414, -0.01550237, -0.03067459,  0.018877  ,  0.04126067,
       -0.00881205,  0.03377115,  0.004122  , -0.02305061, -0.01989758,
       -0.03329762,  0.04769159,  0.00363921,  0.0160043 ,  0.00016095,
       -0.00751973, -0.01647564,  0.04507012,  0.03735936,  0.00348307,
       -0.0443404 , -0.00769715,  0.00465501,  0.04798652,  0.02703634,
        0.00351572, -0.02186524, -0.0483281 , -0.0120188 , -0.01574825,
        0.017328

## Evaluation function for our Model Experiments

Since we're going to perform multiple experiments by creating deeplearning models and scikit learn algorithms, so to track them we should create a common function for comparison.

There are various metrics to evaluate the classification models like precision, f1-score, recall. So let's look at them through a function altogether.

In [21]:
from sklearn.metrics import accuracy_score, precision_recall_fscore_support
def calculate_results(model, y_true, y_pred):
  """
  Calculates Model accuracy, precision, recall and f1-score for a binary classification model
  """
  # calculate the model accuracy
  model_accuracy = accuracy_score(y_true, y_pred)
  # calculate precision, recall, f1-score using "weighted" average
  model_precision, model_recall, model_f1, _ = precision_recall_fscore_support(y_true, y_pred, average="weighted")
  model_results = {"accuracy": model_accuracy * 100,
                   "precision":model_precision,
                   "recall": model_recall,
                   "f1-score": model_f1}
                  
  return model_results

## Modelling a Text Dataset

For experiments with various machine learning model for text classifier we will be considering below experiments:
- **Model 0** : Naive Bayes (baseline)
- **Model 1** : Feed-forward neural network (dense model) 
- **Model 2** : LSTM Model (RNN)
- **Model 3** : GRU (RNN)
- **Model 4** : Bidirectional-LSTM (RNN)
- **Model 5** : 1D CNN
- **Model 6** : TF Hub Pre-trained Feature Extractor
- **Model 7** : Same as model 6 with 10% of training samples

### Model 0 : Getting a baseline

We'll use `scikit-learn` library for building this model, and create a Scikit-Learn Pipeline using the TF-IDF (term frequency-inverse document frequency) formula to convert words into numbers and then model them using Multinomial Naive Bayes Algorithm.

> 📖 **Reading**: About TF-IDF on [Scikit-learn documentation](https://scikit-learn.org/stable/modules/feature_extraction.html#tfidf-term-weighting)

In [22]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import Pipeline

# Create tokenization and modelling pipeline
model_0 = Pipeline([
    ("tfidf", TfidfVectorizer()),
    ("clf", MultinomialNB())
])

# Fit the pipeline to the training data
model_0.fit(train_sentences, train_labels)

Pipeline(steps=[('tfidf', TfidfVectorizer()), ('clf', MultinomialNB())])

Multinomial Naive Bayes model is kind of shallow model which trains faster.

In [23]:
# evaluate the model
baseline_score = model_0.score(val_sentences, val_labels)
print(f"Our baseline model achieves an accuracy of: {baseline_score*100:.2f}%")

Our baseline model achieves an accuracy of: 80.18%


In [24]:
# let's make some predictions
baseline_preds = model_0.predict(val_sentences)
baseline_preds[:10]

array([0, 0, 0, 0, 1, 1, 0, 1, 0, 0])

In [25]:
# it is similar to our labels
val_labels[:10]

array([0, 0, 0, 0, 1, 1, 0, 1, 0, 1])

In [26]:
# get the results
baseline_results = calculate_results(model=model_0,
                                     y_true=val_labels,
                                     y_pred=baseline_preds)
baseline_results

{'accuracy': 80.18372703412074,
 'precision': 0.8125567744156732,
 'recall': 0.8018372703412073,
 'f1-score': 0.7968681002825004}

### Model 1 : A Simple Dense Model

The first "deep" model we're going to build is a single layer dense model.

In [27]:
from tensorflow.keras import layers

# Create directory to save Tensorboard logs
SAVE_DIR = "model_logs"

# Build model with functional API
inputs = layers.Input(shape=(1,), dtype=tf.string) # inputs are 1-dimensional strings
x = text_vectorizer(inputs) # turn text inputs into numbers using text_vectorizer
x = embedding(x) # create an embedding of the numerized numbers
x = layers.GlobalAveragePooling1D()(x) # lower the dimensionality of the embedding
outputs = layers.Dense(1, activation="sigmoid")(x) # create the output layer
model_1 = tf.keras.Model(inputs, outputs, name="model_1_dense") # construct the model

# compile the model
model_1.compile(loss='binary_crossentropy',
                optimizer=tf.keras.optimizers.Adam(),
                metrics=["accuracy"])

# fit the model
history_model_1 = model_1.fit(x=train_sentences,
                              y=train_labels,
                              epochs=5,
                              validation_data=(val_sentences, val_labels),
                              callbacks=[create_tensorboard_callback(dir_name=SAVE_DIR,
                                                                     experiment_name="simple_dense_model")])

Saving Tensorboard log files to: model_logs/simple_dense_model/20230201-034746
Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


In [28]:
# evaluate the results
model_1.evaluate(val_sentences, val_labels)



[0.4938335716724396, 0.7952755689620972]

In [29]:
# make some predictions
model_1_pred_probs = model_1.predict(val_sentences)
model_1_pred_probs[:10]



array([[0.1209681 ],
       [0.00963379],
       [0.07517961],
       [0.43156   ],
       [0.9998043 ],
       [0.8580827 ],
       [0.13515034],
       [0.894733  ],
       [0.58565694],
       [0.08657645]], dtype=float32)

Alright ! Let's some more evaluations by using our common function for evaluation.

In [30]:
# for the evaluation we have to make it similar to our val_labels which is in 0 and 1
model_1_preds = tf.squeeze(tf.round(model_1_pred_probs))
model_1_preds[:10]

<tf.Tensor: shape=(10,), dtype=float32, numpy=array([0., 0., 0., 0., 1., 1., 0., 1., 1., 0.], dtype=float32)>

In [31]:
simple_dense_results = calculate_results(model=model_1,
                                         y_true=val_labels,
                                         y_pred=model_1_preds)

simple_dense_results

{'accuracy': 79.52755905511812,
 'precision': 0.7966788703003602,
 'recall': 0.7952755905511811,
 'f1-score': 0.7932321923411083}

For comparing the results like these between two models, let's create a helper function

In [32]:
def compare_baseline_to_new_results(baseline_results, new_model_results):
  for k,v in baseline_results.items():
    print(f"Baseline {k}: {v:.2f}, New {k}: {new_model_results[k]:.2f}, Difference: {new_model_results[k] - v:.3f}")

In [33]:
compare_baseline_to_new_results(baseline_results, simple_dense_results)

Baseline accuracy: 80.18, New accuracy: 79.53, Difference: -0.656
Baseline precision: 0.81, New precision: 0.80, Difference: -0.016
Baseline recall: 0.80, New recall: 0.80, Difference: -0.007
Baseline f1-score: 0.80, New f1-score: 0.79, Difference: -0.004


## Visualizing Learned Embeddings

Our first model (`model_1`) contained an embedding layer (`embedding`) which learned a way of representing words as feature vectors by passing over the training data.

To understand what a text embedding is, let's visualize the embedding our model learned.

In [34]:
# get the vocabulary from the text vectorization layer
words_in_vocab = text_vectorizer.get_vocabulary()
len(words_in_vocab), words_in_vocab[:10]

(10000, ['', '[UNK]', 'the', 'a', 'in', 'to', 'of', 'and', 'i', 'is'])

And now let's get our embedding layer's weights (these are the numerical representations of each word).

In [35]:
model_1.summary()

Model: "model_1_dense"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 input_1 (InputLayer)        [(None, 1)]               0         
                                                                 
 text_vectorization_1 (TextV  (None, 15)               0         
 ectorization)                                                   
                                                                 
 embedding_1 (Embedding)     (None, 15, 128)           1280000   
                                                                 
 global_average_pooling1d (G  (None, 128)              0         
 lobalAveragePooling1D)                                          
                                                                 
 dense (Dense)               (None, 1)                 129       
                                                                 
Total params: 1,280,129
Trainable params: 1,280,129
N

In [43]:
# get the weight matrix of embedding layer
embed_weights = model_1.layers[2].get_weights()[0] 
print(embed_weights.shape) # same size as vocab size and embedding_dim (each word is a embedding_dim size vector)

(10000, 128)


In [44]:
embed_weights[0]

array([-0.05793177, -0.03914464, -0.02473383, -0.00400928,  0.0294634 ,
       -0.01301204,  0.05229734,  0.04887475,  0.06025089,  0.00805068,
        0.06344494,  0.06382315,  0.03528398, -0.02437459, -0.0378452 ,
       -0.00860966, -0.02997298, -0.00939367,  0.0595504 , -0.06665281,
       -0.01535042,  0.01206796,  0.01324633,  0.00885767, -0.03989451,
        0.0059811 , -0.05867312,  0.01391945, -0.0134295 , -0.03151251,
       -0.01170612, -0.01470305,  0.05553512,  0.06478363, -0.03305589,
        0.00180238,  0.03188965, -0.01603714,  0.0065343 , -0.0136434 ,
       -0.01092496,  0.03955515, -0.00912938, -0.04180072, -0.01730737,
       -0.0026758 , -0.05257864, -0.00593809, -0.00243136, -0.00406672,
        0.05934988,  0.00771909, -0.00551351, -0.03471248, -0.02665269,
       -0.01058054,  0.01899303, -0.05025218, -0.01186906, -0.00643933,
       -0.0397414 , -0.02110855, -0.00443364, -0.00674049,  0.0227233 ,
        0.02054001, -0.01021375, -0.03714589,  0.0349581 , -0.00

Now we've got these two objects, we can use the [Embedding Projector Tool](http://projector.tensorflow.org/) to visualize our embedding.

To use the embedding projector tool, we need two files,
- the embedding vectors (same as embedding weights)
- the metadata of the embedding vectors (the words they represent - our vocabulary)

In [45]:
import io

# Create output writers
out_v = io.open("embedding_vectors.tsv", "w", encoding="utf-8")
out_m = io.open("embedding_metadata.tsv", "w", encoding="utf-8")

# Write embedding vectors and words to file
for num, word in enumerate(words_in_vocab):
  if num==0:
    continue # skip padding token
  vec = embed_weights[num]
  out_m.write(word + "\n") # write words to file
  out_v.write("\t".join([str(x) for x in vec]) + "\n") # write corresponding word vector to file

out_v.close()
out_m.close()


# Download files locally to upload to Embedding projector

try:
  from google.colab import files
except ImportError:
  pass
else:
  files.download("embedding_vectors.tsv")
  files.download("embedding_metadata.tsv")

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>