<a href="https://colab.research.google.com/github/SaketMunda/introduction-to-nlp/blob/master/nlp_with_tensorflow.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Natural Language Processing with TensorFlow

NLP has the goal of deriving information out of natural language (could be sequences text or speech).

Another common term for NLP problems is sequence to sequence problems (seq2seq).

In [1]:
# Since we're going to experiment deep-learning models so we need to enable GPUs
!nvidia-smi -L

GPU 0: Tesla T4 (UUID: GPU-1ff35a48-fdfc-9e13-a94c-2ab1fba38cc1)


## Get Helper functions

In [2]:
# Get helper_functions.py script from Github
!wget https://raw.githubusercontent.com/SaketMunda/ml-helpers/master/helper_functions.py

from helper_functions import unzip_data, create_tensorboard_callback

--2023-02-03 04:38:33--  https://raw.githubusercontent.com/SaketMunda/ml-helpers/master/helper_functions.py
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 2904 (2.8K) [text/plain]
Saving to: ‘helper_functions.py’


2023-02-03 04:38:33 (47.7 MB/s) - ‘helper_functions.py’ saved [2904/2904]



## Get a Text Dataset
The dataset that we're going to be using is Kaggle's introduction to NLP dataset (text samples of Tweets labelled as disaster or not disaster)

See the original source here: https://www.kaggle.com/c/nlp-getting-started

In [3]:
!wget https://storage.googleapis.com/ztm_tf_course/nlp_getting_started.zip

# unzip the data
unzip_data('nlp_getting_started.zip')

--2023-02-03 04:38:35--  https://storage.googleapis.com/ztm_tf_course/nlp_getting_started.zip
Resolving storage.googleapis.com (storage.googleapis.com)... 142.251.10.128, 142.251.12.128, 172.217.194.128, ...
Connecting to storage.googleapis.com (storage.googleapis.com)|142.251.10.128|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 607343 (593K) [application/zip]
Saving to: ‘nlp_getting_started.zip’


2023-02-03 04:38:36 (151 MB/s) - ‘nlp_getting_started.zip’ saved [607343/607343]



## Visualizing Text Dataset

To visualize our text samples, we first have to read them in, so we can do it through pandas.

In [4]:
import pandas as pd
train_df = pd.read_csv('train.csv')
test_df = pd.read_csv('test.csv')

# check the shapes
train_df.shape, test_df.shape

((7613, 5), (3263, 4))

In [5]:
# view some samples
train_df.head()

Unnamed: 0,id,keyword,location,text,target
0,1,,,Our Deeds are the Reason of this #earthquake M...,1
1,4,,,Forest fire near La Ronge Sask. Canada,1
2,5,,,All residents asked to 'shelter in place' are ...,1
3,6,,,"13,000 people receive #wildfires evacuation or...",1
4,7,,,Just got sent this photo from Ruby #Alaska as ...,1


So here, `text` is the tweet and `target` variable is to identify whether the tweet is a disaster or not, so if `1` then it's a disaster else not a disaster.

Let's visualize some random `training` samples, but before that this is a good practice to shuffle the training samples first,

In [6]:
train_df_shuffled = train_df.sample(frac=1, random_state=17)
# frac=1 means 100% of samples will be shuffled
train_df_shuffled

Unnamed: 0,id,keyword,location,text,target
7027,10072,typhoon,,Typhoon Soudelor: When will it hit Taiwan ÛÒ ...,1
318,463,armageddon,,RT @RTRRTcoach: #Love #TrueLove #romance lith ...,0
1681,2425,collide,www.youtube.com?Malkavius2,I liked a @YouTube video from @gassymexican ht...,0
5131,7318,nuclear%20reactor,"New York, New York",Japan's Restart of Nuclear Reactor Fleet Fast ...,1
2967,4262,drowning,"Hendersonville, NC",#ICYMI #Annoucement from Al Jackson... http://...,0
...,...,...,...,...,...
406,584,arson,"Jerusalem, Israel",Mourning notices for stabbing arson victims st...,1
5510,7863,quarantined,"Livonia, MI",Reddit's new content policy goes into effect m...,0
2191,3139,debris,,Plane debris discovered on Reunion Island belo...,1
7409,10600,wounded,santo domingo,Police Officer Wounded Suspect Dead After Exch...,1


In [7]:
# how does the test set looks like ?
test_df.head()

Unnamed: 0,id,keyword,location,text
0,0,,,Just happened a terrible car crash
1,2,,,"Heard about #earthquake is different cities, s..."
2,3,,,"there is a forest fire at spot pond, geese are..."
3,9,,,Apocalypse lighting. #Spokane #wildfires
4,11,,,Typhoon Soudelor kills 28 in China and Taiwan


In [8]:
# How many examples of each class ?
train_df['target'].value_counts()

0    4342
1    3271
Name: target, dtype: int64

In [9]:
# Let's visualize some random samples

import random
random_index = random.randint(0, len(train_df_shuffled)-5)
for row in train_df_shuffled[["text", "target"]][random_index:random_index+5].itertuples():
  _, text, target = row
  print(f"Target: {target}", "(disaster)" if target > 0 else "(not a disaster)")
  print(f"Text:\n{text}\n")
  print("----------\n")

Target: 0 (not a disaster)
Text:
When he lets you drive his truck and you start panicking because you had to 'flip that bitch'. ?????? http://t.co/W6O0uiZF8p

----------

Target: 0 (not a disaster)
Text:
Join #charity 10k #run event! @DoningtonDash
11am start Sun 20 Sept 2015
Castle Donington Community First Responders
https://t.co/G1Nw99YJ8U

----------

Target: 0 (not a disaster)
Text:
@czallstarwes more like demolition derby ??

----------

Target: 0 (not a disaster)
Text:
@parksboardfacts first off it is the #ZippoLine as no one wants to use it and the community never asked for this blight on the park #moveit

----------

Target: 1 (disaster)
Text:
on the flip side I'm at Walmart and there is a bomb and everyone had to evacuate so stay tuned if I blow up or not

----------



## Split dataset into Train and Validation sets

Since the test set doesn't contain the target variable so we might need some unseen data for model to be validated after training, so how about splitting our training set for validating purpose with some amount.


In [10]:
from sklearn.model_selection import train_test_split

train_sentences, val_sentences, train_labels, val_labels = train_test_split(train_df_shuffled["text"].to_numpy(),
                                                                            train_df_shuffled["target"].to_numpy(),
                                                                            test_size=0.1,
                                                                            random_state=17)

## Converting Text into Numbers

Our labels are in numerical form (0 and 1) but our tweets are in string form.

But machine learning algorithm learns only through numbers so we have to convert those tweets/texts into numbers.

In NLP, there are two main concepts for turning text into numbers,
- **Tokenization** : A straight mapping from **word**(known as *word-level tokenization*) or character(which is *character-level tokenization*) or sub-word(*sub-word tokenization*) to a numerical value. Just like One hot encoding, suppose we have a sentence as "My name is Alpha", then if we are mapping according to word, "My" would `0`, "name" as `1`, "is" as `2` and "Alpha" as `3`.
- **Embeddings** : An embedding is a representation of natural language which can be learned. Representation comes in the form of **feature-vector**. For example the word "Alpha" could be represented by 5-D vector `[0.564, 0.897, 0.456, -0.987, 0.15]`. The size of the feature vector is tuneable. There are two ways to use embeddings:

    - **Create your own embedding** - Once your text has been turned into numbers (required for an embedding), you can put them through an embedding layer (such as `tf.keras.layers.Embedding`) and an embedding representation will be learned during model training.
    - **Reuse a pre-learned embedding** - Many pre-trained embeddings exist online. These pre-trained embeddings have often been learned on large corpuses of text (such as all of Wikipedia) and thus have a good underlying representation of natural language. You can use a pre-trained embedding to initialize your model and fine-tune it to your own specific task.


Simply, 

**Tokenization** : Straight mapping from word to number.

**Embedding** : Richer representation of relationships between tokens.

It depends on your problem. You could try character-level tokenization/embeddings and word-level tokenization/embeddings and see which perform best. You might even want to try stacking them (e.g. combining the outputs of your embedding layers using [tf.keras.layers.concatenate](https://www.tensorflow.org/api_docs/python/tf/keras/layers/concatenate)).

If you're looking for pre-trained word embeddings, [Word2vec embeddings](https://jalammar.github.io/illustrated-word2vec/), [GloVe embeddings](https://nlp.stanford.edu/projects/glove/) and many of the options available on TensorFlow Hub are great places to start.

Much like searching for a pre-trained computer vision model, we can search for pre-trained word embedding to use for your problem. Try searching for something like "use pre-trained word embeddings in TensorFlow".

### Text Vectorization

Mapping words to numbers.

To tokenize our words, we'll use the preprocessing layer,
`tf.keras.layers.preprocessing.TextVectorization`

In [11]:
import tensorflow as tf
from tensorflow.keras.layers.experimental.preprocessing import TextVectorization

# Using the default TextVectorization variables
text_vectorizer = TextVectorization(max_tokens=None, # how many words in the vocabulary (all of the different words in your text)
                                 standardize="lower_and_strip_punctuation", # how to process the text
                                 split="whitespace", # how to split the text
                                 ngrams=None, # create groups of n-words
                                 output_mode='int', # how to map tokens to numbers
                                 output_sequence_length=None) # How long should the output sequence of tokens be?

About the above params,

- `max_tokens` : The maximum number of words in your vocabulary (e.g 20000 or the number of unique words in your text), includes a value for OOV(out of vocabulary) tokens
- `standardize` : Methods for standardizing text
- `split`: split the text
- `ngrams`: how many words to contain per token split, for example if 2, it splits tokens into continous sequences of 2
- `output_mode`: How to output tokens can be `int`(integer mapping), `binary`(OHE), `count` or `tf-idf`
- `output_sequence_length`: Length of tokenized sequence to output, For example if set to 150, all tokenized sequences will be 150 tokens long.

In the above cell, we have initialized the object with the default settings but let's customize it a little bit for our own use case.

In particular, let's set values for `max_tokens` and `output_sequence_length`.

For `max_tokens`(the number of words in the vocabulary), multiples of 10,000(`10,000`, `20,000`, `30,000`) or the exact number of unqiue words in your text(e.g `32,179`) are common values.

For our use case, `10,000`

And for the `output_sequence_length` we'll use the average number of tokens per Tweet in the training set. But first, we'll need to find it.

In [12]:
# Find average number of tokens (words) in training tweets
round(sum([len(i.split()) for i in train_sentences])/len(train_sentences))

15

In [13]:
# Setup text vectorization with custom variables
max_vocab_length = 10000
max_length = 15

text_vectorizer = TextVectorization(max_tokens=max_vocab_length,
                                    output_mode='int',
                                    output_sequence_length=max_length)

To map our `TextVectorization` instance `text_vectorizer` to our data, we can call the `adapt()` method on it whilst passing it our training set.

In [14]:
# fit the text vectorizer to the training text
text_vectorizer.adapt(train_sentences)

Training data mapped! Let's try our `text_vectorizer` on a custom sentence.

In [15]:
# create a sample sentence
sample_sentence = "There's a flood in my village!"
text_vectorizer([sample_sentence])

<tf.Tensor: shape=(1, 15), dtype=int64, numpy=
array([[281,   3, 214,   4,  13, 881,   0,   0,   0,   0,   0,   0,   0,
          0,   0]])>

Try our `text_vectorizer` on a few random sentences ?

In [16]:
import random
random_sentence = random.choice(train_sentences)
print(f'Original Text: \n{random_sentence}\
        \n\n Vectorized Text:')
text_vectorizer([random_sentence])

Original Text: 
4 THOSE WHO CARE ABOUT SIBLING ABUSE SURVIVORS join the new family tree: http://t.co/LQD1WEfpQd http://t.co/GgnbVZoHWu        

 Vectorized Text:


<tf.Tensor: shape=(1, 15), dtype=int64, numpy=
array([[ 180,  163,   65,  527,   53, 4622, 3202,  404, 1396,    2,   49,
         285, 1172,    1,    1]])>

We can also check the unique tokens in our vocabulary using the `get_vocabulary()` method

In [17]:
# get the unique words in the vocabulary
words_in_vocab = text_vectorizer.get_vocabulary()
top_5_words = words_in_vocab[:5] # most common tokens (notice the [UNK] token for "unknown" words)
bottom_5_words = words_in_vocab[-5:] # least common tokens

print(f"Number of words in Vocab:{len(words_in_vocab)}")
print(f"Top 5 most common words:{top_5_words}")
print(f"Bottom 5 least common words:{bottom_5_words}")

Number of words in Vocab:10000
Top 5 most common words:['', '[UNK]', 'the', 'a', 'in']
Bottom 5 least common words:['ovo', 'overåÊhostages', 'overzero', 'overwatch', 'overturns']


### Embedding

**Create an Embedding using Embedding Layer**

We've got a way to map our text to numbers. How about we go a step further and turn those numbers into an embedding?

The powerful thing about an embedding is it can be learned during training. This means rather that just be static, a word's numeric representation can be improved as a model goes through data samples.

We can see what an embedding of a word looks like by using the `tf.keras.layers.Embedding` layer.



In [18]:
tf.random.set_seed(17)
from tensorflow.keras import layers

embedding = layers.Embedding(input_dim=max_vocab_length,
                             output_dim=128,
                             embeddings_initializer="uniform",
                             input_length=max_length,
                             name="embedding_1")
embedding

<keras.layers.core.embedding.Embedding at 0x7fedb02e4070>

`embedding` is a TensorFlow Layer, so that we can use it as part of a model, meaning its parameters(word representations) can be updated and improved as the model learns.



In [19]:
# Get a random sentence from training set
random_sentence = random.choice(train_sentences)
print(f"Original text: \n{random_sentence}")
print("\n\nEmbedded version:")

# embed the random sentence
sample_embed = embedding(text_vectorizer([random_sentence]))
sample_embed

Original text: 
you know you hate your body when you buy 2 bags of chips and a variety pack of fruit snacks and a redbull as a snack


Embedded version:


<tf.Tensor: shape=(1, 15, 128), dtype=float32, numpy=
array([[[-0.02013494,  0.00367814,  0.00704851, ...,  0.00118704,
         -0.03017279, -0.02486727],
        [-0.04007417, -0.00953885, -0.04488143, ..., -0.03326275,
         -0.03895334, -0.0359932 ],
        [-0.02013494,  0.00367814,  0.00704851, ...,  0.00118704,
         -0.03017279, -0.02486727],
        ...,
        [-0.02031898,  0.03425609,  0.03940277, ..., -0.04002044,
          0.01343619, -0.00712932],
        [ 0.02456499, -0.0027171 ,  0.02670671, ...,  0.01225286,
          0.04258173,  0.00253669],
        [-0.04131304,  0.01487033, -0.04388602, ..., -0.04878377,
          0.04338017,  0.02946276]]], dtype=float32)>

These values might not mean much to us but they're what our computer sees each word as. When our model looks for patterns in different samples, these values will be updated as necessary.

If we review the shape of Embedded Version Tensor it is, `(1, 15, 128)`, it means that,
- 1: Is the quantity of sequences(sentences) we passed
- 15: is the `max_length` that we decided to normalize every sentence, if greater than 15 tokens/words then trim extra tokens or if less than 15 then pad it.
- 128: is the array size of each words, for example above sentence as `Now`, for this token the embedding tensor will look like,

In [20]:
sample_embed[0][0]

<tf.Tensor: shape=(128,), dtype=float32, numpy=
array([-0.02013494,  0.00367814,  0.00704851,  0.03835148,  0.01900088,
        0.00061114,  0.01187392,  0.04401598, -0.04037824, -0.00385005,
        0.01364303,  0.01933743,  0.03797182,  0.0317191 , -0.03191744,
       -0.02826787, -0.00074808,  0.04151082,  0.02736055, -0.04909203,
       -0.04026247, -0.02701092, -0.00230974,  0.02857598, -0.03896632,
       -0.0462239 , -0.00654601, -0.04307854, -0.03492272,  0.03922273,
        0.00525022,  0.03337574,  0.04375429,  0.02135426, -0.01749852,
       -0.03767998,  0.00622987, -0.00512462,  0.00042514,  0.03733614,
        0.02118028, -0.02392187,  0.04331103,  0.02193166,  0.01353315,
        0.0086139 , -0.01341078,  0.0312569 , -0.01739498, -0.04435037,
       -0.0316669 ,  0.01047034, -0.00990814, -0.04527396, -0.00659496,
       -0.00198267, -0.01089675,  0.00849969,  0.03371694,  0.0001568 ,
       -0.03257991, -0.03354436, -0.00902262,  0.04805397, -0.03052126,
        0.034209

## Evaluation function for our Model Experiments

Since we're going to perform multiple experiments by creating deeplearning models and scikit learn algorithms, so to track them we should create a common function for comparison.

There are various metrics to evaluate the classification models like precision, f1-score, recall. So let's look at them through a function altogether.

In [55]:
from sklearn.metrics import accuracy_score, precision_recall_fscore_support
def calculate_results(y_true, y_pred):
  """
  Calculates Model accuracy, precision, recall and f1-score for a binary classification model
  """
  # calculate the model accuracy
  model_accuracy = accuracy_score(y_true, y_pred)
  # calculate precision, recall, f1-score using "weighted" average
  model_precision, model_recall, model_f1, _ = precision_recall_fscore_support(y_true, y_pred, average="weighted")
  model_results = {"accuracy": model_accuracy * 100,
                   "precision":model_precision,
                   "recall": model_recall,
                   "f1-score": model_f1}
                  
  return model_results

## Modelling a Text Dataset

For experiments with various machine learning model for text classifier we will be considering below experiments:
- **Model 0** : Naive Bayes (baseline)
- **Model 1** : Feed-forward neural network (dense model) 
- **Model 2** : LSTM Model (RNN)
- **Model 3** : GRU (RNN)
- **Model 4** : Bidirectional-LSTM (RNN)
- **Model 5** : 1D CNN
- **Model 6** : TF Hub Pre-trained Feature Extractor
- **Model 7** : Same as model 6 with 10% of training samples

### Model 0 : Getting a baseline

We'll use `scikit-learn` library for building this model, and create a Scikit-Learn Pipeline using the TF-IDF (term frequency-inverse document frequency) formula to convert words into numbers and then model them using Multinomial Naive Bayes Algorithm.

> 📖 **Reading**: About TF-IDF on [Scikit-learn documentation](https://scikit-learn.org/stable/modules/feature_extraction.html#tfidf-term-weighting)

In [22]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import Pipeline

# Create tokenization and modelling pipeline
model_0 = Pipeline([
    ("tfidf", TfidfVectorizer()),
    ("clf", MultinomialNB())
])

# Fit the pipeline to the training data
model_0.fit(train_sentences, train_labels)

Pipeline(steps=[('tfidf', TfidfVectorizer()), ('clf', MultinomialNB())])

Multinomial Naive Bayes model is kind of shallow model which trains faster.

In [23]:
# evaluate the model
baseline_score = model_0.score(val_sentences, val_labels)
print(f"Our baseline model achieves an accuracy of: {baseline_score*100:.2f}%")

Our baseline model achieves an accuracy of: 80.18%


In [24]:
# let's make some predictions
baseline_preds = model_0.predict(val_sentences)
baseline_preds[:10]

array([0, 0, 0, 0, 1, 1, 0, 1, 0, 0])

In [25]:
# it is similar to our labels
val_labels[:10]

array([0, 0, 0, 0, 1, 1, 0, 1, 0, 1])

In [26]:
# get the results
baseline_results = calculate_results(y_true=val_labels,
                                     y_pred=baseline_preds)
baseline_results

{'accuracy': 80.18372703412074,
 'precision': 0.8125567744156732,
 'recall': 0.8018372703412073,
 'f1-score': 0.7968681002825004}

### Model 1 : A Simple Dense Model

The first "deep" model we're going to build is a single layer dense model.

In [27]:
from tensorflow.keras import layers

# Create directory to save Tensorboard logs
SAVE_DIR = "model_logs"

# Build model with functional API
inputs = layers.Input(shape=(1,), dtype=tf.string) # inputs are 1-dimensional strings
x = text_vectorizer(inputs) # turn text inputs into numbers using text_vectorizer
x = embedding(x) # create an embedding of the numerized numbers
x = layers.GlobalAveragePooling1D()(x) # lower the dimensionality of the embedding
outputs = layers.Dense(1, activation="sigmoid")(x) # create the output layer
model_1 = tf.keras.Model(inputs, outputs, name="model_1_dense") # construct the model

# compile the model
model_1.compile(loss='binary_crossentropy',
                optimizer=tf.keras.optimizers.Adam(),
                metrics=["accuracy"])

# fit the model
history_model_1 = model_1.fit(x=train_sentences,
                              y=train_labels,
                              epochs=5,
                              validation_data=(val_sentences, val_labels),
                              callbacks=[create_tensorboard_callback(dir_name=SAVE_DIR,
                                                                     experiment_name="simple_dense_model")])

Saving Tensorboard log files to: model_logs/simple_dense_model/20230203-043844
Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


In [28]:
# evaluate the results
model_1.evaluate(val_sentences, val_labels)



[0.4938335716724396, 0.7952755689620972]

In [29]:
# make some predictions
model_1_pred_probs = model_1.predict(val_sentences)
model_1_pred_probs[:10]



array([[0.12096813],
       [0.00963379],
       [0.07517964],
       [0.43156   ],
       [0.9998043 ],
       [0.8580827 ],
       [0.13515034],
       [0.894733  ],
       [0.58565694],
       [0.08657642]], dtype=float32)

Alright ! Let's some more evaluations by using our common function for evaluation.

In [30]:
# for the evaluation we have to make it similar to our val_labels which is in 0 and 1
model_1_preds = tf.squeeze(tf.round(model_1_pred_probs))
model_1_preds[:10]

<tf.Tensor: shape=(10,), dtype=float32, numpy=array([0., 0., 0., 0., 1., 1., 0., 1., 1., 0.], dtype=float32)>

In [31]:
simple_dense_results = calculate_results(y_true=val_labels,
                                         y_pred=model_1_preds)

simple_dense_results

{'accuracy': 79.52755905511812,
 'precision': 0.7966788703003602,
 'recall': 0.7952755905511811,
 'f1-score': 0.7932321923411083}

For comparing the results like these between two models, let's create a helper function

In [32]:
def compare_baseline_to_new_results(baseline_results, new_model_results):
  for k,v in baseline_results.items():
    print(f"Baseline {k}: {v:.2f}, New {k}: {new_model_results[k]:.2f}, Difference: {new_model_results[k] - v:.3f}")

In [33]:
compare_baseline_to_new_results(baseline_results, simple_dense_results)

Baseline accuracy: 80.18, New accuracy: 79.53, Difference: -0.656
Baseline precision: 0.81, New precision: 0.80, Difference: -0.016
Baseline recall: 0.80, New recall: 0.80, Difference: -0.007
Baseline f1-score: 0.80, New f1-score: 0.79, Difference: -0.004


## Visualizing Learned Embeddings

Our first model (`model_1`) contained an embedding layer (`embedding`) which learned a way of representing words as feature vectors by passing over the training data.

To understand what a text embedding is, let's visualize the embedding our model learned.

In [34]:
# get the vocabulary from the text vectorization layer
words_in_vocab = text_vectorizer.get_vocabulary()
len(words_in_vocab), words_in_vocab[:10]

(10000, ['', '[UNK]', 'the', 'a', 'in', 'to', 'of', 'and', 'i', 'is'])

And now let's get our embedding layer's weights (these are the numerical representations of each word).

In [35]:
model_1.summary()

Model: "model_1_dense"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 input_1 (InputLayer)        [(None, 1)]               0         
                                                                 
 text_vectorization_1 (TextV  (None, 15)               0         
 ectorization)                                                   
                                                                 
 embedding_1 (Embedding)     (None, 15, 128)           1280000   
                                                                 
 global_average_pooling1d (G  (None, 128)              0         
 lobalAveragePooling1D)                                          
                                                                 
 dense (Dense)               (None, 1)                 129       
                                                                 
Total params: 1,280,129
Trainable params: 1,280,129
N

In [36]:
# get the weight matrix of embedding layer
embed_weights = model_1.layers[2].get_weights()[0] 
print(embed_weights.shape) # same size as vocab size and embedding_dim (each word is a embedding_dim size vector)

(10000, 128)


In [37]:
embed_weights[0]

array([-0.05793182, -0.03914462, -0.02473386, -0.00400929,  0.0294634 ,
       -0.01301203,  0.05229731,  0.04887474,  0.06025089,  0.00805069,
        0.06344496,  0.06382321,  0.03528399, -0.02437459, -0.0378452 ,
       -0.00860966, -0.02997299, -0.00939368,  0.0595504 , -0.06665281,
       -0.01535043,  0.01206796,  0.0132463 ,  0.00885767, -0.03989455,
        0.00598112, -0.05867314,  0.01391945, -0.01342947, -0.03151251,
       -0.01170613, -0.01470302,  0.05553512,  0.06478365, -0.03305589,
        0.00180235,  0.03188964, -0.01603713,  0.0065343 , -0.0136434 ,
       -0.01092492,  0.0395552 , -0.00912934, -0.04180073, -0.01730741,
       -0.00267581, -0.05257862, -0.00593809, -0.00243134, -0.00406672,
        0.05934988,  0.00771908, -0.0055135 , -0.03471246, -0.0266527 ,
       -0.01058053,  0.01899303, -0.0502522 , -0.01186908, -0.00643936,
       -0.0397414 , -0.02110856, -0.00443363, -0.00674049,  0.02272329,
        0.02054001, -0.01021376, -0.03714586,  0.03495812, -0.00

Now we've got these two objects, we can use the [Embedding Projector Tool](http://projector.tensorflow.org/) to visualize our embedding.

To use the embedding projector tool, we need two files,
- the embedding vectors (same as embedding weights)
- the metadata of the embedding vectors (the words they represent - our vocabulary)

In [38]:
# import io

# # Create output writers
# out_v = io.open("embedding_vectors.tsv", "w", encoding="utf-8")
# out_m = io.open("embedding_metadata.tsv", "w", encoding="utf-8")

# # Write embedding vectors and words to file
# for num, word in enumerate(words_in_vocab):
#   if num==0:
#     continue # skip padding token
#   vec = embed_weights[num]
#   out_m.write(word + "\n") # write words to file
#   out_v.write("\t".join([str(x) for x in vec]) + "\n") # write corresponding word vector to file

# out_v.close()
# out_m.close()


# # Download files locally to upload to Embedding projector

# try:
#   from google.colab import files
# except ImportError:
#   pass
# else:
#   files.download("embedding_vectors.tsv")
#   files.download("embedding_metadata.tsv")

## Recurrent Neural Networks(RNN's)

For further experimentations we're going to use special kind of neural networks used for sequence data such as to predict the next location based on the prior location or may be to generate a new sequence based on the past sequences which is done through **Recurrent Neural Networks (RNN)**.

Recurrent Neural Networks can be used for a number of sequence-based problems:
- **One to One:** one input, one output, such as image classification
- **One to many:** one input, many output, such as image captioning
- **Many to one:** many inputs, one outputs, such as text classification
- **Many to Many:** many inputs, many outputs, such as machine translation (translating English to Spanish) or speech to text

Most commong RNN cell or layers used for designing the network are:
- LSTM (Long Short Term Memory)
- GRU (Gates Recurrent Unit)
- Bidirectional RNNs (passes forward and backward along a sequence, left to right and right to left)

The architecture of the RNNs would be,

      Input(text) -> Tokenize -> Embedding -> Layers -> Output (label probability)

### Model 2 : LSTM

> **Note**: For a best practice when we are comparing different models then embedding layer should be different because embedding layer is a learned representation of words, if we were to use the same embedding layer for each model, we'd be mixing what one model has learned with the next. 

In [48]:
# set random seed and creating embedding layer (new embedding layer for each model)
tf.random.set_seed(17)
from tensorflow.keras import layers

model_2_embedding = layers.Embedding(input_dim=max_vocab_length,
                                           output_dim=128,
                                           embeddings_initializer="uniform",
                                           input_length=max_length,
                                           name="embedding_2")

# Create LSTM model
input = layers.Input(shape=(1,), dtype="string")
x = text_vectorizer(input)
x = model_2_embedding(x)
#print(x.shape)
#x = layers.LSTM(64, return_sequences=True)(x) # return vector for each word in the Tweet (if we want to stack RNN cells as long as return_sequences=True)
x = layers.LSTM(64)(x)
output = layers.Dense(1, activation='sigmoid')(x)
model_2 = tf.keras.Model(input, output, name="model_2_LSTM")

According to the Tensorflow documentation on LSTM, it accepts 3D input tensor as [batch, timestamp, feature_vector] so when we stack one more cell of LSTM, then we must set `return_sequences=True` so that when next LSTM cell is stacked will be inject with 3D input or else it would through error of "expecting 3D tensor but received 2D"

### Helper Function for Compiling and Fitting the Model

In [43]:
# for compiling and fitting the model let's create a helper function

def compile_fit_RNNs(model, dir_name, experiment_name):
  """
  Function used for compiling and fitting the model and save the 
  tensorboard experiment as well on the passed directory

  Returns the model history
  """
  model.compile(loss='binary_crossentropy',
                optimizer=tf.keras.optimizers.Adam(),
                metrics=['accuracy'])

  # fit the model
  history = model.fit(train_sentences,
                      train_labels,
                      epochs=5,
                      validation_data=(val_sentences, val_labels),
                      callbacks=[create_tensorboard_callback(dir_name, experiment_name)])
  
  return history

In [49]:
model_2_history = compile_fit_RNNs(model=model_2, dir_name=SAVE_DIR, experiment_name="model_2_LSTM")

Saving Tensorboard log files to: model_logs/model_2_LSTM/20230203-051605
Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


If we check the summary of the LSTM Model

In [50]:
model_2.summary()

Model: "model_2_LSTM"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 input_5 (InputLayer)        [(None, 1)]               0         
                                                                 
 text_vectorization_1 (TextV  multiple                 0         
 ectorization)                                                   
                                                                 
 embedding_2 (Embedding)     (None, 15, 128)           1280000   
                                                                 
 lstm_4 (LSTM)               (None, 64)                49408     
                                                                 
 dense_3 (Dense)             (None, 1)                 65        
                                                                 
Total params: 1,329,473
Trainable params: 1,329,473
Non-trainable params: 0
____________________________________________

In [51]:
# make some predictions
model_2_pred_probs = model_2.predict(val_sentences)
model_2_pred_probs[:10]



array([[0.0211548 ],
       [0.00182996],
       [0.01620475],
       [0.16472116],
       [0.9998977 ],
       [0.99410546],
       [0.01473757],
       [0.7996622 ],
       [0.8936984 ],
       [0.54785377]], dtype=float32)

In [52]:
# convert them to compare with labels
model_2_preds = tf.squeeze(tf.round(model_2_pred_probs))
model_2_preds[:10]

<tf.Tensor: shape=(10,), dtype=float32, numpy=array([0., 0., 0., 0., 1., 1., 0., 1., 1., 1.], dtype=float32)>

In [56]:
# Let's calculate the results
model_2_results = calculate_results(y_true=val_labels,
                                    y_pred=model_2_preds)
model_2_results

{'accuracy': 76.64041994750657,
 'precision': 0.7659025055163764,
 'recall': 0.7664041994750657,
 'f1-score': 0.7660201893659146}

In [57]:
# compare the baseline results with model_2
compare_baseline_to_new_results(baseline_results, model_2_results)

Baseline accuracy: 80.18, New accuracy: 76.64, Difference: -3.543
Baseline precision: 0.81, New precision: 0.77, Difference: -0.047
Baseline recall: 0.80, New recall: 0.77, Difference: -0.035
Baseline f1-score: 0.80, New f1-score: 0.77, Difference: -0.031


### Model 3 : GRU

The GRU cell has similar features to an LSTM cell but has less parameters.

In [60]:
# set the random seed and create new embedding layer
tf.random.set_seed(17)

model_3_embedding = layers.Embedding(input_dim=max_vocab_length,
                                     output_dim=128,
                                     embeddings_initializer='uniform',
                                     input_length=max_length,
                                     name="embedding_3")

# Build GRU model
inputs = layers.Input(shape=(1,), dtype='string')
x = text_vectorizer(inputs)
x = model_3_embedding(x)
x = layers.GRU(64)(x)
outputs = layers.Dense(1, activation='sigmoid')(x)
model_3 = tf.keras.Model(inputs, outputs, name='model_3_GRU')


In [61]:
# let's see the summary
model_3.summary()

Model: "model_3_GRU"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 input_7 (InputLayer)        [(None, 1)]               0         
                                                                 
 text_vectorization_1 (TextV  multiple                 0         
 ectorization)                                                   
                                                                 
 embedding_3 (Embedding)     (None, 15, 128)           1280000   
                                                                 
 gru_1 (GRU)                 (None, 64)                37248     
                                                                 
 dense_5 (Dense)             (None, 1)                 65        
                                                                 
Total params: 1,317,313
Trainable params: 1,317,313
Non-trainable params: 0
_____________________________________________

It's the same as LSTM model but with less params

In [62]:
# compile and fit the model
model_3_history = compile_fit_RNNs(model_3, SAVE_DIR, "model_3_GRU")

Saving Tensorboard log files to: model_logs/model_3_GRU/20230203-052809
Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


In [63]:
# make predictions
model_3_pred_probs = model_3.predict(val_sentences)
# convert them into labels
model_3_preds = tf.squeeze(tf.round(model_3_pred_probs)) 
model_3_preds[:10]



<tf.Tensor: shape=(10,), dtype=float32, numpy=array([0., 0., 0., 0., 1., 1., 0., 1., 1., 0.], dtype=float32)>

In [66]:
# calculate the result
model_3_results = calculate_results(y_true=val_labels,
                                   y_pred=model_3_preds)
model_3_results

{'accuracy': 75.59055118110236,
 'precision': 0.7551876442680979,
 'recall': 0.7559055118110236,
 'f1-score': 0.7551196015270819}

In [67]:
# compare with baseline score
compare_baseline_to_new_results(baseline_results, model_3_results)

Baseline accuracy: 80.18, New accuracy: 75.59, Difference: -4.593
Baseline precision: 0.81, New precision: 0.76, Difference: -0.057
Baseline recall: 0.80, New recall: 0.76, Difference: -0.046
Baseline f1-score: 0.80, New f1-score: 0.76, Difference: -0.042


It's looks like Baseline is still outperforming dense models, let's see one more model

### Model 4 : Bidirectional RNNs

A standard RNN will process a sequence from left to right, where as a bidirectional RNN will process the sequence from left to right and then again from right to left.

In [69]:
tf.random.set_seed(17)
model_4_embedding = layers.Embedding(input_dim = max_vocab_length,
                                     output_dim=128,
                                     embeddings_initializer='uniform',
                                     input_length=max_length,
                                     name="embedding_4")

# Build the Bidirectional Model
inputs = layers.Input(shape=(1,), dtype='string')
x = text_vectorizer(inputs)
x = model_4_embedding(x)
x = layers.Bidirectional(layers.LSTM(64))(x)
outputs = layers.Dense(1, activation='sigmoid')(x)
model_4 = tf.keras.Model(inputs, outputs, name='model_4_Bidirectional')

In [70]:
# see the summary
model_4.summary()

Model: "model_4_Bidirectional"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 input_8 (InputLayer)        [(None, 1)]               0         
                                                                 
 text_vectorization_1 (TextV  multiple                 0         
 ectorization)                                                   
                                                                 
 embedding_4 (Embedding)     (None, 15, 128)           1280000   
                                                                 
 bidirectional (Bidirectiona  (None, 128)              98816     
 l)                                                              
                                                                 
 dense_6 (Dense)             (None, 1)                 129       
                                                                 
Total params: 1,378,945
Trainable params: 1,3

Notice the parameters of the Bidirectional layer get doubled.

In [71]:
# compile and fit
model_4_history = compile_fit_RNNs(model_4, SAVE_DIR, "model_4_Bidirectional")

Saving Tensorboard log files to: model_logs/model_4_Bidirectional/20230203-053715
Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


In [72]:
# make predictions
model_4_pred_probs = model_4.predict(val_sentences)



In [73]:
# convert them into labels
model_4_preds = tf.squeeze(tf.round(model_4_pred_probs))
model_4_preds[:10]

<tf.Tensor: shape=(10,), dtype=float32, numpy=array([0., 0., 0., 0., 1., 1., 0., 1., 1., 0.], dtype=float32)>

In [74]:
# calculate the result
model_4_results = calculate_results(y_true = val_labels,
                                    y_pred = model_4_preds)
model_4_results

{'accuracy': 75.98425196850394,
 'precision': 0.7600694650010854,
 'recall': 0.7598425196850394,
 'f1-score': 0.7599441744417845}

In [75]:
# compare with baseline scores
compare_baseline_to_new_results(baseline_results, model_4_results)

Baseline accuracy: 80.18, New accuracy: 75.98, Difference: -4.199
Baseline precision: 0.81, New precision: 0.76, Difference: -0.052
Baseline recall: 0.80, New recall: 0.76, Difference: -0.042
Baseline f1-score: 0.80, New f1-score: 0.76, Difference: -0.037
