# Introduction to NLP Fundamentals in tensorflow for this project.

In this colab we'll be using a dataset from Kaggle that has text samples of Tweets labelled as diaster or not diaster.

The real Tweets are actually about disasters,  for example:


    Jetstar and Virgin forced to cancel Bali flights again because of ash from Mount Raung volcano*


The NOT real Tweets are Tweets not about disasters (they can be on anything), for example:


    'Education is the most powerful weapon which you can use to change the world.' Nelson #Mandela #quote


See the original source here: https://www.kaggle.com/c/nlp-getting-started/data.

## Imports and Helper Fuctions

```
!wget https://raw.githubusercontent.com/Charliecr94/Tensor_flow_projects/main/Extras/helper_functions.py
from helper_functions import unzip_data, create_tensorboard_callback, plot_loss_curves, compare_historys

```

## Download a text dataset
```
# Download data.
!wget "https://storage.googleapis.com/ztm_tf_course/nlp_getting_started.zip"

# Unzip data
unzip_data("nlp_getting_started.zip")
``` 



In [52]:
!wget https://raw.githubusercontent.com/Charliecr94/Tensor_flow_projects/main/Extras/helper_functions.py
from helper_functions import unzip_data, create_tensorboard_callback, plot_loss_curves, compare_historys
# Download data.
!wget "https://storage.googleapis.com/ztm_tf_course/nlp_getting_started.zip"

# Unzip data
unzip_data("nlp_getting_started.zip")

--2023-03-03 16:57:10--  https://raw.githubusercontent.com/Charliecr94/Tensor_flow_projects/main/Extras/helper_functions.py
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 10234 (10.0K) [text/plain]
Saving to: ‘helper_functions.py.1’


2023-03-03 16:57:11 (72.0 MB/s) - ‘helper_functions.py.1’ saved [10234/10234]

--2023-03-03 16:57:11--  https://storage.googleapis.com/ztm_tf_course/nlp_getting_started.zip
Resolving storage.googleapis.com (storage.googleapis.com)... 142.250.152.128, 142.250.128.128, 142.251.6.128, ...
Connecting to storage.googleapis.com (storage.googleapis.com)|142.250.152.128|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 607343 (593K) [application/zip]
Saving to: ‘nlp_getting_started.zip.1’


2023-03-03 16:57:11 

# 1. Get the Data ready

### Exploring the dataset.

In [53]:
import pandas as pd
import numpy as np
train_df = pd.read_csv("train.csv")
test_df = pd.read_csv("train.csv")
train_df.head()

Unnamed: 0,id,keyword,location,text,target
0,1,,,Our Deeds are the Reason of this #earthquake M...,1
1,4,,,Forest fire near La Ronge Sask. Canada,1
2,5,,,All residents asked to 'shelter in place' are ...,1
3,6,,,"13,000 people receive #wildfires evacuation or...",1
4,7,,,Just got sent this photo from Ruby #Alaska as ...,1


In [54]:
# Shuffle training datafram
train_df_shuffled = train_df.sample(frac=1, random_state=42)
train_df_shuffled.head()

Unnamed: 0,id,keyword,location,text,target
2644,3796,destruction,,So you have a new weapon that can cause un-ima...,1
2227,3185,deluge,,The f$&amp;@ing things I do for #GISHWHES Just...,0
5448,7769,police,UK,DT @georgegalloway: RT @Galloway4Mayor: ÛÏThe...,1
132,191,aftershock,,Aftershock back to school kick off was great. ...,0
6845,9810,trauma,"Montgomery County, MD",in response to trauma Children of Addicts deve...,0


In [55]:
# How many examples of each class?
train_df.target.value_counts()

0    4342
1    3271
Name: target, dtype: int64

Since we have Two target values, we're dealing with a **binary classification** problem. Where:
  * `1` = a real disaster Tweet
  * `2` = not a real disaster Tweet
 




In [56]:
# How many total samples?
len(train_df), len(test_df)

(7613, 7613)

### Visualize some random training examples:

In [57]:
import random 
random_index = random.randint(0, len(train_df)-5)
for row in train_df_shuffled[["text","target"]][random_index:random_index+5].itertuples():
  _, text, target = row
  print(f"Target: {target}", "(real disaster)" if target > 0 else "(not real disaster)")
  print(f"Text:\n{text}\n")
  print("---\n")

Target: 1 (real disaster)
Text:
It's never a good sign when you pull up to work &amp; there's five ambulances &amp; a fire truck in the bay. Wompppp at least it's Friday

---

Target: 0 (not real disaster)
Text:
RIZZO IS ON ???????? THAT BALL WAS OBLITERATED

---

Target: 1 (real disaster)
Text:
Palestinian Teen Killed Amid Protests Against Arson Attack http://t.co/okVsImoGic

---

Target: 0 (not real disaster)
Text:
#frontpage: #Bioterror lab faced secret sanctions. #RickPerry doesn't make the cut for @FoxNews #GOPDebate http://t.co/fZujg7sXJg @USATODAY

---

Target: 0 (not real disaster)
Text:
My emotions are a train wreck. My body is a train wreck. I'm a wreck

---



### Splitting data into training and validation sets

In [58]:
from sklearn.model_selection import train_test_split

# Split training data into training and validation sets.
train_sentences, val_sentences, train_labels, val_labels = train_test_split(train_df_shuffled["text"].to_numpy(),
                                                                           train_df_shuffled["target"].to_numpy(),
                                                                           test_size= 0.1,
                                                                           random_state= 42)                                                                                                                                                

In [59]:
# Check the lengths
len(train_sentences), len(val_sentences), len(train_labels), len(val_labels)

(6851, 762, 6851, 762)

### Converting text data to numbers using tokenisation and embeddings.

In [60]:
# Find the average numbers of tokens (words) in the trainin tweets
round(sum([len(i.split()) for i in train_sentences])/len(train_sentences))

15

In [61]:
import tensorflow as tf
from tensorflow.keras.layers.experimental.preprocessing import TextVectorization

# Setup text vectorization variables
max_vocab_length = 10000
max_length = 15

# Use the default TextVectorization parameters
text_vectorizer = TextVectorization(max_tokens= max_vocab_length, # how many words in the vocabulary
                                    standardize='lower_and_strip_punctuation',
                                    split="whitespace",
                                    ngrams=None,
                                    output_mode="int",
                                    output_sequence_length=max_length,
                                    pad_to_max_tokens= True)


To map our `TextVectorization` instance `text_vectorizer` to our data, we can call the `adapt()` method on it whilst passing it our training text.

In [62]:
# fit the text vectorizer to the training text
text_vectorizer.adapt(train_sentences)

In [63]:
# Create a sample sentences and tokenize it
sample_sentences = "There's a flood in my street!"
text_vectorizer([sample_sentences])

<tf.Tensor: shape=(1, 15), dtype=int64, numpy=
array([[264,   3, 232,   4,  13, 698,   0,   0,   0,   0,   0,   0,   0,
          0,   0]])>

Beautiful!

Now we've got a way to turn our text into numbers, so let's try this `text_vectorizer` instance on a few random training sentences!

In [64]:
# Choose a random sentence from the training dataset and tokenize it
random_sentence = random.choice(train_sentences)
print(f"Original text:\n{random_sentence}\
      \n\nVectorized version:")
text_vectorizer([random_sentence])

Original text:
Woman sneaks into airplane cockpit; terrorism not suspected http://t.co/1W58Ehv9S1 http://t.co/p8Ih0hni3l      

Vectorized version:


<tf.Tensor: shape=(1, 15), dtype=int64, numpy=
array([[ 410, 8353,   66,  444, 5961,  361,   34, 1355,    1,    1,    0,
           0,    0,    0,    0]])>

Finally, we can check out some uniques tokens in our vocabulary using `get_vocabulary()` method

In [65]:
# Get the unique words in the vocabulary
words_in_vocabulary = text_vectorizer.get_vocabulary()
top_5_words = words_in_vocabulary[:5]
bottom_5_words = words_in_vocabulary[-5:]
print(f"Number of words in vocab: {len(words_in_vocabulary)}")
print(f"Top 5 most common words:{top_5_words}")
print(f"Bottom 5 least common words:{bottom_5_words}")

Number of words in vocab: 10000
Top 5 most common words:['', '[UNK]', 'the', 'a', 'in']
Bottom 5 least common words:['pages', 'paeds', 'pads', 'padres', 'paddytomlinson1']


### Creating an Embedding using an Embedding Layer

In [66]:
from tensorflow.keras import layers

embedding = layers.Embedding(input_dim=max_vocab_length,
                             output_dim=128,
                             embeddings_initializer="uniform",
                             input_length=max_length,
                             name="embedding_1")
embedding


<keras.layers.core.embedding.Embedding at 0x7fc8358f3c70>

In [67]:
# Get a random sentence from training set
random_sentence = random.choice(train_sentences)
print(f"Original text:\n{random_sentence}\
      \n\nEmbedded version:")

# Embed the random sentence (turn it into numerical representation)
sample_embed = embedding(text_vectorizer([random_sentence]))
sample_embed

Original text:
First Tweet collided with a Selfie. Pretty 'Sweet' if you ask me???? http://t.co/knomg9pfiz      

Embedded version:


<tf.Tensor: shape=(1, 15, 128), dtype=float32, numpy=
array([[[-0.02093711,  0.04156688,  0.01657852, ...,  0.03687808,
         -0.03648955,  0.01522502],
        [ 0.04606913, -0.01693554, -0.02520466, ..., -0.01627056,
         -0.04016382,  0.00044153],
        [-0.02461989,  0.01436492, -0.01566825, ...,  0.04777322,
         -0.01958194,  0.04180281],
        ...,
        [-0.01186168,  0.03959366,  0.00843366, ..., -0.0155185 ,
          0.03665351,  0.02508214],
        [ 0.02575022, -0.00555981, -0.01584203, ...,  0.0129073 ,
         -0.04916935, -0.04302556],
        [ 0.02575022, -0.00555981, -0.01584203, ...,  0.0129073 ,
         -0.04916935, -0.04302556]]], dtype=float32)>

Each token in the sentences gets turned into a length 128 feature vector.

So now the our data is ready let's test some models:

# Modeling a text dataset

In this colab we'll explore 7 modeling solutions.

  * **Model 0:** Naive Bayes (From Sklearn)
  * **Model 1:** Feed-forward neural network
  * **Model 2:** LSTM model (RNN)
  * **Model 3:** GRU model (RNN)
  * **Model 4:** Bidirectional-LSTM model (RNN)
  * **Model 5:** 1D Convolutional Neural Network (CNN)
  * **Model 6:** TensorFlow Hub Pretrained Feature Extrator (using transfer learning for NLP)
  * Model 7: Same as model 6 with 10% of training data.


### Model 0: Getting a baseline with Scikit-Learn

To create our baseline, we'll create a Scikit-Learn Pipeline using the TF-IDF (term frequency-inverse document frequency) formula to convert our words to numbers and then model them with the Multinomial Naive Bayes algorithm. This was chosen via referring to the [Scikit-Learn machine learning map.](https://scikit-learn.org/stable/tutorial/machine_learning_map/index.html)



In [68]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import Pipeline

# Create tokenization and modeling pipeline
model_0= Pipeline([
                   ("tfidf", TfidfVectorizer()),
                   ("clf", MultinomialNB())
                   ])

# Fit the pipeline to the training data
model_0.fit(train_sentences, train_labels)


In [69]:
# Evaluate our baseline model
baseline_score = model_0.score(val_sentences, val_labels)
print(f"The model have a {baseline_score*100:.2f}% of accuracy")

The model have a 79.27% of accuracy


In [70]:
baseline_preds = model_0.predict(val_sentences)
baseline_preds[:20]

array([1, 1, 1, 0, 0, 1, 1, 1, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1])

# Evaluating the models

### Creating an evaluation function for our model experiments.

Since we're going to be evaluating several models in the same way, let's create a helper function which takes and array of predictions and ground truth labels and computes the following:

* Accuracy
* Precision
* Recall
* F1-score


In [71]:
from sklearn.metrics import accuracy_score, precision_recall_fscore_support

def calculate_results(y_true, y_pred):
  """
  Calculates model accuracy, precision, recall and f1 score of a binary classification model.
  """

  # Calculate model accuracy
  model_accuracy = accuracy_score(y_true, y_pred)
  
  # Calaculate model precision, recall and f1-score
  model_precision, model_recall, model_f1, _ = precision_recall_fscore_support(y_true, y_pred, average="weighted")
  model_results = {"accuracy": model_accuracy,
                   "precision": model_precision,
                   "recall": model_recall,
                   "f1": model_f1}
  return model_results


In [72]:
# Get baseline results
baseline_results = calculate_results(y_true=val_labels, y_pred=baseline_preds)
baseline_results

{'accuracy': 0.7926509186351706,
 'precision': 0.8111390004213173,
 'recall': 0.7926509186351706,
 'f1': 0.7862189758049549}

In [73]:
# Create a helper function to compare our baseline results to new model results
def compare_baseline_to_new_results(baseline_results, new_model_results):
  for key, value in baseline_results.items():
    print(f"Baseline {key}: {value:.2f}, New {key}: {new_model_results[key]:.2f}, Difference: {new_model_results[key]-value:.2f}")

compare_baseline_to_new_results(baseline_results=baseline_results, 
                                new_model_results=model_1_results)

Baseline accuracy: 0.79, New accuracy: 0.79, Difference: -0.00
Baseline precision: 0.81, New precision: 0.80, Difference: -0.02
Baseline recall: 0.79, New recall: 0.79, Difference: -0.00
Baseline f1: 0.79, New f1: 0.79, Difference: 0.00


# Improve throught experimentation

### Model 1: A simple dense model.

In [74]:
# Create a tensorboard callback
from helper_functions import create_tensorboard_callback

# Create a directory to save tensoBoard logs
SAVE_DIR = "model_logs"

In [75]:
# Build model with the Functional API
from tensorflow.keras import layers

inputs = layers.Input(shape=(1,), dtype="string")
x = text_vectorizer(inputs) # turn the input text into numbers
x = embedding(x) # create an embedding of the numberized inputs
x = layers.GlobalAveragePooling1D()(x) # Lower the dimensionalty of the embedding
outputs = layers.Dense(1, activation="sigmoid")(x) # Create the output layer
model_1 = tf.keras.Model(inputs, outputs, name="model_1_dense") # construct the model



In [76]:
# Compile model
model_1.compile(loss="binary_crossentropy",
                optimizer= tf.keras.optimizers.Adam(),
                metrics=["accuracy"])

In [77]:
model_1.summary()

Model: "model_1_dense"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 input_3 (InputLayer)        [(None, 1)]               0         
                                                                 
 text_vectorization_1 (TextV  (None, 15)               0         
 ectorization)                                                   
                                                                 
 embedding_1 (Embedding)     (None, 15, 128)           1280000   
                                                                 
 global_average_pooling1d_1   (None, 128)              0         
 (GlobalAveragePooling1D)                                        
                                                                 
 dense_2 (Dense)             (None, 1)                 129       
                                                                 
Total params: 1,280,129
Trainable params: 1,280,129
N

In [78]:
# Fit the model
model_1_history = model_1.fit(train_sentences, # input sentences can be a list of strings due to text preprocessing layer built-in model
                              train_labels,
                              epochs=5,
                              validation_data=(val_sentences, val_labels),
                              callbacks=[create_tensorboard_callback(dir_name=SAVE_DIR, 
                                                                     experiment_name="simple_dense_model")])

Saving TensorBoard log files to: model_logs/simple_dense_model/20230303-165714
Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


Let's check this model performance on the validation set

In [79]:
# Check the results
model_1.evaluate(val_sentences, val_labels)



[0.4766300618648529, 0.787401556968689]

In [80]:
# Make some predictions and evaluate those
model_1_pred_probs = model_1.predict(val_sentences)
model_1_pred_probs.shape



(762, 1)

In [81]:
# Look at the firts 3 predictions
model_1_pred_probs[:8]

array([[0.4133136 ],
       [0.7443478 ],
       [0.99777716],
       [0.10999878],
       [0.10660768],
       [0.9387579 ],
       [0.9154959 ],
       [0.9926271 ]], dtype=float32)

For comparing our `pred_probs` with the `true_labels` we need to convert the predicts to label format.

In [82]:
# Convert model prediction probabilities to label format
model_1_preds = tf.squeeze(tf.round(model_1_pred_probs))
model_1_preds[:10]

<tf.Tensor: shape=(10,), dtype=float32, numpy=array([0., 1., 1., 0., 0., 1., 1., 1., 1., 0.], dtype=float32)>

In [83]:
# Calculate our model_1 results
model_1_results = calculate_results(y_true= val_labels,
                                    y_pred= model_1_preds)
model_1_results

{'accuracy': 0.7874015748031497,
 'precision': 0.7914920592553047,
 'recall': 0.7874015748031497,
 'f1': 0.7846966492209201}

In [None]:
# View tensorboard logs of transfer learning modelling experiments (should be 4 models)
# Upload TensorBoard dev records
!tensorboard dev upload --logdir ./model_logs \
   --name "First deep model on text data" \
   --description "Trying a dense model with an embedding layer" \
   --one_shot # exits the uploader when upload has finished

In [None]:
# Delete Tensoboard instance.
!tensorboard dev delete --experiment_id Y87inY29TPCA0nwXKXIkqg

Looks like our baseline is out performing our firts deep learning model...

### Model_2: LSTM RNN

To make sure we're not reusing trained embeddings, we'll create another embedding layer for our second model.

In [86]:
# Set random seed and create embedding layer
tf.random.set_seed(42)
from tensorflow.keras import layers
model_2_embedding = layers.Embedding(input_dim=max_vocab_length,
                                     output_dim=128,
                                     embeddings_initializer="uniform",
                                     input_length= max_length,
                                     name="embedding_2")

In [87]:
#  Set random seed and create embedding layer
inputs = layers.Input(shape=(1,), dtype="string")
x = text_vectorizer(inputs)
x = model_2_embedding(x)
x = layers.LSTM(64)(x)
outputs = layers.Dense(1, activation="sigmoid")(x)
model_2= tf.keras.Model(inputs,outputs, name="model_2_LSTM")

In [88]:
# Compile model
model_2.compile(loss="binary_crossentropy",
                optimizer=tf.keras.optimizers.Adam(),
                metrics=["accuracy"])

In [89]:
model_2.summary()

Model: "model_2_LSTM"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 input_4 (InputLayer)        [(None, 1)]               0         
                                                                 
 text_vectorization_1 (TextV  (None, 15)               0         
 ectorization)                                                   
                                                                 
 embedding_2 (Embedding)     (None, 15, 128)           1280000   
                                                                 
 lstm_1 (LSTM)               (None, 64)                49408     
                                                                 
 dense_3 (Dense)             (None, 1)                 65        
                                                                 
Total params: 1,329,473
Trainable params: 1,329,473
Non-trainable params: 0
____________________________________________

In [90]:
# Fit the model
model_2_history = model_2.fit(train_sentences,
                               train_labels,
                               epochs=5,
                               validation_data=(val_sentences, val_labels),
                               callbacks= [create_tensorboard_callback(SAVE_DIR,
                                                                       "LSTM")])

Saving TensorBoard log files to: model_logs/LSTM/20230303-170342
Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


In [91]:
# Make predictions on the validation dataset
model_2_pred_probs = model_2.predict(val_sentences)
model_2_pred_probs.shape, model_2_pred_probs[:10] # view the firts 10



((762, 1), array([[0.01593563],
        [0.7578992 ],
        [0.9992505 ],
        [0.05997077],
        [0.00282428],
        [0.99931824],
        [0.9792722 ],
        [0.99962115],
        [0.9994947 ],
        [0.2321203 ]], dtype=float32))

Our current model returns prediction probabilities rather than classes, we can turn these predictions into classes by rounding to the nearest integer.

In [92]:
# Round out predictions and reduce to 1-dimensional array
model_2_preds = tf.squeeze(tf.round(model_2_pred_probs))
model_2_preds[:10]

<tf.Tensor: shape=(10,), dtype=float32, numpy=array([0., 1., 1., 0., 0., 1., 1., 1., 1., 0.], dtype=float32)>

In [93]:
# Calculate LSTM model results
model_2_results = calculate_results(y_true= val_labels,
                                    y_pred=model_2_preds)
model_2_results

{'accuracy': 0.7650918635170604,
 'precision': 0.7664434345240916,
 'recall': 0.7650918635170604,
 'f1': 0.7630272521222509}

In [94]:
# Compare model 2 to baseline
compare_baseline_to_new_results(baseline_results, model_2_results)

Baseline accuracy: 0.79, New accuracy: 0.77, Difference: -0.03
Baseline precision: 0.81, New precision: 0.77, Difference: -0.04
Baseline recall: 0.79, New recall: 0.77, Difference: -0.03
Baseline f1: 0.79, New f1: 0.76, Difference: -0.02


### Model_3: GRU RNN

In [102]:
# Create embedding layer
model_3_embedding = layers.Embedding(input_dim=max_vocab_length,
                                     output_dim=128,
                                     embeddings_initializer="uniform",
                                     input_length=max_length,
                                     name="embedding_3")
# Build an RNN using the GRU cell
inputs = layers.Input(shape=(1,), dtype="string")
x = text_vectorizer(inputs)
x = model_3_embedding(x)
x = layers.GRU(64)(x) 
outputs = layers.Dense(1, activation="sigmoid")(x)
model_3 = tf.keras.Model(inputs, outputs, name="model_3_GRU")

In [103]:
model_3.summary()

Model: "model_3_GRU"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 input_8 (InputLayer)        [(None, 1)]               0         
                                                                 
 text_vectorization_1 (TextV  (None, 15)               0         
 ectorization)                                                   
                                                                 
 embedding_3 (Embedding)     (None, 15, 128)           1280000   
                                                                 
 gru_4 (GRU)                 (None, 64)                37248     
                                                                 
 dense_6 (Dense)             (None, 1)                 65        
                                                                 
Total params: 1,317,313
Trainable params: 1,317,313
Non-trainable params: 0
_____________________________________________

In [104]:
# Compile GRU model
model_3.compile(loss="binary_crossentropy",
                optimizer=tf.keras.optimizers.Adam(),
                metrics=["accuracy"])

In [105]:
# Fit the model
model_3_history = model_3.fit(train_sentences,
                              train_labels,
                              epochs=5,
                              validation_data= (val_sentences, val_labels),
                              callbacks=[create_tensorboard_callback(SAVE_DIR,"GRU")])

Saving TensorBoard log files to: model_logs/GRU/20230303-172229
Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


In [106]:
# Make predictions on validation data
model_3_pred_probs = model_3.predict(val_sentences)
model_3_pred_probs.shape, model_3_pred_probs[:10]




((762, 1), array([[0.30253026],
        [0.89125985],
        [0.99651426],
        [0.17802308],
        [0.00878775],
        [0.99373806],
        [0.76566184],
        [0.9978204 ],
        [0.9964696 ],
        [0.37190127]], dtype=float32))

In [108]:
# Convert prediction probabilities to prediction classes
model_3_pred = tf.squeeze(tf.round(model_3_pred_probs))
model_3_pred[:10]

<tf.Tensor: shape=(10,), dtype=float32, numpy=array([0., 1., 1., 0., 0., 1., 1., 1., 1., 0.], dtype=float32)>

In [110]:
# Calculate model_3 results
model_3_results = calculate_results(y_true= val_labels,
                                    y_pred= model_3_pred)
model_3_results

{'accuracy': 0.7755905511811023,
 'precision': 0.7759863909628747,
 'recall': 0.7755905511811023,
 'f1': 0.7743062301518678}

In [112]:
# Compare to baseline
compare_baseline_to_new_results(baseline_results, model_3_results)

Baseline accuracy: 0.79, New accuracy: 0.78, Difference: -0.02
Baseline precision: 0.81, New precision: 0.78, Difference: -0.04
Baseline recall: 0.79, New recall: 0.78, Difference: -0.02
Baseline f1: 0.79, New f1: 0.77, Difference: -0.01


### Model_4: Bidirectonal RNN.

In [116]:
# Create embedding layer
model_4_embedding = layers.Embedding(input_dim=max_vocab_length,
                                     output_dim=128,
                                     embeddings_initializer="uniform",
                                     input_length=max_length,
                                     name="embedding_4")

# Build a bidirectional RNN in Tensorflow
inputs = layers.Input(shape= (1,), dtype= "string")
x = text_vectorizer(inputs)
x = model_4_embedding(x)
x = layers.Bidirectional(layers.LSTM(64))(x)
outputs = layers.Dense(1, activation="sigmoid")(x)
model_4 = tf.keras.Model(inputs, outputs, name="model_4_Bidirectional")

In [119]:
# Get a summary
model_4.summary()

Model: "model_4_Bidirectional"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 input_10 (InputLayer)       [(None, 1)]               0         
                                                                 
 text_vectorization_1 (TextV  (None, 15)               0         
 ectorization)                                                   
                                                                 
 embedding_4 (Embedding)     (None, 15, 128)           1280000   
                                                                 
 bidirectional_2 (Bidirectio  (None, 128)              98816     
 nal)                                                            
                                                                 
 dense_8 (Dense)             (None, 1)                 129       
                                                                 
Total params: 1,378,945
Trainable params: 1,3

In [118]:
# Compile model
model_4.compile(loss="binary_crossentropy",
                optimizer=tf.keras.optimizers.Adam(),
                metrics=["accuracy"])

In [120]:
# Fit the model
model_4_history = model_4.fit(train_sentences,
                              train_labels,
                              epochs=5,
                              validation_data = (val_sentences, val_labels),
                              callbacks= [create_tensorboard_callback(SAVE_DIR, "bidirectional_RNN")])

Saving TensorBoard log files to: model_logs/bidirectional_RNN/20230303-181943
Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


In [121]:
# Make predictions with bidirectional RNN on the validation data
model_4_pred_probs = model_4.predict(val_sentences)
model_4_pred_probs[:10]



array([[0.06770237],
       [0.8605525 ],
       [0.9992816 ],
       [0.07982261],
       [0.00726422],
       [0.9973387 ],
       [0.9584599 ],
       [0.999514  ],
       [0.99960023],
       [0.13278319]], dtype=float32)

In [122]:
# Convert prediction probabilities to labels
model_4_preds = tf.squeeze(tf.round(model_4_pred_probs))
model_4_preds[:10]

<tf.Tensor: shape=(10,), dtype=float32, numpy=array([0., 1., 1., 0., 0., 1., 1., 1., 1., 0.], dtype=float32)>

In [123]:
# Calculate bidirectional RNN model results
model_4_results = calculate_results(val_labels, model_4_preds)
model_4_results

{'accuracy': 0.7716535433070866,
 'precision': 0.7728196127186888,
 'recall': 0.7716535433070866,
 'f1': 0.7698331286570831}

In [125]:
# Check to see how the bidirectional model performs against the baseline
compare_baseline_to_new_results(baseline_results, model_4_results)

Baseline accuracy: 0.79, New accuracy: 0.77, Difference: -0.02
Baseline precision: 0.81, New precision: 0.77, Difference: -0.04
Baseline recall: 0.79, New recall: 0.77, Difference: -0.02
Baseline f1: 0.79, New f1: 0.77, Difference: -0.02


 ### Model_5: Conv1D

In [None]:
from tensorflow.keras import layersembedding 


In [128]:
# Test out the embedding, 1D convolutional and max pooling
embedding_test = embedding(text_vectorizer(["this is a test sentence"])) # turn target sentence into embedding
conv_1d = layers.Conv1D(filters=32, kernel_size=5, activation="relu") # convolve over target sequence 5 words at a time
conv_1d_output = conv_1d(embedding_test) # pass embedding through 1D convolutional layer
max_pool = layers.GlobalMaxPool1D() 
max_pool_output = max_pool(conv_1d_output) # get the most important features
embedding_test.shape, conv_1d_output.shape, max_pool_output.shape

(TensorShape([1, 15, 128]), TensorShape([1, 11, 32]), TensorShape([1, 32]))