<a href="https://colab.research.google.com/github/SheshamJoseph/Deep-Learning-with-Tensorflow-ZTM/blob/main/08_natural_Language_processing.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Natural Language Processing with Tensorflow

The main goal of natural language processing(NLP) is to derive information from natural language. It is concerned with text and speech and covers fields like email spam classification, twitter sentiment analysis, and machine translation.

In this section we will cover:
* Downloading text data
* Visualizing it
* Converting text into numbers using tokenization
* Model a text dataset
- * Starting with a baseline (TD-IDF)
* * Building deep learning text models
* * * Dense, LSTM, GRU, Conv1D, Transfer Learning
* Compare the performance of each model
* Combining models into an ensemble
* Saving and loading a pretrained model
* Finding the most wrong predictions

In [1]:
# import datetime
# print(f"Notebook last run (end-to-end): {datetime.datetime.now()}")

## Check for GPU

In [2]:
!nvidia-smi

/bin/bash: line 1: nvidia-smi: command not found


## Fetch helper functions

In [3]:
!wget https://raw.githubusercontent.com/mrdbourke/tensorflow-deep-learning/main/extras/helper_functions.py

--2024-07-09 13:31:07--  https://raw.githubusercontent.com/mrdbourke/tensorflow-deep-learning/main/extras/helper_functions.py
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.110.133, 185.199.108.133, 185.199.109.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.110.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 10246 (10K) [text/plain]
Saving to: ‘helper_functions.py’


2024-07-09 13:31:07 (71.1 MB/s) - ‘helper_functions.py’ saved [10246/10246]



In [4]:
# Import series of helper functions for the notebook
from helper_functions import unzip_data, create_tensorboard_callback, plot_loss_curves, compare_historys

## Download text dataset
We'll start by using the [Real or Not?](https://www.kaggle.com/c/nlp-getting-started/data) dataset from Kaggle which conntains text-based Tweets about natural disasters

In [5]:
!wget https://storage.googleapis.com/ztm_tf_course/nlp_getting_started.zip

# unzip data
unzip_data("nlp_getting_started.zip")

--2024-07-09 13:31:20--  https://storage.googleapis.com/ztm_tf_course/nlp_getting_started.zip
Resolving storage.googleapis.com (storage.googleapis.com)... 142.251.179.207, 64.233.180.207, 142.251.16.207, ...
Connecting to storage.googleapis.com (storage.googleapis.com)|142.251.179.207|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 607343 (593K) [application/zip]
Saving to: ‘nlp_getting_started.zip’


2024-07-09 13:31:20 (23.6 MB/s) - ‘nlp_getting_started.zip’ saved [607343/607343]



## Visualizing the text dataset

In [6]:
import pandas as pd
train_df = pd.read_csv("train.csv")
test_df = pd.read_csv("test.csv")

train_df.head()

Unnamed: 0,id,keyword,location,text,target
0,1,,,Our Deeds are the Reason of this #earthquake M...,1
1,4,,,Forest fire near La Ronge Sask. Canada,1
2,5,,,All residents asked to 'shelter in place' are ...,1
3,6,,,"13,000 people receive #wildfires evacuation or...",1
4,7,,,Just got sent this photo from Ruby #Alaska as ...,1


In [7]:
# shuffle the training data
train_df_shuffled = train_df.sample(frac=1, random_state=42)
train_df_shuffled.head()

Unnamed: 0,id,keyword,location,text,target
2644,3796,destruction,,So you have a new weapon that can cause un-ima...,1
2227,3185,deluge,,The f$&amp;@ing things I do for #GISHWHES Just...,0
5448,7769,police,UK,DT @georgegalloway: RT @Galloway4Mayor: ÛÏThe...,1
132,191,aftershock,,Aftershock back to school kick off was great. ...,0
6845,9810,trauma,"Montgomery County, MD",in response to trauma Children of Addicts deve...,0


In [8]:
test_df.head()

Unnamed: 0,id,keyword,location,text
0,0,,,Just happened a terrible car crash
1,2,,,"Heard about #earthquake is different cities, s..."
2,3,,,"there is a forest fire at spot pond, geese are..."
3,9,,,Apocalypse lighting. #Spokane #wildfires
4,11,,,Typhoon Soudelor kills 28 in China and Taiwan


In [9]:
# check how many examples for each class
train_df.target.value_counts()

target
0    4342
1    3271
Name: count, dtype: int64

In [10]:
len(train_df), len(test_df)

(7613, 3263)

## Visualize some of the text

In [11]:
import random
random_index = random.randint(0, len(train_df)-1)
for row in train_df_shuffled[["text", "target"]][random_index:random_index+5].itertuples():
    _, text, target = row
    print(f"Target : {target}", "(real disaster)" if target > 0 else "(not real disaster)")
    print(f"Text :\n{text}\n")
    print("-----------------------\n")

Target : 0 (not real disaster)
Text :
#np agalloch - the desolation song

-----------------------

Target : 0 (not real disaster)
Text :
New Ladies Shoulder Tote Handbag Women Cross Body Bag Faux Leather Fashion Purse - Full reÛ_ http://t.co/BLAAWHYScT http://t.co/dDR0zjXVQN

-----------------------

Target : 1 (real disaster)
Text :

-----------------------

Target : 1 (real disaster)
Text :
Kach was a group to which belonged Baruch Goldstein a mass murderer who in 1994 shot and killed 29 PalestinianÛ_ http://t.co/bXGNQ57xvb

-----------------------

Target : 1 (real disaster)
Text :
Rly tragedy in MP: Some live to recount horror: ÛÏWhen I saw coaches of my train plunging into water I called ... http://t.co/CaR5QEUVHH

-----------------------



In [12]:
# split data into training and validation sets
from sklearn.model_selection import train_test_split

train_sentences, val_sentences, train_labels, val_labels = train_test_split(train_df_shuffled["text"].to_numpy(),
                                                                            train_df_shuffled["target"].to_numpy(),
                                                                            test_size=0.2,
                                                                            random_state=42)

In [13]:
# Check the lengths
len(train_sentences), len(train_labels), len(val_sentences), len(val_labels)

(6090, 6090, 1523, 1523)

## Converting text to numbers

In [14]:
# carrying out text vectorization
import tensorflow as tf
from tensorflow.keras.layers import TextVectorization

max_vocab_length = 10000
max_length = 15

text_vectorizer = TextVectorization(
    max_tokens=max_vocab_length,
    output_mode='int',
    output_sequence_length=max_length
)


In [15]:
# Fit the text vectorizer to the training text
text_vectorizer.adapt(train_sentences)

# create a sample sentence and tokenize it
sample_sentence = "There's a fire in my living room!"
text_vectorizer([sample_sentence])

<tf.Tensor: shape=(1, 15), dtype=int64, numpy=
array([[ 240,    3,   44,    4,   13, 1169, 1080,    0,    0,    0,    0,
           0,    0,    0,    0]])>

In [16]:
# Get the unique words in the vocabulary
words_in_vocab = text_vectorizer.get_vocabulary()
top_5_words = words_in_vocab[:5] # most common tokens (notice the [UNK] token for "unknown" words)
bottom_5_words = words_in_vocab[-5:] # least common tokens
print(f"Number of words in vocab: {len(words_in_vocab)}")
print(f"Top 5 most common words: {top_5_words}")
print(f"Bottom 5 least common words: {bottom_5_words}")

Number of words in vocab: 10000
Top 5 most common words: ['', '[UNK]', 'the', 'a', 'in']
Bottom 5 least common words: ['minded', 'mindblowing', 'milne', 'milledgeville', 'millcityio']


## Creating an embedding layer

In [17]:
from tensorflow.keras import layers

tf.random.set_seed(42)
embedding = layers.Embedding(input_dim=max_vocab_length,
                             output_dim=128,
                             embeddings_initializer='uniform',
                             input_length=max_length,
                             name='embedding_1')

embedding

<keras.src.layers.core.embedding.Embedding at 0x7eda6181a5c0>

In [18]:
# Make a sample embedding
sample_embed = embedding(text_vectorizer([sample_sentence]))
sample_embed

<tf.Tensor: shape=(1, 15, 128), dtype=float32, numpy=
array([[[-0.04315957,  0.00098763, -0.01503975, ...,  0.00560836,
         -0.01111202, -0.03593566],
        [ 0.01508427, -0.02151479,  0.02530565, ...,  0.01069943,
         -0.02532423,  0.00253395],
        [ 0.04744754,  0.0166708 ,  0.00908464, ..., -0.03297221,
          0.01442195,  0.02773907],
        ...,
        [-0.043354  ,  0.03913673,  0.04365997, ..., -0.00332902,
          0.00191905, -0.01499425],
        [-0.043354  ,  0.03913673,  0.04365997, ..., -0.00332902,
          0.00191905, -0.01499425],
        [-0.043354  ,  0.03913673,  0.04365997, ..., -0.00332902,
          0.00191905, -0.01499425]]], dtype=float32)>

## Modelling a text dataset
Here we will be build different models and then we'll compare them to ko=now which ine performed best. The models we will be building include:
* Model 0: Naive Bayes (baseline)
* Model 1: Feed-forward neural network (dense model)
* Model 2: LSTM model
* Model 3: GRU model
* Model 4: Bidirectional-LSTM model
* Model 5: 1D Convolutional Neural Network
* Model 6: TensorFlow Hub Pretrained Feature Extractor
* Model 7: Same as model 6 with 10% of training data

### Model_0 (baseline)
To create our baseline model, we'll create a Scikit-Learn Pipeline using TD-IDF(term frequency-inverse document frequency) to convert words to numbers and then model them using Multinomial Naive Bayes algorithm

In [19]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import Pipeline

model_0 = Pipeline([
    ('tfidf', TfidfVectorizer()),
     ('clf', MultinomialNB())
])

model_0.fit(train_sentences, train_labels)

In [20]:
baseline_score = model_0.score(val_sentences, val_labels)
print(f"Baseline model accuracy : {baseline_score*100:.2f}%")

Baseline model accuracy : 79.91%


In [21]:
# make predictions
baseline_preds = model_0.predict(val_sentences)
baseline_preds[:10]

array([1, 1, 1, 0, 0, 1, 1, 1, 1, 0])

### Creating an evaluation function for model experiments

In [22]:
from sklearn.metrics import accuracy_score, precision_recall_fscore_support

def calculate_results(y_true, y_pred):
    """
    Calculates a model's accuracy, precision, recall and f1-score
    Returns a dictionary of accuracy, precision, recall, f1-score
    """
    model_accuracy = accuracy_score(y_true, y_pred) * 100
    model_precision, model_recall, model_f1, _ = precision_recall_fscore_support(y_true, y_pred, average='weighted')
    model_results = {
        "accuracy": model_accuracy,
        "precision": model_precision,
        "recall": model_recall,
        "f1-score": model_f1
    }
    return model_results

In [23]:
baseline_results = calculate_results(val_labels, baseline_preds)
baseline_results

{'accuracy': 79.9080761654629,
 'precision': 0.8146358812834972,
 'recall': 0.799080761654629,
 'f1-score': 0.7920155324845473}

### Model_1 : Simple dense model

In [24]:
from helper_functions import create_tensorboard_callback
from tensorflow.keras import layers

SAVE_DIR = 'model_logs'

# using  the functional api to create the model
inputs = layers.Input(shape=(1,), dtype='string')
x = text_vectorizer(inputs)
x = embedding(x)
x = layers.GlobalAveragePooling1D()(x)
outputs = layers.Dense(1, activation='sigmoid')(x)
model_1 = tf.keras.Model(inputs, outputs, name='model_1_dense')

In [25]:
# compile
model_1.compile(loss=tf.keras.losses.BinaryCrossentropy(),
                optimizer=tf.keras.optimizers.Adam(),
                metrics=['accuracy'])

In [26]:
model_1.summary()

Model: "model_1_dense"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 input_1 (InputLayer)        [(None, 1)]               0         
                                                                 
 text_vectorization (TextVe  (None, 15)                0         
 ctorization)                                                    
                                                                 
 embedding_1 (Embedding)     (None, 15, 128)           1280000   
                                                                 
 global_average_pooling1d (  (None, 128)               0         
 GlobalAveragePooling1D)                                         
                                                                 
 dense (Dense)               (None, 1)                 129       
                                                                 
Total params: 1280129 (4.88 MB)
Trainable params: 128

In [27]:
# Fit
model_1_history = model_1.fit(train_sentences,
                              train_labels,
                              epochs=5,
                              validation_data=(val_sentences, val_labels),
                              callbacks=[create_tensorboard_callback(SAVE_DIR, 'model_1_dense')])

Saving TensorBoard log files to: model_logs/model_1_dense/20240709-133125
Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


In [28]:
model_1.evaluate(val_sentences, val_labels)



[0.475711852312088, 0.7925148010253906]

In [29]:
# get model_1's weights
embed_weights = model_1.get_layer('embedding_1').get_weights()[0]
print(embed_weights.shape)

(10000, 128)


In [30]:
# make predictions
model_1_pred_probs = model_1.predict(val_sentences)
model_1_pred_probs[:10]



array([[0.41732004],
       [0.8769269 ],
       [0.9969989 ],
       [0.13979489],
       [0.09902692],
       [0.94456834],
       [0.9663556 ],
       [0.9927332 ],
       [0.9424975 ],
       [0.30259675]], dtype=float32)

In [31]:
# turn model predictions into a single-dimesion tensor
model_1_preds = tf.squeeze(tf.round(model_1_pred_probs))
model_1_preds[:10]

<tf.Tensor: shape=(10,), dtype=float32, numpy=array([0., 1., 1., 0., 0., 1., 1., 1., 1., 0.], dtype=float32)>

In [32]:
# get model_1 metrics
model_1_results = calculate_results(val_labels, model_1_preds)
model_1_results

{'accuracy': 79.25147734734077,
 'precision': 0.794219865754067,
 'recall': 0.7925147734734077,
 'f1-score': 0.789793113639757}

In [33]:
# compare model_1 to baseline
import numpy as np
np.array(list(model_1_results.values())) > np.array(list(baseline_results.values()))

array([False, False, False, False])

In [34]:
# create helper function to do this comparison
def compare_baseline_to_model(new_model_results, baseline_results=baseline_results):
    for key, value, in baseline_results.items():
        print(f"Baseline {key}: {value:.2f}, New {key}: {new_model_results[key]:.2f}, Difference: {new_model_results[key]-value:.2f}")

## Recurrent Neural Networks

### Model_2: LSTM

In [35]:
tf.random.set_seed(42)

# create new embedding layer
model_2_embedding = layers.Embedding(input_dim=max_vocab_length,
                                     output_dim=128,
                                     embeddings_initializer='uniform',
                                     input_length=max_length,
                                     name='embedding_1')

In [36]:
# create LSTM
inputs = layers.Input(shape=(1,), dtype='string')
x = text_vectorizer(inputs)
x = model_2_embedding(x)
# x = layers.LSTM(64, return_sequences=True)(x)  # return vector for each each word in the Tweet
x = layers.LSTM(64)(x)
outputs = layers.Dense(1, activation='sigmoid')(x)
model_2 = tf.keras.Model(inputs, outputs, name='model_2_lstm')

In [37]:
model_2.compile(loss=tf.keras.losses.BinaryCrossentropy(),
                optimizer=tf.keras.optimizers.Adam(),
                metrics=['accuracy'])

In [38]:
model_2.summary()

Model: "model_2_lstm"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 input_2 (InputLayer)        [(None, 1)]               0         
                                                                 
 text_vectorization (TextVe  (None, 15)                0         
 ctorization)                                                    
                                                                 
 embedding_1 (Embedding)     (None, 15, 128)           1280000   
                                                                 
 lstm (LSTM)                 (None, 64)                49408     
                                                                 
 dense_1 (Dense)             (None, 1)                 65        
                                                                 
Total params: 1329473 (5.07 MB)
Trainable params: 1329473 (5.07 MB)
Non-trainable params: 0 (0.00 Byte)
________________

In [39]:
model_2_history = model_2.fit(train_sentences,
                              train_labels,
                              epochs=5,
                              validation_data=(val_sentences, val_labels),
                              callbacks=[create_tensorboard_callback(SAVE_DIR, 'model_2_lstm')])

Saving TensorBoard log files to: model_logs/model_2_lstm/20240709-133210
Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


In [40]:
model_2.evaluate(val_sentences, val_labels)



[0.7871161103248596, 0.7721602320671082]

In [41]:
model_2_pred_probs = model_2.predict(val_sentences)
model_2_pred_probs[:10]



array([[0.02479582],
       [0.9610905 ],
       [0.9988289 ],
       [0.02040034],
       [0.00288349],
       [0.9989458 ],
       [0.9361678 ],
       [0.9996759 ],
       [0.99798644],
       [0.06923723]], dtype=float32)

In [42]:
model_2_preds = tf.squeeze(tf.round(model_2_pred_probs))
model_2_preds[:10]

<tf.Tensor: shape=(10,), dtype=float32, numpy=array([0., 1., 1., 0., 0., 1., 1., 1., 1., 0.], dtype=float32)>

In [43]:
model_2_results = calculate_results(val_labels, model_2_preds)
model_2_results

{'accuracy': 77.21602101116218,
 'precision': 0.7738605376654213,
 'recall': 0.7721602101116218,
 'f1-score': 0.7687081060333852}

In [44]:
compare_baseline_to_model(model_2_results)

Baseline accuracy: 79.91, New accuracy: 77.22, Difference: -2.69
Baseline precision: 0.81, New precision: 0.77, Difference: -0.04
Baseline recall: 0.80, New recall: 0.77, Difference: -0.03
Baseline f1-score: 0.79, New f1-score: 0.77, Difference: -0.02


### Model 3 : GRU

In [46]:
# set random seed and create ambedding
tf.random.set_seed(42)
from tensorflow.keras import layers

model_3_embedding = layers.Embedding(input_dim=max_vocab_length,
                                     output_dim=128,
                                     embeddings_initializer='uniform',
                                     input_length=max_length,
                                     name='embedding_3')

In [49]:
# build an RNN using GRU
inputs = layers.Input(shape=(1,), dtype='string')
x = text_vectorizer(inputs)
x = model_3_embedding(x)
# x = layers.GRU(64, return_sequences=True)  # stacking recurrent GRU layers requires 'return_sentences=True'
x = layers.GRU(64)(x)
# layers.Dense(64, activation='relu')(x)  # optional dense layer
outputs = layers.Dense(1, activation='sigmoid')(x)
model_3 = tf.keras.Model(inputs, outputs, name='model_3_gru')

In [50]:
model_3.compile(loss=tf.keras.losses.BinaryCrossentropy(),
                optimizer=tf.keras.optimizers.Adam(),
                metrics=['accuracy'])

model_3.summary()

Model: "model_3_gru"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 input_3 (InputLayer)        [(None, 1)]               0         
                                                                 
 text_vectorization (TextVe  (None, 15)                0         
 ctorization)                                                    
                                                                 
 embedding_3 (Embedding)     (None, 15, 128)           1280000   
                                                                 
 gru (GRU)                   (None, 64)                37248     
                                                                 
 dense_2 (Dense)             (None, 1)                 65        
                                                                 
Total params: 1317313 (5.03 MB)
Trainable params: 1317313 (5.03 MB)
Non-trainable params: 0 (0.00 Byte)
_________________

In [51]:
model_3_history = model_3.fit(train_sentences,
                              train_labels,
                              epochs=5,
                              validation_data=(val_sentences, val_labels),
                              callbacks=[create_tensorboard_callback(SAVE_DIR, 'model_3_gru')])


Saving TensorBoard log files to: model_logs/model_3_gru/20240709-135637
Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


In [52]:
model_3.evaluate(val_sentences, val_labels)



[0.6827343106269836, 0.787261962890625]

In [53]:
model_3_pred_probs = model_3.predict(val_sentences)
model_3_pred_probs[:10]



array([[0.2927048 ],
       [0.993047  ],
       [0.99942476],
       [0.03573595],
       [0.00558204],
       [0.99761134],
       [0.6197026 ],
       [0.9995598 ],
       [0.9959177 ],
       [0.20512453]], dtype=float32)

In [54]:
model_3_preds = tf.squeeze(tf.round(model_3_pred_probs))
model_3_preds[:10]

<tf.Tensor: shape=(10,), dtype=float32, numpy=array([0., 1., 1., 0., 0., 1., 1., 1., 1., 0.], dtype=float32)>

In [56]:
model_3_results = calculate_results(val_labels, model_3_preds)
model_3_results

{'accuracy': 78.72619829284308,
 'precision': 0.7887763804833361,
 'recall': 0.7872619829284307,
 'f1-score': 0.784471420314181}

In [57]:
compare_baseline_to_model(model_3_results)

Baseline accuracy: 79.91, New accuracy: 78.73, Difference: -1.18
Baseline precision: 0.81, New precision: 0.79, Difference: -0.03
Baseline recall: 0.80, New recall: 0.79, Difference: -0.01
Baseline f1-score: 0.79, New f1-score: 0.78, Difference: -0.01


### Model 4 : Bidirectional RNN model

In [58]:
tf.random.set_seed(42)

model_4_embedding = layers.Embedding(input_dim=max_vocab_length,
                                     output_dim=128,
                                     embeddings_initializer='uniform',
                                     input_length=max_length,
                                     name='embedding_4')

In [59]:
# create bidirectional RNN
inputs = layers.Input(shape=(1,), dtype='string')
x = text_vectorizer(inputs)
x = model_4_embedding(x)
# x = layers.Bidirectional(layers.LSTM(64, return_sequences=True))
x = layers.Bidirectional(layers.LSTM(64))(x)
outputs = layers.Dense(1, activation='sigmoid')(x)
model_4 = tf.keras.Model(inputs, outputs, name='model_4_bidirectional')

In [60]:
model_4.compile(loss=tf.keras.losses.BinaryCrossentropy(),
                optimizer=tf.keras.optimizers.Adam(),
                metrics=['accuracy'])

In [61]:
model_4.summary()

Model: "model_4_bidirectional"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 input_4 (InputLayer)        [(None, 1)]               0         
                                                                 
 text_vectorization (TextVe  (None, 15)                0         
 ctorization)                                                    
                                                                 
 embedding_4 (Embedding)     (None, 15, 128)           1280000   
                                                                 
 bidirectional (Bidirection  (None, 128)               98816     
 al)                                                             
                                                                 
 dense_3 (Dense)             (None, 1)                 129       
                                                                 
Total params: 1378945 (5.26 MB)
Trainable par

In [62]:
model_4_history = model_4.fit(train_sentences,
                              train_labels,
                              epochs=5,
                              validation_data=(val_sentences, val_labels),
                              callbacks=[create_tensorboard_callback(SAVE_DIR, 'model_4_bidirectional')])

Saving TensorBoard log files to: model_logs/model_4_bidirectional/20240709-141908
Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


In [63]:
model_4.evaluate(val_sentences, val_labels)



[0.8263726830482483, 0.770190417766571]