**Natural Language Processing with TensorFlow**

- NLP problems are also called **Sequence Problems** because data is presented in a Sequence.

Natural Language covers
- Text (Such as email, blog post, book, Tweet)
- Speech (a conversation you have with a doctor, voice commands you give to a smart speaker)

- One **use case** is to scan an incoming emails to see if they are spam or not (classification)
- Another **use case** is analyzing feedback complaints to find which segment of business it is talking about

**Both of the above are referred to as *sequences*. You might come across terms like **seq2seq**, in other words, finding information in one sequence to produce another sequence.**

A typical workflow in NLP is

*Text --> Turn into numbers --> Build a model --> Train the model --> use patterns to make predictions*

**What we are going to cover**

- Getting data
- Visualizing text
- Converting text into numbers using tokenization
- Turning our tokenized text into embedding
- Modelling a text dataset
    - Starting with a baseline (TF-IDF)
    - Building deep learning models like
       - LSTM, GRU, Conv1D, Transfer Learning
- Comparing the performance of each of our models
- Combining the models into **ensemble**
- Saving and Loading a **Trained Model**
- Find the most wrong predictions




**Download the Helper Functions**

In [1]:
# Download helper functions script
!wget https://raw.githubusercontent.com/mrdbourke/tensorflow-deep-learning/main/extras/helper_functions.py

--2025-10-09 11:07:12--  https://raw.githubusercontent.com/mrdbourke/tensorflow-deep-learning/main/extras/helper_functions.py
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.111.133, 185.199.108.133, 185.199.109.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.111.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 10246 (10K) [text/plain]
Saving to: ‘helper_functions.py’


2025-10-09 11:07:12 (29.3 MB/s) - ‘helper_functions.py’ saved [10246/10246]



**Import the helper functions**

In [2]:
# Import series of helper functions for the notebook
from helper_functions import unzip_data, create_tensorboard_callback, plot_loss_curves, compare_historys

**Download the Text Dataset**

We will be using **Real or Not** dataset from **Kaggle** which contains **text-based Tweets** about natural disasters


The Real Tweets are actually about disasters, for example:

*Jetstar and Virgin forced to cancel Bali flights again because of ash from Mount Raung volcano*


The Not Real Tweets are Tweets not about disasters (they can be on anything), for example:

*Education is the most powerful weapon which you can use to change the world.Nelson #Mandela #quote*


In [3]:
# Download data (same as from Kaggle)
!wget "https://storage.googleapis.com/ztm_tf_course/nlp_getting_started.zip"

# Unzip data
unzip_data("nlp_getting_started.zip")

--2025-10-09 11:07:17--  https://storage.googleapis.com/ztm_tf_course/nlp_getting_started.zip
Resolving storage.googleapis.com (storage.googleapis.com)... 142.250.125.207, 173.194.64.207, 209.85.200.207, ...
Connecting to storage.googleapis.com (storage.googleapis.com)|142.250.125.207|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 607343 (593K) [application/zip]
Saving to: ‘nlp_getting_started.zip’


2025-10-09 11:07:17 (133 MB/s) - ‘nlp_getting_started.zip’ saved [607343/607343]



**Unzipping Files**
 By Unzipping we get the following files
 - **sample_submission.csv:-** An example of the file that you would submit to Kaggle competition
 - **train.csv:-** training samples of real and not real disaster Tweets
 - **test.csv:-** testing samples of real and not real disaster Tweets


**Visualizing a Text Dataset**

In [4]:
import pandas as pd

train_df = pd.read_csv("train.csv")
test_df = pd.read_csv("test.csv")
train_df.head(5)

Unnamed: 0,id,keyword,location,text,target
0,1,,,Our Deeds are the Reason of this #earthquake M...,1
1,4,,,Forest fire near La Ronge Sask. Canada,1
2,5,,,All residents asked to 'shelter in place' are ...,1
3,6,,,"13,000 people receive #wildfires evacuation or...",1
4,7,,,Just got sent this photo from Ruby #Alaska as ...,1


In [5]:
# Shuffle training dataframe
train_df_shuffled = train_df.sample(frac=1, random_state=42)
# shuffle with random_state=42 for reproducibility
train_df_shuffled.head()

Unnamed: 0,id,keyword,location,text,target
2644,3796,destruction,,So you have a new weapon that can cause un-ima...,1
2227,3185,deluge,,The f$&amp;@ing things I do for #GISHWHES Just...,0
5448,7769,police,UK,DT @georgegalloway: RT @Galloway4Mayor: ÛÏThe...,1
132,191,aftershock,,Aftershock back to school kick off was great. ...,0
6845,9810,trauma,"Montgomery County, MD",in response to trauma Children of Addicts deve...,0


In the **training data** we have a **target** column and from the analysis of the text we will try to predict the **target** column.
The **test dataset** does not have a **target** column.

Inputs (text column) -> Machine Learning Algorithm -> Outputs (target column)

In [6]:
test_df.head()

Unnamed: 0,id,keyword,location,text
0,0,,,Just happened a terrible car crash
1,2,,,"Heard about #earthquake is different cities, s..."
2,3,,,"there is a forest fire at spot pond, geese are..."
3,9,,,Apocalypse lighting. #Spokane #wildfires
4,11,,,Typhoon Soudelor kills 28 in China and Taiwan


In [7]:
train_df.target.value_counts()

Unnamed: 0_level_0,count
target,Unnamed: 1_level_1
0,4342
1,3271


So,from the above we can easily deduce that we are dealing with a **binary classification** problem.
Also, the dataset is fairly balanced with 60% negative class and 40% positive class

Where,
- 1 = a real disaster Tweet
- 0 = not a real disaster Tweet

In [8]:
# How many samples total?
print(f"Total training samples: {len(train_df)}")
print(f"Total test samples: {len(test_df)}")
print(f"Total samples: {len(train_df) + len(test_df)}")

Total training samples: 7613
Total test samples: 3263
Total samples: 10876


**Train/Test Split**
We have got abundance of testing samples and normally a 80/20 split is recommended.

**Question:** Why visualize random samples? You could visualize samples in order but this could lead to only seeing a certain subset of data. Better to visualize a substantial quantity (100+) of random samples to get an idea of the different kinds of data you're working with. In machine learning, never underestimate the power of randomness.

In [9]:
import random
random_index = random.randint(0, len(train_df)-5) # create random indexes not higher than the total number of samples

'''
train_df_shuffled[['text','target']]: Get only the "text" and "target" column
[random_index:random_index+5]: Get only five rows starting from the random index
.itertuples(): Iterate over the rows as namedtuples.
The return is a named tuple like
Pandas(Index=100, text="Fire in the building", target=1)
Pandas(Index=101, text="Lovely weather today", target=0)
We get these results in the row variable
_: is used to get the index
text: is used to get the text
target: is used to get the target (0 or 1)
'''


for row in train_df_shuffled[['text', 'target']][random_index:random_index+5].itertuples():
  _, text, target = row
  print(f"Target: {target}", "(real disaster)" if target > 0 else "(not real disaster)")
  print(f"Text:\n{text}\n")
  print("---\n")

Target: 1 (real disaster)
Text:
The majority of those killed were civilians on the ground after the jet first bombed the city's main street then dramatically plummeted

---

Target: 0 (not real disaster)
Text:
@NEPD_Loyko Texans hope you are wrong. Radio in Houston have him as starter after Foster injury

---

Target: 0 (not real disaster)
Text:
Even then our words slip and souls coincide Finer than subatomic spells Just as we collide http://t.co/2WcbrgN62J

---

Target: 1 (real disaster)
Text:
Maj Muzzamil Pilot Offr of MI-17 crashed near Mansehra today. May Almighty give strength to family to bear the loss http://t.co/EI1K01zAb3

---

Target: 0 (not real disaster)
Text:
Ignition Knock (Detonation) Sensor-Senso Standard KS57 http://t.co/bzZdeDcthL http://t.co/OQJNUyIBxM

---



**Split data into Training and Validation Sets**

- The test data has no **labels** and we need a way to evaluate the model, so we split the **training data** into **training data** and **validation set.**
- Model trains on **training data** and the performances are checked using unseen **validation set.**

In [10]:
from sklearn.model_selection import train_test_split

'''
Each column is converted to a numpy array.
train_df_shuffled["text"] is a Pandas Series
(basically a 1-D labeled array with an index).
Scikit-learn functions like train_test_split are designed to work with
NumPy arrays or plain Python lists.
Even if you sends a Pandas Series directly, it works too because Pandas is
compitable with NumPy but internally scikit-learn will convert it to
Numpy anyway
'''

# Use train_test_split to split training data into training and validation sets
train_sentences, val_sentences, train_labels, val_labels = train_test_split(train_df_shuffled["text"].to_numpy(),
                                                                            train_df_shuffled["target"].to_numpy(),
                                                                            test_size=0.1,
                                                                            random_state=42)


In [11]:
# Check the lengths
len(train_sentences), len(train_labels), len(val_sentences), len(val_labels)

(6851, 6851, 762, 762)

In [12]:
# View the first 10 training sentences and their labels
train_sentences[:10], train_labels[:10]

(array(['@mogacola @zamtriossu i screamed after hitting tweet',
        'Imagine getting flattened by Kurt Zouma',
        '@Gurmeetramrahim #MSGDoing111WelfareWorks Green S welfare force ke appx 65000 members har time disaster victim ki help ke liye tyar hai....',
        "@shakjn @C7 @Magnums im shaking in fear he's gonna hack the planet",
        'Somehow find you and I collide http://t.co/Ee8RpOahPk',
        '@EvaHanderek @MarleyKnysh great times until the bus driver held us hostage in the mall parking lot lmfao',
        'destroy the free fandom honestly',
        'Weapons stolen from National Guard Armory in New Albany still missing #Gunsense http://t.co/lKNU8902JE',
        '@wfaaweather Pete when will the heat wave pass? Is it really going to be mid month? Frisco Boy Scouts have a canoe trip in Okla.',
        'Patient-reported outcomes in long-term survivors of metastatic colorectal cancer - British Journal of Surgery http://t.co/5Yl4DC1Tqt'],
       dtype=object),
 array([0,

We now have a **training and test set**.

**Converting Text into Numbers**

- We now have **training set and validation set** containing Tweets and Labels

**Question:** What is the most important step before we can use a **machine learning algorithm** with our text data?

**Answer** Turn the text into numbers.

*For any Machine Learning algorithm, the inputs needs to be in numerical form.*

In NLP, there are two main concepts of turning text into numbers

- **Tokenization:** A straight mapping from a word/character/sub-word to a numerical value. There are three main levels of tokenization
   - Using **word-level tokenization** with the sentence like "I love NLP" might result in "I" being "0", "love" being "1" and "NLP" being "2". Every word in the **sequence** is considered a single **token**
   - **Character Level Tokenization** Converting the letters A-Z to value *1-26* and each character in the sequence is considered a single **token**
   - **Sub-word tokenization** is between **word-level** and **character level** tokenization. It involves word into smaller parts and then converting those smaller parts to numbers. For example, "my favourite food is pineapple pizza" might become "my, fav, avour, rite, fo, oo, od, is, pin, ine, app, le, piz, za". After this these **sub-words** are converted to numerical form.

- **Embeddings-** An embedding is a representation of natural language which can be learned. The **representation** is in the form of **feature vector**.For example, the word "dance" could be represented by the 5-dimensional vector [-0.8547, 0.4559, -0.3332, 0.9877, 0.1112]. It's important to note here, the size of the feature vector is tuneable. There are two ways to use embeddings:
  - **Create your own embedding** Once your text has been turned into numbers (required for embedding), you can put them through an **embedding layer** such as **tf.keras.layers.Embedding** and an embedding representation will be learned during model training
  - **Reuse pre-learned embedding** Many pre-trained embeddings exist online. These **pre-trained** embeddings have often learned on large corpuses of text and thus have a good **underlying representation of natural language.** We can use **pre-trained embedding** to initialize our model and later on **fine-tune** it to our own specific task.

  Regarding **which level of tokenization should one use** mostly depends on your problem. You can try character-level/word level tokenization/embedding and which one performs best should be the choice.


**Text Vectorization (tokenization)**

Creating *tokens* is the most important step. To tokenize our text, we will use the following
**tf.keras.layers.experimental.preprocessing.TextVectorization**

The **TextVectorization** layer takes the following parameters as input

- **max_tokens:-** Maximum number of words in your vocabulary (e.g. 20,000 or the *number of unique words or text*). It also includes **OOV (out of vocabulary)** tokens
- **standardize:-** Method for standardizing text. Default is **lower_and_strip_punctuation** which lowers text and removes all punctuation marks.
- **split:-** Text splitting, default is **whitespace** which splits text on spaces
- **output_mode:-** How to output tokens, can be **int (integer mapping)**, **binary (one hot encoding)**, count or **tf-idf**
- **output_sequence_length:-** Length of tokenized sequence to output. For example, if *output_sequence_length=150*, all tokenized sequences will be **150 tokens long.**
- **pad_to_max_tokens:-** Defaults to *False*, If *True*, then all the output feature axis will be padded to *max_tokens* even if the *number of unique tokens* in vocabulary is less than *max_tokens.*


In [13]:
import tensorflow as tf
from tensorflow.keras.layers import TextVectorization

# Use the default TextVectorization variables
text_vectorizer = TextVectorization(
                                    max_tokens=None, # how many words in the vocabulary (all of the different words in your text)
                                    standardize="lower_and_strip_punctuation", # how to process text
                                    split="whitespace", # how to split tokens
                                    ngrams=None, # create groups of n-words?
                                    output_mode="int", # how to map tokens to numbers
                                    output_sequence_length=None
                                    ) # how long should the output sequence of tokens be?
                                    # pad_to_max_tokens=True) # Not valid if using max_tokens=None



The **TextVectorizaiton** object has been initialized with default settings but let's customize it a little.

- We will particulary customize **max_tokens** and **output_sequence_length** variables

- The **max_tokens**(number of words in vocabulary) is normally multiple of 10,000, 20,000 and 30,000. For our case we will use **10,000**

The **output_sequence_length** will be the average number of tokens per Tweet in the training set.





In [14]:
'''
The code snipet loops over each sentence in 'train_sentences'.
Splits the sentence based on whitespace
Finds the length of that sentence and keep it in a list
we get a list like [5,3,7,2.....].
Then, it sums the and divide by the toal number of sentence.

Finally, we get the average length of output sequence

'''
round(sum([len(i.split()) for i in train_sentences])/len(train_sentences))

15

In [15]:
max_vocab_length = 10000
# max number of words to have in our vocabulary
max_length = 15
''' max length our sequences will be (e.g. how many words from a
    Tweet does our model see?)'''

text_vectorizer = TextVectorization(max_tokens=max_vocab_length,
                                    output_mode="int",
                                    output_sequence_length=max_length)

**Training Data Mapping**
**Text Vectorizer** is a special layer which learns **how to turn text(words) into numbers**

**What .adapt(train_sentences) does**

The **.adapt()** step is where the text_vectorizer actually learns the vocabulary.
It scans through all your **training sentences** to figure out **which words exist** and **how often they appear.**

For example,

train_sentences = ["I love pizza", "Pizza is delicious", "I hate cold pizza"]

- the .adapt() will read all those sentences
- build a vocabulary: A dictionary of all unique words like

['[PAD]', '[UNK]', 'pizza', 'i', 'love', 'is', 'delicious', 'hate', 'cold']

It then assigns an integer number to each word.

'pizza' → 2
'i' → 3
'love' → 4
'is' → 5
...

Now, whenever you pass a new sentence to **text_vectorizer** it converts it automatically into a **sequence of numbers**

text_vectorizer(["I love pizza"])  ➜  [[3, 4, 2]]


In [16]:
text_vectorizer.adapt(train_sentences)

Now **training data** is mapped. Let's try the **text_vectorizer** on a customer sentence.

In [17]:
# Create sample sentence and tokenize it
sample_sentence = "There's a flood in my street!"
text_vectorizer([sample_sentence])

<tf.Tensor: shape=(1, 15), dtype=int64, numpy=
array([[264,   3, 232,   4,  13, 698,   0,   0,   0,   0,   0,   0,   0,
          0,   0]])>

Our **text** has been converted into **numbers**.
- Notice the 0's at the end of the returned tensor because we have set the output length to **15** and any sequence we input will be ouput in a length of **15**.

In [18]:
# Choose a random sentence from the training dataset and tokenize it
random_sentence = random.choice(train_sentences)
print(f"Original text:\n{random_sentence}\
      \n\nVectorized version:")
text_vectorizer([random_sentence])

Original text:
It was finally demolished in the spring of 2013 and the property has sat vacant since. The justÛ_: saddlebrooke... http://t.co/b8n6e4rYvZ      

Vectorized version:


<tf.Tensor: shape=(1, 15), dtype=int64, numpy=
array([[  15,   23,  852,  606,    4,    2, 1110,    6, 1336,    7,    2,
         927,   41, 2721, 2622]])>

Let's check the **get_vocabulary()** function

In [19]:
words_in_vocab = text_vectorizer.get_vocabulary()
top_5_words = words_in_vocab[:5]
bottom_5_words = words_in_vocab[-5:]
print(f"Number of words in vocab: {len(words_in_vocab)}")
print(f"Top 5 most common words: {top_5_words}")
print(f"Bottom 5 least common words: {bottom_5_words}")

Number of words in vocab: 10000
Top 5 most common words: ['', '[UNK]', np.str_('the'), np.str_('a'), np.str_('in')]
Bottom 5 least common words: [np.str_('pages'), np.str_('paeds'), np.str_('pads'), np.str_('padres'), np.str_('paddytomlinson1')]


**Creating an Embedding using an Embedding Layer**

- We now have a way of **mapping text to numbers**. We will go a step further and turn those **numbers into an embedding**

But why **Embedding is a necessary step**

- After the **vectorization layer** our text sentences become something like
'I love pizza' ==> [4,6,2,0,0]. These are just word ID's, simple integer but here is the problem, nothing can be deduced from these **numbers**. We cannot deduce things like **pizza and burger** are similar and that **pizza** and **table** are not similar

It just sees different integers — like seeing phone numbers:

pizza = 125
burger = 89
table = 500

There is no relationship between these numbers.

**Enter the Embedding Layer**

- The embedding layer gives meaning to those numbers. It takes each **word ID** and converts it into a **dense vector of real numbers** a small list of numbers that represents what the word means in context.
Word ID → Embedding Vector
2 (“pizza”)  → [0.12, -0.45, 0.88, 0.33, -0.02]

6 (“love”)   → [0.91,  0.11, 0.55, -0.23, 0.70]

4 (“I”)      → [0.02, -0.09, 0.30,  0.45, 0.10]

Now, instead of just having [4, 6, 2],
your sentence becomes something like:

[
 [0.02, -0.09, 0.30, 0.45, 0.10],

 [0.91,  0.11, 0.55, -0.23, 0.70],

 [0.12, -0.45, 0.88, 0.33, -0.02]
]

**Why is this powerful**
The embeddings help the model understand relationships between words:

- "king" and "queen" have similar vectors and differ mostly by gender dimensions

- "pizza" and "burger" appear in similar food contexts, so their vectors are close


Embeddings also **reduce dimensionality**. Embedding converts each word into a small vector like 16, 32, 128

Embeddings are learned during training and you do not have to define the relationships yourself.

**Analogy:**

Think of it like this:

**Vectorization =** giving every word a roll number (just an ID).

**Embedding =** giving every student (word) a personality profile —
strengths, weaknesses, interests (numbers that describe meaning).


TextVectorization
Turns text into word IDs
“I love pizza” → [4, 6, 2]

Embedding
Turns word IDs into meaning-rich vectors
[4, 6, 2] → [[0.02, -0.09, 0.30...], ...]





**Creating an Embedding Layer Using an Embedding Layer**

- As discussed above the **powerful thing about embeddings** is that it can be learned during training. So, **a model** rather than going through static numbers like **1=1, 2=2**, a word's numeric representation can be improved as the model goes through data samples.

*We will see what an embedding layer looks like by using the **tf.keras.layers.Embedding** layer*


The main parameters we are concerned about here are:

- **input_dim -** The size of vocabulary (e.g. len(text_vectorizer.get_vocabulary() ).
- **output_dim -** The size of the **output embedding vector**, for example, a value of **100** outputs a feature vector of **size 100 for each word.**
- **embeddings_initializer -** How to initialize embedding matrix, default is **uniform** which randomly initializes embedding matrix with **uniform distribution.** This can be changed by using **pre-learned embeddings.**
- **input_length -** Length of sequences being passed to *embedding layer.*



In [20]:
tf.random.set_seed(42)
from tensorflow.keras import layers

embedding = layers.Embedding(input_dim=max_vocab_length, # set input shape
                             output_dim=128, # set size of embedding vector
                             embeddings_initializer="uniform", # default, intialize
                             input_length=max_length, # how long is each input
                             name="embedding_1")

embedding



<Embedding name=embedding_1, built=False>

In [21]:
# Get a random sentence from training set
random_sentence = random.choice(train_sentences)
print(f"Original text:\n{random_sentence}\
      \n\nEmbedded version:")

# Embed the random sentence (turn it into numerical representation)
sample_embed = embedding(text_vectorizer([random_sentence]))
sample_embed

Original text:
@smallforestelf Umm because a gun stopped the gunman with who was carrying a bomb!      

Embedded version:


<tf.Tensor: shape=(1, 15, 128), dtype=float32, numpy=
array([[[ 0.02692438,  0.00451346, -0.01685288, ..., -0.00061289,
          0.0050822 , -0.04684842],
        [ 0.02526628,  0.01787874, -0.01643982, ..., -0.0060253 ,
          0.01488583, -0.02433949],
        [ 0.03143467,  0.02759944,  0.03550751, ..., -0.03582587,
         -0.00115436, -0.046015  ],
        ...,
        [-0.01401442, -0.01166731, -0.03310205, ...,  0.0241088 ,
         -0.02523102, -0.04639594],
        [-0.03721518, -0.00669315, -0.02011895, ...,  0.03867329,
         -0.02918214,  0.01138765],
        [-0.02670651,  0.03277263, -0.02580332, ..., -0.04259557,
          0.03003252, -0.00745965]]], dtype=float32)>

Embedded version:
<tf.Tensor: shape=(1, 15, 128)>

- **1:** here is the batch. Hence, we have a single sentence, so it is 1.
- **15:** This represents the sequence length (the number of word positions the model expects per sentence). This means that each **sentence is 15 words long**. If the **sentence is shorter** then **tensorflow pads it.**
- **128:** This is the **embedding dimension** - the number of numbers used to represent each word's meaning.

Embedding(input_dim=vocab_size, output_dim=128)

This means that *Each word(or token) will be represented by **128 floating-point numbers** that capture its semantic meaning - how it related to other words.

So the tensor shape (1, 15, 128) means:

“We have 1 sentence, which is **15 tokens long,** and **each token is represented by 128 features** that capture its meaning.”

Word     Feature1            Feature2               Feature128

she	     0.01	              -0.02	...  	            0.03

keep	  -0.04	               0.05	...               0.01

it	     0.03	              -0.01	...	              0.02

wet	    -0.07	              0.02	...	              0.04

like	   0.05	              0.01	...	             -0.06










**Each token in the sentence gets turned into a length of 128-feature vector**

In [22]:
# Check out a single token's embedding
sample_embed[0][0]

<tf.Tensor: shape=(128,), dtype=float32, numpy=
array([ 2.6924375e-02,  4.5134649e-03, -1.6852878e-02, -1.3054766e-02,
        1.2090482e-02,  1.8951464e-02,  6.9807768e-03, -2.3053348e-02,
       -2.2919655e-02,  3.5134520e-02,  4.5245077e-02,  3.4086559e-02,
        4.6049122e-02, -3.1613886e-02, -7.4185021e-03, -1.7319538e-02,
        8.7212771e-05, -3.7154663e-02,  3.5107542e-02,  1.3365459e-02,
        3.0832402e-03, -8.5102208e-03, -2.5723541e-02, -2.4223639e-02,
        3.8212802e-02,  3.4688581e-02, -2.9946422e-02, -4.9709786e-02,
       -2.6710952e-02, -9.5573552e-03,  2.9594254e-02, -4.9265575e-02,
       -2.3818128e-03, -1.2318648e-02, -4.1492641e-02, -2.7676001e-03,
       -4.6026219e-02, -1.8073402e-02,  3.6907684e-02,  3.5456154e-02,
        1.1080660e-02, -8.0083497e-03, -4.6619713e-02, -1.9938529e-02,
        3.8194302e-02,  9.6670873e-03, -2.8421164e-02,  7.5739846e-03,
        1.6802121e-02,  4.8079602e-03,  2.3047376e-02, -1.8289804e-02,
        2.7891699e-02, -3.597

These values might not mean much to us but they're what our computer sees each word as. When our model looks for patterns in different samples, these values will be updated as necessary.

Now that we've got a way to turn our text data into numbers, we can start to build machine learning models to model it.

To get plenty of practice, we're going to build a series of different models, each as its own experiment. We'll then compare the results of each model and see which one performed best.

More specifically, we'll be building the following:

**Model 0:** Naive Bayes (baseline)

**Model 1:** Feed-forward neural network (dense model)

**Model 2:** LSTM model

**Model 3:** GRU model

**Model 4:** Bidirectional-LSTM model

**Model 5:** 1D Convolutional Neural Network

**Model 6:** TensorFlow Hub Pretrained Feature Extractor

**Model 7:** Same as model 6 with 10% of training data

**Model 0**

Each experiment will go through the following steps:

- Construct the Model
- Train the Model
- Maker Predictions with the Model
- Track Prediction Evaluation Metrics for Later Comparison

**Model 0: Getting a baseline**

In all Machine Learning modelling experiments, it is important to **create a baseline ** model, so that we can benchmark against it.

To create our baseline, we'll create a **Scikit-Learn Pipeline** using the **TF-IDF (term frequency-inverse document frequency) formula** to convert our words to numbers and then model them with the **Multinomial Naive Bayes algorithm.** This was chosen via referring to the Scikit-Learn machine learning map.


In [23]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import Pipeline

# Create tokenization and modelling pipeline
model_0 = Pipeline([
                    ("tfidf", TfidfVectorizer()), # convert words to numbers using tfidf
                    ("clf", MultinomialNB()) # model the text
])

# Fit the pipeline to the training data
model_0.fit(train_sentences, train_labels)

Let's evaluate our model and find our baseline metric.

In [24]:
baseline_score = model_0.score(val_sentences, val_labels)
print(f"Our baseline model achieves an accuracy of: {baseline_score*100:.2f}%")

Our baseline model achieves an accuracy of: 79.27%


Let's do some **predictions** with our baseline model

In [25]:
# Make predictions
baseline_preds = model_0.predict(val_sentences)
baseline_preds[:20]

array([1, 1, 1, 0, 0, 1, 1, 1, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1])

**Creating an Evaluation Function for our Model Experiments**

- We are going to create this **function** because we are going to **evaluate several models** in the same way going forward.
- Let's create a helper function which takes **an array of predictions and ground truth labels** and compute the following
    - Accuracy
    - Precision
    - Recall
    - F1-Score

Hence, we are dealing with a **Classification Problem** above metrics are appropriate. If we are dealing with **Regression Problem**, we will use metrics like **MAE (Mean Absolute Error)**

In [26]:
# Function to evaluate: accuracy, precision, recall, f1-score
from sklearn.metrics import accuracy_score, precision_recall_fscore_support

def calculate_results(y_true, y_pred):
  """
  Calculates model accuracy, precision, recall and f1 score of a binary classification model.

  Args:
  -----
  y_true = true labels in the form of a 1D array
  y_pred = predicted labels in the form of a 1D array

  Returns a dictionary of accuracy, precision, recall, f1-score.
  """
  # Calculate model accuracy
  model_accuracy = accuracy_score(y_true, y_pred) * 100
  # Calculate model precision, recall and f1 score using "weighted" average
  model_precision, model_recall, model_f1, _ = precision_recall_fscore_support(y_true, y_pred, average="weighted")
  model_results = {"accuracy": model_accuracy,
                  "precision": model_precision,
                  "recall": model_recall,
                  "f1": model_f1}
  return model_results

In [27]:
# Get baseline results
baseline_results = calculate_results(y_true=val_labels,
                                     y_pred=baseline_preds)
baseline_results

{'accuracy': 79.26509186351706,
 'precision': 0.8111390004213173,
 'recall': 0.7926509186351706,
 'f1': 0.7862189758049549}

**Model 1: A simple Dense Model**

The first **deep** model we are going to build is a **single layer dense model**. And

- It will take our **text** and **labels** as input
- **Tokenize** the **text**
- Create and **embedding**
- Find the average of **embedding (using Global Average Pooling)** and then pass the average through **fully connected layer** with one output and a **sigmoid** activation.

And since we're going to be building a number of TensorFlow deep learning models, we'll import our **create_tensorboard_callback()** function from helper_functions.py to keep track of the results of each.


In [28]:
# Create tensorboard callback (need to create a new one for each model)
from helper_functions import create_tensorboard_callback

# Create directory to save TensorBoard logs
SAVE_DIR = "model_logs"

In [29]:
# Build model with the Functional API
from tensorflow.keras import layers
inputs = layers.Input(shape=(1,), dtype="string")
# inputs are 1-dimensional strings

x = text_vectorizer(inputs)
# turn the input text into numbers

x = embedding(x)
# create an embedding of the numerized numbers

x = layers.GlobalAveragePooling1D()(x)
# lower the dimensionality of the embedding
# (try running the model without this layer and see what happens)

outputs = layers.Dense(1, activation="sigmoid")(x)

# create the output layer, want binary outputs so use sigmoid activation
model_1 = tf.keras.Model(inputs, outputs, name="model_1_dense")
# construct the model


In [30]:
'''
The Sequential API code is

from tensorflow.keras import Sequential, layers

model_1_seq = Sequential([
    layers.Input(shape=(1,), dtype="string"),
    text_vectorizer,                      # convert text → numbers
    embedding,                            # convert numbers → meaning vectors
    layers.GlobalAveragePooling1D(),      # summarize each sentence
    layers.Dense(1, activation="sigmoid") # output: binary classification
], name="model_1_dense_sequential")

'''

'\nThe Sequential API code is\n\nfrom tensorflow.keras import Sequential, layers\n\nmodel_1_seq = Sequential([\n    layers.Input(shape=(1,), dtype="string"),\n    text_vectorizer,                      # convert text → numbers\n    embedding,                            # convert numbers → meaning vectors\n    layers.GlobalAveragePooling1D(),      # summarize each sentence\n    layers.Dense(1, activation="sigmoid") # output: binary classification\n], name="model_1_dense_sequential")\n\n'

Let's discuss the code above

- Our model takes a **1-dimensional string as input**.
- It then tokenizes the string using **text_vectorizer** and creates an **embedding**
- By using **GlobalAveragePooling1D()** layer we reduce the dimensionality of the tensor we pass to the **output layer.**

In [31]:
# Compile model
model_1.compile(loss="binary_crossentropy",
                optimizer=tf.keras.optimizers.Adam(),
                metrics=["accuracy"])

In [32]:
model_1.summary()


Most of the trainable parameters are contained within the **embedding layer**. Recall we created an embedding of size **128 (output_dim=128)** for a vocabulary of size **10,000** (input_dim=10000), hence the **1,280,000** trainable parameters.

In [33]:
# Fit the model
model_1_history = model_1.fit(train_sentences, # input sentences can be a list of strings due to text preprocessing layer built-in model
                              train_labels,
                              epochs=5,
                              validation_data=(val_sentences, val_labels),
                              callbacks=[create_tensorboard_callback(dir_name=SAVE_DIR,
                                                                     experiment_name="simple_dense_model")])

Saving TensorBoard log files to: model_logs/simple_dense_model/20251009-110718
Epoch 1/5
[1m215/215[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m5s[0m 16ms/step - accuracy: 0.6374 - loss: 0.6494 - val_accuracy: 0.7559 - val_loss: 0.5338
Epoch 2/5
[1m215/215[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m3s[0m 12ms/step - accuracy: 0.8088 - loss: 0.4660 - val_accuracy: 0.7861 - val_loss: 0.4740
Epoch 3/5
[1m215/215[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m3s[0m 12ms/step - accuracy: 0.8530 - loss: 0.3619 - val_accuracy: 0.7953 - val_loss: 0.4621
Epoch 4/5
[1m215/215[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m3s[0m 12ms/step - accuracy: 0.8870 - loss: 0.2956 - val_accuracy: 0.7887 - val_loss: 0.4684
Epoch 5/5
[1m215/215[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m3s[0m 14ms/step - accuracy: 0.9069 - loss: 0.2468 - val_accuracy: 0.7795 - val_loss: 0.4843


In [34]:
# Check the results
model_1.evaluate(val_sentences, val_labels)

[1m24/24[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 5ms/step - accuracy: 0.7684 - loss: 0.5165


[0.4842676818370819, 0.7795275449752808]

What **embedding.weights** gives you



Returns a list of all the trainable weights in that layer.

For a standard embedding layer, there’s usually only one matrix, so the list has one element.

Shape: (vocab_size, embedding_dim) → (5000, 128) in our example

In [35]:
embedding.weights

[<Variable path=embedding_1/embeddings, shape=(10000, 128), dtype=float32, value=[[-0.01325036  0.0211815  -0.01513357 ... -0.05524667  0.04565595
   -0.01864273]
  [ 0.03761405  0.01036259 -0.00125246 ...  0.02195662 -0.0276198
   -0.00208184]
  [ 0.03693467 -0.01538322 -0.00384381 ... -0.0255676   0.03773013
   -0.02904656]
  ...
  [ 0.01115602 -0.00955745  0.00068473 ... -0.02583609  0.01994188
   -0.01180987]
  [-0.01268602  0.06382184  0.0853771  ... -0.07023269  0.02309494
   -0.02182389]
  [ 0.09210607  0.1089362   0.05232262 ... -0.07559539  0.10191818
   -0.03709696]]>]

**shape=(10000, 128):-** The *10000* is the vocabulary items and *128* is the dimensions for each word.

In [36]:
embed_weights = model_1.get_layer("embedding_1").get_weights()[0]
print(embed_weights.shape)

(10000, 128)


**Recurrent Neural Networks (RNN's)**

We will use a special kind of Neural Networks **Recurrent Neural Network (RNN)** for text data.

- **RNN:** Use information from **past** helps you with **future**. Take an input **X** and compute **y** based on all previous inputs.

- The concept is helpful when dealing with **Sequence data** such as passages of natural language text such as Text

- When you read a sentence, you take into account the context of previous words when **deciphering the meaning** of the current word

- When an RNN looks at a sequence of text (already in numerical form), the patterns it learns are continually updated based on the order of the sequence.

**Recurrent neural networks** can be used for a number of sequence-based problems:

- **One to one:** one input, one output, such as image classification.
- **One to many:** one input, many outputs, such as image captioning (image input, a sequence of text as caption output).
- **Many to one:** many inputs, one outputs, such as text classification (classifying a Tweet as real diaster or not real diaster).
- **Many to many:** many inputs, many outputs, such as machine translation (translating English to Spanish) or speech to text (audio wave as input, text as output).

When you come across RNN's in the wild, you'll most likely come across variants of the following:

- Long short-term memory cells (LSTMs).
- Gated recurrent units (GRUs).
- Bidirectional RNN's (passes forward and backward along a sequence, left to right and right to left).














**Model 2: LSTM**

We will start with **LSTM powered RNN**

- LSTM Cell and LSTM layer are often used interchangably. We will use **tensorflow.keras.layers.LSTM()**

Our model is going to take on a very similar structure to model_1:

**Input (text)** -> Tokenize -> Embedding -> Layers -> **Output (label probability)**


The main difference will be that we're going to **add an LSTM layer** between our **embedding** and **output.**

**Note:-** Make sure not to re-use the **trained embeddings** and we will create another **embedding layer** *model_2_embedding* for our model. The **text_vectorizer** layer can be reused since it does not get updated during training.
- Think of **embedding** as a notebook that stores the meaning of words considering the context.
- If both models use the **same embedding layer**, then it is like reading from the same book.
- This is called **data leakage**

*An embedding layer starts with random numbers like the Word2Vec or GloVE. Each word is represented by a vector - a list of numbers that the model will learn to adjust to capture the meaning. For example, when a model sees "The cat sat on the mat", it makes predictions, compares them with the correct answers, and **adjusts the embedding layers** to reduce the error.Over the time, the embeddings for words like "cate" and "dog" start to become **closer in meaning** while "cat" and "banana" stay far apart.*









**LSTM**

- An inherent problem with **RNNs** is that they can forget what happened many steps ago because when **Gradients are propagated through many time steps** they vanish. This means that long-term dependencies are lost.

An **LSTM** is a special kind of RNN that remembers information for longer periods using a clever internal structure called a **cell state**.

It uses gates to control information flow:

- **Forget Gate** → what to throw away
- **Input Gate** → what new info to store
- **Cell State** → memory itself
- **Output Gate** → what to send out

**Inside the LSTM Cell**
At each step (t), we have:
- Input vector xt
- Previous hidden state ht-1
- Previous cell state ct-1
The **LSTM** updates them to produce:
- New Cell State ct
- New hidden state ht






In [37]:
''' Set random seed and create embedding layer
    (new embedding layer for each model)
'''

tf.random.set_seed(42)
from tensorflow.keras import layers
model_2_embedding = layers.Embedding(input_dim=max_vocab_length,
                                     output_dim=128,
                                     embeddings_initializer="uniform",
                                     input_length=max_length,
                                     name="embedding_2")

# Create LSTM Model
inputs = layers.Input(shape=(1,), dtype="string")
x = text_vectorizer(inputs)
x = model_2_embedding(x)
print(x.shape)
# x = layers.LSTM(64, return_sequences=True)(x) # return vector for each word in the Tweet (you can stack RNN cells as long as return_sequences=True
x = layers.LSTM(64)(x) # return vector for whole sequence
print(x.shape)
# x = layers.Dense(64, activation="relu")(x) # optional dense layer on top of output of LSTM cell
outputs = layers.Dense(1, activation="sigmoid")(x)
model_2 = tf.keras.Model(inputs, outputs, name="model_2_LSTM")




(None, 15, 128)
(None, 64)




**1. Input:**
You have a sentence/tweet of 15 words.

**2. Embedding Layer:**
Each word is converted to a 128-dimensional vector (embedding).
So now your input looks like a matrix of shape:
**(15 words, 128 dimensions)**

**LSTM Layer (64 units):**

Each of the 64 LSTM cells processes the whole sequence of 15 embeddings.
Internally, each LSTM cell keeps track of memory through the sequence using gates (forget, input, output).

After processing the sequence, the LSTM layer outputs a single 64-dimensional vector (because return_sequences=False), summarizing the entire sentence.

**Dense Layer:**

The 64-dimensional vector goes into a single Dense unit with sigmoid activation.
Sigmoid squashes the output to 0–1, representing the probability of the sentence being positive or negative.


Sentence (15 words)
     ↓

Embeddings (15 × 128)
     ↓

LSTM (64 units)
     ↓
    
Vector summarizing sentence (64 numbers)
     ↓

Dense layer (1 output)
     ↓

Probability: Positive or Negative



**Note:**

Reading the documentation for the **TensorFlow LSTM layer**, you'll find a plethora of parameters. Many of these have been tuned to make sure they compute as fast as possible. The main ones you'll be looking to adjust are units **(number of hidden units)** and **return_sequences** (**set this to True **when stacking LSTM or other recurrent layers).

In [38]:

# Compile model
model_2.compile(loss="binary_crossentropy",
                optimizer=tf.keras.optimizers.Adam(),
                metrics=["accuracy"])

In [39]:
model_2.summary()

In [40]:
# Fit model
model_2_history = model_2.fit(train_sentences,
                              train_labels,
                              epochs=5,
                              validation_data=(val_sentences, val_labels),
                              callbacks=[create_tensorboard_callback(SAVE_DIR,
                                                                     "LSTM")])

Saving TensorBoard log files to: model_logs/LSTM/20251009-110734
Epoch 1/5
[1m215/215[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m7s[0m 22ms/step - accuracy: 0.6753 - loss: 0.5800 - val_accuracy: 0.7795 - val_loss: 0.4600
Epoch 2/5
[1m215/215[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m6s[0m 25ms/step - accuracy: 0.8633 - loss: 0.3289 - val_accuracy: 0.7585 - val_loss: 0.5105
Epoch 3/5
[1m215/215[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m9s[0m 20ms/step - accuracy: 0.9145 - loss: 0.2262 - val_accuracy: 0.7572 - val_loss: 0.6181
Epoch 4/5
[1m215/215[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m6s[0m 24ms/step - accuracy: 0.9436 - loss: 0.1576 - val_accuracy: 0.7480 - val_loss: 0.6738
Epoch 5/5
[1m215/215[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m4s[0m 20ms/step - accuracy: 0.9622 - loss: 0.1205 - val_accuracy: 0.7717 - val_loss: 0.6166


In [41]:
model_2_pred_probs = model_2.predict(val_sentences)
model_2_pred_probs.shape, model_2_pred_probs[:10]

[1m24/24[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 13ms/step


((762, 1),
 array([[0.24991111],
        [0.95657367],
        [0.99856645],
        [0.02831087],
        [0.00459421],
        [0.9809498 ],
        [0.73174715],
        [0.9979942 ],
        [0.9989453 ],
        [0.38142493]], dtype=float32))

We can turn **these probabilities** into **prediction classes** by rounding to the nearest integer (By default **prediction probabilities** under 0.5 will go to 0 and those above 0.5 will go to 1)

The model_2_pred_probs output like

                     [   [0.8],
                         [0.3],
                         [0.9]  ]
The **round** function changes it to

                [     [1],
                      [0],
                      [1]   ]

The **squeeze** function removes the extra dimensions

[1,0,1]

In [42]:
# Round out predictions and reduce to 1-dimensional array
model_2_preds = tf.squeeze(tf.round(model_2_pred_probs))
model_2_preds[:10]

<tf.Tensor: shape=(10,), dtype=float32, numpy=array([0., 1., 1., 0., 0., 1., 1., 1., 1., 0.], dtype=float32)>

In [43]:
# Calculate LSTM model results
model_2_results = calculate_results(y_true=val_labels,
                                    y_pred=model_2_preds)
model_2_results


{'accuracy': 77.16535433070865,
 'precision': 0.7747861668850706,
 'recall': 0.7716535433070866,
 'f1': 0.7688960790251899}

In [44]:
# Create a helper function to compare our baseline results to new model results
def compare_baseline_to_new_results(baseline_results, new_model_results):
  for key, value in baseline_results.items():
    print(f"Baseline {key}: {value:.2f}, New {key}: {new_model_results[key]:.2f}, Difference: {new_model_results[key]-value:.2f}")

In [45]:
# Compare model 2 to baseline
compare_baseline_to_new_results(baseline_results, model_2_results)

Baseline accuracy: 79.27, New accuracy: 77.17, Difference: -2.10
Baseline precision: 0.81, New precision: 0.77, Difference: -0.04
Baseline recall: 0.79, New recall: 0.77, Difference: -0.02
Baseline f1: 0.79, New f1: 0.77, Difference: -0.02


**Model 3: Bidirectional RNN Model**

Another very popular model is **GRU** or **Gated Recurrent Unit**
- **GRU** has similar features to an LSTM cell but has less parameters
- To use **GRU** cell in TensorFlow we can call the **tensorflow.keras.layers.GRU()**

The architecture of the GRU-powered model will follow the same structure we've been using:

*Input (text) -> Tokenize -> Embedding -> Layers -> Output (label probability)*

In [46]:
tf.random.set_seed(42)
from tensorflow.keras import layers

model_3_embedding = layers.Embedding(input_dim=max_vocab_length,
                                     output_dim=128,
                                     embeddings_initializer="uniform",
                                     input_length=max_length,
                                     name="embedding_3")
# Build an RNN using the GRU Cell
inputs = layers.Input(shape=(1,), dtype="string")
x = text_vectorizer(inputs)
x = model_3_embedding(x)
# x = layers.GRU(64, return_sequences=True)
x = layers.GRU(64)(x)

# x = layers.Dense(64, activation="relu")(x)
# optional dense layer after GRU cell

outputs = layers.Dense(1, activation="sigmoid")(x)
model_3 = tf.keras.Model(inputs, outputs, name="model_3_GRU")



In [47]:
# Compile GRU model
model_3.compile(loss="binary_crossentropy",
                optimizer=tf.keras.optimizers.Adam(),
                metrics=["accuracy"])

In [48]:
model_3.summary()

- Notice the difference in **number of trainable parameters** between **model_2(LSTM)** and **model_3 (GRU)**
- The main difference comes from **LSTM Cell** having more trainable parameters than a **GRU Cell.**

- We'll fit our model just as we've been doing previously. We'll also track our models results using our **create_tensorboard_callback()** function.

**GRU Theory**

- **RNN** has a problem that **it forgets long term information** also called **vanishing gradient problem.**
- **GRU** is a simplest form of **LSTM** which can help us with long term information rememberance

**Imagine you’re reading a story:**

“The man put the milk in the fridge because it was hot.”
When you read the word **“it”**, your brain remembers that **“it”** refers to **“milk”** — something you read several words earlier.
That’s long-term dependency.

*GRU helps a neural network decide how much of the past to remember and how much to forget, just like your brain.*

It does this using two gates
- **Update Gate (z):** Decides how much of the past to keep
- **Reset Gate (r):** Decides how much of the past to forget

At each step 't'(for each word or data point in sequence)
- Take the current input xt
- Take the previous hidden state ht-1
- Compute
  updated gate:zt= σ(Wz​⋅[ht−1​,xt​])

  Reset gate:rt= σ(Wr​⋅[ht−1​,xt​])
  
  Candidate Memory:
          ht=tanh(Wh​⋅[rt​∗ht−1​,xt​])

**Blend the Info**

        ht​=(1−zt​)∗ht−1​+zt​∗h~t

**Why GRU is Popular**

✅ Easier to train than LSTM (fewer gates, fewer parameters)
✅ Works well on smaller datasets
✅ Faster and simpler while still capturing long-term dependencies

- **Gates:** LSTM (Input, forget, output) while GRU (update, reset)
- **Memory Cell:** LSTM has memory cell while GRU does not have it
- **Speed:** LSTM is slower while GRU is faster
- **Accuracy:** Often similar
- **Complexity:** LSTM has high complexity while GRU is Moderate.


In [49]:
# Fit model
model_3_history = model_3.fit(train_sentences,
                              train_labels,
                              epochs=5,
                              validation_data=(val_sentences, val_labels),
                              callbacks=[create_tensorboard_callback(SAVE_DIR, "GRU")])

Saving TensorBoard log files to: model_logs/GRU/20251009-110807
Epoch 1/5
[1m215/215[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m8s[0m 28ms/step - accuracy: 0.6509 - loss: 0.6006 - val_accuracy: 0.7769 - val_loss: 0.4562
Epoch 2/5
[1m215/215[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m5s[0m 21ms/step - accuracy: 0.8624 - loss: 0.3341 - val_accuracy: 0.7625 - val_loss: 0.5111
Epoch 3/5
[1m215/215[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m6s[0m 26ms/step - accuracy: 0.9110 - loss: 0.2328 - val_accuracy: 0.7612 - val_loss: 0.5913
Epoch 4/5
[1m215/215[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m5s[0m 23ms/step - accuracy: 0.9380 - loss: 0.1719 - val_accuracy: 0.7638 - val_loss: 0.6025
Epoch 5/5
[1m215/215[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m10s[0m 45ms/step - accuracy: 0.9596 - loss: 0.1322 - val_accuracy: 0.7598 - val_loss: 0.6520


In [50]:
# Make predictions on the validation data
model_3_pred_probs = model_3.predict(val_sentences)
model_3_pred_probs.shape, model_3_pred_probs[:10]

[1m24/24[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 14ms/step


((762, 1),
 array([[0.0585444 ],
        [0.93690646],
        [0.994304  ],
        [0.10785145],
        [0.01040519],
        [0.99212456],
        [0.19128388],
        [0.99574375],
        [0.9955693 ],
        [0.8610971 ]], dtype=float32))

In [51]:
# Convert prediction probabilities to prediction classes
model_3_preds = tf.squeeze(tf.round(model_3_pred_probs))
model_3_preds[:10]

<tf.Tensor: shape=(10,), dtype=float32, numpy=array([0., 1., 1., 0., 0., 1., 0., 1., 1., 1.], dtype=float32)>

In [52]:
# Calcuate model_3 results
model_3_results = calculate_results(y_true=val_labels,
                                    y_pred=model_3_preds)
model_3_results

{'accuracy': 75.98425196850394,
 'precision': 0.7637560697167074,
 'recall': 0.7598425196850394,
 'f1': 0.7563819709955472}

In [53]:
# Compare to baseline
compare_baseline_to_new_results(baseline_results, model_3_results)

Baseline accuracy: 79.27, New accuracy: 75.98, Difference: -3.28
Baseline precision: 0.81, New precision: 0.76, Difference: -0.05
Baseline recall: 0.79, New recall: 0.76, Difference: -0.03
Baseline f1: 0.79, New f1: 0.76, Difference: -0.03


**Model 4: Bidirectional RNN Model**

- A standard **RNN** will process a **sequence from left to right**, whereas a **bidrectional RNN** will process the sequence from **left to right** and then again from **right to left**
- It's like reading a sentence normally from **left to right** and to make full understanding read it from **right to left** again

*In practice, many sequence models often see an improvement in performance when using bidirectional RNN's.*

However, this improvement in performance often comes at the** cost of longer training times and increased model parameters **(since the model goes left to right and right to left, the number of trainable parameters doubles).

**TensorFlow** helps by providing the **tensorflow.keras.layers.Bidirectional** class. We can use the **Bidirectional** class to wrap our existing RNNs, instantly making them **bidrectional**

In [54]:
# Set random seed and create embedding layer (new embedding layer for each model)
tf.random.set_seed(42)
from tensorflow.keras import layers
model_4_embedding = layers.Embedding(input_dim=max_vocab_length,
                                     output_dim=128,
                                     embeddings_initializer="uniform",
                                     input_length=max_length,
                                     name="embedding_4")
inputs = layers.Input(shape=(1,), dtype="string")
x = text_vectorizer(inputs)
x = model_4_embedding(x)

''' x = layers.Bidirectional(layers.LSTM(64, return_sequences=True))(x)
        stacking RNN layers requires return_sequences=True'''
x = layers.Bidirectional(layers.LSTM(64))(x)
''' bidirectional goes both ways so has double the parameters of a
    regular LSTM layer'''

outputs = layers.Dense(1, activation="sigmoid")(x)
model_4 = tf.keras.Model(inputs, outputs, name="model_4_Bidirectional")




**Note:** You can use the **Bidirectional** wrapper on any **RNN Cell in TensorFlow**. For example, **layers.Bidirectional(layers.GRU(64))** creates a **bidirectional GRU cell.**

In [55]:
# Compile
model_4.compile(loss="binary_crossentropy",
                optimizer=tf.keras.optimizers.Adam(),
                metrics=["accuracy"])

In [56]:
model_4.summary()

Notice the **increased number of trainable parameters** in model_4 (bidirectional LSTM) compared to **model_2 (regular LSTM)**. This is due to the bidirectionality we added to our RNN.

**Time to fit our bidirectional model** and track its performance.

In [57]:
# Fit the model (takes longer because of the bidirectional layers)
model_4_history = model_4.fit(train_sentences,
                              train_labels,
                              epochs=5,
                              validation_data=(val_sentences, val_labels),
                              callbacks=[create_tensorboard_callback(SAVE_DIR,
                                                                     "bidirectional_RNN")])

Saving TensorBoard log files to: model_logs/bidirectional_RNN/20251009-110842
Epoch 1/5
[1m215/215[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m12s[0m 38ms/step - accuracy: 0.6775 - loss: 0.5814 - val_accuracy: 0.7769 - val_loss: 0.4608
Epoch 2/5
[1m215/215[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m8s[0m 30ms/step - accuracy: 0.8653 - loss: 0.3274 - val_accuracy: 0.7717 - val_loss: 0.5058
Epoch 3/5
[1m215/215[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m7s[0m 32ms/step - accuracy: 0.9167 - loss: 0.2218 - val_accuracy: 0.7493 - val_loss: 0.6033
Epoch 4/5
[1m215/215[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m7s[0m 31ms/step - accuracy: 0.9471 - loss: 0.1459 - val_accuracy: 0.7533 - val_loss: 0.6587
Epoch 5/5
[1m215/215[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m8s[0m 35ms/step - accuracy: 0.9622 - loss: 0.1151 - val_accuracy: 0.7428 - val_loss: 0.7429


Due to the bidirectionality of our model we see a slight increase in training time.

In [58]:
# Make predictions with bidirectional RNN on the validation data
model_4_pred_probs = model_4.predict(val_sentences)
model_4_pred_probs[:10]

[1m24/24[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 21ms/step


array([[0.11086708],
       [0.988601  ],
       [0.9996399 ],
       [0.07283685],
       [0.00842149],
       [0.9950955 ],
       [0.84937453],
       [0.9998815 ],
       [0.99890023],
       [0.53743035]], dtype=float32)

In [59]:
# Convert prediction probabilities to labels
model_4_preds = tf.squeeze(tf.round(model_4_pred_probs))
model_4_preds[:10]

<tf.Tensor: shape=(10,), dtype=float32, numpy=array([0., 1., 1., 0., 0., 1., 1., 1., 1., 1.], dtype=float32)>

In [60]:
# Calculate bidirectional RNN model results
model_4_results = calculate_results(val_labels, model_4_preds)
model_4_results

{'accuracy': 74.2782152230971,
 'precision': 0.7436694999217812,
 'recall': 0.7427821522309711,
 'f1': 0.7404494883483694}

In [61]:
# Check to see how the bidirectional model performs against the baseline
compare_baseline_to_new_results(baseline_results, model_4_results)

Baseline accuracy: 79.27, New accuracy: 74.28, Difference: -4.99
Baseline precision: 0.81, New precision: 0.74, Difference: -0.07
Baseline recall: 0.79, New recall: 0.74, Difference: -0.05
Baseline f1: 0.79, New f1: 0.74, Difference: -0.05


**Convolutional Neural Networks For Text**

- You might've used convolutional neural networks (CNNs) for images before but they can also be used for sequences.

- The main difference between using CNNs for images and sequences is the shape of the data. Images come in 2-dimensions (height x width) where as sequences are often 1-dimensional (a string of text).

- So to use CNNs with sequences, we use a 1-dimensional convolution instead of a 2-dimensional convolution.

A typical CNN architecture for sequences will look like the following:

Inputs (text) -> Tokenization -> Embedding -> Layers -> Outputs (class probabilities)


The difference again is in the layers component. Instead of using an LSTM or GRU cell, we're going to use a **tensorflow.keras.layers.Conv1D() layer** followed by a **tensorflow.keras.layers.GlobablMaxPool1D() layer.**


- 1 dimensional convolving filters are used as ngram detectors, each filter specializing in a closely-related family of ngrams (an ngram is a collection of n-words, for example, an ngram of 5 might result in "hello, my name is Daniel").
- 2 Max-pooling over time extracts the relevant ngrams for making a decision.

- 3 The rest of the network classifies the text based on this information.







**Model 5: Conv1D**

Before building a full **1-dimensional CNN Model** let's see a **1-dimensional Convolution Layer** also called **Temporal Convolution** in action.

We will first create an embedding of a sample of text and experiment passing it through **Conv1D()** layer and **GlobalMaxPoolID()** layer.

In [62]:
# Test out the embedding, 1D Convolutional and max pooling

# Turn the target sentence into embedding
embedding_test = embedding(text_vectorizer(["This is a test sentence"]))
conv_1d = layers.Conv1D(filters=32, kernel_size=5, activation="relu")
conv_1d_output = conv_1d(embedding_test)
max_pool = layers.GlobalMaxPool1D()
max_pool_output = max_pool(conv_1d_output)
embedding_test.shape, conv_1d_output.shape, max_pool_output.shape



(TensorShape([1, 15, 128]), TensorShape([1, 11, 32]), TensorShape([1, 32]))

- **embedding_test:-** Converts the text into a tensor of shape ([1,15,128]) where '1' is the batch info, '15' is the input length and each word in the sentence is converted to '128' dimensions vector.

- **conv_1d = layers.Conv1D(filters=32, kernel_size=5, activation="relu")** This is a **pattern detector** sliding over your sentence.
    - **kernel_size=5** means that the convolution looks at **5 words at a time**. For example, *This is a test sentence* is checked like first *This is a test* and then *is a test sentence*
- **filter=32:** means we have **32 detectors** (each trying to find a different type of word pattern). **It is like having 32 detectors each looking for different patterns like 'positive tone' and other might detect 'negative tone'. The sentence length becomes (1,11,32). 1 is the sentence in batch, 11 is the number of positions(word windows) the convolution moved across and 32 is the Number of filters(pattern detectors)

What the convolution actually did

Let’s say your sentence (after embedding)
had 15 words → shape (1, 15, 128)

Now your convolution is set with kernel_size=5 → looks at 5 words at a time.

*The convolution starts at word 1–5, then slides to words 2–6, then 3–7, etc.
So, how many such windows does it get?*

👉 15 (words) − 5 (window) + 1 = 11 windows

That’s why you see 11 as the second number in (1, 11, 32).

Each window (chunk of 5 words) gets processed by 32 filters —
so you get 32 output numbers per window, showing how strongly each filter responded to that chunk.

Input sentence (15 words)

↓

Convolution with window=5 slides 11 times

↓

Each slide produces 32 numbers (from 32 filters)

↓

Resulting output → (1, 11, 32)

- **max_pool = layers.GlobalMaxPool1D()**
We’re starting with the conv_1d_output of shape:
(1, 11, 32) → which means:
- 1 = one sentence (batch size)
- 11 = 11 sliding windows (5-word chunks)
- 32 = 32 filters (each finding a pattern)

For each of the 32 filters (pattern detectors), find the strongest signal across all 11 chunks.

It “summarizes” each filter’s responses by keeping only its maximum value — the one where it responded most strongly.

**An intutive way of understanding above**

- Each sentence is **15 words**, and each word is represented by **128 numbers** so the shape is *(15, 128)*, 15 rows and 128 columns with each row representing a word with 128 dimensions.

You’ve told the Conv1D layer:

- **kernel_size = 5** → look at 5 words at a time
- **filters = 32** → use 32 different “pattern detectors”

Now the convolution will take **5 rows(words) x 128 columns (embedding dims)** and multiple it by a kernel of **32 kernel**.
Since, each word is **128-dim vector**, the window is a small **5 x 128 matrix**

Now,
- Each **row** of the kernel correponds to a position in the **5-word window.**
- Each **column** corresponds to an *embedding feature.*
- Each element Kij is a trainable weight.










In [63]:
embedding_test.shape
embedding_test[:1]

<tf.Tensor: shape=(1, 15, 128), dtype=float32, numpy=
array([[[ 0.00689882,  0.01558637,  0.04679224, ...,  0.04097068,
         -0.0077493 ,  0.03530424],
        [ 0.02853888,  0.01967032,  0.02235212, ..., -0.03355449,
         -0.01723835, -0.05492988],
        [-0.00563933, -0.0075665 , -0.02510787, ...,  0.01567699,
         -0.01620559, -0.05446053],
        ...,
        [-0.01325036,  0.0211815 , -0.01513357, ..., -0.05524667,
          0.04565595, -0.01864273],
        [-0.01325036,  0.0211815 , -0.01513357, ..., -0.05524667,
          0.04565595, -0.01864273],
        [-0.01325036,  0.0211815 , -0.01513357, ..., -0.05524667,
          0.04565595, -0.01864273]]], dtype=float32)>

In [64]:
conv_1d_output.shape
conv_1d_output[:1]

<tf.Tensor: shape=(1, 11, 32), dtype=float32, numpy=
array([[[0.03655996, 0.12613723, 0.        , 0.06870178, 0.09322613,
         0.08816093, 0.02233853, 0.        , 0.00532557, 0.        ,
         0.        , 0.17987034, 0.        , 0.02863005, 0.13624111,
         0.        , 0.08822712, 0.09706487, 0.        , 0.0009329 ,
         0.03885968, 0.1387533 , 0.        , 0.04235066, 0.03736172,
         0.05846445, 0.        , 0.        , 0.        , 0.        ,
         0.        , 0.        ],
        [0.        , 0.        , 0.02259161, 0.        , 0.02077084,
         0.01890488, 0.        , 0.10071232, 0.        , 0.06708212,
         0.05403037, 0.        , 0.07743414, 0.        , 0.13780522,
         0.        , 0.        , 0.        , 0.        , 0.        ,
         0.        , 0.02061899, 0.03057478, 0.02854359, 0.        ,
         0.03003151, 0.        , 0.00337284, 0.04962951, 0.00481275,
         0.        , 0.        ],
        [0.        , 0.02430483, 0.        , 0.0259

In [66]:

tf.random.set_seed(42)
from tensorflow.keras import layers
model_5_embedding = layers.Embedding(input_dim=max_vocab_length,
                                     output_dim=128,
                                     embeddings_initializer="uniform",
                                     input_length=max_length,
                                     name="embedding_5")
from tensorflow.keras import layers
inputs = layers.Input(shape=(1,), dtype="string")
x = text_vectorizer(inputs)
x = model_5_embedding(x)
x = layers.Conv1D(filters=32, kernel_size=5, activation="relu")(x)
x = layers.GlobalMaxPool1D()(x)
outputs = layers.Dense(1, activation="sigmoid")(x)
model_5 = tf.keras.Model(inputs, outputs, name="model_5_Conv1D")


# Compile the model
model_5.compile(loss="binary_crossentropy",
                optimizer=tf.keras.optimizers.Adam(),
                metrics=["accuracy"])
# Print the summary
model_5.summary()



Woohoo! Looking great! Notice how the number of trainable parameters for the 1-dimensional convolutional layer is similar to that of the LSTM layer in model_2.

Let's fit our 1D CNN model to our text data. In line with previous experiments, we'll save its results using our create_tensorboard_callback() function.

In [67]:
# Fit the model
model_5_history = model_5.fit(train_sentences,
                              train_labels,
                              epochs=5,
                              validation_data=(val_sentences, val_labels),
                              callbacks=[create_tensorboard_callback(
                                  SAVE_DIR, "Conv1D")])

Saving TensorBoard log files to: model_logs/Conv1D/20251009-122023
Epoch 1/5
[1m215/215[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m5s[0m 16ms/step - accuracy: 0.6507 - loss: 0.6239 - val_accuracy: 0.7822 - val_loss: 0.4680
Epoch 2/5
[1m215/215[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m3s[0m 15ms/step - accuracy: 0.8460 - loss: 0.3681 - val_accuracy: 0.7874 - val_loss: 0.4760
Epoch 3/5
[1m215/215[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m4s[0m 18ms/step - accuracy: 0.9128 - loss: 0.2329 - val_accuracy: 0.7861 - val_loss: 0.5309
Epoch 4/5
[1m215/215[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m3s[0m 14ms/step - accuracy: 0.9508 - loss: 0.1488 - val_accuracy: 0.7861 - val_loss: 0.6004
Epoch 5/5
[1m215/215[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m3s[0m 14ms/step - accuracy: 0.9663 - loss: 0.1045 - val_accuracy: 0.7822 - val_loss: 0.6521


In [68]:
# Make predictions with model_5
model_5_pred_probs = model_5.predict(val_sentences)
model_5_pred_probs[:10]

[1m24/24[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 7ms/step


array([[0.89048654],
       [0.89241564],
       [0.9999319 ],
       [0.1475555 ],
       [0.00546581],
       [0.9963466 ],
       [0.9844464 ],
       [0.9991346 ],
       [0.999044  ],
       [0.2209142 ]], dtype=float32)

In [69]:
# Convert model_5 prediction probabilities to labels
model_5_preds = tf.squeeze(tf.round(model_5_pred_probs))
model_5_preds[:10]

<tf.Tensor: shape=(10,), dtype=float32, numpy=array([1., 1., 1., 0., 0., 1., 1., 1., 1., 0.], dtype=float32)>

In [70]:
# Calculate model_5 evaluation metrics
model_5_results = calculate_results(y_true=val_labels,
                                    y_pred=model_5_preds)
model_5_results

{'accuracy': 78.21522309711287,
 'precision': 0.7843528649827862,
 'recall': 0.7821522309711286,
 'f1': 0.7800522271093586}

In [71]:
# Compare model_5 results to baseline
compare_baseline_to_new_results(baseline_results, model_5_results)

Baseline accuracy: 79.27, New accuracy: 78.22, Difference: -1.05
Baseline precision: 0.81, New precision: 0.78, Difference: -0.03
Baseline recall: 0.79, New recall: 0.78, Difference: -0.01
Baseline f1: 0.79, New f1: 0.78, Difference: -0.01


**Using Pretrained Embeddings (Transfer Learning for NLP)**