**Natural Language Processing with TensorFlow**

- NLP problems are also called **Sequence Problems** because data is presented in a Sequence.

Natural Language covers
- Text (Such as email, blog post, book, Tweet)
- Speech (a conversation you have with a doctor, voice commands you give to a smart speaker)

- One **use case** is to scan an incoming emails to see if they are spam or not (classification)
- Another **use case** is analyzing feedback complaints to find which segment of business it is talking about

**Both of the above are referred to as *sequences*. You might come across terms like **seq2seq**, in other words, finding information in one sequence to produce another sequence.**

A typical workflow in NLP is

*Text --> Turn into numbers --> Build a model --> Train the model --> use patterns to make predictions*

**What we are going to cover**

- Getting data
- Visualizing text
- Converting text into numbers using tokenization
- Turning our tokenized text into embedding
- Modelling a text dataset
    - Starting with a baseline (TF-IDF)
    - Building deep learning models like
       - LSTM, GRU, Conv1D, Transfer Learning
- Comparing the performance of each of our models
- Combining the models into **ensemble**
- Saving and Loading a **Trained Model**
- Find the most wrong predictions




**Download the Helper Functions**

In [36]:
# Download helper functions script
!wget https://raw.githubusercontent.com/mrdbourke/tensorflow-deep-learning/main/extras/helper_functions.py

--2025-10-07 11:38:37--  https://raw.githubusercontent.com/mrdbourke/tensorflow-deep-learning/main/extras/helper_functions.py
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 10246 (10K) [text/plain]
Saving to: ‘helper_functions.py.1’


2025-10-07 11:38:37 (57.3 MB/s) - ‘helper_functions.py.1’ saved [10246/10246]



**Import the helper functions**

In [37]:
# Import series of helper functions for the notebook
from helper_functions import unzip_data, create_tensorboard_callback, plot_loss_curves, compare_historys

**Download the Text Dataset**

We will be using **Real or Not** dataset from **Kaggle** which contains **text-based Tweets** about natural disasters


The Real Tweets are actually about disasters, for example:

*Jetstar and Virgin forced to cancel Bali flights again because of ash from Mount Raung volcano*


The Not Real Tweets are Tweets not about disasters (they can be on anything), for example:

*Education is the most powerful weapon which you can use to change the world.Nelson #Mandela #quote*


In [38]:
# Download data (same as from Kaggle)
!wget "https://storage.googleapis.com/ztm_tf_course/nlp_getting_started.zip"

# Unzip data
unzip_data("nlp_getting_started.zip")

--2025-10-07 11:38:37--  https://storage.googleapis.com/ztm_tf_course/nlp_getting_started.zip
Resolving storage.googleapis.com (storage.googleapis.com)... 142.251.183.207, 64.233.179.207, 173.194.193.207, ...
Connecting to storage.googleapis.com (storage.googleapis.com)|142.251.183.207|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 607343 (593K) [application/zip]
Saving to: ‘nlp_getting_started.zip.1’


2025-10-07 11:38:37 (24.2 MB/s) - ‘nlp_getting_started.zip.1’ saved [607343/607343]



**Unzipping Files**
 By Unzipping we get the following files
 - **sample_submission.csv:-** An example of the file that you would submit to Kaggle competition
 - **train.csv:-** training samples of real and not real disaster Tweets
 - **test.csv:-** testing samples of real and not real disaster Tweets


**Visualizing a Text Dataset**

In [39]:
import pandas as pd

train_df = pd.read_csv("train.csv")
test_df = pd.read_csv("test.csv")
train_df.head(5)

Unnamed: 0,id,keyword,location,text,target
0,1,,,Our Deeds are the Reason of this #earthquake M...,1
1,4,,,Forest fire near La Ronge Sask. Canada,1
2,5,,,All residents asked to 'shelter in place' are ...,1
3,6,,,"13,000 people receive #wildfires evacuation or...",1
4,7,,,Just got sent this photo from Ruby #Alaska as ...,1


In [40]:
# Shuffle training dataframe
train_df_shuffled = train_df.sample(frac=1, random_state=42)
# shuffle with random_state=42 for reproducibility
train_df_shuffled.head()

Unnamed: 0,id,keyword,location,text,target
2644,3796,destruction,,So you have a new weapon that can cause un-ima...,1
2227,3185,deluge,,The f$&amp;@ing things I do for #GISHWHES Just...,0
5448,7769,police,UK,DT @georgegalloway: RT @Galloway4Mayor: ÛÏThe...,1
132,191,aftershock,,Aftershock back to school kick off was great. ...,0
6845,9810,trauma,"Montgomery County, MD",in response to trauma Children of Addicts deve...,0


In the **training data** we have a **target** column and from the analysis of the text we will try to predict the **target** column.
The **test dataset** does not have a **target** column.

Inputs (text column) -> Machine Learning Algorithm -> Outputs (target column)

In [41]:
test_df.head()

Unnamed: 0,id,keyword,location,text
0,0,,,Just happened a terrible car crash
1,2,,,"Heard about #earthquake is different cities, s..."
2,3,,,"there is a forest fire at spot pond, geese are..."
3,9,,,Apocalypse lighting. #Spokane #wildfires
4,11,,,Typhoon Soudelor kills 28 in China and Taiwan


In [42]:
train_df.target.value_counts()

Unnamed: 0_level_0,count
target,Unnamed: 1_level_1
0,4342
1,3271


So,from the above we can easily deduce that we are dealing with a **binary classification** problem.
Also, the dataset is fairly balanced with 60% negative class and 40% positive class

Where,
- 1 = a real disaster Tweet
- 0 = not a real disaster Tweet

In [43]:
# How many samples total?
print(f"Total training samples: {len(train_df)}")
print(f"Total test samples: {len(test_df)}")
print(f"Total samples: {len(train_df) + len(test_df)}")

Total training samples: 7613
Total test samples: 3263
Total samples: 10876


**Train/Test Split**
We have got abundance of testing samples and normally a 80/20 split is recommended.

**Question:** Why visualize random samples? You could visualize samples in order but this could lead to only seeing a certain subset of data. Better to visualize a substantial quantity (100+) of random samples to get an idea of the different kinds of data you're working with. In machine learning, never underestimate the power of randomness.

In [44]:
import random
random_index = random.randint(0, len(train_df)-5) # create random indexes not higher than the total number of samples

'''
train_df_shuffled[['text','target']]: Get only the "text" and "target" column
[random_index:random_index+5]: Get only five rows starting from the random index
.itertuples(): Iterate over the rows as namedtuples.
The return is a named tuple like
Pandas(Index=100, text="Fire in the building", target=1)
Pandas(Index=101, text="Lovely weather today", target=0)
We get these results in the row variable
_: is used to get the index
text: is used to get the text
target: is used to get the target (0 or 1)
'''


for row in train_df_shuffled[['text', 'target']][random_index:random_index+5].itertuples():
  _, text, target = row
  print(f"Target: {target}", "(real disaster)" if target > 0 else "(not real disaster)")
  print(f"Text:\n{text}\n")
  print("---\n")

Target: 1 (real disaster)
Text:
Pakistan's Supreme Court rules to allow military trials for suspects in terrorism cases http://t.co/ajpbdCalew

---

Target: 0 (not real disaster)
Text:
ITS JUST NOW SINKING IN THIS IS THE LAST EPISODE MY HEART HURTS SO BAD

---

Target: 1 (real disaster)
Text:
Burning buildings? Media outrage? http://t.co/pHixZnv1YN

---

Target: 0 (not real disaster)
Text:
Has An Ancient Nuclear Reactor Been Discovered In Africa? ÛÒ Your... http://t.co/qadUfO8zXg

---

Target: 1 (real disaster)
Text:
@JoeDawg42 TOR for a TOR situation only. Wind damage enhanced wording is key IMO

---



**Split data into Training and Validation Sets**

- The test data has no **labels** and we need a way to evaluate the model, so we split the **training data** into **training data** and **validation set.**
- Model trains on **training data** and the performances are checked using unseen **validation set.**

In [45]:
from sklearn.model_selection import train_test_split

'''
Each column is converted to a numpy array.
train_df_shuffled["text"] is a Pandas Series
(basically a 1-D labeled array with an index).
Scikit-learn functions like train_test_split are designed to work with
NumPy arrays or plain Python lists.
Even if you sends a Pandas Series directly, it works too because Pandas is
compitable with NumPy but internally scikit-learn will convert it to
Numpy anyway
'''

# Use train_test_split to split training data into training and validation sets
train_sentences, val_sentences, train_labels, val_labels = train_test_split(train_df_shuffled["text"].to_numpy(),
                                                                            train_df_shuffled["target"].to_numpy(),
                                                                            test_size=0.1,
                                                                            random_state=42)


In [46]:
# Check the lengths
len(train_sentences), len(train_labels), len(val_sentences), len(val_labels)

(6851, 6851, 762, 762)

In [47]:
# View the first 10 training sentences and their labels
train_sentences[:10], train_labels[:10]

(array(['@mogacola @zamtriossu i screamed after hitting tweet',
        'Imagine getting flattened by Kurt Zouma',
        '@Gurmeetramrahim #MSGDoing111WelfareWorks Green S welfare force ke appx 65000 members har time disaster victim ki help ke liye tyar hai....',
        "@shakjn @C7 @Magnums im shaking in fear he's gonna hack the planet",
        'Somehow find you and I collide http://t.co/Ee8RpOahPk',
        '@EvaHanderek @MarleyKnysh great times until the bus driver held us hostage in the mall parking lot lmfao',
        'destroy the free fandom honestly',
        'Weapons stolen from National Guard Armory in New Albany still missing #Gunsense http://t.co/lKNU8902JE',
        '@wfaaweather Pete when will the heat wave pass? Is it really going to be mid month? Frisco Boy Scouts have a canoe trip in Okla.',
        'Patient-reported outcomes in long-term survivors of metastatic colorectal cancer - British Journal of Surgery http://t.co/5Yl4DC1Tqt'],
       dtype=object),
 array([0,

We now have a **training and test set**.

**Converting Text into Numbers**

- We now have **training set and validation set** containing Tweets and Labels

**Question:** What is the most important step before we can use a **machine learning algorithm** with our text data?

**Answer** Turn the text into numbers.

*For any Machine Learning algorithm, the inputs needs to be in numerical form.*

In NLP, there are two main concepts of turning text into numbers

- **Tokenization:** A straight mapping from a word/character/sub-word to a numerical value. There are three main levels of tokenization
   - Using **word-level tokenization** with the sentence like "I love NLP" might result in "I" being "0", "love" being "1" and "NLP" being "2". Every word in the **sequence** is considered a single **token**
   - **Character Level Tokenization** Converting the letters A-Z to value *1-26* and each character in the sequence is considered a single **token**
   - **Sub-word tokenization** is between **word-level** and **character level** tokenization. It involves word into smaller parts and then converting those smaller parts to numbers. For example, "my favourite food is pineapple pizza" might become "my, fav, avour, rite, fo, oo, od, is, pin, ine, app, le, piz, za". After this these **sub-words** are converted to numerical form.

- **Embeddings-** An embedding is a representation of natural language which can be learned. The **representation** is in the form of **feature vector**.For example, the word "dance" could be represented by the 5-dimensional vector [-0.8547, 0.4559, -0.3332, 0.9877, 0.1112]. It's important to note here, the size of the feature vector is tuneable. There are two ways to use embeddings:
  - **Create your own embedding** Once your text has been turned into numbers (required for embedding), you can put them through an **embedding layer** such as **tf.keras.layers.Embedding** and an embedding representation will be learned during model training
  - **Reuse pre-learned embedding** Many pre-trained embeddings exist online. These **pre-trained** embeddings have often learned on large corpuses of text and thus have a good **underlying representation of natural language.** We can use **pre-trained embedding** to initialize our model and later on **fine-tune** it to our own specific task.

  Regarding **which level of tokenization should one use** mostly depends on your problem. You can try character-level/word level tokenization/embedding and which one performs best should be the choice.


**Text Vectorization (tokenization)**

Creating *tokens* is the most important step. To tokenize our text, we will use the following
**tf.keras.layers.experimental.preprocessing.TextVectorization**

The **TextVectorization** layer takes the following parameters as input

- **max_tokens:-** Maximum number of words in your vocabulary (e.g. 20,000 or the *number of unique words or text*). It also includes **OOV (out of vocabulary)** tokens
- **standardize:-** Method for standardizing text. Default is **lower_and_strip_punctuation** which lowers text and removes all punctuation marks.
- **split:-** Text splitting, default is **whitespace** which splits text on spaces
- **output_mode:-** How to output tokens, can be **int (integer mapping)**, **binary (one hot encoding)**, count or **tf-idf**
- **output_sequence_length:-** Length of tokenized sequence to output. For example, if *output_sequence_length=150*, all tokenized sequences will be **150 tokens long.**
- **pad_to_max_tokens:-** Defaults to *False*, If *True*, then all the output feature axis will be padded to *max_tokens* even if the *number of unique tokens* in vocabulary is less than *max_tokens.*


In [48]:
import tensorflow as tf
from tensorflow.keras.layers import TextVectorization

# Use the default TextVectorization variables
text_vectorizer = TextVectorization(
                                    max_tokens=None, # how many words in the vocabulary (all of the different words in your text)
                                    standardize="lower_and_strip_punctuation", # how to process text
                                    split="whitespace", # how to split tokens
                                    ngrams=None, # create groups of n-words?
                                    output_mode="int", # how to map tokens to numbers
                                    output_sequence_length=None
                                    ) # how long should the output sequence of tokens be?
                                    # pad_to_max_tokens=True) # Not valid if using max_tokens=None



The **TextVectorizaiton** object has been initialized with default settings but let's customize it a little.

- We will particulary customize **max_tokens** and **output_sequence_length** variables

- The **max_tokens**(number of words in vocabulary) is normally multiple of 10,000, 20,000 and 30,000. For our case we will use **10,000**

The **output_sequence_length** will be the average number of tokens per Tweet in the training set.





In [49]:
'''
The code snipet loops over each sentence in 'train_sentences'.
Splits the sentence based on whitespace
Finds the length of that sentence and keep it in a list
we get a list like [5,3,7,2.....].
Then, it sums the and divide by the toal number of sentence.

Finally, we get the average length of output sequence

'''
round(sum([len(i.split()) for i in train_sentences])/len(train_sentences))

15

In [50]:
max_vocab_length = 10000
# max number of words to have in our vocabulary
max_length = 15
''' max length our sequences will be (e.g. how many words from a
    Tweet does our model see?)'''

text_vectorizer = TextVectorization(max_tokens=max_vocab_length,
                                    output_mode="int",
                                    output_sequence_length=max_length)

**Training Data Mapping**
**Text Vectorizer** is a special layer which learns **how to turn text(words) into numbers**

**What .adapt(train_sentences) does**

The **.adapt()** step is where the text_vectorizer actually learns the vocabulary.
It scans through all your **training sentences** to figure out **which words exist** and **how often they appear.**

For example,

train_sentences = ["I love pizza", "Pizza is delicious", "I hate cold pizza"]

- the .adapt() will read all those sentences
- build a vocabulary: A dictionary of all unique words like

['[PAD]', '[UNK]', 'pizza', 'i', 'love', 'is', 'delicious', 'hate', 'cold']

It then assigns an integer number to each word.

'pizza' → 2
'i' → 3
'love' → 4
'is' → 5
...

Now, whenever you pass a new sentence to **text_vectorizer** it converts it automatically into a **sequence of numbers**

text_vectorizer(["I love pizza"])  ➜  [[3, 4, 2]]


In [51]:
text_vectorizer.adapt(train_sentences)

Now **training data** is mapped. Let's try the **text_vectorizer** on a customer sentence.

In [52]:
# Create sample sentence and tokenize it
sample_sentence = "There's a flood in my street!"
text_vectorizer([sample_sentence])

<tf.Tensor: shape=(1, 15), dtype=int64, numpy=
array([[264,   3, 232,   4,  13, 698,   0,   0,   0,   0,   0,   0,   0,
          0,   0]])>

Our **text** has been converted into **numbers**.
- Notice the 0's at the end of the returned tensor because we have set the output length to **15** and any sequence we input will be ouput in a length of **15**.

In [53]:
# Choose a random sentence from the training dataset and tokenize it
random_sentence = random.choice(train_sentences)
print(f"Original text:\n{random_sentence}\
      \n\nVectorized version:")
text_vectorizer([random_sentence])

Original text:
Real Hip Hop: Apollo Brown Feat M.O.P. - Detonate 
#JTW http://t.co/cEiaO1TEXr      

Vectorized version:


<tf.Tensor: shape=(1, 15), dtype=int64, numpy=
array([[ 369, 2436, 2950, 1044,  631, 1210,  940,  437,    1,    1,    0,
           0,    0,    0,    0]])>

Let's check the **get_vocabulary()** function

In [54]:
words_in_vocab = text_vectorizer.get_vocabulary()
top_5_words = words_in_vocab[:5]
bottom_5_words = words_in_vocab[-5:]
print(f"Number of words in vocab: {len(words_in_vocab)}")
print(f"Top 5 most common words: {top_5_words}")
print(f"Bottom 5 least common words: {bottom_5_words}")

Number of words in vocab: 10000
Top 5 most common words: ['', '[UNK]', np.str_('the'), np.str_('a'), np.str_('in')]
Bottom 5 least common words: [np.str_('pages'), np.str_('paeds'), np.str_('pads'), np.str_('padres'), np.str_('paddytomlinson1')]


**Creating an Embedding using an Embedding Layer**

- We now have a way of **mapping text to numbers**. We will go a step further and turn those **numbers into an embedding**

But why **Embedding is a necessary step**

- After the **vectorization layer** our text sentences become something like
'I love pizza' ==> [4,6,2,0,0]. These are just word ID's, simple integer but here is the problem, nothing can be deduced from these **numbers**. We cannot deduce things like **pizza and burger** are similar and that **pizza** and **table** are not similar

It just sees different integers — like seeing phone numbers:

pizza = 125
burger = 89
table = 500

There is no relationship between these numbers.

**Enter the Embedding Layer**

- The embedding layer gives meaning to those numbers. It takes each **word ID** and converts it into a **dense vector of real numbers** a small list of numbers that represents what the word means in context.
Word ID → Embedding Vector
2 (“pizza”)  → [0.12, -0.45, 0.88, 0.33, -0.02]

6 (“love”)   → [0.91,  0.11, 0.55, -0.23, 0.70]

4 (“I”)      → [0.02, -0.09, 0.30,  0.45, 0.10]

Now, instead of just having [4, 6, 2],
your sentence becomes something like:

[
 [0.02, -0.09, 0.30, 0.45, 0.10],

 [0.91,  0.11, 0.55, -0.23, 0.70],

 [0.12, -0.45, 0.88, 0.33, -0.02]
]

**Why is this powerful**
The embeddings help the model understand relationships between words:

- "king" and "queen" have similar vectors and differ mostly by gender dimensions

- "pizza" and "burger" appear in similar food contexts, so their vectors are close


Embeddings also **reduce dimensionality**. Embedding converts each word into a small vector like 16, 32, 128

Embeddings are learned during training and you do not have to define the relationships yourself.

**Analogy:**

Think of it like this:

**Vectorization =** giving every word a roll number (just an ID).

**Embedding =** giving every student (word) a personality profile —
strengths, weaknesses, interests (numbers that describe meaning).


TextVectorization
Turns text into word IDs
“I love pizza” → [4, 6, 2]

Embedding
Turns word IDs into meaning-rich vectors
[4, 6, 2] → [[0.02, -0.09, 0.30...], ...]





**Creating an Embedding Layer Using an Embedding Layer**

- As discussed above the **powerful thing about embeddings** is that it can be learned during training. So, **a model** rather than going through static numbers like **1=1, 2=2**, a word's numeric representation can be improved as the model goes through data samples.

*We will see what an embedding layer looks like by using the **tf.keras.layers.Embedding** layer*


The main parameters we are concerned about here are:

- **input_dim -** The size of vocabulary (e.g. len(text_vectorizer.get_vocabulary() ).
- **output_dim -** The size of the **output embedding vector**, for example, a value of **100** outputs a feature vector of **size 100 for each word.**
- **embeddings_initializer -** How to initialize embedding matrix, default is **uniform** which randomly initializes embedding matrix with **uniform distribution.** This can be changed by using **pre-learned embeddings.**
- **input_length -** Length of sequences being passed to *embedding layer.*



In [55]:
tf.random.set_seed(42)
from tensorflow.keras import layers

embedding = layers.Embedding(input_dim=max_vocab_length, # set input shape
                             output_dim=128, # set size of embedding vector
                             embeddings_initializer="uniform", # default, intialize
                             input_length=max_length, # how long is each input
                             name="embedding_1")

embedding



<Embedding name=embedding_1, built=False>

In [56]:
# Get a random sentence from training set
random_sentence = random.choice(train_sentences)
print(f"Original text:\n{random_sentence}\
      \n\nEmbedded version:")

# Embed the random sentence (turn it into numerical representation)
sample_embed = embedding(text_vectorizer([random_sentence]))
sample_embed

Original text:
@EvaHanderek @MarleyKnysh great times until the bus driver held us hostage in the mall parking lot lmfao      

Embedded version:


<tf.Tensor: shape=(1, 15, 128), dtype=float32, numpy=
array([[[ 0.0350262 , -0.00978339, -0.02676543, ..., -0.0073467 ,
         -0.02346202, -0.04070182],
        [ 0.0350262 , -0.00978339, -0.02676543, ..., -0.0073467 ,
         -0.02346202, -0.04070182],
        [-0.00649419, -0.0431314 ,  0.01509755, ...,  0.02880701,
         -0.04773539, -0.00163373],
        ...,
        [ 0.00105276,  0.00192619, -0.03825718, ...,  0.04465655,
         -0.01222809, -0.00943607],
        [-0.00403442,  0.00815734,  0.03066098, ..., -0.01933029,
         -0.02879018,  0.0165939 ],
        [-0.0157411 ,  0.01961218, -0.00990567, ...,  0.02623982,
          0.0370774 ,  0.00839784]]], dtype=float32)>

Embedded version:
<tf.Tensor: shape=(1, 15, 128)>

- **1:** here is the batch. Hence, we have a single sentence, so it is 1.
- **15:** This represents the sequence length (the number of word positions the model expects per sentence). This means that each **sentence is 15 words long**. If the **sentence is shorter** then **tensorflow pads it.**
- **128:** This is the **embedding dimension** - the number of numbers used to represent each word's meaning.

Embedding(input_dim=vocab_size, output_dim=128)

This means that *Each word(or token) will be represented by **128 floating-point numbers** that capture its semantic meaning - how it related to other words.

So the tensor shape (1, 15, 128) means:

“We have 1 sentence, which is **15 tokens long,** and **each token is represented by 128 features** that capture its meaning.”

Word     Feature1            Feature2               Feature128

she	     0.01	              -0.02	...  	            0.03

keep	  -0.04	               0.05	...               0.01

it	     0.03	              -0.01	...	              0.02

wet	    -0.07	              0.02	...	              0.04

like	   0.05	              0.01	...	             -0.06










**Each token in the sentence gets turned into a length of 128-feature vector**

In [57]:
# Check out a single token's embedding
sample_embed[0][0]

<tf.Tensor: shape=(128,), dtype=float32, numpy=
array([ 0.0350262 , -0.00978339, -0.02676543,  0.03956392, -0.03344665,
       -0.04311233,  0.02388332, -0.01232858, -0.00422556,  0.04479102,
       -0.04634058,  0.02043059, -0.04931803, -0.03226129, -0.01436562,
        0.03988515,  0.0421787 ,  0.01196321, -0.03263714, -0.03924365,
       -0.01383326, -0.0406701 , -0.00065632, -0.0104547 ,  0.04966575,
       -0.01922205,  0.03410057,  0.04362017, -0.0415011 , -0.02972913,
       -0.01344228,  0.02470872, -0.01575663,  0.0412953 , -0.03093391,
        0.01897729,  0.02087748,  0.02792082, -0.04690794, -0.04125341,
       -0.00636761, -0.02103144,  0.01261086, -0.00069611, -0.01704327,
        0.03868764, -0.01426035, -0.01755022,  0.00700565,  0.03556034,
       -0.0017521 ,  0.02194133, -0.01111495,  0.03713404,  0.01719724,
       -0.03117243,  0.00123413, -0.01852653, -0.01440245, -0.00058364,
       -0.027253  ,  0.01461956,  0.02197878, -0.04186672,  0.0480007 ,
       -0.040687

These values might not mean much to us but they're what our computer sees each word as. When our model looks for patterns in different samples, these values will be updated as necessary.

Now that we've got a way to turn our text data into numbers, we can start to build machine learning models to model it.

To get plenty of practice, we're going to build a series of different models, each as its own experiment. We'll then compare the results of each model and see which one performed best.

More specifically, we'll be building the following:

**Model 0:** Naive Bayes (baseline)

**Model 1:** Feed-forward neural network (dense model)

**Model 2:** LSTM model

**Model 3:** GRU model

**Model 4:** Bidirectional-LSTM model

**Model 5:** 1D Convolutional Neural Network

**Model 6:** TensorFlow Hub Pretrained Feature Extractor

**Model 7:** Same as model 6 with 10% of training data

**Model 0**

Each experiment will go through the following steps:

- Construct the Model
- Train the Model
- Maker Predictions with the Model
- Track Prediction Evaluation Metrics for Later Comparison

**Model 0: Getting a baseline**

In all Machine Learning modelling experiments, it is important to **create a baseline ** model, so that we can benchmark against it.

To create our baseline, we'll create a **Scikit-Learn Pipeline** using the **TF-IDF (term frequency-inverse document frequency) formula** to convert our words to numbers and then model them with the **Multinomial Naive Bayes algorithm.** This was chosen via referring to the Scikit-Learn machine learning map.


In [58]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import Pipeline

# Create tokenization and modelling pipeline
model_0 = Pipeline([
                    ("tfidf", TfidfVectorizer()), # convert words to numbers using tfidf
                    ("clf", MultinomialNB()) # model the text
])

# Fit the pipeline to the training data
model_0.fit(train_sentences, train_labels)

Let's evaluate our model and find our baseline metric.

In [59]:
baseline_score = model_0.score(val_sentences, val_labels)
print(f"Our baseline model achieves an accuracy of: {baseline_score*100:.2f}%")

Our baseline model achieves an accuracy of: 79.27%


Let's do some **predictions** with our baseline model

In [60]:
# Make predictions
baseline_preds = model_0.predict(val_sentences)
baseline_preds[:20]

array([1, 1, 1, 0, 0, 1, 1, 1, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1])

**Creating an Evaluation Function for our Model Experiments**

- We are going to create this **function** because we are going to **evaluate several models** in the same way going forward.
- Let's create a helper function which takes **an array of predictions and ground truth labels** and compute the following
    - Accuracy
    - Precision
    - Recall
    - F1-Score

Hence, we are dealing with a **Classification Problem** above metrics are appropriate. If we are dealing with **Regression Problem**, we will use metrics like **MAE (Mean Absolute Error)**

In [61]:
# Function to evaluate: accuracy, precision, recall, f1-score
from sklearn.metrics import accuracy_score, precision_recall_fscore_support

def calculate_results(y_true, y_pred):
  """
  Calculates model accuracy, precision, recall and f1 score of a binary classification model.

  Args:
  -----
  y_true = true labels in the form of a 1D array
  y_pred = predicted labels in the form of a 1D array

  Returns a dictionary of accuracy, precision, recall, f1-score.
  """
  # Calculate model accuracy
  model_accuracy = accuracy_score(y_true, y_pred) * 100
  # Calculate model precision, recall and f1 score using "weighted" average
  model_precision, model_recall, model_f1, _ = precision_recall_fscore_support(y_true, y_pred, average="weighted")
  model_results = {"accuracy": model_accuracy,
                  "precision": model_precision,
                  "recall": model_recall,
                  "f1": model_f1}
  return model_results

In [62]:
# Get baseline results
baseline_results = calculate_results(y_true=val_labels,
                                     y_pred=baseline_preds)
baseline_results

{'accuracy': 79.26509186351706,
 'precision': 0.8111390004213173,
 'recall': 0.7926509186351706,
 'f1': 0.7862189758049549}

**Model 1: A simple Dense Model**

The first **deep** model we are going to build is a **single layer dense model**. And

- It will take our **text** and **labels** as input
- **Tokenize** the **text**
- Create and **embedding**
- Find the average of **embedding (using Global Average Pooling)** and then pass the average through **fully connected layer** with one output and a **sigmoid** activation.

And since we're going to be building a number of TensorFlow deep learning models, we'll import our **create_tensorboard_callback()** function from helper_functions.py to keep track of the results of each.


In [63]:
# Create tensorboard callback (need to create a new one for each model)
from helper_functions import create_tensorboard_callback

# Create directory to save TensorBoard logs
SAVE_DIR = "model_logs"

In [64]:
# Build model with the Functional API
from tensorflow.keras import layers
inputs = layers.Input(shape=(1,), dtype="string")
# inputs are 1-dimensional strings

x = text_vectorizer(inputs)
# turn the input text into numbers

x = embedding(x)
# create an embedding of the numerized numbers

x = layers.GlobalAveragePooling1D()(x)
# lower the dimensionality of the embedding
# (try running the model without this layer and see what happens)

outputs = layers.Dense(1, activation="sigmoid")(x)

# create the output layer, want binary outputs so use sigmoid activation
model_1 = tf.keras.Model(inputs, outputs, name="model_1_dense")
# construct the model


In [65]:
'''
The Sequential API code is

from tensorflow.keras import Sequential, layers

model_1_seq = Sequential([
    layers.Input(shape=(1,), dtype="string"),
    text_vectorizer,                      # convert text → numbers
    embedding,                            # convert numbers → meaning vectors
    layers.GlobalAveragePooling1D(),      # summarize each sentence
    layers.Dense(1, activation="sigmoid") # output: binary classification
], name="model_1_dense_sequential")

'''

'\nThe Sequential API code is\n\nfrom tensorflow.keras import Sequential, layers\n\nmodel_1_seq = Sequential([\n    layers.Input(shape=(1,), dtype="string"),\n    text_vectorizer,                      # convert text → numbers\n    embedding,                            # convert numbers → meaning vectors\n    layers.GlobalAveragePooling1D(),      # summarize each sentence\n    layers.Dense(1, activation="sigmoid") # output: binary classification\n], name="model_1_dense_sequential")\n\n'

Let's discuss the code above

- Our model takes a **1-dimensional string as input**.
- It then tokenizes the string using **text_vectorizer** and creates an **embedding**
- By using **GlobalAveragePooling1D()** layer we reduce the dimensionality of the tensor we pass to the **output layer.**

In [66]:
# Compile model
model_1.compile(loss="binary_crossentropy",
                optimizer=tf.keras.optimizers.Adam(),
                metrics=["accuracy"])

In [67]:
model_1.summary()


Most of the trainable parameters are contained within the **embedding layer**. Recall we created an embedding of size **128 (output_dim=128)** for a vocabulary of size **10,000** (input_dim=10000), hence the **1,280,000** trainable parameters.

In [68]:
# Fit the model
model_1_history = model_1.fit(train_sentences, # input sentences can be a list of strings due to text preprocessing layer built-in model
                              train_labels,
                              epochs=5,
                              validation_data=(val_sentences, val_labels),
                              callbacks=[create_tensorboard_callback(dir_name=SAVE_DIR,
                                                                     experiment_name="simple_dense_model")])

Saving TensorBoard log files to: model_logs/simple_dense_model/20251007-113839
Epoch 1/5
[1m215/215[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m5s[0m 15ms/step - accuracy: 0.6334 - loss: 0.6501 - val_accuracy: 0.7585 - val_loss: 0.5339
Epoch 2/5
[1m215/215[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m4s[0m 19ms/step - accuracy: 0.8084 - loss: 0.4666 - val_accuracy: 0.7887 - val_loss: 0.4738
Epoch 3/5
[1m215/215[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m3s[0m 13ms/step - accuracy: 0.8513 - loss: 0.3625 - val_accuracy: 0.7953 - val_loss: 0.4617
Epoch 4/5
[1m215/215[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m3s[0m 13ms/step - accuracy: 0.8873 - loss: 0.2962 - val_accuracy: 0.7900 - val_loss: 0.4678
Epoch 5/5
[1m215/215[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m3s[0m 13ms/step - accuracy: 0.9066 - loss: 0.2474 - val_accuracy: 0.7795 - val_loss: 0.4833


In [69]:
# Check the results
model_1.evaluate(val_sentences, val_labels)

[1m24/24[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 3ms/step - accuracy: 0.7684 - loss: 0.5156 


[0.4833453297615051, 0.7795275449752808]

What **embedding.weights** gives you



Returns a list of all the trainable weights in that layer.

For a standard embedding layer, there’s usually only one matrix, so the list has one element.

Shape: (vocab_size, embedding_dim) → (5000, 128) in our example

In [70]:
embedding.weights

[<Variable path=embedding_1/embeddings, shape=(10000, 128), dtype=float32, value=[[-0.03937228 -0.01227688  0.01821358 ... -0.0341537  -0.05376278
    0.04604749]
  [ 0.02324624 -0.02324826 -0.01383099 ... -0.02071906 -0.03644653
   -0.02663261]
  [-0.01937372 -0.01664208 -0.01929737 ...  0.02645427 -0.02991473
    0.00916128]
  ...
  [ 0.0107158  -0.03941555 -0.03236634 ...  0.01032839  0.03100609
   -0.02696295]
  [-0.08993362 -0.05932302  0.05058928 ... -0.00678272 -0.04834094
    0.05455987]
  [-0.09880479 -0.01107263  0.07939789 ... -0.08908401 -0.02404846
    0.06953207]]>]

**shape=(10000, 128):-** The *10000* is the vocabulary items and *128* is the dimensions for each word.

In [71]:
embed_weights = model_1.get_layer("embedding_1").get_weights()[0]
print(embed_weights.shape)

(10000, 128)


**Recurrent Neural Networks (RNN's)**

We will use a special kind of Neural Networks **Recurrent Neural Network (RNN)** for text data.

- **RNN:** Use information from **past** helps you with **future**. Take an input **X** and compute **y** based on all previous inputs.

- The concept is helpful when dealing with **Sequence data** such as passages of natural language text such as Text

- When you read a sentence, you take into account the context of previous words when **deciphering the meaning** of the current word

- When an RNN looks at a sequence of text (already in numerical form), the patterns it learns are continually updated based on the order of the sequence.

**Recurrent neural networks** can be used for a number of sequence-based problems:

**One to one:** one input, one output, such as image classification.
**One to many:** one input, many outputs, such as image captioning (image input, a sequence of text as caption output).
**Many to one:** many inputs, one outputs, such as text classification (classifying a Tweet as real diaster or not real diaster).
**Many to many:** many inputs, many outputs, such as machine translation (translating English to Spanish) or speech to text (audio wave as input, text as output).

When you come across RNN's in the wild, you'll most likely come across variants of the following:

- Long short-term memory cells (LSTMs).
- Gated recurrent units (GRUs).
- Bidirectional RNN's (passes forward and backward along a sequence, left to right and right to left).












