**Natural Language Processing with TensorFlow**

- NLP problems are also called **Sequence Problems** because data is presented in a Sequence.

Natural Language covers
- Text (Such as email, blog post, book, Tweet)
- Speech (a conversation you have with a doctor, voice commands you give to a smart speaker)

- One **use case** is to scan an incoming emails to see if they are spam or not (classification)
- Another **use case** is analyzing feedback complaints to find which segment of business it is talking about

**Both of the above are referred to as *sequences*. You might come across terms like **seq2seq**, in other words, finding information in one sequence to produce another sequence.**

A typical workflow in NLP is

*Text --> Turn into numbers --> Build a model --> Train the model --> use patterns to make predictions*

**What we are going to cover**

- Getting data
- Visualizing text
- Converting text into numbers using tokenization
- Turning our tokenized text into embedding
- Modelling a text dataset
    - Starting with a baseline (TF-IDF)
    - Building deep learning models like
       - LSTM, GRU, Conv1D, Transfer Learning
- Comparing the performance of each of our models
- Combining the models into **ensemble**
- Saving and Loading a **Trained Model**
- Find the most wrong predictions




**Download the Helper Functions**

In [17]:
# Download helper functions script
!wget https://raw.githubusercontent.com/mrdbourke/tensorflow-deep-learning/main/extras/helper_functions.py

--2025-10-03 10:57:51--  https://raw.githubusercontent.com/mrdbourke/tensorflow-deep-learning/main/extras/helper_functions.py
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 10246 (10K) [text/plain]
Saving to: ‘helper_functions.py.2’


2025-10-03 10:57:51 (57.2 MB/s) - ‘helper_functions.py.2’ saved [10246/10246]



**Import the helper functions**

In [18]:
# Import series of helper functions for the notebook
from helper_functions import unzip_data, create_tensorboard_callback, plot_loss_curves, compare_historys

**Download the Text Dataset**

We will be using **Real or Not** dataset from **Kaggle** which contains **text-based Tweets** about natural disasters


The Real Tweets are actually about disasters, for example:

*Jetstar and Virgin forced to cancel Bali flights again because of ash from Mount Raung volcano*


The Not Real Tweets are Tweets not about disasters (they can be on anything), for example:

*Education is the most powerful weapon which you can use to change the world.Nelson #Mandela #quote*


In [19]:
# Download data (same as from Kaggle)
!wget "https://storage.googleapis.com/ztm_tf_course/nlp_getting_started.zip"

# Unzip data
unzip_data("nlp_getting_started.zip")

--2025-10-03 10:57:52--  https://storage.googleapis.com/ztm_tf_course/nlp_getting_started.zip
Resolving storage.googleapis.com (storage.googleapis.com)... 172.253.62.207, 142.251.163.207, 142.251.167.207, ...
Connecting to storage.googleapis.com (storage.googleapis.com)|172.253.62.207|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 607343 (593K) [application/zip]
Saving to: ‘nlp_getting_started.zip.1’


2025-10-03 10:57:52 (48.9 MB/s) - ‘nlp_getting_started.zip.1’ saved [607343/607343]



**Unzipping Files**
 By Unzipping we get the following files
 - **sample_submission.csv:-** An example of the file that you would submit to Kaggle competition
 - **train.csv:-** training samples of real and not real disaster Tweets
 - **test.csv:-** testing samples of real and not real disaster Tweets


**Visualizing a Text Dataset**

In [20]:
import pandas as pd

train_df = pd.read_csv("train.csv")
test_df = pd.read_csv("test.csv")
train_df.head(5)

Unnamed: 0,id,keyword,location,text,target
0,1,,,Our Deeds are the Reason of this #earthquake M...,1
1,4,,,Forest fire near La Ronge Sask. Canada,1
2,5,,,All residents asked to 'shelter in place' are ...,1
3,6,,,"13,000 people receive #wildfires evacuation or...",1
4,7,,,Just got sent this photo from Ruby #Alaska as ...,1


In [21]:
# Shuffle training dataframe
train_df_shuffled = train_df.sample(frac=1, random_state=42)
# shuffle with random_state=42 for reproducibility
train_df_shuffled.head()

Unnamed: 0,id,keyword,location,text,target
2644,3796,destruction,,So you have a new weapon that can cause un-ima...,1
2227,3185,deluge,,The f$&amp;@ing things I do for #GISHWHES Just...,0
5448,7769,police,UK,DT @georgegalloway: RT @Galloway4Mayor: ÛÏThe...,1
132,191,aftershock,,Aftershock back to school kick off was great. ...,0
6845,9810,trauma,"Montgomery County, MD",in response to trauma Children of Addicts deve...,0


In the **training data** we have a **target** column and from the analysis of the text we will try to predict the **target** column.
The **test dataset** does not have a **target** column.

Inputs (text column) -> Machine Learning Algorithm -> Outputs (target column)

In [22]:
test_df.head()

Unnamed: 0,id,keyword,location,text
0,0,,,Just happened a terrible car crash
1,2,,,"Heard about #earthquake is different cities, s..."
2,3,,,"there is a forest fire at spot pond, geese are..."
3,9,,,Apocalypse lighting. #Spokane #wildfires
4,11,,,Typhoon Soudelor kills 28 in China and Taiwan


In [23]:
train_df.target.value_counts()

Unnamed: 0_level_0,count
target,Unnamed: 1_level_1
0,4342
1,3271


So,from the above we can easily deduce that we are dealing with a **binary classification** problem.
Also, the dataset is fairly balanced with 60% negative class and 40% positive class

Where,
- 1 = a real disaster Tweet
- 0 = not a real disaster Tweet

In [24]:
# How many samples total?
print(f"Total training samples: {len(train_df)}")
print(f"Total test samples: {len(test_df)}")
print(f"Total samples: {len(train_df) + len(test_df)}")

Total training samples: 7613
Total test samples: 3263
Total samples: 10876


**Train/Test Split**
We have got abundance of testing samples and normally a 80/20 split is recommended.

**Question:** Why visualize random samples? You could visualize samples in order but this could lead to only seeing a certain subset of data. Better to visualize a substantial quantity (100+) of random samples to get an idea of the different kinds of data you're working with. In machine learning, never underestimate the power of randomness.

In [25]:
import random
random_index = random.randint(0, len(train_df)-5) # create random indexes not higher than the total number of samples
'''
train_df_shuffled[['text','target']]: Get only the "text" and "target" column
[random_index:random_index+5]: Get only five rows starting from the random index
.itertuples(): Iterate over the rows as namedtuples.
The return is a named tuple like
Pandas(Index=100, text="Fire in the building", target=1)
Pandas(Index=101, text="Lovely weather today", target=0)
We get these results in the row variable
_: is used to get the index
text: is used to get the text
target: is used to get the target (0 or 1)
'''


for row in train_df_shuffled[['text', 'target']][random_index:random_index+5].itertuples():
  _, text, target = row
  print(f"Target: {target}", "(real disaster)" if target > 0 else "(not real disaster)")
  print(f"Text:\n{text}\n")
  print("---\n")





Target: 0 (not real disaster)
Text:
Hellfire is surrounded by desires so be careful and donÛªt let your desires control you! #Afterlife

---

Target: 0 (not real disaster)
Text:
ÛÏ@YMcglaun: @JulieKragt @WildWestSixGun You're a lot safer that way.Ûyeah a lot more stable &amp; if I get in trouble I have a seat right there

---

Target: 1 (real disaster)
Text:
RT owenrbroadhurst RT JuanMThompson: At this hour 70 yrs ago one of the greatest acts of mass murder in world histÛ_ http://t.co/ODWs0waW9Q

---

Target: 1 (real disaster)
Text:
@SourMashNumber7 @tomfromireland @rfcgeom66 @BBCTalkback They didn't succeed the other two times either. Bomb didn't detonate&amp;Shots missed.

---

Target: 0 (not real disaster)
Text:
China's Stock Market Crash: Are There Gems In The Rubble?: ChinaÛªs stock market crash this summer has sparked ... http://t.co/2OqSGZqlbz

---



**Split data into Training and Validation Sets**

- The test data has no **labels** and we need a way to evaluate the model, so we split the **training data** into **training data** and **validation set.**
- Model trains on **training data** and the performances are checked using unseen **validation set.**

In [26]:
from sklearn.model_selection import train_test_split

'''
Each column is converted to a numpy array.
train_df_shuffled["text"] is a Pandas Series
(basically a 1-D labeled array with an index).
Scikit-learn functions like train_test_split are designed to work with
NumPy arrays or plain Python lists.
Even if you sends a Pandas Series directly, it works too because Pandas is
compitable with NumPy but internally scikit-learn will convert it to
Numpy anyway
'''

# Use train_test_split to split training data into training and validation sets
train_sentences, val_sentences, train_labels, val_labels = train_test_split(train_df_shuffled["text"].to_numpy(),
                                                                            train_df_shuffled["target"].to_numpy(),
                                                                            test_size=0.1,
                                                                            random_state=42)


In [27]:
# Check the lengths
len(train_sentences), len(train_labels), len(val_sentences), len(val_labels)

(6851, 6851, 762, 762)

In [28]:
# View the first 10 training sentences and their labels
train_sentences[:10], train_labels[:10]

(array(['@mogacola @zamtriossu i screamed after hitting tweet',
        'Imagine getting flattened by Kurt Zouma',
        '@Gurmeetramrahim #MSGDoing111WelfareWorks Green S welfare force ke appx 65000 members har time disaster victim ki help ke liye tyar hai....',
        "@shakjn @C7 @Magnums im shaking in fear he's gonna hack the planet",
        'Somehow find you and I collide http://t.co/Ee8RpOahPk',
        '@EvaHanderek @MarleyKnysh great times until the bus driver held us hostage in the mall parking lot lmfao',
        'destroy the free fandom honestly',
        'Weapons stolen from National Guard Armory in New Albany still missing #Gunsense http://t.co/lKNU8902JE',
        '@wfaaweather Pete when will the heat wave pass? Is it really going to be mid month? Frisco Boy Scouts have a canoe trip in Okla.',
        'Patient-reported outcomes in long-term survivors of metastatic colorectal cancer - British Journal of Surgery http://t.co/5Yl4DC1Tqt'],
       dtype=object),
 array([0,

We now have a **training and test set**.

**Converting Text into Numbers**

- We now have **training set and validation set** containing Tweets and Labels

**Question:** What is the most important step before we can use a **machine learning algorithm** with our text data?

**Answer** Turn the text into numbers.

*For any Machine Learning algorithm, the inputs needs to be in numerical form.*

In NLP, there are two main concepts of turning text into numbers

- **Tokenization:**


