**Natural Language Processing with TensorFlow**

- NLP problems are also called **Sequence Problems** because data is presented in a Sequence.

Natural Language covers
- Text (Such as email, blog post, book, Tweet)
- Speech (a conversation you have with a doctor, voice commands you give to a smart speaker)

- One **use case** is to scan an incoming emails to see if they are spam or not (classification)
- Another **use case** is analyzing feedback complaints to find which segment of business it is talking about

**Both of the above are referred to as *sequences*. You might come across terms like **seq2seq**, in other words, finding information in one sequence to produce another sequence.**

A typical workflow in NLP is

*Text --> Turn into numbers --> Build a model --> Train the model --> use patterns to make predictions*

**What we are going to cover**

- Getting data
- Visualizing text
- Converting text into numbers using tokenization
- Turning our tokenized text into embedding
- Modelling a text dataset
    - Starting with a baseline (TF-IDF)
    - Building deep learning models like
       - LSTM, GRU, Conv1D, Transfer Learning
- Comparing the performance of each of our models
- Combining the models into **ensemble**
- Saving and Loading a **Trained Model**
- Find the most wrong predictions




**Download the Helper Functions**

In [4]:
# Download helper functions script
!wget https://raw.githubusercontent.com/mrdbourke/tensorflow-deep-learning/main/extras/helper_functions.py

--2025-10-03 07:58:22--  https://raw.githubusercontent.com/mrdbourke/tensorflow-deep-learning/main/extras/helper_functions.py
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.110.133, 185.199.111.133, 185.199.109.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.110.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 10246 (10K) [text/plain]
Saving to: ‘helper_functions.py.1’


2025-10-03 07:58:22 (101 MB/s) - ‘helper_functions.py.1’ saved [10246/10246]



**Import the helper functions**

In [5]:
# Import series of helper functions for the notebook
from helper_functions import unzip_data, create_tensorboard_callback, plot_loss_curves, compare_historys

**Download the Text Dataset**

We will be using **Real or Not** dataset from **Kaggle** which contains **text-based Tweets** about natural disasters


The Real Tweets are actually about disasters, for example:

*Jetstar and Virgin forced to cancel Bali flights again because of ash from Mount Raung volcano*


The Not Real Tweets are Tweets not about disasters (they can be on anything), for example:

*Education is the most powerful weapon which you can use to change the world.Nelson #Mandela #quote*


In [6]:
# Download data (same as from Kaggle)
!wget "https://storage.googleapis.com/ztm_tf_course/nlp_getting_started.zip"

# Unzip data
unzip_data("nlp_getting_started.zip")

--2025-10-03 07:58:23--  https://storage.googleapis.com/ztm_tf_course/nlp_getting_started.zip
Resolving storage.googleapis.com (storage.googleapis.com)... 142.251.167.207, 142.251.179.207, 64.233.180.207, ...
Connecting to storage.googleapis.com (storage.googleapis.com)|142.251.167.207|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 607343 (593K) [application/zip]
Saving to: ‘nlp_getting_started.zip’


2025-10-03 07:58:23 (89.0 MB/s) - ‘nlp_getting_started.zip’ saved [607343/607343]



**Unzipping Files**
 By Unzipping we get the following files
 - **sample_submission.csv:-** An example of the file that you would submit to Kaggle competition
 - **train.csv:-** training samples of real and not real disaster Tweets
 - **test.csv:-** testing samples of real and not real disaster Tweets


**Visualizing a Text Dataset**

In [9]:
import pandas as pd

train_df = pd.read_csv("train.csv")
test_df = pd.read_csv("test.csv")
train_df.head(5)

Unnamed: 0,id,keyword,location,text,target
0,1,,,Our Deeds are the Reason of this #earthquake M...,1
1,4,,,Forest fire near La Ronge Sask. Canada,1
2,5,,,All residents asked to 'shelter in place' are ...,1
3,6,,,"13,000 people receive #wildfires evacuation or...",1
4,7,,,Just got sent this photo from Ruby #Alaska as ...,1


In [10]:
# Shuffle training dataframe
train_df_shuffled = train_df.sample(frac=1, random_state=42)
# shuffle with random_state=42 for reproducibility
train_df_shuffled.head()

Unnamed: 0,id,keyword,location,text,target
2644,3796,destruction,,So you have a new weapon that can cause un-ima...,1
2227,3185,deluge,,The f$&amp;@ing things I do for #GISHWHES Just...,0
5448,7769,police,UK,DT @georgegalloway: RT @Galloway4Mayor: ÛÏThe...,1
132,191,aftershock,,Aftershock back to school kick off was great. ...,0
6845,9810,trauma,"Montgomery County, MD",in response to trauma Children of Addicts deve...,0


In the **training data** we have a **target** column and from the analysis of the text we will try to predict the **target** column.
The **test dataset** does not have a **target** column.

Inputs (text column) -> Machine Learning Algorithm -> Outputs (target column)

In [11]:
test_df.head()

Unnamed: 0,id,keyword,location,text
0,0,,,Just happened a terrible car crash
1,2,,,"Heard about #earthquake is different cities, s..."
2,3,,,"there is a forest fire at spot pond, geese are..."
3,9,,,Apocalypse lighting. #Spokane #wildfires
4,11,,,Typhoon Soudelor kills 28 in China and Taiwan


In [12]:
train_df.target.value_counts()

Unnamed: 0_level_0,count
target,Unnamed: 1_level_1
0,4342
1,3271


So,from the above we can easily deduce that we are dealing with a **binary classification** problem.
Also, the dataset is fairly balanced with 60% negative class and 40% positive class

Where,
- 1 = a real disaster Tweet
- 0 = not a real disaster Tweet