# Introduction to NLP Fundamentals in TensorFlow

source: https://github.com/mrdbourke/tensorflow-deep-learning/blob/main/08_introduction_to_nlp_in_tensorflow.ipynb

NLP has the goal of deriving information out of natural language (could be sequences text or speech).

Another common term for NLP problems is sequence to sequence (**seq2seq**).

<img src="./course_images/nlp/nlp_inputs_and_outputs.png">

<img src="./course_images/nlp/architecture_of_an_rnn.png">

## Check for GPU

In [None]:
!nvidia-smi -L

## Get helper functions

In [5]:
# https://www.tomshardware.com/how-to/use-wget-download-files-command-line
!wget -O helper_functions.py https://raw.githubusercontent.com/mrdbourke/tensorflow-deep-learning/main/extras/helper_functions.py

# Import series of helper functions for the notebook
from helper_functions import unzip_data, create_tensorboard_callback, plot_loss_curves, compare_historys


--2023-08-22 12:30:55--  https://raw.githubusercontent.com/mrdbourke/tensorflow-deep-learning/main/extras/helper_functions.py
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.109.133, 185.199.111.133, 185.199.108.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.109.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 10246 (10K) [text/plain]
Saving to: 'helper_functions.py'

     0K ..........                                            100% 5.23M=0.002s

2023-08-22 12:30:55 (5.23 MB/s) - 'helper_functions.py' saved [10246/10246]



## Get a text dataset

The dataset we"re going to be using in Kaggle's introduction to NLP dataset (text samples of Tweets labelled as disaster or not disaster).

https://www.kaggle.com/c/nlp-getting-started

In [3]:
!wget https://storage.googleapis.com/ztm_tf_course/nlp_getting_started.zip

# Unzip data
unzip_data("nlp_getting_started.zip")

--2023-08-22 10:10:39--  https://storage.googleapis.com/ztm_tf_course/nlp_getting_started.zip
Resolving storage.googleapis.com (storage.googleapis.com)... 142.250.179.80, 142.250.179.112, 142.250.178.144, ...
Connecting to storage.googleapis.com (storage.googleapis.com)|142.250.179.80|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 607343 (593K) [application/zip]
Saving to: 'nlp_getting_started.zip.1'

     0K .......... .......... .......... .......... ..........  8% 2.88M 0s
    50K .......... .......... .......... .......... .......... 16% 5.80M 0s
   100K .......... .......... .......... .......... .......... 25% 5.75M 0s
   150K .......... .......... .......... .......... .......... 33% 6.86M 0s
   200K .......... .......... .......... .......... .......... 42% 8.98M 0s
   250K .......... .......... .......... .......... .......... 50% 15.8M 0s
   300K .......... .......... .......... .......... .......... 59% 5.90M 0s
   350K .......... .......... ......

## Visualizing a text dataset

To visualize our text samples, we first have to read them in, one way to do so would be to use Python.

But I prefer to get visual straight away.

So another way to do this is to use pandas...
(need lots of ram spaces)

Another way is Tensorflow Load Text: https://www.tensorflow.org/tutorials/load_data/text

In [6]:
import pandas as pd
import os
DATASET_PATH = "./data/nlp_getting_started"
train_df = pd.read_csv(os.path.join(DATASET_PATH, "train.csv"))
test_df = pd.read_csv(os.path.join(DATASET_PATH, "test.csv"))
train_df.head()

Unnamed: 0,id,keyword,location,text,target
0,1,,,Our Deeds are the Reason of this #earthquake M...,1
1,4,,,Forest fire near La Ronge Sask. Canada,1
2,5,,,All residents asked to 'shelter in place' are ...,1
3,6,,,"13,000 people receive #wildfires evacuation or...",1
4,7,,,Just got sent this photo from Ruby #Alaska as ...,1


In [7]:
# Shuffle training dataframe
train_df_shuffled = train_df.sample(frac=1, random_state=42)
train_df_shuffled.head()

Unnamed: 0,id,keyword,location,text,target
2644,3796,destruction,,So you have a new weapon that can cause un-ima...,1
2227,3185,deluge,,The f$&amp;@ing things I do for #GISHWHES Just...,0
5448,7769,police,UK,DT @georgegalloway: RT @Galloway4Mayor: ÛÏThe...,1
132,191,aftershock,,Aftershock back to school kick off was great. ...,0
6845,9810,trauma,"Montgomery County, MD",in response to trauma Children of Addicts deve...,0


In [8]:
# What does the test dataframe look like?
test_df.head()

Unnamed: 0,id,keyword,location,text
0,0,,,Just happened a terrible car crash
1,2,,,"Heard about #earthquake is different cities, s..."
2,3,,,"there is a forest fire at spot pond, geese are..."
3,9,,,Apocalypse lighting. #Spokane #wildfires
4,11,,,Typhoon Soudelor kills 28 in China and Taiwan


🔑 If inbalanced data (not ~50% ~50%): 
https://www.tensorflow.org/tutorials/structured_data/imbalanced_data

In [9]:
# How many examples of each class?
train_df.target.value_counts()

target
0    4342
1    3271
Name: count, dtype: int64

In [10]:
# How many total samples?
len(train_df), len(test_df)

(7613, 3263)

In [11]:
# Let's visualize some random training examples
import random
random_index = random.randint(0, len(train_df)-5) # create random indexes not higher than the total number of samples
for row in train_df_shuffled[["text", "target"]][random_index:random_index+5].itertuples(): # return tuples, in this case, "index (_), text, target"
    _, text, target = row
    print(f"Target: {target}",  "(real disaster)" if target > 0 else "(not real disaster)")
    print(f"Text:\n{text}\n")
    print("---\n")

Target: 1 (real disaster)
Text:
70 years after : Hiroshima and Nagasaki - consequences of a nuclear detonation @ICRC http://t.co/BKh7Z6CWWl

---

Target: 0 (not real disaster)
Text:
Can you save
Can you save my
Can you save my heavydirtysoul?

---

Target: 0 (not real disaster)
Text:
@ThatPersianGuy @YOUNGSAFE ?? Eden Hazard as Harden is spot on flopping is identical

---

Target: 1 (real disaster)
Text:
the mv should just be them strutting like they mean it while buildings are burning up in the bg and flames everywhere how cool would that be

---

Target: 1 (real disaster)
Text:
shit is hard to get over but sometimes the tragedy means it's over soulja..

---



In [12]:
### Split data into training and validation sets
from sklearn.model_selection import train_test_split
# Use train_test_split to split training data into training and validation sets
train_sentences, val_sentences, train_labels , val_labels = train_test_split(train_df_shuffled["text"].to_numpy(), 
                                                                             train_df_shuffled["target"].to_numpy(), 
                                                                             test_size=0.1, # use 10% of training data for validation,
                                                                             random_state=42)

In [13]:
# Check the lengths
len(train_sentences), len(train_labels), len(val_sentences), len(val_labels)

(6851, 6851, 762, 762)

In [14]:
len(train_df_shuffled)

7613

In [15]:
# Check the first 10 samples
train_sentences[:10], train_labels[:10]

(array(['@mogacola @zamtriossu i screamed after hitting tweet',
        'Imagine getting flattened by Kurt Zouma',
        '@Gurmeetramrahim #MSGDoing111WelfareWorks Green S welfare force ke appx 65000 members har time disaster victim ki help ke liye tyar hai....',
        "@shakjn @C7 @Magnums im shaking in fear he's gonna hack the planet",
        'Somehow find you and I collide http://t.co/Ee8RpOahPk',
        '@EvaHanderek @MarleyKnysh great times until the bus driver held us hostage in the mall parking lot lmfao',
        'destroy the free fandom honestly',
        'Weapons stolen from National Guard Armory in New Albany still missing #Gunsense http://t.co/lKNU8902JE',
        '@wfaaweather Pete when will the heat wave pass? Is it really going to be mid month? Frisco Boy Scouts have a canoe trip in Okla.',
        'Patient-reported outcomes in long-term survivors of metastatic colorectal cancer - British Journal of Surgery http://t.co/5Yl4DC1Tqt'],
       dtype=object),
 array([0,

## Converting text into numbers

<img src="./course_images/nlp/tokenization_vs_embedding.png">

In NLP, there are two main concepts for turning text into numbers:

* **Tokenization** - A straight mapping from word or character or sub-word to a numerical value. There are three main levels of tokenization:

    1. Using **word-level tokenization** with the sentence "I love TensorFlow" might result in "I" being 0, "love" being 1 and "TensorFlow" being 2. In this case, every word in a sequence considered a single token.
    2. **Character-level tokenization**, such as converting the letters A-Z to values 1-26. In this case, every character in a sequence considered a single token.
    3. **Sub-word tokenization** is in between word-level and character-level tokenization. It involves breaking invidual words into smaller parts and then converting those smaller parts into numbers. For example, "my favourite food is pineapple pizza" might become "my, fav, avour, rite, fo, oo, od, is, pin, ine, app, le, piz, za". After doing this, these sub-words would then be mapped to a numerical value. In this case, every word could be considered multiple tokens.

* **Embeddings** - An embedding is a representation of natural language which can be learned. Representation comes in the form of a feature vector. For example, the word "dance" could be represented by the 5-dimensional vector [-0.8547, 0.4559, -0.3332, 0.9877, 0.1112]. It's important to note here, the size of the feature vector is tuneable. There are two ways to use embeddings:
    1. **Create your own embedding** - Once your text has been turned into numbers (required for an embedding), you can put them through an embedding layer (such as tf.keras.layers.Embedding) and an embedding representation will be learned during model training.
    2. **Reuse a pre-learned embedding** - Many pre-trained embeddings exist online. These pre-trained embeddings have often been learned on large corpuses of text (such as all of Wikipedia) and thus have a good underlying representation of natural language. You can use a pre-trained embedding to initialize your model and fine-tune it to your own specific task.

If you're looking for pre-trained word embeddings, [Word2vec embeddings](http://jalammar.github.io/illustrated-word2vec/), [GloVe embeddings](https://nlp.stanford.edu/projects/glove/) and many of the options available on [TensorFlow Hub](https://tfhub.dev/s?module-type=text-embedding) are great places to start.

https://www.tensorflow.org/text/guide/word_embeddings




------
When dealing with a text problem, one of the first things you'll have to do before you can build a model is to convert your test to numbers.

There are a few ways to do this, namely:
* Tokenization - direct mapping of token (a token could be a word or a character) to number
* Embedding - create a matrix of feature vector for each token (the size of the feature can be defined and this embedding can be learned)

### Text vectorization (tokenization)

In [16]:
train_sentences[:5]

array(['@mogacola @zamtriossu i screamed after hitting tweet',
       'Imagine getting flattened by Kurt Zouma',
       '@Gurmeetramrahim #MSGDoing111WelfareWorks Green S welfare force ke appx 65000 members har time disaster victim ki help ke liye tyar hai....',
       "@shakjn @C7 @Magnums im shaking in fear he's gonna hack the planet",
       'Somehow find you and I collide http://t.co/Ee8RpOahPk'],
      dtype=object)

In [19]:
import tensorflow as tf
from keras.layers import TextVectorization

# Use the default TextVectorization parameters
text_vectorizer = TextVectorization(max_tokens=None, # how many words in the vocabulary (automatically add <OOV> "out of vocabulary")
                                    standardize="lower_and_strip_punctuation", # how to process text
                                    split="whitespace",
                                    ngrams=None, # create groups of n-words?
                                    output_mode="int", # how to map tokens in numbers
                                    output_sequence_length=None) # how long do you want you sequences to be,
                                    #pad_to_max_tokens=True) # Not valid if using max_tokens=None

In [21]:
len(train_sentences[0].split())

7

In [23]:
# Find the average number of tokens (words) in the training tweets
round(sum([len(i.split()) for i in train_sentences])) / len(train_sentences)

14.901036345059115

In [28]:
# Setup text vectorization variables
max_vocab_length = 10000 # max number of words to have in our vocabulary
max_length = 15 # max length our sequences will be (e.g. how many words from a Tweet does a model see?)

text_vectorizer = TextVectorization(max_tokens=max_vocab_length, 
                                    output_mode="int",
                                    output_sequence_length=max_length)

In [29]:
# Fit the text vectorizer to the training text
text_vectorizer.adapt(train_sentences)

In [30]:
# Create a sample sentence and tokenize it
sample_sentence = "There's a flood in my street!"
text_vectorizer([sample_sentence])

<tf.Tensor: shape=(1, 15), dtype=int64, numpy=
array([[264,   3, 232,   4,  13, 698,   0,   0,   0,   0,   0,   0,   0,
          0,   0]], dtype=int64)>

In [32]:
# Choose a random sentence from the training dataset and tokenize it
random_sentence = random.choice(train_sentences)
print(f"Original text:\n {random_sentence} \
      \n\nVectorized version:")
text_vectorizer([random_sentence])


Original text:
 @TeaFrystlik -- causing the entire sky around their battle to darken to a violent storm as an ungodly powerful bolt of lightning struck at--       

Vectorized version:


<tf.Tensor: shape=(1, 15), dtype=int64, numpy=
array([[7763, 1426,    2,  855,  985,  470,  131,  442,    5,    1,    5,
           3,  357,   84,   26]], dtype=int64)>

In [33]:
# Get the unique words in the vocabulary
words_in_vocab = text_vectorizer.get_vocabulary() # get all of the unique words in our training data
top_5_words = words_in_vocab[:5] # get the most common words
bottom_5_words = words_in_vocab[-5:] # get the least common words

print(f"Number of words in vocab: {len(words_in_vocab)}")
print(f"5 most common words: {top_5_words}")
print(f"5 least common words: {bottom_5_words}")

Number of words in vocab: 10000
5 most common words: ['', '[UNK]', 'the', 'a', 'in']
5 least common words: ['pages', 'paeds', 'pads', 'padres', 'paddytomlinson1']


[UNK] = unknown (words outside of the top 10000 words) for example #MSG ...

### Creating an Embedding using an Embedding Layer



To make our embedding, we're going to use Tensorflow's embedding layer: https://www.tensorflow.org/api_docs/python/tf/keras/layers/Embedding

The parameters we care most about for our embedding layer:
* `input_dim`= the size of our vocabulary
* `output_dim` = the size of the output embedding vector, for example, a value of 100 would mean each token get's represented by a vector of 100 long
* `input_length` = length of the sequences being passed to the embedding layer

In [41]:
from keras import layers

embedding = layers.Embedding(input_dim=max_vocab_length, # set input shape
                             output_dim=128, # output shape (machines are better with 2^x)
                             embeddings_initializer="uniform", # random uniform numbers
                             input_length=max_length # how long is each input
                             )
embedding

<keras.src.layers.core.embedding.Embedding at 0x21f0cad6ce0>

In [43]:
# Get a random sentence from the training set
random_sentence = random.choice(train_sentences)
print(f"Original text:\n {random_sentence} \
      \n\nEmbedded version:")

# Embed the random sentence (turn into dense vectors of fixed size)
sample_embed = embedding(text_vectorizer([random_sentence]))
sample_embed

Original text:
 3 Former Executives To Be Prosecuted In Fukushima Nuclear Disaster http://t.co/UmjpRRwRUU       

Embedded version:


<tf.Tensor: shape=(1, 15, 128), dtype=float32, numpy=
array([[[-0.04630945,  0.03003553,  0.04536592, ..., -0.03408079,
         -0.01924235, -0.04502784],
        [-0.04709723,  0.04912701,  0.03619016, ...,  0.04549519,
          0.00036404, -0.00139327],
        [ 0.01585795,  0.00337551,  0.00492793, ...,  0.04084766,
         -0.03795891,  0.04436871],
        ...,
        [-0.00663267, -0.04182254,  0.03362733, ..., -0.01988559,
         -0.03918185,  0.01323232],
        [-0.00663267, -0.04182254,  0.03362733, ..., -0.01988559,
         -0.03918185,  0.01323232],
        [-0.00663267, -0.04182254,  0.03362733, ..., -0.01988559,
         -0.03918185,  0.01323232]]], dtype=float32)>

In [45]:
# Check out a single token's embedding
sample_embed[0][0], sample_embed[0][0].shape, random_sentence, random_sentence[0]

(<tf.Tensor: shape=(128,), dtype=float32, numpy=
 array([-0.04630945,  0.03003553,  0.04536592,  0.00295794, -0.04226234,
         0.01014543,  0.01253842, -0.00868255, -0.02019215,  0.00629608,
         0.01416207,  0.04675606, -0.03144509, -0.03714043,  0.00228071,
         0.03287682,  0.04439881, -0.01037916, -0.03326933, -0.02649232,
        -0.03246701,  0.03414718,  0.03284794,  0.03025026,  0.04029608,
         0.01379324, -0.03973158, -0.00312669, -0.04953605,  0.04527733,
        -0.0192475 , -0.01899569,  0.03932152,  0.04046657, -0.01831449,
         0.02464113, -0.01457719,  0.0205737 , -0.02518891, -0.01302762,
        -0.01501725,  0.04553136, -0.01511531,  0.03703994, -0.00019265,
         0.02361307, -0.04368327,  0.03934497, -0.00815278,  0.007218  ,
         0.03695719, -0.00295029, -0.03657174,  0.02954403, -0.04816339,
        -0.03918196, -0.04368923,  0.03993369,  0.00704107, -0.0066998 ,
        -0.00564903, -0.03345776,  0.01649005, -0.01574318,  0.013226  ,
  

## Modelling a text dataset (running a series of experiments)

<img src="./course_images/nlp/experiments_we_are_running.png">

Now we've got a way to turn our text sequences into numbers,
it's time to start building a series of modelling experiments.

We'll start with a baseline and move on from there.

* Model 0: Naive Bayes (baseline) (non neural network), this is from SKlearn mn map: https://scikit-learn.org/stable/tutorial/machine_learning_map/index.html
* Model 1: Feed-forward neural network (dense model)
* Model 2: LSTM model (RNN)
* Model 3: GRU model (RNN)
* Model 4: Bidirectional-LSTM model (RNN)
* Model 5: 1D Convolutional Neural Network (CNN)
* Model 6: Tensorflow hub Pretrained Feature Extractor (using transfer learning for NLP)
* Model 7: Same as model 6 with 10% of training data

How are we going to approach all of these?

Use the standard steps in modelling with tensorflow:

* Create a model
* Build a model
* Fit a model
* Evaluate our model

### Model 0: Getting a baseline

As with all machine learning modelling experiments, it's important to create a baseline model so you've got a benchmark for future experiments to build upon.

To create our baseline, we'll use SKlearn's Multinomial Naive Bayes using TF-IDF formula to convert our words to numbers.

🔑 **Note**: It's common practice to use a non-DL algorithm as a baseline because of their speed and then later using DL to see if you can improve upon them.

In [46]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import Pipeline

# Create tokenization and modelling pipeline (do these steps in order)
model_0 = Pipeline([
    ("tfidf", TfidfVectorizer()), # convert words to numbers using tfidf
    ("clf", MultinomialNB()) # model the text
])

# Fit the pipeline to the training data
model_0.fit(train_sentences, train_labels)

In [47]:
# Evaluate our baseline model (score in scikit learn = evaluate in tf)
baseline_score = model_0.score(val_sentences, val_labels)
print(f"Our baseline model achieves an accuracy of: {baseline_score*100:.2f}%")

Our baseline model achieves an accuracy of: 79.27%


In [50]:
# Make predictions
baseline_preds = model_0.predict(val_sentences)
baseline_preds[:20]

array([1, 1, 1, 0, 0, 1, 1, 1, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1],
      dtype=int64)

<img src="./course_images/nlp/classification_evaluation_methods.png">

### Creating an evaluation for our model experiments

We could evaluate all of our model's prediction with different metrics every time, howewer, this will be cumbersome and could easily be fixed with a function.

Let's create one to compare our model's predictions with the truth labels using the following metrics:

* Accuraccy
* Precision
* Recall
* F1-score

> For a deep overview of many different evaluation methods, see the SKlearn documentation: https://scikit-learn.org/stable/modules/model_evaluation.html

In [52]:
# Function to evaluate accuracy, precision, recall, f1-score
from sklearn.metrics import accuracy_score, precision_recall_fscore_support

def calculate_results(y_true, y_pred):
    """
    Calculates model accuracy, precision, recall and f1 score of a binary classification model.
    """
    # Calculate model accuracy
    model_accuracy = accuracy_score(y_true, y_pred) * 100 # percentage
    # Calculate model precision, recall and f1-score using "weighted" average (imbalance check doc) (_ = support we don't want)
    model_precision, model_recall, model_f1, _ = precision_recall_fscore_support(y_true, y_pred, average="weighted")
    model_results = {"accuracy": model_accuracy,
                     "precision": model_precision,
                     "recall": model_recall,
                     "f1": model_f1}
    return model_results


In [None]:
# we can put it in our helper file ( and access it in every notebooks)
#from  helper_functions import calculate_results...

In [54]:
# Get baseline results
baseline_results = calculate_results(y_true=val_labels, 
                                     y_pred=baseline_preds)
baseline_results

{'accuracy': 79.26509186351706,
 'precision': 0.8111390004213173,
 'recall': 0.7926509186351706,
 'f1': 0.7862189758049549}