<a href="https://colab.research.google.com/github/SilahicAmil/Intro-NLP/blob/main/Into_To_Natural_Language_Processing.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Introduction to NLP fundamentals in Tensorflow

NLP has the goal of deriving information of of natural language (could be, sequences, text or speech)

Another common term for NLP provblems is sequence to sequence problem

## Check for GPU


In [1]:
!nvidia-smi -L

GPU 0: Tesla T4 (UUID: GPU-a8d759d1-3817-e0f3-8a0d-93c3849d9599)


## Helper functions

In [2]:
!wget https://raw.githubusercontent.com/mrdbourke/tensorflow-deep-learning/main/extras/helper_functions.py

# Import a series of helper functions
from helper_functions import unzip_data, create_tensorboard_callback, plot_loss_curves, compare_historys


--2021-07-02 04:16:43--  https://raw.githubusercontent.com/mrdbourke/tensorflow-deep-learning/main/extras/helper_functions.py
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 10246 (10K) [text/plain]
Saving to: ‘helper_functions.py’


2021-07-02 04:16:43 (112 MB/s) - ‘helper_functions.py’ saved [10246/10246]



## Get our text data set

The data we're going to be using is Kaggle's intro to NLP dataset (text samples of Tweets labelled as a disaster or non disaster)

In [3]:
!wget https://storage.googleapis.com/ztm_tf_course/nlp_getting_started.zip

#Unzip data
unzip_data("nlp_getting_started.zip")

--2021-07-02 04:16:45--  https://storage.googleapis.com/ztm_tf_course/nlp_getting_started.zip
Resolving storage.googleapis.com (storage.googleapis.com)... 172.217.164.144, 172.253.115.128, 172.253.122.128, ...
Connecting to storage.googleapis.com (storage.googleapis.com)|172.217.164.144|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 607343 (593K) [application/zip]
Saving to: ‘nlp_getting_started.zip’


2021-07-02 04:16:45 (155 MB/s) - ‘nlp_getting_started.zip’ saved [607343/607343]



## Visualize a text dataset

In [4]:
import pandas as pd
train_df = pd.read_csv("train.csv")
test_df = pd.read_csv("test.csv")

train_df.head()

Unnamed: 0,id,keyword,location,text,target
0,1,,,Our Deeds are the Reason of this #earthquake M...,1
1,4,,,Forest fire near La Ronge Sask. Canada,1
2,5,,,All residents asked to 'shelter in place' are ...,1
3,6,,,"13,000 people receive #wildfires evacuation or...",1
4,7,,,Just got sent this photo from Ruby #Alaska as ...,1


In [5]:
#Shuffle training data

train_df_shuffled = train_df.sample(frac=1,
                                    random_state=42)
train_df_shuffled.head()

Unnamed: 0,id,keyword,location,text,target
2644,3796,destruction,,So you have a new weapon that can cause un-ima...,1
2227,3185,deluge,,The f$&amp;@ing things I do for #GISHWHES Just...,0
5448,7769,police,UK,DT @georgegalloway: RT @Galloway4Mayor: ÛÏThe...,1
132,191,aftershock,,Aftershock back to school kick off was great. ...,0
6845,9810,trauma,"Montgomery County, MD",in response to trauma Children of Addicts deve...,0


In [6]:
# What does the test data look like?

test_df.head()

Unnamed: 0,id,keyword,location,text
0,0,,,Just happened a terrible car crash
1,2,,,"Heard about #earthquake is different cities, s..."
2,3,,,"there is a forest fire at spot pond, geese are..."
3,9,,,Apocalypse lighting. #Spokane #wildfires
4,11,,,Typhoon Soudelor kills 28 in China and Taiwan


In [7]:
# How many examples of each class are there?
train_df.target.value_counts()

0    4342
1    3271
Name: target, dtype: int64

In [8]:
len(train_df), len(test_df)

(7613, 3263)

In [9]:
#Visualize random training examples
import random
random_index = random.randint(0, len(train_df)-5) # Creat random indexes
for row in train_df_shuffled[["text", "target"]][random_index:random_index+5].itertuples():
  _, text, target = row

  print(f"Target: {target}", "(Real Disaster)" if target > 0 else "(Not a real disaster)")
  print(f"Text:\n{text}\n")
  print("---\n")

Target: 0 (Not a real disaster)
Text:
There's still room for you at our party for first responders from around the country! 3rd annual best ever. http://t.co/mNh6FXhOdB

---

Target: 0 (Not a real disaster)
Text:
Hellfire is surrounded by desires so be careful and donÛªt let your desires control you! #Afterlife

---

Target: 0 (Not a real disaster)
Text:
ÛÏ@YMcglaun: @JulieKragt @WildWestSixGun You're a lot safer that way.Ûyeah a lot more stable &amp; if I get in trouble I have a seat right there

---

Target: 1 (Real Disaster)
Text:
RT owenrbroadhurst RT JuanMThompson: At this hour 70 yrs ago one of the greatest acts of mass murder in world histÛ_ http://t.co/ODWs0waW9Q

---

Target: 1 (Real Disaster)
Text:
@SourMashNumber7 @tomfromireland @rfcgeom66 @BBCTalkback They didn't succeed the other two times either. Bomb didn't detonate&amp;Shots missed.

---



## Split data into training and validation sets

In [10]:
from sklearn.model_selection import train_test_split


In [11]:
# Use train test split to split data into training and validation sets 
train_sentences, val_sentences, train_labels, val_labels = train_test_split(train_df_shuffled["text"].to_numpy(),
                                                                            train_df_shuffled["target"].to_numpy(),
                                                                            test_size=0.1,
                                                                            random_state=42)

In [12]:
# Check the lengths
len(train_sentences), len(train_labels), len(val_sentences), len(val_labels)

(6851, 6851, 762, 762)

In [13]:
# Check the first 10 examples

train_sentences[:10], train_labels[:10]

(array(['@mogacola @zamtriossu i screamed after hitting tweet',
        'Imagine getting flattened by Kurt Zouma',
        '@Gurmeetramrahim #MSGDoing111WelfareWorks Green S welfare force ke appx 65000 members har time disaster victim ki help ke liye tyar hai....',
        "@shakjn @C7 @Magnums im shaking in fear he's gonna hack the planet",
        'Somehow find you and I collide http://t.co/Ee8RpOahPk',
        '@EvaHanderek @MarleyKnysh great times until the bus driver held us hostage in the mall parking lot lmfao',
        'destroy the free fandom honestly',
        'Weapons stolen from National Guard Armory in New Albany still missing #Gunsense http://t.co/lKNU8902JE',
        '@wfaaweather Pete when will the heat wave pass? Is it really going to be mid month? Frisco Boy Scouts have a canoe trip in Okla.',
        'Patient-reported outcomes in long-term survivors of metastatic colorectal cancer - British Journal of Surgery http://t.co/5Yl4DC1Tqt'],
       dtype=object), array([0, 

## Converting text into numbers

When dealing with a text problem. One of the first things youll have to do before you can build the model is to convert the text to numbers

There are a few ways:
* Tokenization - Direct mapping of token
* Embedding- Create a matrix of feature vector for each token


## Text Vectorization (Tokenization)

In [14]:
import tensorflow as tf
from tensorflow.keras.layers.experimental.preprocessing import TextVectorization

# Use the default Text Vectorization parameters
text_vectorizer = TextVectorization(max_tokens=None, # How many words in the vocabulary (Auto adds <OOV>)
                                       standardize="lower_and_strip_punctuation",
                                       split="whitespace",
                                       ngrams=None, # Create groups of n_words,
                                       output_mode="int", # How to map tokens to numbers
                                       output_sequence_length=None, # How long do you want the sequence to be
                                       pad_to_max_tokens=True)

In [15]:
# Find the average number of tokens (words) in the training tweets

round(sum([len(i.split()) for i in train_sentences])/len(train_sentences))

15

In [16]:
#Setup text vectorization variables
max_vocab_length =  10000 # Max # of words to have in vocab
max_length = 15 #Max lengeth sequence will be (average)

# Update vectorizer
text_vectorizer = TextVectorization(max_tokens=max_vocab_length,
                                    output_mode="int",
                                    output_sequence_length=max_length)

In [17]:
# fit the text vectorizer to the training text

text_vectorizer.adapt(train_sentences)

In [18]:
# Create a sample sentence and tokenize it
sample_sentence = "There's a flood in my street!"
text_vectorizer([sample_sentence])

<tf.Tensor: shape=(1, 15), dtype=int64, numpy=
array([[264,   3, 232,   4,  13, 698,   0,   0,   0,   0,   0,   0,   0,
          0,   0]])>

In [19]:
# Choose a random sentence form the training dataset and tokenize it

random_sentence = random.choice(train_sentences)
print(f"Original text:\n {random_sentence}\
      \n\nVetorized Version:")

text_vectorizer([random_sentence])

Original text:
 #breaking Firefighters battling blaze at east Cary condo building http://t.co/mIM8hH2ce6      

Vetorized Version:


<tf.Tensor: shape=(1, 15), dtype=int64, numpy=
array([[ 379, 1083, 3161,  749,   17,  856,    1,    1,  630,    1,    0,
           0,    0,    0,    0]])>

In [20]:
# Get the unique words in the vocab
words_in_vocab = text_vectorizer.get_vocabulary()
top_5_words = words_in_vocab[:5]
bottom_5_words= words_in_vocab[-5:]
print(f"Number of words in vocab: {len(words_in_vocab)}")
print(f"5 Most common words: {top_5_words}")
print(f"5 Least common words: {bottom_5_words}")


Number of words in vocab: 10000
5 Most common words: ['', '[UNK]', 'the', 'a', 'in']
5 Least common words: ['pages', 'paeds', 'pads', 'padres', 'paddytomlinson1']


## Creating an Embedding using an Emedding Layer

To make our embedding we are going to use TensorFlow's embedding layer

In [21]:
from tensorflow.keras import layers

embedding = layers.Embedding(input_dim=max_vocab_length, # Input Shape
                             output_dim=128, # Anything divisble by 8 speeds up computing with machine learning
                             embeddings_initializer="uniform",
                             input_length=max_length)

embedding

<tensorflow.python.keras.layers.embeddings.Embedding at 0x7f729381cc90>

In [22]:
# Get a random sentence from trainin set
random_sentence = random.choice(train_sentences)
print(f"Original text:\n {random_sentence}\
      n\nEmbedded Version:")

# Embed the random sentence

sample_embed = embedding(text_vectorizer([random_sentence]))
sample_embed

Original text:
 Noel back up      n
Embedded Version:


<tf.Tensor: shape=(1, 15, 128), dtype=float32, numpy=
array([[[ 0.01900217, -0.00778875,  0.0252149 , ...,  0.02049587,
         -0.03541503,  0.02888921],
        [-0.00855951,  0.03940299, -0.04917406, ...,  0.04089766,
         -0.04314646, -0.00851778],
        [ 0.02618853,  0.03172157,  0.03705489, ..., -0.03908883,
          0.02005341,  0.03536557],
        ...,
        [ 0.02915983, -0.00653829,  0.00810423, ..., -0.01640823,
          0.00160346,  0.032729  ],
        [ 0.02915983, -0.00653829,  0.00810423, ..., -0.01640823,
          0.00160346,  0.032729  ],
        [ 0.02915983, -0.00653829,  0.00810423, ..., -0.01640823,
          0.00160346,  0.032729  ]]], dtype=float32)>

In [23]:
# Check out a single tokens embedding
sample_embed[0][0], sample_embed[0][0].shape, random_sentence

(<tf.Tensor: shape=(128,), dtype=float32, numpy=
 array([ 0.01900217, -0.00778875,  0.0252149 ,  0.0143753 ,  0.02000191,
        -0.02781319, -0.02472391, -0.03461609,  0.01694708,  0.04769664,
         0.01938396, -0.03676265,  0.03805673,  0.03136755, -0.03141034,
         0.0147112 ,  0.00924078,  0.03401972, -0.03402137, -0.0091555 ,
         0.04875575, -0.03683907, -0.03879149, -0.03475635,  0.02311564,
        -0.01561357,  0.00574952,  0.04904005, -0.00394764, -0.04003202,
         0.00768739,  0.01375408, -0.03029318, -0.02753223,  0.04416238,
         0.00457255, -0.0124725 , -0.01446985, -0.02118945, -0.03293609,
         0.02645091, -0.01945975,  0.0086719 ,  0.02053357,  0.0203228 ,
        -0.03013772, -0.04735538, -0.04992881,  0.03636857, -0.03751533,
        -0.01551702, -0.02634298,  0.01285874, -0.00662382, -0.0299028 ,
         0.02153771,  0.04789722,  0.01314703, -0.00780978, -0.03815674,
         0.03145741, -0.04256538, -0.02866439, -0.02616819,  0.03275535,
  

## Modelling a text data set (running experiments)

Now we've got a way to tunr our text sequences into numbers, it's time to start building a series of moedlling experiments.

We'll start with a baseline and move on from there.
* Model 0: Naive bayes (baseline) - SkLearn ML Map
* Model 1: Feed-forward nueral network (dense model)
* Model 2: LSTM Model (RNN)
* Model 3: GRU Model (RNN)
* Model 4: Bidrectional-LSTM Model (RNN)
* Model 5: 1D Covolutional Nueral Network (CNN)
* Model 6: TensorFlow Hub Pretrained Feature Extractor (Transfer learning for NLP)
* Model 8: Same as model 6 with 10% of training data

How are we going to approach all 8 models?

Using the standard steps in modelling is how we'll do it!
* Create our model
* Build a model
* Fit a model
* Evaluate our model
* Experiment 
* Save and Reload the model



## Model 0: Getting a basline

With all machine learning model experiemnts. It's importtant to create a bseline model so you can have a benchmark for future models

To create the baseline wel'll use Sklearn's Multinomial Naive Bayes using the TF-IDF formula to convert our words to number

> **Note:** Common practice to use non-DL algorithms as a baseline because of their speed and then later use DL to see if you can improve

In [24]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import Pipeline

# Create tokenization and modelling pipeline

model_0 = Pipeline([
                    ("tfidf", TfidfVectorizer()), # Convert words to numbers
                    ("clf", MultinomialNB()) # Model the text
])

# Fit the pipeline to the training data

model_0.fit(train_sentences, train_labels)



Pipeline(memory=None,
         steps=[('tfidf',
                 TfidfVectorizer(analyzer='word', binary=False,
                                 decode_error='strict',
                                 dtype=<class 'numpy.float64'>,
                                 encoding='utf-8', input='content',
                                 lowercase=True, max_df=1.0, max_features=None,
                                 min_df=1, ngram_range=(1, 1), norm='l2',
                                 preprocessor=None, smooth_idf=True,
                                 stop_words=None, strip_accents=None,
                                 sublinear_tf=False,
                                 token_pattern='(?u)\\b\\w\\w+\\b',
                                 tokenizer=None, use_idf=True,
                                 vocabulary=None)),
                ('clf',
                 MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True))],
         verbose=False)

In [25]:
# Evaluate our baseline model
baseline_score = model_0.score(val_sentences, val_labels)
print(f"Ourbaseline model achieves an accuracy of: {baseline_score*100:.2f}%")

Ourbaseline model achieves an accuracy of: 79.27%


In [26]:
# Make predictions
baseline_preds = model_0.predict(val_sentences)
baseline_preds[:20]

array([1, 1, 1, 0, 0, 1, 1, 1, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1])

## Create and evaluation function for the model

We could evaluate with different metrics each time. That would be cumbersome and could easily be fixed with a function.

* Accuracy
* Precision
* Recall
* F1 Score


In [27]:
# Evaluation function

from sklearn.metrics import accuracy_score, precision_recall_fscore_support

def calculate_results(y_true, y_pred):
  """
  Evaluates a binary classifcation model
  """

  # Calculate model accuracy
  model_accuracy = accuracy_score(y_true, y_pred) * 100
  
  # Calculate model precisions, recall and f1-score (weighted average)
  model_precision, model_recall, model_f1, _ = precision_recall_fscore_support(y_true, y_pred, average="weighted") 
  model_results= {"accuracy": model_accuracy,
                  "precision": model_precision,
                  "recall": model_recall,
                  "f1 score": model_f1}
  
  return model_results

In [28]:
# Get baseline restults

baseline_results = calculate_results(y_true=val_labels,
                                     y_pred=baseline_preds)

baseline_results

{'accuracy': 79.26509186351706,
 'f1 score': 0.7862189758049549,
 'precision': 0.8111390004213173,
 'recall': 0.7926509186351706}

## Model 1: Simple Dense Model

In [29]:
# Create a tensorboard callback (tracking model, new one for each model)
from helper_functions import create_tensorboard_callback

# Create a directory to save tensorboard logs

SAVE_DIR = "model_logs"

In [51]:
# Build model with Functional API
from tensorflow.keras import layers

inputs = layers.Input(shape=(1,), dtype=tf.string) # One dimensional string inputs

x = text_vectorizer(inputs) # Turn the input text to numbers

x = embedding(x) # Embed the numberized inputs ^

x = layers.GlobalAveragePooling1D()(x) # Condense feature vector for each token to a single vector

outputs = layers.Dense(1, activation="sigmoid")(x) # Output layer. Wants binary outputs so use sigmoid

model_1 = tf.keras.Model(inputs, outputs, name="model_1_dense")


In [52]:
model_1.summary()

Model: "model_1_dense"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
input_3 (InputLayer)         [(None, 1)]               0         
_________________________________________________________________
text_vectorization_1 (TextVe (None, 15)                0         
_________________________________________________________________
embedding (Embedding)        (None, 15, 128)           1280000   
_________________________________________________________________
global_average_pooling1d_1 ( (None, 128)               0         
_________________________________________________________________
dense_2 (Dense)              (None, 1)                 129       
Total params: 1,280,129
Trainable params: 1,280,129
Non-trainable params: 0
_________________________________________________________________


In [53]:
# Compile model
model_1.compile(loss="binary_crossentropy",
                optimizer=tf.keras.optimizers.Adam(),
                metrics=["accuracy"])

In [54]:
# Fit the model
model_1_history = model_1.fit(x=train_sentences,
                              y=train_labels,
                              epochs=5,
                              validation_data=(val_sentences, val_labels),
                              callbacks=[create_tensorboard_callback(dir_name=SAVE_DIR,
                                                                     experiment_name="model_1_dense")])

Saving TensorBoard log files to: model_logs/model_1_dense/20210702-043318
Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


In [56]:
model_1.evaluate(val_sentences, val_labels)



[0.4698525071144104, 0.7821522355079651]

In [57]:
# Make some precitions and evalute 
model_1_pred_probs = model_1.predict(val_sentences)
model_1_pred_probs.shape

(762, 1)

In [61]:
# First 10 predicts
model_1_pred_probs[:10]

array([[0.31518134],
       [0.8112293 ],
       [0.9973266 ],
       [0.15898155],
       [0.11760436],
       [0.94113594],
       [0.86556405],
       [0.9967026 ],
       [0.9734354 ],
       [0.2927875 ]], dtype=float32)

In [62]:
# Convert model prediction probablities ot label format
model_1_preds = tf.squeeze(tf.round(model_1_pred_probs))

model_1_preds[:20]

<tf.Tensor: shape=(20,), dtype=float32, numpy=
array([0., 1., 1., 0., 0., 1., 1., 1., 1., 0., 0., 1., 0., 0., 0., 0., 0.,
       0., 0., 1.], dtype=float32)>

In [63]:
# Calculate our model_1 results
model_1_results = calculate_results(y_true=val_labels,
                                    y_pred=model_1_preds)
model_1_results

{'accuracy': 78.21522309711287,
 'f1 score': 0.7799245444538409,
 'precision': 0.7846540517195822,
 'recall': 0.7821522309711286}

In [64]:
baseline_results

{'accuracy': 79.26509186351706,
 'f1 score': 0.7862189758049549,
 'precision': 0.8111390004213173,
 'recall': 0.7926509186351706}

In [66]:
# Baseline reults are outperforming the first deep learning model

import numpy as np
np.array(list(model_1_results.values())) > np.array(list(baseline_results.values()))

array([False, False, False, False])