<a href="https://colab.research.google.com/github/BAlshowaikh/ML-Projects/blob/main/Intro_to_NLP.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#Introduction to NLP Fundamentals

* NLP has the goal to derive information out of natural language (could be sequences of text or speech)
* The main neural network architucture is RNN (Stands for recurrent neural network) which is differ from other types of networks in such it remembers the previous word ( or tokenization ) and uses it in the current one.

In [None]:
!nvidia-smi -L

GPU 0: Tesla T4 (UUID: GPU-57ced2f7-0129-f746-435e-781c8272de17)


In [None]:
# Import helper function
!wget https://raw.githubusercontent.com/mrdbourke/tensorflow-deep-learning/main/extras/helper_functions.py

from helper_functions import unzip_data, create_tensorboard_callback, plot_loss_curves, compare_historys

--2024-08-09 06:21:01--  https://raw.githubusercontent.com/mrdbourke/tensorflow-deep-learning/main/extras/helper_functions.py
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 10246 (10K) [text/plain]
Saving to: ‘helper_functions.py’


2024-08-09 06:21:01 (84.4 MB/s) - ‘helper_functions.py’ saved [10246/10246]



#Become one with the data

##Get a text dataset

The text we're going to use is Kaggle's introduction to NLP (text samples of Tweets labelled as disaster or not disaster).

The original source: https://www.kaggle.com/competitions/nlp-getting-started


In [None]:
!wget https://storage.googleapis.com/ztm_tf_course/nlp_getting_started.zip

# Unzip the data
unzip_data("nlp_getting_started.zip")

--2024-08-09 06:21:12--  https://storage.googleapis.com/ztm_tf_course/nlp_getting_started.zip
Resolving storage.googleapis.com (storage.googleapis.com)... 142.250.4.207, 172.253.118.207, 74.125.200.207, ...
Connecting to storage.googleapis.com (storage.googleapis.com)|142.250.4.207|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 607343 (593K) [application/zip]
Saving to: ‘nlp_getting_started.zip’


2024-08-09 06:21:13 (706 KB/s) - ‘nlp_getting_started.zip’ saved [607343/607343]



## Visualizing our dataset

Because our data is on a form of csv, so to visualize them we need to read them first, one way to do so is to use Python.

Another way to be used is **Pandas**

In [None]:
import pandas as pd
train_df = pd.read_csv("train.csv")
test_df = pd.read_csv("test.csv")

In [None]:
train_df.head()

Unnamed: 0,id,keyword,location,text,target
0,1,,,Our Deeds are the Reason of this #earthquake M...,1
1,4,,,Forest fire near La Ronge Sask. Canada,1
2,5,,,All residents asked to 'shelter in place' are ...,1
3,6,,,"13,000 people receive #wildfires evacuation or...",1
4,7,,,Just got sent this photo from Ruby #Alaska as ...,1


##Shuffle training dataframe


In [None]:
train_df_shuffled = train_df.sample(frac=1, random_state=42) #This returns a shuffled random rows, frac 1 means shuffle the whole rows
train_df_shuffled.head()

Unnamed: 0,id,keyword,location,text,target
2644,3796,destruction,,So you have a new weapon that can cause un-ima...,1
2227,3185,deluge,,The f$&amp;@ing things I do for #GISHWHES Just...,0
5448,7769,police,UK,DT @georgegalloway: RT @Galloway4Mayor: ÛÏThe...,1
132,191,aftershock,,Aftershock back to school kick off was great. ...,0
6845,9810,trauma,"Montgomery County, MD",in response to trauma Children of Addicts deve...,0


In [None]:
# See the test dataframe
test_df.head()

Unnamed: 0,id,keyword,location,text
0,0,,,Just happened a terrible car crash
1,2,,,"Heard about #earthquake is different cities, s..."
2,3,,,"there is a forest fire at spot pond, geese are..."
3,9,,,Apocalypse lighting. #Spokane #wildfires
4,11,,,Typhoon Soudelor kills 28 in China and Taiwan


In [None]:
# How many examples of each class are there? (Class1. disaster, class2. not disaster)
train_df.target.value_counts()

Unnamed: 0_level_0,count
target,Unnamed: 1_level_1
0,4342
1,3271


In [None]:
# How many total samples?
len(train_df), len(test_df)

(7613, 3263)

In [None]:
# Visualize random training samples
import random
random_index = random.randint(0, len(train_df)-5)
for row in train_df_shuffled[["text", "target"]][random_index:random_index+5].itertuples():
  _, text, target=row
  print(f"Target: {target}", "(real disaster)" if target > 0 else "(not disaster)")
  print(f"Text sample:\n{text}\n")
  print("---\n")


Target: 1 (real disaster)
Text sample:
#flood #disaster Heavy rain causes flash flooding of streets in Manitou, Colorado Springs areas

---

Target: 0 (not disaster)
Text sample:
I'm drowning in spirits to wash you out

---

Target: 1 (real disaster)
Text sample:
FAAN orders evacuation of abandoned aircraft at MMA: FAAN noted that the action had become necessary due to re... http://t.co/ZUqgvJnEQA

---

Target: 0 (not disaster)
Text sample:
Dont even come if you worried about curfew #BC19

---

Target: 0 (not disaster)
Text sample:
I presume my timeline will be inundated with 'soggy bottom' &amp; lashings of 'moist' tweets now! :-D

---



##Split data into training and validation sets

> As the test set doesn't have the target column so we need to split the training set

In [None]:
from sklearn.model_selection import train_test_split

In [None]:
# Use train_test_split to split training data into training and validation sets
# We use .to_numpy() bc the train_test_split expectes the data to be in numpy form, so this method convert the dataframe into number arrays

train_sentences, val_sentences, train_labels, val_labels = train_test_split(train_df_shuffled["text"].to_numpy(),
                                                                            train_df_shuffled["target"].to_numpy(),
                                                                            test_size=0.1, # Use 10% of the training data for validation
                                                                            random_state=42)

In [None]:
# Check the lenghts
len(train_sentences), len(train_labels), len(val_sentences), len(val_labels)

(6851, 6851, 762, 762)

In [None]:
# Check the first 10 examples
train_sentences[:10], train_labels[:10]

(array(['@mogacola @zamtriossu i screamed after hitting tweet',
        'Imagine getting flattened by Kurt Zouma',
        '@Gurmeetramrahim #MSGDoing111WelfareWorks Green S welfare force ke appx 65000 members har time disaster victim ki help ke liye tyar hai....',
        "@shakjn @C7 @Magnums im shaking in fear he's gonna hack the planet",
        'Somehow find you and I collide http://t.co/Ee8RpOahPk',
        '@EvaHanderek @MarleyKnysh great times until the bus driver held us hostage in the mall parking lot lmfao',
        'destroy the free fandom honestly',
        'Weapons stolen from National Guard Armory in New Albany still missing #Gunsense http://t.co/lKNU8902JE',
        '@wfaaweather Pete when will the heat wave pass? Is it really going to be mid month? Frisco Boy Scouts have a canoe trip in Okla.',
        'Patient-reported outcomes in long-term survivors of metastatic colorectal cancer - British Journal of Surgery http://t.co/5Yl4DC1Tqt'],
       dtype=object),
 array([0,

#Convert text into numbers

> When working with text problem, one of the first things you have to do is converting your text into number, to do so there are two ways:

* Tokenization - direct mapping of token to a number (token could be a word or a character)
* Embedding - create a matrix of feature vector for each token (the size of the vector could be defined)



## Text vectorization (Tokenization)

In [None]:
# What out data look like?
train_sentences[:5]

array(['@mogacola @zamtriossu i screamed after hitting tweet',
       'Imagine getting flattened by Kurt Zouma',
       '@Gurmeetramrahim #MSGDoing111WelfareWorks Green S welfare force ke appx 65000 members har time disaster victim ki help ke liye tyar hai....',
       "@shakjn @C7 @Magnums im shaking in fear he's gonna hack the planet",
       'Somehow find you and I collide http://t.co/Ee8RpOahPk'],
      dtype=object)

In [None]:
import tensorflow as tf
from tensorflow.keras.layers import TextVectorization

In [None]:
# Use the default TextVectorization parameters

text_vectorizer= TextVectorization(max_tokens=None, # how many words are there? None let the model figure out by itself
                                   standardize="lower_and_strip_punctuation",
                                   split="whitespace", #How the tokens will be split?
                                   ngrams=None, # create groups or not? None will get the word by its own
                                   output_mode="int",
                                   output_sequence_length=None) # how long do you want your sequences?,
                                   #pad_to_max_tokens=True )

In [None]:
train_sentences[0].split()

['@mogacola', '@zamtriossu', 'i', 'screamed', 'after', 'hitting', 'tweet']

In [None]:
len(train_sentences[0].split())

7

In [None]:
# Find the avg number of tokens (words) in the training tweets
round(sum([len(i.split()) for i in train_sentences])/len(train_sentences))

15

In [None]:
# Setup text vectorization variables
max_vocab_length = 10000 #max number of words to have in our vocab
max_length = 15 # max length our sequences will be (how many words from a single tweet our model will see?)

text_vectorizer = TextVectorization(max_tokens=max_vocab_length,
                                    output_mode="int",
                                    output_sequence_length = max_length)

In [None]:
# Fit the text vectorizer to the training text
text_vectorizer.adapt(train_sentences) # Because A TextVectorization layer should always be either adapted
#over a dataset or supplied with a vocabulary.

In [None]:
# Create a sample sentence and tokenize it
sample_sentence = "Math is the best"
text_vectorizer([sample_sentence])

<tf.Tensor: shape=(1, 15), dtype=int64, numpy=
array([[  1,   9,   2, 149,   0,   0,   0,   0,   0,   0,   0,   0,   0,
          0,   0]])>

In [None]:
# Choose a random sentnce from the training dataset
random_sentence = random.choice(train_sentences)
print(f"Original sentence:\n {random_sentence}\n\n")
print("Vectroed version:")
text_vectorizer([random_sentence])

Original sentence:
 @SenateMajLdr let's try to do our best to prevent another outbreak of violence by talking to each other both the people and the politics


Vectroed version:


<tf.Tensor: shape=(1, 15), dtype=int64, numpy=
array([[8665,  541,  831,    5,   68,  103,  149,    5, 1378,  165,  298,
           6, 2236,   18, 1462]])>

In [None]:
# Get the unique words in the vocav
words_in_vocab = text_vectorizer.get_vocabulary() # Get all of the unique words
top_5_words = words_in_vocab[:5]
bottom_5_words = words_in_vocab[-5:]
print(f"Number of words in vocab: {len(words_in_vocab)}")
print(f"5 most common words: {top_5_words}")
print(f"5 least common words: {bottom_5_words}")

Number of words in vocab: 10000
5 most common words: ['', '[UNK]', 'the', 'a', 'in']
5 least common words: ['pages', 'paeds', 'pads', 'padres', 'paddytomlinson1']


##Create an embedding

Tensorflow offers an embedding layer that can be used directly with a tweaking in the parameters.
* `input_dim`: The size of our vocabulary
* `output_dim`: The size of the output embedding vector.
* `input_length`: length of the sequences being passed to the embedding layer.


In [None]:
tf.random.set_seed(42)
from tensorflow.keras import layers

embedding = layers.Embedding(input_dim=max_vocab_length, # set input shape
                             output_dim=128, # set size of embedding vector
                             embeddings_initializer="uniform", # default, intialize randomly
                             input_length=max_length, # how long is each input
                             name="embedding_1")

embedding



<Embedding name=embedding_1, built=False>

In [None]:
# Get a random sentence
random_sentence = random.choice(train_sentences)
print(f"Original text:\n {random_sentence}\
      \n\nEmbedded version:")

# Embed the random sentence (turn it into dense vactor)
sample_embed = embedding(text_vectorizer([random_sentence]))
sample_embed

Original text:
 JOBOOZOSO: USAT usatoday_nfl Michael Floyd's hand injury shouldn't devalue his fantasy stock http://t.co/DGkmUEoAxZ      

Embedded version:


<tf.Tensor: shape=(1, 15, 128), dtype=float32, numpy=
array([[[ 9.2579946e-03,  2.8663624e-02,  8.7894499e-05, ...,
          1.0342397e-02, -3.5464652e-03, -9.8634511e-05],
        [-4.4020917e-02,  4.9836565e-02, -4.4883560e-02, ...,
         -4.5130767e-02,  4.5497902e-03,  4.2583037e-02],
        [ 1.1597823e-02, -3.6085021e-02,  5.1572807e-03, ...,
          8.4151626e-03,  3.3579540e-02,  1.9274238e-02],
        ...,
        [ 9.2579946e-03,  2.8663624e-02,  8.7894499e-05, ...,
          1.0342397e-02, -3.5464652e-03, -9.8634511e-05],
        [ 1.3435591e-02, -4.9216557e-02,  2.8530743e-02, ...,
          1.5006270e-02,  2.4644319e-02,  4.3834224e-03],
        [ 1.3435591e-02, -4.9216557e-02,  2.8530743e-02, ...,
          1.5006270e-02,  2.4644319e-02,  4.3834224e-03]]], dtype=float32)>

In [None]:
# Check out a single token's embedding
sample_embed[0][0], sample_embed[0][0].shape, random_sentence

(<tf.Tensor: shape=(128,), dtype=float32, numpy=
 array([ 9.25799459e-03,  2.86636241e-02,  8.78944993e-05,  3.00476812e-02,
         2.61380412e-02,  4.63107117e-02, -4.29058187e-02,  3.80254500e-02,
         2.69741528e-02,  4.76666428e-02,  4.29976918e-02, -9.20299441e-03,
         2.34788917e-02,  1.13563053e-02, -4.50099967e-02,  4.43215258e-02,
        -3.54205258e-02,  2.80140080e-02, -3.26441303e-02, -4.33638804e-02,
        -1.96484476e-03, -4.73546274e-02,  4.14412729e-02, -2.05518957e-02,
        -1.28874779e-02,  3.24938446e-03, -4.35409546e-02,  9.82233137e-03,
         2.10933350e-02,  9.12146643e-03, -3.53891253e-02, -2.01856382e-02,
        -5.04239649e-03, -7.87900761e-03, -4.65510748e-02, -4.38164845e-02,
        -2.29247939e-02,  9.90790129e-03,  7.47663900e-03,  4.94228713e-02,
         3.39393280e-02, -7.18677044e-03,  1.24184005e-02, -2.67630704e-02,
        -9.25575569e-03,  4.99517582e-02,  2.28365548e-02,  1.36413239e-02,
         2.14672424e-02,  4.86176051e-0

#Create Models

* Model 0: Naive Bayes (baseline)
* Model 1: Feed-forward neural network
* Model 2: LSTM model (RNN)
* Model 3: GRU (RNN)
* Model 4: Bidirectional-LSTM (RNN)
* Model 5: 1D Convolutional Neural Network (CNN)
* Model 6: TF Hub Pretrained feture extractor
* Model 7: Same as model 6 with 10% of training data

##Model 0: Getting a baseline

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import Pipeline # Equal to keras Sequential

# Create tokenization
model_0 = Pipeline([
                    ("tfidf", TfidfVectorizer()), # Convert words to numbers using tfidf
                    ("clf", MultinomialNB()) # model the text
])

# Fit the model
model_0.fit(train_sentences, train_labels)

In [None]:
# Evaluate the baseline model
baseline_score = model_0.score(val_sentences, val_labels) # Score is equal to evaluate but in sklearn format
print(f"Baseline model accuracy: {baseline_score *100:.2f}%")

Baseline model accuracy: 79.27%


In [None]:
# Make predictions
baseline_preds = model_0.predict(val_sentences)
baseline_preds[:20]

array([1, 1, 1, 0, 0, 1, 1, 1, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1])

### Create a function to evaluate our model

Differente metrics:
* Accuracy
* Precision
* Recall
* F1-score

In [None]:
from sklearn.metrics import accuracy_score, precision_recall_fscore_support

def calculate_results(y_true, y_pred):
  """
  Calculate model accuracy, precision, recall and f1 of a binary classification model.
  """
  # Calculate model accuracy
  model_accuracy = accuracy_score(y_true, y_pred) * 100
  # Calculate model precision, recall and f1
  # Note: The underscore near the = sign means blank, bc there is one
  # variable in the function we don't need to return
  model_precision, model_recall, model_f1, _ = precision_recall_fscore_support(y_true, y_pred, average="weighted")
  model_results = {"accuracy": model_accuracy,
                   "precision": model_precision,
                   "recall": model_recall,
                   "f1-score": model_f1}
  return model_results

In [None]:
# Gte baseline results
baseline_results = calculate_results(y_true=val_labels,
                                     y_pred=baseline_preds)
baseline_results

{'accuracy': 79.26509186351706,
 'precision': 0.8111390004213173,
 'recall': 0.7926509186351706,
 'f1-score': 0.7862189758049549}

##Model 1: A simple dense model

In [None]:
# Create a function to check the difference between the baseline results and the given model's results
def compare_baseline_to_new_results(baseline_results, new_model_results):
    for key, value in baseline_results.items():
      print(f"Baseline {key}: {baseline_results[key]:.2f} | New model {key}: {new_model_results[key]:.2f} | Difference: {new_model_results[key]-value:.2f}")

In [None]:
# Create a tensorboard callback

#from helper_functions import create_tensorboard_callback

# Creatr a directry to save the logs
#SAVE_DIR = "model_logs" NOTE: Tensorboard is no longer used

###Build model with Functional API

In [None]:
from tensorflow.keras import layers

inputs = layers.Input(shape=(1,), dtype=tf.string) # Inputs are 1-dimensional
x = text_vectorizer(inputs) # Turn the input strings into numbers
x = embedding(x) # create an embedding
x = layers.GlobalAveragePooling1D()(x)
outputs = layers.Dense(1, activation="sigmoid")(x) # Bc it's a binary classification the output should be 1, and the activation used is sigmoid

model_1 = tf.keras.Model(inputs, outputs, name="Model_1_Dense")

In [None]:
# Compile the model
model_1.compile(loss="binary_crossentropy",
                optimizer=tf.keras.optimizers.Adam(),
                metrics=["accuracy"])


In [None]:
model_1.summary()

In [None]:
# Fit the model
model_1_history = model_1.fit(x=train_sentences,
                               y=train_labels,
                               epochs=5,
                               validation_data=(val_sentences, val_labels))

Epoch 1/5
[1m215/215[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m4s[0m 6ms/step - accuracy: 0.6460 - loss: 0.6497 - val_accuracy: 0.7612 - val_loss: 0.5339
Epoch 2/5
[1m215/215[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 7ms/step - accuracy: 0.8096 - loss: 0.4662 - val_accuracy: 0.7913 - val_loss: 0.4739
Epoch 3/5
[1m215/215[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2s[0m 4ms/step - accuracy: 0.8533 - loss: 0.3620 - val_accuracy: 0.7953 - val_loss: 0.4620
Epoch 4/5
[1m215/215[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 4ms/step - accuracy: 0.8866 - loss: 0.2956 - val_accuracy: 0.7887 - val_loss: 0.4683
Epoch 5/5
[1m215/215[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 4ms/step - accuracy: 0.9066 - loss: 0.2468 - val_accuracy: 0.7808 - val_loss: 0.4841


In [None]:
# Evaluate model_1
model_1.evaluate(val_sentences, val_labels)

[1m24/24[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 4ms/step - accuracy: 0.7692 - loss: 0.5164


[0.484100341796875, 0.7808399200439453]

In [None]:
model_1_pred_probs = model_1.predict(val_sentences)
model_1_pred_probs.shape

[1m24/24[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 11ms/step


(762, 1)

In [None]:
# Look at one prediction
model_1_pred_probs[0]

array([0.32025823], dtype=float32)

In [None]:
# Look at the fisrt 10 predictions
model_1_pred_probs[:10]

array([[0.32025823],
       [0.7347645 ],
       [0.9974777 ],
       [0.18962027],
       [0.10636731],
       [0.94413483],
       [0.8883857 ],
       [0.99445564],
       [0.96072006],
       [0.3873078 ]], dtype=float32)

> NOTE: The result of the predictions are important, if the number is less than  0.5 so it's not a disaster, if it's 0.5 and above so it's a disaster

In [None]:
# Convert the model's prediction to label formate
model_1_preds = tf.squeeze(tf.round(model_1_pred_probs))
model_1_preds[:20]

<tf.Tensor: shape=(20,), dtype=float32, numpy=
array([0., 1., 1., 0., 0., 1., 1., 1., 1., 0., 0., 1., 0., 0., 0., 0., 0.,
       0., 0., 1.], dtype=float32)>

In [None]:
# Calculate the model_1 results
model_1_results = calculate_results(y_true=val_labels,
                                    y_pred=model_1_preds)
model_1_results

{'accuracy': 78.08398950131233,
 'precision': 0.783783808499639,
 'recall': 0.7808398950131233,
 'f1-score': 0.7783998521836788}

In [None]:
baseline_results

{'accuracy': 79.26509186351706,
 'precision': 0.8111390004213173,
 'recall': 0.7926509186351706,
 'f1-score': 0.7862189758049549}

##Visualize the embedding layer

In [None]:
# Get the vocabs from text vectorizer layer
words_in_vocab = text_vectorizer.get_vocabulary()
len(words_in_vocab), words_in_vocab[:10]

(10000, ['', '[UNK]', 'the', 'a', 'in', 'to', 'of', 'and', 'i', 'is'])

In [None]:
model_1.summary()

In [None]:
# Get the embediing's layer's weights (The numerical representation of each word)
embed_weights= model_1.get_layer("embedding_1").get_weights()[0]
len(embed_weights)

10000

In [None]:
# Get gthe shape of the first embedded word which has to be the same size
# as vocab size embedding_dim (each word is a embedding_dim size vector)
print(embed_weights.shape)

(10000, 128)


###Use the TF embed projector tool

> https://projector.tensorflow.org/

 NOTE: To use the Embedding Projector tool, we need two files:

* The embedding vectors (same as embedding weights).
* The meta data of the embedding vectors (the words they represent - our vocabulary).

In [None]:
# Code below is adapted from: https://www.tensorflow.org/tutorials/text/word_embeddings#retrieve_the_trained_word_embeddings_and_save_them_to_disk
import io

# Create output writers
out_v = io.open("embedding_vectors.tsv", "w", encoding="utf-8")
out_m = io.open("embedding_metadata.tsv", "w", encoding="utf-8")

# Write embedding vectors and words to file
for num, word in enumerate(words_in_vocab):
  if num == 0:
     continue # skip padding token
  vec = embed_weights[num]
  out_m.write(word + "\n") # write words to file
  out_v.write("\t".join([str(x) for x in vec]) + "\n") # write corresponding word vector to file
out_v.close()
out_m.close()

In [None]:
# Download files locally to upload to Embedding Projector
try:
  from google.colab import files
except ImportError:
  pass
else:
  files.download("embedding_vectors.tsv")
  files.download("embedding_metadata.tsv")

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

#Recurrent Neural Network (RNNs)

> RNNs are useful for sequence data.
It takes the representations of a previuos input and aid the representation of the next input.

Resources: MIT sequnce modelling lecture: https://youtu.be/SEnXr6v2ifU

##Model 2: LSTM

LSTM: Stands for Long Short Term Memory

The structure of RNNs usually:
```
Input --> Tikenize --> Embedding --> Layers (RNNs/dense) --> Output (label probability)
```

In [None]:
# Craete and LSTM model
from tensorflow.keras import layers

inputs = layers.Input(shape=(1,), dtype="string")
x = text_vectorizer(inputs)
x = embedding(x)
x = layers.LSTM(units=64)(x) # Better tp have 8, 16, 24 .....
#x = layers.Dense(64, activation="relu")(x)
outputs = layers.Dense(1, activation="sigmoid")(x)
model_2 = tf.keras.Model(inputs, outputs, name="model_2_LSTM")

In [None]:
# Compile teh model
model_2.compile(loss="binary_crossentropy",
                optimizer=tf.keras.optimizers.Adam(),
                metrics=["accuracy"])


In [None]:
# Check the summary
model_2.summary()

In [None]:
# Fit the model
model_2_history = model_2.fit(x=train_sentences,
                              y=train_labels,
                              epochs=5,
                              validation_data=(val_sentences, val_labels))

Epoch 1/5
[1m215/215[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m5s[0m 9ms/step - accuracy: 0.9082 - loss: 0.3088 - val_accuracy: 0.7703 - val_loss: 0.5424
Epoch 2/5
[1m215/215[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2s[0m 6ms/step - accuracy: 0.9348 - loss: 0.1720 - val_accuracy: 0.7730 - val_loss: 0.5975
Epoch 3/5
[1m215/215[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m3s[0m 6ms/step - accuracy: 0.9459 - loss: 0.1401 - val_accuracy: 0.7730 - val_loss: 0.6890
Epoch 4/5
[1m215/215[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 6ms/step - accuracy: 0.9580 - loss: 0.1160 - val_accuracy: 0.7651 - val_loss: 0.7975
Epoch 5/5
[1m215/215[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 6ms/step - accuracy: 0.9660 - loss: 0.0945 - val_accuracy: 0.7638 - val_loss: 0.8464


In [None]:
# Make predictions with LSTM
model_2_pred_probs = model_2.predict(val_sentences)
model_2_pred_probs[:10]

[1m24/24[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 10ms/step


array([[0.0250964 ],
       [0.89659965],
       [0.99973553],
       [0.06181465],
       [0.00200438],
       [0.9980106 ],
       [0.655495  ],
       [0.9997806 ],
       [0.9995478 ],
       [0.4251484 ]], dtype=float32)

In [None]:
# Convert model 2 predictions to labels (Should be the same formate as val_labels)
model_2_preds = tf.squeeze(tf.round(model_2_pred_probs))
model_2_preds[:10]

<tf.Tensor: shape=(10,), dtype=float32, numpy=array([0., 1., 1., 0., 0., 1., 1., 1., 1., 0.], dtype=float32)>

In [None]:
val_labels[:10]

array([0, 0, 1, 1, 1, 1, 1, 1, 1, 0])

In [None]:
# Calculate model_2 results
model_2_results = calculate_results(y_true=val_labels,
                                    y_pred=model_2_preds)
model_2_results

{'accuracy': 76.37795275590551,
 'precision': 0.765729922717656,
 'recall': 0.7637795275590551,
 'f1-score': 0.7613639638656104}

##Model 3: GRU

> The GRU cell has similar features to an LSTM cell but has less parameters.

In [None]:
# Create the model
from tensorflow.keras import layers

inputs = layers.Input(shape=(1,), dtype="string")
x = text_vectorizer(inputs)
x = embedding(x)
x = layers.GRU(units=64)(x)
outputs = layers.Dense(1, activation="sigmoid")(x)

model_3 = tf.keras.Model(inputs, outputs, name="model_3_GRU")

In [None]:
# Compile the model
model_3.compile(loss="binary_crossentropy",
                optimizer=tf.keras.optimizers.Adam(),
                metrics=["accuracy"])

In [None]:
# Get the summary of the model
model_3.summary()

In [None]:
# Fit the model
model_3_history = model_3.fit(train_sentences,
            train_labels,
            epochs=5,
            validation_data=(val_sentences, val_labels))

Epoch 1/5
[1m215/215[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m3s[0m 12ms/step - accuracy: 0.9781 - loss: 0.0622 - val_accuracy: 0.7743 - val_loss: 0.9287
Epoch 2/5
[1m215/215[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m4s[0m 8ms/step - accuracy: 0.9778 - loss: 0.0494 - val_accuracy: 0.7743 - val_loss: 1.0664
Epoch 3/5
[1m215/215[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m3s[0m 9ms/step - accuracy: 0.9799 - loss: 0.0442 - val_accuracy: 0.7743 - val_loss: 1.1612
Epoch 4/5
[1m215/215[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2s[0m 7ms/step - accuracy: 0.9799 - loss: 0.0433 - val_accuracy: 0.7717 - val_loss: 1.1593
Epoch 5/5
[1m215/215[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m3s[0m 8ms/step - accuracy: 0.9833 - loss: 0.0367 - val_accuracy: 0.7730 - val_loss: 1.0104


In [None]:
# Predict using the model
model_3_pred_probs = model_3.predict(val_sentences)
model_3_pred_probs[:10]

[1m24/24[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 12ms/step


array([[1.9019637e-02],
       [7.1900171e-01],
       [9.9968159e-01],
       [2.3428182e-01],
       [1.7765754e-04],
       [9.9972039e-01],
       [3.9443439e-01],
       [9.9991691e-01],
       [9.9985969e-01],
       [5.4861248e-01]], dtype=float32)

In [None]:
# Convert the probabalities to lables
model_3_preds = tf.squeeze(tf.round(model_3_pred_probs))
model_3_preds[:10]

<tf.Tensor: shape=(10,), dtype=float32, numpy=array([0., 1., 1., 0., 0., 1., 0., 1., 1., 1.], dtype=float32)>

In [None]:
# Calculate the performance of the model
model_3_results = calculate_results(y_true=val_labels,
                  y_pred=model_3_preds)
model_3_results

{'accuracy': 77.29658792650919,
 'precision': 0.7733211200948874,
 'recall': 0.7729658792650919,
 'f1-score': 0.7716665369372698}

In [None]:
# Compare the results of the baseline and model_3
compare_baseline_to_new_results(baseline_results, model_3_results)

Baseline accuracy: 79.27 | New model accuracy: 77.30 | Difference: -1.97
Baseline precision: 0.81 | New model precision: 0.77 | Difference: -0.04
Baseline recall: 0.79 | New model recall: 0.77 | Difference: -0.02
Baseline f1-score: 0.79 | New model f1-score: 0.77 | Difference: -0.01


##Model 4: Bidirectional RNN

> The LSTM and GRU RNNs will map the words in one direction only (ex. from left to right), while bidirection will go on both ways (left to right AND right to left)

In [None]:
from tensorflow.keras import layers
# Create the model
inputs = layers.Input(shape=(1,), dtype="string")
x = text_vectorizer(inputs)
x = embedding(x)
x = layers.Bidirectional(layers.LSTM(units=64, return_sequences=False))(x)
# Note: If we have more than one RNN layer the return sequence should be set to True, and it means
# Whether to return the last output in the output sequence, or the full sequence
outputs = layers.Dense(1, activation="sigmoid")(x)

model_4 = tf.keras.Model(inputs, outputs, name="model_4_bidirectional")


In [None]:
# Compile the model
model_4.compile(loss="binary_crossentropy",
                optimizer=tf.keras.optimizers.Adam(),
                metrics=["accuracy"])

In [None]:
# Get the summary of the model
model_4.summary()

In [None]:
# Fit the model
model_4_history = model_4.fit(train_sentences,
                              train_labels,
                              epochs=5,
                              validation_data=(val_sentences, val_labels))

Epoch 1/5
[1m215/215[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m5s[0m 9ms/step - accuracy: 0.9619 - loss: 0.1827 - val_accuracy: 0.7598 - val_loss: 1.0570
Epoch 2/5
[1m215/215[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2s[0m 7ms/step - accuracy: 0.9803 - loss: 0.0447 - val_accuracy: 0.7559 - val_loss: 1.1320
Epoch 3/5
[1m215/215[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m3s[0m 7ms/step - accuracy: 0.9807 - loss: 0.0443 - val_accuracy: 0.7533 - val_loss: 1.1256
Epoch 4/5
[1m215/215[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m3s[0m 7ms/step - accuracy: 0.9833 - loss: 0.0389 - val_accuracy: 0.7664 - val_loss: 1.1893
Epoch 5/5
[1m215/215[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2s[0m 9ms/step - accuracy: 0.9812 - loss: 0.0421 - val_accuracy: 0.7598 - val_loss: 1.2703


In [None]:
# Make predictions
model_4_pred_probs = model_4.predict(val_sentences)
model_4_pred_probs[:10]

[1m24/24[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 26ms/step


array([[2.2136413e-03],
       [6.5088516e-01],
       [9.9998510e-01],
       [3.4370574e-01],
       [3.9557250e-05],
       [9.9996746e-01],
       [5.3374711e-02],
       [9.9999583e-01],
       [9.9998689e-01],
       [7.4126083e-01]], dtype=float32)

In [None]:
# Convert the prediction probabilities to labels
model_4_preds = tf.squeeze(tf.round(model_4_pred_probs))
model_4_preds[:10]

<tf.Tensor: shape=(10,), dtype=float32, numpy=array([0., 1., 1., 0., 0., 1., 0., 1., 1., 1.], dtype=float32)>

In [None]:
# Check the performance
model_4_results = calculate_results(y_true=val_labels,
                                    y_pred=model_4_preds)
model_4_results

{'accuracy': 75.98425196850394,
 'precision': 0.7618096125081139,
 'recall': 0.7598425196850394,
 'f1-score': 0.7573149475055201}

In [None]:
# Compare the results with the baseline
compare_baseline_to_new_results(baseline_results, model_4_results)

Baseline accuracy: 79.27 | New model accuracy: 75.98 | Difference: -3.28
Baseline precision: 0.81 | New model precision: 0.76 | Difference: -0.05
Baseline recall: 0.79 | New model recall: 0.76 | Difference: -0.03
Baseline f1-score: 0.79 | New model f1-score: 0.76 | Difference: -0.03


##Model 5: Convolutional model

In [None]:
# Create the model
from tensorflow.keras import layers

inputs = layers.Input(shape=(1,), dtype="string")
x = text_vectorizer(inputs)
x = embedding(x)
x = layers.Conv1D(filters=32, kernel_size=3, activation="relu")(x)
x = layers.GlobalAveragePooling1D()(x)
outputs = layers.Dense(1, activation="sigmoid")(x)

model_5 = tf.keras.Model(inputs, outputs, name="model_5_Conv1D")

In [None]:
# Compile the model
model_5.compile(loss="binary_crossentropy",
                optimizer=tf.keras.optimizers.Adam(),
                metrics=["accuracy"])

In [None]:
# Get the summary
model_5.summary()

In [None]:
# Fit the model
model_5_history = model_5.fit(train_sentences,
            train_labels,
            epochs=5,
            validation_data=(val_sentences, val_labels))

Epoch 1/5
[1m215/215[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m6s[0m 12ms/step - accuracy: 0.9500 - loss: 0.3109 - val_accuracy: 0.7664 - val_loss: 0.7230
Epoch 2/5
[1m215/215[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2s[0m 6ms/step - accuracy: 0.9646 - loss: 0.0997 - val_accuracy: 0.7520 - val_loss: 0.9148
Epoch 3/5
[1m215/215[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 6ms/step - accuracy: 0.9693 - loss: 0.0783 - val_accuracy: 0.7559 - val_loss: 1.0549
Epoch 4/5
[1m215/215[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m3s[0m 7ms/step - accuracy: 0.9713 - loss: 0.0665 - val_accuracy: 0.7572 - val_loss: 1.1589
Epoch 5/5
[1m215/215[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m3s[0m 9ms/step - accuracy: 0.9747 - loss: 0.0586 - val_accuracy: 0.7572 - val_loss: 1.2436


In [None]:
# Get the prediction probablities
model_5_pred_probs = model_5.predict(val_sentences)
model_5_pred_probs[:10]

[1m24/24[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 13ms/step


array([[2.9219800e-01],
       [8.3111352e-01],
       [9.9999475e-01],
       [1.4103575e-01],
       [3.1097198e-08],
       [9.9370885e-01],
       [9.9273974e-01],
       [1.0000000e+00],
       [1.0000000e+00],
       [9.0129387e-01]], dtype=float32)

In [None]:
# Convert the prediction probabalities into labels
model_5_preds = tf.squeeze(tf.round(model_5_pred_probs))
model_5_preds[:10]

<tf.Tensor: shape=(10,), dtype=float32, numpy=array([0., 1., 1., 0., 0., 1., 1., 1., 1., 1.], dtype=float32)>

In [None]:
# Calculate the performance
model_5_results = calculate_results(y_true=val_labels,
                                    y_pred=model_5_preds)
model_5_results

{'accuracy': 75.7217847769029,
 'precision': 0.7568784468743027,
 'recall': 0.7572178477690289,
 'f1-score': 0.7563501070538545}

In [None]:
# Compare model_5 results with the baseline
compare_baseline_to_new_results(baseline_results,
                                model_5_results)

Baseline accuracy: 79.27 | New model accuracy: 75.72 | Difference: -3.54
Baseline precision: 0.81 | New model precision: 0.76 | Difference: -0.05
Baseline recall: 0.79 | New model recall: 0.76 | Difference: -0.04
Baseline f1-score: 0.79 | New model f1-score: 0.76 | Difference: -0.03


## Model 6: Use Tensorflow hub

> This means that we'll use 'transfer learning' as the model is already built but we'll tweak it to align with our needs

In [None]:
# Load the model
import tensorflow_hub as hub
embed = hub.load("https://www.kaggle.com/models/google/universal-sentence-encoder/TensorFlow2/universal-sentence-encoder/2")

In [None]:
# We can use this encoding layer in place of our text_vectorizer and embedding layer
sentence_encoder_layer = hub.KerasLayer("https://www.kaggle.com/models/google/universal-sentence-encoder/TensorFlow2/universal-sentence-encoder/2",
                                        input_shape=[],
                                        dtype="string",
                                        trainable=False,
                                        name="USE")


In [None]:
# Create model using the Sequential API
model_6 = tf.keras.Sequential([
  sentence_encoder_layer, # take in sentences and then encode them into an embedding
  layers.Dense(64, activation="relu"),
  layers.Dense(1, activation="sigmoid")
], name="model_6_USE")

# # Compile model
# model_6.compile(loss="binary_crossentropy",
#                 optimizer=tf.keras.optimizers.Adam(),
#                 metrics=["accuracy"])

# model_6.summary()

ValueError: Only instances of `keras.Layer` can be added to a Sequential model. Received: <tensorflow_hub.keras_layer.KerasLayer object at 0x780f2a4e78b0> (of type <class 'tensorflow_hub.keras_layer.KerasLayer'>)

In [None]:
import tensorflow as tf
from tensorflow.keras import layers
import tensorflow_hub as hub

class HubKerasLayer(layers.Layer):
    def __init__(self, hub_layer, **kwargs):
        super(HubKerasLayer, self).__init__(**kwargs)
        self.hub_layer = hub_layer

    def build(self, input_shape):
        self.trainable = False  # Freeze the weights of the TensorFlow Hub layer
        self.built = True

    def call(self, inputs):
        return self.hub_layer(inputs)

# Load the Universal Sentence Encoder model
embed = hub.load("https://tfhub.dev/google/universal-sentence-encoder/4")

# Create a Keras layer using the HubKerasLayer class
sentence_encoder_layer = HubKerasLayer(embed)

# Example usage:
inputs = tf.keras.Input(shape=(), dtype=tf.string)
x = sentence_encoder_layer(inputs)
x = layers.Dense(64, activation="relu")(x)
outputs = layers.Dense(1, activation="sigmoid")(x)

model_6 = tf.keras.Model(inputs, outputs, name="model_6_USE")

In [None]:
# Compile model
model_6.compile(loss="binary_crossentropy",
                optimizer=tf.keras.optimizers.Adam(),
                metrics=["accuracy"])