# NLP with TensorFlow

#### The data for this project is a Kaggle dataset that can be found in the link below. This project aims to make a Deep Learning model to predict whether a given twit is a disaster message.
https://www.kaggle.com/competitions/nlp-getting-started/data

In [None]:
#importing the libraries needed for this project.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import tensorflow as tf

#### let's visualize the data.

In [5]:
train_data = pd.read_csv('train.csv')
test_data = pd.read_csv('test.csv')
train_data.head(10)

Unnamed: 0,id,keyword,location,text,target
0,1,,,Our Deeds are the Reason of this #earthquake M...,1
1,4,,,Forest fire near La Ronge Sask. Canada,1
2,5,,,All residents asked to 'shelter in place' are ...,1
3,6,,,"13,000 people receive #wildfires evacuation or...",1
4,7,,,Just got sent this photo from Ruby #Alaska as ...,1
5,8,,,#RockyFire Update => California Hwy. 20 closed...,1
6,10,,,#flood #disaster Heavy rain causes flash flood...,1
7,13,,,I'm on top of the hill and I can see a fire in...,1
8,14,,,There's an emergency evacuation happening now ...,1
9,15,,,I'm afraid that the tornado is coming to our a...,1


#### it seems that the trainig data is not shuffled. Let's shuffle the data.

In [6]:
shuffled_train_data = train_data.sample(frac=1, random_state=42)
shuffled_train_data.head(10)

Unnamed: 0,id,keyword,location,text,target
2644,3796,destruction,,So you have a new weapon that can cause un-ima...,1
2227,3185,deluge,,The f$&amp;@ing things I do for #GISHWHES Just...,0
5448,7769,police,UK,DT @georgegalloway: RT @Galloway4Mayor: ÛÏThe...,1
132,191,aftershock,,Aftershock back to school kick off was great. ...,0
6845,9810,trauma,"Montgomery County, MD",in response to trauma Children of Addicts deve...,0
5559,7934,rainstorm,,@Calum5SOS you look like you got caught in a r...,0
1765,2538,collision,,my favorite lady came to our volunteer meeting...,1
1817,2611,crashed,,@brianroemmele UX fail of EMV - people want to...,1
6810,9756,tragedy,"Los Angeles, CA",Can't find my ariana grande shirt this is a f...,0
4398,6254,hijacking,"Athens,Greece",The Murderous Story Of AmericaÛªs First Hijac...,1


#### let's visualize the test data.

In [7]:
test_data.head(10)
#it doesn't have labels.

Unnamed: 0,id,keyword,location,text
0,0,,,Just happened a terrible car crash
1,2,,,"Heard about #earthquake is different cities, s..."
2,3,,,"there is a forest fire at spot pond, geese are..."
3,9,,,Apocalypse lighting. #Spokane #wildfires
4,11,,,Typhoon Soudelor kills 28 in China and Taiwan
5,12,,,We're shaking...It's an earthquake
6,21,,,They'd probably still show more life than Arse...
7,22,,,Hey! How are you?
8,27,,,What a nice hat?
9,29,,,Fuck off!


#### Good, now let's check whether the labels of training data is balanced or not.

In [9]:
shuffled_train_data['target'].value_counts()
# It's almost balanced , about 60% for target = 0 and 40% for target = 1.

0    4342
1    3271
Name: target, dtype: int64

In [13]:
# let's check the distribution of the total data set.
print(f"training data set : {len(train_data)}, test data set : {len(test_data)}, total data set : {len(train_data) + len(test_data)}")

training data set : 7613, test data set : 3263, total data set : 10876


In [19]:
# Let's visualize some random training examples
import random
random_index = random.randint(0, len(shuffled_train_data)-3) # create random indexes not higher than the total number of samples
for row in shuffled_train_data[["text", "target"]][random_index:random_index+3].itertuples():
  _, text, target = row
  print(f"Target: {target}", "(disaster)" if target > 0 else "(no disaster)")
  print(f"Text:\n{text}\n")
  print("---\n")

Target: 1 (disaster)
Text:
Mourning notices for stabbing arson victims stir Û÷politics of griefÛª in Israel http://t.co/KkbXIBlAH7

---

Target: 1 (disaster)
Text:
Mass murderer Che Guevara greeting a woman in North Korea http://t.co/GlJBNSFGLl'

---

Target: 0 (no disaster)
Text:
Womens Flower Printed Shoulder Handbags Cross Body Metal Chain Satchel Bags Blue http://t.co/rjZw6C8asX http://t.co/WtdIav11ua

---



### Split train data into training and validation sets

Because the test data has no labels and we have to evalaute our trained models, we'll split off some of the training data and create a validation set.

I also convert the splitted data from pandas Series to lists of the text and lists the labels for ease of use later.

In [20]:
from sklearn.model_selection import train_test_split

train_texts, val_texts, train_targets, val_targets = train_test_split(shuffled_train_data["text"].to_numpy(),
                                                                            shuffled_train_data["target"].to_numpy(),
                                                                            test_size=0.1, 
                                                                            random_state=42) 

In [25]:
# Check the lengths
len(train_texts), len(train_targets), len(val_texts), len(val_targets)

(6851, 6851, 762, 762)

In [70]:
# for input length of embedding layer in the next section, we need to Find average number of tokens (words) in training Tweets.
avg_length = round(sum([len(i.split()) for i in train_texts])/len(train_texts))
avg_length

15

### Converting text to numbers
As ML models just accept numbers as input, we need to change the input format from text to numbers.

We will use TextVectoriztion method and an Embedding Layer to convert text to numbers.

More information about the hyperparameters of TextVectorization is available at https://www.tensorflow.org/api_docs/python/tf/keras/layers/TextVectorization.


In [71]:

from tensorflow.keras.layers.experimental.preprocessing import TextVectorization

text_vectorizer = TextVectorization(max_tokens=10000, # how many words in the vocabulary
                                    standardize="lower_and_strip_punctuation", # how to process text
                                    split="whitespace", # how to split tokens
                                    ngrams=None, 
                                    output_mode="int", 
                                    output_sequence_length=avg_length)

#### Let's fit the text vectorizer to the training data set.

In [72]:

text_vectorizer.adapt(train_texts)

In [73]:
# lets's check how this textvector works.
sample_sentence = "Today is a rainy day, take care!"
text_vectorizer([sample_sentence])

<tf.Tensor: shape=(1, 15), dtype=int64, numpy=
array([[ 124,    9,    3, 9375,  101,  167,  488,    0,    0,    0,    0,
           0,    0,    0,    0]])>

#### Creating an Embedding Layer

For more information about the hyperparameters of the Embedding Layer please see the following link 

https://www.tensorflow.org/api_docs/python/tf/keras/layers/Embedding

In [74]:
tf.random.set_seed(42)
from tensorflow.keras import layers

embedding = layers.Embedding(input_dim=10000, 
                             output_dim=128, 
                             embeddings_initializer="uniform", 
                             input_length=avg_length,
                             name="embedding_1") 

embedding

<keras.layers.core.embedding.Embedding at 0x7f966c197ac0>

In [75]:
sample_s  = 'today is a dangrous day'
a = text_vectorizer(sample_s)
embedding(a).shape

TensorShape([15, 128])

#### Creating the baseline model with Scikit-Learn Pipeline using the TF-IDF.

In [76]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import Pipeline

model_0 = Pipeline([
                    ("tfidf", TfidfVectorizer()),
                    ("clf", MultinomialNB())
])

# Fit the pipeline to the training data
model_0.fit(train_texts, train_targets)

In [77]:
baseline_score = model_0.score(val_texts, val_targets)
print(f"the accuracy of the baseline model is: {baseline_score*100:.2f}%")

the accuracy of the baseline model is: 79.27%


#### Creating a function for evaluation of our model 

We create a function to find the accuracy by these metrics.

Accuracy, Precision, Recall, F1-score

In [78]:
from sklearn.metrics import accuracy_score, precision_recall_fscore_support

def calculate_results(y_true, y_pred):
  
  
  model_accuracy = accuracy_score(y_true, y_pred) * 100
  model_precision, model_recall, model_f1, _ = precision_recall_fscore_support(y_true, y_pred, average="weighted")
  model_results = {"Accuracy": model_accuracy,
                  "Precision": model_precision,
                  "Recall": model_recall,
                  "F1-score": model_f1}
  return model_results

#### let's make predictions using the validation data set to make y-pred.

In [79]:
# Make predictions
baseline_preds = model_0.predict(val_texts)

In [80]:
#Baseline results
baseline_results = calculate_results(y_true=val_targets,
                                     y_pred=baseline_preds)
baseline_results

{'Accuracy': 79.26509186351706,
 'Precision': 0.8111390004213173,
 'Recall': 0.7926509186351706,
 'F1-score': 0.7862189758049549}

#### Model 1:  Dense Neural Network model


In [94]:
from tensorflow.keras import layers
inputs = layers.Input(shape=(1,), dtype="string") # input is a 1D string
x = text_vectorizer(inputs) # apply the text_vectorizer to the string inputs to get numbers
x = embedding(x) # create an embedding of the numerized numbers
x = layers.GlobalAveragePooling1D()(x) # reduce the dimensions of the inputs
# x = layers.Dense(128, activation="relu")(x) # optional dense layer on top of output of LSTM cell
outputs = layers.Dense(1, activation="sigmoid")(x) 
model_1 = tf.keras.Model(inputs, outputs, name="model_1_dense")

model_1.compile(loss="binary_crossentropy",
                optimizer=tf.keras.optimizers.Adam(),
                metrics=["accuracy"])



In [95]:
model_1.summary()

Model: "model_1_dense"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 input_11 (InputLayer)       [(None, 1)]               0         
                                                                 
 text_vectorization_3 (TextV  (None, 15)               0         
 ectorization)                                                   
                                                                 
 embedding_1 (Embedding)     (None, 15, 128)           1280000   
                                                                 
 global_average_pooling1d_10  (None, 128)              0         
  (GlobalAveragePooling1D)                                       
                                                                 
 dense_12 (Dense)            (None, 1)                 129       
                                                                 
Total params: 1,280,129
Trainable params: 1,280,129
N

In [96]:
model_1_history = model_1.fit(train_texts, train_targets, epochs = 5, validation_data=(val_texts, val_targets))

Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


#### Let's make predictions on the validation data set to fond the y_pred

In [97]:
model_1_pred_probs = model_1.predict(val_texts)
model_1_pred_probs.shape



(762, 1)

In [98]:
# we must remove the extra dimension.
model_1_preds = tf.squeeze(tf.round(model_1_pred_probs))

In [99]:
model_1_results = calculate_results(y_true=val_targets, 
                                    y_pred=model_1_preds)
model_1_results

{'Accuracy': 77.82152230971128,
 'Precision': 0.7798979990634543,
 'Recall': 0.7782152230971129,
 'F1-score': 0.7762659531210079}

Model_1 doesn't work well. Baseline is still better than a simple neural network.
Let's use a more advanced neural network.

#### RNN architecture LSTM

In [105]:
tf.random.set_seed(42)
from tensorflow.keras import layers

model_2_embedding = layers.Embedding(input_dim=10000,
                                     output_dim=128,
                                     embeddings_initializer="uniform",
                                     input_length=avg_length,
                                     name="embedding_2")

# Create LSTM model
inputs = layers.Input(shape=(1,), dtype="string")
x = text_vectorizer(inputs)
x = model_2_embedding(x)
print(x.shape)
x = layers.LSTM(64, return_sequences=True)(x) 
x = layers.LSTM(64)(x) 
print(x.shape)
outputs = layers.Dense(1, activation="sigmoid")(x)
model_2 = tf.keras.Model(inputs, outputs, name="model_2_LSTM")

(None, 15, 128)
(None, 64)


In [106]:
model_2.compile(loss="binary_crossentropy",
                optimizer=tf.keras.optimizers.Adam(),
                metrics=["accuracy"])

In [107]:
model_2.summary()


Model: "model_2_LSTM"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 input_13 (InputLayer)       [(None, 1)]               0         
                                                                 
 text_vectorization_3 (TextV  (None, 15)               0         
 ectorization)                                                   
                                                                 
 embedding_2 (Embedding)     (None, 15, 128)           1280000   
                                                                 
 lstm_2 (LSTM)               (None, 15, 64)            49408     
                                                                 
 lstm_3 (LSTM)               (None, 64)                33024     
                                                                 
 dense_14 (Dense)            (None, 1)                 65        
                                                      

In [108]:
model_2_history = model_2.fit(train_texts,
                              train_targets,
                              epochs=5,
                              validation_data=(val_texts, val_targets)
                              )

Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


#### Like the simple Dense model, LSTM does not work well. 

Let's find model accuracy, and then test other methods to find better accuracy.

In [110]:
model_2_pred_probs = model_2.predict(val_texts)
model_2_preds = tf.squeeze(tf.round(model_2_pred_probs))
model_2_results = calculate_results(y_true=val_targets,
                                    y_pred=model_2_preds)
model_2_results



{'Accuracy': 75.8530183727034,
 'Precision': 0.7593999956661763,
 'Recall': 0.7585301837270341,
 'F1-score': 0.7566051475454213}

#### Next model is based on GRU