<a href="https://colab.research.google.com/github/Henil21/Tweet_sentiment_NLP/blob/main/NLP.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Introduction to NLP fundamentals in Tensorflow

The main goal of natural language processing (NLP) is to derive information from natural language.

Natural language is a broad term but you can consider it to cover any of the following:

* Text (such as that contained in an email, blog post, book, Tweet)
* Speech (a conversation you have with a doctor, voice commands you give to a smart speaker)


> Text -> turn into numbers -> build a model -> train the model to find patterns -> use patterns (make predictions)


In [1]:
!nvidia-smi  -L

GPU 0: Tesla T4 (UUID: GPU-662b710f-955f-6a2d-40e4-084fa3551ff3)


## getting helper functions 🐚

In [2]:
!wget https://raw.githubusercontent.com/mrdbourke/tensorflow-deep-learning/main/extras/helper_functions.py

--2023-01-14 10:50:27--  https://raw.githubusercontent.com/mrdbourke/tensorflow-deep-learning/main/extras/helper_functions.py
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 10246 (10K) [text/plain]
Saving to: ‘helper_functions.py’


2023-01-14 10:50:27 (79.0 MB/s) - ‘helper_functions.py’ saved [10246/10246]



In [3]:
# importing series of helper functions for the notebook
from helper_functions import unzip_data, create_tensorboard_callback,plot_loss_curves,compare_historys

## Get a text dataset

>description of data set: text sample of tweet labelled as disaster or not disaster.

In [4]:
!wget https://storage.googleapis.com/ztm_tf_course/nlp_getting_started.zip
unzip_data('nlp_getting_started.zip')

--2023-01-14 10:50:31--  https://storage.googleapis.com/ztm_tf_course/nlp_getting_started.zip
Resolving storage.googleapis.com (storage.googleapis.com)... 142.250.4.128, 142.251.10.128, 142.251.12.128, ...
Connecting to storage.googleapis.com (storage.googleapis.com)|142.250.4.128|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 607343 (593K) [application/zip]
Saving to: ‘nlp_getting_started.zip’


2023-01-14 10:50:32 (133 MB/s) - ‘nlp_getting_started.zip’ saved [607343/607343]



## Visualizing Our Data


In [5]:
import pandas as pd
train_dir=pd.read_csv("train.csv")
test_dir=pd.read_csv("test.csv")
train_dir.head()


Unnamed: 0,id,keyword,location,text,target
0,1,,,Our Deeds are the Reason of this #earthquake M...,1
1,4,,,Forest fire near La Ronge Sask. Canada,1
2,5,,,All residents asked to 'shelter in place' are ...,1
3,6,,,"13,000 people receive #wildfires evacuation or...",1
4,7,,,Just got sent this photo from Ruby #Alaska as ...,1


In [6]:
# shuffling the training data
train_shf=train_dir.sample(frac=1,random_state=42)
train_shf.head()

Unnamed: 0,id,keyword,location,text,target
2644,3796,destruction,,So you have a new weapon that can cause un-ima...,1
2227,3185,deluge,,The f$&amp;@ing things I do for #GISHWHES Just...,0
5448,7769,police,UK,DT @georgegalloway: RT @Galloway4Mayor: ÛÏThe...,1
132,191,aftershock,,Aftershock back to school kick off was great. ...,0
6845,9810,trauma,"Montgomery County, MD",in response to trauma Children of Addicts deve...,0


In [7]:
train_dir.target.value_counts()

0    4342
1    3271
Name: target, dtype: int64

In [8]:
# lets visualize some random training example
import random
random_index=random.randint(0, len(train_dir)-5)
# create random index not higher than total number of samples
for row in train_shf[["text","target"]][random_index:random_index+5].itertuples():
  _,text,target=row
  print(f"target:{target}","(real disaster)" if target>0 else "(not real disaster)")
  print(f"Text:\n{text}\n")
  print("----\n")

target:0 (not real disaster)
Text:
DISASTER AVERTED: Police kill gunman with Û÷hoax deviceÛª atåÊcinema http://t.co/5NG0FzpVdS

----

target:0 (not real disaster)
Text:
Yo I got bars and I'm not even a rapper

----

target:0 (not real disaster)
Text:
If you sit and rant on snapchat to your apparent fans when you have about 8000 followers I hope your in a train crash xoxo

----

target:1 (real disaster)
Text:
Just came back from camping and returned with a new song which gets recorded tomorrow. Can't wait! #Desolation #TheConspiracyTheory #NewEP

----

target:0 (not real disaster)
Text:
If Ryan doesn't release new music soon I might explode

----



### Split data into training and validation sets ✅

In [9]:
from sklearn.model_selection import train_test_split

In [10]:
# use train_test_split to split training data into training and validation sets
train_sentences,val_sentences,train_labels,val_labels=train_test_split(train_shf["text"].to_numpy(),
                                                                       train_shf["target"].to_numpy(),
                                                                       test_size=0.1,
                                                                       random_state=42)

In [11]:
val_sentences[:10]
val_labels[:10]

array([0, 0, 1, 1, 1, 1, 1, 1, 1, 0])

In [12]:
val_sentences[:10]

array(['DFR EP016 Monthly Meltdown - On Dnbheaven 2015.08.06 http://t.co/EjKRf8N8A8 #Drum and Bass #heavy #nasty http://t.co/SPHWE6wFI5',
       'FedEx no longer to transport bioterror germs in wake of anthrax lab mishaps http://t.co/qZQc8WWwcN via @usatoday',
       'Gunmen kill four in El Salvador bus attack: Suspected Salvadoran gang members killed four people and wounded s... http://t.co/CNtwB6ScZj',
       '@camilacabello97 Internally and externally screaming',
       'Radiation emergency #preparedness starts with knowing to: get inside stay inside and stay tuned http://t.co/RFFPqBAz2F via @CDCgov',
       'Investigators rule catastrophic structural failure resulted in 2014 Virg.. Related Articles: http://t.co/Cy1LFeNyV8',
       'How the West was burned: Thousands of wildfires ablaze in #California alone http://t.co/iCSjGZ9tE1 #climate #energy http://t.co/9FxmN0l0Bd',
       "Map: Typhoon Soudelor's predicted path as it approaches Taiwan; expected to make landfall over southern C

### Converting Text into number ⚡




* Tokenization - A straight mapping from word or character or sub-word to a numerical value. There are three main levels of tokenization:
1. Using word-level tokenization with the sentence "I love TensorFlow" might result in "I" being 0, "love" being 1 and "TensorFlow" being 2. In this case, every word in a sequence considered a single token.
2. Character-level tokenization, such as converting the letters A-Z to values 1-26. In this case, every character in a sequence considered a single token.
3. Sub-word tokenization is in between word-level and character-level tokenization. It involves breaking invidual words into smaller parts and then converting those smaller parts into numbers. For example, "my favourite food is pineapple pizza" might become "my, fav, avour, rite, fo, oo, od, is, pin, ine, app, le, piz, za". After doing this, these sub-words would then be mapped to a numerical value. In this case, every word could be considered multiple tokens.
* Embeddings - An embedding is a representation of natural language which can be learned. Representation comes in the form of a feature vector. For example, the word "dance" could be represented by the 5-dimensional vector [-0.8547, 0.4559, -0.3332, 0.9877, 0.1112]. It's important to note here, the size of the feature vector is tuneable. There are two ways to use     embeddings:
1. Create your own embedding - Once your text has been turned into numbers (required for an embedding), you can put them through an embedding layer (such as tf.keras.layers.Embedding) and an embedding representation will be learned during model training.
2. Reuse a pre-learned embedding - Many pre-trained embeddings exist online. These pre-trained embeddings have often been learned on large corpuses of text (such as all of Wikipedia) and thus have a good underlying representation of natural language. You can use a pre-trained embedding to initialize your model and fine-tune it to your own specific task.

### Text vectorization (Tokenization)
* The TextVectorization layer takes the following parameters:

*  max_tokens - The maximum number of words in your vocabulary (e.g. 20000 or the number of unique words in your text), includes a value for OOV (out of vocabulary) tokens.
* standardize - Method for standardizing text. Default is "lower_and_strip_punctuation" which lowers text and removes all punctuation marks.
* split - How to split text, default is "whitespace" which splits on spaces.
* ngrams - How many words to contain per token split, for example, ngrams=2 splits tokens into continuous sequences of 2.
* output_mode - How to output tokens, can be "int" (integer mapping), "binary" (one-hot encoding), "count" or "tf-idf". See documentation for more.
* output_sequence_length - Length of tokenized sequence to output. For example, if output_sequence_length=150, all tokenized sequences will be 150 tokens long.
* pad_to_max_tokens - Defaults to False, if True, the output feature axis will be padded to max_tokens even if the number of unique tokens in the vocabulary is less than max_tokens. Only valid in certain modes, see docs for more.

In [13]:
import tensorflow as tf
from tensorflow.keras.layers.experimental.preprocessing import TextVectorization

#using default textvectorization parameters
text_vec=TextVectorization(max_tokens=None, #how many different words are in our vocabaulary (automatically add <ODV>)
                           standardize="lower_and_strip_punctuation",
                           split="whitespace",
                           ngrams=None, #create groupe of n-word,
                           output_mode="int",
                           output_sequence_length=None,#how long we want our sequences to be(how long a tweet can be)
                          #  pad_to_max_tokens=True [not valid if max_token is set to None]
                           )


In [14]:
len(train_sentences[0].split())

7

In [15]:
# find the average number of token(words) in the training tweet
# Find average number of tokens (words) in training Tweets
round(sum([len(i.split()) for i in train_sentences])/len(train_sentences))

15

In [16]:
# Setup text vectorization with custom variables
max_vocab_length = 10000 # max number of words to have in our vocabulary
max_length = 15 # max length our sequences will be (e.g. how many words from a Tweet does our model see?)

text_vectorizer = TextVectorization(max_tokens=max_vocab_length,
                                    output_mode="int",
                                    output_sequence_length=max_length)

In [17]:
# Fit the text vectorizer to the training text
text_vectorizer.adapt(train_sentences)

# create a smple sentence and tokenize it
sample_tweet="There's a flood in my street"
text_vectorizer([sample_tweet])

<tf.Tensor: shape=(1, 15), dtype=int64, numpy=
array([[264,   3, 232,   4,  13, 698,   0,   0,   0,   0,   0,   0,   0,
          0,   0]])>

In [18]:
# choose a random sentence from training dataset and tokenizing the,
random_sentence=random.choice(train_sentences)
print(f"original text:\n{random_sentence}\n\n vectorized text",text_vectorizer(random_sentence))

original text:
Putin's plan to destroy Western food en masse is causing a huge public backlash http://t.co/FAJbxz5kar

 vectorized text tf.Tensor(
[4825  241    5  305 1102  260 3890    1    9 1426    3  775  926    1
    1], shape=(15,), dtype=int64)


In [19]:
# Get the unique words in vocabulary
words_in_voc=text_vectorizer.get_vocabulary()
top_5=words_in_voc[:5]
bottom_5=words_in_voc[-5:]
print(f"number of word in vocab {len(words_in_voc)}\n")
print(f"5 most comman word in  vocab {top_5}\n")
print(f"5 least comman  word in vocab {bottom_5}")


number of word in vocab 10000

5 most comman word in  vocab ['', '[UNK]', 'the', 'a', 'in']

5 least comman  word in vocab ['pages', 'paeds', 'pads', 'padres', 'paddytomlinson1']


### Creating an Embedding Layer

https://www.tensorflow.org/api_docs/python/tf/keras/layers/Embedding

The Parameters we care most about are
* `input_dim` = the size of our vocabulary
* `output_dim` = the size of the output embedding vector, eg:- a value of 100 would mean each token get represented by a vector of 100 long
* `input_length` = The length of the sequences being passed to the embedding layer

In [20]:
from tensorflow.keras import layers
embedding = layers.Embedding(input_dim=max_vocab_length,#set the input shape
                             output_dim=128,
                             input_length=max_length)

In [21]:
# Get random sentence
random_sentence=random.choice(train_sentences)
print(f"orignal text:\n{random_sentence}")

# embed the random sentence (turn it into dense vector of the fixed size)
sample_embed = embedding(text_vectorizer([random_sentence]))



orignal text:
#NowPlaying Fitz And The Tantrums - Out Of My League on #Crush #Listen http://t.co/Pwd5L0GLkV #NowPlaying


In [22]:
#  check out single token embedding
sample_embed[0][0],sample_embed[0][0].shape, random_sentence

(<tf.Tensor: shape=(128,), dtype=float32, numpy=
 array([ 2.4226818e-02, -4.7777679e-02,  3.5049524e-02,  3.6900640e-03,
        -2.9904878e-02, -1.7876554e-02, -3.4891665e-02, -1.5524577e-02,
        -3.7678648e-02, -4.9924206e-02,  2.0213652e-02,  4.2819623e-02,
        -7.2308294e-03,  2.5348712e-02,  2.0667799e-03, -4.1906156e-02,
         1.7752759e-03,  4.6461012e-02, -1.2989067e-02, -4.7076058e-02,
         2.6168648e-02,  3.3077862e-02,  1.4537636e-02, -2.6907062e-02,
         1.1931382e-02, -2.0432360e-03, -6.6225305e-03,  3.1097341e-02,
         3.6220733e-02,  4.9807739e-02,  2.8792154e-02, -2.7266478e-02,
         2.3760263e-02, -3.3342898e-02,  3.2542121e-02, -4.5779776e-02,
        -1.2452196e-02, -3.9768852e-02,  2.8728221e-02, -4.9280919e-02,
        -4.3931808e-02, -2.9616356e-03, -4.7358263e-02,  8.2042813e-03,
        -3.0100858e-02,  2.3684908e-02,  1.0478329e-02,  4.7073755e-02,
         7.1232319e-03, -1.6918492e-02,  4.1845925e-03, -4.2973008e-02,
        -1.9965

### Modelling a text dataset

Once you've got your inputs and outputs prepared, it's a matter of figuring out which machine learning model to build in between them to bridge the gap.

Now that we've got a way to turn our text data into numbers, we can start to build machine learning models to model it.

To get plenty of practice, we're going to build a series of different models, each as its own experiment. We'll then compare the results of each model and see which one performed best.

More specifically, we'll be building the following:

* Model 0: Naive Bayes (baseline)
* Model 1: Feed-forward neural network (dense model)
* Model 2: LSTM model
* Model 3: GRU model
* Model 4: Bidirectional-LSTM model
* Model 5: 1D Convolutional Neural Network
* Model 6: TensorFlow Hub Pretrained Feature Extractor
* Model 7: Same as model 6 with 10% of training data




### Model 0: Getting a baseline
As with all machine learning modelling experiments, it's important to create a baseline model so you've got a benchmark for future experiments to build upon.

To create our baseline, we'll create a Scikit-Learn Pipeline using the TF-IDF (term frequency-inverse document frequency) formula to convert our words to numbers and then model them with the Multinomial Naive Bayes algorithm. 

In [23]:
# Convert text into number
from sklearn.feature_extraction.text import TfidfVectorizer
# our model
from sklearn.naive_bayes import MultinomialNB

from sklearn.pipeline import Pipeline

# create tokenization and modelling pipeline
model_0 = Pipeline([
    ("tfidf",TfidfVectorizer()),# convert words to number using tfidf
    ("clf",MultinomialNB())# model the text
])
# fit the pipleine to the training data
model_0.fit(train_sentences,train_labels)

Pipeline(steps=[('tfidf', TfidfVectorizer()), ('clf', MultinomialNB())])

In [24]:
baseline_score=model_0.score(val_sentences,val_labels)
# as we use .evaluate in tf for sklearn its .score
baseline_score*100

79.26509186351706

In [25]:
baseline_pred=model_0.predict(val_sentences)

baseline_pred[:10]

array([1, 1, 1, 0, 0, 1, 1, 1, 1, 0])

In [26]:
train_labels[:10]

array([0, 0, 1, 0, 0, 1, 1, 0, 1, 1])

### Creating an evaluation function for our model experiments

we could evaluate these as they are but since we're going to be evaluating several models in the same way going forward, let's create a helper function which takes an array of predictions and ground truth labels and computes the following:

* Accuracy
* Precision
* Recall
* F1-score
> 🔑 Note: Since we're dealing with a classification problem, the above metrics are the most appropriate. If we were working with a regression problem, other metrics such as MAE (mean absolute error) would be a better choice.

In [27]:
from sklearn.metrics import accuracy_score,precision_recall_fscore_support
def calculate_results(y_true,y_pred):
  # Calculate model accuracy
  model_accuracy=accuracy_score(y_true,y_pred)*100
  # Calculate model precision, recall and f1 score using "weighted" average
  model_precision, model_recall, model_f1, _ = precision_recall_fscore_support(y_true, y_pred, average="weighted")
  model_results = {"accuracy": model_accuracy,
                   "precision":model_precision*100,
                   "recall":model_recall*100,
                   "f1":model_f1*100
                   }
  return model_results

In [28]:
bline=calculate_results(y_true=val_labels, y_pred=baseline_pred)
bline

{'accuracy': 79.26509186351706,
 'precision': 81.11390004213173,
 'recall': 79.26509186351706,
 'f1': 78.6218975804955}

### Model 1: A Simple Dense Model 🚀

In [29]:
#  Creating tensorboard callback
from helper_functions import create_tensorboard_callback
SAVE_DIR="model_logs"

In [30]:
from tensorflow.keras import layers
input=layers.Input(shape=(1,),dtype=tf.string)
x=text_vectorizer(input)
x=embedding(x)
x = layers.GlobalAveragePooling1D()(x)
output=layers.Dense(1,activation="sigmoid")(x)
model_1=tf.keras.Model(input,output,name="model_1_dense")

In [31]:
# model_1.summary()
model_1.compile(loss="binary_crossentropy",
                optimizer=tf.keras.optimizers.Adam(),
                metrics=["accuracy"])

# Fit the model
model_1_history = model_1.fit(train_sentences, # input sentences can be a list of strings due to text preprocessing layer built-in model
                              train_labels,
                              epochs=5,
                              validation_data=(val_sentences, val_labels),
                              callbacks=[create_tensorboard_callback(dir_name=SAVE_DIR, 
                                                                     experiment_name="simple_dense_model")])

Saving TensorBoard log files to: model_logs/simple_dense_model/20230114-105036
Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


In [32]:
model_1.summary()

Model: "model_1_dense"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 input_1 (InputLayer)        [(None, 1)]               0         
                                                                 
 text_vectorization_1 (TextV  (None, 15)               0         
 ectorization)                                                   
                                                                 
 embedding (Embedding)       (None, 15, 128)           1280000   
                                                                 
 global_average_pooling1d (G  (None, 128)              0         
 lobalAveragePooling1D)                                          
                                                                 
 dense (Dense)               (None, 1)                 129       
                                                                 
Total params: 1,280,129
Trainable params: 1,280,129
N