<a href="https://colab.research.google.com/github/86lekwenshiung/Neural-Network-with-Tensorflow/blob/main/07_Natural_Language_Processing_in_Tensorflow.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# 0.0 Natural Language Processing in Tensorflow
___

The main goal of natural language processing (NLP) is to derive information from natural language.
Natural language is a broad term but you can consider it to cover any of the following:
* Text (such as that contained in an email, blog post, book, Tweet)
* Speech (a conversation you have with a doctor, voice commands you give to a smart speaker)

**What is NLP used for?**

Natural Language Processing is the driving force behind the following common applications:
* Language translation applications such as Google Translate
* Word Processors such as Microsoft Word and Grammarly that employ NLP to check grammatical accuracy of texts.
* Interactive Voice Response (IVR) applications used in call centers to respond to certain users’ requests.
* Personal assistant applications such as OK Google, Siri, Cortana, and Alexa.

**Workflow**
```
Download text -> Visualize Text -> turn into numbers (tokenization , embedding) -> build a model -> train the model to find patterns -> compare model -> ensemble model
```

Another common term for NLP problems is sequence to sequence problems(seq2seq)

**Typical Architecture of a RNN**

| Hyperparameter/Layer type | What does it do? | Typical values |
|---|---|---|
| Input text(s) | Target texts/sequences you'd like to discover patterns in | Whatever you can represent as text or a sequence |
| Input layer | Takes in target sequence | input_shape = [batch_size, embedding_size] or [batch_size , sequence_shape] |
| Text Vectorisation layer | Maps input sequence to layers | Multiple, can create with tf.keras.layers.preprocessing.TextVectorisation |
| Embedding | Turn mapping of text vectors to embedding matrix | Multiple, can create with tf.keras.layers.Embedding |
| RNN Cells | Find Pattern in Sequences | SimpleRNN , LSTM , GRU |
| Hidden activation | Adds non-linearity to learned features (non-straight lines) | Usually Tanh (tf.keras.activations.tanh) |
| Pooling layer | Reduces the dimensionality of learned image features | Average (tf.keras.layers.GlobalAveragePooling1D) or Max (tf.keras.layers.GlobalMaxPool1D) |
| Fully connected layer | Further refines learned features from convolution layers | tf.keras.layers.Dense |
| Output layer | Takes learned features and outputs them in shape of target labels | output_shape = [number_of_classes] (e.g. disaster , Not Disaster) |
| Output activation | Adds non-linearities to output layer | tf.keras.activations.sigmoid (binary classification) or tf.keras.activations.softmax |


`source` : 
* https://towardsdatascience.com/whatnlpscientistsdo-905aa987c5c0
* https://becominghuman.ai/a-simple-introduction-to-natural-language-processing-ea66a1747b32
* https://blog.insightdatascience.com/how-to-solve-90-of-nlp-problems-a-step-by-step-guide-fda605278e4e


In [26]:
import tensorflow as tf
from tensorflow.keras.layers.experimental.preprocessing import TextVectorization

import pandas as pd
import numpy as np
import random

from sklearn.model_selection import train_test_split

import zipfile
import os

# 1.0 Getting Data from kaggle (Natural Language Processing with Disaster Tweets)
___

source : https://www.kaggle.com/philculliton/nlp-getting-started-tutorial

In [4]:
!wget https://storage.googleapis.com/ztm_tf_course/nlp_getting_started.zip

--2021-08-26 10:05:42--  https://storage.googleapis.com/ztm_tf_course/nlp_getting_started.zip
Resolving storage.googleapis.com (storage.googleapis.com)... 172.217.193.128, 172.217.204.128, 172.217.203.128, ...
Connecting to storage.googleapis.com (storage.googleapis.com)|172.217.193.128|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 607343 (593K) [application/zip]
Saving to: ‘nlp_getting_started.zip’


2021-08-26 10:05:42 (60.6 MB/s) - ‘nlp_getting_started.zip’ saved [607343/607343]



In [5]:
# Unzip file
zip_ref = zipfile.ZipFile('nlp_getting_started.zip')
zip_ref.extractall()
zip_ref.close()

In [6]:
df_train = pd.read_csv('train.csv')
df_test = pd.read_csv('test.csv')

### 1.1 Visualising Data
___

In [7]:
# Checking Training Data
df_train.head()

Unnamed: 0,id,keyword,location,text,target
0,1,,,Our Deeds are the Reason of this #earthquake M...,1
1,4,,,Forest fire near La Ronge Sask. Canada,1
2,5,,,All residents asked to 'shelter in place' are ...,1
3,6,,,"13,000 people receive #wildfires evacuation or...",1
4,7,,,Just got sent this photo from Ruby #Alaska as ...,1


In [8]:
df_train = df_train.sample(frac = 1 , random_state = 42)
df_train.head()

Unnamed: 0,id,keyword,location,text,target
2644,3796,destruction,,So you have a new weapon that can cause un-ima...,1
2227,3185,deluge,,The f$&amp;@ing things I do for #GISHWHES Just...,0
5448,7769,police,UK,DT @georgegalloway: RT @Galloway4Mayor: ÛÏThe...,1
132,191,aftershock,,Aftershock back to school kick off was great. ...,0
6845,9810,trauma,"Montgomery County, MD",in response to trauma Children of Addicts deve...,0


In [9]:
# Checking Test Data
df_test.head()

Unnamed: 0,id,keyword,location,text
0,0,,,Just happened a terrible car crash
1,2,,,"Heard about #earthquake is different cities, s..."
2,3,,,"there is a forest fire at spot pond, geese are..."
3,9,,,Apocalypse lighting. #Spokane #wildfires
4,11,,,Typhoon Soudelor kills 28 in China and Taiwan


In [10]:
# Target True : False Ratio
df_train['target'].value_counts(normalize = True)

0    0.57034
1    0.42966
Name: target, dtype: float64

In [14]:
random_index = random.randint(0 , len(df_train))
df_train[['text' , 'target']].head()

Unnamed: 0,text,target
2644,So you have a new weapon that can cause un-ima...,1
2227,The f$&amp;@ing things I do for #GISHWHES Just...,0
5448,DT @georgegalloway: RT @Galloway4Mayor: ÛÏThe...,1
132,Aftershock back to school kick off was great. ...,0
6845,in response to trauma Children of Addicts deve...,0


In [15]:
random_index = random.randint(0 , len(df_train))

for row in df_train[['text' , 'target']][random_index : random_index +5].itertuples():
  _ , text , target = row

  print(f"Target: {target}" , "(Disaster)" if target > 0 else "(Not Disaster)")
  print(f'Text: {text}')
  print('-------\n')

Target: 0 (Not Disaster)
Text: Bright &amp; BLAZING Fireman Birthday Party http://t.co/9rFo9GY3nE #Weddings
-------

Target: 0 (Not Disaster)
Text: I wanna tweet a 'niall thx for not making me was to electrocute myself' tweet but I'm scared I'll jinx it
-------

Target: 1 (Disaster)
Text: @aria_ahrary @TheTawniest The out of control wild fires in California even in the Northern part of the state. Very troubling.
-------

Target: 0 (Not Disaster)
Text: Just burned the shit outta myself on my dirt bike ??
-------

Target: 0 (Not Disaster)
Text: @FEVWarrior -in the Vault that could take a look at those wounds of yours if you'd like to go to one of these places first.' Zarry has had-
-------



### 1.2 Data Split Training and Validation
___

In [16]:
# Define X and y variables
train = df_train['text'].to_numpy()
val = df_train['target'].to_numpy()

In [17]:
train_sentences , val_sentences ,train_label , val_label = train_test_split(train , val , test_size = 0.1 , random_state  = 42)

In [18]:
print(f'Train Sentence: {train_sentences.shape}')
print(f'Val Sentence: {val_sentences.shape}')
print(f'Train Label: {train_label.shape}')
print(f'Val Label: {val_label.shape}')

Train Sentence: (6851,)
Val Sentence: (762,)
Train Label: (6851,)
Val Label: (762,)


### 1.3 Converting Text into Numbers
___

* Tokenization : Straight mapping from token to number , however model can get very big as no. of words increases.
* Embedding : Representation by vector , weighted matrix. Richer representation of relationship between tokens.

#### 1.3.1 Tokenization
___

In [30]:
# # Default Setting of TextVectorisation
# text_vectorizer = TextVectorization(max_tokens = None,
#                                     standardize = 'lower_and_strip_punctuation',
#                                     split = 'whitespace',
#                                     ngrams = None, # grouping of words
#                                     output_mode = 'int',
#                                     output_sequence_length = None,
#                                     pad_to_max_tokens = True) 

In [34]:
# Setting up text vectorisation variables

max_vocab_length = 10000  # Max number of words in our vocab
max_length = 15 # max length our sequence will be (In this case the sequence is a tweet)

text_vectorizer = TextVectorization(max_tokens = max_vocab_length,
                                    output_mode = 'int',
                                    output_sequence_length = max_length,
                                    pad_to_max_tokens = True)

In [37]:
# Our max_length is set as 15 and our sentence only have 7 words. The rest of the 8 remaining wordss are padded with 0s.
sample_sentence = 'There is a flood in Bukit Timah'
text_vectorizer(sample_sentence)

<tf.Tensor: shape=(15,), dtype=int64, numpy=
array([ 74,   9,   3, 232,   4,   1,   1,   0,   0,   0,   0,   0,   0,
         0,   0])>

In [45]:
# Visualing random sentence from our training dataset.
random_sentence = random.choice(train_sentences)
print(f'Originial Sentence: {random_sentence}')
print('--------')
print(f'Vectorized Sentence: {text_vectorizer(random_sentence)}')

Originial Sentence: Im so anxious though because so many ppl will me watching me meet them and that makes me uncomfortable BUT I CANT LET THAT RUIN THE MOMENT
--------
Vectorized Sentence: [  32   28 4142  804  152   28  123  929   38   31  636   31 1389   93
    7]


In [50]:
# Checking the unique vocabulary
word_in_vocab = text_vectorizer.get_vocabulary()
print(f'Top 10 words: {word_in_vocab[:10]}')
print(f'Bottom 10 words: {word_in_vocab[-10:]}')

Top 10 words: ['', '[UNK]', 'the', 'a', 'in', 'to', 'of', 'and', 'i', 'is']
Bottom 10 words: ['painthey', 'painful', 'paine', 'paging', 'pageshi', 'pages', 'paeds', 'pads', 'padres', 'paddytomlinson1']


In [35]:
# Fit the text vectorizer to the training text
text_vectorizer.adapt(train_sentences)

#### 1.3.2 Embedding
___

* Key Parameters for embedding layer:
  - `input_dim` = size of our vocab
  - `output_dim` = size of our output embedding vector. A value of 100 would mean each token get represented by a vector 100 long
  - `input_length` = length of the sequences being passed to the embedding layer

In [52]:
embedding = tf.keras.layers.Embedding(input_dim = max_vocab_length,
                                      output_dim = 128,
                                      input_length = max_length)
embedding

<keras.layers.embeddings.Embedding at 0x7fe9a35fae90>

In [61]:
# Visualing random sentence from our training dataset.
random_sentence = random.choice(train_sentences)

print(f'Original Sentence : {random_sentence}')
print('--------')
print(f'Embedded Sentence : {embedding(text_vectorizer(random_sentence)).shape}')
print(f'Embedded Sentence : {embedding(text_vectorizer(random_sentence))}')

Original Sentence : I don't pray harm on members of ISIS.I pray they experience the life-rebooting love of God &amp; become 'Paul's' in Gods mind-blowing final Act
--------
Embedded Sentence : (15, 128)
Embedded Sentence : [[-0.00283871  0.0118186   0.02091951 ... -0.01730181 -0.04164167
   0.00070925]
 [ 0.02597681 -0.01401955  0.02378706 ... -0.04332544  0.03536881
  -0.01831583]
 [-0.04824638  0.00782424  0.00272657 ... -0.00690223  0.03945352
  -0.03128378]
 ...
 [ 0.04091967 -0.03550221  0.04409833 ...  0.02903645  0.03072219
  -0.00509418]
 [ 0.04361527  0.01893303  0.0297743  ...  0.02776686  0.03634677
   0.03193242]
 [-0.04824767 -0.02490045 -0.04372081 ...  0.04281325  0.00329584
   0.01509192]]


### 1.4 Model 0 : Baseline Model with Naive Bayes with TF_IDF encoder
___

* Model 0 : Naive Bayes with TF-IDF encoder
* Model 1 : Feed-Forward neural network (dense)
* Model 2 : LSTM model (RNN)
* Model 3 : GRU model (RNN)
* Model 4 : Bidirectional - LSTM model (RNN)
* Model 5 : 1D CNN
* Model 6 : TF Hub Pretrained Feature Extractor
* Model 7 : TF Hub Pretrained Feature Extractor with 10% data.

In [63]:
# Model 0 : Import Libraries
from sklearn.naive_bayes import MultinomialNB
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.pipeline import Pipeline

In [64]:
model_0 = Pipeline([
                    ('tfidf' , TfidfVectorizer()), # convert words to numbers using tfidf
                    ('clf' , MultinomialNB()) #model the text
])

model_0.fit(train_sentences , train_label)

Pipeline(memory=None,
         steps=[('tfidf',
                 TfidfVectorizer(analyzer='word', binary=False,
                                 decode_error='strict',
                                 dtype=<class 'numpy.float64'>,
                                 encoding='utf-8', input='content',
                                 lowercase=True, max_df=1.0, max_features=None,
                                 min_df=1, ngram_range=(1, 1), norm='l2',
                                 preprocessor=None, smooth_idf=True,
                                 stop_words=None, strip_accents=None,
                                 sublinear_tf=False,
                                 token_pattern='(?u)\\b\\w\\w+\\b',
                                 tokenizer=None, use_idf=True,
                                 vocabulary=None)),
                ('clf',
                 MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True))],
         verbose=False)

In [68]:
# Baseline model score
model_0.score(val_sentences , val_label)
model_0_preds = model_0.predict(val_sentences)

In [87]:
def eval_classification(y_true , y_pred):

  from sklearn.metrics import accuracy_score , precision_score , recall_score , f1_score

  # Define Scoring variables
  accuracy = accuracy_score(y_true , y_pred)
  precision = precision_score(y_true , y_pred)
  recall = recall_score(y_true , y_pred)
  f1_score = f1_score(y_true , y_pred)

  score_dict = {'Accuracy' : accuracy,
                'Precision' : precision,
                'Recall' : recall,
                'F1 Score' : f1_score}

  return score_dict

In [88]:
eval_classification(y_true = val_label , y_pred = model_0_preds)

{'Accuracy': 0.7926509186351706,
 'F1 Score': 0.734006734006734,
 'Precision': 0.8861788617886179,
 'Recall': 0.6264367816091954}

### 1.5 Model 1 : Feed Forward Neural Network
___