<a href="https://colab.research.google.com/github/86lekwenshiung/Neural-Network-with-Tensorflow/blob/main/07_Natural_Language_Processing_in_Tensorflow.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# 0.0 Natural Language Processing in Tensorflow
___

The main goal of natural language processing (NLP) is to derive information from natural language.
Natural language is a broad term but you can consider it to cover any of the following:
* Text (such as that contained in an email, blog post, book, Tweet)
* Speech (a conversation you have with a doctor, voice commands you give to a smart speaker)

**What is NLP used for?**

Natural Language Processing is the driving force behind the following common applications:
* Language translation applications such as Google Translate
* Word Processors such as Microsoft Word and Grammarly that employ NLP to check grammatical accuracy of texts.
* Interactive Voice Response (IVR) applications used in call centers to respond to certain users’ requests.
* Personal assistant applications such as OK Google, Siri, Cortana, and Alexa.

**Workflow**
```
Download text -> Visualize Text -> turn into numbers (tokenization , embedding) -> build a model -> train the model to find patterns -> compare model -> ensemble model
```

Another common term for NLP problems is sequence to sequence problems(seq2seq)

**Typical Architecture of a RNN**

| Hyperparameter/Layer type | What does it do? | Typical values |
|---|---|---|
| Input text(s) | Target texts/sequences you'd like to discover patterns in | Whatever you can represent as text or a sequence |
| Input layer | Takes in target sequence | input_shape = [batch_size, embedding_size] or [batch_size , sequence_shape] |
| Text Vectorisation layer | Maps input sequence to layers | Multiple, can create with tf.keras.layers.preprocessing.TextVectorisation |
| Embedding | Turn mapping of text vectors to embedding matrix | Multiple, can create with tf.keras.layers.Embedding |
| RNN Cells | Find Pattern in Sequences | SimpleRNN , LSTM , GRU |
| Hidden activation | Adds non-linearity to learned features (non-straight lines) | Usually Tanh (tf.keras.activations.tanh) |
| Pooling layer | Reduces the dimensionality of learned image features | Average (tf.keras.layers.GlobalAveragePooling1D) or Max (tf.keras.layers.GlobalMaxPool1D) |
| Fully connected layer | Further refines learned features from convolution layers | tf.keras.layers.Dense |
| Output layer | Takes learned features and outputs them in shape of target labels | output_shape = [number_of_classes] (e.g. disaster , Not Disaster) |
| Output activation | Adds non-linearities to output layer | tf.keras.activations.sigmoid (binary classification) or tf.keras.activations.softmax |


`source` : 
* https://towardsdatascience.com/whatnlpscientistsdo-905aa987c5c0
* https://becominghuman.ai/a-simple-introduction-to-natural-language-processing-ea66a1747b32
* https://blog.insightdatascience.com/how-to-solve-90-of-nlp-problems-a-step-by-step-guide-fda605278e4e


In [114]:
import tensorflow as tf
from tensorflow.keras import layers
from tensorflow.keras.layers.experimental.preprocessing import TextVectorization
from sklearn.naive_bayes import MultinomialNB
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.pipeline import Pipeline

import pandas as pd
import numpy as np
import random

from sklearn.model_selection import train_test_split

import zipfile
import os

# 0.5 General Function
___

### TensorBoard Callbacks
___

In [89]:
import datetime

def create_tensorboard_callback(dir_name , experiment_name):
  log_dir = dir_name +'/' + experiment_name +'/' +datetime.datetime.now().strftime('%Y%m%d-%H%M%S')
  tensorboard_callback = tf.keras.callbacks.TensorBoard(log_dir = log_dir)
  print(f'Saving Tensorboard log files to {log_dir}')
  return tensorboard_callback

### Classification Evaluation Metrics
___

In [90]:
def eval_classification(y_true , y_pred):

  from sklearn.metrics import accuracy_score , precision_score , recall_score , f1_score

  # Define Scoring variables
  accuracy = accuracy_score(y_true , y_pred)
  precision = precision_score(y_true , y_pred)
  recall = recall_score(y_true , y_pred)
  f1_score = f1_score(y_true , y_pred)

  score_dict = {'Accuracy' : accuracy,
                'Precision' : precision,
                'Recall' : recall,
                'F1 Score' : f1_score}

  return score_dict

# 1.0 Getting Data from kaggle (Natural Language Processing with Disaster Tweets)
___

source : https://www.kaggle.com/philculliton/nlp-getting-started-tutorial

In [91]:
!wget https://storage.googleapis.com/ztm_tf_course/nlp_getting_started.zip

--2021-08-26 18:12:00--  https://storage.googleapis.com/ztm_tf_course/nlp_getting_started.zip
Resolving storage.googleapis.com (storage.googleapis.com)... 173.194.69.128, 108.177.96.128, 108.177.119.128, ...
Connecting to storage.googleapis.com (storage.googleapis.com)|173.194.69.128|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 607343 (593K) [application/zip]
Saving to: ‘nlp_getting_started.zip.3’


2021-08-26 18:12:00 (81.5 MB/s) - ‘nlp_getting_started.zip.3’ saved [607343/607343]



In [92]:
# Unzip file
zip_ref = zipfile.ZipFile('nlp_getting_started.zip')
zip_ref.extractall()
zip_ref.close()

In [93]:
df_train = pd.read_csv('train.csv')
df_test = pd.read_csv('test.csv')

### 1.1 Visualising Data
___

In [94]:
# Checking Training Data
df_train.head()

Unnamed: 0,id,keyword,location,text,target
0,1,,,Our Deeds are the Reason of this #earthquake M...,1
1,4,,,Forest fire near La Ronge Sask. Canada,1
2,5,,,All residents asked to 'shelter in place' are ...,1
3,6,,,"13,000 people receive #wildfires evacuation or...",1
4,7,,,Just got sent this photo from Ruby #Alaska as ...,1


In [95]:
df_train = df_train.sample(frac = 1 , random_state = 42)
df_train.head()

Unnamed: 0,id,keyword,location,text,target
2644,3796,destruction,,So you have a new weapon that can cause un-ima...,1
2227,3185,deluge,,The f$&amp;@ing things I do for #GISHWHES Just...,0
5448,7769,police,UK,DT @georgegalloway: RT @Galloway4Mayor: ÛÏThe...,1
132,191,aftershock,,Aftershock back to school kick off was great. ...,0
6845,9810,trauma,"Montgomery County, MD",in response to trauma Children of Addicts deve...,0


In [96]:
# Checking Test Data
df_test.head()

Unnamed: 0,id,keyword,location,text
0,0,,,Just happened a terrible car crash
1,2,,,"Heard about #earthquake is different cities, s..."
2,3,,,"there is a forest fire at spot pond, geese are..."
3,9,,,Apocalypse lighting. #Spokane #wildfires
4,11,,,Typhoon Soudelor kills 28 in China and Taiwan


In [97]:
# Target True : False Ratio
df_train['target'].value_counts(normalize = True)

0    0.57034
1    0.42966
Name: target, dtype: float64

In [98]:
random_index = random.randint(0 , len(df_train))
df_train[['text' , 'target']].head()

Unnamed: 0,text,target
2644,So you have a new weapon that can cause un-ima...,1
2227,The f$&amp;@ing things I do for #GISHWHES Just...,0
5448,DT @georgegalloway: RT @Galloway4Mayor: ÛÏThe...,1
132,Aftershock back to school kick off was great. ...,0
6845,in response to trauma Children of Addicts deve...,0


In [99]:
random_index = random.randint(0 , len(df_train))

for row in df_train[['text' , 'target']][random_index : random_index +5].itertuples():
  _ , text , target = row

  print(f"Target: {target}" , "(Disaster)" if target > 0 else "(Not Disaster)")
  print(f'Text: {text}')
  print('-------\n')

Target: 1 (Disaster)
Text: On the sneak America has us spoiled. A natural disaster will humble niggas.
-------

Target: 0 (Not Disaster)
Text: Kids are inundated with images and information online and in media and have no way to deconstruct. - Kerri Sackville #TMS7
-------

Target: 1 (Disaster)
Text: Ashes 2015: AustraliaÛªs collapse at Trent Bridge among worst in history: England bundled out Australia for 60 ... http://t.co/985DwWPdEt
-------

Target: 1 (Disaster)
Text: Heat wave in WB heavy losses and no compensations (report) -  http://t.co/wMDihdiz1r (via PalinfoEn)   #Palestine
-------

Target: 1 (Disaster)
Text: @sonofbobBOB @Shimmyfab @trickxie usually I'd agree. Once the whole chopping heads off throwing gays off rooftops &amp; suicide bombing start
-------



### 1.2 Data Split Training and Validation
___

In [100]:
# Define X and y variables
train = df_train['text'].to_numpy()
val = df_train['target'].to_numpy()

In [101]:
train_sentences , val_sentences ,train_label , val_label = train_test_split(train , val , test_size = 0.1 , random_state  = 42)

In [102]:
print(f'Train Sentence: {train_sentences.shape}')
print(f'Val Sentence: {val_sentences.shape}')
print(f'Train Label: {train_label.shape}')
print(f'Val Label: {val_label.shape}')

Train Sentence: (6851,)
Val Sentence: (762,)
Train Label: (6851,)
Val Label: (762,)


### 1.3 Converting Text into Numbers
___

* Tokenization : Straight mapping from token to number , however model can get very big as no. of words increases.
* Embedding : Representation by vector , weighted matrix. Richer representation of relationship between tokens.

#### 1.3.1 Tokenization
___

In [103]:
# # Default Setting of TextVectorisation
# text_vectorizer = TextVectorization(max_tokens = None,
#                                     standardize = 'lower_and_strip_punctuation',
#                                     split = 'whitespace',
#                                     ngrams = None, # grouping of words
#                                     output_mode = 'int',
#                                     output_sequence_length = None,
#                                     pad_to_max_tokens = True) 

In [104]:
#   This example instantiates a `TextVectorization` layer that lowercases text, splits on whitespace, strips punctuation, and outputs integer vocab indices.

max_vocab_length = 10000  # Max number of words in our vocab
max_length = 15 # max length our sequence will be (In this case the sequence is a tweet)

text_vectorizer = TextVectorization(max_tokens = max_vocab_length,
                                    output_mode = 'int',
                                    output_sequence_length = max_length,
                                    pad_to_max_tokens = True)

In [105]:
text_vectorizer.adapt(train_sentences)

In [106]:
# Our max_length is set as 15 and our sentence only have 7 words. The rest of the 8 remaining words are padded with 0s.
sample_sentence = 'There is a flood in Bukit Timah'
text_vectorizer(sample_sentence)

<tf.Tensor: shape=(15,), dtype=int64, numpy=
array([ 74,   9,   3, 232,   4,   1,   1,   0,   0,   0,   0,   0,   0,
         0,   0])>

In [107]:
# Visualing random sentence from our training dataset.
random_sentence = random.choice(train_sentences)
print(f'Originial Sentence: {random_sentence}')
print('--------')
print(f'Vectorized Sentence: {text_vectorizer(random_sentence)}')

Originial Sentence: Goulburn man Henry Van Bilsen missing: Emergency services are searching for a Goulburn man who disappeared from hisÛ_ http://t.co/z99pKJzTRp
--------
Vectorized Sentence: [5549   89 5489 1929    1  373   73  327   22  669   10    3 5549   89
   65]


In [108]:
# Checking the unique vocabulary
word_in_vocab = text_vectorizer.get_vocabulary()
print(f'Top 10 words: {word_in_vocab[:10]}')
print(f'Bottom 10 words: {word_in_vocab[-10:]}')

Top 10 words: ['', '[UNK]', 'the', 'a', 'in', 'to', 'of', 'and', 'i', 'is']
Bottom 10 words: ['painthey', 'painful', 'paine', 'paging', 'pageshi', 'pages', 'paeds', 'pads', 'padres', 'paddytomlinson1']


#### 1.3.2 Embedding
___

* Key Parameters for embedding layer:
  - `input_dim` = size of our vocab
  - `output_dim` = size of our output embedding vector. A value of 100 would mean each token get represented by a vector 100 long
  - `input_length` = length of the sequences being passed to the embedding layer

In [109]:
embedding = tf.keras.layers.Embedding(input_dim = max_vocab_length,
                                      output_dim = 128,
                                      input_length = max_length)
embedding

<keras.layers.embeddings.Embedding at 0x7f062e97ad10>

In [110]:
# Visualing random sentence from our training dataset.
random_sentence = random.choice(train_sentences)

print(f'Original Sentence : {random_sentence}')
print('--------')
print(f'Embedded Sentence : {embedding(text_vectorizer(random_sentence)).shape}')
print(f'Embedded Sentence : {embedding(text_vectorizer(random_sentence))}')

Original Sentence : Officer Wounded Suspect Killed in Exchange of Gunfire: Richmond police officer wounded suspect killed in exc... http://t.co/zDHwRN6cZc
--------
Embedded Sentence : (15, 128)
Embedded Sentence : [[-0.0239583   0.02416081 -0.00979527 ... -0.04668007 -0.01397713
  -0.00261535]
 [ 0.03097672 -0.00587211 -0.04343293 ...  0.04963129  0.00380563
  -0.03805446]
 [-0.01389981  0.01931682 -0.03164418 ... -0.01101432  0.02029847
   0.01594095]
 ...
 [-0.01389981  0.01931682 -0.03164418 ... -0.01101432  0.02029847
   0.01594095]
 [ 0.01624436  0.02262406 -0.00955763 ... -0.03733759  0.03619066
  -0.01987487]
 [-0.01294749  0.00542659 -0.04776496 ... -0.0214572  -0.02556634
   0.01253258]]


### 1.4 Model 0 : Baseline Model with Naive Bayes with TF_IDF encoder
___

* Model 0 : Naive Bayes with TF-IDF encoder
* Model 1 : Feed-Forward neural network (dense)
* Model 2 : LSTM model (RNN)
* Model 3 : GRU model (RNN)
* Model 4 : Bidirectional - LSTM model (RNN)
* Model 5 : 1D CNN
* Model 6 : TF Hub Pretrained Feature Extractor
* Model 7 : TF Hub Pretrained Feature Extractor with 10% data.

In [111]:
model_0 = Pipeline([
                    ('tfidf' , TfidfVectorizer()), # convert words to numbers using tfidf
                    ('clf' , MultinomialNB()) #model the text
])

model_0.fit(train_sentences , train_label)

Pipeline(memory=None,
         steps=[('tfidf',
                 TfidfVectorizer(analyzer='word', binary=False,
                                 decode_error='strict',
                                 dtype=<class 'numpy.float64'>,
                                 encoding='utf-8', input='content',
                                 lowercase=True, max_df=1.0, max_features=None,
                                 min_df=1, ngram_range=(1, 1), norm='l2',
                                 preprocessor=None, smooth_idf=True,
                                 stop_words=None, strip_accents=None,
                                 sublinear_tf=False,
                                 token_pattern='(?u)\\b\\w\\w+\\b',
                                 tokenizer=None, use_idf=True,
                                 vocabulary=None)),
                ('clf',
                 MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True))],
         verbose=False)

In [112]:
# Baseline model score
model_0.score(val_sentences , val_label)
model_0_preds = model_0.predict(val_sentences)

In [113]:
eval_classification(y_true = val_label , y_pred = model_0_preds)

{'Accuracy': 0.7926509186351706,
 'F1 Score': 0.734006734006734,
 'Precision': 0.8861788617886179,
 'Recall': 0.6264367816091954}

### 1.5 Model 1 : Feed Forward Neural Network
___

In [129]:
# Build model with the functional API

inputs = layers.Input(shape = (1,) , dtype = tf.string) # inputs are 1-D strings
x = text_vectorizer(inputs) # turn the text into numbers
x = embedding(x) # create an embedding 
# x = layers.GlobalAveragePooling1D()(x) # condense the feature vector for each token
outputs = layers.Dense(1 , activation = 'sigmoid')(x)
model_1 = tf.keras.Model(inputs = inputs , outputs = outputs , name ='model_1')

In [130]:
# Observed that the output layer is not 1 output , but somehow related to our token count.
model_1.summary()

Model: "model_1"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
input_4 (InputLayer)         [(None, 1)]               0         
_________________________________________________________________
text_vectorization_5 (TextVe (None, 15)                0         
_________________________________________________________________
embedding_2 (Embedding)      (None, 15, 128)           1280000   
_________________________________________________________________
dense_3 (Dense)              (None, 15, 1)             129       
Total params: 1,280,129
Trainable params: 1,280,129
Non-trainable params: 0
_________________________________________________________________


In [132]:
# Build model with the functional API

inputs = layers.Input(shape = (1,) , dtype = tf.string) # inputs are 1-D strings
x = text_vectorizer(inputs) # turn the text into numbers
x = embedding(x) # create an embedding 
x = layers.GlobalAveragePooling1D()(x) # condense the feature vector for each token
outputs = layers.Dense(1 , activation = 'sigmoid')(x)
model_1 = tf.keras.Model(inputs = inputs , outputs = outputs , name ='model_1')

In [133]:
# Observed that after passing through the Pooling1D , the feature vector for each token is condense into 1 output
model_1.summary()

Model: "model_1"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
input_5 (InputLayer)         [(None, 1)]               0         
_________________________________________________________________
text_vectorization_5 (TextVe (None, 15)                0         
_________________________________________________________________
embedding_2 (Embedding)      (None, 15, 128)           1280000   
_________________________________________________________________
global_average_pooling1d_1 ( (None, 128)               0         
_________________________________________________________________
dense_4 (Dense)              (None, 1)                 129       
Total params: 1,280,129
Trainable params: 1,280,129
Non-trainable params: 0
_________________________________________________________________


In [134]:
# Tensorboard Save directory
save_dir = 'model_logs'

model_1.compile(optimizer = tf.keras.optimizers.Adam(),
                loss = tf.keras.losses.BinaryCrossentropy(),
                metrics = 'accuracy')

history_1 = model_1.fit(train_sentences,
                        train_label,
                        epochs = 5,
                        validation_data = (val_sentences, val_label),
                        callbacks = [create_tensorboard_callback(dir_name = save_dir , 
                                                                 experiment_name = 'model_1_Dense')])

Saving Tensorboard log files to model_logs/model_1_Dense/20210826-183311
Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


In [135]:
# Model 1 Score
model_1.evaluate(val_sentences , val_label)



[0.4712037444114685, 0.7900262475013733]

In [145]:
# Checking test_prediction with actual label
model_1_preds = model_1.predict(val_sentences)
print(f' Sample Test Prediction : {model_1_preds[-10:]}')
print('----------------------')
print(f' Actual Label : {val_label[-10:]}')

 Sample Test Prediction : [[0.9894411 ]
 [0.03264952]
 [0.9403504 ]
 [0.6411978 ]
 [0.09302598]
 [0.2739401 ]
 [0.12194067]
 [0.75411695]
 [0.36215562]
 [0.01136661]]
----------------------
 Actual Label : [1 0 1 1 0 0 0 0 0 0]


In [151]:
# formatting test prediction to 0 and 1 format

model_1_preds = tf.squeeze(tf.round(model_1_preds))
model_1_preds[:10]

<tf.Tensor: shape=(10,), dtype=float32, numpy=array([0., 1., 1., 0., 0., 1., 1., 1., 1., 0.], dtype=float32)>

In [154]:
model_1_results = eval_classification(val_label , model_1_preds)
model_1_results

{'Accuracy': 0.7900262467191601,
 'F1 Score': 0.7460317460317459,
 'Precision': 0.8333333333333334,
 'Recall': 0.6752873563218391}

### 1.5.1 Visualing via tensorflow projector
___
- [Tensorflow Projector](https://projector.tensorflow.org/)
- [Word Embedding](https://www.tensorflow.org/text/guide/word_embeddings)

In [157]:
words_in_vocab = text_vectorizer.get_vocabulary()
len(words_in_vocab) , words_in_vocab[:10]

(10000, ['', '[UNK]', 'the', 'a', 'in', 'to', 'of', 'and', 'i', 'is'])

In [171]:
# Get the weight matrix of embedding layer
# (these are the numerical representation of each token in our training data)
# For every unique token or vocab , there is 128 vectors representing it.
embed_weights = model_1.get_layer('embedding_2').get_weights()[0]
embed_weights.shape

(10000, 128)

In [172]:
# Create Embedding files (we got this from Tensorflow's word embedding documentation)

import io

out_v = io.open('vectors.tsv', 'w', encoding='utf-8')
out_m = io.open('metadata.tsv', 'w', encoding='utf-8')

for index, word in enumerate(words_in_vocab):
  if index == 0:
    continue  # skip 0, it's padding.
  vec = embed_weights[index]
  out_v.write('\t'.join([str(x) for x in vec]) + "\n")
  out_m.write(word + "\n")
out_v.close()
out_m.close()

In [173]:
# Download file from Colab to upload to projector (we got this from Tensorflow's word embedding documentation)
try:
  from google.colab import files
  files.download('vectors.tsv')
  files.download('metadata.tsv')
except Exception:
  pass

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<p align = 'center'>
  Extract from Tensorflow Projector for Model_1 (Load Vector.tsv and metadata.tsv)
  <img src = 'https://raw.githubusercontent.com/86lekwenshiung/Neural-Network-with-Tensorflow/main/images/08-tf_projector.png'>
</p>