<a href="https://colab.research.google.com/github/86lekwenshiung/Neural-Network-with-Tensorflow/blob/main/07_Natural_Language_Processing_in_Tensorflow.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# 0.0 Natural Language Processing in Tensorflow
___

The main goal of natural language processing (NLP) is to derive information from natural language.
Natural language is a broad term but you can consider it to cover any of the following:
* Text (such as that contained in an email, blog post, book, Tweet)
* Speech (a conversation you have with a doctor, voice commands you give to a smart speaker)

**What is NLP used for?**

Natural Language Processing is the driving force behind the following common applications:
* Language translation applications such as Google Translate
* Word Processors such as Microsoft Word and Grammarly that employ NLP to check grammatical accuracy of texts.
* Interactive Voice Response (IVR) applications used in call centers to respond to certain users’ requests.
* Personal assistant applications such as OK Google, Siri, Cortana, and Alexa.

**Workflow**
```
Download text -> Visualize Text -> turn into numbers (tokenization , embedding) -> build a model -> train the model to find patterns -> compare model -> ensemble model
```

Another common term for NLP problems is sequence to sequence problems(seq2seq)

**Typical Architecture of a RNN**

| Hyperparameter/Layer type | What does it do? | Typical values |
|---|---|---|
| Input text(s) | Target texts/sequences you'd like to discover patterns in | Whatever you can represent as text or a sequence |
| Input layer | Takes in target sequence | input_shape = [batch_size, embedding_size] or [batch_size , sequence_shape] |
| Text Vectorisation layer | Maps input sequence to layers | Multiple, can create with tf.keras.layers.preprocessing.TextVectorisation |
| Embedding | Turn mapping of text vectors to embedding matrix | Multiple, can create with tf.keras.layers.Embedding |
| RNN Cells | Find Pattern in Sequences | SimpleRNN , LSTM , GRU |
| Hidden activation | Adds non-linearity to learned features (non-straight lines) | Usually Tanh (tf.keras.activations.tanh) |
| Pooling layer | Reduces the dimensionality of learned image features | Average (tf.keras.layers.GlobalAveragePooling1D) or Max (tf.keras.layers.GlobalMaxPool1D) |
| Fully connected layer | Further refines learned features from convolution layers | tf.keras.layers.Dense |
| Output layer | Takes learned features and outputs them in shape of target labels | output_shape = [number_of_classes] (e.g. disaster , Not Disaster) |
| Output activation | Adds non-linearities to output layer | tf.keras.activations.sigmoid (binary classification) or tf.keras.activations.softmax |


`source` : 
* https://towardsdatascience.com/whatnlpscientistsdo-905aa987c5c0
* https://becominghuman.ai/a-simple-introduction-to-natural-language-processing-ea66a1747b32
* https://blog.insightdatascience.com/how-to-solve-90-of-nlp-problems-a-step-by-step-guide-fda605278e4e


In [64]:
import pandas as pd
import numpy as np
import random

from sklearn.model_selection import train_test_split

import zipfile
import os

# 1.0 Getting Data from kaggle (Natural Language Processing with Disaster Tweets)
___

source : https://www.kaggle.com/philculliton/nlp-getting-started-tutorial

In [2]:
!wget https://storage.googleapis.com/ztm_tf_course/nlp_getting_started.zip

--2021-08-25 18:16:18--  https://storage.googleapis.com/ztm_tf_course/nlp_getting_started.zip
Resolving storage.googleapis.com (storage.googleapis.com)... 172.253.114.128, 172.253.119.128, 108.177.111.128, ...
Connecting to storage.googleapis.com (storage.googleapis.com)|172.253.114.128|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 607343 (593K) [application/zip]
Saving to: ‘nlp_getting_started.zip’


2021-08-25 18:16:18 (101 MB/s) - ‘nlp_getting_started.zip’ saved [607343/607343]



In [3]:
# Unzip file
zip_ref = zipfile.ZipFile('nlp_getting_started.zip')
zip_ref.extractall()
zip_ref.close()

In [4]:
df_train = pd.read_csv('train.csv')
df_test = pd.read_csv('test.csv')

### 1.1 Visualising Data
___

In [5]:
# Checking Training Data
df_train.head()

Unnamed: 0,id,keyword,location,text,target
0,1,,,Our Deeds are the Reason of this #earthquake M...,1
1,4,,,Forest fire near La Ronge Sask. Canada,1
2,5,,,All residents asked to 'shelter in place' are ...,1
3,6,,,"13,000 people receive #wildfires evacuation or...",1
4,7,,,Just got sent this photo from Ruby #Alaska as ...,1


In [11]:
df_train = df_train.sample(frac = 1 , random_state = 42)
df_train.head()

Unnamed: 0,id,keyword,location,text,target
900,1302,bloody,Singapore,Damn bloody hot,0
5470,7804,quarantine,"Joshua Tree, CA",Reddit Will Now Quarantine Offensive Content h...,1
2625,3766,destruction,,'Every kingdom divided against itself is heade...,0
5319,7594,outbreak,Pro-American and Anti-#Occupy,#BREAKING 10th death confirmed in Legionnaires...,1
3174,4556,emergency%20plan,Reddit,http://t.co/F7LJwxJ5jp #GamerGate The end of R...,0


In [6]:
# Checking Test Data
df_test.head()

Unnamed: 0,id,keyword,location,text
0,0,,,Just happened a terrible car crash
1,2,,,"Heard about #earthquake is different cities, s..."
2,3,,,"there is a forest fire at spot pond, geese are..."
3,9,,,Apocalypse lighting. #Spokane #wildfires
4,11,,,Typhoon Soudelor kills 28 in China and Taiwan


In [17]:
# Target True : False Ratio
df_train['target'].value_counts(normalize = True)

0    0.57034
1    0.42966
Name: target, dtype: float64

In [37]:
random_index = random.randint(0 , len(df_train))
df_train[['text' , 'target']][]

Unnamed: 0,text,target
900,Damn bloody hot,0
5470,Reddit Will Now Quarantine Offensive Content h...,1
2625,'Every kingdom divided against itself is heade...,0
5319,#BREAKING 10th death confirmed in Legionnaires...,1
3174,http://t.co/F7LJwxJ5jp #GamerGate The end of R...,0
...,...,...
3854,Maryland mansion fire that killed 6 caused by ...,1
1566,I regress and I slip and I fall off that cliff,0
5342,Pandemonium In Aba As Woman Delivers Baby With...,0
7465,Driver rams car into Israeli soldiers wounds 3...,1


In [62]:
random_index = random.randint(0 , len(df_train))

for row in df_train[['text' , 'target']][random_index : random_index +5].itertuples():
  _ , text , target = row

  print(f"Target: {target}" , "(Disaster)" if target > 0 else "(Not Disaster)")
  print(f'Text: {text}')
  print('-------\n')

Target: 0 (Not Disaster)
Text: &lt;meltdown of proportions commences I manage to calm myself long enough to turn the waters to hot and wait for the steam to cloud my vision-
-------

Target: 1 (Disaster)
Text: #science Now that a piece of wreckage from flight MH370 has been confirmed on RÌ©union Island is it possible t...  http://t.co/qNVXJ2pAlJ
-------

Target: 1 (Disaster)
Text: http://t.co/iXiYBAp8Qa The Latest: More homes razed by Northern California wildfire - Lynchburg News and Advance http://t.co/zEpzQYDby4
-------

Target: 0 (Not Disaster)
Text: if they kill off Val I'm rioting #Emmerdale
-------

Target: 0 (Not Disaster)
Text: This guy idk just made me his woman crush ?? first one ever ??
-------



### 1.2 Data Split Training and Validation
___

In [72]:
# Define X and y variables
train = df_train['text'].to_numpy()
val = df_train['target'].to_numpy()

In [71]:
train_sentences , val_sentences ,train_label , val_label = train_test_split(train , val , test_size = 0.1 , random_state  = 42)

In [73]:
print(f'Train Sentence: {train_sentences.shape}')
print(f'Val Sentence: {val_sentences.shape}')
print(f'Train Label: {train_label.shape}')
print(f'Val Label: {val_label.shape}')

Train Sentence: (6851,)
Val Sentence: (762,)
Train Label: (6851,)
Val Label: (762,)


### 1.3 Converting Text into Numbers
___

* Tokenization : Straight mapping from token to number , however model can get very big as no. of words increases.
* Embedding : Representation by vector , weighted matrix. Richer representation of relationship between tokens.