# Learn basics in NLP with TensorFlow 

I'm gonna follow this github tutorial.

https://github.com/mrdbourke/tensorflow-deep-learning/blob/main/08_introduction_to_nlp_in_tensorflow.ipynb

Get dataset from kaggle.

In [1]:
import pandas as pd
import numpy as np

In [2]:
train_data = pd.read_csv('./dataaset/train.csv')

In [3]:
train_data.head()

Unnamed: 0,id,keyword,location,text,target
0,1,,,Our Deeds are the Reason of this #earthquake M...,1
1,4,,,Forest fire near La Ronge Sask. Canada,1
2,5,,,All residents asked to 'shelter in place' are ...,1
3,6,,,"13,000 people receive #wildfires evacuation or...",1
4,7,,,Just got sent this photo from Ruby #Alaska as ...,1


Split data into train and test

In [4]:
from sklearn.model_selection import train_test_split

train_sentences, val_sentences, train_lables, val_lables = train_test_split(
    train_data["text"].to_numpy(),
    train_data["target"].to_numpy(),
    test_size=0.1
    )

In [5]:
train_sentences

array(['MEN CRUSH EVERY FUCKING DAY???????????????????????????? http://t.co/Fs4y1c9mNf',
       '@Blazing_Ben @PattyDs50 @gwfrazee @JoshuaAssaraf Not really. Sadly I have come to expect that from Obama.',
       '@aphyr I\x89Ûªve been following you this long\x89Û_ Sunk cost fallacy or somethin\x89Ûª',
       ..., '@OfficialMqm you are terrorist',
       '@Mmchale13 *tries to electrocute self with phone cord*',
       '70 Years After Atomic Bombs Japan Still Struggles With War Past: The anniversary of the devastation wrought b... http://t.co/Targ56iGBZ'],
      dtype=object)

# Converting text into numbers

Create words to vector function.

In [6]:
from tensorflow.keras.layers import TextVectorization

In [7]:
text2vec = TextVectorization(
    max_tokens=10000, standardize='lower_and_strip_punctuation',
    split='whitespace', ngrams=None, output_mode='int',
    output_sequence_length=15, pad_to_max_tokens=False, vocabulary=None,
    idf_weights=None, sparse=False, ragged=False
)

In [8]:
text2vec.adapt(train_sentences)

See how the words 

In [9]:
sample_sentence = "There is a flood in my street!"
text2vec([sample_sentence])

<tf.Tensor: shape=(1, 15), dtype=int64, numpy=
array([[ 74,   9,   3, 215,   4,  13, 696,   0,   0,   0,   0,   0,   0,
          0,   0]], dtype=int64)>

Get first words

In [10]:
text2vec.get_vocabulary()[:5]

['', '[UNK]', 'the', 'a', 'in']

Get the words from 100 to 105th.

In [11]:
text2vec.get_vocabulary()[100:105]

['day', 'first', 'cant', 'buildings', 'attack']

# Creating Embedding layer

We are going to use TnsorFlow's embedding layers.

https://www.tensorflow.org/api_docs/python/tf/keras/layers/Embedding

In [12]:
from tensorflow.keras import layers

embedding = layers.Embedding(input_dim = 10000, # set imput shape
                             output_dim = 128, # output shape
                             input_length = 10000 # how long is each input 
                            )

embedding

<keras.layers.embeddings.Embedding at 0x2ce0ca15c70>

Get a random sentence from the training set

In [13]:
import random
random_sentence = random.choice(train_sentences)

print(f"Original text:\n {random_sentence}\
        \n\nEmbedded version:")

# Embed the random sentence (turn it into dense vectors of fixed size)
sample_embed = embedding(text2vec([random_sentence]))
sample_embed


Original text:
 Where will the winds take my gypsy blood this time? http://t.co/66YVulIZbk        

Embedded version:


<tf.Tensor: shape=(1, 15, 128), dtype=float32, numpy=
array([[[-0.01654463, -0.02916887,  0.00402608, ...,  0.0486904 ,
         -0.02967325, -0.01296461],
        [-0.02516477,  0.04138014, -0.01501142, ...,  0.0399284 ,
          0.04077123,  0.04924088],
        [ 0.02981856,  0.0029608 ,  0.00556079, ..., -0.0150679 ,
          0.03289032, -0.02314901],
        ...,
        [-0.00976108, -0.02075405,  0.02464462, ...,  0.01897743,
          0.01493886,  0.01296253],
        [-0.00976108, -0.02075405,  0.02464462, ...,  0.01897743,
          0.01493886,  0.01296253],
        [-0.00976108, -0.02075405,  0.02464462, ...,  0.01897743,
          0.01493886,  0.01296253]]], dtype=float32)>

In [14]:
sample_embed[0][0], sample_embed[0][0].shape, random_sentence

(<tf.Tensor: shape=(128,), dtype=float32, numpy=
 array([-0.01654463, -0.02916887,  0.00402608,  0.03156095, -0.01280761,
         0.0102576 , -0.02029748,  0.04976146,  0.00745783, -0.04059462,
         0.00715708,  0.027896  , -0.03073162, -0.0142231 ,  0.04000933,
         0.03548099,  0.02797547, -0.0358911 , -0.02592714, -0.03627281,
         0.01622288,  0.02645807,  0.04425314,  0.01021792,  0.02007533,
         0.01331313,  0.0130183 , -0.01840711, -0.00985068, -0.03270923,
         0.0340765 ,  0.00344791,  0.04997556,  0.00017343, -0.01647408,
        -0.04760018, -0.02892542, -0.04200842,  0.02270316,  0.01299495,
         0.03675191, -0.0264742 ,  0.01844417, -0.03964479,  0.02460872,
        -0.02527495,  0.03979373,  0.0433034 , -0.0170002 ,  0.01859173,
         0.00591798, -0.02757793, -0.03790073, -0.03183411,  0.03713011,
        -0.02045463,  0.01253465,  0.03721421,  0.04483048, -0.04005899,
        -0.01912232,  0.02871368, -0.00541382,  0.0066871 ,  0.01636289,
  

# Modelling a text dataset with running a series of experiment

There are some Model to learn text:

0, Naive Bayes with TF-IDF encoder (baseline)

1, Feed-forward neural network (dence model)

2, LSTM (RNN)

3, GRU (RNN)

4, Bidirectional-LSTM (RNN)

5, 1D Convolutional Neural Network

6, TensorFlow Hub Pretrained Feature Extractor

7, TensorFlow Hub Pretrained Feature Extractor (10% of data)

How are we going to approach all of these?

Use the standard steps in modeling with tensorflow:

* Create a model
* Build a model
* Fit a model
* Evaluate our model

# Create GRU layer

In [15]:
import tensorflow as tf
from tensorflow.keras import layers

inputs = layers.Input(shape=(1,), dtype=tf.string)
x = text2vec(inputs)
x = embedding(x)
x = layers.GRU(64)(x)
outputs = layers.Dense(1, activation="sigmoid")(x)
model_3 = tf.keras.Model(inputs, outputs, name="model_3_GRU")

In [16]:
model_3.summary()

Model: "model_3_GRU"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 input_1 (InputLayer)        [(None, 1)]               0         
                                                                 
 text_vectorization (TextVec  (None, 15)               0         
 torization)                                                     
                                                                 
 embedding (Embedding)       (None, 15, 128)           1280000   
                                                                 
 gru (GRU)                   (None, 64)                37248     
                                                                 
 dense (Dense)               (None, 1)                 65        
                                                                 
Total params: 1,317,313
Trainable params: 1,317,313
Non-trainable params: 0
_____________________________________________

In [21]:
# Compile model
model_3.compile(loss="binary_crossentropy",
                optimizer=tf.keras.optimizers.Adam(),
                metrics=["accuracy"])

In [18]:
# Create a tensorboard callback ( need to a new one for each model)
from helper_function import create_tensorboard_callback

# Create a directory to save TensorBoard logs
SAVE_DIR = "model_logs"

In [22]:
# Fit the model
model_3_history = model_3.fit(x=train_sentences,
                             y=train_lables,
                             epochs=5,
                             validation_data=(val_sentences, val_lables),
                             callbacks=[create_tensorboard_callback(dir_name=SAVE_DIR,
                                                                   experiment_name="model_3_GRU")])

Saving TensorBoard log files to: model_logs/model_3_GRU/20211229-184431
Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


In [24]:
# Make prediction of GRU
model_3_pred_probs = model_3.predict(val_sentences)
model_3_pred_probs[:10]

array([[0.9983696 ],
       [0.785221  ],
       [0.00168416],
       [0.43720007],
       [0.9947716 ],
       [0.77878106],
       [0.8552263 ],
       [0.14568129],
       [0.28284764],
       [0.0279229 ]], dtype=float32)

In [26]:
# Convert Model 3 prediction to label format 
model_3_pred = tf.squeeze(tf.round(model_3_pred_probs))
model_3_pred[:10]

<tf.Tensor: shape=(10,), dtype=float32, numpy=array([1., 1., 0., 0., 1., 1., 1., 0., 0., 0.], dtype=float32)>

In [27]:
# Calculate model 3 results

from Evaluation import caluculate_results
model_3_results = caluculate_results(y_true=val_lables,
                                    y_pre=model_3_pred)
model_3_results

{'accuracy': 77.95275590551181,
 'prediction': 0.7789333564214768,
 'recall': 0.7795275590551181,
 'f1': 0.7776283995045866}