# Learn basics in NLP with TensorFlow 

I'm gonna follow this github tutorial.

https://github.com/mrdbourke/tensorflow-deep-learning/blob/main/08_introduction_to_nlp_in_tensorflow.ipynb

Get dataset from kaggle.

In [1]:
import pandas as pd
import numpy as np

In [2]:
train_data = pd.read_csv('./dataaset/train.csv')

In [3]:
train_data.head()

Unnamed: 0,id,keyword,location,text,target
0,1,,,Our Deeds are the Reason of this #earthquake M...,1
1,4,,,Forest fire near La Ronge Sask. Canada,1
2,5,,,All residents asked to 'shelter in place' are ...,1
3,6,,,"13,000 people receive #wildfires evacuation or...",1
4,7,,,Just got sent this photo from Ruby #Alaska as ...,1


Split data into train and test

In [4]:
from sklearn.model_selection import train_test_split

train_sentences, val_sentences, train_lables, val_lables = train_test_split(
    train_data["text"].to_numpy(),
    train_data["target"].to_numpy(),
    test_size=0.1
    )

In [5]:
train_sentences

array(["@okgabby_ damn suh. don't let that ruin your year bruh. this our year. better start carpooling like we did back in the day",
       "'If you are going to achieve excellence in big things you develop the habit in little matters....' dont know the author",
       'Choking Hazard Prompts Recall Of Kraft Cheese Singles http://t.co/XGKyVF9t4f',
       ...,
       'Aquarium Ornament Wreck Sailing Boat Sunk Ship Destroyer Fish Tank Cave Decor - Full read \x89Û_ http://t.co/nosA8JJjiN http://t.co/WUKvdavUJu',
       'Bluedio Turbine Hurricane H Bluetooth 4.1 Wireless Stereo Headphones Headset BLK - Full re\x89Û_ http://t.co/WeUDLkc4o4 http://t.co/trl1dskF81',
       "http://t.co/XlFi7ovhFJ VIDEO: 'We're picking up bodies from water': Rescuers are searching for hundreds\x89Û_ http://t.co/rAq4ZpdvKe"],
      dtype=object)

# Converting text into numbers

Create words to vector function.

In [6]:
from tensorflow.keras.layers import TextVectorization

In [7]:
text2vec = TextVectorization(
    max_tokens=10000, standardize='lower_and_strip_punctuation',
    split='whitespace', ngrams=None, output_mode='int',
    output_sequence_length=15, pad_to_max_tokens=False, vocabulary=None,
    idf_weights=None, sparse=False, ragged=False
)

In [8]:
text2vec.adapt(train_sentences)

See how the words 

In [9]:
sample_sentence = "There is a flood in my street!"
text2vec([sample_sentence])

<tf.Tensor: shape=(1, 15), dtype=int64, numpy=
array([[ 72,   9,   3, 216,   5,  13, 701,   0,   0,   0,   0,   0,   0,
          0,   0]], dtype=int64)>

Get first words

In [10]:
text2vec.get_vocabulary()[:5]

['', '[UNK]', 'the', 'a', 'to']

Get the words from 100 to 105th.

In [11]:
text2vec.get_vocabulary()[100:105]

['time', 'first', 'got', 'world', 'love']

# Creating Embedding layer

We are going to use TnsorFlow's embedding layers.

https://www.tensorflow.org/api_docs/python/tf/keras/layers/Embedding

In [12]:
from tensorflow.keras import layers

embedding = layers.Embedding(input_dim = 10000, # set imput shape
                             output_dim = 128, # output shape
                             input_length = 10000 # how long is each input 
                            )

embedding

<keras.layers.embeddings.Embedding at 0x286893a7dc0>

Get a random sentence from the training set

In [13]:
import random
random_sentence = random.choice(train_sentences)

print(f"Original text:\n {random_sentence}\
        \n\nEmbedded version:")

# Embed the random sentence (turn it into dense vectors of fixed size)
sample_embed = embedding(text2vec([random_sentence]))
sample_embed


Original text:
 Boy saves autistic brother from drowning: A nine-year-old in Maine dove into a pool to save his autistic brother from drowning        

Embedded version:


<tf.Tensor: shape=(1, 15, 128), dtype=float32, numpy=
array([[[-0.01633505, -0.0333326 ,  0.02428682, ...,  0.03221834,
          0.02810869, -0.03226112],
        [-0.0431253 ,  0.04908233,  0.01159443, ...,  0.04121036,
         -0.04984248,  0.02418424],
        [ 0.03160769,  0.04385887, -0.0315024 , ...,  0.00981873,
         -0.04838436,  0.01833885],
        ...,
        [-0.02509712,  0.00252758,  0.01120732, ...,  0.03815761,
          0.02797906, -0.00581179],
        [-0.01354187,  0.01167788,  0.03739823, ..., -0.02990412,
         -0.00281852, -0.00471834],
        [ 0.02459984, -0.03324062, -0.03261051, ..., -0.02778139,
         -0.01932004, -0.03199853]]], dtype=float32)>

In [14]:
sample_embed[0][0], sample_embed[0][0].shape, random_sentence

(<tf.Tensor: shape=(128,), dtype=float32, numpy=
 array([-0.01633505, -0.0333326 ,  0.02428682, -0.01172941,  0.01253531,
        -0.02390733, -0.01692311, -0.02879509,  0.02773492,  0.0234932 ,
         0.04054533,  0.03897861,  0.00245571,  0.02664149, -0.02715945,
        -0.0043594 , -0.0085379 ,  0.00631136,  0.01319552,  0.00289512,
         0.0367268 , -0.01734496, -0.02654569,  0.01388904, -0.04451586,
        -0.00824375,  0.00538548,  0.02327209,  0.0431522 , -0.04723018,
        -0.04571656, -0.04024692, -0.0012235 ,  0.00289857,  0.02234027,
        -0.01382188,  0.00131958, -0.02938548, -0.0448084 ,  0.00228596,
        -0.04224558, -0.02415817,  0.00232942,  0.01215092, -0.02803907,
         0.04449954,  0.03510661, -0.00244106,  0.02183453,  0.02190504,
         0.01023483,  0.01911161, -0.02444285,  0.0303578 ,  0.04145867,
        -0.0187322 ,  0.01889959,  0.01434069, -0.03980677, -0.01571401,
         0.04318125, -0.04801739, -0.01193171, -0.04218401, -0.04322013,
  

# Modelling a text dataset with running a series of experiment

There are some Model to learn text:

0, Naive Bayes with TF-IDF encoder (baseline)

1, Feed-forward neural network (dence model)

2, LSTM (RNN)

3, GRU (RNN)

4, Bidirectional-LSTM (RNN)

5, 1D Convolutional Neural Network

6, TensorFlow Hub Pretrained Feature Extractor

7, TensorFlow Hub Pretrained Feature Extractor (10% of data)

How are we going to approach all of these?

Use the standard steps in modeling with tensorflow:

* Create a model
* Build a model
* Fit a model
* Evaluate our model

# Create Conv1D layer

For more CNN info see below 
* https://poloclub.github.io/cnn-explainer/

When relu activation see below
* https://ml-cheatsheet.readthedocs.io/en/latest/activation_functions.html

In [15]:
# Test embedding layer, Conv1D layer and max pooling
from tensorflow.keras import layers

embedding_test = embedding(text2vec(["This is a test sentence"])) # trurn target sequence into embedding
conv_1d = layers.Conv1D(filters=32,
                       kernel_size=5,
                       activation="relu",
                       padding="valid")  # when "valid" the output size is smaller than input, when "same" the output size is the same
conv_1d_output = conv_1d(embedding_test)  # pass test embedding through conv1d layer
max_pool = layers.GlobalMaxPool1D()
max_pool_output = max_pool(conv_1d_output)   # equivalent to "get the most important feature" or "get the feature the highest value"

embedding_test.shape, conv_1d_output.shape, max_pool_output.shape

(TensorShape([1, 15, 128]), TensorShape([1, 11, 32]), TensorShape([1, 32]))

In [16]:
# Create a tensorboard callback ( need to a new one for each model)
from helper_function import create_tensorboard_callback

# Create a directory to save TensorBoard logs
SAVE_DIR = "model_logs"

# Create Conv1D layer model

In [17]:
import tensorflow as tf
from tensorflow.keras import layers
inputs = layers.Input(shape=(1,), dtype = tf.string)
x = text2vec(inputs)
x = embedding(x)
x = layers.Conv1D(filters=64, kernel_size=5, activation="relu", padding="valid")(x)
x = layers.GlobalMaxPool1D()(x)
output = layers.Dense(1, activation="sigmoid")(x)
model_5 = tf.keras.Model(inputs, output, name="model_5_Conv1D")

In [18]:
model_5.summary()

Model: "model_5_Conv1D"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 input_1 (InputLayer)        [(None, 1)]               0         
                                                                 
 text_vectorization (TextVec  (None, 15)               0         
 torization)                                                     
                                                                 
 embedding (Embedding)       (None, 15, 128)           1280000   
                                                                 
 conv1d_1 (Conv1D)           (None, 11, 64)            41024     
                                                                 
 global_max_pooling1d_1 (Glo  (None, 64)               0         
 balMaxPooling1D)                                                
                                                                 
 dense (Dense)               (None, 1)              

In [19]:
# Compile model
model_5.compile(loss="binary_crossentropy",
               optimizer=tf.keras.optimizers.Adam(),
               metrics=["accuracy"])

In [22]:
model_5_history = model_5.fit(x=train_sentences,
                             y=train_lables,
                             epochs=5,
                             validation_data=(val_sentences, val_lables),
                             callbacks=[create_tensorboard_callback(SAVE_DIR,
                                                                   "Conv1D")])

Saving TensorBoard log files to: model_logs/Conv1D/20220102-092036
Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


In [24]:
# Make predictions
model_5_pred_probs = model_5.predict(val_sentences)
model_5_pred_probs[:10]

array([[0.7710103 ],
       [0.01703349],
       [0.99977785],
       [0.00562745],
       [0.48096937],
       [0.99865043],
       [0.923563  ],
       [1.        ],
       [0.94818616],
       [0.9999392 ]], dtype=float32)

In [25]:
# Convert the model prediction to lables format
model_5_pred = tf.squeeze(tf.round(model_5_pred_probs))
model_5_pred[:10]

<tf.Tensor: shape=(10,), dtype=float32, numpy=array([1., 0., 1., 0., 0., 1., 1., 1., 1., 1.], dtype=float32)>

In [26]:
from Evaluation import caluculate_results

caluculate_results(y_true=val_lables,
                  y_pre=model_5_pred)

{'accuracy': 74.1469816272966,
 'prediction': 0.7413901742830225,
 'recall': 0.7414698162729659,
 'f1': 0.7414287125560604}