# Learn basics in NLP with TensorFlow 

I'm gonna follow this github tutorial.

https://github.com/mrdbourke/tensorflow-deep-learning/blob/main/08_introduction_to_nlp_in_tensorflow.ipynb

Get dataset from kaggle.

In [1]:
import pandas as pd
import numpy as np
import tensorflow as tf

In [2]:
train_data = pd.read_csv('./dataaset/train.csv')

In [3]:
train_data.head()

Unnamed: 0,id,keyword,location,text,target
0,1,,,Our Deeds are the Reason of this #earthquake M...,1
1,4,,,Forest fire near La Ronge Sask. Canada,1
2,5,,,All residents asked to 'shelter in place' are ...,1
3,6,,,"13,000 people receive #wildfires evacuation or...",1
4,7,,,Just got sent this photo from Ruby #Alaska as ...,1


Split data into train and test

In [4]:
from sklearn.model_selection import train_test_split

train_sentences, val_sentences, train_lables, val_lables = train_test_split(
    train_data["text"].to_numpy(),
    train_data["target"].to_numpy(),
    test_size=0.1
    )

In [5]:
train_sentences

array(["'Congress' should be renamed Italian Goonda Party. They are a motley crowd of hooligans and selfavowed crooks determined to derail democracy",
       'Landslide kills three near Venice after heavyåÊrain http://t.co/q3Xq8R658r',
       'Not one character in the final destination series has ever survived ??',
       ...,
       'The Witches of the Glass Castle. Supernatural YA where sibling rivalry magic and love collide #wogc #kindle http://t.co/IzakNpJeQW',
       'Horrible Accident Man Died In Wings of Airplane (29-07-2015) http://t.co/TfcdRONRA6',
       '@AdamTuss and is the car that derailed a 5000 series by chance. They used to have issues w/ wheel climbing RE: 1/2007 Mt. Vern Sq derailment'],
      dtype=object)

# Converting text into numbers

Create words to vector function.

In [6]:
from tensorflow.keras.layers import TextVectorization

In [7]:
text2vec = TextVectorization(
    max_tokens=10000, standardize='lower_and_strip_punctuation',
    split='whitespace', ngrams=None, output_mode='int',
    output_sequence_length=15, pad_to_max_tokens=False, vocabulary=None,
    idf_weights=None, sparse=False, ragged=False
)

In [8]:
text2vec.adapt(train_sentences)

See how the words 

In [9]:
sample_sentence = "There is a flood in my street!"
text2vec([sample_sentence])

<tf.Tensor: shape=(1, 15), dtype=int64, numpy=
array([[ 74,   9,   3, 224,   4,  13, 789,   0,   0,   0,   0,   0,   0,
          0,   0]], dtype=int64)>

Get first words

In [10]:
text2vec.get_vocabulary()[:5]

['', '[UNK]', 'the', 'a', 'in']

Get the words from 100 to 105th.

In [11]:
text2vec.get_vocabulary()[100:105]

['man', 'fires', 'world', 'rt', 'love']

# Creating Embedding layer

We are going to use TnsorFlow's embedding layers.

https://www.tensorflow.org/api_docs/python/tf/keras/layers/Embedding

In [12]:
from tensorflow.keras import layers

embedding = layers.Embedding(input_dim = 10000, # set imput shape
                             output_dim = 128, # output shape
                             input_length = 10000 # how long is each input 
                            )

embedding

<keras.layers.embeddings.Embedding at 0x1d6d86758e0>

Get a random sentence from the training set

In [13]:
import random
random_sentence = random.choice(train_sentences)

print(f"Original text:\n {random_sentence}\
        \n\nEmbedded version:")

# Embed the random sentence (turn it into dense vectors of fixed size)
sample_embed = embedding(text2vec([random_sentence]))
sample_embed


Original text:
 ! Residents Return To Destroyed Homes As Washington Wildfire Burns on http://t.co/UcI8stQUg1        

Embedded version:


<tf.Tensor: shape=(1, 15, 128), dtype=float32, numpy=
array([[[ 0.0337841 , -0.04265943, -0.01576056, ..., -0.04682023,
         -0.01124186,  0.04004878],
        [ 0.01670153,  0.00922741,  0.00738541, ..., -0.02125056,
         -0.04204486, -0.0330943 ],
        [-0.04058049,  0.0493963 , -0.00905965, ..., -0.01078407,
         -0.03960771, -0.00675224],
        ...,
        [-0.04179449,  0.03868698, -0.01063225, ...,  0.00155853,
          0.01839009,  0.035278  ],
        [-0.04179449,  0.03868698, -0.01063225, ...,  0.00155853,
          0.01839009,  0.035278  ],
        [-0.04179449,  0.03868698, -0.01063225, ...,  0.00155853,
          0.01839009,  0.035278  ]]], dtype=float32)>

In [14]:
sample_embed[0][0], sample_embed[0][0].shape, random_sentence

(<tf.Tensor: shape=(128,), dtype=float32, numpy=
 array([ 0.0337841 , -0.04265943, -0.01576056, -0.04015245,  0.04022225,
        -0.01401131,  0.03750006, -0.03184675,  0.02951287, -0.01774532,
         0.03969281, -0.01669383, -0.03744419,  0.00642474,  0.02990139,
        -0.03153314,  0.04652533, -0.00644759, -0.01268534, -0.0338016 ,
         0.03158467,  0.01509025, -0.01870737, -0.04347018,  0.01826001,
        -0.00459472, -0.00907284,  0.04063076,  0.02000294, -0.00943081,
        -0.01354191, -0.00212723, -0.02696894, -0.01937709,  0.00748347,
         0.01899696, -0.02962916, -0.00429243,  0.0160205 ,  0.04642891,
        -0.01777109,  0.0095791 , -0.0049394 ,  0.03092838,  0.02097987,
         0.03320912, -0.01151361, -0.03555859,  0.00896908,  0.02704009,
         0.03326956, -0.03886857,  0.03857337, -0.00783563, -0.02429022,
        -0.01858344, -0.01822566,  0.02388005,  0.00076503,  0.00428661,
         0.02374265,  0.03831632, -0.03448   ,  0.02039299,  0.04884393,
  

# Modelling a text dataset with running a series of experiment

There are some Model to learn text:

0, Naive Bayes with TF-IDF encoder (baseline)

1, Feed-forward neural network (dence model)

2, LSTM (RNN)

3, GRU (RNN)

4, Bidirectional-LSTM (RNN)

5, 1D Convolutional Neural Network

6, TensorFlow Hub Pretrained Feature Extractor

7, TensorFlow Hub Pretrained Feature Extractor (10% of data)

How are we going to approach all of these?

Use the standard steps in modeling with tensorflow:

* Create a model
* Build a model
* Fit a model
* Evaluate our model

## Model 1 : Simple Dence layer

Create simple dence layer prediction

![Simple Dence Layer](img/SimpleDenceLayer.png)

In [15]:
# Create a tensorboard callback ( need to a new one for each model)
from helper_function import create_tensorboard_callback

# Create a directory to save TensorBoard logs
SAVE_DIR = "model_logs"

In [20]:
# build model with thr Function API
from tensorflow.keras import layers
inputs = layers.Input(shape=(1,), dtype=tf.string) # inputs are 1-dimentional strings
x = text2vec(inputs) # turn the input text into numbers
x = embedding(x) # create an embedding of the nuberized inputs
x = layers.GlobalAveragePooling1D()(x) # lower the dimensionality of the embedding (try running the model without this layer and see what happens)
outputs = layers.Dense(1, activation="sigmoid")(x) # Create output layer, want binary outputs so use sigmoid function 
model_1 = tf.keras.Model(inputs, outputs, name="model_1_dence") 

In [21]:
model_1.summary()

Model: "model_1_dence"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 input_2 (InputLayer)        [(None, 1)]               0         
                                                                 
 text_vectorization (TextVec  (None, 15)               0         
 torization)                                                     
                                                                 
 embedding (Embedding)       (None, 15, 128)           1280000   
                                                                 
 global_average_pooling1d (G  (None, 128)              0         
 lobalAveragePooling1D)                                          
                                                                 
 dense_1 (Dense)             (None, 1)                 129       
                                                                 
Total params: 1,280,129
Trainable params: 1,280,129
N

### compile model

In [22]:
model_1.compile(loss="binary_crossentropy",
                optimizer=tf.keras.optimizers.Adam(),
                metrics=["accuracy"])

In [23]:
# fir the model

model_1_history = model_1.fit(x=train_sentences,
                             y=train_lables,
                             epochs=5,
                             validation_data=(val_sentences, val_lables),
                             callbacks=[create_tensorboard_callback(dir_name=SAVE_DIR,
                                                                   experiment_name="model_1_dense")])

Saving TensorBoard log files to: model_logs/model_1_dense/20211228-143317
Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


In [25]:
# check the results

model_1.evaluate(val_sentences, val_lables)



[0.47177064418792725, 0.7887139320373535]

In [33]:
# Make some predictions and evaluate those

model_1_pred_probs = model_1.predict(val_sentences)
print(model_1_pred_probs.shape)
print(model_1_pred_probs[:10])

(762, 1)
[[0.09253025]
 [0.9972645 ]
 [0.02822798]
 [0.9914555 ]
 [0.036149  ]
 [0.9618006 ]
 [0.70454437]
 [0.992393  ]
 [0.37690672]
 [0.03268823]]


In [34]:
# Convert model prediction to label format

model_1_preds = tf.squeeze(tf.round(model_1_pred_probs))
model_1_preds[:20]

<tf.Tensor: shape=(20,), dtype=float32, numpy=
array([0., 1., 0., 1., 0., 1., 1., 1., 0., 0., 1., 1., 0., 1., 1., 0., 1.,
       1., 1., 0.], dtype=float32)>

In [40]:
# Calculate our model_1 results 
from Evaluation import caluculate_results
model_1_results = caluculate_results(y_true=val_lables, 
                                    y_pre=model_1_preds)
model_1_results

{'accuracy': 78.87139107611549,
 'prediction': 0.7929120493899015,
 'recall': 0.7887139107611548,
 'f1': 0.7868932145420301}

## Visualize embeddings

In [42]:
# Get the vocabulary from the text vectorization layer
words_in_vocab = text2vec.get_vocabulary()
len(words_in_vocab), words_in_vocab[:10]

(10000, ['', '[UNK]', 'the', 'a', 'in', 'to', 'of', 'and', 'i', 'is'])

In [46]:
# Get the weight matrix of embedding layer
embed_weights = model_1.get_layer("embedding").get_weights()[0]
embed_weights, embed_weights.shape

(array([[-0.05605679,  0.05359774,  0.00250347, ...,  0.01564652,
          0.03309035,  0.02445746],
        [-0.01606945,  0.03757437,  0.04642731, ...,  0.02036933,
         -0.02405428, -0.04277565],
        [-0.0141153 ,  0.03720097,  0.05430916, ...,  0.06198725,
         -0.01132409,  0.01451792],
        ...,
        [-0.08347785,  0.06697842,  0.07331058, ...,  0.03520257,
          0.01218209, -0.01386334],
        [ 0.03933527, -0.01829431, -0.08557785, ..., -0.11267728,
         -0.08746089,  0.05525713],
        [ 0.02558482, -0.09147273, -0.07978733, ..., -0.04099841,
         -0.05591433,  0.09141265]], dtype=float32),
 (10000, 128))

The above number means, the every token are represented by a vector of 128 numbers.

Read below url

https://www.tensorflow.org/tutorials/text/word_embeddings#retrieve_the_trained_word_embeddings_and_save_them_to_disk

In [51]:
# Code below is adapted from: 
import io

# Create output writers
out_v = io.open("model_1\cembedding_vectors.tsv", "w", encoding="utf-8")
out_m = io.open("model_1\embedding_metadata.tsv", "w", encoding="utf-8")

# Write embedding vectors and words to file
for num, word in enumerate(words_in_vocab):
  if num == 0: 
     continue # skip padding token
  vec = embed_weights[num]
  out_m.write(word + "\n") # write words to file
  out_v.write("\t".join([str(x) for x in vec]) + "\n") # write corresponding word vector to file
out_v.close()
out_m.close()
