# Learn basics in NLP with TensorFlow 

I'm gonna follow this github tutorial.

https://github.com/mrdbourke/tensorflow-deep-learning/blob/main/08_introduction_to_nlp_in_tensorflow.ipynb

Get dataset from kaggle.

In [1]:
import pandas as pd
import numpy as np
import tensorflow as tf

In [2]:
train_data = pd.read_csv('./dataaset/train.csv')

In [3]:
train_data.head()

Unnamed: 0,id,keyword,location,text,target
0,1,,,Our Deeds are the Reason of this #earthquake M...,1
1,4,,,Forest fire near La Ronge Sask. Canada,1
2,5,,,All residents asked to 'shelter in place' are ...,1
3,6,,,"13,000 people receive #wildfires evacuation or...",1
4,7,,,Just got sent this photo from Ruby #Alaska as ...,1


Split data into train and test

In [4]:
from sklearn.model_selection import train_test_split

train_sentences, val_sentences, train_lables, val_lables = train_test_split(
    train_data["text"].to_numpy(),
    train_data["target"].to_numpy(),
    test_size=0.1
    )

In [5]:
train_sentences

array(['U.S National Park Services Tonto National Forest: Stop the Annihilation of the Salt River Wild Horse... https://t.co/m8MvDSPJp7 via @Change',
       ':StarMade: :Stardate 3: :Planetary Annihilation:: http://t.co/I2hHvIUmTm via @YouTube',
       'my vibrator shaped vape done busted', ...,
       'RT @WIRED: Reddit will now quarantine offensive content http://t.co/zlAGv1U5ZA',
       "Oops: Bounty hunters try to raid Phoenix police chief's home: A group of armed bounty hunters surrounded the h... http://t.co/dGELJ8rYt9",
       'Remembering Pittsburgh Eyewitness History of Steel City by Len Barcousky PB Penn http://t.co/dhGAVw8bSW http://t.co/0lMhEAEX9k'],
      dtype=object)

# Converting text into numbers

Create words to vector function.

In [6]:
from tensorflow.keras.layers import TextVectorization

In [7]:
text2vec = TextVectorization(
    max_tokens=10000, standardize='lower_and_strip_punctuation',
    split='whitespace', ngrams=None, output_mode='int',
    output_sequence_length=15, pad_to_max_tokens=False, vocabulary=None,
    idf_weights=None, sparse=False, ragged=False
)

In [8]:
text2vec.adapt(train_sentences)

See how the words 

In [9]:
sample_sentence = "There is a flood in my street!"
text2vec([sample_sentence])

<tf.Tensor: shape=(1, 15), dtype=int64, numpy=
array([[ 72,   9,   3, 230,   4,  13, 770,   0,   0,   0,   0,   0,   0,
          0,   0]], dtype=int64)>

Get first words

In [10]:
text2vec.get_vocabulary()[:5]

['', '[UNK]', 'the', 'a', 'in']

Get the words from 100 to 105th.

In [11]:
text2vec.get_vocabulary()[100:105]

['buildings', 'see', 'had', 'world', 'bomb']

# Creating Embedding layer

We are going to use TnsorFlow's embedding layers.

https://www.tensorflow.org/api_docs/python/tf/keras/layers/Embedding

In [12]:
from tensorflow.keras import layers

embedding = layers.Embedding(input_dim = 10000, # set imput shape
                             output_dim = 128, # output shape
                             input_length = 10000 # how long is each input 
                            )

embedding

<keras.layers.embeddings.Embedding at 0x24c4c09cdf0>

Get a random sentence from the training set

In [13]:
import random
random_sentence = random.choice(train_sentences)

print(f"Original text:\n {random_sentence}\
        \n\nEmbedded version:")

# Embed the random sentence (turn it into dense vectors of fixed size)
sample_embed = embedding(text2vec([random_sentence]))
sample_embed


Original text:
 Experts in France begin examining airplane debris found on Reunion Island: French air accident experts o... http://t.co/YVVPznZmXg #news        

Embedded version:


<tf.Tensor: shape=(1, 15, 128), dtype=float32, numpy=
array([[[-0.00442255, -0.01719322, -0.01430987, ..., -0.00445938,
          0.03908605, -0.03841059],
        [ 0.02364143, -0.04795272,  0.01264676, ...,  0.01682056,
          0.03746491, -0.0089851 ],
        [-0.01584182, -0.01135397, -0.04369679, ..., -0.01374605,
          0.01498913,  0.01496622],
        ...,
        [-0.01032605, -0.04815351, -0.02526944, ...,  0.0045061 ,
         -0.02644936,  0.02694004],
        [ 0.03044695,  0.03209635,  0.04036835, ...,  0.00870193,
         -0.02781923,  0.00543054],
        [-0.00442255, -0.01719322, -0.01430987, ..., -0.00445938,
          0.03908605, -0.03841059]]], dtype=float32)>

In [14]:
sample_embed[0][0], sample_embed[0][0].shape, random_sentence

(<tf.Tensor: shape=(128,), dtype=float32, numpy=
 array([-0.00442255, -0.01719322, -0.01430987,  0.00200362,  0.03842178,
        -0.03178269, -0.04890714, -0.04801202, -0.02231261,  0.01776468,
        -0.01583605,  0.0418413 ,  0.01031483,  0.01288477, -0.03431915,
        -0.01988486, -0.00027715, -0.04026494,  0.02971555, -0.00029796,
         0.00697303,  0.00502187,  0.03120411,  0.0180714 , -0.04032099,
        -0.01823191, -0.03402852,  0.01127168, -0.03959861,  0.01169776,
        -0.04184027, -0.04222951,  0.0238943 ,  0.03858123, -0.03544161,
         0.00300155, -0.0342504 ,  0.0216804 ,  0.028552  ,  0.01701982,
        -0.04821209,  0.02308965,  0.0434227 , -0.03937116,  0.045908  ,
         0.01280899,  0.0460065 ,  0.03295021,  0.01391354, -0.01722723,
         0.02689068,  0.013389  , -0.04365537,  0.02658138,  0.04557014,
        -0.01708128, -0.03273678, -0.02143202, -0.04409257,  0.00171136,
        -0.02703009,  0.00799965, -0.02625896, -0.01353822, -0.02869183,
  

# Modelling a text dataset with running a series of experiment

There are some Model to learn text:

0, Naive Bayes with TF-IDF encoder (baseline)

1, Feed-forward neural network (dence model)

2, LSTM (RNN)

3, GRU (RNN)

4, Bidirectional-LSTM (RNN)

5, 1D Convolutional Neural Network

6, TensorFlow Hub Pretrained Feature Extractor

7, TensorFlow Hub Pretrained Feature Extractor (10% of data)

How are we going to approach all of these?

Use the standard steps in modeling with tensorflow:

* Create a model
* Build a model
* Fit a model
* Evaluate our model

## Model 1 : Simple Dence layer

Create simple dence layer prediction

![Simple Dence Layer](img/SimpleDenceLayer.png)

In [15]:
# Create a tensorboard callback ( need to a new one for each model)
from helper_function import create_tensorboard_callback

# Create a directory to save TensorBoard logs
SAVE_DIR = "model_logs"

In [26]:
# build model with thr Function API
from tensorflow.keras import layers
inputs = layers.Input(shape=(1,), dtype=tf.string) # inputs are 1-dimentional strings
x = text2vec(inputs) # turn the input text into numbers
x = embedding(x) # create an embedding of the nuberized inputs
outputs = layers.Dense(1, activation="sigmoid")(x) # Create output layer, want binary outputs so use sigmoid function 
model_1 = tf.keras.Model(inputs, outputs, name="model_1_dence") 

In [27]:
model_1.summary()

Model: "model_1_dence"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 input_2 (InputLayer)        [(None, 1)]               0         
                                                                 
 text_vectorization (TextVec  (None, 15)               0         
 torization)                                                     
                                                                 
 embedding (Embedding)       (None, 15, 128)           1280000   
                                                                 
 dense_1 (Dense)             (None, 15, 1)             129       
                                                                 
Total params: 1,280,129
Trainable params: 1,280,129
Non-trainable params: 0
_________________________________________________________________


### compile model

In [29]:
model_1.compile(loss="binary_crossentropy",
                optimizer=tf.keras.optimizers.Adam(),
                metrics=["accuracy"])

In [30]:
# fir the model

model_1_history = model_1.fit(x=train_sentences,
                             y=train_lables,
                             epochs=5,
                             validation_data=(val_sentences, val_lables),
                             callbacks=[create_tensorboard_callback(dir_name=SAVE_DIR,
                                                                   experiment_name="model_1_dense")])

Saving TensorBoard log files to: model_logs/model_1_dense/20211225-100620
Epoch 1/5


ValueError: in user code:

    File "C:\Users\USER\AppData\Roaming\Python\Python38\site-packages\keras\engine\training.py", line 878, in train_function  *
        return step_function(self, iterator)
    File "C:\Users\USER\AppData\Roaming\Python\Python38\site-packages\keras\engine\training.py", line 867, in step_function  **
        outputs = model.distribute_strategy.run(run_step, args=(data,))
    File "C:\Users\USER\AppData\Roaming\Python\Python38\site-packages\keras\engine\training.py", line 860, in run_step  **
        outputs = model.train_step(data)
    File "C:\Users\USER\AppData\Roaming\Python\Python38\site-packages\keras\engine\training.py", line 809, in train_step
        loss = self.compiled_loss(
    File "C:\Users\USER\AppData\Roaming\Python\Python38\site-packages\keras\engine\compile_utils.py", line 201, in __call__
        loss_value = loss_obj(y_t, y_p, sample_weight=sw)
    File "C:\Users\USER\AppData\Roaming\Python\Python38\site-packages\keras\losses.py", line 141, in __call__
        losses = call_fn(y_true, y_pred)
    File "C:\Users\USER\AppData\Roaming\Python\Python38\site-packages\keras\losses.py", line 245, in call  **
        return ag_fn(y_true, y_pred, **self._fn_kwargs)
    File "C:\Users\USER\AppData\Roaming\Python\Python38\site-packages\keras\losses.py", line 1807, in binary_crossentropy
        backend.binary_crossentropy(y_true, y_pred, from_logits=from_logits),
    File "C:\Users\USER\AppData\Roaming\Python\Python38\site-packages\keras\backend.py", line 5158, in binary_crossentropy
        return tf.nn.sigmoid_cross_entropy_with_logits(labels=target, logits=output)

    ValueError: `logits` and `labels` must have the same shape, received ((None, 15, 1) vs (None,)).
