# Learn basics in NLP with TensorFlow 

I'm gonna follow this github tutorial.

https://github.com/mrdbourke/tensorflow-deep-learning/blob/main/08_introduction_to_nlp_in_tensorflow.ipynb

Get dataset from kaggle.

In [1]:
import pandas as pd
import numpy as np

In [2]:
train_data = pd.read_csv('./dataaset/train.csv')

In [3]:
train_data.head()

Unnamed: 0,id,keyword,location,text,target
0,1,,,Our Deeds are the Reason of this #earthquake M...,1
1,4,,,Forest fire near La Ronge Sask. Canada,1
2,5,,,All residents asked to 'shelter in place' are ...,1
3,6,,,"13,000 people receive #wildfires evacuation or...",1
4,7,,,Just got sent this photo from Ruby #Alaska as ...,1


Split data into train and test

In [4]:
from sklearn.model_selection import train_test_split

train_sentences, val_sentences, train_lables, val_lables = train_test_split(
    train_data["text"].to_numpy(),
    train_data["target"].to_numpy(),
    test_size=0.1
    )

In [5]:
train_sentences

array(['#foodscare #offers2go #NestleIndia slips into loss after #Magginoodle #ban unsafe and hazardous for #humanconsumption',
       '70 Years After Atomic Bombs Japan Still Struggles With War Past: The anniversary of the devastation wrought b... http://t.co/o6AA0nWLha',
       "#BakeOffFriends #GBBO 'The one with the mudslide and the guy with the hat'",
       ...,
       "#Cowboys: Wednesday's injury report: RB Lance Dunbar injures ankle is listed as day-to-day:  http://t.co/RkB7EgKveb",
       'Watch Sarah Palin OBLITERATE Planned Parenthood For Targeting Minority Women! \x89ÛÒ BB4SP http://t.co/sAYZt2oagm',
       'Fucking yes /r/antiPOZi is quarantined.  Triggered the cucks we have.'],
      dtype=object)

# Converting text into numbers

Create words to vector function.

In [6]:
from tensorflow.keras.layers import TextVectorization

In [7]:
text2vec = TextVectorization(
    max_tokens=10000, standardize='lower_and_strip_punctuation',
    split='whitespace', ngrams=None, output_mode='int',
    output_sequence_length=15, pad_to_max_tokens=False, vocabulary=None,
    idf_weights=None, sparse=False, ragged=False
)

In [8]:
text2vec.adapt(train_sentences)

See how the words 

In [9]:
sample_sentence = "There is a flood in my street!"
text2vec([sample_sentence])

<tf.Tensor: shape=(1, 15), dtype=int64, numpy=
array([[ 74,   9,   3, 210,   4,  13, 668,   0,   0,   0,   0,   0,   0,
          0,   0]], dtype=int64)>

Get first words

In [10]:
text2vec.get_vocabulary()[:5]

['', '[UNK]', 'the', 'a', 'in']

Get the words from 100 to 105th.

In [11]:
text2vec.get_vocabulary()[100:105]

['first', 'day', 'see', 'rt', 'nuclear']

# Creating Embedding layer

We are going to use TnsorFlow's embedding layers.

https://www.tensorflow.org/api_docs/python/tf/keras/layers/Embedding

In [12]:
from tensorflow.keras import layers

embedding = layers.Embedding(input_dim = 10000, # set imput shape
                             output_dim = 128, # output shape
                             input_length = 10000 # how long is each input 
                            )

embedding

<keras.layers.embeddings.Embedding at 0x1f7ed4da820>

Get a random sentence from the training set

In [13]:
import random
random_sentence = random.choice(train_sentences)

print(f"Original text:\n {random_sentence}\
        \n\nEmbedded version:")

# Embed the random sentence (turn it into dense vectors of fixed size)
sample_embed = embedding(text2vec([random_sentence]))
sample_embed


Original text:
 @Silent0siris why not even more awesome norse landscapes with loads of atmosphere and life than boring/dead snotgreen wastelands =/        

Embedded version:


<tf.Tensor: shape=(1, 15, 128), dtype=float32, numpy=
array([[[-8.7750182e-03,  3.8374934e-02, -2.0454038e-02, ...,
         -3.2008775e-03, -3.0277824e-02,  1.7723035e-02],
        [-2.8183436e-02,  2.5909398e-02,  4.9008381e-02, ...,
          2.5891397e-02,  2.1168362e-02,  1.5717398e-02],
        [ 4.5796517e-02, -2.3933221e-02, -2.7477741e-05, ...,
          8.1192739e-03, -4.1421104e-02, -2.9009020e-02],
        ...,
        [-2.2014428e-02, -4.5775604e-02, -2.8667951e-02, ...,
         -2.0075215e-02,  9.3334913e-03, -1.6072858e-02],
        [ 4.3553893e-02, -2.4615599e-02, -2.1535957e-02, ...,
          1.4880490e-02,  3.9464608e-03, -8.4826723e-03],
        [ 3.4357179e-02, -3.4353390e-02,  1.5562143e-02, ...,
         -8.1611983e-03,  8.8150986e-03,  1.4191236e-02]]], dtype=float32)>

In [14]:
sample_embed[0][0], sample_embed[0][0].shape, random_sentence

(<tf.Tensor: shape=(128,), dtype=float32, numpy=
 array([-0.00877502,  0.03837493, -0.02045404,  0.028922  , -0.00135047,
        -0.04771801,  0.03413505,  0.0370185 , -0.01713576, -0.01167939,
        -0.0361629 ,  0.01092094,  0.00961093,  0.01341688, -0.03212906,
         0.03236114,  0.01870133,  0.04086839, -0.0220531 ,  0.00364311,
        -0.02071682, -0.00677913,  0.02209851, -0.04576121, -0.04687148,
         0.04765746, -0.0471023 ,  0.03654218,  0.0071661 ,  0.03409285,
         0.03632246,  0.01396228, -0.04551941,  0.02249939,  0.04124662,
         0.01090536,  0.01335294,  0.01863958, -0.00444819, -0.02384531,
        -0.02345958, -0.02696501, -0.03813272,  0.01408628,  0.00684815,
        -0.02348231, -0.01978018, -0.00857029, -0.01169363,  0.0453099 ,
        -0.0010746 , -0.03824887, -0.02235872,  0.02410522, -0.00468149,
        -0.02590205,  0.02142422, -0.03827486, -0.03778159,  0.00542183,
        -0.04529473, -0.00608424, -0.01701016, -0.00801777, -0.03720414,
  

# Modelling a text dataset with running a series of experiment

There are some Model to learn text:

0, Naive Bayes with TF-IDF encoder (baseline)

1, Feed-forward neural network (dence model)

2, LSTM (RNN)

3, GRU (RNN)

4, Bidirectional-LSTM (RNN)

5, 1D Convolutional Neural Network

6, TensorFlow Hub Pretrained Feature Extractor

7, TensorFlow Hub Pretrained Feature Extractor (10% of data)

How are we going to approach all of these?

Use the standard steps in modeling with tensorflow:

* Create a model
* Build a model
* Fit a model
* Evaluate our model

# Create Conv1D layer

For more CNN info see below 
* https://poloclub.github.io/cnn-explainer/

When relu activation see below
* https://ml-cheatsheet.readthedocs.io/en/latest/activation_functions.html

In [17]:
# Test embedding layer, Conv1D layer and max pooling
from tensorflow.keras import layers

embedding_test = embedding(text2vec(["This is a test sentence"])) # trurn target sequence into embedding
conv_1d = layers.Conv1D(filters=32,
                       kernel_size=5,
                       activation="relu",
                       padding="valid")  # when "valid" the output size is smaller than input, when "same" the output size is the same
conv_1d_output = conv_1d(embedding_test)  # pass test embedding through conv1d layer
max_pool = layers.GlobalMaxPool1D()
max_pool_output = max_pool(conv_1d_output)   # equivalent to "get the most important feature" or "get the feature the highest value"

embedding_test.shape, conv_1d_output.shape, max_pool_output.shape

(TensorShape([1, 15, 128]), TensorShape([1, 11, 32]), TensorShape([1, 32]))

In [15]:
# Create a tensorboard callback ( need to a new one for each model)
from helper_function import create_tensorboard_callback

# Create a directory to save TensorBoard logs
SAVE_DIR = "model_logs"