# Learn basics in NLP with TensorFlow 

I'm gonna follow this github tutorial.

https://github.com/mrdbourke/tensorflow-deep-learning/blob/main/08_introduction_to_nlp_in_tensorflow.ipynb

Get dataset from kaggle.

In [1]:
import pandas as pd
import numpy as np

In [2]:
train_data = pd.read_csv('./dataaset/train.csv')

In [3]:
train_data.head()

Unnamed: 0,id,keyword,location,text,target
0,1,,,Our Deeds are the Reason of this #earthquake M...,1
1,4,,,Forest fire near La Ronge Sask. Canada,1
2,5,,,All residents asked to 'shelter in place' are ...,1
3,6,,,"13,000 people receive #wildfires evacuation or...",1
4,7,,,Just got sent this photo from Ruby #Alaska as ...,1


Split data into train and test

In [4]:
from sklearn.model_selection import train_test_split

train_sentences, val_sentences, train_lables, val_lables = train_test_split(
    train_data["text"].to_numpy(),
    train_data["target"].to_numpy(),
    test_size=0.1
    )

In [5]:
train_sentences

array(["Simmering beneath #NHL good times the league's own concussion issues @PioneerPress\n\nhttp://t.co/zl7FhUCxHL",
       'U.S National Park Services Tonto National Forest: Stop the Annihilation of the Salt River Wild Horse... https://t.co/MatIJwkzbh via @Change',
       '#world FedEx no longer to transport bioterror germs in wake of anthrax lab mishaps  http://t.co/wvExJjRG6E',
       ...,
       'Hollywood movie about trapped miners released in #Chile http://t.co/r18aUtnLSd #ZippedNews http://t.co/CNqaE9foj6',
       '*New!* Stretcher in 5 min https://t.co/q5MDsNbCMh (by FUJIWARA Shunichiro 2015-08-05) [Technology]',
       'Barak will Tell the American People that the lives of the Hostages in Iran depends on Congress Voting to give Terrorist a Nuke for hostages.'],
      dtype=object)

# Converting text into numbers

Create words to vector function.

In [6]:
from tensorflow.keras.layers import TextVectorization

In [15]:
text2vec = TextVectorization(
    max_tokens=10000, standardize='lower_and_strip_punctuation',
    split='whitespace', ngrams=None, output_mode='int',
    output_sequence_length=None, pad_to_max_tokens=False, vocabulary=None,
    idf_weights=None, sparse=False, ragged=False
)

In [8]:
text2vec.adapt(train_sentences)

See how the words 

In [9]:
sample_sentence = "There is a flood in my street!"
text2vec([sample_sentence])

<tf.Tensor: shape=(1, 7), dtype=int64, numpy=array([[ 73,   9,   3, 218,   4,  13, 668]])>

Get first words

In [10]:
text2vec.get_vocabulary()[:5]

['', '[UNK]', 'the', 'a', 'in']

Get the words from 100 to 105th.

In [11]:
text2vec.get_vocabulary()[100:105]

['going', 'bomb', 'first', 'world', 'see']

# Creating Embedding layer

We are going to use TnsorFlow's embedding layers.

https://www.tensorflow.org/api_docs/python/tf/keras/layers/Embedding

In [12]:
from tensorflow.keras import layers

embedding = layers.Embedding(input_dim = 10000, # set imput shape
                             output_dim = 128, # output shape
                             input_length = 10000 # how long is each input 
                            )

embedding

<keras.layers.embeddings.Embedding at 0x7f4294225d00>

Get a random sentence from the training set

In [13]:
import random
random_sentence = random.choice(train_sentences)

print(f"Original text:\n {random_sentence}\
        \n\nEmbedded version:")

# Embed the random sentence (turn it into dense vectors of fixed size)
sample_embed = embedding(text2vec([random_sentence]))
sample_embed


Original text:
 I came up with an idea of a fragrance concept for a bath bomb called The Blood of my Enemies. So you can say that's what you bathe in.        

Embedded version:


<tf.Tensor: shape=(1, 29, 128), dtype=float32, numpy=
array([[[ 0.02958575, -0.01597808, -0.01917583, ...,  0.03905093,
         -0.02408947, -0.01861982],
        [ 0.02632899, -0.0480075 , -0.0182404 , ...,  0.0014418 ,
          0.00921815,  0.03629432],
        [-0.04899255, -0.0412196 , -0.01272858, ...,  0.04235465,
          0.0411538 , -0.04494948],
        ...,
        [-0.01053566,  0.0204275 ,  0.00317945, ..., -0.03147644,
         -0.0228538 ,  0.03613024],
        [-0.03692415, -0.00882105,  0.01451446, ...,  0.01105335,
          0.03074242, -0.00395235],
        [-0.01825533, -0.03438073,  0.03379628, ...,  0.00583144,
         -0.00925924,  0.02018284]]], dtype=float32)>

In [14]:
sample_embed[0][0], sample_embed[0][0].shape, random_sentence

(<tf.Tensor: shape=(128,), dtype=float32, numpy=
 array([ 2.95857452e-02, -1.59780756e-02, -1.91758275e-02, -4.34473418e-02,
        -4.15468924e-02,  9.74230841e-03,  3.61525752e-02, -3.13725844e-02,
         3.18538025e-03, -9.12852213e-03,  2.62330882e-02,  1.42868869e-02,
         2.21077688e-02,  1.76996700e-02, -3.93583290e-02, -4.68223095e-02,
         1.66739561e-02, -4.23863307e-02, -2.55356561e-02,  1.80129670e-02,
        -2.92800423e-02,  4.26120795e-02,  4.11627777e-02,  1.35728456e-02,
        -4.69775461e-02, -4.24849652e-02,  3.65714468e-02, -3.03423051e-02,
         1.49317123e-02,  2.76662447e-02,  2.05638073e-02, -3.21891159e-03,
         9.80960205e-03,  2.10213661e-03, -3.11073661e-02, -1.41449086e-02,
        -3.63717563e-02, -4.56285849e-02,  1.92188136e-02,  2.33934857e-02,
         2.31421329e-02, -6.20067120e-04, -2.98727993e-02, -9.18539613e-03,
         3.26558091e-02,  2.99807452e-02, -2.63557434e-02, -1.16965771e-02,
         3.78444232e-02,  4.54924367e-0

# Modelling a text dataset with running a series of experiment

There are some Model to learn text:

0, Naive Bayes with TF-IDF encoder (baseline)

1, Feed-forward neural network (dence model)

2, LSTM (RNN)

3, GRU (RNN)

4, Bidirectional-LSTM (RNN)

5, 1D Convolutional Neural Network

6, TensorFlow Hub Pretrained Feature Extractor

7, TensorFlow Hub Pretrained Feature Extractor (10% of data)

How are we going to approach all of these?

Use the standard steps in modeling with tensorflow:

* Create a model
* Build a model
* Fit a model
* Evaluate our model