# Learn basics in NLP with TensorFlow 

I'm gonna follow this github tutorial.

https://github.com/mrdbourke/tensorflow-deep-learning/blob/main/08_introduction_to_nlp_in_tensorflow.ipynb

Get dataset from kaggle.

In [1]:
import pandas as pd
import numpy as np

In [2]:
train_data = pd.read_csv('./dataaset/train.csv')

In [3]:
train_data.head()

Unnamed: 0,id,keyword,location,text,target
0,1,,,Our Deeds are the Reason of this #earthquake M...,1
1,4,,,Forest fire near La Ronge Sask. Canada,1
2,5,,,All residents asked to 'shelter in place' are ...,1
3,6,,,"13,000 people receive #wildfires evacuation or...",1
4,7,,,Just got sent this photo from Ruby #Alaska as ...,1


Split data into train and test

In [4]:
from sklearn.model_selection import train_test_split

train_sentences, val_sentences, train_lables, val_lables = train_test_split(
    train_data["text"].to_numpy(),
    train_data["target"].to_numpy(),
    test_size=0.1
    )

In [5]:
train_sentences

array(['hurricane?? sick!',
       'ML 2.0 SICILY ITALY http://t.co/z6hxx6d2pm #euroquake',
       "Don't be so modest. You certainly... *sniff* *sniiiiiiff* Er Donny? Is something burning?",
       ...,
       'Naaa I bee dead.. Like a legit zombie .. I feel every sore part in my body ?? https://t.co/J4fSDPfA63',
       'Suicide Bomber Kills 13 At SaudiåÊMosque http://t.co/h99bHB29xt',
       'U.S National Park Services Tonto National Forest: Stop the Annihilation of the Salt River Wild Horse... http://t.co/6LoJOoROuk via @Change'],
      dtype=object)

# Converting text into numbers

Create words to vector function.

In [6]:
from tensorflow.keras.layers import TextVectorization

In [21]:
text2vec = TextVectorization(
    max_tokens=10000, standardize='lower_and_strip_punctuation',
    split='whitespace', ngrams=None, output_mode='int',
    output_sequence_length=None, pad_to_max_tokens=False, vocabulary=None,
    idf_weights=None, sparse=False, ragged=False
)

In [8]:
text2vec.adapt(train_sentences)

See how the words 

In [9]:
sample_sentence = "There is a flood in my street!"
text2vec([sample_sentence])

<tf.Tensor: shape=(1, 7), dtype=int64, numpy=array([[ 74,   9,   3, 210,   4,  13, 696]])>

Get first words

In [10]:
text2vec.get_vocabulary()[:5]

['', '[UNK]', 'the', 'a', 'in']

Get the words from 100 to 105th.

In [11]:
text2vec.get_vocabulary()[100:105]

['off', 'nuclear', 'going', 'world', 'cant']

# Creating Embedding layer

We are going to use TnsorFlow's embedding layers.

https://www.tensorflow.org/api_docs/python/tf/keras/layers/Embedding

In [12]:
from tensorflow.keras import layers

embedding = layers.Embedding(input_dim = 10000, # set imput shape
                             output_dim = 128, # output shape
                             input_length = 10000 # how long is each input 
                            )

embedding

<keras.layers.embeddings.Embedding at 0x7f556eea9340>

Get a random sentence from the training set

In [13]:
import random
random_sentence = random.choice(train_sentences)

print(f"Original text:\n {random_sentence}\
        \n\nEmbedded version:")

# Embed the random sentence (turn it into dense vectors of fixed size)
sample_embed = embedding(text2vec([random_sentence]))
sample_embed


Original text:
 One thing I wanna see before I die&gt; #Trump standing in a good windstorm with no hat on!!! #Hardball        

Embedded version:


<tf.Tensor: shape=(1, 19, 128), dtype=float32, numpy=
array([[[-0.0435282 ,  0.0175561 , -0.03012011, ...,  0.02523227,
         -0.01872987,  0.00905143],
        [ 0.038867  ,  0.01798466, -0.0340655 , ...,  0.00472493,
         -0.00747948,  0.04603646],
        [ 0.04763714, -0.00991907,  0.04974221, ..., -0.02522643,
          0.00574137,  0.01926524],
        ...,
        [ 0.01530648, -0.00745587, -0.00504302, ..., -0.01269235,
         -0.04317092,  0.03196422],
        [ 0.00611197,  0.01676695, -0.03667712, ..., -0.04615531,
          0.02169428,  0.04329332],
        [-0.03812277, -0.04817264, -0.00614817, ...,  0.01390323,
         -0.0039277 , -0.02616168]]], dtype=float32)>

In [14]:
sample_embed[0][0], sample_embed[0][0].shape, random_sentence

(<tf.Tensor: shape=(128,), dtype=float32, numpy=
 array([-0.0435282 ,  0.0175561 , -0.03012011,  0.00377155,  0.01986678,
        -0.00550107, -0.04804999, -0.01466942, -0.01857733,  0.01076485,
        -0.02527479,  0.02206281,  0.0234226 ,  0.03638004, -0.0222977 ,
        -0.04812552, -0.00531228,  0.04633999,  0.02013744, -0.02050192,
        -0.04420015, -0.01526312, -0.04588317,  0.04618523, -0.02494662,
        -0.02533022, -0.04571458, -0.02759495, -0.03889594, -0.03725729,
        -0.02101976, -0.04304085,  0.03541197,  0.04382397,  0.02169443,
         0.02458951,  0.0108851 , -0.00196086, -0.01927493, -0.00802384,
        -0.02563849,  0.00569295,  0.0269939 ,  0.03198583,  0.02407658,
         0.00926826, -0.01008099, -0.00028284, -0.03355294,  0.04632728,
         0.00797959,  0.03472814, -0.04120144, -0.04846063, -0.00177735,
        -0.01190611,  0.0157467 ,  0.04107188,  0.01387191, -0.02925472,
         0.03234943, -0.02097943, -0.00973696,  0.02425394, -0.02399017,
  

# Modelling a text dataset with running a series of experiment

There are some Model to learn text:

0, Naive Bayes with TF-IDF encoder (baseline)

1, Feed-forward neural network (dence model)

2, LSTM (RNN)

3, GRU (RNN)

4, Bidirectional-LSTM (RNN)

5, 1D Convolutional Neural Network

6, TensorFlow Hub Pretrained Feature Extractor

7, TensorFlow Hub Pretrained Feature Extractor (10% of data)

How are we going to approach all of these?

Use the standard steps in modeling with tensorflow:

* Create a model
* Build a model
* Fit a model
* Evaluate our model

## Model 0 : Naive Bayes with TF-IDF encoder

This is famouse when we don't use DL model.

In [15]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import Pipeline

model_0 = Pipeline([
                    ("tfidf", TfidfVectorizer()), # convert words to numbers using tfidf
                    ("clf", MultinomialNB()) # model the text
])

In [16]:
# Fit the pipeline to the training data
model_0.fit(train_sentences, train_lables)

Pipeline(steps=[('tfidf', TfidfVectorizer()), ('clf', MultinomialNB())])

In [17]:
# Evaluate our baseline model
baseline_score = model_0.score(val_sentences, val_lables)
baseline_score

0.8097112860892388

In [18]:
# predict model
baseline_pre = model_0.predict(val_sentences)

Evaluate scores

In [19]:
from Evaluation import caluculate_results
baseline_results =  caluculate_results(y_true = val_lables,
                                       y_pre = baseline_pre)

In [20]:
baseline_results

{'accuracy': 80.97112860892388,
 'prediction': 0.8212871828521435,
 'recall': 0.8097112860892388,
 'f1': 0.8037619034994363}