# Learn basics in NLP with TensorFlow 

I'm gonna follow this github tutorial.

https://github.com/mrdbourke/tensorflow-deep-learning/blob/main/08_introduction_to_nlp_in_tensorflow.ipynb

Get dataset from kaggle.

In [1]:
import pandas as pd
import numpy as np

In [2]:
train_data = pd.read_csv('./dataaset/train.csv')

In [3]:
train_data.head()

Unnamed: 0,id,keyword,location,text,target
0,1,,,Our Deeds are the Reason of this #earthquake M...,1
1,4,,,Forest fire near La Ronge Sask. Canada,1
2,5,,,All residents asked to 'shelter in place' are ...,1
3,6,,,"13,000 people receive #wildfires evacuation or...",1
4,7,,,Just got sent this photo from Ruby #Alaska as ...,1


Split data into train and test

In [4]:
from sklearn.model_selection import train_test_split

train_sentences, val_sentences, train_lables, val_lables = train_test_split(
    train_data["text"].to_numpy(),
    train_data["target"].to_numpy(),
    test_size=0.1
    )

In [5]:
train_sentences

       '@iK4LEN Sirens was cancelled.',
       'My brother-n-law riooooos got the call to head up north and fight the wild fires. Dudes a beast at\x89Û_ https://t.co/463P0yS0Eb',
       ...,
       'George Njenga the hero saved his burning friend from a razing wildfire... http://t.co/us8r6Qsn0p',
       'Omg earthquake',
       'Goulburn man Henry Van Bilsen missing: Emergency services are searching for a Goulburn man who disappeared from his\x89Û_ http://t.co/z99pKJzTRp'],
      dtype=object)

# Converting text into numbers

Create words to vector function.

In [6]:
from tensorflow.keras.layers import TextVectorization

In [7]:
text2vec = TextVectorization(
    max_tokens=10000, standardize='lower_and_strip_punctuation',
    split='whitespace', ngrams=None, output_mode='int',
    output_sequence_length=None, pad_to_max_tokens=False, vocabulary=None,
    idf_weights=None, sparse=False, ragged=False
)

In [8]:
text2vec.adapt(train_sentences)

See how the words 

In [9]:
sample_sentence = "There is a flood in my street!"
text2vec([sample_sentence])

<tf.Tensor: shape=(1, 7), dtype=int64, numpy=array([[ 75,   9,   3, 210,   4,  13, 693]], dtype=int64)>

Get first words

In [10]:
text2vec.get_vocabulary()[:5]

['', '[UNK]', 'the', 'a', 'in']

Get the words from 100 to 105th.

In [11]:
text2vec.get_vocabulary()[100:105]

['why', 'going', 'see', 'day', 'love']

# Creating Embedding layer

We are going to use TnsorFlow's embedding layers.

https://www.tensorflow.org/api_docs/python/tf/keras/layers/Embedding

In [12]:
from tensorflow.keras import layers

embedding = layers.Embedding(input_dim = 10000, # set imput shape
                             output_dim = 128, # output shape
                             input_length = 10000 # how long is each input 
                            )

embedding

<keras.layers.embeddings.Embedding at 0x2003df72a30>

Get a random sentence from the training set

In [13]:
import random
random_sentence = random.choice(train_sentences)

print(f"Original text:\n {random_sentence}\
        \n\nEmbedded version:")

# Embed the random sentence (turn it into dense vectors of fixed size)
sample_embed = embedding(text2vec([random_sentence]))
sample_embed


Original text:
 'I know a dill pickle when I taste one' -me        

Embedded version:


<tf.Tensor: shape=(1, 10, 128), dtype=float32, numpy=
array([[[-0.02135775, -0.03758646, -0.01996627, ...,  0.0278645 ,
         -0.00598536, -0.03844702],
        [-0.04226902, -0.00163647,  0.01667083, ..., -0.0428711 ,
          0.01848867,  0.02290911],
        [-0.03672609,  0.02936644,  0.03548683, ..., -0.00916771,
          0.01369672,  0.02785199],
        ...,
        [ 0.01022422, -0.0036725 ,  0.03455767, ...,  0.04977309,
          0.00265974,  0.03455759],
        [-0.0054733 , -0.01895433,  0.01308021, ..., -0.01617257,
         -0.0139548 , -0.03751872],
        [-0.00618932, -0.0489864 , -0.04190475, ..., -0.03495146,
          0.01029365,  0.035929  ]]], dtype=float32)>

In [15]:
sample_embed[0][0], sample_embed[0][0].shape, random_sentence

(<tf.Tensor: shape=(128,), dtype=float32, numpy=
 array([-2.13577505e-02, -3.75864618e-02, -1.99662689e-02, -4.48975675e-02,
        -4.63089831e-02,  8.18821043e-03,  4.48398627e-02,  4.28785421e-02,
         1.22422464e-02, -3.86800542e-02, -3.70315202e-02, -1.32973567e-02,
         2.33544819e-02,  2.45478265e-02,  4.21095155e-02,  1.07404366e-02,
         9.82768461e-03, -4.57753912e-02,  5.08897379e-03, -2.09103227e-02,
         3.58012356e-02, -8.75191763e-03,  3.25902589e-02, -4.40030210e-02,
        -2.79144049e-02, -2.47259866e-02,  1.21647716e-02,  7.70256668e-03,
         3.79385240e-02, -2.14370247e-02, -1.16806030e-02,  2.45687254e-02,
        -2.04440001e-02,  3.18035968e-02, -8.55914503e-03, -1.69825181e-02,
         4.21337374e-02,  9.49386507e-03, -4.96903807e-03,  1.86784007e-02,
         1.38633586e-02, -9.06940550e-03, -2.76816841e-02,  1.41907483e-04,
         3.39682437e-02,  3.29332426e-03,  1.79452822e-03,  1.82428211e-03,
         4.10582907e-02, -4.32386883e-0

# Modelling a text dataset with running a series of experiment

There are some Model to learn text:

0, Naive Bayes with TF-IDF encoder (baseline)

1, Feed-forward neural network (dence model)

2, LSTM (RNN)

3, GRU (RNN)

4, Bidirectional-LSTM (RNN)

5, 1D Convolutional Neural Network

6, TensorFlow Hub Pretrained Feature Extractor

7, TensorFlow Hub Pretrained Feature Extractor (10% of data)

How are we going to approach all of these?

Use the standard steps in modeling with tensorflow:

* Create a model
* Build a model
* Fit a model
* Evaluate our model

## Model 0 : Naive Bayes with TF-IDF encoder

This is famouse when we don't use DL model.

In [19]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import Pipeline

model_0 = Pipeline([
                    ("tfidf", TfidfVectorizer()), # convert words to numbers using tfidf
                    ("clf", MultinomialNB()) # model the text
])

In [24]:
# Fit the pipeline to the training data
model_0.fit(train_sentences, train_lables)

Pipeline(steps=[('tfidf', TfidfVectorizer()), ('clf', MultinomialNB())])

In [28]:
# Evaluate our baseline model
baseline_score = model_0.score(val_sentences, val_lables)
baseline_score

0.8044619422572179

In [34]:
# predict model
baseline_pre = model_0.predict(val_sentences)

Evaluate scores

In [35]:
from Evaluation import caluculate_results
baseline_results =  caluculate_results(y_true = val_lables,
                                       y_pre = baseline_pre)

In [36]:
baseline_results

{'accuracy': 80.4461942257218,
 'prediction': 0.8156975938097631,
 'recall': 0.8044619422572179,
 'f1': 0.7977634834805553}

# Model 1: simple dence layer