# Learn basics in NLP with TensorFlow 

I'm gonna follow this github tutorial.

https://github.com/mrdbourke/tensorflow-deep-learning/blob/main/08_introduction_to_nlp_in_tensorflow.ipynb

Get dataset from kaggle.

In [1]:
import pandas as pd
import numpy as np

In [2]:
train_data = pd.read_csv('./dataaset/train.csv')

In [3]:
train_data.head()

Unnamed: 0,id,keyword,location,text,target
0,1,,,Our Deeds are the Reason of this #earthquake M...,1
1,4,,,Forest fire near La Ronge Sask. Canada,1
2,5,,,All residents asked to 'shelter in place' are ...,1
3,6,,,"13,000 people receive #wildfires evacuation or...",1
4,7,,,Just got sent this photo from Ruby #Alaska as ...,1


Split data into train and test

In [4]:
from sklearn.model_selection import train_test_split

train_sentences, val_sentences, train_lables, val_lables = train_test_split(
    train_data["text"].to_numpy(),
    train_data["target"].to_numpy(),
    test_size=0.1
    )

In [5]:
train_sentences

array(['Uber reduces drunk driving fatalities says independent study http://t.co/jVIVT6zrv7',
       "A little concerned about the number of forest fires where I'll be living",
       'descended or sunk however it may be to the shadowed land beyond the crest of a striking cobra landing harshly upon his back; torch and',
       ...,
       '[Comment] Deaths of older children: what do the data tell #US? http://t.co/p8Yr2po6Jn\n #nghlth',
       "@astros stunningly poor defense it's not all on the pitcher. If our bats are MIA like the top of 1st inning this team is in trouble.",
       'Sinkhole leaking sewage opens in housing estate\nIrish Independent-3 Aug 2015'],
      dtype=object)

# Converting text into numbers

Create words to vector function.

In [6]:
from tensorflow.keras.layers import TextVectorization

In [7]:
text2vec = TextVectorization(
    max_tokens=10000, standardize='lower_and_strip_punctuation',
    split='whitespace', ngrams=None, output_mode='int',
    output_sequence_length=15, pad_to_max_tokens=False, vocabulary=None,
    idf_weights=None, sparse=False, ragged=False
)

In [8]:
text2vec.adapt(train_sentences)

See how the words 

In [9]:
sample_sentence = "There is a flood in my street!"
text2vec([sample_sentence])

<tf.Tensor: shape=(1, 15), dtype=int64, numpy=
array([[ 75,   9,   3, 213,   4,  13, 696,   0,   0,   0,   0,   0,   0,
          0,   0]], dtype=int64)>

Get first words

In [10]:
text2vec.get_vocabulary()[:5]

['', '[UNK]', 'the', 'a', 'in']

Get the words from 100 to 105th.

In [11]:
text2vec.get_vocabulary()[100:105]

['buildings', 'cant', 'bomb', 'world', 'going']

# Creating Embedding layer

We are going to use TnsorFlow's embedding layers.

https://www.tensorflow.org/api_docs/python/tf/keras/layers/Embedding

In [12]:
from tensorflow.keras import layers

embedding = layers.Embedding(input_dim = 10000, # set imput shape
                             output_dim = 128, # output shape
                             input_length = 10000 # how long is each input 
                            )

embedding

<keras.layers.embeddings.Embedding at 0x286a6d90520>

Get a random sentence from the training set

In [13]:
import random
random_sentence = random.choice(train_sentences)

print(f"Original text:\n {random_sentence}\
        \n\nEmbedded version:")

# Embed the random sentence (turn it into dense vectors of fixed size)
sample_embed = embedding(text2vec([random_sentence]))
sample_embed


Original text:
 #Sismo M 1.9 - 5km S of Volcano Hawaii: Time2015-08-06 01:04:01 UTC2015-08-05 15:04:01 -10:00 at ep... http://t.co/RTUeTdfBqb #CSismica        

Embedded version:


<tf.Tensor: shape=(1, 15, 128), dtype=float32, numpy=
array([[[-0.00110046, -0.01278925, -0.04345628, ..., -0.03667394,
          0.04411669, -0.04826413],
        [-0.00228111,  0.00094842, -0.033382  , ...,  0.01971699,
         -0.0270062 , -0.00865058],
        [-0.00937158,  0.03930024, -0.02984457, ...,  0.04780022,
          0.03601365, -0.00774721],
        ...,
        [ 0.02725519, -0.01262987,  0.01173335, ..., -0.01251786,
         -0.02179431, -0.00612444],
        [ 0.02651985,  0.02771715, -0.03794049, ..., -0.04656031,
          0.00767535, -0.00552149],
        [ 0.04439998, -0.01148913, -0.01172309, ..., -0.00683018,
         -0.04659912, -0.0124477 ]]], dtype=float32)>

In [14]:
sample_embed[0][0], sample_embed[0][0].shape, random_sentence

(<tf.Tensor: shape=(128,), dtype=float32, numpy=
 array([-0.00110046, -0.01278925, -0.04345628, -0.03915166, -0.00508578,
        -0.00270509, -0.01162582,  0.00302484,  0.01827208, -0.01555096,
        -0.01495733, -0.02727971,  0.03612879,  0.01345444,  0.02473441,
         0.01068245, -0.04250053,  0.01603914, -0.04426651,  0.03726841,
         0.02058747, -0.00264572,  0.03418631,  0.02683239, -0.01889637,
         0.04407502, -0.02523658, -0.00983491, -0.02066312,  0.01693488,
        -0.00597306, -0.01795416,  0.02507987, -0.03917861, -0.0427908 ,
         0.03813349,  0.04971219, -0.00885048, -0.02547884, -0.02516809,
        -0.03229671,  0.02315296, -0.0471476 ,  0.0072696 ,  0.01165967,
        -0.01315971,  0.03781008, -0.04096187,  0.04793744, -0.01164491,
         0.03358987,  0.00564047, -0.03524301,  0.02943058, -0.01222594,
         0.0306639 , -0.03248478, -0.01927526, -0.04097341,  0.01935121,
        -0.03554165, -0.01483761, -0.02703774, -0.00381593, -0.04680095,
  

# Modelling a text dataset with running a series of experiment

There are some Model to learn text:

0, Naive Bayes with TF-IDF encoder (baseline)

1, Feed-forward neural network (dence model)

2, LSTM (RNN)

3, GRU (RNN)

4, Bidirectional-LSTM (RNN)

5, 1D Convolutional Neural Network

6, TensorFlow Hub Pretrained Feature Extractor

7, TensorFlow Hub Pretrained Feature Extractor (10% of data)

How are we going to approach all of these?

Use the standard steps in modeling with tensorflow:

* Create a model
* Build a model
* Fit a model
* Evaluate our model

# Create Tensorflow Pretrained model

refer this model
* https://tfhub.dev/google/universal-sentence-encoder/4

This apploach takes lots time with local PC, so comment out these code.

In [16]:
import tensorflow_hub as hub

In [18]:
# embed = hub.load("https://tfhub.dev/google/universal-sentence-encoder/4")
# embed_samples = embed([sample_sentence,
#                        "When you can the universal sentence encoder on a sentence, it turns it into numbers."])
# print(embed_samples[0][:50])

In [None]:
# # Create model
# sentence_encoder_layer = hub.KerasLayer("https://tfhub.dev/google/universal-sentence-encoder/4",
#                                        input_shape=[],
#                                        dtype=tf.string,
#                                        trainable=False,
#                                        name="USE")

In [None]:
# # Create model useing the Sequence 
# model_6 = tf.keras.Sequential([
#     sentence_encoder_layer,
#     layers.Dense(1, activation="sigmoid")
# ], name="model_6")

In [None]:
# # Compile
# model_6.compile()
# model_6.summary()
# model_6.fit()
# model_6.predict()