# Natural Language Processing (NLP) with Tensorflow

## What we're goint to cover?

- Downloading and preparing a text dataset
- How to prepare text dataset for modelling (`tokenization embedding`)
- Setting up multiple modelling experiments with recurrent neural networking (RNNs)
- Build a text `feature extraction model using Tensorflow Hub`
- Finding the most wrong predictions examples
- Using a model we've built to make predictions on text from the wild

see: https://en.wikipedia.org/wiki/Recurrent_neural_network  
see: https://awari.com.br/deep-learning-rnn-utilizando-redes-rnn-recurrent-neural-networks-no-deep-learning/

RNN - Recurrent Neural Network  
LSTM - Long short-term memory


how to add deep learning model in android application: https://www.youtube.com/watch?v=tySgZ1rEbW4&t=987s

| Hyperparameter/Layer Type | What does it do?                                               | Typical Value                                                                        |
| ------------------------- | -------------------------------------------------------------- | ------------------------------------------------------------------------------------ |
| Input (txt)               | Target text/sequencial you'd like to discover pattern in       | whatever you can represent as text or a sequence                                     |
| Input Layer               | Takes target in sequence                                       | input_shape = [batch_size, embedding_size] or [batch_size, sequence_shape]           |
| Text Vectorization Layer  | Maps input sequences to number                                 | tf.keras.layer.experimental.preprocessing.TextVectorization                          |
| Embedding                 | mapping of text vectors to embedding matrix (how words relate) | tf.keras.layers.Embedding                                                            |
| RNN                       | Finds patterns in sequence                                     | SimpleRNN, LSTM, GRU                                                                 |
| Hidden activation         | Adds non-linearity to learned features (non-streigh-line)      | Usually Tanh (hyperbolic tangent) tf.keras.activations.tanh                          |
| Pooling Layer             | Reduces the dimensionality of learned sequence (ConvD1)        | tf.keras.layers.GlobalAlveragePooling1D or tf.keras.layers.GlobalMaxPool1D           |
| Full connected Layer      | Futher refines learned features from recurrent layers          | tf.keras.layers.Dense                                                                |
| Output Layer              | Takes learned features outputs them in shape of target labels  | output_shape = number_of_classes                                                     |
| Output Activation         | Adds non-linearities to output layer                           | tf.keras.activations.sigmoid (binary classification) or tf.keras.activations.softmax |


```python
from tensorflow.keras import layers

# create LSTM Model
inputs = layers.Input(shape=(1,), dtype='string')
x = text_vectorizer(inputs) # turn input sequence to number
x = embedding(x) # create embedding matrix
x = layers.LSTM(64, activation='tanh')(x) # return vector for whole sequence
outputs = layers.Dense(1, activation='sigmoid')(x)
model = tf.keras.Model(inputs, outputs)

# compile the model
model.compile(loss=tf.keras.losses.BinaryCrossentropy(),
                optimizer=tf.keras.optimizers.Adam(),
                metrics=['accuracy'])

# fit the model
history = model.fit(train_sequences, train_labels, epochs=5)
```


## Check GPU


In [1]:
!nvidia-smi -L

GPU 0: NVIDIA GeForce GTX 1650 Ti (UUID: GPU-396e0d2a-a22a-a57f-bd25-8f6297e10119)


In [2]:
import os
import sys

# add path root project to read helper fuctions
sys.path.append(os.path.join('../'))

## Helper Functions

In past modules, we've created a brunch of helper functions to do small task required for our notebooks. Rather than rewrite all these, we can import a script and load them in from threre. The script we've got available can be found on github: https://raw.githubusercontent.com/mrdbourke/tensorflow-deep-learning/main/extras/helper_functions.py


In [6]:
# import urllib.request as ur
# uncomment this line below and run it to download helper_functions file
# ur.urlretrieve('https://raw.githubusercontent.com/mrdbourke/tensorflow-deep-learning/main/extras/helper_functions.py', filename='helper_functions.py')
from helper_functions import create_tensorboard_callback, plot_loss_curves, walk_through_dir, compare_historys

## Kaggle Dataset

The dataset we're going to be using is Kaggle's introduction to NLP dataset

see: https://www.kaggle.com/competitions/nlp-getting-started  
see: https://storage.googleapis.com/ztm_tf_course/nlp_getting_started.zip


In [6]:
STORAGE = os.path.join('../../', 'storage')
IMAGE_PATH = f'{STORAGE}/images'
ZIP_PATH = f'{STORAGE}/zip'
MODEL_PATH = f'{STORAGE}/models'
NLP = f'{STORAGE}/nlp'

# concat paths
LIST_PATHS = [IMAGE_PATH, ZIP_PATH, MODEL_PATH, NLP]

In [7]:
# confirm dir create
os.listdir(STORAGE)

['binary_classification',
 'datasets',
 'images',
 'models',
 'multi_class_classification',
 'nlp',
 'transfer_learning',
 'zip']

In [8]:
for dir in LIST_PATHS:
    if not os.path.exists(dir):
        os.mkdir(dir)

In [9]:
import zipfile
import urllib.request as ur
import shutil

filename = 'nlp_getting_started.zip'
url = f'https://storage.googleapis.com/ztm_tf_course/{filename}'

if not os.path.exists(f'{ZIP_PATH}/{filename}'):
    # download zip file
    ur.urlretrieve(url, filename)
    shutil.move(filename, f'{ZIP_PATH}')

# unzip the download file
zip_ref = zipfile.ZipFile(f'{ZIP_PATH}/{filename}', 'r')

# remove folder if exist
folder = filename.split('.')[0]
if os.path.isdir(f'{NLP}/{folder}'):
    shutil.rmtree(f'{NLP}/{folder}')

zip_ref.extractall(f'{NLP}')
zip_ref.close()

## Visualizing a text dataset

see: https://www.w3schools.com/python/python_file_open.asp  
see: https://www.tensorflow.org/tutorials/load_data/text?hl=pt-br

To visualize our text samples, we first have to read them in, one way to do would be to use python read, but we prefer to get visual straight away. So another way to do this is to sue pandas library.


In [10]:
import pandas as pd

train_df = pd.read_csv(f'{NLP}/train.csv')
test_df = pd.read_csv(f'{NLP}/test.csv')

In [12]:
display(train_df.head())

Unnamed: 0,id,keyword,location,text,target
0,1,,,Our Deeds are the Reason of this #earthquake M...,1
1,4,,,Forest fire near La Ronge Sask. Canada,1
2,5,,,All residents asked to 'shelter in place' are ...,1
3,6,,,"13,000 people receive #wildfires evacuation or...",1
4,7,,,Just got sent this photo from Ruby #Alaska as ...,1


In [14]:
# shuffle training dataframe
train_df_shuffled = train_df.sample(frac=1, random_state=42)
train_df_shuffled.head()

Unnamed: 0,id,keyword,location,text,target
2644,3796,destruction,,So you have a new weapon that can cause un-ima...,1
2227,3185,deluge,,The f$&amp;@ing things I do for #GISHWHES Just...,0
5448,7769,police,UK,DT @georgegalloway: RT @Galloway4Mayor: ÛÏThe...,1
132,191,aftershock,,Aftershock back to school kick off was great. ...,0
6845,9810,trauma,"Montgomery County, MD",in response to trauma Children of Addicts deve...,0


In [15]:
# how many examples of each class?
train_df.target.value_counts()

0    4342
1    3271
Name: target, dtype: int64

### classification when data is imbalanced

see: https://www.tensorflow.org/tutorials/structured_data/imbalanced_data?hl=pt-br


In [16]:
# how many total samples
train_df.shape[0], test_df.shape[0]

(7613, 3263)

In [19]:
# let's visualize some random training examples
import random

random_index = random.randint(0, train_df.shape[0] - 5) # create random indexes not higher than total of samples
for row in train_df_shuffled[['text', 'target']][random_index:random_index+5].itertuples():
    _, text, target = row
    print(f'Target: {target}', '(real disaster)' if target > 0 else '(not real disaster)')
    print(f'Text: {text}')
    print('----\n')

Target: 0 (not real disaster)
Text: @chikislizeth08 you're not injured anymore? ??
----

Target: 0 (not real disaster)
Text: WRAPUP 2-U.S. cable TV companies' shares crushed after Disney disappoints http://t.co/jFJLbF40To
----

Target: 0 (not real disaster)
Text: What the fuck was that. There was a loud bang and a flash of light outside. I'm pretty sure I'm not dead but what the hell??
----

Target: 1 (real disaster)
Text: 'Three #people were #killed when a severe #rainstorm in the #Italian #Alps caused a #landslide' http://t.co/hAXJ6Go2ac
----

Target: 1 (real disaster)
Text: RÌ©union Debris Is Almost Surely From Flight 370 Officials Say - New York Times http://t.co/VFbW3NyO9L
----



## Split data into training and validation sets

see: https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html


In [152]:
from sklearn.model_selection import train_test_split

# use train_test_split to split training data and validation sets
train_sequences, val_sequences, train_labels, val_labels = train_test_split(train_df_shuffled['text'].to_numpy(),
                                                                           train_df_shuffled['target'].to_numpy(),
                                                                           test_size=0.1, # use 10% of training data validation
                                                                           random_state=42)

In [153]:
# check the lengh
train_sequences.shape[0], train_labels.shape[0], val_sequences.shape[0], val_labels.shape[0]

(6851, 6851, 762, 762)

In [25]:
# check the first 5 samples
train_sequences[:5], train_labels[:5]

(array(['@mogacola @zamtriossu i screamed after hitting tweet',
        'Imagine getting flattened by Kurt Zouma',
        '@Gurmeetramrahim #MSGDoing111WelfareWorks Green S welfare force ke appx 65000 members har time disaster victim ki help ke liye tyar hai....',
        "@shakjn @C7 @Magnums im shaking in fear he's gonna hack the planet",
        'Somehow find you and I collide http://t.co/Ee8RpOahPk'],
       dtype=object),
 array([0, 0, 1, 0, 0], dtype=int64))

## Converting text into numbers

see: https://www.tensorflow.org/text/guide/word_embeddings?hl=pt-br

When dealing with a text problem, one of the first things you'll have to do before you can build a model is to convert your text to numbers.

There are a few ways to do this, namely:

- `Tokenization`: - direct mapping of token (a token could be a word or a character) to number

- `Embedding`: create a matrix of feature vector for each token (the size of the feature vector can be defined and this embedding can be learned)

```python

# I Love Tensorflow

                [[1, 0, 0],
OneHotEncoder = [0, 1, 0],
                [0, 0, 1]]

            [[0.234, 0.2323, 0.34],
Embedding = [0.343, 0.222, 0.333],
            [0.111, 0.343, 0.999]]
```

`Tokenization`: straight mapping from token to number (can be modelled but quick get too big)  
`Embedding`: richer representation of relationship between tokens (can limit size + can be learned)

```mermaid
graph TD;
    A[I Love Tensorflow] -->  Number[0 1 2]
    A[I Love Tensorflow] --> OneHotEncoder[1, 0, 0
                                0, 1, 0
                                0, 0, 1 ]
    A[I Love Tensorflow] --> Embedding[0.234, 0.2323, 0.34
                                0.343, 0.222, 0.333,
                                0.111, 0.343, 0.999]
```


In [29]:
import tensorflow as tf

# use the default TextVectorization parameters
text_vectorizer = tf.keras.layers.experimental.preprocessing.TextVectorization(max_tokens=None, # how many words in the vocabulary (automatically add <OOV>)
                                                                               standardize='lower_and_strip_punctuation',
                                                                               split='whitespace',
                                                                               ngrams=None, # create groups of n-words?
                                                                               output_mode='int', # how to map tokens to number
                                                                               output_sequence_length=None, # how long do you want your sequence to be
                                                                               pad_to_max_tokens=False)

## Text Vectorization (Tokenization)

see: https://www.tensorflow.org/api_docs/python/tf/keras/layers/TextVectorization  
see: https://keras.io/api/layers/preprocessing_layers/text/text_vectorization/  
see: https://en.wikipedia.org/wiki/Tf%E2%80%93idf  
see: https://monkeylearn.com/blog/what-is-tf-idf/


In [197]:
len(train_sequences[2].split())

20

In [34]:
round(sum([len(i.split()) for i in train_sequences]))

102087

In [33]:
# Find the average number of tokens (words) in the training tweets
# WARNING:
# this value is very important to final of processing
round(sum([len(i.split()) for i in train_sequences])/len(train_sequences))

15

In [35]:
# setup text vectorization variables
max_vocab_length = 10000 # max number of words to have in our vacabulary
max_length = 15 # max length our sequences will be (e.g how many words from a Tweet does a model see?)

In [36]:
text_vectorizer = tf.keras.layers.experimental.preprocessing.TextVectorization(max_tokens=max_vocab_length,
                                                                               standardize='lower_and_strip_punctuation',
                                                                               split='whitespace',
                                                                               ngrams=None,
                                                                               output_mode='int',
                                                                               output_sequence_length=max_length,
                                                                               pad_to_max_tokens=True)

In [37]:
# Fit the text vectorizer to the training text
text_vectorizer.adapt(train_sequences)

In [43]:
# create a sample sequence and tokenize it
sample_sequence = "There's a flood in my street!"
text_vectorizer([sample_sequence])

<tf.Tensor: shape=(1, 15), dtype=int64, numpy=
array([[264,   3, 232,   4,  13, 698,   0,   0,   0,   0,   0,   0,   0,
          0,   0]], dtype=int64)>

In [50]:
# choose random sentence from the training dataset and tokenize it 
random_sentence = random.choice(train_sequences)
print(f'Original Text:\n{random_sentence}\n\nVectorized: {text_vectorizer(random_sentence)}')

Original Text:
kou is like [CASH REGISTER] [BUILDINGS BURNING]

Vectorized: [   1    9   25 6019 2322   95   86    0    0    0    0    0    0    0
    0]


In [51]:
# get the unique words in the vocabulary
words_in_vocab = text_vectorizer.get_vocabulary() # get all the unique words in 
top_5_words = words_in_vocab[:5] # get the most common  words
bottom_5_words = words_in_vocab[-5:]  # get the least common words

print(f'Number of words in vocabulary: {len(words_in_vocab)}')
print(f'5 most common words: {top_5_words}')
print(f'5 least common words: {bottom_5_words}')

Number of words in vocabulary: 10000
5 most common words: ['', '[UNK]', 'the', 'a', 'in']
5 least common words: ['pages', 'paeds', 'pads', 'padres', 'paddytomlinson1']


## Creating an Embedding using an Embedding Layer

To make our embedding we going to use tensorflow's embedding layer: https://www.tensorflow.org/api_docs/python/tf/keras/layers/Embedding

The parameters we care most about for our embedding layer:

- `input_dim`: the size of our vocabulary
- `output_dim`: the size of the output embedding vector for example, a value fo 100 would mean each token gets represented by a vector 100 long
- `input_length`: length of the sequences being passed to the embedding layer


In [58]:
embedding = tf.keras.layers.Embedding(input_dim=max_vocab_length, # set input shape
                                      output_dim=128,
                                      embeddings_initializer='uniform',
                                      input_length=max_length, # how long is each input
                                      )
embedding

<keras.layers.core.embedding.Embedding at 0x1ed447ce7f0>

In [56]:
# choose random sentence from the training dataset and tokenize it 
random_sentence = random.choice(train_sequences)
print(f'Original Text:\n{random_sentence}\n\nEmbedding:\n{embedding(text_vectorizer([random_sentence]))}')

Original Text:
City of Calgary activates Municipal Emergency Plan - 660 NEWS http://t.co/KFBjVJiVQB http://t.co/BN7Xpzqdm0

Embedding:
[[[-0.00648278  0.01337972 -0.01433364 ...  0.00690647 -0.02951211
   -0.02901908]
  [-0.04939894 -0.03808736 -0.00597149 ...  0.0353103  -0.01542674
    0.01706082]
  [-0.02199901  0.00479995 -0.03400421 ... -0.02642218  0.031021
    0.02360031]
  ...
  [-0.02232008 -0.02389628 -0.04893912 ... -0.00577093  0.0222831
    0.04473752]
  [-0.02232008 -0.02389628 -0.04893912 ... -0.00577093  0.0222831
    0.04473752]
  [-0.02232008 -0.02389628 -0.04893912 ... -0.00577093  0.0222831
    0.04473752]]]


In [59]:
sample_embedding = embedding(text_vectorizer([random_sentence]))
sample_embedding

<tf.Tensor: shape=(1, 15, 128), dtype=float32, numpy=
array([[[-0.04640205,  0.00171908, -0.03961884, ..., -0.04194402,
          0.04199127,  0.02135679],
        [-0.01387553,  0.01867838, -0.00582208, ...,  0.0170106 ,
          0.03363771,  0.03008801],
        [ 0.04845527, -0.0147807 ,  0.01513132, ..., -0.00106633,
          0.0464595 ,  0.03556531],
        ...,
        [-0.04297487,  0.04677823,  0.01795694, ...,  0.03392654,
         -0.00842122,  0.0277623 ],
        [-0.04297487,  0.04677823,  0.01795694, ...,  0.03392654,
         -0.00842122,  0.0277623 ],
        [-0.04297487,  0.04677823,  0.01795694, ...,  0.03392654,
         -0.00842122,  0.0277623 ]]], dtype=float32)>

In [62]:
sample_embedding[0][0], random_sentence

(<tf.Tensor: shape=(128,), dtype=float32, numpy=
 array([-4.6402048e-02,  1.7190799e-03, -3.9618839e-02,  4.8778508e-02,
        -1.4259994e-02,  4.9060073e-02,  8.7298453e-05, -4.8830178e-02,
         9.7618699e-03,  4.4864416e-03,  3.3057522e-02, -3.7397552e-02,
         4.4677854e-03,  2.5168072e-02,  9.6130855e-03,  2.2334542e-02,
         6.5403581e-03, -4.5781817e-02,  2.4690758e-02, -2.4229145e-02,
         2.0585645e-02,  2.1276500e-02,  2.2293832e-02, -3.1096149e-02,
         4.6419073e-02,  6.8682916e-03, -2.4764612e-04, -1.5066911e-02,
         3.4228329e-02,  3.4954917e-02,  4.6412397e-02,  4.5008671e-02,
         1.6224612e-02,  3.2790009e-02,  6.4618699e-03, -1.9150902e-02,
         1.0181166e-02,  1.5592996e-02,  2.5926899e-02, -9.3286261e-03,
        -1.0728098e-02, -6.5713748e-03, -6.9539621e-04, -3.0418063e-02,
         2.2396017e-02,  6.4811334e-03, -2.0963192e-02, -4.3181431e-02,
         3.9815012e-02,  1.1992704e-02,  2.4409581e-02, -3.5525076e-03,
        -1.0038

## Modelling a text dataset (running a series of experiments)

see: https://scikit-learn.org/stable/tutorial/machine_learning_map/index.html

0. Model: Naive Bayes with `TF-IDF` encoder (baseline)
1. Model: Feed Forward Neural Network (dense model)
2. Model: `LTSM` (RNN) Long-Short Term Memory Recurrent Neural Network
3. Model: `GRU` (RNN) Gated Recurrent Unit
4. Model: Bidirection `LTSM` (RNN)
5. Model: 1D Convolutional Neural Network
6. Model: Tensorflow Hub Pretrained Feature Extractor
7. Model: Tensorflow Hub Pretrained Feature Extractor (10% of data)

Now we've a got way to turn our text sequences into numbers, it's time to start building a series of modelling experiments. We'll start with a baseline and move on from here.

How are we going to approach all of these?

- Create a model
- Build a model
- Fit a model
- Evaluate a model


## Model 0: Getting a baseline

As with all machine learning modelling experiments, it's important to create a baseline model so you've got a benchmark for future experiments to build upon.

To create our baseline, we'll use Sklearn's Multinomial Naive Bayes using the TF-IDF formula to convert our words to numbers.

> **Note**: It's common practice to use non-DL (non-deep-learning) algorithms as a baseline because of their speed and then later using DL to see if you can improve upon them.


In [63]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import Pipeline

In [64]:
# create tokenization and modelling pipeline
model_0 = Pipeline([ 
    ('tfidf', TfidfVectorizer()),  # convert words to number using tf-idf
    ('clf', MultinomialNB()) # model text
])

# Fit the pipeline to training data
model_0.fit(train_sequences, train_labels)

Pipeline(steps=[('tfidf', TfidfVectorizer()), ('clf', MultinomialNB())])

In [66]:
# Evaluate our baseline model
baseline_score = model_0.score(val_sequences, val_labels)
print(f'Our baseline model achieves an accuracy of: {baseline_score*100:.2f}%')

Our baseline model achieves an accuracy of: 79.27%


In [68]:
# make predictions
baseline_preds = model_0.predict(val_sequences)
baseline_preds[:10]

array([1, 1, 1, 0, 0, 1, 1, 1, 1, 0], dtype=int64)

In [130]:
# create function to evaluate: accuracy, precision, recall, f1-score
# see: https://stackoverflow.com/questions/3490738/how-to-sum-dict-elements
from sklearn.metrics import accuracy_score, precision_recall_fscore_support
from functools import reduce

def calculate_results(y_true, y_pred):
    """
    Calculate Model accuracy, precision, recal and f1-score of a binary classification model.
    """
    # calculate accuracy
    m_accuracy = accuracy_score(y_true, y_pred)

    # calculate model precision, recall and f1-score using 'weighted' average
    m_precision, m_recall, m_f1_score, _ = precision_recall_fscore_support(y_true, y_pred, average='weighted')
    
    results = {
        'accuracy':     m_accuracy,
        'precision':    m_precision,
        'recall':       m_recall,
        'f1':           m_f1_score}
    
    return reduce(lambda x, y: dict((k, v * 100) for k, v in results.items()), results)     

In [131]:
# get baseline results
baseline_results = calculate_results(y_true=val_labels, y_pred=baseline_preds)
baseline_results

{'accuracy': 79.26509186351706,
 'precision': 81.11390004213173,
 'recall': 79.26509186351706,
 'f1': 78.6218975804955}

## Model 1: A simple dense model

In [181]:
# build model with functional api
inputs = tf.keras.layers.Input(shape=(1,), dtype=tf.string) # inputs are 1-dimensional strings
x = text_vectorizer(inputs) # turn the input text into numbers
x = embedding(x) # create an embedding of the numberzed inputs
# x = tf.keras.layers.GlobalAveragePooling1D()(x) # condense the feature vector for each token to one vector
x = tf.keras.layers.GlobalMaxPooling1D()(x)
# x = tf.keras.layers.GlobalAvgPool1D()(x)
outputs = tf.keras.layers.Dense(1, activation='sigmoid')(x) # create the output layer, want binary outputs so use sigmoid activation functions
model_1 = tf.keras.Model(inputs, outputs, name='mode_1_dense')

In [182]:
model_1.summary()

Model: "mode_1_dense"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 input_8 (InputLayer)        [(None, 1)]               0         
                                                                 
 text_vectorization_2 (TextV  (None, 15)               0         
 ectorization)                                                   
                                                                 
 embedding_1 (Embedding)     (None, 15, 128)           1280000   
                                                                 
 global_max_pooling1d_1 (Glo  (None, 128)              0         
 balMaxPooling1D)                                                
                                                                 
 dense_7 (Dense)             (None, 1)                 129       
                                                                 
Total params: 1,280,129
Trainable params: 1,280,129
No

In [183]:
# compile the model
model_1.compile(loss=tf.keras.losses.BinaryCrossentropy(),
            optimizer=tf.keras.optimizers.Adam(),
            metrics=['accuracy'])

In [184]:
# create a tensorflow callback (need to create a new one for each model)
from helper_functions import create_tensorboard_callback

# create a directory to save tensorboard logs
tensorboard_logs = f'{NLP}/tensorboard/logs'

In [185]:
model_1_history = model_1.fit(x=train_sequences,
                              y=train_labels,
                              epochs=5,
                              validation_data=(val_sequences, val_labels),
                              callbacks=[create_tensorboard_callback(dir_name=tensorboard_logs, experiment_name='model_1_dense')])

Saving TensorBoard log files to: ../../storage/nlp/tensorboard/logs/model_1_dense/20240408-223551
Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


In [186]:
# check the results
# GlobalAveragePooling1D    ===>>>      [0.491241455078125, 0.7900262475013733]
# GlobalMaxPooling1D        ===>>>      [0.536510705947876, 0.7952755689620972]
# GlobalAvgPool1D           ===>>>      [0.48092105984687805, 0.7821522355079651]
model_1.evaluate(val_sequences, val_labels)



[0.5994751453399658, 0.808398962020874]

In [188]:
model_1_pred_probs = model_1.predict(val_sequences)
model_1_pred_probs.shape



(762, 1)

In [189]:
# look at a single prediction
model_1_pred_probs[0]

array([0.6496586], dtype=float32)

In [190]:
# look at the first 5
model_1_pred_probs[:5]

array([[0.6496586 ],
       [0.8561691 ],
       [0.99309194],
       [0.03033629],
       [0.09752803]], dtype=float32)

In [192]:
# convert model prediction probabilities to label format
model_1_preds = tf.squeeze(tf.round(model_1_pred_probs))
model_1_preds[:5]

<tf.Tensor: shape=(5,), dtype=float32, numpy=array([1., 1., 1., 0., 0.], dtype=float32)>

In [193]:
# calculate our mode_1 results
model_1_results = calculate_results(y_true=val_labels, y_pred=model_1_preds)
model_1_results

{'accuracy': 80.83989501312337,
 'precision': 81.01190494960657,
 'recall': 80.83989501312337,
 'f1': 80.69722254427087}

In [194]:
import numpy as np

np.array(list(model_1_results.values())) > np.array(list(baseline_results.values()))

array([ True, False,  True,  True])

## Visualizing learned embedding

In [198]:
# get the vocabulary from the text vectorization layer
words_in_vocab = text_vectorizer.get_vocabulary()
len(words_in_vocab), words_in_vocab[:10]

(10000, ['', '[UNK]', 'the', 'a', 'in', 'to', 'of', 'and', 'i', 'is'])

In [200]:
model_1.summary()

Model: "mode_1_dense"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 input_8 (InputLayer)        [(None, 1)]               0         
                                                                 
 text_vectorization_2 (TextV  (None, 15)               0         
 ectorization)                                                   
                                                                 
 embedding_1 (Embedding)     (None, 15, 128)           1280000   
                                                                 
 global_max_pooling1d_1 (Glo  (None, 128)              0         
 balMaxPooling1D)                                                
                                                                 
 dense_7 (Dense)             (None, 1)                 129       
                                                                 
Total params: 1,280,129
Trainable params: 1,280,129
No

In [202]:
# get the weight matrix of embedding layer
# (these are the numerical representations of each token in our training data)
embed_weights = model_1.get_layer('embedding_1').get_weights()
print(embed_weights[0].shape) # same size as vocab size and embedding_dim (output_dim of our embedding layer)

(10000, 128)


In [204]:
# create embedding files (we got this from tensorflow's word embedding documentation)
# see: https://www.tensorflow.org/text/guide/word_embeddings?hl=pt-br
import io

out_v = io.open(f'{NLP}/vectors.tsv', 'w', encoding='utf-8')
out_m = io.open(f'{NLP}/metadata.tsv', 'w', encoding='utf-8')

for index, word in enumerate(words_in_vocab):
  if index == 0:
    continue  # skip 0, it's padding.
  vec = embed_weights[0][index]
  out_v.write('\t'.join([str(x) for x in vec]) + "\n")
  out_m.write(word + "\n")

out_v.close()
out_m.close()

## Recurrent Neural Network (RNN's)

RNN's are useful for sequence data. The promise of a recurrent neural network is to use the representation of a previous input to aid the representation of a later input.

* MIT's sequence modelling lecture: https://www.youtube.com/watch?v=qjrad0V0uJE  
* Chris Olah's intro to LSTM:  https://colah.github.io/posts/2015-08-Understanding-LSTMs/
* Andrej Karpathy's the unreasonable effectiveness of recurrent neural network:  https://karpathy.github.io/2015/05/21/rnn-effectiveness/

<div style="width: 720px; height:450px">
    <img src='https://miro.medium.com/v2/resize:fit:1400/1*3ltsv1uzGR6UBjZ6CUs04A.jpeg' style="width:100%">
<div>

## Model_2: LSTM

LSTM - Long Short Term Memory (one of the most popular LSTM Cell)

In [207]:
# create an LSTM model
inputs = tf.keras.layers.Input(shape=(1,), dtype=tf.string)
x = text_vectorizer(inputs)
x = embedding(x)
print(x.shape)
x = tf.keras.layers.LSTM(units=64, return_sequences=True)(x)
print(x.shape)
x = tf.keras.layers.LSTM(units=64)(x)
print(x.shape)
x = tf.keras.layers.Dense(64, activation='relu')(x)
outputs = tf.keras.layers.Dense(1, activation='sigmoid')(x)
model_2 = tf.keras.Model(inputs, outputs, name='model_2_LSTM')


(None, 15, 128)
(None, 15, 64)
(None, 64)
