# What is NLP?

Natural Language Processing: The ability of a computer program to understand human language as it is spoken and written.

Example: OpenAI GPT-3

# What is NLU?
Natural Language Understanding

Difference?

Please crack the car window, it is getting hot.

nlp will literally crack the window.
nlu will understand to slightly open the window.

# Sequence Problems
1. one to one
2. one to many
3. many to one
4. many to many
5. many to many synchronized

Use Cases:
* Classification
* Machine Translation
* Text Generation
* Voice Assistants


## What is an RNN

Recurrent Neural Network (RNN) is a type of neural network that is capable of learning to predict the next element in a sequence.

## Architecture of an RNN

Typical architecture of an RNN is as follows:
1. Input Layer
2. Text vectorization layer
3. Embedding layer
4. RNN Cell: LSTM layer
    * Tanh activation function
5. Hidden Activation layer
6. Output Layer
    * Sigmoid
7. Creation of Model
8. Compile
9. Fit


In [None]:
!nvidia-smi -L

In [None]:
# Helper Functions
from _helper_functions import unzip_data, create_tensorboard_callback, plot_loss_curves, compare_historys, walk_through_dir

## Get a text dataset

The dataset we're going to be using is Kaggle's introduction to NLP dataset (text samples of Tweets labelled as disaster or not disaster).

[Source](https://www.kaggle.com/c/nlp-getting-started)

In [None]:
!wget -nc -P ../Downloads/ https://storage.googleapis.com/ztm_tf_course/nlp_getting_started.zip

In [None]:
# Unzip
unzip_data('../Downloads/nlp_getting_started.zip', '../Downloads/08_NLP')

# Walkthrough dir
walk_through_dir('../Downloads/08_NLP')

## Visualizing a text dataset

To visualize text samples, we need to read them in.

Python read-write
Pandas

In [None]:
import pandas as pd

train_df = pd.read_csv('../Downloads/08_NLP/train.csv')
test_df = pd.read_csv('../Downloads/08_NLP/test.csv')
train_df.head()

In [None]:
# Shuffle training dataframe
train_df_shuffled = train_df.sample(frac=1, random_state=42)
train_df_shuffled.head()

In [None]:
# Check test dataframe
test_df.head()

In [None]:
# How many examples of each class?
train_df.target.value_counts()
# 0 = not disaster, 1 = disaster

In [None]:
# How many total samples?
len(train_df), len(test_df)

In [None]:
# Visualize random training examples
import random

random_index = random.randint(0, len(train_df) - 5)
for row in train_df_shuffled[["text", "target"]][random_index:random_index + 5].itertuples():
    _, text, target = row
    print(f"Target: {target}", "(real disaster)" if target == 1 else "(not disaster)")
    print(f"Text:\n{text}\n")
    print("---\n")

### Split data into training and validation sets

In [None]:
from sklearn.model_selection import train_test_split

train_sentences, val_sentences, train_labels, val_labels = train_test_split(train_df_shuffled["text"].to_numpy(),
                                                                            train_df_shuffled["target"].to_numpy(),
                                                                            test_size=0.1, 
                                                                            random_state=42)

In [None]:
# Check lengths

len(train_sentences), len(train_labels), len(val_sentences), len(val_labels)

In [None]:
# Check first 10
train_sentences[:10], train_labels[:10]

### Converting text into numbers

Tokenization vs Embedding

In NLP, there are two main concepts for turning text into numbers:
* **Tokenization** - A straight mapping from word or character or sub-word to a numerical value. There are three main levels of tokenization:
  1. Using **word-level tokenization** with the sentence "I love TensorFlow" might result in "I" being `0`, "love" being `1` and "TensorFlow" being `2`. In this case, every word in a sequence considered a single **token**.
  2. **Character-level tokenization**, such as converting the letters A-Z to values `1-26`. In this case, every character in a sequence considered a single **token**.
  3. **Sub-word tokenization** is in between word-level and character-level tokenization. It involves breaking invidual words into smaller parts and then converting those smaller parts into numbers. For example, "my favourite food is pineapple pizza" might become "my, fav, avour, rite, fo, oo, od, is, pin, ine, app, le, piz, za". After doing this, these sub-words would then be mapped to a numerical value. In this case, every word could be considered multiple **tokens**.
* **Embeddings** - An embedding is a representation of natural language which can be learned. Representation comes in the form of a **feature vector**. For example, the word "dance" could be represented by the 5-dimensional vector `[-0.8547, 0.4559, -0.3332, 0.9877, 0.1112]`. It's important to note here, the size of the feature vector is tuneable. There are two ways to use embeddings: 
  1. **Create your own embedding** - Once your text has been turned into numbers (required for an embedding), you can put them through an embedding layer (such as [`tf.keras.layers.Embedding`](https://www.tensorflow.org/api_docs/python/tf/keras/layers/Embedding)) and an embedding representation will be learned during model training.
  2. **Reuse a pre-learned embedding** - Many pre-trained embeddings exist online. These pre-trained embeddings have often been learned on large corpuses of text (such as all of Wikipedia) and thus have a good underlying representation of natural language. You can use a pre-trained embedding to initialize your model and fine-tune it to your own specific task.


![](https://raw.githubusercontent.com/mrdbourke/tensorflow-deep-learning/main/images/08-tokenization-vs-embedding.png)
*Example of **tokenization** (straight mapping from word to number) and **embedding** (richer representation of relationships between tokens).*

It depends on your problem. You could try character-level tokenization/embeddings and word-level tokenization/embeddings and see which perform best. You might even want to try stacking them (e.g. combining the outputs of your embedding layers using [`tf.keras.layers.concatenate`](https://www.tensorflow.org/api_docs/python/tf/keras/layers/concatenate)). 

If you're looking for pre-trained word embeddings, [Word2vec embeddings](http://jalammar.github.io/illustrated-word2vec/), [GloVe embeddings](https://nlp.stanford.edu/projects/glove/) and many of the options available on [TensorFlow Hub](https://tfhub.dev/s?module-type=text-embedding) are great places to start.

In [None]:
import tensorflow as tf
import numpy as np
from tensorflow.keras.layers import TextVectorization


sent_lens = [len(sentence.split()) for sentence in train_sentences] 
 
# Taking the 95% length as max length
max_len = int(np.percentile(sent_lens, 95))

print('max len:', max_len)

In [None]:
# Find the avg number of tokens (words) in the training tweets

round(sum([len(i.split()) for i in train_sentences]) / len(train_sentences))

# Setup text vectorization variables
max_vocab_length = 10000 # max number of words to have in our vocab

text_vectorizer = TextVectorization(
                    max_tokens=max_vocab_length, # how many words in the vector
                    standardize="lower_and_strip_punctuation", # standardize text
                    split="whitespace", # split text into words via whitespace
                    ngrams=None, # create groups of n-words
                    output_mode="int", # how to map tokens to numbers
                    output_sequence_length=max_len, # how long the output sequence should be
                    pad_to_max_tokens=True, # pad the output sequence to the max length
                )

In [None]:
tf.config.list_physical_devices('GPU')


In [None]:
# Fit the text vectorizer to the training text 
vectorized_train_sentences = text_vectorizer.adapt(train_sentences)

In [None]:
# Create a sample sentence and tokenize it

sample_sentence = "There's a flood in my street!"
text_vectorizer([sample_sentence])

In [None]:
# Choose a random setnence from the training dataset and tokenize it
random_index = random.randint(0, len(train_sentences) - 1)
random_sentence = train_sentences[random_index]
vectorized_sentence = text_vectorizer([random_sentence])
print(f'original sentence: {random_sentence}\nVectorized: {vectorized_sentence[0]}')

In [None]:
# Get the unique words in the vocab

words_in_vocab = text_vectorizer.get_vocabulary() # get all of the unique words in our training data
top_5_words = words_in_vocab[:5] # get the top 5 words (most common)
bottom_5_words = words_in_vocab[-5:] # get the bottom 5 words (least common)

print(f'Number of words in vocab: {len(words_in_vocab)}\nTop 5 words: {top_5_words}\nBottom 5 words: {bottom_5_words}')

### Creating an Embedding using an Embedding Layer

1. `input_dim` = the size of our vocab
2. `output_dim` = the size of our output embedding vector
3. `input_length` = the length of our input sequence
4. `mask_zero` = whether or not to mask zero values in our input sequence

In [None]:
from tensorflow.keras import layers

embedding = layers.Embedding(
                            input_dim=max_vocab_length, #set input shape
                            output_dim=128, #set output shape
                            input_length=max_len #set input length
                            )
embedding                       

In [None]:
# Get a random sentence from the training set
random_sentence = random.choice(train_sentences)

print(f'original sentence: {random_sentence}')

# Embed the random sentence (turn it into dense vectors)

sample_embed = embedding(text_vectorizer([random_sentence]))
sample_embed

In [None]:
# check out a single token's embedding
sample_embed[0][0], sample_embed[0][0].shape, random_sentence[0]

## Modelling a text dataset

Running a series of experiments

* **Model 0**: Naive Bayes (baseline)
* **Model 1**: Feed-forward neural network (dense model)
* **Model 2**: LSTM model
* **Model 3**: GRU model
* **Model 4**: Bidirectional-LSTM model
* **Model 5**: 1D Convolutional Neural Network
* **Model 6**: TensorFlow Hub Pretrained Feature Extractor
* **Model 7**: Same as model 6 with 10% of training data

Each experiment will go through the following steps:
* Construct the model
* Train the model
* Make predictions with the model
* Track prediction evaluation metrics for later comparison


### Model 0: Getting a baseline

Sklearn's Multinomial Naive Bayes using the TF-IDF formula to convert our words into numbers.

In [None]:
from sklearn.naive_bayes import MultinomialNB
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.pipeline import Pipeline

# Create tokenization and modelling pipeline
model_0 = Pipeline([
    ("tfidf", TfidfVectorizer()), # convert words to numbers using tfidf
    ("clf", MultinomialNB()), # model the text using a naive bayes classifier
])

# Fit the pipeline to the training data
model_0.fit(train_sentences, train_labels)

In [None]:
# Evaluate our baseline model
baseline_score =  model_0.score(val_sentences, val_labels)
print(f'Baseline accuracy score: {baseline_score * 100:.2f}%')

In [None]:
# Make predictions

baseline_preds = model_0.predict(val_sentences)
baseline_preds[:20]

#### Evaluating the model

In [None]:
from _helper_functions import calculate_results

In [None]:
baseline_results = calculate_results(y_true=val_labels, y_pred=baseline_preds)
baseline_results

### Model 1: Feed forward network (dense model)


In [None]:
# Create a tensorboard callback
from _helper_functions import create_tensorboard_callback

# Create a dir to save the tensorboard logs
SAVE_DIR = "../training_logs"

In [None]:
# Build model with the Functional API
from tensorflow.keras import layers
inputs = layers.Input(shape=(1,), dtype="string") # inputs are 1-dimensional strings
x = text_vectorizer(inputs) # turn the input text into numbers
x = embedding(x) # create an embedding of the numerized numbers
# x = layers.Dense(128, activation="relu")(x) # create a dense layer with 128 hidden units
# add globalaveragepooling2d
x = layers.GlobalAveragePooling1D()(x) # create a dense layer with 128 hidden units
x = layers.Dense(1, activation="sigmoid")(x) # create a dense layer with 1 output unit
model_1 = tf.keras.Model(inputs=inputs, outputs=x, name='model_1_dense') # create the model
model_1.summary()

In [None]:
# Compile model
model_1.compile(loss="binary_crossentropy",
                optimizer=tf.keras.optimizers.Adam(),
                metrics=["accuracy"])

In [None]:
# Fit the model
model_1_history = model_1.fit(train_sentences,
                              train_labels,
                              epochs=5,
                              validation_data=(val_sentences, val_labels),
                              callbacks=[create_tensorboard_callback(dir_name=SAVE_DIR,
                                                                     experiment_name="model_1_dense")])


In [None]:
model_1.evaluate(val_sentences,val_labels)

In [None]:
model_1_pred_probs = model_1.predict(val_sentences)
print(model_1_pred_probs.shape)
print(model_1_pred_probs[:10])


In [None]:
# Convert model prediction probabilities to label format (0 or 1)
# Squeeze first to remove outer dimension
model_1_preds = tf.squeeze(tf.round(model_1_pred_probs))
model_1_preds[:10]

In [None]:
# Calc model_1 results
model_1_results = calculate_results(y_true=val_labels, y_pred=model_1_preds)
model_1_results


In [None]:
baseline_results

In [None]:
import numpy as np
np.array(list(model_1_results.values())) > np.array(list(baseline_results.values()))

## Visualizing Learned Embeddings

In [None]:
# Get the vocab from text vectorization layer
words_in_vocab = text_vectorizer.get_vocabulary() # get all of the unique words in our training data
len(words_in_vocab), words_in_vocab[:10]


In [None]:
model_1.summary()

In [None]:
# Get the weight matrix of the embedding layer
embedded_weights = model_1.get_layer('embedding_1').get_weights()[0]
embedded_weights.shape

### Word Embedding Projector
Visualize on embedding projector: https://projector.tensorflow.org/

In [None]:
import io

out_v = io.open('../extras/vectors.tsv', 'w', encoding='utf-8')
out_m = io.open('../extras/metadata.tsv', 'w', encoding='utf-8')

for index, word in enumerate(words_in_vocab):
  if index == 0:
    continue  # skip 0, it's padding.
  vec = embedded_weights[index]
  out_v.write('\t'.join([str(x) for x in vec]) + "\n")
  out_m.write(word + "\n")
out_v.close()
out_m.close()

In [None]:
model_1.predict(['Excellent do dry news ash hungry'])

## Recurrent Neural Networks (RNN's)

RNN's are useful for sequence data.
Uses the representation of a previous input to aid the representation of a later input.


In [None]:
train_sentences[:5]

### Model 2: LSTM

LSTM = long short term memory (one of the most popular LSTM cells)

`Input (text) -> Tokenize -> Embedding -> Layers (RNNs/dense) -> Output (label probability)`

In [None]:
# Create an LSTM model
from tensorflow.keras import layers

inputs = layers.Input(shape=(1, ), dtype="string") # inputs are 1-dimensional strings
x = text_vectorizer(inputs) # turn the input text into numbers
x = embedding(x) # create an embedding of the numerized numbers
# print(x.shape)
# x = layers.LSTM(64, return_sequences=True)(x) # create a dense layer with 64 hidden units
# print(x.shape)
x = layers.LSTM(64)(x) # create a dense layer with 64 hidden units
# print(x.shape)
# x = layers.Dense(64, activation="relu")(x) # optionally add a dense layer on top of the LSTM layer
# print(x.shape)
outputs = layers.Dense(1, activation="sigmoid")(x)
model_2 = tf.keras.Model(inputs, outputs, name='model_2_LSTM') # create the model

In [None]:
# Compile
model_2.compile(loss="binary_crossentropy",
                optimizer=tf.keras.optimizers.Adam(),
                metrics=["accuracy"])

In [None]:
# Fit
model_2_history = model_2.fit(train_sentences,
                                train_labels,
                                epochs=5,
                                validation_data=(val_sentences, val_labels),
                                callbacks=[
                                    create_tensorboard_callback(
                                        dir_name=SAVE_DIR,
                                        experiment_name="model_2_LSTM"
                                    )
                                ]
                            )


In [None]:
# Make predictions with LSTM model
model_2_pred_probs = model_2.predict(val_sentences)
model_2_pred_probs[:10]

In [None]:
# Convert model 2 pred probs to labels
model_2_preds = tf.squeeze(tf.round(model_2_pred_probs))
model_2_preds[:10]

In [None]:
# Calc model_2 results
model_2_results = calculate_results(y_true=val_labels, y_pred=model_2_preds)
model_2_results, baseline_results

### Model 3: GRU

Gated recurrent unit (GRU) is a variation of LSTM (less parameters)

In [None]:
# GRU Model
from tensorflow.keras import layers

inputs = layers.Input(shape=(1, ), dtype="string") # inputs are 1-dimensional strings
x = text_vectorizer(inputs) # turn the input text into numbers
x = embedding(x) # create an embedding of the numerized numbers
x = layers.GRU(64)(x) # create a dense layer with 64 hidden units
# x = layers.LSTM(64, return_sequences=True)(x)
# x = layers.GRU(64)(x)
# x = layers.Dense(64, activation="relu")(x)
# x = layers.GlobalAveragePooling1D()(x)
outputs = layers.Dense(1, activation="sigmoid")(x)
model_3 = tf.keras.Model(inputs, outputs, name='model_3_GRU') # create the model
model_3.summary()

In [None]:
# Compile
model_3.compile(loss="binary_crossentropy",
                optimizer=tf.keras.optimizers.Adam(),
                metrics=["accuracy"])
                

In [None]:
model_3.fit(train_sentences,
            train_labels,
            epochs=5,
            validation_data=(val_sentences, val_labels),
            callbacks=[
                create_tensorboard_callback(
                    dir_name=SAVE_DIR,
                    experiment_name="model_3_GRU"
                )
            ]
            )

In [None]:
# Make predictions with LSTM model
model_3_pred_probs = model_3.predict(val_sentences)
model_3_pred_probs[:10]

model_3_preds = tf.squeeze(tf.round(model_3_pred_probs))
model_3_preds[:10]

In [None]:
# Calc model_3 results
model_3_results = calculate_results(y_true=val_labels, y_pred=model_3_preds)
model_results = [baseline_results, model_1_results, model_2_results, model_3_results]

for i, result in enumerate(model_results):
    print(f'{i}: ', result["accuracy"])

### Model 4: Bidirectional RNN

A Bidirectional RNN is a combination of two RNNs training the network in opposite directions, one from the beginning to the end of a sequence, and the other, from the end to the beginning of a sequence. It helps in analyzing the future events by not limiting the model's learning to past and present.

In [None]:
# Bidirectional Model
from tensorflow.keras import layers

inputs = layers.Input(shape=(1, ), dtype="string") # inputs are 1-dimensional strings
x = text_vectorizer(inputs)
x = embedding(x)
# x = layers.Bidirectional(layers.LSTM(64, return_sequences=True))(x)
x = layers.Bidirectional(layers.LSTM(64))(x)
outputs = layers.Dense(1, activation="sigmoid")(x)
model_4 = tf.keras.Model(inputs, outputs, name="model_4_bidirectional")
model_4.summary()

In [None]:
# Compile
model_4.compile(loss="binary_crossentropy",
                optimizer=tf.keras.optimizers.Adam(),
                metrics=["accuracy"])
                

In [None]:
# Fit
model_4_history = model_4.fit(train_sentences,
                                train_labels,
                                epochs=5,
                                validation_data=(val_sentences, val_labels),
                                callbacks=[
                                    create_tensorboard_callback(
                                        dir_name=SAVE_DIR,
                                        experiment_name="model_4_bidirectional"
                                    )
                                ]
                            )
                            

In [None]:
# Make predictions with bidirectional model
model_4_pred_probs = model_4.predict(val_sentences)
model_4_pred_probs[:10]

model_4_preds = tf.squeeze(tf.round(model_4_pred_probs))
model_4_preds[:10]

In [None]:
# Calc model_3 results
model_4_results = calculate_results(y_true=val_labels, y_pred=model_4_preds)
model_results = [baseline_results, model_1_results, model_2_results, model_3_results, model_4_results]

for i, result in enumerate(model_results):
    print(f'{i}: ', result["accuracy"])

### Model 5: Conv1D Model

In [None]:
# Conv1D Model
from tensorflow.keras import layers

inputs = layers.Input(shape=(1, ), dtype="string") # inputs are 1-dimensional strings
x = text_vectorizer(inputs)
x = embedding(x)
# x = layers.Bidirectional(layers.LSTM(64, return_sequences=True))(x)
x = layers.Conv1D(64, kernel_size=5, activation="relu")(x)
x = layers.GlobalAveragePooling1D()(x)
outputs = layers.Dense(1, activation="sigmoid")(x)
model_5 = tf.keras.Model(inputs, outputs, name="model_5_conv1d")
model_5.summary()

In [None]:
# Compile and fit
model_5.compile(loss="binary_crossentropy",
                optimizer=tf.keras.optimizers.Adam(),
                metrics=["accuracy"])
                
model_5.fit(train_sentences,
                train_labels,
                epochs=5,
                validation_data=(val_sentences, val_labels),
                callbacks=[
                    create_tensorboard_callback(
                        dir_name=SAVE_DIR,
                        experiment_name="model_5_conv1d"
                    )
                ]
            )
