<a href="https://colab.research.google.com/github/SaketMunda/introduction-to-nlp/blob/master/exercises_nlp.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Exercises on Natural Language Processing with TensorFlow

- [x] Rebuild, compile and train model_1, model_2 and model_5 using the [Keras Sequential API](https://www.tensorflow.org/api_docs/python/tf/keras/Sequential) instead of the Functional API.
- [x] Retrain the baseline model with 10% of the training data. How does perform compared to the Universal Sentence Encoder model with 10% of the training data?
- [ ] Try fine-tuning the TF Hub Universal Sentence Encoder model by setting training=True when instantiating it as a Keras layer.

```
# We can use this encoding layer in place of our text_vectorizer and embedding layer
sentence_encoder_layer = hub.KerasLayer("https://tfhub.dev/google/universal-sentence-encoder/4",
                                        input_shape=[],
                                        dtype=tf.string,
                                        trainable=True) # turn training on to fine-tune the TensorFlow Hub model
```
- [ ] Retrain the best model you've got so far on the whole training set (no validation split). Then use this trained model to make predictions on the test dataset and format the predictions into the same format as the sample_submission.csv file from Kaggle (see the Files tab in Colab for what the sample_submission.csv file looks like). Once you've done this, [make a submission to the Kaggle competition](https://www.kaggle.com/c/nlp-getting-started/data), how did your model perform?
- [ ] Combine the ensemble predictions using the majority vote (mode), how does this perform compare to averaging the prediction probabilities of each model?
- [ ] Make a confusion matrix with the best performing model's predictions on the validation set and the validation ground truth labels.

## Get ready with the Data

In [1]:
# First checking the prerequisite for this exercise notebook
!nvidia-smi -L

GPU 0: Tesla T4 (UUID: GPU-148bce30-f304-adb8-29c6-9fc207b2b92f)


In [2]:
# import the helper function
!wget https://raw.githubusercontent.com/SaketMunda/ml-helpers/master/helper_functions.py

from helper_functions import unzip_data, create_tensorboard_callback

--2023-02-27 03:27:29--  https://raw.githubusercontent.com/SaketMunda/ml-helpers/master/helper_functions.py
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.109.133, 185.199.108.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.109.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 2904 (2.8K) [text/plain]
Saving to: ‘helper_functions.py’


2023-02-27 03:27:29 (43.2 MB/s) - ‘helper_functions.py’ saved [2904/2904]



In [3]:
# import the dataset of tweets
!wget https://storage.googleapis.com/ztm_tf_course/nlp_getting_started.zip

# unzip the data
unzip_data('nlp_getting_started.zip')

--2023-02-27 03:27:53--  https://storage.googleapis.com/ztm_tf_course/nlp_getting_started.zip
Resolving storage.googleapis.com (storage.googleapis.com)... 209.85.147.128, 142.250.125.128, 142.250.136.128, ...
Connecting to storage.googleapis.com (storage.googleapis.com)|209.85.147.128|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 607343 (593K) [application/zip]
Saving to: ‘nlp_getting_started.zip’


2023-02-27 03:27:53 (104 MB/s) - ‘nlp_getting_started.zip’ saved [607343/607343]



In [4]:
# visualize the text dataset
import pandas as pd
train_df = pd.read_csv('train.csv')
test_df = pd.read_csv('test.csv')

# shapes
train_df.shape, test_df.shape

((7613, 5), (3263, 4))

In [5]:
# view any samples
train_df.head()

Unnamed: 0,id,keyword,location,text,target
0,1,,,Our Deeds are the Reason of this #earthquake M...,1
1,4,,,Forest fire near La Ronge Sask. Canada,1
2,5,,,All residents asked to 'shelter in place' are ...,1
3,6,,,"13,000 people receive #wildfires evacuation or...",1
4,7,,,Just got sent this photo from Ruby #Alaska as ...,1


In [6]:
# need to reshuffle the training dataset
train_shuffled_df = train_df.sample(frac=1, random_state=17)
train_shuffled_df.head()

Unnamed: 0,id,keyword,location,text,target
7027,10072,typhoon,,Typhoon Soudelor: When will it hit Taiwan ÛÒ ...,1
318,463,armageddon,,RT @RTRRTcoach: #Love #TrueLove #romance lith ...,0
1681,2425,collide,www.youtube.com?Malkavius2,I liked a @YouTube video from @gassymexican ht...,0
5131,7318,nuclear%20reactor,"New York, New York",Japan's Restart of Nuclear Reactor Fleet Fast ...,1
2967,4262,drowning,"Hendersonville, NC",#ICYMI #Annoucement from Al Jackson... http://...,0


In [10]:
# split the training and test set from the train_shuffled_df
from sklearn.model_selection import train_test_split

train_sentences, val_sentences, train_labels, val_labels = train_test_split(train_shuffled_df['text'].to_numpy(),
                                                                              train_shuffled_df['target'].to_numpy(),
                                                                              test_size=0.1,
                                                                              random_state=17)

In [11]:
# Text Vectorizer & Embedding Layer
import tensorflow as tf
from tensorflow.keras.layers.experimental.preprocessing import TextVectorization
from tensorflow.keras import layers

max_vocab_length = 10000
max_length = 15
# using the default vectorizor variables
text_vectorizer = TextVectorization(max_tokens=max_vocab_length,
                                    output_mode='int',
                                    output_sequence_length=max_length)

# fit the text vectorizer to the training text
text_vectorizer.adapt(train_sentences)

# creating the embedding layer
embedding = layers.Embedding(input_dim=max_vocab_length,
                             output_dim=128,
                             embeddings_initializer='uniform',
                             input_length=max_length,
                             name='embedding_1')

## 1. Rebuild, Compile and Train model_1, model_2 and model_5 using the Keras Sequential API instead of the Functional API


Brief about each models,

- `model_1` : Simple Dense Model
- `model_2` : LSTM (RNN)
- `model_5` : 1D (CNN)


### Model_1 : Simple Dense Model using Sequential API

In [14]:
from tensorflow.keras import layers

SAVE_DIR='model_logs'

# Build model with Sequential API
model_1 = tf.keras.Sequential([
    layers.Input(shape=(1,), dtype=tf.string),
    text_vectorizer,
    embedding,
    layers.GlobalAveragePooling1D(),
    layers.Dense(1, activation='sigmoid', name='model_1_sequential_dense')
])

# compile the model
model_1.compile(loss='binary_crossentropy',
                optimizer=tf.keras.optimizers.Adam(),
                metrics=['accuracy'])

# fit the model
history_1 = model_1.fit(train_sentences,
                        train_labels,
                        epochs=5,
                        validation_data=(val_sentences, val_labels),
                        callbacks=[create_tensorboard_callback(dir_name=SAVE_DIR,
                                                                experiment_name='model_1_sequential')])

Saving Tensorboard log files to: model_logs/model_1_sequential/20230227-041319
Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


In [15]:
model_1_results = model_1.evaluate(val_sentences, val_labels)
model_1_results



[0.4876011312007904, 0.7939632534980774]

### Model_2 : LSTM (RNN)

In [16]:
# create new embeddings
model_2_embedding = layers.Embedding(input_dim=max_vocab_length,
                                     output_dim=128,
                                     embeddings_initializer='uniform',
                                     input_length=max_length,
                                     name='embeding_2')

# Create the model
model_2 = tf.keras.Sequential([
    layers.Input(shape=(1,), dtype=tf.string),
    text_vectorizer,
    model_2_embedding,
    layers.LSTM(64),
    layers.Dense(1, activation='sigmoid')    
])

# compile the model
model_2.compile(loss='binary_crossentropy',
                optimizer=tf.keras.optimizers.Adam(),
                metrics=['accuracy'])

# fit the model
model_2_history = model_2.fit(train_sentences,
                              train_labels,
                              epochs=5,
                              validation_data=(val_sentences, val_labels),
                              callbacks=[create_tensorboard_callback(dir_name=SAVE_DIR,
                                                                     experiment_name='model_2_LSTM')])

Saving Tensorboard log files to: model_logs/model_2_LSTM/20230227-042654
Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


In [17]:
model_2_results = model_2.evaluate(val_sentences,val_labels)
model_2_results



[0.9480307102203369, 0.7506561875343323]

### Model_5 : 1D CNN

In [18]:
# create embeddings
model_5_embedding = layers.Embedding(input_dim=max_vocab_length,
                                     output_dim=128,
                                     embeddings_initializer='uniform',
                                     input_length=max_length,
                                     name='model_5_embedding')

# create the model
model_5 = tf.keras.Sequential([
    layers.Input(shape=(1,), dtype=tf.string),
    text_vectorizer,
    model_5_embedding,
    layers.Conv1D(filters=32, kernel_size=5, strides=1, activation='relu', padding='valid'),
    layers.GlobalMaxPool1D(),
    layers.Dense(1, activation='sigmoid', name='model_5_CNN')
])

# compile the model
model_5.compile(loss='binary_crossentropy',
                optimizer=tf.keras.optimizers.Adam(),
                metrics=['accuracy'])

# fit the model
model_5_history = model_5.fit(train_sentences,
                              train_labels,
                              epochs=5,
                              validation_data=(val_sentences, val_labels),
                              callbacks=[create_tensorboard_callback(dir_name=SAVE_DIR,
                                                                      experiment_name='model_5_CNN')])



Saving Tensorboard log files to: model_logs/model_5_CNN/20230227-044948
Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


In [19]:
model_5_results = model_5.evaluate(val_sentences, val_labels)
model_5_results



[0.7121530771255493, 0.7650918364524841]

## Retrain the `baseline` model with 10% data

In [21]:
# Select only the 10% of random data
train_10_percent_df = train_df.sample(frac=0.1, random_state=17)
train_10_percent_df.shape

(761, 5)

In [22]:
# create training and validation data
from sklearn.model_selection import train_test_split

train_sentences, val_sentences, train_labels, val_labels = train_test_split(train_10_percent_df['text'].to_numpy(),
                                                                            train_10_percent_df['target'].to_numpy(),
                                                                            test_size=0.1,
                                                                            random_state=17)

In [23]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import Pipeline

model_0 = Pipeline([
    ("tfidf", TfidfVectorizer()),
    ("clf", MultinomialNB())
])

# Fit the pipeline to the training data
model_0.fit(train_sentences, train_labels)

Pipeline(steps=[('tfidf', TfidfVectorizer()), ('clf', MultinomialNB())])

In [24]:
# baseline_scores with 10% of data
baseline_10_percent_score = model_0.score(val_sentences, val_labels)
baseline_10_percent_score

0.8051948051948052

It looks like our baseline_10_percent_score still beating the universal sentence encoder results which was trained with 10% of data