<a href="https://colab.research.google.com/github/SilahicAmil/NLP-NLTK/blob/main/Sentiment_Analysis.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#Sentiment Analysis 

Sentiment analysis on the IMBD dataset

In [12]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import os
import re
import string
import shutil
from collections import Counter

# TensorFlow imports
import tensorflow as tf
import tensorflow_datasets as tfds
import keras
from tensorflow.keras import layers
from tensorflow.keras import losses
from keras import callbacks

In [None]:
!nvidia-smi -L

GPU 0: Tesla P100-PCIE-16GB (UUID: GPU-f41d6473-b238-8712-aa69-df63c4f05efb)


## Dataset import

In [None]:


url = "https://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz"

dataset = tf.keras.utils.get_file("aclImdb_v1",
                                  url,
                                  untar=True,
                                  cache_dir=".",
                                  cache_subdir="")


dataset_dir = os.path.join(os.path.dirname(dataset), "aclImdb")

Downloading data from https://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz


In [None]:
os.listdir(dataset_dir)

['README', 'test', 'imdb.vocab', 'train', 'imdbEr.txt']

In [None]:
train_dir = os.path.join(dataset_dir, "train")
os.listdir(train_dir)

['urls_unsup.txt',
 'labeledBow.feat',
 'neg',
 'pos',
 'urls_pos.txt',
 'unsup',
 'urls_neg.txt',
 'unsupBow.feat']

In [None]:
sample_file = os.path.join(train_dir, "pos/1181_9.txt")
with open(sample_file) as f:
  print(f.read())

Rachel Griffiths writes and directs this award winning short film. A heartwarming story about coping with grief and cherishing the memory of those we've loved and lost. Although, only 15 minutes long, Griffiths manages to capture so much emotion and truth onto film in the short space of time. Bud Tingwell gives a touching performance as Will, a widower struggling to cope with his wife's death. Will is confronted by the harsh reality of loneliness and helplessness as he proceeds to take care of Ruth's pet cow, Tulip. The film displays the grief and responsibility one feels for those they have loved and lost. Good cinematography, great direction, and superbly acted. It will bring tears to all those who have lost a loved one, and survived.


## Loading the dataset and some preprocessing

In [None]:
# removing irrelevant folder
remove_dir = os.path.join(train_dir, "unsup")
shutil.rmtree(remove_dir)

In [None]:
# Creating validation set
# text_dataset_from_directory creates a labeled td.data.Datset

batch_size = 32
seed = 42

train_set = tf.keras.utils.text_dataset_from_directory("aclImdb/train",
                                                       batch_size=batch_size,
                                                       validation_split=0.2,
                                                       subset="training",
                                                       seed=seed)

Found 25000 files belonging to 2 classes.
Using 20000 files for training.


Originally 25k examples in the training folder which now 80% will be used for training and the other 5k for validation.

In [None]:
# Prinitng out examples
for text_batch, label_batch in train_set.take(1):
  for i in range(5):
    print("Review", text_batch.numpy()[i])
    print("Label", label_batch.numpy()[i])

Review b'"Pandemonium" is a horror movie spoof that comes off more stupid than funny. Believe me when I tell you, I love comedies. Especially comedy spoofs. "Airplane", "The Naked Gun" trilogy, "Blazing Saddles", "High Anxiety", and "Spaceballs" are some of my favorite comedies that spoof a particular genre. "Pandemonium" is not up there with those films. Most of the scenes in this movie had me sitting there in stunned silence because the movie wasn\'t all that funny. There are a few laughs in the film, but when you watch a comedy, you expect to laugh a lot more than a few times and that\'s all this film has going for it. Geez, "Scream" had more laughs than this film and that was more of a horror film. How bizarre is that?<br /><br />*1/2 (out of four)'
Label 0
Review b"David Mamet is a very interesting and a very un-equal director. His first movie 'House of Games' was the one I liked best, and it set a series of films with characters whose perspective of life changes as they get into 

In the the reviews there is raw text and the occasional HTML tags. Let's see how we can handle these.

Labels 0 or 1 correspond to pos or neg movie reviews.

0- neg

1- pos

which we can see is confirmed below

In [None]:
print('Label 0 is', train_set.class_names[0])
print('Label 1 is', train_set.class_names[1])

Label 0 is neg
Label 1 is pos


## Creating Test and Validation dataset

In [None]:
# Validation set
val_set = tf.keras.utils.text_dataset_from_directory("aclImdb/train",
                                                     batch_size=batch_size,
                                                     validation_split=0.2,
                                                     subset="validation",
                                                     seed=seed)

Found 25000 files belonging to 2 classes.
Using 5000 files for validation.


In [None]:
# Test set

test_set = tf.keras.utils.text_dataset_from_directory("aclImdb/test",
                                                      batch_size=batch_size)

Found 25000 files belonging to 2 classes.


## Preparing dataset for training

Standardizing, tokenizing and vectorizing the datasets with tf.keras.layers.TextVectorization.

Standardization refers to making the making the dataset to simplify it. Removing punctuation, HTML elements and etc.

Tokenization is splitting string to tokens. Example: splitting a sentence into individual words by splitting on the white space.

Vectorization is converting tokens into numbers so they can be used in a nueral net for learning.

In [None]:
# Standardizing dataset

def standardize_datasets(input_data):
  lowercase = tf.strings.lower(input_data)
  stripped_html = tf.strings.regex_replace(lowercase, '<br />', ' ')

  return tf.strings.regex_replace(stripped_html,
                                 '[%s]' % re.escape(string.punctuation),
                                 '')

In [None]:
# TextVectorization layer does everything. Standardizes, tokenize and vectorize
MAX_FEATS = 10000
SEQUENCE_LEN = 250

vectorization_layer = layers.TextVectorization(
    standardize=standardize_datasets,
    max_tokens=MAX_FEATS,
    output_mode="int", # creates unique int for each token
    output_sequence_length=SEQUENCE_LEN)

Note: When using .adapt() only use it on the trainin data

In [None]:
# Text only dataset, no labels
train_text_set = train_set.map(lambda x, y: x)
vectorization_layer.adapt(train_text_set)

In [None]:
# Function to see results of the layer
def vect_text(text, label):
  text = tf.expand_dims(text, -1)
  
  return vectorization_layer(text), label

In [None]:
# Review batch from the dataset

text_batch, label_batch = next(iter(train_set))
first_review, first_label = text_batch[0], label_batch[0]

print(f"First Review: {first_review}\nFirst Label {train_set.class_names[first_label]}\nVectorized Review: {vect_text(first_review, first_label)}")

First Review: b'Great movie - especially the music - Etta James - "At Last". This speaks volumes when you have finally found that special someone.'
First Label neg
Vectorized Review: (<tf.Tensor: shape=(1, 250), dtype=int64, numpy=
array([[  86,   17,  260,    2,  222,    1,  571,   31,  229,   11, 2418,
           1,   51,   22,   25,  404,  251,   12,  306,  282,    0,    0,
           0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,   

WE can see each token is an integer. Let's see what token corresponds to what integer

In [None]:
print(f"1337 -> {vectorization_layer.get_vocabulary()[1337]}\n420 -> {vectorization_layer.get_vocabulary()[420]}\nVocab Size: {len(vectorization_layer.get_vocabulary())}")

1337 -> sent
420 -> yes
Vocab Size: 10000


## Applying TextVectorization to train, val and test sets

In [None]:
# Vectorizing Text
train_set = train_set.map(vect_text)
test_set = test_set.map(vect_text)
val_set = val_set.map(vect_text)

## Creating a perfomant dataset

using .cache() and .prefetch() from tf.data.Datset

In [None]:
AUTOTUNE = tf.data.AUTOTUNE

train_set = train_set.cache().prefetch(buffer_size=AUTOTUNE)
test_set = test_set.cache().prefetch(buffer_size=AUTOTUNE)
val_set = val_set.cache().prefetch(buffer_size=AUTOTUNE)

## Model Creation Time

using the TF sequential API

Topology of the model:

First Layer is the embedding layer. This takes the int encoded reviews and looks up the embedding vector for each word index. The vectors are learned as the model trains. Vectors add a dimension to the output. So the dimensions look like `(batch, sequence, embedding)`.

Then we dropout to avoid overfitting

Next we use the GlobalAveragePooling1D to return a fixed output length vector.  Which averages over the sequence dimension. This lets the model handle intput of varying lengths.

The last layer is a dense layer with a single output



In [None]:
EMBEDDING_DIMS = 16

model_1 = tf.keras.Sequential([
      layers.Embedding(MAX_FEATS +1, EMBEDDING_DIMS),
      layers.Dropout(0.2),
      layers.GlobalAveragePooling1D(),
      layers.Dropout(0.2),
      layers.Dense(1) 
                              
])

In [None]:
model_1.summary()

Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding (Embedding)       (None, None, 16)          160016    
                                                                 
 dropout (Dropout)           (None, None, 16)          0         
                                                                 
 global_average_pooling1d (G  (None, 16)               0         
 lobalAveragePooling1D)                                          
                                                                 
 dropout_1 (Dropout)         (None, 16)                0         
                                                                 
 dense (Dense)               (None, 1)                 17        
                                                                 
Total params: 160,033
Trainable params: 160,033
Non-trainable params: 0
__________________________________________________

## Loss Function and optimizer

Since this is a binary classifier (0 or 1) we use the BinaryCrossentropy loss func

In [None]:
# Compiling the model
model_1.compile(loss=losses.BinaryCrossentropy(from_logits=True),
                optimizer="adam",
                metrics=tf.metrics.BinaryAccuracy(threshold=0.0))

In [None]:
# Trainig the model
%%time

hist_1 = model_1.fit(train_set,
                     validation_data=val_set,
                     epochs=10)

Epoch 1/10
100/625 [===>..........................] - ETA: 6s - loss: 0.6917 - binary_accuracy: 0.5184

KeyboardInterrupt: ignored

## Evaluating the model

In [None]:
loss, accuracy = model_1.evaluate(test_set)

print(f"Loss: {loss}\nAccuracy: {accuracy}")

Even though the model is very naive it achieves an accuracy of 87%

## Plotting accuracy and loss over time

In [None]:
hist_dict_1 = hist_1.history
hist_dict_1.keys()

There are 4 entries which is 1 for each monitored metrics during traning and validation. Lets plot these and see how it converges

In [None]:
# Validation and Training loss plot
acc = hist_dict_1["binary_accuracy"]
val_acc = hist_dict_1["val_binary_accuracy"]
loss = hist_dict_1["loss"]
val_loss = hist_dict_1["val_loss"]

epochs = range(1, len(acc) +1)

plt.plot(epochs, loss, "bo", label="Train Loss")
plt.plot(epochs, val_loss, "b", label="Val Loss")
plt.title("Train and Val Loss")
plt.xlabel("Epochs")
plt.ylabel("Loss")
plt.legend()

In [None]:
# Accuracy over epochs plot
plt.plot(epochs, acc, 'bo', label='Training acc')
plt.plot(epochs, val_acc, 'b', label='Validation acc')
plt.title('Training and validation accuracy')
plt.xlabel('Epochs')
plt.ylabel('Accuracy')
plt.legend(loc='lower right')

## Testing out another model

In [None]:
#model_1 = tf.keras.Sequential([
 #     layers.Embedding(MAX_FEATS +1, EMBEDDING_DIMS),
  #    layers.Dropout(0.2),
   #   layers.GlobalAveragePooling1D(),
    #  layers.Dropout(0.2),
     # layers.Dense(1) 
                              
#])

In [None]:
model_2 = tf.keras.Sequential([
      layers.Embedding(MAX_FEATS +1, EMBEDDING_DIMS),
      layers.Dropout(0.2),
      layers.Conv1D(32, kernel_size=5, padding="same", activation="relu"),
      layers.Dropout(0.2),
      layers.Conv1D(16, kernel_size=5, activation="relu"),
      layers.GlobalAveragePooling1D(),
      layers.Dropout(0.2),
      layers.Dense(1) 
                              
])

In [None]:
model_2.compile(loss=tf.losses.BinaryCrossentropy(from_logits=True),
                optimizer="adam",
                metrics=tf.metrics.BinaryAccuracy(threshold=0.0))

In [None]:
model_1.summary()

In [None]:
model_2.summary()

In [None]:
#Training model 2
%%time
hist_model_2 = model_2.fit(train_set,
                           epochs=10,
                           validation_data=val_set)

In [None]:
loss, accuracy = model_2.evaluate(test_set)

print(f"Loss: {loss}\nAccuracy: {accuracy}")

Even thought we added more deep learning layers. It has an accuracy of 85%. Less than the naive baseline. As we know trying to imporve upon the baseline is very hard. Lets see if we can create some better models

## More model testing

In [None]:
model_3 = tf.keras.Sequential([
      layers.Embedding(MAX_FEATS +1, EMBEDDING_DIMS),
      layers.Dropout(0.2),
      layers.Conv1D(128, kernel_size=5, padding="same", activation="relu"),
      layers.Conv1D(64, kernel_size=5, activation="relu"),
      layers.Dropout(0.5),

      layers.Conv1D(32, kernel_size=5, activation="relu"),
      layers.Conv1D(16, kernel_size=5, activation="relu"),

      layers.Dropout(0.2),
      layers.GlobalAveragePooling1D(),
      layers.Dense(1) 
                              
])

In [None]:
model_3.compile(loss=tf.losses.BinaryCrossentropy(from_logits=True),
                optimizer="adam",
                metrics=["accuracy"])

In [None]:
model_3.summary()

In [None]:
# Early stopping

early_stop = callbacks.EarlyStopping(monitor ="val_loss", 
                                        mode ="min", patience=2, 
                                        restore_best_weights=True)

In [None]:
hist_3 = model_3.fit(train_set,
            epochs=50,
            validation_data=test_set,
            callbacks=[early_stop]
            )

3 epochs is all it needed. Maybe if I put the pataience higher it would be different but lets see the results

In [None]:
loss, accuracy = model_3.evaluate(test_set)

print(f"Loss: {loss}\nAccuracy: {accuracy}")

86% POGGERZ. Almost as good as the naive model with a few tweaks. Maybe if the patience was higher we could make it better.

In [None]:
# Same Model
model_4 = tf.keras.Sequential([
      layers.Embedding(MAX_FEATS +1, EMBEDDING_DIMS),
      layers.Dropout(0.2),
      layers.Conv1D(128, kernel_size=5, padding="same", activation="relu"),
      layers.Conv1D(64, kernel_size=5, activation="relu"),
      layers.Dropout(0.5),

      layers.Conv1D(32, kernel_size=5, activation="relu"),
      layers.Conv1D(16, kernel_size=5, activation="relu"),

      layers.Dropout(0.2),
      layers.GlobalAveragePooling1D(),
      layers.Dense(1) 
                              
])

In [None]:
model_4.compile(loss=tf.losses.BinaryCrossentropy(from_logits=True),
                optimizer="adam",
                metrics=["accuracy"])

In [None]:
# higher patience
early_stop = callbacks.EarlyStopping(monitor="val_loss", 
                                        mode="min", patience=5, 
                                        restore_best_weights=True)

In [None]:
hist_4 = model_4.fit(train_set,
                     epochs=50,
                     validation_data=test_set,
                     callbacks=[early_stop])

In [None]:
loss, accuracy = model_4.evaluate(test_set)

print(f"Loss: {loss}\nAccuracy: {accuracy}")

Still at 86% hmmm. Could we improve this with a better model? More hidden layers? more dropout? Or maybe the we need to experiment with different embeddings? Lets test them out

## GRU Model

In [None]:
model_5 = tf.keras.Sequential([
      layers.Embedding(MAX_FEATS +1, EMBEDDING_DIMS),
      layers.GRU(64, return_sequences=True, dropout=0.2),
      layers.GRU(64, dropout=0.2),
      layers.Dense(1)
                              
])

In [None]:
model_5.compile(loss=tf.losses.BinaryCrossentropy(from_logits=True),
                optimizer="adam",
                metrics=tf.metrics.BinaryAccuracy(threshold=0.0))

In [None]:
model_5.summary()

Model: "sequential_16"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding_18 (Embedding)    (None, None, 16)          160016    
                                                                 
 gru_45 (GRU)                (None, None, 64)          15744     
                                                                 
 gru_46 (GRU)                (None, 64)                24960     
                                                                 
 dense_16 (Dense)            (None, 1)                 65        
                                                                 
Total params: 200,785
Trainable params: 200,785
Non-trainable params: 0
_________________________________________________________________


In [None]:
hist_5 = model_5.fit(train_set,
                     epochs=50,
                     validation_data=test_set,
                     callbacks=[early_stop])

Epoch 1/50
Epoch 2/50
Epoch 3/50
Epoch 4/50
Epoch 5/50
Epoch 6/50
Epoch 7/50
Epoch 8/50
Epoch 9/50
Epoch 10/50


In [None]:
loss, accuracy = model_5.evaluate(test_set)

print(f"Loss: {loss}\nAccuracy: {accuracy}")

Loss: 0.3343585431575775
Accuracy: 0.86080002784729


86% is what we got. Even with GRU layers some of the most powerful RNN layers out there but still our naive Conv1D model outperforms it.

## Different Approach

In [9]:
datasets, info = tfds.load("imdb_reviews", as_supervised=True, with_info=True)

train_size = info.splits["train"].num_examples

[1mDownloading and preparing dataset imdb_reviews/plain_text/1.0.0 (download: 80.23 MiB, generated: Unknown size, total: 80.23 MiB) to /root/tensorflow_datasets/imdb_reviews/plain_text/1.0.0...[0m


Dl Completed...: 0 url [00:00, ? url/s]

Dl Size...: 0 MiB [00:00, ? MiB/s]





0 examples [00:00, ? examples/s]

Shuffling and writing examples to /root/tensorflow_datasets/imdb_reviews/plain_text/1.0.0.incompleteTD0JYU/imdb_reviews-train.tfrecord


  0%|          | 0/25000 [00:00<?, ? examples/s]

0 examples [00:00, ? examples/s]

Shuffling and writing examples to /root/tensorflow_datasets/imdb_reviews/plain_text/1.0.0.incompleteTD0JYU/imdb_reviews-test.tfrecord


  0%|          | 0/25000 [00:00<?, ? examples/s]

0 examples [00:00, ? examples/s]

Shuffling and writing examples to /root/tensorflow_datasets/imdb_reviews/plain_text/1.0.0.incompleteTD0JYU/imdb_reviews-unsupervised.tfrecord


  0%|          | 0/50000 [00:00<?, ? examples/s]



[1mDataset imdb_reviews downloaded and prepared to /root/tensorflow_datasets/imdb_reviews/plain_text/1.0.0. Subsequent calls will reuse this data.[0m


In [16]:
def preprocess(X_batch, y_batch):
  X_batch = tf.strings.substr(X_batch, 0, 300)
  X_batch = tf.strings.regex_replace(X_batch, b"<br\\s*?>", b" ")
  X_batch = tf.strings.regex_replace(X_batch, b"[^a-zA-Z']", b" ")
  X_batch = tf.strings.split(X_batch)

  return X_batch.to_tensor(default_value=b"<pad>"), y_batch

In [17]:
# Vocab counting

vocab = Counter()

for X_batch, y_batch in datasets["train"].batch(32).map(preprocess):
  for review in X_batch:
    vocab.update(list(review.numpy()))

In [18]:
vocab.most_common()[:5]

[(b'<pad>', 205484),
 (b'the', 61137),
 (b'a', 38564),
 (b'of', 33983),
 (b'and', 33431)]

In [19]:
# Trunc vocab
VOCAB_SIZE = 10000

trunc_vocab = [
  word for word, count in vocab.most_common()[:VOCAB_SIZE]
]

In [21]:
# OOV Buckets
NUM_OOV_BUCKETS = 1000

words = tf.constant(trunc_vocab)

word_ids = tf.range(len(trunc_vocab), dtype=tf.int64)
vocab_init = tf.lookup.KeyValueTensorInitializer(words, word_ids)

table = tf.lookup.StaticVocabularyTable(vocab_init, NUM_OOV_BUCKETS)

In [24]:
# Encoding words
def encode_wrds(X_batch, y_batch):
  return table.lookup(X_batch), y_batch

# Train set
train_set = datasets["train"].batch(32).map(preprocess)
train_set = train_set.map(encode_wrds).prefetch(1)
# Test set
test_set = datasets["test"].batch(32).map(preprocess)
test_set = test_set.map(encode_wrds).prefetch(1)

### Model Creation

In [26]:
EMBED_SIZE = 128

model_6 = keras.models.Sequential([
  keras.layers.Embedding(VOCAB_SIZE + NUM_OOV_BUCKETS, EMBED_SIZE, input_shape=[None], mask_zero=True),
  keras.layers.GRU(128, return_sequences=True),
  keras.layers.GRU(128),
  keras.layers.Dense(1, activation="sigmoid")
])

In [27]:
model_6.compile(loss="binary_crossentropy",
                optimizer="adam",
                metrics=["accuracy"])

In [28]:
model_6.summary()

Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding (Embedding)       (None, None, 128)         1408000   
                                                                 
 gru (GRU)                   (None, None, 128)         99072     
                                                                 
 gru_1 (GRU)                 (None, 128)               99072     
                                                                 
 dense (Dense)               (None, 1)                 129       
                                                                 
Total params: 1,606,273
Trainable params: 1,606,273
Non-trainable params: 0
_________________________________________________________________


In [29]:
hist_6 = model_6.fit(train_set,
                     epochs=10,
                     validation_data=(test_set))

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


In [30]:
loss, accuracy = model_6.evaluate(test_set, verbose=1)

print(f"Loss: {loss}\nAccuracy: {accuracy}")

Loss: 0.06768446415662766
Accuracy: 0.9787600040435791


LETSSSS GOOOOOOOO 98% ACCURACY!!!!!! Probably overfitted but POGGERS regardless