<a href="https://colab.research.google.com/github/BrendanL72/ACM-Research-Coding-Challenge-F21/blob/main/Sentiment_Analysis_RNN.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

This code was shamelessly ripped/adapted from the "Text classification with an RNN" tutorial offered by the TensorFlow website.

https://www.tensorflow.org/text/tutorials/text_classification_rnn


#Setup

The IMDB reviews are part of the imported TensorFlow Datasets library. As such it does not need to be downloaded from the internet and can be loaded by name.

In [1]:
import numpy as np

import tensorflow_datasets as tfds
import tensorflow as tf

In [2]:
import matplotlib.pyplot as plt


def plot_graphs(history, metric):
  plt.plot(history.history[metric])
  plt.plot(history.history['val_'+metric], '')
  plt.xlabel("Epochs")
  plt.ylabel(metric)
  plt.legend([metric, 'val_'+metric])

Loads the directory of imdb_reviews into a TensorFlowDataSet. 

In [3]:
dataset, info = tfds.load('imdb_reviews', 
                          with_info =True,
                          as_supervised = True) #VERY IMPORTANT, dataset contains unsupervised data

train_dataset, test_dataset = dataset['train'], dataset['test']

train_dataset.element_spec

[1mDownloading and preparing dataset imdb_reviews/plain_text/1.0.0 (download: 80.23 MiB, generated: Unknown size, total: 80.23 MiB) to /root/tensorflow_datasets/imdb_reviews/plain_text/1.0.0...[0m


Dl Completed...: 0 url [00:00, ? url/s]

Dl Size...: 0 MiB [00:00, ? MiB/s]





0 examples [00:00, ? examples/s]

Shuffling and writing examples to /root/tensorflow_datasets/imdb_reviews/plain_text/1.0.0.incompleteNWPFUZ/imdb_reviews-train.tfrecord


  0%|          | 0/25000 [00:00<?, ? examples/s]

0 examples [00:00, ? examples/s]

Shuffling and writing examples to /root/tensorflow_datasets/imdb_reviews/plain_text/1.0.0.incompleteNWPFUZ/imdb_reviews-test.tfrecord


  0%|          | 0/25000 [00:00<?, ? examples/s]

0 examples [00:00, ? examples/s]

Shuffling and writing examples to /root/tensorflow_datasets/imdb_reviews/plain_text/1.0.0.incompleteNWPFUZ/imdb_reviews-unsupervised.tfrecord


  0%|          | 0/50000 [00:00<?, ? examples/s]



[1mDataset imdb_reviews downloaded and prepared to /root/tensorflow_datasets/imdb_reviews/plain_text/1.0.0. Subsequent calls will reuse this data.[0m


(TensorSpec(shape=(), dtype=tf.string, name=None),
 TensorSpec(shape=(), dtype=tf.int64, name=None))

Get a random review to make sure that the data has been loaded in properly

In [4]:
for example, label in train_dataset.take(1):
  print('Text: ', example.numpy())
  print('Rating: ', label.numpy())

Text:  b"This was an absolutely terrible movie. Don't be lured in by Christopher Walken or Michael Ironside. Both are great actors, but this must simply be their worst role in history. Even their great acting could not redeem this movie's ridiculous storyline. This movie is an early nineties US propaganda piece. The most pathetic scenes were those when the Columbian rebels were making their cases for revolutions. Maria Conchita Alonso appeared phony, and her pseudo-love affair with Walken was nothing but a pathetic emotional plug in a movie that was devoid of any real meaning. I am disappointed that there are movies like this, ruining actor's like Christopher Walken's good name. I could barely sit through it."
Rating:  0


Prefetching some random data from the training and testing datasets to make sure everything works

In [5]:
#Constants to determine the amount of data we are prefetching
BUFFER_SIZE = 10000
BATCH_SIZE = 64

train_dataset = train_dataset.shuffle(BUFFER_SIZE).batch(BATCH_SIZE).prefetch(tf.data.AUTOTUNE)
test_dataset = test_dataset.batch(BATCH_SIZE).prefetch(tf.data.AUTOTUNE)

In [6]:
num_loops = 1
num_reviews_show = 4
for example, label in train_dataset.take(num_loops):
  print('TEXTS: ', example.numpy()[:num_reviews_show])
  print()
  print('RATINGS: ', label.numpy()[:num_reviews_show])

TEXTS:  [b"I have seen this movie a whole dozen times and it's awesome. But the only thing with it was that in the beginning, there was too much talk of who's going out with who. I think that it would be interesting to do a remake of it. But on the official site, they said that they will not be making a remake of it because so many people have gotten saved when viewing it. What's even happened to Patty Dunning now? She is a pretty good actress. She has done several other movies in the 70s and 80s, but we haven't heard from her since. I know for sure about Thom Rachford, who plays Jerry, works for Accounting at RD Films. But overall, I have to say that the series itself is like Left Behind gone old school."
 b"The prerequisite for making such a film is a complete ignorance of Nietzche's work and personality, psychoanalytical techniques and Vienna's history. Take a well-know genius you have not read, describe him as demented, include crazy physicians to cure him, a couple of somewhat goo

#Creating the text encoder

Notes:
*   Too few allowed vocabulary causes more of the tokens to be unknown ([UNK]) but also run slower 

*   keras.layers.experimental.preprocessing.TextVectorization() processes the text into an array of processed, lowercase tokens that is then turned into vectors. It is a layer. There is a variation that does not preprocess.

*   The adapt() method can be used on a layer to look at the dataset and creates a vocabulary based on the frequency of the words

In [7]:
MAX_VOCAB = 1000 #sets the max number of words the computer can know

encoder = tf.keras.layers.experimental.preprocessing.TextVectorization(
    max_tokens=MAX_VOCAB)

encoder.adapt(train_dataset.map(lambda text, label: text))

In [8]:
vocab = np.array(encoder.get_vocabulary())
#see the most used words, dtype not included
vocab[:30]

array(['', '[UNK]', 'the', 'and', 'a', 'of', 'to', 'is', 'in', 'it', 'i',
       'this', 'that', 'br', 'was', 'as', 'for', 'with', 'movie', 'but',
       'film', 'on', 'not', 'you', 'are', 'his', 'have', 'he', 'be',
       'one'], dtype='<U14')

Testing to see if the numberized token array exists.

In [9]:
encoded_example = encoder(example)[:3].numpy()
encoded_example

array([[ 10,  26, 108, ...,   0,   0,   0],
       [  2,   1,  16, ...,   0,   0,   0],
       [ 11, 207,   1, ...,   0,   0,   0]])

Compares the original text with what the computer reads using its limited vocabulary

In [10]:
for n in range(3):
  print("Original: ", example[n].numpy())
  print("Round-trip: ", " ".join(vocab[encoded_example[n]]))
  print()

Original:  b"I have seen this movie a whole dozen times and it's awesome. But the only thing with it was that in the beginning, there was too much talk of who's going out with who. I think that it would be interesting to do a remake of it. But on the official site, they said that they will not be making a remake of it because so many people have gotten saved when viewing it. What's even happened to Patty Dunning now? She is a pretty good actress. She has done several other movies in the 70s and 80s, but we haven't heard from her since. I know for sure about Thom Rachford, who plays Jerry, works for Accounting at RD Films. But overall, I have to say that the series itself is like Left Behind gone old school."
Round-trip:  i have seen this movie a whole [UNK] times and its [UNK] but the only thing with it was that in the beginning there was too much talk of whos going out with who i think that it would be interesting to do a remake of it but on the [UNK] [UNK] they said that they will no

In [11]:
#the explanation for this had a lot of words that i don't understand, look up later
#Sequential allows you to easily lasagna your layers in one line of code
model = tf.keras.Sequential([
    encoder,
    tf.keras.layers.Embedding(
        input_dim=len(encoder.get_vocabulary()),  #sets the length of the array
        output_dim=64,                            #makes each token/number in the array into a vector of 64 numbers
        mask_zero=True),                          #does padding and masking to handle variable lengths and ignore the padding respectively
    tf.keras.layers.Bidirectional(tf.keras.layers.LSTM(64)),  #
    tf.keras.layers.Dense(64, activation='relu'), #converts the vector into a logit
    tf.keras.layers.Dense(1)
])

Testing the padding and masking. The same text should produce the same results regardless of whether there is padding. Obviously the actual number does not matter since the model hasn't learned anything yet.

In [12]:
#no padding
sample_text = ('The movie was cool. The animation and the graphics '
               'were out of this world. I would recommend this movie.')
predictions = model.predict(np.array([sample_text]))
print(predictions[0])

[0.00243256]


In [13]:
#with padding
padding = "wow " * 100 #can interchange string with anything to produce same results
predictions = model.predict(np.array([sample_text, padding]))
print(predictions[0])

[0.00243255]


In [14]:
#compiling the model
model.compile(loss=tf.keras.losses.BinaryCrossentropy(from_logits=True),
              optimizer=tf.keras.optimizers.Adam(1e-4),
              metrics=['accuracy'])

#Training the model

In [None]:
#this takes way too long, at least 2s/step when the demo takes 200-400 ms/step
#epoch time takes >700 seconds, over 2 hours to make the whole thing!
#the answer is that the demo has GPU accel on >>
history = model.fit(train_dataset, 
                    epochs = 10,
                    validation_data = test_dataset,
                    validation_steps = 30)

Epoch 1/10

#Testing the model

In [None]:
test_loss, test_acc = model.evaluate(test_dataset)

print('Test Loss:', test_loss)
print('Test Accuracy:', test_acc)

Make a graph to visualize the accuracy of the model over the number of epochs. We should see a exponentially diminishing return. Validation accuracy should be less than accuracy as the model has less experience with it.

In [None]:
plt.figure(figsize=(16, 8))
plt.subplot(1, 2, 1)
plot_graphs(history, 'accuracy')
plt.ylim(None, 1)
plt.subplot(1, 2, 2)
plot_graphs(history, 'loss')
plt.ylim(0, None)

#Testing the code on the given text

Getting the text we want to read from Google Drive 
https://towardsdatascience.com/importing-data-to-google-colab-the-clean-way-5ceef9e9e3c8

In [None]:
from google.colab import files
uploaded = files.upload()
uploaded


In [None]:
file_name = list(uploaded.keys())[0]
print(file_name)
uploaded[file_name]

#input.txt is encoded in binary and its text blocks are split using double returns
text_blocks = uploaded[file_name].decode("utf-8").split("\r\n\r\n")

#print out the first 100 characters to confirm that the text blocks have been properly parsed for use
print(text_blocks[0][:100])
print(text_blocks[1][:100])

Predicting the first block of code. This one is a passage from Farenheit 451 (thank you Mr. Webster) where characters are having a philosophical debate. There is a lot of negative discussion/words between the characters so I expect the model to pick up on this and rate it slightly negatively.

In [None]:
sample_text = text_blocks[0]
print(sample_text)
predictions = model.predict(np.array([sample_text]))
print(predictions[0])

Predicting the next block of text. This one reads like a recommendation letter, and should score very highly, unless the computer does not recognize some of the very dated English.

In [None]:
sample_text = text_blocks[1]
print(text_blocks[1])
predictions = model.predict(np.array([sample_text]))
print(predictions[0])

In [None]:
positive_words = ["good", "better", "best"]
for i in range(3):
  predictions = model.predict(np.array([positive_words[i]]))
  print(predictions[0])