# IMDB Dataset Sentiment Analysis

The dataset contains IMDB movie reviews and is annotated wether it is a positive or a negative review. The task is to create a model, that can recognize if a review is good or bad. The dataset is taken from the Kaggle competetion that can be found [here](https://www.kaggle.com/columbine/imdb-dataset-sentiment-analysis-in-csv-format).

In [45]:
import tensorflow as tf
import tqdm
import nltk
from nltk.corpus import stopwords
import matplotlib.pyplot as plt
import numpy as np
%matplotlib notebook
plt.style.use( 'dark_background')

# Data Preprocessing

### Reading the data

Reading the data by skipping the first line (header), then reading it line by line, then caching it, shuffling it, setting the batch size to 64 and also prefetching it to reduce step time during training. 

In [2]:
def decode_line(line):
    return tf.io.decode_csv(line, record_defaults=[str(), int()])

train_data = tf.data.TextLineDataset('Train.csv') \
    .skip(1) \ # Skip header
    .map(decode_line) \ # Parse CSV file
    .cache() \ # Cache file to disk for better performance
    .shuffle(buffer_size=1024) \ # Only training data is shuffled
    .batch(64) \
    .prefetch(tf.data.experimental.AUTOTUNE) # Preload a single batch at all times in the background

val_data = tf.data.TextLineDataset('Valid.csv') \
    .skip(1) \
    .map(decode_line) \
    .cache() \
    .batch(64) \
    .prefetch(tf.data.experimental.AUTOTUNE)

test_data = tf.data.TextLineDataset('Test.csv') \
    .skip(1) \
    .map(decode_line) \
    .cache() \
    .batch(64) \
    .prefetch(tf.data.experimental.AUTOTUNE)

x, y = next(iter(train_data))
x[0], y[0]

(<tf.Tensor: shape=(), dtype=string, numpy=b'I really thought this wasn\'t that bad. Not a great work of art but Dermot M was the stronger performer by far. Patricia Arquette was overacting much of the time. He was actually playing cello which was very impressive, and his lines were never forced. Besides, he is an incredibly Beautiful Man. Really sexy. Add that to the talent, and most anything he\'s been in is a lot more tolerable. He always gives his all even if some of the projects he\'s been involved in didn\'t quite hit the highest mark.. Not the fault of the actor in most cases. He\'s unfortunately been in some strange films that just didn\'t resonate at the box office. Always with A-list actors but just not always a "hit". But he is "worth every penny" of any DVD rented or purchased. See The Wedding Date with Debra Messing - one of his best overall films. WORTH EVERY PENNY! ; ) (if you haven\'t seen it yet, do, then you\'ll understand that quote!)'>,
 <tf.Tensor: shape=(), dtype=

### Tokenizing and cleaning the sentences

The Tokenizer Class will remove all dots, all numbers, transform every word into lower case and then split every sentence into tokens for each word. 

In [3]:
class Tokenizer(tf.keras.layers.Layer):
    
    def __init__(self, name='Tokenizer'):
        super(Tokenizer, self).__init__(self, name=name)
    
    def call(self, text):
        # remove everything except words and spaces
        text = tf.strings.regex_replace(text, r'[^\w\s]', '')
        # remove digits
        text = tf.strings.regex_replace(text, r'\d', '')
        # all letters to lower case
        text = tf.strings.lower(text)
        # tokenize sentencese to single word tokens
        return tf.strings.split(text)

In [4]:
tokenizer = Tokenizer()

train_data = train_data.map(lambda x, y: (tokenizer(x), y))
val_data = val_data.map(lambda x, y: (tokenizer(x), y))
test_data = test_data.map(lambda x, y: (tokenizer(x), y))

x, y = next(iter(train_data))
x[0], y[0]

(<tf.Tensor: shape=(578,), dtype=string, numpy=
 array([b'let', b'me', b'just', b'say', b'i', b'loved', b'the',
        b'original', b'boogeyman', b'sure', b'its', b'a', b'flawed',
        b'clichd', b's', b'horror', b'movie', b'but', b'hey', b'those',
        b'types', b'are', b'fun', b'to', b'watch', b'and', b'plus', b'it',
        b'gave', b'us', b'something', b'a', b'bit', b'different', b'so',
        b'i', b'gladly', b'bought', b'it', b'and', b'to', b'my',
        b'surprise', b'this', b'movie', b'came', b'along', b'with', b'it',
        b'only', b'copy', b'they', b'had', b'actually', b'so', b'i',
        b'thought', b'eh', b'what', b'the', b'hell', b'and', b'bought',
        b'it', b'mistake', b'so', b'that', b'night', b'i', b'felt', b'in',
        b'the', b'mood', b'to', b'watch', b'a', b'movie', b'i',
        b'actually', b'bought', b'tons', b'that', b'day', b'and',
        b'figured', b'this', b'was', b'the', b'shortest', b'out', b'of',
        b'all', b'the', b'ones', b'i', b

### Removing Stopwords

In [5]:
class StopwordFilter(tf.keras.layers.Layer):
    
    def __init__(self, sw, name='StopwordFilter'):
        super(StopwordFilter, self).__init__(self, name=name)
        
        remove_words_keys = tf.constant(sw, dtype=tf.string)
        remove_words_values = tf.fill([len(sw)], True)
        
        # Create a dict that maps a tf.string to True, if the string is in sw, else False
        stopwords_init = tf.lookup.KeyValueTensorInitializer(
                remove_words_keys, remove_words_values)
        self.stopwords_table = tf.lookup.StaticHashTable(
            stopwords_init, default_value=False)
    
    def call(self, words):
        is_stopword = tf.ragged.map_flat_values(
            self.stopwords_table.lookup, words)
        is_stopword = tf.cast(is_stopword, tf.bool)
        return tf.ragged.boolean_mask(words, ~is_stopword)

In [6]:
stopword_filter = StopwordFilter(stopwords.words('english'))

train_data = train_data.map(lambda x, y: (stopword_filter(x), y))
val_data = val_data.map(lambda x, y: (stopword_filter(x), y))
test_data = test_data.map(lambda x, y: (stopword_filter(x), y))

x, y = next(iter(train_data))
x[0], y[0]

(<tf.Tensor: shape=(187,), dtype=string, numpy=
 array([b'tried', b'really', b'thought', b'maybe', b'gave', b'joao',
        b'pedro', b'rodrigues', b'another', b'chance', b'could', b'enjoy',
        b'movie', b'know', b'seeing', b'fantasma', b'felt', b'ill',
        b'nearly', b'disgusted', b'core', b'reviews', b'quite', b'good',
        b'favor', b'like', b'hell', b'least', b'didnt', b'pay', b'dollars',
        b'quad', b'give', b'shotbr', b'br', b'sometimes', b'better', b'go',
        b'dentist', b'ask', b'root', b'canal', b'without', b'previous',
        b'anesthetic', b'alleviate', b'horror', b'much', b'pain', b'often',
        b'wonder', b'wouldnt', b'better', b'go', b'back', b'childhood',
        b'demand', b'former', b'bullies', b'really', b'let', b'occasions',
        b'often', b'think', b'world', b'really', b'flat', b'sail', b'away',
        b'far', b'enough', b'get', b'away', b'fall', b'clear', b'evil',
        b'lovecraftian', b'thing', b'snatch', b'tentacles', b'squeeze',


### Feature Engineering

Each unique word will be transformed into an integer value. Therefore a vocabulary list of the data set needs to be created. 

In [7]:
class WordIndexer(tf.keras.layers.Layer):
    
    def __init__(self, vocab, name='WordIndexer'):
        super(WordIndexer, self).__init__(self, name=name)
        
        vocab_keys = tf.constant(sorted(list(vocab)), dtype=tf.string)
        vocab_values = tf.range(2, len(vocab_keys) + 2, dtype=tf.int64)
        
        # Create a dict that maps a tf.string to a unique index >= 2 for all of the strings in the vocab
        # else return 1, 0 is reserved for padding
        vocab_init = tf.lookup.KeyValueTensorInitializer(
            vocab_keys, vocab_values)
        self.vocab_table = tf.lookup.StaticHashTable(
            vocab_init, default_value=1)

    def call(self, words):
        return tf.ragged.map_flat_values(self.vocab_table.lookup, words)

In [8]:
vocab = set()
for x, y in tqdm.tqdm(train_data):
    vocab.update(word for text in x for word in text.numpy())

625it [00:26, 23.35it/s]


In [9]:
word_indexer = WordIndexer(vocab)

train_data = train_data.map(lambda x, y: (word_indexer(x), y))
val_data = val_data.map(lambda x, y: (word_indexer(x), y))
test_data = test_data.map(lambda x, y: (word_indexer(x), y))

x, y = next(iter(train_data))
x[0], y[0]

(<tf.Tensor: shape=(152,), dtype=int64, numpy=
 array([ 21330,  48756, 144877, 132738,  80409,  76889,  50541, 135975,
        104436, 137679, 115952,  57114, 136645, 142808, 142462,  68681,
        123863,  26536, 135125,    213, 123013,  50601,  52514,  16397,
         57114,  20956,  67094,  44742, 138843,  60798,  29957,  82285,
         18300,   3947,  59605,   6114,  81928, 135975, 151831,  73148,
        113137,  13805, 131075, 120582,  48417, 114444, 133309,  77103,
        144780,  42717,  82291,  46706,  60324,  44442,  41854,  15955,
        144780, 144018,  85466,  34880, 113545,  97720,  15955, 144780,
         46004,  31212,  48913,  85716, 118775, 124180,  16397,  80042,
         50601,  53017,  79593, 138858,  46759,  47551,  77422,  86572,
         10971,  50541,  82285, 130160, 114857, 139736,  16397,  58243,
         62993,  42734,  11493,  54426,  50541, 150176,  88940, 119780,
         78669, 118775,  35672,  43705, 147364, 130864,  22863,  42717,
        119780,  

### Padding

The longest sentence in the data defines the input size, sentences with less words need to be padded and thus filled with zeros to match the input size. 

In [10]:
class Padder(tf.keras.layers.Layer):
    
    def __init__(self, input_size, name='Padder'):
        super(Padder, self).__init__(self, name=name)
        
        self.input_size = input_size

    def call(self, x):
        # Pad the ragged tensors with 0 to build a full tensor
        return x.to_tensor(default_value=0, shape=[None, self.input_size])

input_size = max(train_data.map(lambda x, y: x.bounding_shape()[1]))
padder = Padder(input_size)

train_data = train_data.map(lambda x, y: (padder(x), y))
val_data = val_data.map(lambda x, y: (padder(x), y))
test_data = test_data.map(lambda x, y: (padder(x), y))

In [11]:
x, y = next(iter(train_data))
x, y

(<tf.Tensor: shape=(64, 1440), dtype=int64, numpy=
 array([[ 41911,  77144,  18104, ...,      0,      0,      0],
        [129163,  49237, 118677, ...,      0,      0,      0],
        [154690,  13306,  46612, ...,      0,      0,      0],
        ...,
        [ 21330, 135042,  42427, ...,      0,      0,      0],
        [148998,  66986,  28836, ...,      0,      0,      0],
        [ 88605,  57114,  80208, ...,      0,      0,      0]])>,
 <tf.Tensor: shape=(64,), dtype=int32, numpy=
 array([0, 0, 0, 1, 1, 0, 0, 1, 1, 0, 1, 1, 0, 0, 0, 1, 1, 0, 0, 1, 1, 0,
        1, 1, 0, 1, 0, 0, 0, 0, 1, 1, 0, 1, 0, 1, 1, 1, 0, 0, 1, 1, 1, 1,
        0, 1, 1, 1, 1, 1, 0, 1, 1, 0, 0, 0, 1, 0, 1, 0, 0, 1, 0, 0],
       dtype=int32)>)

# Model Definition

In [12]:
model = tf.keras.Sequential([
    tf.keras.Input([input_size,]),
    tf.keras.layers.Embedding(len(vocab) + 2, 4, input_length=input_size),
    tf.keras.layers.SeparableConv1D(4, kernel_size=7, padding='same'),
    tf.keras.layers.Activation('relu'),
    tf.keras.layers.MaxPooling1D(pool_size=5, strides=4),
    tf.keras.layers.Conv1D(8, kernel_size=3, strides=2, padding='same'),
    tf.keras.layers.Activation('relu'),
    tf.keras.layers.Conv1D(16, kernel_size=3, strides=2, padding='same'),
    tf.keras.layers.Activation('relu'),
    tf.keras.layers.Conv1D(16, kernel_size=3, strides=2, padding='same'),
    tf.keras.layers.Activation('relu'),
    tf.keras.layers.GlobalAveragePooling1D(),
    tf.keras.layers.Dense(1),
    tf.keras.layers.Activation('sigmoid')])

print(model.summary())

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding (Embedding)        (None, 1440, 4)           622600    
_________________________________________________________________
separable_conv1d (SeparableC (None, 1440, 4)           48        
_________________________________________________________________
activation (Activation)      (None, 1440, 4)           0         
_________________________________________________________________
max_pooling1d (MaxPooling1D) (None, 359, 4)            0         
_________________________________________________________________
conv1d (Conv1D)              (None, 180, 8)            104       
_________________________________________________________________
activation_1 (Activation)    (None, 180, 8)            0         
_________________________________________________________________
conv1d_1 (Conv1D)            (None, 90, 16)            4

# Training

In [13]:
model.compile(loss='binary_crossentropy', optimizer='adam', metrics='accuracy')
model_history = model.fit(train_data, validation_data=val_data, epochs=8)

Epoch 1/8
Epoch 2/8
Epoch 3/8
Epoch 4/8
Epoch 5/8
Epoch 6/8
Epoch 7/8
Epoch 8/8


In [46]:
# Plotting the accuracy per epoch
plt.plot(model_history.history['accuracy'], color=(0.2, 0.4, 0.6, 0.6))
plt.plot(model_history.history['val_accuracy'], color=(0.2, 0.9, 0.8, 0.6))
plt.title('model loss')
plt.ylabel('accuracy')
plt.xlabel('Epoch')
plt.legend(['train', 'validation'], loc='upper left')
plt.gcf().patch.set_alpha(0.0)
plt.gca().patch.set_alpha(0.0)
plt.tight_layout()
plt.show()

<IPython.core.display.Javascript object>

In [15]:
model.evaluate(test_data)



[0.7511438131332397, 0.8705999851226807]

# Full model exportation

In [27]:
# Create a model with the preprocessing pipeline
full_model = tf.keras.Sequential([
    tf.keras.layers.Input([], dtype=tf.string),
    tokenizer, stopword_filter, word_indexer, padder,
    model])

In [43]:
example_positive_review = 'This is one of the finest science fiction ' \
    'films ever made. Everything is so carefully and expertly constructed ' \
    'to the point that repeated viewings are just as good as the first. ' \
    'Also, the atmosphere, along with the amazing sets, is real shocker ' \
    'and few movies have managed to create the same kind eerie feeling.'
example_negative_review = 'I watch a lot of movies and I like to give them ' \
    'all a chance just in case there is something interesting or exciting to ' \
    'warrant a viewing Unfortunately this movie has none of these features it ' \
    'is pointless and offers nothing in the way of story line,acting or direction ' \
    'The plot is non-existent with the actors just going through the motions and ' \
    'the dialogue is sooo boring its embarrassing. I wish the previous reviewers ' \
    'had posted earlier as this would have saved me 95 mins of my time'
test_sentence = 'So good. Very good. Best movie ever! So cool.'
test2_sentence = 'I wish I could say: "So good. Very good. Best movie ever!' \
    ' So cool." But this is not the case. It was actually hard to watch.'
test3_sentence = 'Thanks for your attention!'

In [44]:
with np.printoptions(precision=3, suppress=True):
    print(full_model(tf.constant([example_positive_review, 
                                  example_negative_review,
                                  test_sentence, 
                                  test2_sentence, 
                                  test3_sentence])).numpy())

[[0.995]
 [0.   ]
 [0.979]
 [0.984]
 [0.976]]
