## Step 1: Downloading the data

As in the XGBoost in SageMaker notebook, we will be using the [IMDb dataset](http://ai.stanford.edu/~amaas/data/sentiment/)

> Maas, Andrew L., et al. [Learning Word Vectors for Sentiment Analysis](http://ai.stanford.edu/~amaas/data/sentiment/). In _Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies_. Association for Computational Linguistics, 2011.

gunzip -c aclImdb_v1.tar.gz | tar xopf -

## Step 2: Preparing and Processing the data

Also, as in the XGBoost notebook, we will be doing some initial data processing. The first few steps are the same as in the XGBoost example. To begin with, we will read in each of the reviews and combine them into a single input structure. Then, we will split the dataset into a training set and a testing set.

In [1]:
import os
import glob

def read_imdb_data(data_dir='/Users/kurie_jumi/dev/aclImdb'):
    data = {}
    labels = {}
    
    for data_type in ['train', 'test']:
        data[data_type] = {}
        labels[data_type] = {}
        
        for sentiment in ['pos', 'neg']:
            data[data_type][sentiment] = []
            labels[data_type][sentiment] = []
            
            path = os.path.join(data_dir, data_type, sentiment, '*.txt')
            files = glob.glob(path)
            
            for f in files:
                with open(f) as review:
                    data[data_type][sentiment].append(review.read())
                    # Here we represent a positive review by '1' and a negative review by '0'
                    labels[data_type][sentiment].append(1 if sentiment == 'pos' else 0)
                    
            assert len(data[data_type][sentiment]) == len(labels[data_type][sentiment]), \
                    "{}/{} data size does not match labels size".format(data_type, sentiment)
                
    return data, labels

In [2]:
data, labels = read_imdb_data()
print("IMDB reviews: train = {} pos / {} neg, test = {} pos / {} neg".format(
            len(data['train']['pos']), len(data['train']['neg']),
            len(data['test']['pos']), len(data['test']['neg'])))

IMDB reviews: train = 12500 pos / 12500 neg, test = 12500 pos / 12500 neg


Now that we've read the raw training and testing data from the downloaded dataset, we will combine the positive and negative reviews and shuffle the resulting records.

In [3]:
from sklearn.utils import shuffle

def prepare_imdb_data(data, labels):
    """Prepare training and test sets from IMDb movie reviews."""
    
    #Combine positive and negative reviews and labels
    data_train = data['train']['pos'] + data['train']['neg']
    data_test = data['test']['pos'] + data['test']['neg']
    labels_train = labels['train']['pos'] + labels['train']['neg']
    labels_test = labels['test']['pos'] + labels['test']['neg']
    
    #Shuffle reviews and corresponding labels within training and test sets
    data_train, labels_train = shuffle(data_train, labels_train)
    data_test, labels_test = shuffle(data_test, labels_test)
    
    # Return a unified training data, test data, training labels, test labets
    return data_train, data_test, labels_train, labels_test

In [4]:
train_X, test_X, train_y, test_y = prepare_imdb_data(data, labels)
print("IMDb reviews (combined): train = {}, test = {}".format(len(train_X), len(test_X)))

IMDb reviews (combined): train = 25000, test = 25000


Now that we have our training and testing sets unified and prepared, we should do a quick check and see an example of the data our model will be trained on. This is generally a good idea as it allows you to see how each of the further processing steps affects the reviews and it also ensures that the data has been loaded correctly.

In [5]:
print(train_X[100])
print(train_y[100])

Of all the movies I have seen, and that's most of them, this is by far the best one made that is primarily about the U.S. Naval Airships (Blimps) during the WW-II era. Yes there are other good LTA related movies, but most use special effects more than any real-time shots. This Man's Navy has considerably more real-time footage of blimps etc. True, lots of corny dialog but that's what makes more interesting Hollywood movies, even today. P.S. I spent 10 years(out of 20) and have over 5,000 hours in Navy Airships of all types, from 1949 through 1959. Proud member of the Naval Airship Association etc. [ATC(LA/AC) USN Retired]
1


The first step in processing the reviews is to make sure that any html tags that appear should be removed. In addition we wish to tokenize our input, that way words such as *entertained* and *entertaining* are considered the same with regard to sentiment analysis.

In [6]:
import nltk
from nltk.corpus import stopwords
from nltk.stem.porter import *

import re
from bs4 import BeautifulSoup

def review_to_words(review):
    nltk.download("stopwords", quiet=True)
    stemmer = PorterStemmer()
    
    text = BeautifulSoup(review, "html.parser").get_text() # Remove HTML tags
    text = re.sub(r"[^a-zA-Z0-9]", " ", text.lower()) # Convert to lower case
    words = text.split() # Split string into words
    words = [w for w in words if w not in stopwords.words("english")] # Remove stopwords
    words = [PorterStemmer().stem(w) for w in words] # stem
    
    return words

The `review_to_words` method defined above uses `BeautifulSoup` to remove any html tags that appear and uses the `nltk` package to tokenize the reviews. As a check to ensure we know how everything is working, try applying `review_to_words` to one of the reviews in the training set.

In [7]:
# TODO: Apply review_to_words to a review (train_X[100] or any other review)
review_to_words(train_X[100])

['movi',
 'seen',
 'far',
 'best',
 'one',
 'made',
 'primarili',
 'u',
 'naval',
 'airship',
 'blimp',
 'ww',
 'ii',
 'era',
 'ye',
 'good',
 'lta',
 'relat',
 'movi',
 'use',
 'special',
 'effect',
 'real',
 'time',
 'shot',
 'man',
 'navi',
 'consider',
 'real',
 'time',
 'footag',
 'blimp',
 'etc',
 'true',
 'lot',
 'corni',
 'dialog',
 'make',
 'interest',
 'hollywood',
 'movi',
 'even',
 'today',
 'p',
 'spent',
 '10',
 'year',
 '20',
 '5',
 '000',
 'hour',
 'navi',
 'airship',
 'type',
 '1949',
 '1959',
 'proud',
 'member',
 'naval',
 'airship',
 'associ',
 'etc',
 'atc',
 'la',
 'ac',
 'usn',
 'retir']

In [None]:
# I will skip the cache part (And probably just work with fewer data just for testing)

## Transform the data

In the XGBoost notebook we transformed the data from its word representation to a bag-of-words feature representation. For the model we are going to construct in this notebook we will construct a feature representation which is very similar. To start, we will represent each word as an integer. Of course, some of the words that appear in the reviews occur very infrequently and so likely don't contain much information for the purposes of sentiment analysis. The way we will deal with this problem is that we will fix the size of our working vocabulary and we will only include the words that appear most frequently. We will then combine all of the infrequent words into a single category and, in our case, we will label it as `1`.

Since we will be using a recurrent neural network, it will be convenient if the length of each review is the same. To do this, we will fix a size for our reviews and then pad short reviews with the category 'no word' (which we will label `0`) and truncate long reviews.

### (TODO) Create a word dictionary

To begin with, we need to construct a way to map words that appear in the reviews to integers. Here we fix the size of our vocabulary (including the 'no word' and 'infrequent' categories) to be `5000` but you may wish to change this to see how it affects the model.

> **TODO:** Complete the implementation for the `build_dict()` method below. Note that even though the vocab_size is set to `5000`, we only want to construct a mapping for the most frequently appearing `4998` words. This is because we want to reserve the special labels `0` for 'no word' and `1` for 'infrequent word'.

In [8]:
sample_data = train_X[:100]

In [72]:
sample_data[1]

'Sorry, gave it a 1, which is the rating I give to movies on which I walk out or fall asleep. In this case I fell asleep 10 minutes from the end, really, really bored and not caring at all about what happened next.'

In [64]:
sample_data_s = train_X[:2]

In [102]:

sample_data_s = ['The story is interesting.', 'The film was interesting','Sorry, gave it a 1.', 'The movie sucks.']


In [105]:
import numpy as np
import operator

def build_dict(data, vocab_size = 5000):
    """Construct and return a dictionary mapping each of the most frequently appearing words to a unique integer."""
    
    # TODO: Determine how often each word appears in `data`. Note that `data` is a list of sentences and that a
    #       sentence is a list of words.
    
    word_count = {} # A dict storing the words that appear in the reviews along with how often they occur
    
    for sentence in data:
        # preprocess the sentence in 'data' by using review_to_words to create a list of words
        word_list = review_to_words(sentence)
        
        print(word_list)
        # add words as key to word_count dictionary and count of words as values
        for word in word_list:
            if word in word_count:
                # print('word {} exists. Increase count'.format(word))
                word_count[word]+=1
            else : 
                # print('Add new word {} to the word_count dictionary'.format(word)) 
                word_count.update( {word : 1} )
    
    # TODO: Sort the words found in `data` so that sorted_words[0] is the most frequently appearing word and
    #       sorted_words[-1] is the least frequently appearing word.
    print(word_count)
    sorted_words = list(dict(sorted(word_count.items(), key=operator.itemgetter(1),reverse=True)).keys())
    print(type(sorted_words))
    print(sorted_words)
    
    word_dict = {} # This is what we are building, a dictionary that translates words into integers
    
    for idx, word in enumerate(sorted_words[:vocab_size - 2]): # The -2 is so that we save room for the 'no word'
        word_dict[word] = idx + 2                              # 'infrequent' labels
         
    return word_dict

In [107]:
word_dict = build_dict(sample_data_s)


['stori', 'interest']
['film', 'interest']
['sorri', 'gave', '1']
['movi', 'suck']
{'stori': 1, 'interest': 2, 'film': 1, 'sorri': 1, 'gave': 1, '1': 1, 'movi': 1, 'suck': 1}
<class 'list'>
['interest', 'stori', 'film', 'sorri', 'gave', '1', 'movi', 'suck']


In [110]:
# TODO: Use this space to determine the five most frequently appearing words in the training set.
for key,value in word_dict.items():
    if value < 7:
        print("key = {}, value = {}".format(key,value))

key = interest, value = 2
key = stori, value = 3
key = film, value = 4
key = sorri, value = 5
key = gave, value = 6
