# Classifying  IMDb Reviews with a RNN

We'll be designing a RNN model that can classify a movie review from the [IMDb](https://www.imdb.com/) website as either positive (1) or negative (0).

Start by importing the modules, types and functions that we'll need.

In [1]:
import nltk
import pandas as pd
import re
import torch
from math import ceil, log
from nltk import WordNetLemmatizer
from torch import optim
from torch import nn
from torchtext import datasets
from typing import Dict, Iterable, List, Tuple

## 1) Load Dataset

The dataset is available via the `torchtext` module (specifically, `torchtext.datasets` submodule). However, it's available as a sequence of tuples `(sentiment, review)`.

We'll read the dataset from the source and transform it into a `pandas` `DataFrame` to better manipulate it.

In [2]:
def sentiment_to_int(sentiment: str) -> int:
    # 1 means 'positive', while 0 means 'negative'.
    return int(sentiment.lower() == 'pos')

def convert_dataset_tuple(dataset_tuple: Tuple[str, str]) -> Tuple[int, str]:
    # Convert the sentiment tag into an integer.
    # Make the review all lower case to normalize it.
    (sentiment, review) = dataset_tuple
    return (sentiment_to_int(sentiment), review.lower())

def convert_dataset_to_list(dataset: Iterable[Tuple[str, str]]) -> List[Tuple[int, str]]:
    # Convert all tuples to our specifications.
    return list(map(convert_dataset_tuple, dataset))

def convert_dataset_to_dataframe(dataset: Iterable[Tuple[str, str]]) -> pd.DataFrame:
    # Build a pandas DataFrame from the converted tuples.
    return pd.DataFrame(data=convert_dataset_to_list(dataset),
                        columns=['sentiment', 'review'])

In [3]:
def load_imdb_dataframes() -> Tuple[pd.DataFrame, pd.DataFrame]:
    # Read and transform both the train and test datasets.
    (train_set_iter, test_set_iter) = datasets.IMDB()
    return (convert_dataset_to_dataframe(train_set_iter),
            convert_dataset_to_dataframe(test_set_iter))

In [4]:
(df_train, df_test) = load_imdb_dataframes()

## 2) Analyse Dataset

How many records are available in each dataset?

In [5]:
len(df_train), len(df_test)

(25000, 25000)

Both the train and test datasets are the same size: 25,000 records.

How do these records split between positive and negative reviews?

In [6]:
df_train['sentiment'].value_counts(normalize=True)

0    0.5
1    0.5
Name: sentiment, dtype: float64

In [7]:
df_test['sentiment'].value_counts(normalize=True)

0    0.5
1    0.5
Name: sentiment, dtype: float64

Both datasets are balanced, 50% of each class. That's good, because it may relieve us from dealing with weights during the model training phase.

But are there any empty reviews in those datasets? We may have to discard them, if that confirms.

In [8]:
def trim_str(value: str) -> str:
    # Trim if not empty.
    return value.strip() if value else value

def is_str_null_or_empty(value: str) -> bool:
    # Both bool('') and bool(None) evaluate to False.
    return not trim_str(value)

def is_str_series_complete(series: Iterable[str]) -> bool:
    # If there's no single null/empty string,
    # then the series is complete.
    return not any(map(is_str_null_or_empty, series))

In [9]:
is_str_series_complete(df_train['review'])

True

In [10]:
is_str_series_complete(df_test['review'])

True

No, there's no empty reviews in the datasets. The balance still remains.

Normally, there'd be the need to analyse the datasets statistically, but they're all text-based. Other than the balance of the classes, there's nothing really left to look.

Out of curiosity, what are the sizes of the shortest and longest reviews?

In [11]:
def minmax_length_in_str_series(series: pd.Series) -> Tuple[int, int]:
    # Call both pd.Series.min and pd.Series.max on the series.
    minmax_length = series.apply(len).agg(['min', 'max'])
    return (minmax_length['min'], minmax_length['max'])

In [12]:
minmax_length_in_str_series(df_train['review'])

(52, 13704)

In [13]:
minmax_length_in_str_series(df_test['review'])

(32, 12988)

There were definitely some very excited people reviewing those movies... (others, not so much)

What do these reviews look like?

In [14]:
df_train.head()

Unnamed: 0,sentiment,review
0,0,i rented i am curious-yellow from my video sto...
1,0,"""i am curious: yellow"" is a risible and preten..."
2,0,if only to avoid making this type of film in t...
3,0,this film was probably inspired by godard's ma...
4,0,"oh, brother...after hearing about this ridicul..."


In [15]:
df_test.head()

Unnamed: 0,sentiment,review
0,0,i love sci-fi and am willing to put up with a ...
1,0,"worth the entertainment value of a rental, esp..."
2,0,its a totally average film with a few semi-alr...
3,0,star rating: ***** saturday night **** friday ...
4,0,"first off let me say, if you haven't enjoyed a..."


In [16]:
def print_first_n_reviews(dataframe: pd.DataFrame, n: int = 5) -> None:
    for review in dataframe['review'][:n]:
        print(f'\t{review}\n')

def print_first_n_reviews_from_dataframes(n: int = 5) -> None:
    for df in [df_train, df_test]:
        print_first_n_reviews(df, n)

In [17]:
print_first_n_reviews_from_dataframes()

	i rented i am curious-yellow from my video store because of all the controversy that surrounded it when it was first released in 1967. i also heard that at first it was seized by u.s. customs if it ever tried to enter this country, therefore being a fan of films considered "controversial" i really had to see this for myself.<br /><br />the plot is centered around a young swedish drama student named lena who wants to learn everything she can about life. in particular she wants to focus her attentions to making some sort of documentary on what the average swede thought about certain political issues such as the vietnam war and race issues in the united states. in between asking politicians and ordinary denizens of stockholm about their opinions on politics, she has sex with her drama teacher, classmates, and married men.<br /><br />what kills me about i am curious-yellow is that 40 years ago, this was considered pornographic. really, the sex and nudity scenes are few and far between, ev

We can see that there are HTML line breaks amidst the reviews. We'll have to remove them.

## 3) Process Datasets

After playing around a little bit with the datasets, I came with the following strategy to transform the reviews into usable data for the classification model:

1. Remove the HTML line breaks;
2. Break the review into sentences;
3. Break each sentence into tokens:
4. Filter out any entirely non-alphanumerical token;
5. Filter out stopwords;
6. Lemmatize remaining words.

As I played around with the datasets, I noticed that the line breaks are the only HTML tags in the reviews. That means we won't have to employ some complex module to clean any HTML, a simple regular expression search will do just fine.

In [18]:
def remove_html_line_breaks(review: str) -> str:
    # We split the review where there's a line break tag,
    # and join all the pieces again, with a blank space in between.
    html_line_break_pattern = r'<\s*(?:/\s*br|br\s*/)\s*>'
    return ' '.join(re.split(html_line_break_pattern, review))

To break the reviews into sentences, and then into words, we'll use the `nltk` module, used for Natural Language processing. It provides both the `sent_tokenize()` and `word_tokenize()` functions to split strings into sentences and into words, respectively.

In [19]:
def gen_sentences_from_review(review: str) -> str:
    # Split into sentences, but keep only the non-empty ones.
    yield from filter(bool, nltk.sent_tokenize(review))

def gen_words_from_sentence(sentence: str) -> str:
    # Split into words, but keep only the non-empty ones.
    yield from filter(bool, nltk.word_tokenize(sentence))

Here, we define the filtering of the non-alphanumerical only tokens.

In [20]:
def is_alphanumerical_token(token: str) -> bool:
    # If it has any alphanumerical character,
    # then it's an alphanumerical token.
    return any(filter(str.isalnum, token))

def filter_alphanumerical_tokens(token_iter: Iterable[str]) -> str:
    # Keep only what is considered alphanumerical.
    yield from filter(is_alphanumerical_token, token_iter)

We'll use the stopwords set also from the `nltk` module to filter out the stopwords from the text.

In [21]:
en_stopwords = set(nltk.corpus.stopwords.words('english'))
# These ones aren't included in the set, so we'll add them.
en_stopwords |= {"n't", "'s", "'t", "'ve"}

def is_not_stopword(word: str) -> bool:
    # Not in the set, not a stopword.
    return word not in en_stopwords

def filter_relevant_tokens(word_iter: Iterable[str]) -> str:
    # Keep only non-stopwords.
    yield from filter(is_not_stopword, word_iter)

We'll use the `WordNetLemmatizer`, also from the `nltk` module, to lemmatize the remaining tokens. It isn't quite powerful, but may do just fine.

In [22]:
lemmer = WordNetLemmatizer()

def gen_lemmatized_tokens(word_iter: Iterable[str]) -> str:
    yield from map(lemmer.lemmatize, word_iter)

Here we join everything together in the sequence we'll apply those transformations.

In [23]:
def normalize_sentence(sentence: str) -> List[str]:
    # Steps 3 through 6, as described previously.
    tokens = gen_words_from_sentence(sentence)
    alphanum_tokens = filter_alphanumerical_tokens(tokens)
    relevant_tokens = filter_relevant_tokens(alphanum_tokens)
    lemmas = gen_lemmatized_tokens(relevant_tokens)
    return list(lemmas)

In [24]:
def gen_normalized_sentences_from_review(review: str) -> List[str]:
    # Step 2, then 3 through 6.
    sentences = gen_sentences_from_review(review)
    normalized_sentences = map(normalize_sentence, sentences)
    yield from filter(bool, normalized_sentences)

In [25]:
def normalize_review(review: str) -> List[List[str]]:
    # Step 1, then 2 through 6
    review = remove_html_line_breaks(review)
    normalized_sentences = gen_normalized_sentences_from_review(review)
    return list(normalized_sentences)

Finally, it's just a matter of applying those transformations to the datasets.

In [26]:
def transform_reviews(dataframe: pd.DataFrame, transformation) -> None:
    dataframe['review'] = dataframe['review'].apply(transformation)

def transform_reviews_in_dataframes(transformation) -> None:
    for dataframe in [df_train, df_test]:
        transform_reviews(dataframe, transformation)

In [27]:
transform_reviews_in_dataframes(normalize_review)

In [28]:
print_first_n_reviews_from_dataframes()

	[['rented', 'curious-yellow', 'video', 'store', 'controversy', 'surrounded', 'first', 'released', '1967.', 'also', 'heard', 'first', 'seized', 'u.s.', 'custom', 'ever', 'tried', 'enter', 'country', 'therefore', 'fan', 'film', 'considered', 'controversial', 'really', 'see'], ['plot', 'centered', 'around', 'young', 'swedish', 'drama', 'student', 'named', 'lena', 'want', 'learn', 'everything', 'life'], ['particular', 'want', 'focus', 'attention', 'making', 'sort', 'documentary', 'average', 'swede', 'thought', 'certain', 'political', 'issue', 'vietnam', 'war', 'race', 'issue', 'united', 'state'], ['asking', 'politician', 'ordinary', 'denizen', 'stockholm', 'opinion', 'politics', 'sex', 'drama', 'teacher', 'classmate', 'married', 'men'], ['kill', 'curious-yellow', '40', 'year', 'ago', 'considered', 'pornographic'], ['really', 'sex', 'nudity', 'scene', 'far', 'even', 'shot', 'like', 'cheaply', 'made', 'porno'], ['countryman', 'mind', 'find', 'shocking', 'reality', 'sex', 'nudity', 'major', 

# 4) Modeling

We now work on the definition of the classification model.

## 4.1) Vocabulary

We'll have to translate the words in the reviews into numbers so that our classification model is able understand the data we pass onto it. For that purpose, we'll build a _vocabulary_, or rather a class that reads a word and returns a corresponding integer. Later on, we'll translate these values to vectors.

In [29]:
class Vocabulary:
    def __init__(self, word_iter: Iterable[str]):
        # Initialized with source from which to read all the words
        # in the vocabulary.
        self.index_table = Vocabulary.make_index_table_(word_iter)
        self.padding_index = self.index_table[None]


    def make_index_table_(word_iter: Iterable[str]) -> Dict[str, int]:
        # Straightforward. If an word is not in the table, add it
        # with a new corresponding integer value.
        # The catch, however, is that we can't use 0, because we'll
        # need it later on for padding.
        table = { None: 0 }
        for word in word_iter:
            if word not in table:
                table[word] = len(table)
        return table


    def translate_sentence(self, sentence: Iterable[str]) -> List[int]:
        # Translate each sentence into a torch Tensor of integers.
        return torch.tensor(list(map(self.translate_word_, sentence)), dtype=torch.long)


    def translate_word_(self, word: str) -> int:
        return self.index_table[word]


    def __len__(self):
        # Just so we can easily get the vocabulary's size.
        return len(self.index_table)

Building the vocabulary that we'll use with our model.

In [30]:
# We're considering all the words in both datasets.
# That can become a problem, as some words may be present in one but not both.
# But we'll carry on for now.
imdb_vocab = Vocabulary(word
                        for df in [df_train, df_test]
                            for review in df['review']
                                for sentence in review
                                    for word in sentence)

## 4.2) Architecture

For the architecture, I'm planning on using two LSTM (_Long Short-Term Memory_) layers to process two separate sequence in each review: the word sequence in each sentence, then the sentence sequence in each review.

Truth be told, that's more of an experiment than anything else. I'm not entirely confident on the possible classification results. This approach may lead to a model that expects each sentence of a review to be influenced by the previous ones, which — although it could theoretically work — doesn't necessarily hold true.

But there's no harm in trying it, is it? The worst it could happen is we ending up discovering another way of _"how not to make a working light bulb."_

In [31]:
class ReviewClassifier(nn.Module):
    def __init__(self, vocab: Vocabulary, n_sentiments: int):
        super(ReviewClassifier, self).__init__()

        vocab_len = len(vocab)

        # "Guess-timate" the sizes of the layers.
        # For the datasets we're working on, I believe they'll be approx.:
        # - self.embedding_len = 6 or 7
        # - self.hidden_len = 4
        self.embedding_len = int(vocab_len ** (1 / (1 + log(vocab_len, 10))))
        self.hidden_len = n_sentiments ** 2
        self.n_sentiments = n_sentiments

        # We're using Embedding to convert the words (integers) into numerical vectors.
        self.word_embedder = nn.Embedding(vocab_len, self.embedding_len,
                                          padding_idx=vocab.padding_index)

        # We're using three layers:
        # - a LSTM layer to process the word sequence of each sentence;
        # - a LSTM layer to process the sentence sequence of each review;
        # - a Linear layer to get the likelihood of each sentimente for a given review.
        # The layers feed each other in that same sequence.
        self.sentence_sentiment_layer = nn.LSTM(self.embedding_len, self.hidden_len, batch_first=True)
        self.review_sentiment_layer = nn.LSTM(self.hidden_len, self.hidden_len, batch_first=True)
        self.to_sentiment_layer = nn.Linear(self.hidden_len, self.n_sentiments)


    def forward(self, reviews: Iterable[Iterable[torch.Tensor]]) -> torch.Tensor:
        # We need to process the first layer and transform the result
        # so that it can be read by the next layer.
        reviews = self.reviews_to_review_layer_input_(reviews)

        # Process the previous values.
        X = self.process_review_layer_(reviews)

        # Calculate the likelihood of each sentiment.
        X = self.to_sentiment_layer(X)
        X = torch.tanh(X)

        # Get the negative log likelihood of each sentiment.
        return nn.functional.log_softmax(X, dim=1)


    def reviews_to_review_layer_input_(self, reviews: Iterable[Iterable[torch.Tensor]]) -> List[torch.Tensor]:
        # Process each review and build a list with the results.
        return list(map(self.process_sentences_layer_, reviews))


    def process_sentences_layer_(self, sentences: Iterable[torch.Tensor]) -> torch.Tensor:
        # The sentence length is variable, so we need to pad them to the longest length.
        X = nn.utils.rnn.pad_sequence(sentences, batch_first=True)
        # Then, we convert the integer-words into numerical vectors.
        X = self.word_embedder(X)

        # Pack the padded sentences to filter out the padded values.
        # That's why we don't associate an word to 0 in the vocabulary.
        sentences_lengths = list(map(len, sentences))
        X = nn.utils.rnn.pack_padded_sequence(X, sentences_lengths, batch_first=True, enforce_sorted=False)

        # Initial values for the hidden and cell states.
        hidden = torch.randn(1, len(sentences), self.hidden_len)
        cell = torch.randn(1, len(sentences), self.hidden_len)
        # Process the LSTM layer.
        X, _ = self.sentence_sentiment_layer(X, (hidden, cell))

        # Unpack the result, getting the padded values
        # as well as their corresponding true lengths.
        X, X_lengths = nn.utils.rnn.pad_packed_sequence(X, batch_first=True)

        # To pick the last values in the processed word sequences,
        # we partially flatten the result matrix and pick the corresponding indices.
        flattened_indices = torch.arange(len(X_lengths)) * max(X_lengths) + (X_lengths - 1)
        X = X.view(-1, self.hidden_len)[flattened_indices, :]

        # Use TanH so that we get a better symmetry.
        return torch.tanh(X)


    def process_review_layer_(self, reviews: Iterable[torch.Tensor]) -> torch.Tensor:
        # If you understood what was done in `process_sentences_layer_()`,
        # then you may also understand this one, as they're very, very similar.

        # Pad to length of the longest sentence sequence.
        X = nn.utils.rnn.pad_sequence(reviews, batch_first=True)

        # Pack the padded matrix.
        reviews_lengths = list(map(len, reviews))
        X = nn.utils.rnn.pack_padded_sequence(X, reviews_lengths, batch_first=True, enforce_sorted=False)

        # Initial values for the hidden and cell states.
        hidden = torch.randn(1, len(reviews), self.hidden_len)
        cell = torch.randn(1, len(reviews), self.hidden_len)
        # Process the LSTM layer.
        X, _ = self.review_sentiment_layer(X, (hidden, cell))

        # Unpack the result.
        X, X_lengths = nn.utils.rnn.pad_packed_sequence(X, batch_first=True)

        # Partially flatten the result matrix and pick
        # only the last values of each sequence.
        flattened_indices = torch.arange(len(X_lengths)) * max(X_lengths) + (X_lengths - 1)
        X = X.view(-1, self.hidden_len)[flattened_indices, :]

        # TanH for better symmetry.
        return torch.tanh(X)

This function takes a list of reviews and transforms them so that they can be read by the model. We use the vocabulary we built previously to translate the words into intergers.

In [32]:
def reviews_to_model_input(reviews: Iterable[Iterable[Iterable[str]]]) -> List[List[torch.Tensor]]:
    return [list(map(imdb_vocab.translate_sentence, r)) for r in reviews]

# 5) Training

In [33]:
imdb_classif = ReviewClassifier(imdb_vocab, len(['neg', 'pos']))

In [34]:
def train_model(model: ReviewClassifier, train_data: pd.DataFrame, *,
                n_epochs: int = 1, batch_size: int = 64) -> None:
    loss_criterion = nn.NLLLoss()
    optimizer = optim.SGD(model.parameters(), lr=0.03)

    for _ in range(n_epochs):
        for batch in range(ceil(len(train_data) / batch_size)):
            i_start = batch*batch_size
            i_end = i_start+batch_size

            model.zero_grad()

            X = reviews_to_model_input(train_data['review'][i_start:i_end])
            Y = torch.from_numpy(train_data['sentiment'][i_start:i_end].values)

            Y_pred = model(X)

            batch_loss = loss_criterion(Y_pred, Y)
            batch_loss.backward()

            optimizer.step()

In [35]:
train_model(imdb_classif, df_train, n_epochs=5)

# 6) Classifying

In [36]:
def predict_with_model(model: ReviewClassifier, test_data: pd.DataFrame) -> None:
    with torch.no_grad():
        X = reviews_to_model_input(test_data['review'])
        Y = torch.from_numpy(test_data['sentiment'].values)

        Y_pred = model(X).argmax(dim=1)

        return (Y_pred, Y)

In [37]:
(Y_pred, Y_true) = predict_with_model(imdb_classif, df_test)

In [38]:
accuracy = (Y_pred == Y_true).sum() / len(Y_true)

print("Accuracy:", accuracy.item())

Accuracy: 0.5
