Recurrent Neural Network (RNN)
This is the README for the Recurrent Neural Network (RNN) Full Implementation assignment by George Barker and Andre Zeromski. We created a RNN model that classifies IMDB movie reviews as positive or negative with 87.7% accuracy.
How to run code:
To test the trained model, run the file “LSTM-RNN.py”. The result of testing will be printed. To train the model yourself, comment out line 38 and 39 (pickling in the model) and uncomment line 45 (training the model). For the stacked LSTM, to test the trained model, run the file “Deep-LSTM-RNN.py”. The result of testing will be printed. To train the model yourself, comment out line 39 and 40 (pickling in the model) and uncomment line 46 (training the model).
We imported the IMDB movie reviews dataset with “from Keras.datasets import imdb” and the function imdb.load_data(num_words=None). This returns (x_train, y_train), (x_test, y_test) for the IMDB reviews with each word replaced by the integer frequency that the word occurs in the whole of the training set. Each review is therefore a sequence of words represented as integers. Each review has a y value of 0 or 1, corresponding to whether the review is negative or positive. The train data and test data is split 50 / 50, so our train data has 25,000 reviews and the test data has 25,000 reviews. The num_words parameter allows us to specify the amount of words we’d like to keep in the vocabulary, so only the top “num_words” are kept. This is beneficial as it allows us to reduce our vocabulary.
The vocabulary of the IMDB reviews is about 90,000 unique words, when imported using the assigned keras import. This large of a vocabulary is problematic as it is more difficult to train a model to understand such a large vocabulary accurately. To deal with this large vocabulary we can reduce it. Zipf’s Law states that the frequency of any word is inversely proportional to its rank in the frequency table. Therefore, we can increase accuracy by reducing a large part of the vocabulary by taking the most frequently occurring words; this should also not remove a significant portion of meaning from the review since the most frequently occurring words are used much more. We started by limiting the vocabulary to 8000 words. We found that when limiting the vocabulary further to 5000 words, accuracy was increased.
We further manipulate the dataset by padding and reducing the length of each review. The mean review length is about 235 words. We limited the review lengths to 600 because we can cover the majority of reviews without truncating the end of the review with this review length. Since we are working with varied input data, we use a padding technique to standardize the inputs by appending 0s to inputs with length less than our specified max review length. This normalization technique ensures all of the inputs that are fed into the model are the same size as smaller reviews are padded to a length of 600 and larger reviews are truncated after 600 words.
We create our model by calling Sequential() to initialize our linear stack of layers. Our first layer is the embedding layer. We used the keras implementation of the embedding layer to convert our integer representation of words into a word embedding. For each word in our input sequence, the embedding layer outputs the word’s vector representation. The embedding layer has two required arguments. We set the first argument to the number of words in our vocabulary, and since the IMDB dataset consists of integer representations of words as determined by the words’ frequencies, only numbers with a value less than or equal to 5000 are embedded. The second argument is the desired length of the embedded vector for each integer representation of a word, which in our case, we set to 5. We include an optional argument, input_length, set to the maximum review length, which essentially specifies the length of the input sequences.
As stated previously, we manipulated our input sequences of movie reviews to all be a fixed length of 600 by using padding. Therefore, a single input sequence as represented by the embedding layer consists of a 2D matrix, with one dimension being the number of words in each padded review, and the second dimension being the embedded word vector of length 5. The embedding layer is used as the first layer in our model and converts the positive integer representing a word into vectors of fixed size.
The model’s next layer is a bidirectional LSTM. We used a bidirectional wrapper on an LSTM so relationship between words could be looked at in the forward direction and also in the backwards direction. We used a bidirectional LSTM because it allows the model to view past words in the sequence and future words in the sequence to determine the sentiment of the current word, or if future words neutralize the sentiment of the current one. This leads to a more robust determination of a word’s sentiment given the other words that surround it. Additionally, bidirectional LSTMs are superior to standard LSTMs in situations where inference is not required, since bidirectional LSTMs rely on seeing data in the forward direction, but that information is unavailable in inference or prediction based problems. However, since we are not trying to model an inference based problem, using bidirectional has clear advantages. When increasing the number of nodes in this layer from 64 - 100, we did not see an increase in accuracy.
Dropout of .3 means that the model randomly selects nodes to be dropped-out with a given probability of 30% during each pass. Adding dropout is a regularization technique used to improve the robustness of the model by blocking the use of certain nodes, forcing the weights to adjust so that the output of the model is not reliant on just a few sets of nodes. We tested adding a single dropout layer to various parts of our LSTM, including the input, recurrent, and output layer. We found that it was not useful to add dropout to our input layer, and propose that this is because dropout at this point would block the use of certain input nodes in training, meaning certain features of our embedded word would not be included for consideration. This therefore adjusts the weights coming from the input layer so that the model would place lesser importance on certain features of the word when passed to the recurrent layer. This potentially removed features that were useful in determining the sentiment of the word.
We also tried adding dropout to the recurrent layer and saw no consistent increase in accuracy from our base-case with no dropout in the model. Papers on natural language processing (Bayer et al., 2013) and RNNs propose that adding a dropout to the recurrent layer changes the dynamics of the model dramatically, as the recurrent layer is the most dynamic and sensitive part of the model. Essentially, by including dropout, the effect of 0-ing out or dropping the use of a node is propagated forward during each time step, greatly hindering the ability of the model to memorize. In our case, adding dropout to the recurrent layer compromises the model’s ability of memorization. Therefore, we added dropout to the output layer. This benefits our model by preventing overfitting yet still retaining the important features of out input data without diminishing the LSTM’s memory. It also makes sense to put it here, as the LSTM is then connected to two dense layers, and by dropping out at the end of the LSTM, we would see similar results in robustness as dropout would work similarly to how it does in a standard feed forward neural network. We also found that dropping the drop out rate to .2 did not result in better generalizability to the testing set.
We connect our LSTM to two dense layers, which are fully connected layers. By connecting the output of our LSTM to two dense layers, we are able to extract features of the features of our LSTM. We tested using just one dense layer and three dense layers (the added layer being a tanh layer after the ReLu layer) after the LSTM and found that neither provided any improvements in accuracy. The first layer after the LSTM is a ReLu dense layer. The ReLu layer is an efficient layer to use since the derivative is a constant, which is useful for backpropagation. The output of the ReLu layer is then sent as input to the sigmoid layer. We also experimented with the use of tanh instead of ReLu for this layer. This change did not yield a significant increase in accuracy when compared with ReLu. We use a sigmoid output layer with one node to get an output between 0-1 to determine the binary classification of whether or not the movie review is positive or negative. The output of this layer is the model’s prediction. The models prediction can then be compared to the actual y value, which is an integer that is either a 1 or a 0, depending on whether the review was positive or negative.
The compile function configures the model for training. Here we specify an optimizer, loss function, and metric to compile our model. We used the adam optimizer. This optimizer tunes the weights similar to stochastic gradient descent, except that adam uses a momentum variable to change the learning rate. We compiled the model with a loss function of binary cross entropy. Cross Entropy is a loss function that measures the performance of classification where the output is between 0 and 1. The equation for binary cross entropy is -(ylog(p)+(1-y)log(1-p)). We use binary cross entropy since we are only categorizing two classes, whether the sentiment is positive or negative. Cross Entropy more greatly punishes errors that are confident and wrong. If we predicted 0.1 for a true value of 1, the log loss is much greater than if the prediction was .2 or .5. Our metric was accuracy. Accuracy is sufficient because we are interested in how often our model correctly identifies the sentiment of the movie review.
Fit trains the model for a fixed number of epochs on the input data. We give x_train, y_train, and epoch number as arguments. The epoch number was also important for better accuracy in our model. When the epochs were increased to 4, we got better accuracy and more consistently broke into 87% accuracy. A problem we had in training was inconsistent end results for accuracy even when training the same model. A reason we were getting different results when retraining the same model was the random initialization of parameters when initializing the model. Even though we had the same model and were training and testing on the same data, random initialization of parameters when initializing the model had an impact on the final results. We reduced the occurrence of this inconsistency using Occam Razor’s theory.
Occam Razor’s theorem states that entities should not be multiplied together if doing so is unnecessary. In our RNN model, this translates to the minimum number of parameters to accurately show the target function is the minimum number of parameters that should be used. Theoretically, a model should be able to zero out extra parameters, but in practice this may not be the case. Having extra parameters in a model can lead to the model making mistakes, such as when the model is unable to zero out unnecessary during training. Thus, we chose to decrease parameters and retain the simplest model when we could. For example, we had a very large word embedding of 100 for one word for one of our models. We ran the same model with a word embedding of 5, and found the accuracy was very similar at 87.7% versus 87.8%. Therefore, we chose to go with the model that had less parameters, which was about 250,000 parameters less. This should reduce the chance of the model making a mistake in training and testing.
We now begin the discussion on the implementation of the deep deep (stacked) LSTM RNN and compare the differences in accuracy to our model. The deep deep LSTM RNN was created by adding an additional LSTM layer to the model. We found that when we added a second bidirectional LSTM to the model, it did not change accuracy significantly, although it did increase the number of parameters. We achieved an accuracy of 87.3% with our stacked LSTM. Our model without the bidirectional layers, just stacked LSTM layers, achieved a similar accuracy, and other modifications did not show significant improvement on accuracy from 87%. We believe that this lack of increase in accuracy makes sense, as the features extracted from the first LSTM layer would unlikely be broken down further to find more features from the second LSTM layer. Since LSTMs are exceptional at finding short and long term dependencies within the data, finding deeper short and long term relationships of features that are already characterized by their short and long term dependencies seems like it would be obsolete.
The other factor of using a stacked LSTM is that the model has significantly more parameters. We can describe the addition of another LSTM layer in terms of Occam Razor’s theory. We do not necessarily want to add more parameters to the model, as in practice we could decrease our accuracy by adding unnecessary parameters that the model is unable to zero out during training. In the case of sentiment analysis, it is hard to see how features that would be extracted from features abstracted from sequential data would be important to classifying the movie review. The added parameters could therefore lead to decreased accuracy. We thus believe the stacked RNN has an accuracy similar to that of our model. We therefore, do not propose the stacked LSTM as an architecture appropriate for furthering accuracy past 87-89%.
Overall there are some limitations with an RNN model for sentiment analysis. An RNN is limited by its sequential nature of feature extraction. An LSTM is going to keep track of the global order of features. In something such as movie reviews, the review may jump around when describing the attributes of a movie. The global order may not be the most important feature. Instead, an alternative strategy that may be promising for future work would be to use a CNN to model the text data for sentiment analysis. A CNN may be another beneficial model in extracting data and may allow the accuracy to be further increased as it is not limited by the temporal relationships in data such as that of an RNN.