<img src="../../../figs/holberton_logo.png" alt="logo" width="500"/>

# Recurrent Neural Network for IMBD Movie Review

## IMBD Dataset

IMDb (Internet Movie Database) is an **online platform that provides a comprehensive database of movies, TV shows, and other entertainment content**. 


IMDb movie reviews refer to **user-generated reviews and ratings** for movies available on the platform. These reviews allow users to express their opinions and share their experiences regarding the quality, plot, acting, and other aspects of movies. 


IMDb movie reviews are typically **short texts written by viewers**, and they reflect a range of sentiments, including **positive and negative feedback**. 

With a large user base and a vast collection of movies, IMDb movie reviews offer valuable insights into the public perception and reception of films, serving as a valuable resource for understanding audience preferences and opinions.

## Why RNNs for Sentiment Analysis of Movie Review

**RNNs excel at modeling sequential data**, making them well-suited for sentiment analysis of movie reviews. 

Movie reviews often contain phrases, sentences, and paragraphs that convey sentiment and context. RNNs can **effectively capture the dependencies and relationships between words in a review** by maintaining a hidden state that retains information from previous inputs. 

This allows RNNs to consider the entire context of the review and make predictions based on the collective sentiment expressed throughout the text.


- **Sequential Modeling**: RNNs are designed to model sequential data, making them a natural choice for sentiment analysis of movie reviews, which are typically composed of sequences of words and sentences.


- **Contextual Understanding**: RNNs can capture the contextual dependencies between words and sentences, allowing them to consider the overall sentiment expressed in a movie review rather than treating each word in isolation.


- **Variable-Length Inputs**: Movie reviews can vary in length, but RNNs can handle inputs of variable lengths by dynamically processing the text in a sequential manner, accommodating reviews of different lengths without requiring fixed-size inputs.

## Let's get started

In [1]:
import numpy as np
from tensorflow import keras
from tensorflow.keras.datasets import imdb
from tensorflow.keras.preprocessing import sequence
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Embedding, LSTM, Dense

## Preprocess the data

The preprocessing phase involves several steps to prepare the data for training an RNN sentiment analysis model. 

- Specify the **maximum number of words to include in the vocabulary** (`max_features`), limiting the vocabulary size to the most frequently occurring words. 


- Specify the **maximum length of each movie review** (`maxlen`), ensuring that all reviews are truncated or padded to a fixed length. 


- Determine the `batch_size`, i.e., the number of samples processed in each iteration during training. 


- **Load the data and split into training and testing**








## Explore the dataset

### How the data are stored

**Movie reviews are converted into numbers** by using a process called text encoding. In this process, **each word in a movie review is assigned a unique numerical index** based on its position in a predefined vocabulary. 

The vocabulary consists of a set of frequently occurring words in the dataset. The **movie review is then represented as a sequence of these numerical indices**, where each index corresponds to a specific word in the review. 

This numerical representation allows to store movie reviews as sequences of numbers, which can be efficiently processed and used as input to machine learning models, such as recurrent neural networks (RNNs), for sentiment analysis and other natural language processing tasks.


### Display random reviews

Exploring a **subset of randomly selected reviews from the IMDB dataset** offers a glimpse into the diversity of opinions and sentiments expressed by viewers. 

These reviews represent a mix of positive and negative feedback, providing valuable insights into the subjective experiences and perceptions of different movies. By analyzing the displayed reviews, we can observe the varying lengths, writing styles, and overall sentiment conveyed by users

## Further explore the data

Display the distribution of review lengths

### Display the distribution of movie review sentiment

### Generate word cloud of most common used word

## Pad review sequences

The goal is to **preprocess the movie review data by padding the sequences to a fixed length**. 

The `sequence.pad_sequences()` function is used to ensure that all movie reviews have the same length (`maxlen`). This padding is necessary because **RNNs typically require inputs of fixed dimensions**. 


The code then splits the dataset into training and validation sets, where the first 10,000 samples are set aside as the validation set, while the remaining samples are used for training

## Building the Recurrent Neural Network

The network uses an embedding layer to convert words into dense vectors, an LSTM layer to capture sequential dependencies, and a dense output layer for binary classification based on the sentiment of the movie reviews.

- **Embedding Layer**: The network begins with an embedding layer that converts the input words into dense vectors of fixed dimensions (128 in this case). This layer helps capture the semantic meaning and relationships between words in the movie reviews.



- **LSTM Layer**: A Long Short-Term Memory (LSTM) layer follows the embedding layer. LSTMs are a type of recurrent neural network (RNN) that can effectively capture long-term dependencies in sequential data. The layer consists of 128 LSTM units and includes dropout regularization to prevent overfitting.



- **Output Layer**: The final layer of the network is a dense layer with a single neuron and a sigmoid activation function. This layer produces a binary output, indicating the sentiment of the movie review (positive or negative).

## Compile the model

- The choice of the loss function, `binary_crossentropy`, is suitable for binary classification tasks, such as sentiment analysis, where the goal is to predict one of two classes (positive or negative sentiment). 


- The `Adam` optimizer is selected as it is an efficient optimization algorithm commonly used for training neural networks. 


- By optimizing the model using the `loss` function and the Adam optimizer, the network aims to minimize the cross-entropy loss and improve its accuracy in predicting the sentiment of movie review


## Train the model

The model is trained using the training data `(x_train` and `y_train`).  Additionally, it uses the validation data (`x_val` and `y_val`) to monitor the model's performance during training. 

The model is trained by iterating over the specified number of epochs (10 in our case), updating its parameters based on the training data, and evaluating its performance on the validation data.

## Evaluate

The evaluation step helps assess how well the trained model generalizes to new movie reviews and provides an indication of its overall effectiveness in predicting the sentiment of movie reviews.

We evaluate the trained model using the testing data (`x_test` and `y_test`). It computes the loss and accuracy of the model's predictions on the test data. The batch size is specified to control the number of samples processed at once during evaluation. The obtained test loss and accuracy are then printed, providing insights into the model's performance on unseen data. 

## Visualize accuracy