A complete, step-by-step guide to building and training a sentiment analysis model on the IMDB movie review dataset. This project uses PyTorch and a Long Short-Term Memory (LSTM) recurrent neural network (RNN) to classify reviews as positive or negative.
This repository is primarily structured as a Jupyter Notebook (IMDB_Sentiment_Analysis.ipynb) that walks through the entire process from data loading to model inference.
The goal of this project is to classify movie reviews as either "positive" or "negative." To achieve this, we will:
-
Load Data: Use
pandasto load the providedimdb_tr.csv(for training) andimdb_te.csv(for testing). -
Preprocess Text: Clean the raw review text by removing HTML tags, punctuation, and stopwords.
-
Build a Vocabulary: Create a word-to-index mapping from the training data.
-
Create PyTorch Datasets: Convert the text data into numerical sequences (tensors) and use PyTorch's
DatasetandDataLoaderfor efficient batching. -
Define the Model: Build an LSTM-based neural network using
torch.nn. The architecture consists of an Embedding layer, an LSTM layer, and a final Linear layer with a Sigmoid activation. -
Train the Model: Train the network on the training data, using a validation split to monitor for overfitting.
-
Evaluate the Model: Test the trained model's performance on the unseen test dataset (
imdb_te.csv) and report metrics like accuracy, precision, recall, and F1-score. -
Run Inference: Create a function to predict the sentiment of any new, custom review.
-
IMDB_Sentiment_Analysis.ipynb: The main Jupyter Notebook containing all the code and explanations. -
imdb_tr.csv: The training and validation dataset (35,000 reviews). -
test/imdb_te.csv: The testing dataset (15,000 reviews). -
README.md: This file.
You will need Python 3.7+ and the following libraries. You can install them using pip:
pip install torch pandas numpy nltk scikit-learn tqdm jupyterlab
-
Clone this repository:
git clone https://github.com/Rahmat-ML/pytorch-imdb-sentiment-lstm cd pytorch-imdb-sentiment-lstm -
Ensure you have the data files (
imdb_tr.csvandtest/imdb_te.csv) in the correct locations.
The entire project is contained within the Jupyter Notebook.
-
Start Jupyter Lab:
jupyter lab -
Open
IMDB_Sentiment_Analysis.ipynb. -
Run the cells sequentially from top to bottom. The notebook is commented to explain each step of the process.
The core of our model is an LSTM network, which is well-suited for sequence data like text.
-
Embedding Layer: Converts integer-encoded words (from our vocabulary) into dense vectors (embeddings) of a fixed size.
-
LSTM Layer: Processes the sequence of embeddings, capturing contextual information and long-range dependencies in the text.
-
Dropout Layer: (Optional but recommended) Helps prevent overfitting during training.
-
Linear Layer: A fully-connected layer that maps the final hidden state of the LSTM to a single output.
-
Sigmoid Activation: Squashes the output to a value between 0 and 1, representing the probability of the review being "positive."