This project builds a sentiment analysis system that classifies IMDb movie reviews as positive or negative. Three models are trained and compared:
- LSTM — Long Short-Term Memory with learned word embeddings
- GRU — Gated Recurrent Unit with learned word embeddings
- BERT — Fine-tuned
bert-base-uncasedtransformer
Developed for a Deep Learning for NLP course.
Sentiment-Analysis/
│
├── data/ # CSV splits (train / val / test)
├── checkpoints/ # Saved model weights and vocabularies
│ └── bert_tokenizer/ # Saved BERT tokenizer (after training)
├── results/
│ ├── figures/ # Auto-generated plots (loss, confusion matrix, etc.)
│ └── metrics/ # JSON files with training history and test metrics
├── src/
│ ├── preprocess.py # Vocabulary builder for LSTM / GRU
│ ├── dataset.py # SentimentDataset and BertSentimentDataset
│ ├── models.py # LSTMClassifier, GRUClassifier, BERTClassifier
│ ├── plots.py # Auto-generates figures after every training run
│ ├── utils.py # Metrics, seeding, device helpers
│ ├── train.py # Training pipeline (all three models)
│ ├── evaluate.py # Standalone evaluation on the test set
│ └── predict.py # Single-text inference
├── report/
├── README.md
└── requirements.txt
This project uses the IMDb Large Movie Review Dataset (25k train / 25k test, balanced).
Two loading options are available:
python src/train.py --model lstm --use_hf_imdbThe dataset is downloaded automatically on first run.
Place train.csv, val.csv, and test.csv inside data/. Each file must have:
| Column | Description |
|---|---|
text |
Review text |
label |
1 = positive, 0 = negative |
python src/train.py --model lstm --data_dir datapip install -r requirements.txtDependencies: torch, transformers, datasets, pandas, scikit-learn, joblib, matplotlib, seaborn
Plots are generated automatically at the end of every training run and saved to results/figures/. Comparison plots (LSTM vs GRU vs BERT) appear once two or more models have been trained.
python src/train.py --model lstm --use_hf_imdb --epochs 5 --batch_size 64python src/train.py --model gru --use_hf_imdb --epochs 5 --batch_size 64python src/train.py --model bert --use_hf_imdb --epochs 3 --batch_size 16BERT uses a lower learning rate (2e-5), AdamW optimizer, linear warmup scheduler, and gradient clipping automatically. Use a smaller batch size due to memory requirements.
After training, each model saves:
checkpoints/best_{model}.pt— best checkpoint by validation losscheckpoints/vocab_{model}.joblib— vocabulary (LSTM / GRU only)checkpoints/bert_tokenizer/— tokenizer config (BERT only)results/metrics/history_{model}.json— per-epoch training historyresults/metrics/test_metrics_{model}.json— final test set metrics
Run standalone evaluation on the test set using a saved checkpoint.
python src/evaluate.py \
--model lstm \
--checkpoint checkpoints/best_lstm.pt \
--vocab_path checkpoints/vocab_lstm.joblib \
--use_hf_imdbpython src/evaluate.py \
--model gru \
--checkpoint checkpoints/best_gru.pt \
--vocab_path checkpoints/vocab_gru.joblib \
--use_hf_imdbpython src/evaluate.py \
--model bert \
--checkpoint checkpoints/best_bert.pt \
--bert_tokenizer_dir checkpoints/bert_tokenizer \
--use_hf_imdbRun inference on a single custom text.
python src/predict.py \
--model gru \
--checkpoint checkpoints/best_gru.pt \
--vocab_path checkpoints/vocab_gru.joblib \
--text "This movie was absolutely fantastic"python src/predict.py \
--model bert \
--checkpoint checkpoints/best_bert.pt \
--bert_tokenizer_dir checkpoints/bert_tokenizer \
--text "This movie was absolutely fantastic"Example output:
{'text': 'This movie was absolutely fantastic', 'prediction': 'positive', 'probability_positive': 0.98}
| Parameter | Value |
|---|---|
| Embedding dimension | 128 |
| Hidden dimension | 128 |
| Dropout | 0.3 |
| Batch size | 64 |
| Epochs | 5 |
| Max sequence length | 200 |
| Optimizer | Adam (lr=1e-3) |
| Loss | BCEWithLogitsLoss |
| Parameter | Value |
|---|---|
| Base model | bert-base-uncased |
| Dropout | 0.3 |
| Batch size | 16 |
| Epochs | 3 |
| Max sequence length | 256 |
| Optimizer | AdamW (lr=2e-5) |
| Scheduler | Linear warmup (10% of steps) |
| Gradient clipping | 1.0 |
| Loss | BCEWithLogitsLoss |
| Model | Accuracy | Precision | Recall | F1-Score |
|---|---|---|---|---|
| LSTM | 73.79% | 73.67% | 74.05% | 73.86% |
| GRU | 82.49% | 80.17% | 86.34% | 83.14% |
| BERT | 91.36% | 92.20% | 90.35% | 91.27% |
BERT significantly outperforms both RNN models, achieving ~9% higher accuracy than GRU and ~18% higher than LSTM. GRU still outperforms LSTM despite being a simpler architecture.
Daniel Wehde & Rami Aabed