Deep Learning project implementing a many-to-one LSTM architecture for sentiment classification of movie reviews.
- Source: Kaggle - IMDB Dataset of 50K Movie Reviews
- Original Size: 50,000 reviews
- Cleaned Size: 49,578 reviews (422 duplicates removed)
- Classes: Binary (Positive/Negative)
- Balance: 1.01:1 (50.18% positive, 49.82% negative) β Perfectly balanced!
- Visit Kaggle - IMDB Dataset
- Download
IMDB Dataset.csv
- Place it in
data/
folder asimdb_dataset.csv
# Install Kaggle CLI
pip install kaggle
# Download dataset
kaggle datasets download -d lakshmi25npathi/imdb-dataset-of-50k-movie-reviews
unzip imdb-dataset-of-50k-movie-reviews.zip -d data/
mv data/IMDB\ Dataset.csv data/imdb_dataset.csv
- Model: Many-to-one LSTM
- Layers: Embedding β LSTM β Dropout β Dense (Sigmoid)
- Parameters: 1.4M trainable parameters
- Cell State: Forget gate, input gate, output gate
- Learning: Long-term dependency for sentiment nuances
- Python: 3.12 (stable version)
- Deep Learning: TensorFlow 2.20.0
- Data Processing: NumPy 1.26.x, Pandas 2.3.3
- Visualization: Matplotlib 3.10.7, Seaborn 0.13.2, WordCloud 1.9.4
- NLP: NLTK 3.9.2
- ML Utils: scikit-learn 1.7.2
imdb-sentiment-lstm/
βββ .venv/ # Virtual environment
βββ data/
β βββ imdb_dataset.csv # Original dataset (50K reviews)
β βββ imdb_dataset_formatted.csv # HTML tags removed
β βββ imdb_dataset_cleaned.csv # Final cleaned (49,578 reviews)
β βββ X_train_preprocessed.npy # Preprocessed training sequences
β βββ X_val_preprocessed.npy # Preprocessed validation sequences
β βββ y_train.npy # Training labels
β βββ y_val.npy # Validation labels
βββ doc/
| βββimbd_sentiment_analysis_project_presentation_d18zgx_vadasz_csaba.pptx # Hungarian presentation
β βββ imbd_sentiment_analysis_project_documentation_d18zgx_vadasz_csaba.pdf # Hungarian doc
βββ models/
β βββ tokenizer.pickle # Keras tokenizer (vocab: 10K)
β βββ lstm_sentiment_model.h5 # Trained model
βββ notebooks/ # Jupyter notebooks for experiments
βββ visualizations/
β βββ eda/ # Exploratory Data Analysis plots (7)
β βββ preprocessing/ # Preprocessing visualizations (2)
β βββ training/ # Training history plots & model architecture
βββ src/
β βββ __init__.py
β βββ check_versions.py # PyPI version checker
β βββ config.py # Configuration & hyperparameters
β βββ data_clean.py # Data cleaning & EDA
β βββ data_inspect.py # Initial data inspection
β βββ data_format.py # HTML tag removal
β βββ data_loader.py # Data loading & train/val split
β βββ data_preprocess.py # Tokenization & padding
β βββ model.py # LSTM model architecture
βββ .gitignore
βββ img.png # # Self-generated AI image (DALL-E 3)
βββ LICENSE # MIT License
βββ main.py # Main entry point
βββ README.md
βββ requirements.txt # Packages to install with versions
git clone <your-repo-url>
cd imdb-sentiment-lstm
# Windows
python -m venv .venv
.venv\Scripts\activate
# Mac/Linux
python -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt
If prompted to update pip:
python -m pip install --upgrade pip
Explore the raw dataset structure and basic statistics.
python src/data_inspect.py
Output:
- Dataset info (50,000 rows, 2 columns)
- First 5 samples
- Sentiment distribution
- HTML tag detection
Remove HTML tags and format text for analysis.
python src/data_format.py
Output:
- Cleaned reviews (HTML tags removed)
- Saved to
data/imdb_dataset_formatted.csv
Comprehensive data cleaning and exploratory analysis.
python src/data_clean.py
What it does:
- β Missing values check: 0 missing values found
- β Duplicate removal: 422 duplicates removed (0.84%)
- β Sentiment validation: 0 invalid values found
- β Text length analysis: Character & word counts
- β Outlier detection: IQR method (7.39% outliers kept)
- β Descriptive statistics: Mean, median, std, min, max
- β
7 Visualizations created:
- Sentiment distribution (bar chart)
- Text length histogram (word & character count)
- Text length boxplot (by sentiment)
- Word clouds (positive & negative)
- Top 20 frequent words (positive & negative)
Output:
data/imdb_dataset_cleaned.csv
(49,578 reviews)- 7 PNG visualizations in
visualizations/eda/
Key Statistics:
Total Reviews: 49,578
Positive: 24,882 (50.18%)
Negative: 24,696 (49.82%)
Avg Word Count: 229 words
Median Word Count: 172 words
Tokenization, sequence padding, and train/validation split.
python src/data_preprocess.py
What it does:
- β Tokenization: Convert text to integer sequences
- β Vocabulary: Top 10,000 most frequent words
- β Padding: All sequences padded/truncated to 200 tokens
- β Train/Val Split: 80/20 stratified split (39,662 / 9,916)
- β Save preprocessed data: Arrays saved as .npy files (38.2 MB)
- β
2 Visualizations created:
- Sequence length distribution (train & val)
- Vocabulary statistics (Zipf's law)
Output:
models/tokenizer.pickle
(4.7 MB)data/X_train_preprocessed.npy
(30.26 MB)data/X_val_preprocessed.npy
(7.57 MB)data/y_train.npy
(309 KB)data/y_val.npy
(77 KB)- 2 PNG visualizations in
visualizations/preprocessing/
Key Statistics:
Training Set: 39,662 samples (80%)
Validation Set: 9,916 samples (20%)
Vocabulary Size: 10,000 words
Sequence Length: 200 tokens (padded/truncated)
Padding: 58.9% padded, 40.8% truncated
Build and compile LSTM architecture.
python src/model.py
Architecture:
Input (batch_size, 200)
β
Embedding Layer (vocab_size=10K, embedding_dim=128)
β
LSTM Layer (128 units, dropout=0.5, recurrent_dropout=0.2)
β
Dropout Layer (0.5)
β
Dense Output (1 unit, sigmoid activation)
β
Output (batch_size, 1) - probability [0=negative, 1=positive]
Model Summary:
Total Parameters: 1,411,713 (5.39 MB)
Trainable Parameters: 1,411,713
Layer Breakdown:
- Embedding: 1,280,000 params
- LSTM: 131,584 params
- Dense: 129 params
Output:
visualizations/training/model_architecture.json
visualizations/training/model_config.json
visualizations/training/model_architecture.png
- Train LSTM model (10 epochs, batch_size=64)
- Early stopping with patience=3
- Save training history & plots
- Save trained model
- Evaluate on validation set
- Confusion matrix
- Classification report
- Sample predictions
After cleaning, our dataset shows excellent characteristics for training:
- Perfect Balance: 50.18% positive vs 49.82% negative (no resampling needed!)
- Good Text Length Distribution: Average 229 words, suitable for LSTM
- Minimal Duplicates: Only 0.84% removed
- No Missing Data: 100% complete dataset
- Outliers Kept: 7.39% long/short reviews retained (may contain valuable sentiment information)
Check the visualizations in visualizations/eda/
for detailed insights! π
Created as part of Deep Learning coursework at University of Pannonia.
Developed with focus on understanding LSTM mechanisms and practical NLP implementation.
Developed by Csaba79-coder | Csaba VadΓ‘sz
MIT License
Solution: Make sure you're in the virtual environment:
# Windows
.venv\Scripts\activate
# Mac/Linux
source .venv/bin/activate
Solution: We use NumPy 1.26.x (not 2.x) for TensorFlow compatibility. This is intentional and stable.
Solution: Install separately if needed:
pip install wordcloud==1.9.4
Solution: Run scripts from project root:
python src/data_clean.py # β
Correct
cd src && python data_clean.py # β Wrong