A comprehensive Python application for fetching, processing, and analyzing top-rated movies from The Movie Database (TMDB) API with advanced text preprocessing capabilities.
- Overview
- Features
- Project Structure
- Installation
- Configuration
- Usage
- API Reference
- Data Processing Pipeline
- Output Format
- Dependencies
- Contributing
- License
- Troubleshooting
This project provides a robust data pipeline for collecting and preprocessing movie data from TMDB's extensive database. It fetches top-rated movies with their descriptions and genres, then applies comprehensive text preprocessing techniques including tokenization, lemmatization, stemming, and stopword removal to prepare the data for further analysis or machine learning applications.
- Data Acquisition: Automated fetching of top-rated movies from TMDB API
- Genre Mapping: Intelligent conversion of genre IDs to human-readable names
- Text Preprocessing: Advanced NLP preprocessing pipeline
- Rate Limiting: Respectful API usage with built-in delays
- Error Handling: Robust error handling for API failures
- CSV Export: Clean data export for further analysis
- Fetch top-rated movies from TMDB API with pagination support
- Automatic genre ID to name mapping
- Configurable number of pages to process
- Built-in rate limiting to respect API guidelines
- Comprehensive error handling and status code validation
- HTML Tag Removal: Clean HTML content from descriptions
- URL Removal: Strip web URLs and links
- Punctuation Removal: Remove all punctuation marks
- Emoji Handling: Remove or replace emojis
- Tokenization: Split text into individual words
- Chat/Abbreviation Expansion: Convert common abbreviations (u → you, r → are)
- Spell Correction: Automatic spelling correction
- Stopword Removal: Remove common English stopwords
- Lemmatization: Convert words to their base forms
- Stemming: Reduce words to their root forms
- CSV format with clean, structured data
- Preprocessed text ready for analysis
- Genre information preserved and readable
tmdb/
├── README.md # Project documentation
├── main.py # Main application entry point
├── data_acquisition.py # TMDB API interaction module
├── text_preprocessing.py # NLP preprocessing pipeline
├── hello.py # Simple greeting module
├── requirements.txt # Python dependencies
├── pyproject.toml # Project configuration
├── uv.lock # Dependency lock file
├── top_rated_movies_with_genres.csv # Sample output data
└── __pycache__/ # Python cache directory
- Python 3.10 or higher
- TMDB API key (free registration required)
- Internet connection for API calls and NLTK data downloads
git clone <repository-url>
cd tmdb# Using venv
python -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate
# Or using conda
conda create -n tmdb python=3.10
conda activate tmdbpip install -r requirements.txtThe application will automatically download required NLTK data on first run:
- punkt (tokenization)
- stopwords (English stopwords)
- wordnet (lemmatization)
- omw-1.4 (multilingual wordnet)
- Register for a free account at TMDB
- Navigate to API settings and generate an API key
- Update the API key in
data_acquisition.py:
API_KEY = "your_api_key_here"
⚠️ Security Note: For production use, store API keys in environment variables or configuration files, not in source code.
pages: Number of pages to fetch (default: 10, max: ~500)
time.sleep(0.5): API rate limiting delay (adjustable)language: API language parameter (default: "en-US")
chat_dict: Expandable abbreviation dictionary- Language settings for spell checker and stopwords
Run the main application to fetch and process movie data:
python main.pyThis will:
- Fetch top-rated movies from TMDB (10 pages by default)
- Apply text preprocessing to movie descriptions
- Save results to
top_rated_movies_with_genres.csv - Display the first 10 processed movies
Modify the pages parameter in main.py:
df = fetch_top_rated_movies(pages=50) # Fetch 50 pages (~1000 movies)Use the preprocessing function independently:
from text_preprocessing import preprocess_text
# Example usage
raw_text = "This is a GREAT movie!!! 😍 Check it out at https://example.com"
clean_text = preprocess_text(raw_text)
print(clean_text) # Output: "great movi check"Use the data acquisition module separately:
from data_acquisition import fetch_top_rated_movies
# Fetch data without preprocessing
df = fetch_top_rated_movies(pages=5)
print(df.head())Fetches top-rated movies from TMDB API.
Parameters:
pages(int): Number of pages to fetch (default: 100)
Returns:
pandas.DataFrame: DataFrame with columns:movie_name: Movie titledescription: Movie overview/descriptiongenre: Comma-separated genre names
Example:
df = fetch_top_rated_movies(pages=5)
print(f"Fetched {len(df)} movies")Applies comprehensive text preprocessing pipeline.
Parameters:
text(str): Raw text to preprocess
Returns:
str: Cleaned and preprocessed text
Processing Steps:
- Lowercase conversion
- HTML tag removal
- URL removal
- Punctuation removal
- Emoji removal
- Tokenization
- Abbreviation expansion
- Spell correction
- Stopword removal
- Lemmatization
- Stemming
Example:
raw = "This movie is AMAZING!!! 😍 You should watch it!"
clean = preprocess_text(raw)
print(clean) # Output: "movi amaz watch"graph LR
A[TMDB API] --> B[Genre Mapping]
B --> C[Movie Data Fetch]
C --> D[Rate Limiting]
D --> E[DataFrame Creation]
graph TD
A[Raw Text] --> B[Lowercase]
B --> C[HTML Removal]
C --> D[URL Removal]
D --> E[Punctuation Removal]
E --> F[Emoji Removal]
F --> G[Tokenization]
G --> H[Abbreviation Expansion]
H --> I[Spell Correction]
I --> J[Stopword Removal]
J --> K[Lemmatization]
K --> L[Stemming]
L --> M[Clean Text]
The application generates a CSV file with the following structure:
| Column | Description | Example |
|---|---|---|
movie_name |
Original movie title | "The Shawshank Redemption" |
description |
Preprocessed movie description | "imprison 1940 doubl murder wife..." |
genre |
Comma-separated genre names | "Drama, Crime" |
movie_name,description,genre
The Shawshank Redemption,imprison 1940 doubl murder wife lover stand banker...,Drama, Crime
The Godfather,span year 1945 1955 chronicl fiction italianamerican...,Drama, Crime
Spirited Away,young girl chihiro becom trap strang new world spirit...,Animation, Family, Fantasy
| Package | Version | Purpose |
|---|---|---|
requests |
2.32.5 | HTTP requests for API calls |
pandas |
2.3.2 | Data manipulation and analysis |
nltk |
3.9.1 | Natural language processing |
beautifulsoup4 |
4.13.5 | HTML parsing and removal |
emoji |
2.14.1 | Emoji handling and removal |
autocorrect |
- | Automatic spell correction |
numpy(2.2.6): Numerical computing supportregex(2025.7.34): Advanced regular expressionstqdm(4.67.1): Progress bars for long operationsclick(8.2.1): Command-line interface utilities
setuptools(65.5.0): Package building toolspip(25.2): Package installer
We welcome contributions! Please follow these guidelines:
- Fork the repository
- Create a feature branch:
git checkout -b feature-name - Make your changes
- Add tests if applicable
- Commit your changes:
git commit -m "Add feature" - Push to the branch:
git push origin feature-name - Submit a pull request
- Follow PEP 8 Python style guidelines
- Use meaningful variable and function names
- Add docstrings to all functions
- Include type hints where appropriate
Before submitting a pull request:
- Test with different page counts
- Verify API key functionality
- Test text preprocessing with various inputs
- Ensure CSV output is properly formatted
This project is licensed under the MIT License. See the LICENSE file for details.
Problem: 401 Unauthorized or API key errors
Solution:
- Verify your TMDB API key is correct
- Ensure the API key is active and not expired
- Check for any typos in the API key
Problem: NLTK data not found errors
Solution:
import nltk
nltk.download('all') # Download all NLTK dataProblem: Out of memory errors when processing many pages
Solution:
- Reduce the number of pages processed
- Process data in smaller batches
- Increase system memory or use a more powerful machine
Problem: API rate limit exceeded
Solution:
- Increase the
time.sleep()value indata_acquisition.py - Reduce the number of concurrent requests
- Consider implementing exponential backoff
- Batch Processing: Process movies in smaller batches
- Parallel Processing: Use multiprocessing for text preprocessing
- Memory Management: Clear intermediate variables
- Caching: Cache genre mappings to avoid repeated API calls
- Reduce Preprocessing Steps: Comment out unnecessary preprocessing steps
- Optimize Regex: Use compiled regex patterns for better performance
- Vectorized Operations: Use pandas vectorized operations where possible
If you encounter issues not covered here:
- Check the Issues page for similar problems
- Create a new issue with:
- Python version
- Operating system
- Full error message
- Steps to reproduce
- Contact the maintainers
- The Movie Database (TMDB) for providing the comprehensive movie API
- NLTK team for natural language processing tools
- Pandas developers for data manipulation capabilities
- Open source community for various supporting libraries
Made with ❤️ for movie data enthusiasts and NLP practitioners