# Movie Dataset Cleaning Pipeline
## Comprehensive Data Cleaning for EDA Project

This notebook systematically cleans all movie datasets from the `Original_Data` folder and exports clean, analysis-ready datasets.

### 📋 Datasets to Process:
1. **Box Office Mojo** (`bom.movie_gross.csv.gz`) - Box office gross earnings
2. **TMDb Movies** (`tmdb.movies.csv.gz`) - The Movie Database metadata
3. **Movie Budgets** (`tn.movie_budgets.csv.gz`) - Production budgets and revenues
4. **Rotten Tomatoes Info** (`rt.movie_info.tsv.gz`) - Movie ratings and metadata

### 🎯 Cleaning Objectives:
- Handle missing values appropriately
- Standardize data types and formats
- Remove duplicates
- Clean financial data (currency formatting)
- Validate and correct data inconsistencies
- Export clean datasets for analysis

In [None]:
# Import required libraries
import pandas as pd
import numpy as np
import warnings
from pathlib import Path

# Configure pandas for better display
pd.set_option("display.max_columns", None)
pd.set_option("display.width", None)
warnings.filterwarnings("ignore")

# Create directories
Path("cleaned_data").mkdir(exist_ok=True)

print("✅ Environment setup complete!")
print(f"📁 Working directory: {Path.cwd()}")

## 1. Box Office Mojo Data Cleaning

Starting with the Box Office Mojo dataset which contains movie gross earnings data.

In [None]:
# Load Box Office Mojo data
print("🎬 Processing Box Office Mojo Data")
print("=" * 40)

bom_df = pd.read_csv("Original_Data/bom.movie_gross.csv.gz", compression="gzip")

print(f"📊 Original shape: {bom_df.shape}")
print(f"📋 Columns: {list(bom_df.columns)}")
print("
🔍 First 5 rows:")
display(bom_df.head())

# Check data types and missing values
print("
📈 Data info:")
bom_df.info()
print("
❌ Missing values:")
print(bom_df.isnull().sum())