# Notebook 2: Data Cleaning & Initial EDA

## Purpose
This notebook focuses on cleaning the raw data collected in Notebook 1 and performing initial exploratory data analysis to understand the dataset's structure, quality, and basic patterns.

## Objectives
1. Load raw data from CSV files
2. Assess data quality (missing values, duplicates, data types, outliers)
3. Clean and standardize the data with documented decisions
4. Perform univariate analysis on all variables
5. Create initial visualizations (distributions, summary statistics)
6. Filter to wide releases (1,000+ theaters) from 2010-2024
7. Save cleaned dataset for feature engineering

## Key Questions
- How much missing data do we have in critical variables (budget, revenue)?
- What is the distribution of our target variable (revenue)?
- Are there extreme outliers that need special handling?
- What percentage of movies meet our filtering criteria?

## Outputs
- `data/processed/movies_cleaned.csv`
- Documentation of cleaning decisions and data quality issues

## Notes
- All cleaning decisions should be documented and justified
- Keep track of how many rows are dropped at each step
- Target: 2,500-3,000 movies after cleaning

---
## Setup and Imports

In [None]:
# Import libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from datetime import datetime

# Set visualization style
sns.set_style('whitegrid')
plt.rcParams['figure.figsize'] = (12, 6)

# Display settings
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', 100)

---
## Load Raw Data

In [None]:
# Load raw datasets
# df_movies = pd.read_csv('data/raw/movies_tmdb_raw.csv')
# df_revenue = pd.read_csv('data/raw/revenue_boxofficemojo_raw.csv')

---
## Data Quality Assessment

In [None]:
# Check shape, data types, missing values
# Create visualizations of missing data patterns

---
## Data Cleaning

In [None]:
# Remove duplicates
# Handle missing values
# Fix data types
# Standardize categorical variables
# Validate ranges

---
## Univariate Analysis

In [None]:
# Distribution plots for numeric variables
# Count plots for categorical variables
# Summary statistics

---
## Save Cleaned Data

In [None]:
# Save cleaned dataset
# df_cleaned.to_csv('data/processed/movies_cleaned.csv', index=False)

---
## Summary of Cleaning Process

In [None]:
# Document:
# - Original row count
# - Rows dropped at each step
# - Final row count
# - Key decisions made