# Movie Industry Exploratory Data Analysis
## Comprehensive Analysis of Movie Performance, Ratings, and Financial Data

This notebook performs thorough exploratory data analysis on cleaned movie datasets to uncover insights about:

### 🎯 Analysis Objectives:
- **Movie Performance Trends** - Box office performance over time
- **Rating Patterns** - TMDb and Rotten Tomatoes rating distributions
- **Financial Analysis** - Budget vs revenue relationships
- **Genre Analysis** - Popular genres and their performance
- **Studio Analysis** - Top performing studios and distributors
- **Temporal Trends** - Movie industry evolution over years

### 📋 Cleaned Datasets Used:
- **TMDb Movies** (26,517 records) - Movie metadata, ratings, popularity
- **Movie Budgets** (5,782 records) - Production budgets and revenues
- **Box Office Mojo** (3,387 records) - Domestic and foreign gross earnings
- **Rotten Tomatoes** (1,560 records) - Critical ratings and movie info

In [None]:
# Import required libraries for comprehensive EDA
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots
import warnings
from pathlib import Path
from datetime import datetime
import re

# Configure visualization settings
plt.style.use('seaborn-v0_8')
sns.set_palette("husl")
plt.rcParams['figure.figsize'] = (12, 8)
pd.set_option('display.max_columns', None)
warnings.filterwarnings('ignore')

print("✅ Libraries imported successfully!")
print(f"📊 Analysis environment ready")

## Data Loading and Initial Overview

Loading all cleaned datasets and performing initial data quality checks.

In [None]:
# Load all cleaned datasets
print("📂 Loading cleaned datasets...")
print("=" * 40)

# Load datasets
tmdb_df = pd.read_csv('cleaned_data/tmdb_movies_cleaned.csv')
budgets_df = pd.read_csv('cleaned_data/movie_budgets_cleaned.csv')
bom_df = pd.read_csv('cleaned_data/box_office_mojo_cleaned.csv')
rt_df = pd.read_csv('cleaned_data/rotten_tomatoes_info_cleaned.csv')

# Convert date columns
tmdb_df['release_date'] = pd.to_datetime(tmdb_df['release_date'])
budgets_df['release_date'] = pd.to_datetime(budgets_df['release_date'])

datasets = {
    'TMDb Movies': tmdb_df,
    'Movie Budgets': budgets_df,
    'Box Office Mojo': bom_df,
    'Rotten Tomatoes': rt_df
}

# Display basic info for each dataset
for name, df in datasets.items():
    print(f"\n📊 {name}:")
    print(f"   Shape: {df.shape[0]:,} rows × {df.shape[1]} columns")
    print(f"   Columns: {list(df.columns)}")
    print(f"   Memory usage: {df.memory_usage(deep=True).sum() / 1024**2:.1f} MB")

print("\n✅ All datasets loaded successfully!")