### **Data Cleaning: TMDB Movies**
The TMDB dataset provides vital metrics on movie popularity and genre classification. To ensure the data is ready for analysis, we perform the following preprocessing steps:

1.  **Removing Redundancy**: Drop the `Unnamed: 0` column, which serves as a duplicate index, to streamline the dataframe.
2.  **Temporal Feature Engineering**: Convert `release_date` to a standard datetime format and extract the `release_year`. This allows us to perform time-series analysis on movie trends.
3.  **Genre Data Parsing**: The `genre_ids` are currently stored as strings (e.g., `"[12, 14]"`). We use `ast.literal_eval` to convert these into actual Python lists, enabling us to later map these IDs to their specific genre names (like Action or Comedy).

In [9]:
# ==========================================
# DATA CLEANING: TMDB Movies
# ==========================================

import ast
import pandas as pd
# Load the dataset directly using the relative path
tmdb_movies = pd.read_csv('../data/zippedData/tmdb.movies.csv.gz')
# Preview the first few rows to ensure successful loading
tmdb_movies.head()

# 1. Drop redundant index column
if 'Unnamed: 0' in tmdb_movies.columns:
    tmdb_movies.drop(columns=['Unnamed: 0'], inplace=True)

# 2. Feature Engineering: Convert release_date to Datetime and extract Year
tmdb_movies['release_date'] = pd.to_datetime(tmdb_movies['release_date'])
tmdb_movies['release_year'] = tmdb_movies['release_date'].dt.year

# 3. Clean Genre IDs 
# They are strings like "[12, 14]". We turn them into actual Python lists.
tmdb_movies['genre_ids'] = tmdb_movies['genre_ids'].apply(lambda x: ast.literal_eval(x) if isinstance(x, str) else x)


### **Data Export**
After performing feature engineering and cleaning, we save the processed TMDB dataset to a CSV file. This ensures that the cleaned data is ready for the exploratory data analysis (EDA) phase without needing to rerun the cleaning script.

In [10]:
tmdb_movies_cleaned = tmdb_movies.copy()
tmdb_movies_cleaned.to_csv('../data/cleanedData/tmdb_cleaned_data.csv', index=False)

## **Data Integration: Bridging Financials and Metadata**
To answer our strategic business questions, we must integrate two primary data sources:
* **The Numbers (TNDB):** Provides the "Financial Backbone" (Budgets, ROI, and Worldwide Gross).
* **TheMovieDB (TMDB):** Provides "Market Context" (Genre Classifications and Popularity Scores).

By merging these datasets on movie titles, we create a unified database that allows us to see not just *how much* a movie made, but *what kind* of movie it was and how the audience engaged with it.

# **Section: Content Strategy & Global Market Analysis**

### **Objective**
The goal of this analysis is to identify which movie types offer the highest financial returns and global reach. We will integrate data from all five sources to build a "Master Dataset" that links financial performance with genre and audience interest.

---
## **1. Master Data Integration**
We are merging:
* **TNDB & BOM**: Financials (Gross, Budget, ROI).
* **IMDB & TMDB**: Metadata (Genres, Popularity, Ratings).

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Load all cleaned datasets
tndb = pd.read_csv('tndb_cleaned_data.csv')
tmdb = pd.read_csv('tmdb_cleaned_data.csv')
imdb = pd.read_csv('imdb_cleaned_data.csv')
bom = pd.read_csv('bom_cleaned_data.csv')

# Standardize titles for a clean merge
tndb['title_std'] = tndb['movie'].str.lower().str.strip()
imdb['title_std'] = imdb['primary_title'].str.lower().str.strip()
tmdb['title_std'] = tmdb['title'].str.lower().str.strip()

# Drop duplicates in metadata to prevent data blow-up
imdb_clean = imdb.drop_duplicates(subset=['title_std', 'start_year'])
tmdb_clean = tmdb.drop_duplicates(subset=['title_std', 'release_year'])

# Master Merge: Financials + Genre/Quality metadata
df_master = pd.merge(tndb, imdb_clean[['title_std', 'genres', 'averagerating', 'numvotes']], on='title_std', how='inner')
df_master = pd.merge(df_master, tmdb_clean[['title_std', 'popularity']], on='title_std', how='left')

# Explode genres for individual analysis
df_exploded = df_master.copy()
df_exploded['genres'] = df_exploded['genres'].str.split(',')
df_exploded = df_exploded.explode('genres')
df_exploded['genres'] = df_exploded['genres'].str.capitalize()

print(f"Master Dataset created with {df_master.shape[0]} unique movie entries.")

FileNotFoundError: [Errno 2] No such file or directory: 'tndb_cleaned_data.csv'