## 1. Business Understanding

### Project Background
Film production is a high-risk, capital-intensive industry. For a new movie studio, selecting projects with strong financial performance is essential to minimizing risk and ensuring sustainable growth.

### Real-World Problem
The studio needs to decide which films to produce by identifying those with:
- High global gross profit and efficient budget utilization (high ROI).
- Favorable trends in market performance over time.
- Successful genres and proven talent (directors and actors).

### Stakeholders
- **Studio Management & Investors:** Require data-driven recommendations to allocate resources effectively.
- **Production Teams:** Benefit by focusing on film attributes that historically correlate with success.

This analysis provides a clear roadmap, enabling stakeholders to make informed decisions that address a real-world business challenge.


## 2. Data Understanding

### Data Sources
1. **Movie Budgets Data:**  
   - Contains production budgets, domestic and worldwide gross revenue, and release dates for films (post-2000).
   - Offers insights into financial performance.
   
2. **IMDb Data:**  
   - Provides film metadata including ratings, runtime, genres, and details about key personnel (actors, directors, etc.).
   - Comes from a SQLite database (and a supplemental CSV) that enriches our analysis.

### Data Properties and Relevance
- **Temporal Coverage:** Films released from 2000 onward.
- **Key Features:** Budget, revenues, profit, ROI, runtime, genres, cast/crew details.
- **Utility:** The financial data combined with IMDb metadata allows us to identify trends and evaluate the efficiency of budget usage across films.
- **Limitations:** Early encoding issues and missing values have been handled during data preparation to ensure data reliability.

Confirms that the data sources are well-suited to address the studio’s problem of identifying films with high potential.

## 3. Data Preparation

### Overview
In this section, we load, clean, and merge the raw datasets to create a master dataset for analysis. Steps include:
- Parsing dates and filtering films released from 2000 onward.
- Converting currency strings to floats.
- Fixing text encoding issues.
- Calculating Global Gross Profit and ROI.
- Merging budget data with enriched IMDb data via a composite key ("Title & Year").

In [8]:
import pandas as pd
import sqlite3
import os
import zipfile
import warnings
import matplotlib.pyplot as plt
import seaborn as sns
from matplotlib.ticker import StrMethodFormatter

# ------------------------------------------------------------------------------
# Basic configuration and warning suppression
# ------------------------------------------------------------------------------
pd.set_option('display.max_columns', None)
pd.options.display.float_format = '{:.0f}'.format
warnings.simplefilter('ignore')


In [None]:
# ------------------------------------------------------------------------------
# Load and Clean Budget Data
# ------------------------------------------------------------------------------
# Load the movie budgets CSV file (with compressed data) and parse dates.
budgets_df = pd.read_csv('./zippedData/tn.movie_budgets.csv.gz',
                          parse_dates=['release_date'], encoding='utf-8')
budgets_df.head(2)

Unnamed: 0,id,release_date,movie,production_budget,domestic_gross,worldwide_gross
0,1,2009-12-18,Avatar,"$425,000,000","$760,507,625","$2,776,345,279"
1,2,2011-05-20,Pirates of the Caribbean: On Stranger Tides,"$410,600,000","$241,063,875","$1,045,663,875"


In [10]:
# Extract the release year from the release_date column.
budgets_df['release_year'] = budgets_df['release_date'].dt.year

# Filter out movies released before 2000.
clean_budgets = budgets_df[budgets_df['release_year'] >= 2000]
clean_budgets.head(2)

Unnamed: 0,id,release_date,movie,production_budget,domestic_gross,worldwide_gross,release_year
0,1,2009-12-18,Avatar,"$425,000,000","$760,507,625","$2,776,345,279",2009
1,2,2011-05-20,Pirates of the Caribbean: On Stranger Tides,"$410,600,000","$241,063,875","$1,045,663,875",2011
