# Final Analysis: IMDb and Financial Data Integration

## Business Understanding

### Objective
This project analyzes the intersection of audience preferences (IMDb ratings) and financial performance (budgets and revenue) to derive actionable insights for movie studios. By merging IMDb data with TMDb and budget datasets, we aim to:
- Identify the most profitable genres and their audience ratings.
- Correlate financial performance with popularity and ratings.
- Offer genre and studio strategies for maximum ROI.

### Data Sources
1. **IMDb Database**:
   - Contains movie ratings, genres, and key details.
2. **TMDb Dataset**:
   - Includes popularity metrics and genre encodings.
3. **Budget Dataset**:
   - Provides production budgets, domestic, and worldwide revenue.



## CRISP-DM Framework

1. **Business Understanding**: Define the objectives and questions.
2. **Data Understanding**: Explore the datasets to understand their structure and content.
3. **Data Preparation**: Clean, transform, and merge data for analysis.
4. **Modeling/Analysis**: Uncover trends and correlations through visualizations and metrics.
5. **Evaluation**: Summarize key findings and actionable insights.
6. **Deployment**: Present the final results in a structured manner.

### Let's begin with data understanding and preparation.


## Data Understanding

We work with the following datasets:
- **IMDb Database**: Contains movie ratings and genres.
- **TMDb Data**: Offers genre encodings, popularity scores, and audience ratings.
- **Budget Data**: Includes production budgets, domestic, and worldwide revenue.

The datasets will be merged using movie titles and release years.

## Data Preparation

### Steps:
1. Clean and format financial and popularity data (TMDb and budget datasets).
2. Extract relevant information from the IMDb database (`movie_basics` and `movie_ratings`).
3. Merge all datasets on movie titles and release years for a unified analysis base.

### Code Implementation:


In [1]:
#install all libraries to be used
import pandas as pd
import numpy as np
import seaborn as sns
sns.set_style('whitegrid')
import scipy as sp
import scipy.stats as st
import sqlite3
from zipfile import ZipFile
import os
import statsmodels.api as sm
from matplotlib.colors import ListedColormap
from statsmodels.stats.power import TTestIndPower, TTestPower
import statsmodels.formula as smf
import matplotlib.pyplot as plt
%matplotlib inline
import plotly.express as px
# Ignore all warnings
import warnings
warnings.filterwarnings('ignore')

In [5]:
# Load financial datasets
tmdb_movies_path = "C:\\Projects\\group3-phase2-project\\data\\tmdb.movies.csv"
movie_budgets_path = "C:\\Projects\\group3-phase2-project\\data\\tn.movie_budgets.csv"

df_tmdb_movies = pd.read_csv(tmdb_movies_path, encoding='latin1')
df_movie_budgets = pd.read_csv(movie_budgets_path, encoding='latin1')

In [6]:
# Clean financial data
def clean_currency(column):
    return column.replace({'\$': '', ',': ''}, regex=True).astype(float)

df_movie_budgets['production_budget'] = clean_currency(df_movie_budgets['production_budget'])
df_movie_budgets['domestic_gross'] = clean_currency(df_movie_budgets['domestic_gross'])
df_movie_budgets['worldwide_gross'] = clean_currency(df_movie_budgets['worldwide_gross'])


In [7]:
# Standardize release_date
df_movie_budgets['release_date'] = pd.to_datetime(df_movie_budgets['release_date'], errors='coerce')
df_tmdb_movies['release_date'] = pd.to_datetime(df_tmdb_movies['release_date'], errors='coerce')

In [8]:
# Extract release year
df_movie_budgets['release_year'] = df_movie_budgets['release_date'].dt.year
df_tmdb_movies['release_year'] = df_tmdb_movies['release_date'].dt.year

In [9]:
# Merge TMDb and budget datasets
merged_df = pd.merge(
    df_tmdb_movies,
    df_movie_budgets,
    how='inner',
    left_on=['title', 'release_year'],
    right_on=['movie', 'release_year']
)


In [None]:






# Load and filter IMDb data
imdb_db_path = '/mnt/data/extracted_imdb_db/im.db'
conn = sqlite3.connect(imdb_db_path)

movie_basics_query = "SELECT * FROM movie_basics;"
movie_ratings_query = "SELECT * FROM movie_ratings;"

movie_basics = pd.read_sql_query(movie_basics_query, conn)[['movie_id', 'primary_title', 'start_year', 'genres']]
movie_ratings = pd.read_sql_query(movie_ratings_query, conn)

# Rename and merge IMDb data
movie_basics = movie_basics.rename(columns={'primary_title': 'title', 'start_year': 'release_year', 'genres': 'imdb_genres'})
imdb_merged = pd.merge(movie_basics, movie_ratings, on='movie_id', how='inner')

final_merged_df = pd.merge(
    merged_df,
    imdb_merged,
    how='inner',
    left_on=['title', 'release_year'],
    right_on=['title', 'release_year']
)

# Add metrics
final_merged_df['profitability'] = final_merged_df['worldwide_gross'] - final_merged_df['production_budget']
final_merged_df['ROI'] = (final_merged_df['worldwide_gross'] - final_merged_df['production_budget']) / final_merged_df['production_budget']

final_merged_df.head()