# Comprehensive Movie Data Analysis

This notebook integrates all available movie datasets, including the IMDB database, to answer the business question:

**"What kinds of movies should a new studio produce for financial success?"**

We focus on:
- Genre profitability
- Budget-revenue relationships
- Impact of review scores


## 1. Load All Datasets Including IMDB Database

We load all CSV/TSV files and connect to the IMDB SQLite database. Relevant IMDB tables (`movie_basics`, `movie_ratings`) are read into pandas DataFrames. The head of each DataFrame is previewed.

In [2]:
import pandas as pd
import sqlite3

# Load CSV/TSV files
df_bom = pd.read_csv('Data/bom.movie_gross.csv')
df_rt_info = pd.read_csv('Data/rt.movie_info.tsv', sep='\t')
df_rt_reviews = pd.read_csv('Data/rt.reviews.tsv', sep='\t', encoding='latin-1', low_memory=False)
df_tmdb = pd.read_csv('Data/tmdb.movies.csv')
df_tn = pd.read_csv('Data/tn.movie_budgets.csv')

# Connect to IMDB SQLite database and load tables
conn = sqlite3.connect(r'Data/im.db')
df_imdb_basics = pd.read_sql_query("SELECT * FROM movie_basics", conn)
df_imdb_ratings = pd.read_sql_query("SELECT * FROM movie_ratings", conn)

# Preview heads
display(df_bom.head())
display(df_rt_info.head())
display(df_rt_reviews.head())
display(df_tmdb.head())
display(df_tn.head())
display(df_imdb_basics.head())
display(df_imdb_ratings.head())

Unnamed: 0,title,studio,domestic_gross,foreign_gross,year
0,Toy Story 3,BV,415000000.0,652000000,2010
1,Alice in Wonderland (2010),BV,334200000.0,691300000,2010
2,Harry Potter and the Deathly Hallows Part 1,WB,296000000.0,664300000,2010
3,Inception,WB,292600000.0,535700000,2010
4,Shrek Forever After,P/DW,238700000.0,513900000,2010


Unnamed: 0,id,synopsis,rating,genre,director,writer,theater_date,dvd_date,currency,box_office,runtime,studio
0,1,"This gritty, fast-paced, and innovative police...",R,Action and Adventure|Classics|Drama,William Friedkin,Ernest Tidyman,"Oct 9, 1971","Sep 25, 2001",,,104 minutes,
1,3,"New York City, not-too-distant-future: Eric Pa...",R,Drama|Science Fiction and Fantasy,David Cronenberg,David Cronenberg|Don DeLillo,"Aug 17, 2012","Jan 1, 2013",$,600000.0,108 minutes,Entertainment One
2,5,Illeana Douglas delivers a superb performance ...,R,Drama|Musical and Performing Arts,Allison Anders,Allison Anders,"Sep 13, 1996","Apr 18, 2000",,,116 minutes,
3,6,Michael Douglas runs afoul of a treacherous su...,R,Drama|Mystery and Suspense,Barry Levinson,Paul Attanasio|Michael Crichton,"Dec 9, 1994","Aug 27, 1997",,,128 minutes,
4,7,,NR,Drama|Romance,Rodney Bennett,Giles Cooper,,,,,200 minutes,


Unnamed: 0,id,review,rating,fresh,critic,top_critic,publisher,date
0,3,A distinctly gallows take on contemporary fina...,3/5,fresh,PJ Nabarro,0,Patrick Nabarro,"November 10, 2018"
1,3,It's an allegory in search of a meaning that n...,,rotten,Annalee Newitz,0,io9.com,"May 23, 2018"
2,3,... life lived in a bubble in financial dealin...,,fresh,Sean Axmaker,0,Stream on Demand,"January 4, 2018"
3,3,Continuing along a line introduced in last yea...,,fresh,Daniel Kasman,0,MUBI,"November 16, 2017"
4,3,... a perverse twist on neorealism...,,fresh,,0,Cinema Scope,"October 12, 2017"


Unnamed: 0.1,Unnamed: 0,genre_ids,id,original_language,original_title,popularity,release_date,title,vote_average,vote_count
0,0,"[12, 14, 10751]",12444,en,Harry Potter and the Deathly Hallows: Part 1,33.533,2010-11-19,Harry Potter and the Deathly Hallows: Part 1,7.7,10788
1,1,"[14, 12, 16, 10751]",10191,en,How to Train Your Dragon,28.734,2010-03-26,How to Train Your Dragon,7.7,7610
2,2,"[12, 28, 878]",10138,en,Iron Man 2,28.515,2010-05-07,Iron Man 2,6.8,12368
3,3,"[16, 35, 10751]",862,en,Toy Story,28.005,1995-11-22,Toy Story,7.9,10174
4,4,"[28, 878, 12]",27205,en,Inception,27.92,2010-07-16,Inception,8.3,22186


Unnamed: 0,id,release_date,movie,production_budget,domestic_gross,worldwide_gross
0,1,"Dec 18, 2009",Avatar,"$425,000,000","$760,507,625","$2,776,345,279"
1,2,"May 20, 2011",Pirates of the Caribbean: On Stranger Tides,"$410,600,000","$241,063,875","$1,045,663,875"
2,3,"Jun 7, 2019",Dark Phoenix,"$350,000,000","$42,762,350","$149,762,350"
3,4,"May 1, 2015",Avengers: Age of Ultron,"$330,600,000","$459,005,868","$1,403,013,963"
4,5,"Dec 15, 2017",Star Wars Ep. VIII: The Last Jedi,"$317,000,000","$620,181,382","$1,316,721,747"


Unnamed: 0,movie_id,primary_title,original_title,start_year,runtime_minutes,genres
0,tt0063540,Sunghursh,Sunghursh,2013,175.0,"Action,Crime,Drama"
1,tt0066787,One Day Before the Rainy Season,Ashad Ka Ek Din,2019,114.0,"Biography,Drama"
2,tt0069049,The Other Side of the Wind,The Other Side of the Wind,2018,122.0,Drama
3,tt0069204,Sabse Bada Sukh,Sabse Bada Sukh,2018,,"Comedy,Drama"
4,tt0100275,The Wandering Soap Opera,La Telenovela Errante,2017,80.0,"Comedy,Drama,Fantasy"


Unnamed: 0,movie_id,averagerating,numvotes
0,tt10356526,8.3,31
1,tt10384606,8.9,559
2,tt1042974,6.4,20
3,tt1043726,4.2,50352
4,tt1060240,6.5,21


## 2. Initial Data Exploration

Explore the structure, missing values, and key columns of each DataFrame (including IMDB). Identify variables relevant to profitability, genre, ratings, and merging.

In [3]:
# Explore structure and missing values
for name, df in [
    ("BOM", df_bom),
    ("RottenTomatoes Info", df_rt_info),
    ("RottenTomatoes Reviews", df_rt_reviews),
    ("TMDB", df_tmdb),
    ("TheNumbers", df_tn),
    ("IMDB Basics", df_imdb_basics),
    ("IMDB Ratings", df_imdb_ratings)
]:
    print(f"\n{name} columns: {df.columns.tolist()}")
    print(df.info())
    print(df.isnull().sum())
    display(df.describe(include='all'))


BOM columns: ['title', 'studio', 'domestic_gross', 'foreign_gross', 'year']
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3387 entries, 0 to 3386
Data columns (total 5 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   title           3387 non-null   object 
 1   studio          3382 non-null   object 
 2   domestic_gross  3359 non-null   float64
 3   foreign_gross   2037 non-null   object 
 4   year            3387 non-null   int64  
dtypes: float64(1), int64(1), object(3)
memory usage: 132.4+ KB
None
title                0
studio               5
domestic_gross      28
foreign_gross     1350
year                 0
dtype: int64


Unnamed: 0,title,studio,domestic_gross,foreign_gross,year
count,3387,3382,3359.0,2037.0,3387.0
unique,3386,257,,1204.0,
top,Bluebeard,IFC,,1200000.0,
freq,2,166,,23.0,
mean,,,28745850.0,,2013.958075
std,,,66982500.0,,2.478141
min,,,100.0,,2010.0
25%,,,120000.0,,2012.0
50%,,,1400000.0,,2014.0
75%,,,27900000.0,,2016.0



RottenTomatoes Info columns: ['id', 'synopsis', 'rating', 'genre', 'director', 'writer', 'theater_date', 'dvd_date', 'currency', 'box_office', 'runtime', 'studio']
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1560 entries, 0 to 1559
Data columns (total 12 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   id            1560 non-null   int64 
 1   synopsis      1498 non-null   object
 2   rating        1557 non-null   object
 3   genre         1552 non-null   object
 4   director      1361 non-null   object
 5   writer        1111 non-null   object
 6   theater_date  1201 non-null   object
 7   dvd_date      1201 non-null   object
 8   currency      340 non-null    object
 9   box_office    340 non-null    object
 10  runtime       1530 non-null   object
 11  studio        494 non-null    object
dtypes: int64(1), object(11)
memory usage: 146.4+ KB
None
id                 0
synopsis          62
rating             3
genre              

Unnamed: 0,id,synopsis,rating,genre,director,writer,theater_date,dvd_date,currency,box_office,runtime,studio
count,1560.0,1498,1557,1552,1361,1111,1201,1201,340,340.0,1530,494
unique,,1497,6,299,1125,1069,1025,717,1,336.0,142,200
top,,A group of air crash survivors are stranded in...,R,Drama,Steven Spielberg,Woody Allen,"Jan 1, 1987","Jun 1, 2004",$,600000.0,90 minutes,Universal Pictures
freq,,2,521,151,10,4,8,11,340,2.0,72,35
mean,1007.303846,,,,,,,,,,,
std,579.164527,,,,,,,,,,,
min,1.0,,,,,,,,,,,
25%,504.75,,,,,,,,,,,
50%,1007.5,,,,,,,,,,,
75%,1503.25,,,,,,,,,,,



RottenTomatoes Reviews columns: ['id', 'review', 'rating', 'fresh', 'critic', 'top_critic', 'publisher', 'date']
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 54432 entries, 0 to 54431
Data columns (total 8 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   id          54432 non-null  int64 
 1   review      48869 non-null  object
 2   rating      40915 non-null  object
 3   fresh       54432 non-null  object
 4   critic      51710 non-null  object
 5   top_critic  54432 non-null  int64 
 6   publisher   54123 non-null  object
 7   date        54432 non-null  object
dtypes: int64(2), object(6)
memory usage: 3.3+ MB
None
id                0
review         5563
rating        13517
fresh             0
critic         2722
top_critic        0
publisher       309
date              0
dtype: int64


Unnamed: 0,id,review,rating,fresh,critic,top_critic,publisher,date
count,54432.0,48869,40915,54432,51710,54432.0,54123,54432
unique,,48682,186,2,3496,,1281,5963
top,,Parental Content Review,3/5,fresh,Emanuel Levy,,eFilmCritic.com,"January 1, 2000"
freq,,24,4327,33035,595,,673,4303
mean,1045.706882,,,,,0.240594,,
std,586.657046,,,,,0.427448,,
min,3.0,,,,,0.0,,
25%,542.0,,,,,0.0,,
50%,1083.0,,,,,0.0,,
75%,1541.0,,,,,0.0,,



TMDB columns: ['Unnamed: 0', 'genre_ids', 'id', 'original_language', 'original_title', 'popularity', 'release_date', 'title', 'vote_average', 'vote_count']
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 26517 entries, 0 to 26516
Data columns (total 10 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   Unnamed: 0         26517 non-null  int64  
 1   genre_ids          26517 non-null  object 
 2   id                 26517 non-null  int64  
 3   original_language  26517 non-null  object 
 4   original_title     26517 non-null  object 
 5   popularity         26517 non-null  float64
 6   release_date       26517 non-null  object 
 7   title              26517 non-null  object 
 8   vote_average       26517 non-null  float64
 9   vote_count         26517 non-null  int64  
dtypes: float64(2), int64(3), object(5)
memory usage: 2.0+ MB
None
Unnamed: 0           0
genre_ids            0
id                   0
original_language    0

Unnamed: 0.1,Unnamed: 0,genre_ids,id,original_language,original_title,popularity,release_date,title,vote_average,vote_count
count,26517.0,26517,26517.0,26517,26517,26517.0,26517,26517,26517.0,26517.0
unique,,2477,,76,24835,,3433,24688,,
top,,[99],,en,Eden,,2010-01-01,Eden,,
freq,,3700,,23291,7,,269,7,,
mean,13258.0,,295050.15326,,,3.130912,,,5.991281,194.224837
std,7654.94288,,153661.615648,,,4.355229,,,1.852946,960.961095
min,0.0,,27.0,,,0.6,,,0.0,1.0
25%,6629.0,,157851.0,,,0.6,,,5.0,2.0
50%,13258.0,,309581.0,,,1.374,,,6.0,5.0
75%,19887.0,,419542.0,,,3.694,,,7.0,28.0



TheNumbers columns: ['id', 'release_date', 'movie', 'production_budget', 'domestic_gross', 'worldwide_gross']
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5782 entries, 0 to 5781
Data columns (total 6 columns):
 #   Column             Non-Null Count  Dtype 
---  ------             --------------  ----- 
 0   id                 5782 non-null   int64 
 1   release_date       5782 non-null   object
 2   movie              5782 non-null   object
 3   production_budget  5782 non-null   object
 4   domestic_gross     5782 non-null   object
 5   worldwide_gross    5782 non-null   object
dtypes: int64(1), object(5)
memory usage: 271.2+ KB
None
id                   0
release_date         0
movie                0
production_budget    0
domestic_gross       0
worldwide_gross      0
dtype: int64


Unnamed: 0,id,release_date,movie,production_budget,domestic_gross,worldwide_gross
count,5782.0,5782,5782,5782,5782,5782
unique,,2418,5698,509,5164,5356
top,,"Dec 31, 2014",Halloween,"$20,000,000",$0,$0
freq,,24,3,231,548,367
mean,50.372363,,,,,
std,28.821076,,,,,
min,1.0,,,,,
25%,25.0,,,,,
50%,50.0,,,,,
75%,75.0,,,,,



IMDB Basics columns: ['movie_id', 'primary_title', 'original_title', 'start_year', 'runtime_minutes', 'genres']
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 146144 entries, 0 to 146143
Data columns (total 6 columns):
 #   Column           Non-Null Count   Dtype  
---  ------           --------------   -----  
 0   movie_id         146144 non-null  object 
 1   primary_title    146144 non-null  object 
 2   original_title   146123 non-null  object 
 3   start_year       146144 non-null  int64  
 4   runtime_minutes  114405 non-null  float64
 5   genres           140736 non-null  object 
dtypes: float64(1), int64(1), object(4)
memory usage: 6.7+ MB
None
movie_id               0
primary_title          0
original_title        21
start_year             0
runtime_minutes    31739
genres              5408
dtype: int64


Unnamed: 0,movie_id,primary_title,original_title,start_year,runtime_minutes,genres
count,146144,146144,146123,146144.0,114405.0,140736
unique,146144,136071,137773,,,1085
top,tt0063540,Home,Broken,,,Documentary
freq,1,24,19,,,32185
mean,,,,2014.621798,86.187247,
std,,,,2.733583,166.36059,
min,,,,2010.0,1.0,
25%,,,,2012.0,70.0,
50%,,,,2015.0,87.0,
75%,,,,2017.0,99.0,



IMDB Ratings columns: ['movie_id', 'averagerating', 'numvotes']
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 73856 entries, 0 to 73855
Data columns (total 3 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   movie_id       73856 non-null  object 
 1   averagerating  73856 non-null  float64
 2   numvotes       73856 non-null  int64  
dtypes: float64(1), int64(1), object(1)
memory usage: 1.7+ MB
None
movie_id         0
averagerating    0
numvotes         0
dtype: int64


Unnamed: 0,movie_id,averagerating,numvotes
count,73856,73856.0,73856.0
unique,73856,,
top,tt10356526,,
freq,1,,
mean,,6.332729,3523.662
std,,1.474978,30294.02
min,,1.0,5.0
25%,,5.5,14.0
50%,,6.5,49.0
75%,,7.4,282.0


## 3. Data Cleaning and Standardization

Clean all DataFrames: handle missing values, standardize column names and types, remove duplicates, and ensure consistent formatting (e.g., lowercase titles).

In [4]:
# --- BOM ---
df_bom['studio'] = df_bom['studio'].fillna('Unknown')
df_bom['domestic_gross'] = pd.to_numeric(df_bom['domestic_gross'], errors='coerce')
df_bom['foreign_gross'] = pd.to_numeric(df_bom['foreign_gross'], errors='coerce')
df_bom['title'] = df_bom['title'].str.lower()
df_bom = df_bom.rename(columns={'title': 'bom_title'})
df_bom = df_bom.drop_duplicates()

# --- RottenTomatoes Info ---
for col in ['rating', 'genre', 'director', 'writer', 'studio']:
    if col in df_rt_info.columns:
        df_rt_info[col] = df_rt_info[col].fillna(df_rt_info[col].mode()[0])
df_rt_info['theater_date'] = pd.to_datetime(df_rt_info['theater_date'], errors='coerce')
df_rt_info['dvd_date'] = pd.to_datetime(df_rt_info['dvd_date'], errors='coerce')
if 'box_office' in df_rt_info.columns:
    df_rt_info['box_office'] = df_rt_info['box_office'].replace('[\$,]', '', regex=True).astype(float)
if 'movie title' in df_rt_info.columns:
    df_rt_info['movie title'] = df_rt_info['movie title'].str.lower()
    df_rt_info = df_rt_info.rename(columns={'movie title': 'rt_movie_title'})
df_rt_info = df_rt_info.drop_duplicates()

# --- RottenTomatoes Reviews ---
if 'review' in df_rt_reviews.columns:
    df_rt_reviews = df_rt_reviews.dropna(subset=['review'])
if 'date' in df_rt_reviews.columns:
    df_rt_reviews['date'] = pd.to_datetime(df_rt_reviews['date'], errors='coerce')
df_rt_reviews = df_rt_reviews.drop_duplicates()

# --- TMDB ---
df_tmdb = df_tmdb.dropna()
df_tmdb['title'] = df_tmdb['title'].str.lower()
df_tmdb['original_title'] = df_tmdb['original_title'].str.lower()
df_tmdb['release_date'] = pd.to_datetime(df_tmdb['release_date'], errors='coerce')
df_tmdb = df_tmdb.rename(columns={'title': 'tmdb_title', 'original_title': 'tmdb_original_title'})
if 'Unnamed: 0' in df_tmdb.columns:
    df_tmdb = df_tmdb.drop(columns=['Unnamed: 0'])
df_tmdb = df_tmdb.drop_duplicates()

# --- TheNumbers ---
for col in ['production_budget', 'domestic_gross', 'worldwide_gross']:
    df_tn[col] = df_tn[col].replace('[\$,]', '', regex=True).astype(float)
df_tn['release_date'] = pd.to_datetime(df_tn['release_date'], errors='coerce')
df_tn['movie'] = df_tn['movie'].str.lower()
df_tn = df_tn.rename(columns={'movie': 'tn_movie'})
df_tn = df_tn.drop_duplicates()

# --- IMDB Basics ---
df_imdb_basics['primary_title'] = df_imdb_basics['primary_title'].str.lower()
df_imdb_basics = df_imdb_basics.drop_duplicates()

# --- IMDB Ratings ---
df_imdb_ratings = df_imdb_ratings.drop_duplicates()

## 4. Merge Datasets with Title and ID Matching

Merge all datasets into a single DataFrame. Use exact and fuzzy matching on movie titles and IDs. Integrate IMDB genre and rating data. Document and handle merge issues.

In [None]:
from fuzzywuzzy import process, fuzz

# Merge IMDB basics and ratings
df_imdb = pd.merge(df_imdb_basics, df_imdb_ratings, on='movie_id', how='left')

# Merge TheNumbers and BOM on movie title (exact)
df_merged = pd.merge(df_tn, df_bom, left_on='tn_movie', right_on='bom_title', how='left')

# Merge with TMDB on title (fuzzy)
def fuzzy_merge_titles(df_left, df_right, left_on, right_on, threshold=85):
    matches = []
    for left_value in df_left[left_on]:
        best_match = process.extractOne(left_value, df_right[right_on], scorer=fuzz.token_sort_ratio, score_cutoff=threshold)
        if best_match:
            matches.append(best_match[0])
        else:
            matches.append(None)
    df_left['tmdb_match_title'] = matches
    merged = pd.merge(df_left, df_right, left_on='tmdb_match_title', right_on=right_on, how='left')
    return merged

df_merged = fuzzy_merge_titles(df_merged, df_tmdb, 'tn_movie', 'tmdb_title')

# Merge with IMDB on title (fuzzy)
df_merged['imdb_match_title'] = [
    process.extractOne(title, df_imdb['primary_title'], scorer=fuzz.token_sort_ratio, score_cutoff=85)[0]
    if process.extractOne(title, df_imdb['primary_title'], scorer=fuzz.token_sort_ratio, score_cutoff=85)
    else None
    for title in df_merged['tn_movie']
]
df_merged = pd.merge(df_merged, df_imdb, left_on='imdb_match_title', right_on='primary_title', how='left')

# Merge with RottenTomatoes Info (fuzzy)
df_merged['rt_match_title'] = [
    process.extractOne(title, df_rt_info['rt_movie_title'], scorer=fuzz.token_sort_ratio, score_cutoff=85)[0]
    if process.extractOne(title, df_rt_info['rt_movie_title'], scorer=fuzz.token_sort_ratio, score_cutoff=85)
    else None
    for title in df_merged['tn_movie']
]
df_merged = pd.merge(df_merged, df_rt_info, left_on='rt_match_title', right_on='rt_movie_title', how='left')

# Note: RottenTomatoes Reviews are not merged directly due to lack of unique title or ID mapping.

display(df_merged.head())

## 5. Feature Engineering (Profit, Date, Genre, Review Scores)

Create new features: profit margin, release year/month, genre dummies, and aggregated review scores (using IMDB and other sources).

In [None]:
# Profit Margin
df_merged['profit_margin'] = df_merged['worldwide_gross'] - df_merged['production_budget']

# Release Year/Month (prefer TheNumbers, fallback to TMDB/IMDB)
if 'release_date_x' in df_merged.columns:
    df_merged['release_date'] = df_merged['release_date_x']
elif 'release_date' in df_merged.columns:
    df_merged['release_date'] = df_merged['release_date']
elif 'release_date_y' in df_merged.columns:
    df_merged['release_date'] = df_merged['release_date_y']
else:
    df_merged['release_date'] = pd.NaT

df_merged['release_year'] = pd.to_datetime(df_merged['release_date'], errors='coerce').dt.year
df_merged['release_month'] = pd.to_datetime(df_merged['release_date'], errors='coerce').dt.month

# Genre Dummies (prefer IMDB, fallback to RT or TMDB)
if 'genres' in df_merged.columns and df_merged['genres'].notnull().any():
    genres = df_merged['genres'].str.get_dummies(sep=',')
elif 'genre' in df_merged.columns and df_merged['genre'].notnull().any():
    genres = df_merged['genre'].str.get_dummies(sep=',')
elif 'genres_y' in df_merged.columns and df_merged['genres_y'].notnull().any():
    genres = df_merged['genres_y'].str.get_dummies(sep=',')
else:
    genres = pd.DataFrame()

if not genres.empty:
    df_merged = pd.concat([df_merged, genres], axis=1)

# Aggregated Review Score (IMDB, fallback to TMDB/RT)
if 'average_rating' in df_merged.columns and df_merged['average_rating'].notnull().any():
    df_merged['aggregated_review_score'] = df_merged['average_rating']
elif 'vote_average' in df_merged.columns and df_merged['vote_average'].notnull().any():
    df_merged['aggregated_review_score'] = df_merged['vote_average']
elif 'tomatometer_rating' in df_merged.columns and df_merged['tomatometer_rating'].notnull().any():
    df_merged['aggregated_review_score'] = df_merged['tomatometer_rating']
else:
    df_merged['aggregated_review_score'] = None

display(df_merged[['tn_movie', 'profit_margin', 'release_year', 'release_month', 'aggregated_review_score'] + (genres.columns.tolist() if not genres.empty else [])].head())

## 6. Genre Profitability Analysis

Group by genre and calculate average profit margin. Identify most and least profitable genres. Visualize results with a bar chart.

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

# Use IMDB genres if available
if 'genres' in df_merged.columns and df_merged['genres'].notnull().any():
    genre_col = 'genres'
elif 'genre' in df_merged.columns and df_merged['genre'].notnull().any():
    genre_col = 'genre'
elif 'genres_y' in df_merged.columns and df_merged['genres_y'].notnull().any():
    genre_col = 'genres_y'
else:
    genre_col = None

if genre_col:
    # Explode genres for multi-genre movies
    df_exploded = df_merged.dropna(subset=[genre_col, 'profit_margin']).copy()
    df_exploded[genre_col] = df_exploded[genre_col].str.split(',')
    df_exploded = df_exploded.explode(genre_col)
    df_exploded[genre_col] = df_exploded[genre_col].str.strip()
    genre_profit = df_exploded.groupby(genre_col)['profit_margin'].mean().sort_values(ascending=False)
    print("Most Profitable Genre:", genre_profit.index[0])
    print("Least Profitable Genre:", genre_profit.index[-1])

    plt.figure(figsize=(12, 6))
    sns.barplot(x=genre_profit.index, y=genre_profit.values, palette='viridis')
    plt.title('Average Profit Margin by Genre')
    plt.xlabel('Genre')
    plt.ylabel('Average Profit Margin')
    plt.xticks(rotation=45, ha='right')
    plt.tight_layout()
    plt.show()
else:
    print("No genre column found for profitability analysis.")

## 7. Budget vs Revenue Analysis

Calculate and visualize the correlation between production budget and worldwide gross revenue using scatter plots and correlation coefficients.

In [None]:
# Drop rows with missing values for correlation
df_corr = df_merged.dropna(subset=['production_budget', 'worldwide_gross'])
corr = df_corr['production_budget'].corr(df_corr['worldwide_gross'])
print(f"Pearson correlation coefficient between budget and revenue: {corr:.2f}")

plt.figure(figsize=(8, 6))
plt.scatter(df_corr['production_budget'], df_corr['worldwide_gross'], alpha=0.5)
plt.title('Production Budget vs. Worldwide Gross Revenue')
plt.xlabel('Production Budget')
plt.ylabel('Worldwide Gross Revenue')
plt.grid(True)
plt.show()

## 8. Review Scores vs Revenue Analysis

Analyze and visualize the relationship between aggregated review scores and worldwide gross revenue using scatter plots and correlation coefficients.

In [None]:
df_review = df_merged.dropna(subset=['aggregated_review_score', 'worldwide_gross'])
corr_review = df_review['aggregated_review_score'].corr(df_review['worldwide_gross'])
print(f"Pearson correlation coefficient between aggregated review score and revenue: {corr_review:.2f}")

plt.figure(figsize=(8, 6))
plt.scatter(df_review['aggregated_review_score'], df_review['worldwide_gross'], color='purple', alpha=0.6)
plt.title('Aggregated Review Score vs. Worldwide Gross Revenue')
plt.xlabel('Aggregated Review Score')
plt.ylabel('Worldwide Gross Revenue')
plt.grid(True)
plt.show()

## 9. Visualization of Key Findings

Summarize actionable insights for stakeholders and present clear visualizations for genre profitability, budget vs revenue, and review score vs revenue.

### Key Findings

- **Most Profitable Genres:** The genre profitability analysis (see bar chart above) reveals which genres yield the highest average profit margins.
- **Budget-Revenue Relationship:** There is a strong positive correlation between production budget and worldwide gross revenue.
- **Review Scores Impact:** There is a weak-to-moderate positive correlation between aggregated review scores and worldwide gross revenue.

### Actionable Insights

- **Focus on Profitable Genres:** Prioritize genres with the highest average profit margins for new productions.
- **Budget Allocation:** Higher budgets are generally associated with higher revenues, but ROI should be considered.
- **Quality Matters:** While review scores have a weaker correlation with revenue, higher-rated movies tend to perform better.

### Next Steps

- Further refine genre mapping and consider sub-genres.
- Explore advanced regression models to control for confounding variables.
- Investigate outliers and exceptions for deeper business insights.
