# Wamonyolo Studios Business Analysis  

##  Overview  
Wamonyolo Studios is planning to launch a new movie studio. To succeed, the company needs to understand what makes movies profitable. By analyzing past industry data, we can uncover insights that will guide Wamonyolo Studios toward smart, profit-driven decisions.  

---

##  Business Problem  
As a new player in the movie industry, Wamonyolo faces several key questions:  
-  How long should their films be?  
-  Which genres are the most profitable?  
-  Should they build their studio from scratch or acquire an existing one?  

Using industry datasets and analysis, we aim to answer these questions and shape a winning strategy.  

---

##  Data Preparation  
The **IMDb** dataset is the largest and most detailed. It provides:  
- Movie runtimes  
- Genres  
- Release years  
- Directors, writers, and actors  

**Limitation:** It does *not* include financial data like budgets or box office revenue.  

To complete the picture, we merge IMDb with financial datasets:  
- **Box Office Mojo (BOM):** Domestic + international box office gross  
- **The Numbers:** Budget + revenue  
- **The Movie DB (TMDB):** Ratings, popularity, and sometimes financial data  

This way, we connect *what a movie is* with *how it performs financially*.  

---

##  Why Merging Matters  
- **IMDb = What the movie is** (content + creators)  
- **Financial datasets = How the movie performed** (cost + revenue)  

When combined, the data allows us to answer:  
- Do longer films earn more or less?  
- Which genres deliver the highest returns?  
- Are certain directors/writers consistently successful?  

---
 
IMDb provides the richest descriptive information, but lacks financial details.  
By merging it with BOM, The Numbers, and TMDB, Wamonyolo Studios can analyze both creativity *and* profitability—ensuring a smart, data-driven entry into the movie market.  


# Import all necessary libraries

In [212]:
# Step 1: Import all necessary libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats
import statsmodels.api as sm
from statsmodels.formula.api import ols

# Reading the data 

In [213]:
# Box Office Mojo 
bom_movie_gross = pd.read_csv('zippedData/bom.movie_gross.csv.gz')

# === The Numbers ===
tn_movie_budgets = pd.read_csv('zippedData/tn.movie_budgets.csv.gz')

# === The Movie Database (TMDb) ===
tmdb_movies = pd.read_csv('zippedData/tmdb.movies.csv.gz')

# === Rotten Tomatoes ===
# === Rotten Tomatoes ===
rt_movies = pd.read_csv('zippedData/rt.movie_info.tsv.gz', sep='\t', encoding='latin-1')
rt_reviews = pd.read_csv('zippedData/rt.reviews.tsv.gz', sep='\t', encoding='latin-1')


Im.db.zip is basically a compressed folder with several .tsv IMDb files inside

In [214]:
import zipfile, pandas as pd

with zipfile.ZipFile('zippedData/im.db.zip') as z:
    print(z.namelist())   #  shows you all files inside



['im.db']


In [215]:
import zipfile

with zipfile.ZipFile("zippedData/im.db.zip", "r") as z:
    z.extractall("zippedData/")  # this will create 'zippedData/im.db'


The file contains  a single SQLite database File called im.db,meaning you need to open it as a SQLite database

In [216]:
import sqlite3
#import pandas as pd

conn = sqlite3.connect("zippedData/im.db")
tables = pd.read_sql("SELECT name FROM sqlite_master WHERE type='table';", conn)
print(tables)



            name
0   movie_basics
1      directors
2      known_for
3     movie_akas
4  movie_ratings
5        persons
6     principals
7        writers


Now loading those tables into pandas DataFrames with simple SQL queriees

In [217]:
movie_basics = pd.read_sql("SELECT * FROM movie_basics;", conn)

directors = pd.read_sql("SELECT * FROM directors;", conn)
known_for = pd.read_sql("SELECT * FROM known_for;", conn)
movie_akas = pd.read_sql("SELECT * FROM movie_akas;", conn)
movie_ratings = pd.read_sql("SELECT * FROM movie_ratings;", conn)
persons = pd.read_sql("SELECT * FROM persons;", conn)
principals = pd.read_sql("SELECT * FROM principals;", conn)
writers = pd.read_sql("SELECT * FROM writers;", conn)

 # Data Cleaning 
We’ll clean only the datasets that are most useful for analysis (IMDb + financials). Rotten Tomatoes/TMDB can be optional later.


# Datasets to Clean First

 1 IMDb tables (content & metadata)

movie_basics  (title, year, runtime, genres)

movie_ratings  (average rating, votes)

2 Box Office Mojo (bom_movie_gross)

Domestic & foreign gross

3 The Numbers (tn_movie_budgets)

Budget + gross

In [218]:
 movie_basics.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 146144 entries, 0 to 146143
Data columns (total 6 columns):
 #   Column           Non-Null Count   Dtype  
---  ------           --------------   -----  
 0   movie_id         146144 non-null  object 
 1   primary_title    146144 non-null  object 
 2   original_title   146123 non-null  object 
 3   start_year       146144 non-null  int64  
 4   runtime_minutes  114405 non-null  float64
 5   genres           140736 non-null  object 
dtypes: float64(1), int64(1), object(4)
memory usage: 6.7+ MB


In [219]:
# Check duplicates
movie_basics.duplicated().sum()

0

In [220]:
# Convert datatypes

movie_basics = pd.read_sql("SELECT * FROM movie_basics;", conn)
movie_basics


Unnamed: 0,movie_id,primary_title,original_title,start_year,runtime_minutes,genres
0,tt0063540,Sunghursh,Sunghursh,2013,175.0,"Action,Crime,Drama"
1,tt0066787,One Day Before the Rainy Season,Ashad Ka Ek Din,2019,114.0,"Biography,Drama"
2,tt0069049,The Other Side of the Wind,The Other Side of the Wind,2018,122.0,Drama
3,tt0069204,Sabse Bada Sukh,Sabse Bada Sukh,2018,,"Comedy,Drama"
4,tt0100275,The Wandering Soap Opera,La Telenovela Errante,2017,80.0,"Comedy,Drama,Fantasy"
...,...,...,...,...,...,...
146139,tt9916538,Kuambil Lagi Hatiku,Kuambil Lagi Hatiku,2019,123.0,Drama
146140,tt9916622,Rodolpho Teóphilo - O Legado de um Pioneiro,Rodolpho Teóphilo - O Legado de um Pioneiro,2015,,Documentary
146141,tt9916706,Dankyavar Danka,Dankyavar Danka,2013,,Comedy
146142,tt9916730,6 Gunn,6 Gunn,2017,116.0,


Step 1

Extract only the columns that we need 

In [221]:
runtime_df = movie_basics[['primary_title', 'start_year', 'runtime_minutes']]

 
movie_basics has many columns (genres, tconst, etc.), but for runtime analysis we only care about:

primary_title- movie name (for identification & merging later)

start_year - release year (to filter by time & merge with financial datasets)

runtime_minutes - our main feature of interest (movie length)


# Step 2 : 

Remove movies that haven't been released yet 

In [222]:
runtime_df = runtime_df[runtime_df['start_year'] < 2025]


Some rows have future release years (e.g., 2023, 2025).

Since we only analyze historical performance, those rows would give misleading results.

Keeps dataset consistent with financial data (which only has past films).

 Step 3

Drop row with missing runtimes 

In [223]:
runtime_df = runtime_df.dropna(axis=0, subset=['runtime_minutes'])


Missing runtimes = useless for analysis.

Dropping them ensures we don’t get NaN values messing up plots/stats.

 Step 4
Inspect the cleaned result

In [224]:
print(runtime_df.shape)     # how many rows/columns after cleaning
print(runtime_df.isna().sum())  # check if any nulls remain
runtime_df.head()           # preview first 5 rows
runtime_df.info()           # check datatypes
runtime_df.describe()       # quick stats (mean, min, max runtime)


(114405, 3)
primary_title      0
start_year         0
runtime_minutes    0
dtype: int64
<class 'pandas.core.frame.DataFrame'>
Int64Index: 114405 entries, 0 to 146142
Data columns (total 3 columns):
 #   Column           Non-Null Count   Dtype  
---  ------           --------------   -----  
 0   primary_title    114405 non-null  object 
 1   start_year       114405 non-null  int64  
 2   runtime_minutes  114405 non-null  float64
dtypes: float64(1), int64(1), object(1)
memory usage: 3.5+ MB


Unnamed: 0,start_year,runtime_minutes
count,114405.0,114405.0
mean,2014.396801,86.187247
std,2.63748,166.36059
min,2010.0,1.0
25%,2012.0,70.0
50%,2014.0,87.0
75%,2017.0,99.0
max,2022.0,51420.0



shape - see how much data we have left after cleaning.

isna() - make sure runtimes are fully clean.

head() - sanity check if columns look correct.

info() - confirm datatypes (start_year should be int, runtime_minutes int/float).

describe() - see runtime distribution (are there very short/long outliers?).

# Now we’re prepping The Numbers and TMDb release dates so they can align with IMDb’s start_year

In [225]:
tn_movie_budgets = pd.read_csv('zippedData/tn.movie_budgets.csv.gz')
tn_movie_budgets.head()

Unnamed: 0,id,release_date,movie,production_budget,domestic_gross,worldwide_gross
0,1,"Dec 18, 2009",Avatar,"$425,000,000","$760,507,625","$2,776,345,279"
1,2,"May 20, 2011",Pirates of the Caribbean: On Stranger Tides,"$410,600,000","$241,063,875","$1,045,663,875"
2,3,"Jun 7, 2019",Dark Phoenix,"$350,000,000","$42,762,350","$149,762,350"
3,4,"May 1, 2015",Avengers: Age of Ultron,"$330,600,000","$459,005,868","$1,403,013,963"
4,5,"Dec 15, 2017",Star Wars Ep. VIII: The Last Jedi,"$317,000,000","$620,181,382","$1,316,721,747"


In [226]:
tn_movie_budgets.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5782 entries, 0 to 5781
Data columns (total 6 columns):
 #   Column             Non-Null Count  Dtype 
---  ------             --------------  ----- 
 0   id                 5782 non-null   int64 
 1   release_date       5782 non-null   object
 2   movie              5782 non-null   object
 3   production_budget  5782 non-null   object
 4   domestic_gross     5782 non-null   object
 5   worldwide_gross    5782 non-null   object
dtypes: int64(1), object(5)
memory usage: 271.2+ KB


In [227]:
#  The Movie Database (TMDb) 
tmdb_movies = pd.read_csv('zippedData/tmdb.movies.csv.gz')
tmdb_movies.head()

Unnamed: 0.1,Unnamed: 0,genre_ids,id,original_language,original_title,popularity,release_date,title,vote_average,vote_count
0,0,"[12, 14, 10751]",12444,en,Harry Potter and the Deathly Hallows: Part 1,33.533,2010-11-19,Harry Potter and the Deathly Hallows: Part 1,7.7,10788
1,1,"[14, 12, 16, 10751]",10191,en,How to Train Your Dragon,28.734,2010-03-26,How to Train Your Dragon,7.7,7610
2,2,"[12, 28, 878]",10138,en,Iron Man 2,28.515,2010-05-07,Iron Man 2,6.8,12368
3,3,"[16, 35, 10751]",862,en,Toy Story,28.005,1995-11-22,Toy Story,7.9,10174
4,4,"[28, 878, 12]",27205,en,Inception,27.92,2010-07-16,Inception,8.3,22186


In [228]:
tmdb_movies.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 26517 entries, 0 to 26516
Data columns (total 10 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   Unnamed: 0         26517 non-null  int64  
 1   genre_ids          26517 non-null  object 
 2   id                 26517 non-null  int64  
 3   original_language  26517 non-null  object 
 4   original_title     26517 non-null  object 
 5   popularity         26517 non-null  float64
 6   release_date       26517 non-null  object 
 7   title              26517 non-null  object 
 8   vote_average       26517 non-null  float64
 9   vote_count         26517 non-null  int64  
dtypes: float64(2), int64(3), object(5)
memory usage: 2.0+ MB


Step 1

Convert release_date into datetime 

In [229]:
tn_movie_budgets['release_date'] = pd.to_datetime(tn_movie_budgets['release_date'])
tmdb_movies['release_date'] = pd.to_datetime(tmdb_movies['release_date'])



Dates are often read in as strings → can’t extract year/month directly.

pd.to_datetime() standardizes them into true datetime objects.

 step 2 
 
 Extract release year (to match IMDb format)

In [230]:
tn_movie_budgets['release_year'] = tn_movie_budgets['release_date'].dt.year
tmdb_movies['release_year'] = tmdb_movies['release_date'].dt.year



IMDb uses just the year (start_year).

To merge datasets later, we need the same format (year only).

Step 3: 
Extract release month (both numeric & string)

In [231]:
tn_movie_budgets['month_dt'] = tn_movie_budgets['release_date'].dt.month  # numeric month (1–12)
tn_movie_budgets['month'] = tn_movie_budgets['release_date'].dt.month     # duplicate here, can adjust if you want month names



Month helps analyze seasonality (e.g., summer blockbusters, holiday releases).

month_dt → numeric (for calculations).

month → could later be turned into month names for plots.

(Small note: you might want dt.month_name() if you prefer full names like “July”)

 Step 4:
Drop raw release_date

In [232]:
tn_movie_budgets = tn_movie_budgets.drop(columns=['release_date'])




We’ve extracted all useful parts (year + month).

Dropping avoids duplication and keeps dataframe cleaner.

 Step 5
Inspect

In [233]:
print(tn_movie_budgets[['movie','release_year','month_dt','month']].head())
print(tmdb_movies[['title','release_year']].head())


                                         movie  release_year  month_dt  month
0                                       Avatar          2009        12     12
1  Pirates of the Caribbean: On Stranger Tides          2011         5      5
2                                 Dark Phoenix          2019         6      6
3                      Avengers: Age of Ultron          2015         5      5
4            Star Wars Ep. VIII: The Last Jedi          2017        12     12
                                          title  release_year
0  Harry Potter and the Deathly Hallows: Part 1          2010
1                      How to Train Your Dragon          2010
2                                    Iron Man 2          2010
3                                     Toy Story          1995
4                                     Inception          2010


# Now you’re cleaning up the financial columns from The Numbers so they’re ready for calculations and plots. 

 Step 1: Identify the money columns

In [234]:
cols = ['production_budget', 'domestic_gross', 'worldwide_gross']



These are stored as strings with $ and commas (e.g.,       "$100,000,000").
We can’t do math or plots with strings → must convert to numbers.

Step 2: Remove $ and ,

In [235]:
tn_movie_budgets[cols] = tn_movie_budgets[cols].replace('[\$,]', '', regex=True)




[\$,] means: match dollar signs $ or commas ,.

.replace(..., regex=True) strips them out → "100000000".

Step 3: Convert to integers

In [236]:
tn_movie_budgets[cols] = tn_movie_budgets[cols].astype('int64')




Converts cleaned strings into integers so we can:

Calculate profits/losses

Plot histograms, scatterplots

Run regressions

Step 4 Inspect the result

In [237]:
print(tn_movie_budgets[cols].dtypes)   # confirm int64
tn_movie_budgets[cols].describe()      # check ranges, averages, etc.
tn_movie_budgets.head(3)               # preview cleaned values


production_budget    int64
domestic_gross       int64
worldwide_gross      int64
dtype: object


Unnamed: 0,id,movie,production_budget,domestic_gross,worldwide_gross,release_year,month_dt,month
0,1,Avatar,425000000,760507625,2776345279,2009,12,12
1,2,Pirates of the Caribbean: On Stranger Tides,410600000,241063875,1045663875,2011,5,5
2,3,Dark Phoenix,350000000,42762350,149762350,2019,6,6




describe() shows if values are realistic (e.g., budgets in millions, not billions).

# Standardizing titles across all datasets to improve your merge success rate

 Step 1: Apply .str.title() to titles

In [238]:
runtime_df['primary_title'] = runtime_df['primary_title'].str.title()
tn_movie_budgets['movie'] = tn_movie_budgets['movie'].str.title()
bom_movie_gross['title'] = bom_movie_gross['title'].str.title()
tmdb_movies['title'] = tmdb_movies['title'].str.title()



In different datasets, titles may appear as "avatar", "Avatar", or "AVATAR".

.str.title() converts them all to "Avatar" → making matches more consistent when merging.

 Step 2: Inspect for consistency

In [239]:
print(runtime_df['primary_title'].head(5))
print(tn_movie_budgets['movie'].head(5))
print(bom_movie_gross['title'].head(5))
print(tmdb_movies['title'].head(5))


0                          Sunghursh
1    One Day Before The Rainy Season
2         The Other Side Of The Wind
4           The Wandering Soap Opera
5                        A Thin Life
Name: primary_title, dtype: object
0                                         Avatar
1    Pirates Of The Caribbean: On Stranger Tides
2                                   Dark Phoenix
3                        Avengers: Age Of Ultron
4              Star Wars Ep. Viii: The Last Jedi
Name: movie, dtype: object
0                                    Toy Story 3
1                     Alice In Wonderland (2010)
2    Harry Potter And The Deathly Hallows Part 1
3                                      Inception
4                            Shrek Forever After
Name: title, dtype: object
0    Harry Potter And The Deathly Hallows: Part 1
1                        How To Train Your Dragon
2                                      Iron Man 2
3                                       Toy Story
4                                   

# Now you’re adding profit margin columns so you can analyze which movies actually made money relative to their costs.(tn_movie_budgets)

 Step 1: Domestic profit margin

In [240]:
tn_movie_budgets['dom_profit_margin'] = (
    (tn_movie_budgets['domestic_gross'] - tn_movie_budgets['production_budget'])
    / tn_movie_budgets['domestic_gross']
) * 100


Formula:

Profit Margin
=
Revenue
−
Cost
Revenue
×
100
Profit Margin=
Revenue
Revenue−Cost
	​

×100

Tells you what % of revenue was actual profit from U.S. box office only.

Step 2: Worldwide profit margin

In [241]:
tn_movie_budgets['ww_profit_margin'] = (
    (tn_movie_budgets['worldwide_gross'] - tn_movie_budgets['production_budget'])
    / tn_movie_budgets['worldwide_gross']
) * 100



Same idea, but using global revenue.

Helps you see if movies depended more on domestic vs international markets for profitability.

 Step 3: Inspect results

In [242]:
tn_movie_budgets[['movie','production_budget','domestic_gross','worldwide_gross','dom_profit_margin','ww_profit_margin']].head(10)


Unnamed: 0,movie,production_budget,domestic_gross,worldwide_gross,dom_profit_margin,ww_profit_margin
0,Avatar,425000000,760507625,2776345279,44.116274,84.692106
1,Pirates Of The Caribbean: On Stranger Tides,410600000,241063875,1045663875,-70.3283,60.73308
2,Dark Phoenix,350000000,42762350,149762350,-718.477001,-133.703598
3,Avengers: Age Of Ultron,330600000,459005868,1403013963,27.974777,76.436443
4,Star Wars Ep. Viii: The Last Jedi,317000000,620181382,1316721747,48.885921,75.925058
5,Star Wars Ep. Vii: The Force Awakens,306000000,936662225,2053311220,67.330806,85.097242
6,Avengers: Infinity War,300000000,678815482,2048134200,55.805369,85.352522
7,Pirates Of The Caribbean: At WorldâS End,300000000,309420425,963420425,3.044539,68.860947
8,Justice League,300000000,229024295,655945209,-30.99047,54.264473
9,Spectre,300000000,200074175,879620923,-49.944389,65.894399


# This structure is like we did for profit margins, but now for profit amount and ROI — and using our dataset (tn_movie_budgets).

Step 4: Worldwide profit amount

In [243]:
tn_movie_budgets['world_wide_profit_amount'] = (
    tn_movie_budgets['worldwide_gross'] - tn_movie_budgets['production_budget']
)




This gives you the absolute dollar profit (or loss) a movie made globally.

Unlike margins, this shows the real money gained.

Example: If budget = $100M, worldwide gross = $250M →
Profit = $150M.

 Step 5: Return on Investment (ROI)

In [244]:
tn_movie_budgets['ROI_perc'] = (
    tn_movie_budgets['world_wide_profit_amount'] / tn_movie_budgets['production_budget']
) * 100



ROI tells you how efficiently money was used.

Formula:

𝑅
𝑂
𝐼
=
Net Profit
Budget
×
100
ROI=
Budget
Net Profit
	​

×100

A blockbuster making $200M profit on a $200M budget → ROI = 100%.

But a small film making $20M profit on $5M budget → ROI = 400%.

So ROI highlights hidden winners among low-budget films.

 Step 6:Inspect results

In [245]:
tn_movie_budgets[['movie','production_budget','worldwide_gross',
                  'world_wide_profit_amount','ROI_perc']].head(10)

print(tn_movie_budgets['release_year'].unique()[:20])
print(tn_movie_budgets['release_year'].dtype)


[2009 2011 2019 2015 2017 2018 2007 2012 2013 2010 2016 2014 2006 2008
 2005 1997 2004 1999 1995 2003]
int64


# Now filtering the dataset by year tn_movie_budgets

In [246]:
tn_movie_budgets= tn_movie_budgets[tn_movie_budgets['release_year'] > 2000]


In [247]:
print(tn_movie_budgets.shape)
print(tn_movie_budgets['release_year'].min(), tn_movie_budgets['release_year'].max())



(4198, 12)
2001 2020


In [248]:
tn_movie_budgets.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 4198 entries, 0 to 5781
Data columns (total 12 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   id                        4198 non-null   int64  
 1   movie                     4198 non-null   object 
 2   production_budget         4198 non-null   int64  
 3   domestic_gross            4198 non-null   int64  
 4   worldwide_gross           4198 non-null   int64  
 5   release_year              4198 non-null   int64  
 6   month_dt                  4198 non-null   int64  
 7   month                     4198 non-null   int64  
 8   dom_profit_margin         4198 non-null   float64
 9   ww_profit_margin          4198 non-null   float64
 10  world_wide_profit_amount  4198 non-null   int64  
 11  ROI_perc                  4198 non-null   float64
dtypes: float64(3), int64(8), object(1)
memory usage: 426.4+ KB




Older movies (before 2000) may not reflect today’s industry dynamics.

Budgets, marketing, and box office models changed drastically in the 2025s (e.g., streaming, globalization).



# Shifting into release month analysis. Since we are using tn_movie_budgets instead of numbers_df, let’s rewrite and break it down:

 Step 1: Group by release month and calculate medians

In [249]:
# Group movies by release month and take the median of numeric columns
month_df = tn_movie_budgets.groupby('month').median()

# Reset index so 'month' becomes a column again
month_df = month_df.reset_index()

# Sort by month number (1–12)
month_df = month_df.sort_values('month')

# Add month names
month_dict = {
    1: 'Jan', 2: 'Feb', 3: 'Mar', 4: 'Apr',
    5: 'May', 6: 'Jun', 7: 'Jul', 8: 'Aug',
    9: 'Sep', 10: 'Oct', 11: 'Nov', 12: 'Dec'
}
month_df['month_name'] = month_df['month'].map(month_dict)




Grouping by month lets you see if certain months tend to produce higher profits/ROI.

Using the median reduces the impact of extreme outliers (e.g., Avengers making billions).

Sorting ensures the months are in calendar order.

Adding names (Jan, Feb, etc.) makes plots readable.

# Merging 

#  The Numbers (box office + budget) with IMDb

# Merge datasets on title + year

In [250]:
print(tn_movie_budgets['release_year'].unique()[:20])
print(runtime_df['start_year'].unique()[:20])


[2009 2011 2019 2015 2017 2018 2007 2012 2013 2010 2016 2014 2006 2008
 2005 2004 2003 2001 2020 2002]
[2013 2019 2018 2017 2012 2010 2011 2015 2016 2014 2020 2022 2021]


In [251]:
overlap_years = set(tn_movie_budgets['release_year']).intersection(set(runtime_df['start_year']))
print("Overlap years:", overlap_years)


Overlap years: {2016, 2017, 2018, 2019, 2020, 2010, 2011, 2012, 2013, 2014, 2015}


In [252]:
tn_2019 = tn_movie_budgets[tn_movie_budgets['release_year'] == 2019]['movie'].unique()
imdb_2019 = runtime_df[runtime_df['start_year'] == 2019]['primary_title'].unique()

print("The Numbers (2019) sample:", tn_2019[:20])
print("IMDb (2019) sample:", imdb_2019[:20])


The Numbers (2019) sample: ['Dark Phoenix' 'Aladdin' 'Captain Marvel' 'Dumbo' 'Alita: Battle Angel'
 'Godzilla: King Of The Monsters' 'Pokã©Mon: Detective Pikachu'
 'How To Train Your Dragon: The Hidden World'
 'Men In Black: International' 'Wonder Park'
 'The Lego Movie 2: The Second Part' 'Army Of The Dead' 'Shazam!'
 'The Secret Life Of Pets 2' 'Renegades' 'Playmobil' '355'
 'A Dogâ\x80\x99S Way Home' 'Cold Pursuit' 'Midway']
IMDb (2019) sample: ['One Day Before The Rainy Season' 'Alita: Battle Angel' 'Shazam!'
 'The Legend Of Secret Pass' 'The Dirt' 'Pet Sematary' 'Bolden'
 'Disrupted Land' 'Fiddler: A Miracle Of Miracles' 'Soccer In The City'
 'When I Became A Butterfly' 'Paradise' 'Aporia' 'Debout' 'Krishnam'
 'Kala-A-Zar' 'Terror In The Skies' 'Bull' 'Troublemaker' 'Snatchers']


In [253]:
numbers_and_runtime = tn_movie_budgets.merge(
    runtime_df,
    left_on=['movie', 'release_year'],
    right_on=['primary_title', 'start_year'],
    how='inner'
)
# Keep only movies with valid domestic gross
numbers_and_runtime = numbers_and_runtime.loc[numbers_and_runtime['domestic_gross'] > 0]


Merge on both title + year

Some movies share the same title (Halloween 1978 vs Halloween 2018).

Matching with year avoids wrong matches.

Inner join (how='inner')

Keeps only rows where a movie exists in both datasets - so each row has financial data + runtime.

Filter out domestic_gross == 0

Removes movies that never played in theaters in the U.S.

Ensures analysis is focused on box office performers.

# Inspect merged results

In [254]:
print(numbers_and_runtime.shape)
numbers_and_runtime.head()
numbers_and_runtime.info()


(1395, 15)
<class 'pandas.core.frame.DataFrame'>
Int64Index: 1395 entries, 0 to 1558
Data columns (total 15 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   id                        1395 non-null   int64  
 1   movie                     1395 non-null   object 
 2   production_budget         1395 non-null   int64  
 3   domestic_gross            1395 non-null   int64  
 4   worldwide_gross           1395 non-null   int64  
 5   release_year              1395 non-null   int64  
 6   month_dt                  1395 non-null   int64  
 7   month                     1395 non-null   int64  
 8   dom_profit_margin         1395 non-null   float64
 9   ww_profit_margin          1395 non-null   float64
 10  world_wide_profit_amount  1395 non-null   int64  
 11  ROI_perc                  1395 non-null   float64
 12  primary_title             1395 non-null   object 
 13  start_year                1395 non-null   int64  
 1

# Creating dataframe with studio and box office data

 Step 1: 
Select relevant columns from Box Office Mojo
We only need the movie title, studio, and release year from BOM because these are the identifiers we will merge with The Numbers dataset.

In [255]:
# Selecting only the necessary columns from BOM
studio_df = bom_movie_gross [['title', 'studio', 'year']]


 Step 2: 
Merge with The Numbers dataset

Now we merge studio_df with tn_movie_budgets to attach financial data (budget, domestic gross, worldwide gross) to each movie.

In [256]:
# Merge studio info from BOM with financial info from The Numbers
studio_df = studio_df.merge(
    tn_movie_budgets,             # TN dataset with budgets & grosses
    left_on=['title', 'year'],    # BOM columns to merge on
    right_on=['movie', 'release_year'], # TN columns to merge on
    how='inner'                   # Only keep movies that exist in both datasets
)


Some movies may have the same title but are different movies released in different years. Matching only by title could create incorrect combinations.

 Step 3: 
Inspect the merged dataframe

In [257]:
# Check the shape of the new dataframe
print(studio_df.shape)

# Preview first 10 rows
print(studio_df.head())

(1255, 15)
                        title studio  year  id                       movie  \
0                 Toy Story 3     BV  2010  47                 Toy Story 3   
1                   Inception     WB  2010  38                   Inception   
2         Shrek Forever After   P/DW  2010  27         Shrek Forever After   
3  The Twilight Saga: Eclipse   Sum.  2010  53  The Twilight Saga: Eclipse   
4                  Iron Man 2   Par.  2010  15                  Iron Man 2   

   production_budget  domestic_gross  worldwide_gross  release_year  month_dt  \
0          200000000       415004880       1068879522          2010         6   
1          160000000       292576195        835524642          2010         7   
2          165000000       238736787        756244673          2010         5   
3           68000000       300531751        706102828          2010         6   
4          170000000       312433331        621156389          2010         5   

   month  dom_profit_margin  ww_p

# Calculating average studio-level metrics

Step 1: Group by studio

In [258]:
avg_studio = studio_df.groupby('studio').mean().reset_index()


We want to see studio-level performance rather than movie-level.

Grouping and averaging helps us identify which studios consistently produce profitable movies.

Groupby('studio')  Groups all movies by their production studio.

mean()  Calculates the average of all numeric columns for each studio, e.g., production_budget, domestic_gross, worldwide_gross, dom_profit_margin, ww_profit_margin, ROI_perc.

reset_index()  Converts the grouped index (studio) back into a regular column so we can easily access and plot it

Step 2: Filter only profitable studios

In [259]:
avg_studio = avg_studio[avg_studio['dom_profit_margin'] > 0]


Negative-profit studios can skew analysis and plots.

Focusing on positive-profit studios helps highlight the best-performing studios.
dom_profit_margin > 0  Keeps only studios whose average domestic profit margin is positive.

This removes studios that on average lose money domestically, so analysis focuses on studios that are financially successful.

In [260]:
print(avg_studio.shape)   # How many studios are left after filtering
print(avg_studio.head(5)) # Preview the first 10 studios with average metrics


(14, 13)
     studio         year         id  production_budget  domestic_gross  \
0        3D  2010.000000  31.000000       5.000000e+06    6.096582e+06   
3    Affirm  2017.500000  53.500000       3.500000e+06    1.167510e+07   
11  BH Tilt  2016.600000  57.200000       2.800000e+06    8.717903e+06   
15      CBS  2011.545455  56.363636       2.063636e+07    2.758124e+07   
48     MBox  2014.000000   3.000000       2.600000e+06    3.827060e+06   

    worldwide_gross  release_year  month_dt     month  dom_profit_margin  \
0      1.651520e+07   2010.000000  5.000000  5.000000          17.986833   
3      1.573575e+07   2017.500000  5.500000  5.500000          68.518543   
11     1.323772e+07   2016.600000  7.800000  7.800000          61.680377   
15     5.372220e+07   2011.545455  5.181818  5.181818          11.384555   
48     1.529836e+07   2014.000000  5.000000  5.000000          32.062732   

    ww_profit_margin  world_wide_profit_amount    ROI_perc  
0          69.724865        

# Merging The Numbers with TMDb to analyze genres

#  Merge datasets

In [261]:
genre_df = tn_movie_budgets.merge(tmdb_movies, left_on=['movie', 'release_year'], right_on=['title', 'release_year'])


To analyze profitability by genre, we need both financial info and genre info in the same DataFrame.

In [262]:
genre_df.loc[:,'genre_ids'] = genre_df['genre_ids'].map(lambda genre_string: genre_string.strip('[]').split(', '))


TMDb assigns multiple genres to a movie.

Splitting into a list prepares it for exploding later, so each movie-genre combination becomes a separate row for analysis.
genre_ids in TMDb is a string like "[28, 12, 878]".

strip('[]')  removes the square brackets.
split(', ')  converts the string into a list of genre IDs

In [263]:
genre_df = genre_df.loc[(genre_df['worldwide_gross'] > 0) & (genre_df['domestic_gross'] > 0)]
genre_ids_df = genre_df.explode('genre_ids')


Keep only movies with revenue
We only want movies that actually earned money, to calculate meaningful profitability metrics by genre.
Explode('genre_ids') - creates one row per movie per genre.

If a movie has 3 genres, it will now appear in 3 rows, one for each genre.
Allows aggregation of financial metrics per genre, not per movie.


# Map genre IDs to names

In [264]:
# Step 1: Map genre_ids to readable genre names using a dictionary
genre_map = {
    '28': 'Action', '12': 'Adventure', '16': 'Animation', '35': 'Comedy', '80': 'Crime',
    '99': 'Documentary', '18': 'Drama', '10751': 'Family', '14': 'Fantasy', '36': 'History',
    '27': 'Horror', '10402': 'Music', '9648': 'Mystery', '10749': 'Romance', '878': 'Sci-Fi',
    '10770': 'TV Movie', '53': 'Thriller', '10752': 'War', '37': 'Western'
}
# Step 2: Add a new column for readable genre names
genre_ids_df['genre_name'] = genre_ids_df['genre_ids'].map(genre_map)
# Step 3: Inspect the resulting dataframe
print(genre_ids_df[['movie', 'production_budget', 'domestic_gross', 'worldwide_gross', 'ROI_perc', 'genre_name']].head())

                                         movie  production_budget  \
0                                       Avatar          425000000   
0                                       Avatar          425000000   
0                                       Avatar          425000000   
0                                       Avatar          425000000   
1  Pirates Of The Caribbean: On Stranger Tides          410600000   

   domestic_gross  worldwide_gross    ROI_perc genre_name  
0       760507625       2776345279  553.257713     Action  
0       760507625       2776345279  553.257713  Adventure  
0       760507625       2776345279  553.257713    Fantasy  
0       760507625       2776345279  553.257713     Sci-Fi  
1       241063875       1045663875  154.667286  Adventure  


genre_map  Provides a mapping from TMDb’s numeric IDs to human-readable genre names.

map()  Converts each genre_id in genre_ids_df to its corresponding genre_name.
now have a clean dataset (genre_ids_df) with financials and readable genres, ready for aggregation like calculating mean ROI per genre.


In [279]:
# Rename the correct genre_name column
# Keep genre_name_y (from converter) and drop genre_name_x
genre_overall = genre_overall.rename(columns={'genre_name_y': 'genre_name'})

# Drop duplicate or unnecessary columns
genre_overall = genre_overall.drop(columns=['genre_name_x', 'id_x', 'id_y', 'Unnamed: 0'], errors='ignore')

# Keep only the useful columns
genre_overall_clean = genre_overall[[
    'movie',
    'release_year',
    'production_budget',
    'domestic_gross',
    'worldwide_gross',
    'ROI_perc',
    'genre_ids',
    'genre_name',
    'month',        # <-- keep this
    'month_dt'      # <-- and this
]]

print(genre_overall_clean.head())


                                         movie  release_year  \
0                                       Avatar          2009   
1                                       Avatar          2009   
2                                       Avatar          2009   
3                                       Avatar          2009   
4  Pirates Of The Caribbean: On Stranger Tides          2011   

   production_budget  domestic_gross  worldwide_gross    ROI_perc genre_ids  \
0          425000000       760507625       2776345279  553.257713        28   
1          425000000       760507625       2776345279  553.257713        12   
2          425000000       760507625       2776345279  553.257713        14   
3          425000000       760507625       2776345279  553.257713       878   
4          410600000       241063875       1045663875  154.667286        12   

  genre_name  month  month_dt  
0     Action     12        12  
1  Adventure     12        12  
2    Fantasy     12        12  
3     Sci-Fi

tmdb_movies → raw TMDb data with columns like title, release_date, genre_ids (as strings like "[28, 12, 878]").

genre_df → merged tn_movie_budgets + tmdb_movies to bring financials together with genre_ids.

genre_ids_df → exploded version of genre_df['genre_ids'], so each row now represents one movie–one genre instead of a list of IDs.

In [270]:
print(genre_overall.columns)


Index(['movie', 'production_budget', 'domestic_gross', 'worldwide_gross',
       'release_year', 'month_dt', 'month', 'dom_profit_margin',
       'ww_profit_margin', 'world_wide_profit_amount', 'ROI_perc', 'genre_ids',
       'original_language', 'original_title', 'popularity', 'release_date',
       'title', 'vote_average', 'vote_count', 'genre_name'],
      dtype='object')


TMDb only gives numeric IDs in genre_ids.

We need readable genre names to analyze which genres are most profitable.

# Analyze profitability by genre

# Group by genre Mean version(average)

In [271]:
#Group by genre_name, calculate mean of financial metrics
genre_groups = genre_overall_clean.groupby('genre_name').mean(numeric_only=True)

#  Sort by ROI_perc and pick top 7 genres
genre_groups = genre_groups.sort_values('ROI_perc', ascending=False).head(7)

print(genre_groups)



            release_year  production_budget  domestic_gross  worldwide_gross  \
genre_name                                                                     
Horror       2014.006061       2.291297e+07    3.915706e+07     9.026821e+07   
Thriller     2013.623288       3.731461e+07    4.332908e+07     1.084243e+08   
Mystery      2013.771186       3.295345e+07    4.284843e+07     1.021399e+08   
Romance      2013.214953       2.846243e+07    4.188975e+07     9.342080e+07   
Animation    2014.290909       1.003909e+08    1.393303e+08     3.849198e+08   
Sci-Fi       2014.258537       9.271988e+07    1.123468e+08     3.077264e+08   
Music        2014.019608       2.693529e+07    4.828898e+07     9.604752e+07   

               ROI_perc  
genre_name               
Horror      1069.092677  
Thriller     436.286887  
Mystery      436.142740  
Romance      291.691268  
Animation    287.135700  
Sci-Fi       261.474050  
Music        249.632305  


We are grouping by genre_name and calculating the average financial metrics (like ROI, budget, and gross) because we want to find out which genres are the most profitable on average.

By grouping, we turn many individual movies into a single “genre profile.”

By taking the mean, we can compare genres fairly, instead of looking at random single movies.

By sorting by ROI, we highlight which genres give the highest return on investment — this tells us where money is being made most efficiently.

Finally, limiting to the top 7 gives us a focused view of the genres that perform the best, so the analysis is actionable.

What it does:(Mean)
Takes the average ROI, budget, gross, etc. across all movies in each genre.

Pros:

Captures the overall profitability of the genre.

Good if you want the "expected value" of investing in that genre.

Cons:

Sensitive to outliers (e.g., one mega-hit Marvel movie can make "Superhero" genre look insanely profitable, even if most films lose money).


#  Median version (middle value)


In [272]:
# Group by genre_name and calculate the median of numeric columns
genre_groups_med = genre_overall_clean.groupby('genre_name').median(numeric_only=True)

# Sort by ROI_perc and keep top 7
genres
genre_groups_med = genre_groups_med.sort_values('ROI_perc', ascending=False).head(7)

print(genre_groups_med)


            release_year  production_budget  domestic_gross  worldwide_gross  \
genre_name                                                                     
Horror            2014.0         10000000.0      29136626.0       59922558.0   
Animation         2015.0         87500000.0     121440343.5      327829122.5   
Adventure         2015.0        110000000.0      93432655.0      282778100.0   
Family            2014.0         78000000.0      82051601.0      200859554.0   
Fantasy           2014.0         90000000.0      68549695.0      213691277.0   
Mystery           2015.0         21500000.0      30322525.0       63757397.0   
Comedy            2014.0         28000000.0      37915414.0       67130045.0   

              ROI_perc  
genre_name              
Horror      231.669132  
Animation   200.418943  
Adventure   167.114096  
Family      166.547080  
Fantasy     165.951426  
Mystery     156.768909  
Comedy      152.905265  


We already looked at average ROI per genre using the mean. That gave us a sense of overall profitability but was sensitive to outliers (e.g., one mega-hit movie making a genre look profitable even if most others flopped).

What it does:
Takes the median (middle) ROI, budget, gross, etc. for movies in each genre.

Pros:

Shows what the typical movie in the genre earns.

More robust against extreme values (one flop or one blockbuster won’t skew results).

Cons:

Doesn’t capture the impact of extreme successes, which are important in the film industry (because a few blockbusters can fund the entire studio).

N/B 

Mean = overall average performance of the genre → influenced by big winners and losers.

Median = typical performance of the genre → tells you what a "normal" movie in that genre does.

In [280]:
# Filter Horror movies only
horror_month_df = genre_overall_clean[genre_overall_clean['genre_name'] == 'Horror']

# Drop very low earners
horror_month_df = horror_month_df[horror_month_df['worldwide_gross'] > 100000]

# Group by release month and take the median of numeric columns
horror_month_df = horror_month_df.groupby('month').median(numeric_only=True).reset_index()

# Sort by calendar order (month_dt ensures Jan -> Dec)
horror_month_df = horror_month_df.sort_values('month_dt')

# Map month numbers to names
month_dict = {
    1:"Jan", 2:"Feb", 3:"Mar", 4:"Apr", 5:"May", 6:"Jun",
    7:"Jul", 8:"Aug", 9:"Sep", 10:"Oct", 11:"Nov", 12:"Dec"
}
horror_month_df['month_name'] = horror_month_df['month'].map(month_dict)

print(horror_month_df.head())


   month  release_year  production_budget  domestic_gross  worldwide_gross  \
0      1        2014.0         12500000.0      33694789.0       77892256.0   
1      2        2014.0         10000000.0      26797294.0       48461873.5   
2      3        2015.0          5000000.0      14674077.0       23250755.0   
3      4        2013.5          5000000.0      35485286.5       67527083.0   
4      5        2015.0         35000000.0      29136626.0       84154026.0   

     ROI_perc  month_dt month_name  
0  325.677601       1.0        Jan  
1  475.462447       2.0        Feb  
2  499.201020       3.0        Mar  
3  333.270935       4.0        Apr  
4  145.898193       5.0        May  


I filter the dataset down to Horror movies and drop tiny releases (worldwide_gross > 100000).

I group those movies by release month and take the median of numeric metrics (so we see the typical horror movie performance per month).

I reset the index and sort by month_dt so months appear in calendar order (Jan - Dec).

I map month numbers to readable month names (Jan, Feb, ...) so the table is easy to read and plot.

# An overview of our descriptive analysis of the dataframes will use for our Simple Linear Regression anaysis 

In [281]:

display(horror_month_df.describe())
display(avg_studio.describe())
display(genre_overall_clean.describe())

Unnamed: 0,month,release_year,production_budget,domestic_gross,worldwide_gross,ROI_perc,month_dt
count,12.0,12.0,12.0,12.0,12.0,12.0,12.0
mean,6.5,2014.375,12250000.0,29674630.0,64444840.0,354.721341,6.5
std,3.605551,1.130668,7981513.0,12670510.0,29535790.0,268.683956,3.605551
min,1.0,2012.5,5000000.0,6810754.0,8890094.0,87.42072,1.0
25%,3.75,2013.875,9000000.0,21276890.0,46092800.0,215.755023,3.75
50%,6.5,2014.0,10500000.0,31374900.0,71661260.0,299.898476,6.5
75%,9.25,2015.125,13125000.0,35229700.0,82412410.0,390.501129,9.25
max,12.0,2016.0,35000000.0,49595540.0,105015000.0,1112.211863,12.0


Unnamed: 0,year,id,production_budget,domestic_gross,worldwide_gross,release_year,month_dt,month,dom_profit_margin,ww_profit_margin,world_wide_profit_amount,ROI_perc
count,14.0,14.0,14.0,14.0,14.0,14.0,14.0,14.0,14.0,14.0,14.0,14.0
mean,2014.184938,49.006175,27565870.0,41439360.0,99748350.0,2014.184938,6.353598,6.353598,34.775649,66.245899,72182480.0,476.523624
std,2.424399,19.496003,37084560.0,47617390.0,136920500.0,2.424399,1.950358,1.950358,22.123616,13.546435,100249200.0,371.478565
min,2010.0,3.0,2500000.0,3827060.0,13237720.0,2010.0,4.0,4.0,1.574618,46.83353,6704317.0,205.213397
25%,2012.120352,44.925,3875000.0,9457203.0,15930610.0,2012.120352,5.045455,5.045455,19.55969,54.784204,12351400.0,243.356975
50%,2014.102564,52.815068,9325000.0,22609920.0,44361420.0,2014.102564,6.153846,6.153846,34.27645,67.14795,30793240.0,320.355018
75%,2016.375,56.990909,38410710.0,69465660.0,126014300.0,2016.375,6.925,6.925,50.476864,75.044508,87603570.0,555.441756
max,2017.5,83.0,133400000.0,168291500.0,507802800.0,2017.5,12.0,12.0,68.518543,89.515856,374402800.0,1574.515218


Unnamed: 0,release_year,production_budget,domestic_gross,worldwide_gross,ROI_perc,month,month_dt
count,4138.0,4138.0,4138.0,4138.0,4138.0,4138.0,4138.0
mean,2013.831078,55346690.0,69369000.0,179467200.0,290.635152,7.044949,7.044949
std,2.72895,61375370.0,96194190.0,269222200.0,1084.418256,3.453326,3.453326
min,2001.0,30000.0,388.0,528.0,-99.8964,1.0,1.0
25%,2012.0,11800000.0,8574339.0,18190830.0,12.138712,4.0,4.0
50%,2014.0,31750000.0,35608240.0,74966850.0,134.604971,7.0,7.0
75%,2016.0,79000000.0,85067180.0,216562300.0,312.646417,10.0,10.0
max,2019.0,425000000.0,760507600.0,2776345000.0,41556.474,12.0,12.0
