# IMDbMovies

## Contents <a id='back'></a>

1. [Introduction](#introduction)
2. [Loading the Dataset](#loading-the-dataset)
3. [Data Cleaning](#data-cleaning)
   - 3.1 [Header style](#header-style)
   - 3.2 [Handling Missing Values](#handling-missing-values)
   - 3.3 [Data Transformation](#data-transformation)
   - 3.4 [Conclusions](#conclusions)
4. [Exploratory Data Analysis (EDA)](#exploratory-data-analysis-eda)
   - 4.1 [Overview of the Dataset](#overview-of-the-dataset)
   - 4.2 [Visualizing Data](#visualizing-data)

## 1. Introduction <a id='introduction'></a>

The IMDbMovies dataset is a comprehensive collection with information on more than 9000 movies on IMDb. The dataset includes key features such as title, director, writer, genres, runtime, release year, budget, and gross revenue. 

The primary goal of this analysis is to gain a deeper insight of the evolution of the cinematographic industry over the years.

[Back to contents](#back)

## 2. Loading the Dataset <a id='loading-the-dataset'></a>

The IMDB Movies dataset is available on [kaggle.com](https://www.kaggle.com/datasets/elvinrustam/imdb-movies-dataset). We will be using the unclean version, so we can carry out our own cleanup. 

The first step is to correct a few rows that have the correct data, but it's shifted one column to the right. The movies that require this are:
 
- Sleepy Hollow
- Texas Chainsaw
- G.I. Joe: Retaliation
- The Woman in the Window
- The Social Dilemma
- Bank of Dave
- Eden
- CHIPS
- The Diving Bell and the Butterfly
- Untamed Heart
- Roxanne
- The Postcard Killings
- Cover Girl
- The Strangers: Chapter 1

We'll be fixing this with google docs.

Note that "The Fall of Minneapolis" has a similar issue, but also a lot of missing values, and the value for budget (302) doesn't make sense. We'll drop this row. 

The rest of the preprocessing will be done here. We can load the Dataset now.

In [1]:
import pandas as pd
import plotly.express as px
import os

data_path = 'datasets'
df = pd.read_csv(os.path.join(data_path, "IMDbMovies.csv"))

Lets see what the data looks like.

In [2]:
df.head(5)

Unnamed: 0,Title,Summary,Director,Writer,Main Genres,Motion Picture Rating,Runtime,Release Year,Rating,Number of Ratings,Budget,Gross in US & Canada,Gross worldwide,Opening Weekend Gross in US & Canada
0,Napoleon,An epic that details the checkered rise and fa...,Ridley Scott,David Scarpa,"Action,Adventure,Biography",R,2h 38m,2023.0,6.7/10,38K,,"$37,514,498","$84,968,381","$20,638,887Nov 26, 2023"
1,The Hunger Games: The Ballad of Songbirds & Sn...,Coriolanus Snow mentors and develops feelings ...,Francis Lawrence,"Michael Lesslie,Michael Arndt,Suzanne Collins","Action,Adventure,Drama",PG-13,2h 37m,2023.0,7.2/10,37K,"$100,000,000 (estimated)","$105,043,414","$191,729,235","$44,607,143Nov 19, 2023"
2,The Killer,"After a fateful near-miss, an assassin battles...",David Fincher,"Andrew Kevin Walker,Luc Jacamon,Alexis Nolent","Action,Adventure,Crime",R,1h 58m,2023.0,6.8/10,117K,,,"$421,332",
3,Leo,A 74-year-old lizard named Leo and his turtle ...,"David Wachtenheim,Robert Smigel,Robert Marianetti","Paul Sado,Robert Smigel,Adam Sandler","Animation,Comedy,Family",PG,1h 42m,2023.0,7.0/10,10K,,,,
4,Thanksgiving,"After a Black Friday riot ends in tragedy, a m...",Eli Roth,"Eli Roth,Jeff Rendell","Horror,Mystery,Thriller",R,1h 46m,2023.0,7.0/10,9.1K,,"$25,408,677","$29,666,585","$10,306,272Nov 19, 2023"


There seems to be a lot of missing values on the `Budget` and `Gross` columns, and some columns are in a format that is uncomfortable to work with. Time for cleanup.

[Back to contents](#back)

## 3. Data Cleaning <a id="data-cleaning"></a>

### 3.1 Header style <a id="header-style"></a>

First we'll fix the names of the existing columns.

In [3]:
base_columns_names = df.columns
base_columns_names

Index(['Title', 'Summary', 'Director', 'Writer', 'Main Genres',
       'Motion Picture Rating', 'Runtime', 'Release Year', 'Rating',
       'Number of Ratings', 'Budget', 'Gross in US & Canada',
       'Gross worldwide', 'Opening Weekend Gross in US & Canada'],
      dtype='object')

In [4]:
fixed_columns_names = {}

for name in base_columns_names:
    fixed_name = name.lower()
    fixed_name = fixed_name.replace(' & ', '&') #turns "us & canada" into "us&canada", to avoid so many '_' characters in the next step
    fixed_name = fixed_name.replace(' ', '_')
    fixed_columns_names [name] = fixed_name

fixed_columns_names

{'Title': 'title',
 'Summary': 'summary',
 'Director': 'director',
 'Writer': 'writer',
 'Main Genres': 'main_genres',
 'Motion Picture Rating': 'motion_picture_rating',
 'Runtime': 'runtime',
 'Release Year': 'release_year',
 'Rating': 'rating',
 'Number of Ratings': 'number_of_ratings',
 'Budget': 'budget',
 'Gross in US & Canada': 'gross_in_us&canada',
 'Gross worldwide': 'gross_worldwide',
 'Opening Weekend Gross in US & Canada': 'opening_weekend_gross_in_us&canada'}

In [5]:
df = df.rename(columns = fixed_columns_names)
df.columns

Index(['title', 'summary', 'director', 'writer', 'main_genres',
       'motion_picture_rating', 'runtime', 'release_year', 'rating',
       'number_of_ratings', 'budget', 'gross_in_us&canada', 'gross_worldwide',
       'opening_weekend_gross_in_us&canada'],
      dtype='object')

With the columns renamed, lets move on to improving the format of certain columns. We might end up renaming some of them again.

[Back to contents](#back)

### 3.2 Handling Missing Values <a id="handling-missing-values"></a>

First, lets see how many missing values we have to deal with.

In [6]:
df.isna().sum()

title                                    0
summary                                  0
director                                31
writer                                 324
main_genres                              7
motion_picture_rating                  797
runtime                                165
release_year                             7
rating                                 270
number_of_ratings                      270
budget                                3203
gross_in_us&canada                    3018
gross_worldwide                       1954
opening_weekend_gross_in_us&canada    3387
dtype: int64

Many of these are not critical. We can fill them with `'unknown'` for text columns, or `0` in numerical columns, to mark the data as unavailable, since `0` isn't a value that you would expect to have as a real datapoint in any of these fields.

The columns `director`, `writer`, `main_genres`, and `motion_picture_rating` can be filled with the value `'unknown'`

In [7]:
columns_to_fix = ['director', 'writer', 'main_genres', 'motion_picture_rating']

for column in columns_to_fix:
    df[column] = df[column].fillna(value = 'unknown')

The missing values in the columns `rating`, `number_of_ratings`, `budget`, `gross_in_us&canada`, `gross_worldwide`, and `opening_weekend_gross_in_us&canada` can be filled, but we need to be careful to keep their general format, so the new values don't interfere with data processing.

In [8]:
# rating is in the format (value/10)
df['rating'] = df['rating'].fillna(value = '0/10')
df['number_of_ratings'] = df['number_of_ratings'].fillna(value = 0)

In [9]:
# runtime is expressed in hours (h) and minutes (m)
df['runtime'] = df['runtime'].fillna(value = '0m')

In [10]:
# all entries have a currency, and $ is the most common
columns_to_fix = ['budget', 'gross_in_us&canada', 'gross_worldwide']

for column in columns_to_fix:
    df[column] = df[column].fillna(value = '$0') 

In [11]:
# all entries have the $ at the beginning, followed by the date of the opening weekend
df['opening_weekend_gross_in_us&canada'] = df['opening_weekend_gross_in_us&canada'].fillna(value = '$0Jan 1, 1900')

The `release_year` is going to be critical. We drop rows that are missing this value.

In [12]:
df.dropna(subset = ['release_year'], inplace=True)

In [13]:
df.isna().sum()

title                                 0
summary                               0
director                              0
writer                                0
main_genres                           0
motion_picture_rating                 0
runtime                               0
release_year                          0
rating                                0
number_of_ratings                     0
budget                                0
gross_in_us&canada                    0
gross_worldwide                       0
opening_weekend_gross_in_us&canada    0
dtype: int64

No more missing values. Let's check duplicates.

In [14]:
df.duplicated().sum()

0

No obvious duplicates.

Just to make sure, lets double check. We could find duplicates in `title`, but it could still be from different movies. A duplicated `summary` would be more suspicious.

In [15]:
df['summary'].duplicated().sum()

32

There are some duplicates here. Lets see what they are.

In [16]:
dup = df[df['summary'].duplicated(keep=False)] 
dup = dup.sort_values('summary')

dup


Unnamed: 0,title,summary,director,writer,main_genres,motion_picture_rating,runtime,release_year,rating,number_of_ratings,budget,gross_in_us&canada,gross_worldwide,opening_weekend_gross_in_us&canada
7803,Aladdin,A kind-hearted street urchin and a power-hungr...,Guy Ritchie,"John August,Guy Ritchie","Adventure,Comedy,Family",PG,2h 8m,2019.0,6.9/10,284K,"$183,000,000 (estimated)","$355,559,216","$1,054,304,000","$91,500,929May 26, 2019"
557,Aladdin,A kind-hearted street urchin and a power-hungr...,"John Musker,Ron Clements","Ron Clements,John Musker,Ted Elliott","Animation,Adventure,Comedy",G,1h 30m,1992.0,8.0/10,454K,"$28,000,000 (estimated)","$217,350,219","$504,050,219","$196,664Nov 15, 1992"
5521,We're the Millers 2,Plot is unknown.,Rawson Marshall Thurber,Adam Sztykiel,Comedy,R,2h 3m,2001.0,0/10,0,$0,$0,$0,"$0Jan 1, 1900"
6403,Real Steel 2,Plot is unknown.,unknown,John Gatins,"Action,Drama,Family",Unrated,1h 43m,2022.0,0/10,0,$0,$0,$0,"$0Jan 1, 1900"
3459,Godzilla x Kong: The New Empire,Plot kept under wraps.,Adam Wingard,"Simon Barrett,Jeremy Slater,Terry Rossio","Action,Adventure,Sci-Fi",unknown,0m,2024.0,0/10,0,$0,$0,$0,"$0Jan 1, 1900"
5536,Iron Lung,Plot kept under wraps.,Mark Fischbach,"Mark Fischbach,David Szymanski","Horror,Sci-Fi",unknown,2h 1m,2022.0,0/10,0,$0,$0,$0,"$0Jan 1, 1900"
2302,Fast X: Part 2,Plot kept under wraps.,Louis Leterrier,"Oren Uziel,Christina Hodson","Action,Crime,Thriller",unknown,0m,2025.0,0/10,0,$0,$0,$0,"$0Jan 1, 1900"
1144,Lethal Weapon 5,Plot kept under wraps.,Mel Gibson,"Shane Black,Richard Wenk","Action,Thriller",R,1h 55m,2018.0,0/10,0,$0,$0,$0,"$0Jan 1, 1900"
5319,Kind of Kindness,Plot kept under wraps.,Yorgos Lanthimos,"Efthimis Filippou,Yorgos Lanthimos",unknown,unknown,0m,2024.0,0/10,0,$0,$0,$0,"$0Jan 1, 1900"
8322,Untitled Margot Robbie Ocean's Eleven Film,Plot under wraps,Jay Roach,"Jack Golden Russell,George Clayton Johnson,Car...",unknown,Not Rated,2h 5m,1955.0,0/10,0,$0,$0,$0,"$0Jan 1, 1900"


Most of the duplicates are from placeholder text. There are two actual duplicated summaries for the movies 'Aladdin' and 'Lady and the Tramp', but correspond to the original film and their remake.

We can be reasonably sure that there are no duplicates in the DataSet.

[Back to Content](#back)

### 3.3 Data Transformation <a id="data-transformation"></a>



We could benefit from reformatting some of the columns. For easy reference, lets see again what they look like.

In [17]:
df.head(5)

Unnamed: 0,title,summary,director,writer,main_genres,motion_picture_rating,runtime,release_year,rating,number_of_ratings,budget,gross_in_us&canada,gross_worldwide,opening_weekend_gross_in_us&canada
0,Napoleon,An epic that details the checkered rise and fa...,Ridley Scott,David Scarpa,"Action,Adventure,Biography",R,2h 38m,2023.0,6.7/10,38K,$0,"$37,514,498","$84,968,381","$20,638,887Nov 26, 2023"
1,The Hunger Games: The Ballad of Songbirds & Sn...,Coriolanus Snow mentors and develops feelings ...,Francis Lawrence,"Michael Lesslie,Michael Arndt,Suzanne Collins","Action,Adventure,Drama",PG-13,2h 37m,2023.0,7.2/10,37K,"$100,000,000 (estimated)","$105,043,414","$191,729,235","$44,607,143Nov 19, 2023"
2,The Killer,"After a fateful near-miss, an assassin battles...",David Fincher,"Andrew Kevin Walker,Luc Jacamon,Alexis Nolent","Action,Adventure,Crime",R,1h 58m,2023.0,6.8/10,117K,$0,$0,"$421,332","$0Jan 1, 1900"
3,Leo,A 74-year-old lizard named Leo and his turtle ...,"David Wachtenheim,Robert Smigel,Robert Marianetti","Paul Sado,Robert Smigel,Adam Sandler","Animation,Comedy,Family",PG,1h 42m,2023.0,7.0/10,10K,$0,$0,$0,"$0Jan 1, 1900"
4,Thanksgiving,"After a Black Friday riot ends in tragedy, a m...",Eli Roth,"Eli Roth,Jeff Rendell","Horror,Mystery,Thriller",R,1h 46m,2023.0,7.0/10,9.1K,$0,"$25,408,677","$29,666,585","$10,306,272Nov 19, 2023"


`runtime` is presented in hours + minutes. This format is useful for display, but for comparisons and calculations the runtime in minutes would be much better.

In [18]:
runtime_minutes=[]

for time in df['runtime']:
    hours = 0
    minutes = 0
    h_pos = time.find('h')
    m_pos = time.find('m')
  
    if h_pos > 0:
        hours = int(time[:h_pos])
    if m_pos > 0:
        minutes = int(time[h_pos+1:-1])
    runtime_minutes.append(hours * 60 + minutes)

df.insert(7, 'runtime_minutes', runtime_minutes)

In [19]:
df[['runtime', 'runtime_minutes']].head(5)

Unnamed: 0,runtime,runtime_minutes
0,2h 38m,158
1,2h 37m,157
2,1h 58m,118
3,1h 42m,102
4,1h 46m,106


Lets work on the ratings next. The current format is `value/max`, with max being 10 for all entries. We can transform that into a single number, with a range of 0 to 100, so we only deal with integers.

**NOTE:** The movie `This is Spinal Tap` has a max rating of 11, but this is an easter egg, and we can ignore it in our analysis.

In [20]:
rating_int = []

for rating in df['rating']:
    slash_pos = rating.find('/')
    new_rating = rating[0:slash_pos]
    new_rating = int(new_rating.replace('.', ''))
    rating_int.append(new_rating)

df.insert(10, 'rating_int', rating_int)

In [21]:
df[['rating', 'rating_int']].head(5)

Unnamed: 0,rating,rating_int
0,6.7/10,67
1,7.2/10,72
2,6.8/10,68
3,7.0/10,70
4,7.0/10,70


Every budget value is `(estimated)`. We can move that to title and clean the values.

In [22]:
df['budget'] = df['budget'].str.replace('(estimated)', '')

In [23]:
df = df.rename(columns={'budget': 'budget_estimated'})

`opening_weekend_gross_in_us&canada` has both the gross revenue and the weekend date. Let's split them.

In [24]:
dates = []
revenues = []
for value in df['opening_weekend_gross_in_us&canada']:
    date_pos = 0
    for letter in value:
        if letter.isalpha():
            break
        date_pos += 1
    revenue = value[:date_pos]
    date = value[date_pos:]
    dates.append(date)
    revenues.append(revenue)

df.insert(16, 'gross_opening', revenues)
df.insert(17, 'opening_date', dates)

In [25]:
df[['opening_weekend_gross_in_us&canada', 'gross_opening', 'opening_date']].head(5)

Unnamed: 0,opening_weekend_gross_in_us&canada,gross_opening,opening_date
0,"$20,638,887Nov 26, 2023","$20,638,887","Nov 26, 2023"
1,"$44,607,143Nov 19, 2023","$44,607,143","Nov 19, 2023"
2,"$0Jan 1, 1900",$0,"Jan 1, 1900"
3,"$0Jan 1, 1900",$0,"Jan 1, 1900"
4,"$10,306,272Nov 19, 2023","$10,306,272","Nov 19, 2023"


Now we can convert `budget_estimated`, `gross_in_us&canada`, `gross_worldwide` and `gross_opening` to int. We need to preserve the currency, since not every movie uses dollars.

In [26]:
def decimal_pos(value:str):
    nr_pos = 0
    for char in value:
        if char.isdecimal():
            break
        nr_pos += 1
    return nr_pos

In [27]:
currency = []

for value in df['budget_estimated']:
    nr_pos = decimal_pos(value)
    currency.append(value[:nr_pos])

df.insert(12, 'currency', currency)

In [28]:
def money_to_int(value_str: str):
    value = value_str[decimal_pos(value_str):]
    value = value.replace(',', '')
    return int(value)

In [29]:
df['budget_estimated'] = df['budget_estimated'].apply(money_to_int)

In [30]:
df['gross_in_us&canada'] = df['gross_in_us&canada'].apply(money_to_int)

In [31]:
df['gross_opening'] = df['gross_opening'].apply(money_to_int)

In [32]:
df['gross_worldwide'] = df['gross_worldwide'].apply(money_to_int)

In [33]:
df.head(10)

Unnamed: 0,title,summary,director,writer,main_genres,motion_picture_rating,runtime,runtime_minutes,release_year,rating,rating_int,number_of_ratings,currency,budget_estimated,gross_in_us&canada,gross_worldwide,opening_weekend_gross_in_us&canada,gross_opening,opening_date
0,Napoleon,An epic that details the checkered rise and fa...,Ridley Scott,David Scarpa,"Action,Adventure,Biography",R,2h 38m,158,2023.0,6.7/10,67,38K,$,0,37514498,84968381,"$20,638,887Nov 26, 2023",20638887,"Nov 26, 2023"
1,The Hunger Games: The Ballad of Songbirds & Sn...,Coriolanus Snow mentors and develops feelings ...,Francis Lawrence,"Michael Lesslie,Michael Arndt,Suzanne Collins","Action,Adventure,Drama",PG-13,2h 37m,157,2023.0,7.2/10,72,37K,$,100000000,105043414,191729235,"$44,607,143Nov 19, 2023",44607143,"Nov 19, 2023"
2,The Killer,"After a fateful near-miss, an assassin battles...",David Fincher,"Andrew Kevin Walker,Luc Jacamon,Alexis Nolent","Action,Adventure,Crime",R,1h 58m,118,2023.0,6.8/10,68,117K,$,0,0,421332,"$0Jan 1, 1900",0,"Jan 1, 1900"
3,Leo,A 74-year-old lizard named Leo and his turtle ...,"David Wachtenheim,Robert Smigel,Robert Marianetti","Paul Sado,Robert Smigel,Adam Sandler","Animation,Comedy,Family",PG,1h 42m,102,2023.0,7.0/10,70,10K,$,0,0,0,"$0Jan 1, 1900",0,"Jan 1, 1900"
4,Thanksgiving,"After a Black Friday riot ends in tragedy, a m...",Eli Roth,"Eli Roth,Jeff Rendell","Horror,Mystery,Thriller",R,1h 46m,106,2023.0,7.0/10,70,9.1K,$,0,25408677,29666585,"$10,306,272Nov 19, 2023",10306272,"Nov 19, 2023"
5,Oppenheimer,"The story of American scientist, J. Robert Opp...",Christopher Nolan,"Martin Sherwin,Christopher Nolan,Kai Bird","Biography,Drama,History",R,3h,180,2023.0,8.5/10,85,525K,$,100000000,325367020,950554020,"$82,455,420Jul 23, 2023",82455420,"Jul 23, 2023"
6,Saltburn,A student at Oxford University finds himself d...,Emerald Fennell,Emerald Fennell,"Comedy,Drama,Thriller",R,2h 11m,131,2023.0,7.5/10,75,7.7K,$,0,4358312,7545932,"$322,651Nov 19, 2023",322651,"Nov 19, 2023"
7,The Marvels,Carol Danvers gets her powers entangled with t...,Nia DaCosta,"Elissa Karasik,Nia DaCosta,Megan McDonnell","Action,Adventure,Fantasy",PG-13,1h 45m,105,2023.0,6.0/10,60,57K,$,220000000,77926655,188196816,"$46,110,859Nov 12, 2023",46110859,"Nov 12, 2023"
8,Wish,A young girl named Asha wishes on a star and g...,"Fawn Veerasunthorn,Chris Buck","Jennifer Lee,Chris Buck,Allison Moore","Animation,Adventure,Comedy",PG,1h 35m,95,2023.0,5.9/10,59,6.4K,$,200000000,33974306,51276049,"$19,698,228Nov 26, 2023",19698228,"Nov 26, 2023"
9,The Creator,Against the backdrop of a war between humans a...,Gareth Edwards,"Chris Weitz,Gareth Edwards","Action,Adventure,Drama",PG-13,2h 13m,133,2023.0,6.9/10,69,79K,$,80000000,40774679,104181028,"$14,079,512Oct 1, 2023",14079512,"Oct 1, 2023"


Sometimes `director` has notes enclosed within parenthesis. Those could cause issues. Lets clean that.

In [34]:
def clean_parenthesis(text:str):
    open_p = text.find('(')
    close_p = text.find(')')
    if close_p > open_p:
        text = text[:open_p] + text[close_p + 1:]
        text = clean_parenthesis(text) # We use recursion in case there are cases with multiple notes.
    return text

df['director'] = df['director'].apply(clean_parenthesis)

Some of these columns are no longer needed. We can drop them.

In [35]:
df.drop(columns = ['rating', 'opening_weekend_gross_in_us&canada'], inplace=True)

Now that the Dataset is ready, let's save it.

In [50]:

df.to_csv(os.path.join(data_path, 'IMDbMovies_clean.csv'))

[Back to Contents](#back)

## 3.4 Conclusions <a id='conclusions'></a>

We corrected several issues with the Dataset:

- Fixed header styles
- Filled missing values
- Reformated several columns

The headers were fixed to make processing simpler.

All missing values were replaced with appropiate values that keep the format of the data they are in.

We checked for duplicates, and didn't find any.

Reformated several columns to ease access to their values.

We are ready to start with EDA.

[Back to Contents](#back)

## 4. Exploratory Data Analysis (EDA) <a id="exploratory-data-analysis-eda"></a>

### 4.1 Overview of the Dataset <a id="overview-of-the-dataset"></a>

Lets start exploring our DataSet. We can begin by checking the range of years that we have to work with, and how many movies per year.

In [37]:
min_year = df['release_year'].min()
max_year = df['release_year'].max()

print(f'The earliest movie is from: {min_year}.\nThe latest movie is from: {max_year}')

The earliest movie is from: 1915.0.
The latest movie is from: 2027.0


In [38]:
fig = px.histogram(df, x='release_year', histfunc='count')
fig.show()

There isn't much info on the first half of the century. Only after 1950 we consistently have more than 10 movies per year. Most of the data is from the last 40 years.

[Back to Contents](#back)

### 4.2 Visualizing Data <a id="visualizing-data"></a>

Movies have become more expensive to produce in the last years. But how much? and when did this trend start?

In [39]:
# a budget of 0 means the data is missing from our DataSet
filtered_budget = df[df['budget_estimated'] != 0]
# It's quite hard to find exchange data from before 1990. We'll only use dollars
filtered_budget = filtered_budget[filtered_budget['currency'] == '$']

fig = px.histogram(filtered_budget, x='release_year', y='budget_estimated', histfunc='avg')
fig.show()

The trend started slightly before the 80's, accelerated during the 90's, and pretty much stopped around the 2000.

And how much money did they make in those years? Was the cost increase worth it?

In [40]:
# a gross of 0 means the data is missing from our DataSet
filtered_gross = df[df['gross_worldwide'] != 0]
# It's quite hard to find exchange data from before 1990. We'll only use dollars
filtered_gross = filtered_gross[filtered_gross['currency'] == '$']

fig = px.histogram(filtered_gross, x='release_year', y='gross_worldwide', histfunc='avg')
fig.show()

The gross revenue has, in general, increased together with the cost of production. There is a sharp drop in 2020, showing the effects of covid-19 on the cinematographic industry. And it hasn't recovered yet. Right now the industry is on the revenue level of the late 80's.

What happened around 1940? Apparently there were some extremely successful movies. Let's find them.

In [41]:
max_gross_index = filtered_gross[filtered_gross['release_year'].isin(range(1937, 1943))].groupby('release_year')[['gross_worldwide']].idxmax()
filtered_gross.loc[max_gross_index['gross_worldwide'], ['release_year', 'title', 'budget_estimated', 'gross_worldwide']]



Unnamed: 0,release_year,title,budget_estimated,gross_worldwide
1203,1937.0,Snow White and the Seven Dwarfs,1499000,184960471
8079,1938.0,The Lady Vanishes,0,39776
251,1939.0,Gone with the Wind,3977000,402382193
3966,1940.0,Pinocchio,2600000,121892045
917,1941.0,Citizen Kane,839727,1645944
4914,1942.0,Bambi,0,267447150


Here they are.

- Snow White (1937): $185M
- Gone with the Wind (1939): $402M
- Pinocchio (1940): $122M
- Bambi (1942): $267M

Those were good years for Disney.

 Let's take a look at the full picture.

In [42]:
fig = px.scatter(filtered_gross, x='release_year', y='gross_worldwide', hover_data=['title'])
fig.show()

Now lets take a look at the evolution of genres.

In [43]:
# List of all the genre classifications on IMDb
genres = set()

for line in df['main_genres'].unique():
    for genre in line.split(','):
        genres.add(genre.strip())

genres.discard('unknown') # This marks missing data, so it's not needed.
genres

{'Action',
 'Adventure',
 'Animation',
 'Biography',
 'Comedy',
 'Crime',
 'Documentary',
 'Drama',
 'Family',
 'Fantasy',
 'Film-Noir',
 'History',
 'Horror',
 'Music',
 'Musical',
 'Mystery',
 'Romance',
 'Sci-Fi',
 'Sport',
 'Thriller',
 'War',
 'Western'}

In [44]:
selected_genre = 'Action'
selected_datapoint = 'budget_estimated'

# Filter out currencies other than $, so we can comfortably chart for revenue and budget.
filtered_dollar_df = df[df['currency'] == '$']
selected_genre_df = filtered_dollar_df[filtered_dollar_df['main_genres'].str.contains(selected_genre)]

fig = px.scatter(selected_genre_df, x='release_year', y=selected_datapoint, hover_data=['title'])
fig.show()

Does a higher budget correlate with higher revenue?

In [45]:
fig = px.scatter(filtered_dollar_df, x='budget_estimated', y='gross_worldwide', hover_data=['title', 'release_year'])
fig.show()

Lets see which are the most profitable directors.

In [46]:
# List of all the directors
directors = set()

for line in filtered_dollar_df['director'].unique():
    
    for d in line.split(','):
        directors.add(d.replace('(*)', '').strip())

#directors.discard('unknown') # This marks missing data, so it's not needed.
directors

{'Billy Ray',
 'Channing Tatum',
 'Brandon Cronenberg',
 'Indar Dzhendubaev',
 'Assaf Bernstein',
 'Otto Jongerius',
 'Stephen Meek',
 'Garth Jennings',
 'Brandon Christensen',
 'Michele Massimo Tarantini',
 'Lisa Azuelos',
 'Ben Palmer',
 'Leslye Headland',
 'Andrew Ahn',
 'Mark Cullen',
 'Frank Henenlotter',
 'Dominic Sena',
 'Jean-Marc Vallée',
 'A.K. Sajan',
 'Mario Van Peebles',
 'Asif Kapadia',
 'Bob Saget',
 'Milos Forman',
 'Matt Brown',
 'Richard Stanley',
 'Nigel Cole',
 'Meredith Bragg',
 'Ken Annakin',
 'Malcolm D. Lee',
 'Andrew Renzi',
 'Marc Abraham',
 'Mike Marvin',
 'Carlota Pereda',
 'James Watkins',
 'Sarah Smith',
 'André van Duren',
 'Ralph Bakshi',
 'Jay Lee',
 'Patrick Magee',
 'Kent Alterman',
 'Devin Sanchez',
 'Hannah Barlow',
 'Dan Mazer',
 'Blair Hayes',
 'Kevin Sorbo',
 'John Maybury',
 'Ka-Wai Kam',
 'Tom Savini',
 'Mick Jackson',
 'Eddie Sternberg',
 'Mel Ferrer',
 'William Olsson',
 'Leonard Nimoy',
 'Edward Yang',
 'Matt Johnson',
 'Qui Nguyen',
 'Bill 

Let's create a new Dataframe with the data for each Director.   

In [47]:
#WARNING: Long process!

directors_data = pd.DataFrame(columns = ['name', 'titles', 'gross_max', 'gross_min', 'gross_mean', 'rating_max', 'rating_min', 'rating_mean'])

for d in directors:
    movies_by_d = df[df['director'].str.contains(d)]
    count = movies_by_d['title'].count()

    # Filter out movies with missing info on gross_worldwide
    movies_gross = movies_by_d[movies_by_d['gross_worldwide'] > 0]
    gross_max = movies_gross['gross_worldwide'].max()
    gross_min = movies_gross['gross_worldwide'].min()
    gross_mean = movies_gross['gross_worldwide'].mean()
    
    # Filter out movies with missing info on rating
    movies_rating = movies_by_d[movies_by_d['rating_int'] > 0]
    rating_max = movies_rating['rating_int'].max()
    rating_min = movies_rating['rating_int'].min()
    rating_mean = movies_rating['rating_int'].mean()

    
    titles = ''
    for index, movie in movies_by_d.iterrows():
        titles = titles + movie['title'] + ', '
    titles = titles[:-2]
    directors_data.loc[len(directors_data)] = [d, titles, gross_max, gross_min, gross_mean, rating_max, rating_min, rating_mean]

directors_data

Unnamed: 0,name,titles,gross_max,gross_min,gross_mean,rating_max,rating_min,rating_mean
0,Billy Ray,"Breach, Secret in Their Eyes, Shattered Glass",40953935.0,2944752.0,2.625123e+07,71.0,63.0,68.0
1,Channing Tatum,Dog,84774243.0,84774243.0,8.477424e+07,65.0,65.0,65.0
2,Brandon Cronenberg,"Infinity Pool, Possessor",5183424.0,911180.0,3.047302e+06,65.0,60.0,62.5
3,Indar Dzhendubaev,I Am Dragon,10495305.0,10495305.0,1.049530e+07,68.0,68.0,68.0
4,Assaf Bernstein,Look Away,1119537.0,1119537.0,1.119537e+06,58.0,58.0,58.0
...,...,...,...,...,...,...,...,...
4347,Joseph Winter,Deadstream,1031519.0,1031519.0,1.031519e+06,64.0,64.0,64.0
4348,Laura Poitras,All the Beauty and the Bloodshed,1483975.0,1483975.0,1.483975e+06,75.0,75.0,75.0
4349,Mario Miscione,Circle,,,,60.0,60.0,60.0
4350,James Cummins,The Boneyard,,,,56.0,56.0,56.0


In [48]:
fig = px.scatter(directors_data, x='gross_mean', y='rating_mean', hover_data=['name', 'titles'])
fig.show()

Let's save the Director's DataSet.  

In [52]:
directors_data.to_csv(os.path.join(data_path, 'directors.csv'))

[Back to Contents](#back)