# **Urban Productions: Movie Production Analysis**. 

## **Business Understanding**

### **Overview**
Urban Enterprises is a company that has decided to venture into the movie production industry and want to build a new studio.   
The head of studio needs to know what type of films/movies to create.  
Our goal is to analyze movie performance data to provide insights on movie production.  
We are using datasets from: *Box Office Mojo*, *IMDB*, *The Numbers*

### **Problem Statement**
The company barely has any knowledge on what it entails to run a successful movie production business. Therefore, it requires research done on what types of films are currently doing the best in the industry. The stakeholders need meaningful insights on movie trends ie the top performing genres, in order to make decisions that will eventually be profitable for the investment.


### **Objectives**

1. Determine the best movie genres by examining total gross and movie ratings for the business to take steps that will maximize their profits and ensure return of investment.

2. Evaluate if there is correlation between movies total gross and their ratings

3. Compare production budgets for movies with thir total gross. Examine whether there is increase in production budget over the years, and if the gross is keping up with it.

4. Explore whether movie release dates/months affect their performance to suggest the best release strategies

5. Identify some of the most influential names in the industry that are associated with top performing movies

6. Verify if the movie studios have an effect on the total gross








## **Data Understanding and Data Cleaning** 


In [199]:
import pandas as pd
import sqlite3
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np


### Box Office Mojo Data

In [200]:
bom_df = pd.read_csv("./Data/bom.movie_gross.csv")

In [201]:
bom_df.head(10)

Unnamed: 0,title,studio,domestic_gross,foreign_gross,year
0,Toy Story 3,BV,415000000.0,652000000,2010
1,Alice in Wonderland (2010),BV,334200000.0,691300000,2010
2,Harry Potter and the Deathly Hallows Part 1,WB,296000000.0,664300000,2010
3,Inception,WB,292600000.0,535700000,2010
4,Shrek Forever After,P/DW,238700000.0,513900000,2010
5,The Twilight Saga: Eclipse,Sum.,300500000.0,398000000,2010
6,Iron Man 2,Par.,312400000.0,311500000,2010
7,Tangled,BV,200800000.0,391000000,2010
8,Despicable Me,Uni.,251500000.0,291600000,2010
9,How to Train Your Dragon,P/DW,217600000.0,277300000,2010


In [202]:
bom_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3387 entries, 0 to 3386
Data columns (total 5 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   title           3387 non-null   object 
 1   studio          3382 non-null   object 
 2   domestic_gross  3359 non-null   float64
 3   foreign_gross   2037 non-null   object 
 4   year            3387 non-null   int64  
dtypes: float64(1), int64(1), object(3)
memory usage: 132.4+ KB


In [203]:
bom_df.dtypes

title              object
studio             object
domestic_gross    float64
foreign_gross      object
year                int64
dtype: object

In [204]:
bom_df['foreign_gross'] = pd.to_numeric(bom_df['foreign_gross'], errors = 'coerce')


In [205]:
bom_df.dtypes

title              object
studio             object
domestic_gross    float64
foreign_gross     float64
year                int64
dtype: object

#### Data Cleaning

In [206]:
bom_df.duplicated().sum()

0

In [207]:
bom_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3387 entries, 0 to 3386
Data columns (total 5 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   title           3387 non-null   object 
 1   studio          3382 non-null   object 
 2   domestic_gross  3359 non-null   float64
 3   foreign_gross   2032 non-null   float64
 4   year            3387 non-null   int64  
dtypes: float64(2), int64(1), object(2)
memory usage: 132.4+ KB


In [208]:
bom_df.isna().sum()

title                0
studio               5
domestic_gross      28
foreign_gross     1355
year                 0
dtype: int64

In [209]:
#missing value percentage
missing_percentage = (bom_df.isna().sum() / len(bom_df)) * 100
missing_percentage

title              0.000000
studio             0.147623
domestic_gross     0.826690
foreign_gross     40.005905
year               0.000000
dtype: float64

The missing values for the studio and domestic_gross are a small percentage of the whole data, so these values will be dropped.The foreign_gross has a large percentage of missing values hence they will have to be filled using either the mean or median depending in the data.

In [210]:
bom_df=bom_df.dropna(subset=['studio'])
bom_df=bom_df.dropna(subset=['domestic_gross'])

In [211]:
median_foreign=bom_df['foreign_gross'].median()
median_foreign

19600000.0

In [212]:
mean_foreign=bom_df['foreign_gross'].mean()
mean_foreign

75979668.67282717

Both the mean and the median are the same, meaning we have symmetrical data. Hence we fill the missing values using the mean.

In [213]:
bom_df['foreign_gross']=bom_df['foreign_gross'].fillna(bom_df['foreign_gross'].mean())

In [214]:
bom_df.isna().sum()

title             0
studio            0
domestic_gross    0
foreign_gross     0
year              0
dtype: int64

### The Number Data

In [215]:
tn_df = pd.read_csv("./Data/tn.movie_budgets.csv")

In [216]:
tn_df.head(5)

Unnamed: 0,id,release_date,movie,production_budget,domestic_gross,worldwide_gross
0,1,"Dec 18, 2009",Avatar,"$425,000,000","$760,507,625","$2,776,345,279"
1,2,"May 20, 2011",Pirates of the Caribbean: On Stranger Tides,"$410,600,000","$241,063,875","$1,045,663,875"
2,3,"Jun 7, 2019",Dark Phoenix,"$350,000,000","$42,762,350","$149,762,350"
3,4,"May 1, 2015",Avengers: Age of Ultron,"$330,600,000","$459,005,868","$1,403,013,963"
4,5,"Dec 15, 2017",Star Wars Ep. VIII: The Last Jedi,"$317,000,000","$620,181,382","$1,316,721,747"


In [217]:
tn_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5782 entries, 0 to 5781
Data columns (total 6 columns):
 #   Column             Non-Null Count  Dtype 
---  ------             --------------  ----- 
 0   id                 5782 non-null   int64 
 1   release_date       5782 non-null   object
 2   movie              5782 non-null   object
 3   production_budget  5782 non-null   object
 4   domestic_gross     5782 non-null   object
 5   worldwide_gross    5782 non-null   object
dtypes: int64(1), object(5)
memory usage: 271.2+ KB


In [218]:
tn_df[['production_budget','domestic_gross','worldwide_gross']]= tn_df[['production_budget','domestic_gross','worldwide_gross']].apply(pd.to_numeric, errors='coerce')

In [219]:
tn_df.dtypes

id                     int64
release_date          object
movie                 object
production_budget    float64
domestic_gross       float64
worldwide_gross      float64
dtype: object

In [220]:
tn_df.duplicated().sum()

0

In [221]:
tn_df.isnull().sum()

id                      0
release_date            0
movie                   0
production_budget    5782
domestic_gross       5782
worldwide_gross      5782
dtype: int64

### IMDB Data

In [222]:
conn=sqlite3.connect('./Data/im.db')

In [223]:
pd.read_sql("SELECT name FROM sqlite_master WHERE type='table';", conn)


Unnamed: 0,name
0,movie_basics
1,directors
2,known_for
3,movie_akas
4,movie_ratings
5,persons
6,principals
7,writers


In [224]:
pd.read_sql("SELECT * FROM movie_basics ;",conn).head(5)


Unnamed: 0,movie_id,primary_title,original_title,start_year,runtime_minutes,genres
0,tt0063540,Sunghursh,Sunghursh,2013,175.0,"Action,Crime,Drama"
1,tt0066787,One Day Before the Rainy Season,Ashad Ka Ek Din,2019,114.0,"Biography,Drama"
2,tt0069049,The Other Side of the Wind,The Other Side of the Wind,2018,122.0,Drama
3,tt0069204,Sabse Bada Sukh,Sabse Bada Sukh,2018,,"Comedy,Drama"
4,tt0100275,The Wandering Soap Opera,La Telenovela Errante,2017,80.0,"Comedy,Drama,Fantasy"


In [225]:
pd.read_sql("SELECT * FROM movie_ratings;",conn).head(5)

Unnamed: 0,movie_id,averagerating,numvotes
0,tt10356526,8.3,31
1,tt10384606,8.9,559
2,tt1042974,6.4,20
3,tt1043726,4.2,50352
4,tt1060240,6.5,21


In [226]:
pd.read_sql("SELECT * FROM persons;",conn).head()

Unnamed: 0,person_id,primary_name,birth_year,death_year,primary_profession
0,nm0061671,Mary Ellen Bauder,,,"miscellaneous,production_manager,producer"
1,nm0061865,Joseph Bauer,,,"composer,music_department,sound_department"
2,nm0062070,Bruce Baum,,,"miscellaneous,actor,writer"
3,nm0062195,Axel Baumann,,,"camera_department,cinematographer,art_department"
4,nm0062798,Pete Baxter,,,"production_designer,art_department,set_decorator"


In [227]:
im_df=pd.read_sql("""SELECT * FROM movie_basics
JOIN movie_ratings
                  USING (movie_id);""",conn)
im_df.head(5)

Unnamed: 0,movie_id,primary_title,original_title,start_year,runtime_minutes,genres,averagerating,numvotes
0,tt0063540,Sunghursh,Sunghursh,2013,175.0,"Action,Crime,Drama",7.0,77
1,tt0066787,One Day Before the Rainy Season,Ashad Ka Ek Din,2019,114.0,"Biography,Drama",7.2,43
2,tt0069049,The Other Side of the Wind,The Other Side of the Wind,2018,122.0,Drama,6.9,4517
3,tt0069204,Sabse Bada Sukh,Sabse Bada Sukh,2018,,"Comedy,Drama",6.1,13
4,tt0100275,The Wandering Soap Opera,La Telenovela Errante,2017,80.0,"Comedy,Drama,Fantasy",6.5,119


In [228]:
im_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 73856 entries, 0 to 73855
Data columns (total 8 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   movie_id         73856 non-null  object 
 1   primary_title    73856 non-null  object 
 2   original_title   73856 non-null  object 
 3   start_year       73856 non-null  int64  
 4   runtime_minutes  66236 non-null  float64
 5   genres           73052 non-null  object 
 6   averagerating    73856 non-null  float64
 7   numvotes         73856 non-null  int64  
dtypes: float64(2), int64(2), object(4)
memory usage: 4.5+ MB


In [229]:
im_df.dtypes

movie_id            object
primary_title       object
original_title      object
start_year           int64
runtime_minutes    float64
genres              object
averagerating      float64
numvotes             int64
dtype: object

In [230]:
im_df.duplicated().sum()

0

In [231]:
im_df.isnull().sum()

movie_id              0
primary_title         0
original_title        0
start_year            0
runtime_minutes    7620
genres              804
averagerating         0
numvotes              0
dtype: int64

In [232]:
#missing value percentage   
missing_percentage_2= (im_df.isnull().sum() / len(im_df)) * 100
missing_percentage_2

movie_id            0.000000
primary_title       0.000000
original_title      0.000000
start_year          0.000000
runtime_minutes    10.317374
genres              1.088605
averagerating       0.000000
numvotes            0.000000
dtype: float64

The genres column has only a 1% missing value data, so the missing values can be dropped.The runtime_minutes column has a 10% of missing values hence we need to fill the data out using either the mean or the median depending on the skewness.

In [233]:
im_df.dropna(subset=['genres'],inplace=True)

In [234]:
runtime_mean=im_df['runtime_minutes'].mean()
runtime_mean

94.7322732805843

In [235]:
runtime_median=im_df['runtime_minutes'].median()
runtime_median

91.0

The mean is not equal to the median, thus the data is skewed.Hence we'll fill out the missing values using the median

In [236]:
im_df['runtime_minutes']=im_df['runtime_minutes'].fillna(im_df['runtime_minutes'].median())

In [237]:
im_df.isnull().sum()

movie_id           0
primary_title      0
original_title     0
start_year         0
runtime_minutes    0
genres             0
averagerating      0
numvotes           0
dtype: int64