# IMDB Movie Data EDA

**Authors:** Franko Ndou, Anthony Brocco

# Overview

IMDB has a SQL database containing vast movie data, this as well as two other CSV files that we have obtained will help us perform an EDA to help solve complex business problems. Our goal is to find the best performing films at the box office currently, and translate our findings into understandable data visualizaitons and recomendations. 

# Business Problem

Universal Pictures is looking to create the next big film, they have a massive budget to find the best directors, actors and business practices that will lead to not only the greatest film of our generation but the largest ROI possible. Our job is to perform an exploratory data analysis on large data sets to help Universal come to a conclusion on what is the best possible way to achieve this goal. 

# Creating the Production team

To make a best-selling film, you need a best-selling production team. We need to find out whos the best director and writer for the job. Actors are redundant as directors often write roles with certain actors in mind. Using data to find out the most successful actor will do nothing to help our production team make the best movie that they can. It is their artistic vision, using those statistics will not help us achieve a higher ROI and may negatively effect the film as well.

## Setting up the workspace

In [60]:
#Importing libraries
import pandas as pd
import sqlite3 
import seaborn as sns
import matplotlib.pyplot as plt
import warnings

# Ignores warnings 
warnings.filterwarnings("ignore")

In [3]:
# Creating data frames and establishing connections
budgets = pd.read_csv('../zippedData/movie_budget_cleaned.csv')
gross = pd.read_csv('../zippedData/gross_movie_cleaned.csv')
conn = sqlite3.connect('../zippedData/im.db')

In [108]:
# Finding directors with the highest rated movies as well as their rating
pd.read_sql("""
    SELECT movie, max_averagerating, numvotes
    FROM (
        SELECT p.primary_name AS movie, MAX(averagerating) AS max_averagerating, numvotes
        FROM movie_ratings AS mr
        JOIN directors AS d ON mr.movie_id = d.movie_id
        JOIN persons AS p ON p.person_id = d.person_id
        WHERE averagerating >= 8 AND numvotes > 2500
        GROUP BY p.primary_name, numvotes
    ) AS subquery
    GROUP BY movie
    ORDER BY numvotes DESC
    
""", conn).head(10)

Unnamed: 0,movie,max_averagerating,numvotes
0,Christopher Nolan,8.6,1299334
1,Joss Whedon,8.1,1183655
2,James Gunn,8.1,948394
3,Tim Miller,8.0,820847
4,J.J. Abrams,8.0,784780
5,George Miller,8.1,780910
6,David Fincher,8.1,761592
7,David Yates,8.1,691835
8,Ridley Scott,8.0,680116
9,Éric Toledano,8.5,677343


In [109]:
# Finding writers with the highest rated movies as well as their rating
pd.read_sql("""
    SELECT movie, max_averagerating, numvotes
    FROM (
        SELECT p.primary_name AS movie, MAX(averagerating) AS max_averagerating, numvotes
        FROM movie_ratings AS mr
        JOIN writers AS w ON mr.movie_id = w.movie_id
        JOIN persons AS p ON p.person_id = w.person_id
        WHERE averagerating >= 8 AND numvotes > 2500
        GROUP BY p.primary_name, numvotes
    ) AS subquery
    GROUP BY movie
    ORDER BY numvotes DESC
""", conn).head(10)

Unnamed: 0,movie,max_averagerating,numvotes
0,David S. Goyer,8.4,1387769
1,Bob Kane,8.4,1387769
2,Jonathan Nolan,8.6,1299334
3,Christopher Nolan,8.6,1299334
4,Zak Penn,8.1,1183655
5,Joss Whedon,8.1,1183655
6,Terence Winter,8.2,1035358
7,Jordan Belfort,8.2,1035358
8,Laeta Kalogridis,8.1,1005960
9,Dennis Lehane,8.1,1005960


Observing the data we have, it seems like Christopher Nolan is one of the most critically acclaimed directors currently. He seems to be a fantastics writter as well. His Brother, Jonathan Nolan appears on the list as well. Im sure the two of them collaborate often but we should delve a little deeper into this data and see the box office performance and critical feedback of his movies

In [101]:
# Movies Christopher Nolan has directed
c_nolan_films = pd.read_sql("""
    SELECT mb.primary_title AS movie, mb.genres, MAX(mr.averagerating) AS averagerating, MAX(mr.numvotes) AS numvotes
    FROM movie_basics AS mb
    JOIN movie_ratings AS mr ON mb.movie_id = mr.movie_id
    JOIN directors AS d ON mb.movie_id = d.movie_id
    JOIN persons AS p ON p.person_id = d.Person_id
    WHERE p.primary_name = "Christopher Nolan"
    GROUP BY mb.primary_title, mb.genres
    ORDER BY MAX(mr.numvotes) DESC
""", conn)
                            
c_nolan_films.head()

Unnamed: 0,movie,genres,averagerating,numvotes
0,Inception,"Action,Adventure,Sci-Fi",8.8,1841066
1,The Dark Knight Rises,"Action,Thriller",8.4,1387769
2,Interstellar,"Adventure,Drama,Sci-Fi",8.6,1299334
3,Dunkirk,"Action,Drama,History",7.9,466580


In [102]:
# Movies Chrisopher Nolan has written
pd.read_sql("""
    SELECT mb.primary_title AS movie, mb.genres, MAX(mr.averagerating) AS averagerating, MAX(mr.numvotes) AS numvotes
    FROM movie_basics AS mb
    JOIN movie_ratings AS mr ON mb.movie_id = mr.movie_id
    JOIN writers AS d ON mb.movie_id = d.movie_id
    JOIN persons AS p ON p.person_id = d.Person_id
    WHERE p.primary_name = "Christopher Nolan"
    GROUP BY mb.primary_title, mb.genres
    ORDER BY MAX(mr.numvotes) DESC
""", conn).head()

Unnamed: 0,movie,genres,averagerating,numvotes
0,Inception,"Action,Adventure,Sci-Fi",8.8,1841066
1,The Dark Knight Rises,"Action,Thriller",8.4,1387769
2,Interstellar,"Adventure,Drama,Sci-Fi",8.6,1299334
3,Man of Steel,"Action,Adventure,Sci-Fi",7.1,647288
4,Dunkirk,"Action,Drama,History",7.9,466580


In [103]:
# Movies that Jonathan Nolan has written
pd.read_sql("""
    SELECT mb.primary_title AS movie, mb.genres, MAX(mr.averagerating) AS averagerating, MAX(mr.numvotes) AS numvotes
    FROM movie_basics AS mb
    JOIN movie_ratings AS mr ON mb.movie_id = mr.movie_id
    JOIN writers AS d ON mb.movie_id = d.movie_id
    JOIN persons AS p ON p.person_id = d.Person_id
    WHERE p.primary_name = "Jonathan Nolan"
    GROUP BY mb.primary_title, mb.genres
    ORDER BY MAX(mr.numvotes) DESC
""", conn).head()

Unnamed: 0,movie,genres,averagerating,numvotes
0,The Dark Knight Rises,"Action,Thriller",8.4,1387769
1,Interstellar,"Adventure,Drama,Sci-Fi",8.6,1299334


It seems like there is some overlap but not entirely, We should check the ROI of the average Christopher Nolan Film as well as the budget and gross of each film

In [107]:
# Merging dataframes to check for ROI
c_nolan_films = budgets.merge(c_nolan_films)
pd.set_option('display.float_format', '{:.2f}'.format)
c_nolan_films.describe()

Unnamed: 0.1,Unnamed: 0,id,production_budget,domestic_gross,worldwide_gross,ROI,averagerating,numvotes
count,4.0,4.0,4.0,4.0,4.0,4.0,4.0,4.0
mean,111.5,37.5,187500000.0,279700367.0,771545121.0,3.13,8.43,1248687.25
std,69.58,23.98,58665719.69,122443157.23,249586963.97,0.79,0.39,572862.16
min,10.0,11.0,150000000.0,188017894.0,499837368.0,2.33,7.9,466580.0
25%,100.75,26.75,157500000.0,189555683.5,624743873.25,2.79,8.28,1091145.5
50%,134.0,35.0,162500000.0,241322237.5,750952008.5,2.99,8.5,1343551.5
75%,144.75,45.75,192500000.0,331466921.0,897753256.25,3.33,8.65,1501093.25
max,168.0,69.0,275000000.0,448139099.0,1084439099.0,4.22,8.8,1841066.0


The tables tell us that his films tend to have a massive return on investment. We can assume that if he was given a large budget, he would be able to make the best out of it. However its better to be confident than to assume.

- The average Christopher Nolan film has a budget of $162,500,000.


- The average Christopher Nolan film has a worldwide gross of $750,952,008

- The average Christopher Nolan film has a ROI of 2.99

This tells us a decent amount about what his films are capable of however its better to do a confidence test and truely see if hes worth betting on.

## Hypothesis testing