# Real Box Office Data Analysis

This project analyzes a set of real box office data (provided by flatiron data science school) which includes information gathered from Box Office Mojo, IMDB, Rotten Tomatoes, ThemovieDB, and The Numbers in order to provide business insights for the investors looking to open a new movie studio.

## Data explaination

The data files used for the analysis are not included within this repository in order to reflect industry best practicies and to simulate the process I would follow if I were working with private and sensitive data. The zipped data can be obtained at the flatiron school repository which served as a model example for this project, accessible [here.](https://github.com/learn-co-curriculum/dsc-phase-2-project-v3/tree/gating/zippedData)

## Introduction

The purpose of this project is to create actionable, data-based reccommendations that the investors in a new movie studio can follow in order to maximize the success of their new films. For the purpose of this analysis, I am considering 'success' to be the profitability an degree of profitability of the movie; not just positive user ratings. In my exploratory analysis of the data, I found that there was a correlation between the budget of the movie and the user ratings the movie recieved, and I additionally found a correlation between user ratings and the total profit of the movie. I also found the number of total foreign markets a movie was released in to be a strong predictor of the movie's profit. Finally, I found that the original language the movie was produced in to be another strong predictor of success.

## Data Assembly and Cleaning

In my exploratory analysis, I found that out of the five sources of data I have assembled for the project, only three of them were well suited to my analysis. The Box Office Mojo data is an inferior version of the data from The Numbers, and joining the Rotten Tomatoes data with everything else would result in a huge amount of lost entries/NaN values, meaning it was not feasible to include it. In the cells below, I assemble the data into a single dataframe and clean out the missing and placeholder values from the columns I plan to use in my analysis.

In [1]:
# Import necessary packages
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import sqlite3
import seaborn as sns
from pandasql import sqldf
from scipy import stats

In [3]:
# Load TMDB data
tmdbdf = pd.read_csv("data/tmdb.movies.csv")
# Trim columns from TMDB data that are not useful for this analysis, reformat data to be easier to work with
tmdbdf.drop(columns=['id', 'Unnamed: 0', 'genre_ids', 'release_date'], inplace=True)
tmdbdf.columns = ['original_language', 'original_title', 'popularity', 'title', 'TMDBvote_average', 'TMDBvote_count']
tmdbdf.set_index('original_title', inplace=True)

# Load data from The Numbers
tndf = pd.read_csv("data/tn.movie_budgets.csv")
# Same trimming process as above
tndf.drop(columns='id', inplace=True)
tndf.set_index('movie', inplace=True)

# Set up connection to IMDB SQL database
conn = sqlite3.connect('data/im.db')

# Query SQL database for IMDB ratings and number of regions movies were released in
q = """
SELECT primary_title, averagerating AS IMDBrating, numvotes AS IMDBnumvotes, COUNT(DISTINCT region) AS num_markets
  FROM movie_basics as MB
  JOIN movie_ratings as MR
      USING(movie_id)
  JOIN movie_akas
      USING(movie_id)
GROUP BY primary_title
"""
imdbdf = pd.read_sql(q, conn)
imdbdf.set_index('primary_title', inplace=True)

# Join dataframes together to create the main source of data I will use in this analysis
fulldf = tndf.join([tmdbdf, imdbdf], how='inner')

# Trim out a duplicate column, calculate a new profit column, and then drop the now-redundant columns that were used to calculate it
# Remove the dollar signs and commas, then convert data type to integer
fulldf['worldwide_gross'] = fulldf['worldwide_gross'].map(lambda x: int(x.replace(",", "")[1:]))
fulldf['production_budget'] = fulldf['production_budget'].map(lambda x: int(x.replace(",", "")[1:]))
# Calculate the new profit column: worldwide gross - production budget
fulldf['profit'] = fulldf['worldwide_gross'] - fulldf['production_budget'] 
fulldf.drop(columns=['domestic_gross', 'worldwide_gross'], inplace=True)
fulldf.head()

Unnamed: 0,release_date,production_budget,original_language,popularity,title,TMDBvote_average,TMDBvote_count,IMDBrating,IMDBnumvotes,num_markets,profit
#Horror,"Nov 20, 2015",1500000,de,6.099,#Horror,3.3,102,3.0,3092,3,-1500000
10 Cloverfield Lane,"Mar 11, 2016",5000000,en,17.892,10 Cloverfield Lane,6.9,4629,7.2,260383,25,103286422
10 Days in a Madhouse,"Nov 11, 2015",12000000,en,0.955,10 Days in a Madhouse,5.4,7,6.7,1114,2,-11985384
12 Strong,"Jan 19, 2018",35000000,en,13.183,12 Strong,5.6,1312,6.6,50155,26,36118378
12 Years a Slave,"Oct 18, 2013",20000000,en,16.493,12 Years a Slave,7.9,6631,8.1,577301,32,161025343


All of the programming that was used to create this dataframe was taken from the exploration branch of this repository, where I did my exploratory analysis. Now that I have the data I want to use in a workable format, I need to check for missing values and obvious placeholders.

In [6]:
fulldf.describe()

Unnamed: 0,production_budget,popularity,TMDBvote_average,TMDBvote_count,IMDBrating,IMDBnumvotes,num_markets,profit
count,2140.0,2140.0,2140.0,2140.0,2140.0,2140.0,2140.0,2140.0
mean,38771420.0,10.868855,6.206355,1760.184579,6.261075,96494.31,20.4,83706600.0
std,52498730.0,8.31712,1.127242,2749.367955,1.103754,153729.8,10.884141,186528900.0
min,9000.0,0.6,0.0,1.0,1.6,5.0,1.0,-110450200.0
25%,5000000.0,5.86025,5.6,71.0,5.7,2986.0,12.0,-1500000.0
50%,20000000.0,9.5925,6.3,675.5,6.4,38740.0,23.0,14174730.0
75%,48000000.0,14.50275,6.9,2148.25,7.0,119054.5,29.0,79185450.0
max,425000000.0,80.773,10.0,22186.0,9.2,1841066.0,48.0,2351345000.0


A potential issue with this dataset is the small number of records that I've ended up with. The IMDB database is rather large, but due to the reliance of my analysis on profit as a metric of success, I can only use records that overlap with the data from The Numbers, which is a rather small amount. Regardless, due to the relatively small number of movies that come out each year, this amount of data should be sufficient for this analysis, and later in my analysis I use statistical tests to confirm that this is the case.

In [5]:
fulldf.isna().any()

release_date         False
production_budget    False
original_language    False
popularity           False
title                False
TMDBvote_average     False
TMDBvote_count       False
IMDBrating           False
IMDBnumvotes         False
num_markets          False
profit               False
dtype: bool

Thankfully, it looks like there aren't any missing values!