# Title

## Overview

## Business Problem

Microsoft sees all the big companies creating original video content and they want to get in on the fun. They have decided to create a new movie studio, but they don’t know anything about creating movies. You are charged with exploring what types of films are currently doing the best at the box office. You must then translate those findings into actionable insights that the head of Microsoft's new movie studio can use to help decide what type of films to create.

Movie production has significant up-front costs that will require internal stakeholder support to adequately fund new projects. Further, it will be important to generate engagement with Microsoft's titles both to maximize return on investment and to legitimize Microsoft as a content producer in the future. Therefore, this analysis aims to generate recommendations on how best to deploy the content production budget. 

We will use data from IMDB, The Numbers and The Movie Database to determine answers to the following questions:

- What genres of movie are likely to optimize return on investment?

- What is the relationship between movie budget and popularity using The Movie Database's popularity score? Similarly, can we use the popularity score as a proxy for engagement based on spending?

- What season/month should we target releases in order to optimize our return on investment?





## Data Understanding

For this analysis, we will utilize a database from IMDB, and two datasets from The Numbers.com and The Movie Database. The IMDB database contains robust information for each title--most notably the title, release date and relevant genres. The dataset from TheNumbers.com will primarily be used for return on investment data, including worldwide gross revenue and movie production budget. The Movie Database contains a proprietary popularity score which is calculated based on a number of factors to measure engagement. Documentation for TMDB's popularity score can be found [here.](https://developers.themoviedb.org/3/getting-started/popularity).

In [1]:
# import necessary packages
import pandas as pd
import sqlite3
import matplotlib.pyplot as plt
import zipfile
import numpy as np
%matplotlib inline

The IMDB dataset includes information on movies spanning nearly a century. Given the changing appetites over time, we elect to limit our dataset to movies released in 2010 or later.

In [2]:
# Extract IMDb SQL .db file
with zipfile.ZipFile('./data/im.db.zip') as zipObj:
    # Extract all contents of .zip file into current directory
    zipObj.extractall(path='./data/')
    
# Create connection to IMDb DB
conn = sqlite3.connect('./data/im.db')

The Numbers contains budget, revenue and release date data for 5782 movies

In [3]:
df_budget = pd.read_csv('./data/tn.movie_budgets.csv.gz', index_col=0)
df_budget.head(3)


Unnamed: 0_level_0,release_date,movie,production_budget,domestic_gross,worldwide_gross
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
1,"Dec 18, 2009",Avatar,"$425,000,000","$760,507,625","$2,776,345,279"
2,"May 20, 2011",Pirates of the Caribbean: On Stranger Tides,"$410,600,000","$241,063,875","$1,045,663,875"
3,"Jun 7, 2019",Dark Phoenix,"$350,000,000","$42,762,350","$149,762,350"


The Movie Database contains genre, release date, title and popularity score data for 26517 movies.


In [4]:
df_movies = pd.read_csv('./data/tmdb.movies.csv.gz', index_col=0)
df_movies.head(3)


Unnamed: 0,genre_ids,id,original_language,original_title,popularity,release_date,title,vote_average,vote_count
0,"[12, 14, 10751]",12444,en,Harry Potter and the Deathly Hallows: Part 1,33.533,2010-11-19,Harry Potter and the Deathly Hallows: Part 1,7.7,10788
1,"[14, 12, 16, 10751]",10191,en,How to Train Your Dragon,28.734,2010-03-26,How to Train Your Dragon,7.7,7610
2,"[12, 28, 878]",10138,en,Iron Man 2,28.515,2010-05-07,Iron Man 2,6.8,12368


## Data Preparation

Our goal was to combine IMDB's expansive list of movies with TMDB's popularity score and budget data from The Numbers.

### Data Cleaning

In [24]:
# WE SHOULD THINK ABOUT RE-NAMING COLUMNS IN OUR DATAFRAMES. NOT CRITICAL, BUT IF WE HAVE TIME IT WILL READ BETTER.

#### The Movie Database

We checked TMD's data for instances where both title and release date were duplicated, and dropped those rows. Based on the other data in the table we did not have an easy way to sort out which of the duplicated records was accurate. Since the dropped data was a small subset, the impact on our overall analysis should be minimal.

There were also instances where title was duplicated but original title was not--primarily in movies that were translated to another language. To avoid duplicating records, we dropped the records with duplicated titles as well.

In [5]:
df_movies[['original_title', 'release_date']].duplicated().sum()

1026

In [6]:
df_movies.drop_duplicates(subset=['original_title', 'release_date'], inplace=True)

In [13]:
df_movies.drop(df_movies[df_movies['genre_ids'] == '[]'].index, inplace=True)

In [8]:
df_movies['title'].duplicated().sum()

803

In [10]:
# examine instances where 'title' is duplicated but 'original_title' is not
df_duplicated_titles = df_movies[df_movies['original_title'] != df_movies['title']]
df_duplicated_titles.head(10)


Unnamed: 0,genre_ids,id,original_language,original_title,popularity,release_date,title,vote_average,vote_count
14,"[27, 80]",41439,en,Saw 3D,20.37,2010-10-28,Saw: The Final Chapter,6.0,1488
49,"[10749, 18]",61979,es,Tres metros sobre el cielo,13.721,2010-12-20,Three Steps Above Heaven,7.5,960
67,"[16, 12, 14, 10751]",42949,en,Arthur 3: la guerre des deux mondes,12.679,2010-08-22,Arthur 3: The War of the Two Worlds,5.6,865
70,"[80, 18, 9648, 10749]",25376,es,El secreto de sus ojos,12.531,2010-04-16,The Secret in Their Eyes,7.9,1141
75,[16],28874,ja,サマーウォーズ,12.275,2010-10-13,Summer Wars,7.5,447
79,"[28, 53, 80, 9648]",33613,sv,Luftslottet som sprängdes,12.235,2010-10-29,The Girl Who Kicked the Hornet's Nest,7.0,705
84,"[12, 14, 16, 878]",37933,ja,ゲド戦記,12.005,2010-08-13,Tales from Earthsea,6.6,502
87,"[18, 28, 53, 80, 9648]",24253,sv,Flickan som lekte med elden,11.655,2010-07-09,The Girl Who Played with Fire,7.0,881
98,"[14, 12, 28, 9648]",35552,fr,Les Aventures extraordinaires d'Adèle Blanc-Sec,11.221,2010-04-14,The Extraordinary Adventures of Adèle Blanc-Sec,6.0,671
103,"[28, 18, 36]",11645,ja,乱,10.885,1985-09-26,Ran,8.1,600


In [11]:
df_movies.drop_duplicates(subset='original_title', inplace=True)

In [14]:
df_movies.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 22422 entries, 0 to 26516
Data columns (total 9 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   genre_ids          22422 non-null  object 
 1   id                 22422 non-null  int64  
 2   original_language  22422 non-null  object 
 3   original_title     22422 non-null  object 
 4   popularity         22422 non-null  float64
 5   release_date       22422 non-null  object 
 6   title              22422 non-null  object 
 7   vote_average       22422 non-null  float64
 8   vote_count         22422 non-null  int64  
dtypes: float64(2), int64(2), object(5)
memory usage: 1.7+ MB


### The Numbers

In [15]:
df_budget = pd.read_csv('./data/tn.movie_budgets.csv.gz', index_col=0)

In [21]:
# Converted release date to a date/time format.
df_budget['release_date'] = pd.to_datetime(df_budget['release_date'])

In [19]:
df_budget.head()

Unnamed: 0_level_0,release_date,movie,production_budget,domestic_gross,worldwide_gross
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
1,"Dec 18, 2009",Avatar,"$425,000,000","$760,507,625","$2,776,345,279"
2,"May 20, 2011",Pirates of the Caribbean: On Stranger Tides,"$410,600,000","$241,063,875","$1,045,663,875"
3,"Jun 7, 2019",Dark Phoenix,"$350,000,000","$42,762,350","$149,762,350"
4,"May 1, 2015",Avengers: Age of Ultron,"$330,600,000","$459,005,868","$1,403,013,963"
5,"Dec 15, 2017",Star Wars Ep. VIII: The Last Jedi,"$317,000,000","$620,181,382","$1,316,721,747"


In [22]:
# Created release year and release month columns, and converted values to integers.
df_budget['release_year'] = df_budget['release_date'].dt.strftime("%Y%m%d").str[:4].astype(int)
df_budget['release_month'] = df_budget['release_date'].dt.strftime("%Y%m%d").str[4:6].astype(int)
df_budget.head()

Unnamed: 0_level_0,release_date,movie,production_budget,domestic_gross,worldwide_gross,release_year,release_month
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
1,2009-12-18,Avatar,"$425,000,000","$760,507,625","$2,776,345,279",2009,12
2,2011-05-20,Pirates of the Caribbean: On Stranger Tides,"$410,600,000","$241,063,875","$1,045,663,875",2011,5
3,2019-06-07,Dark Phoenix,"$350,000,000","$42,762,350","$149,762,350",2019,6
4,2015-05-01,Avengers: Age of Ultron,"$330,600,000","$459,005,868","$1,403,013,963",2015,5
5,2017-12-15,Star Wars Ep. VIII: The Last Jedi,"$317,000,000","$620,181,382","$1,316,721,747",2017,12


In [23]:
# Removed punctuation from production budget and converted the column values to integers.
df_budget['production_budget'] = df_budget['production_budget'].str.replace(',','')
df_budget['production_budget'] = df_budget['production_budget'].str.replace('$', '')
df_budget['production_budget'] = df_budget['production_budget'].astype(int)

In [None]:
# Removed punctuation from worldwide gross and converted the column values to integers.
df_budget['worldwide_gross'] = df_budget['worldwide_gross'].str.replace(',','')
df_budget['worldwide_gross'] = df_budget['worldwide_gross'].str.replace('$', '')
df_budget['worldwide_gross'] = df_budget['worldwide_gross'].astype(int)

### IMDB

In [26]:
imdbq = """
SELECT
    movie_id,
    primary_title,
    start_year,
    genres,
    directors.person_id AS director_id,
    writers.person_id AS writer_id,  
    movie_ratings.averagerating,
    movie_ratings.numvotes
    
FROM
    movie_basics
    JOIN
        movie_ratings
            USING(movie_id)
    JOIN
        directors
            USING(movie_id)
    JOIN
        writers
            USING(movie_id)
    
WHERE
    start_year >= 2010 AND
    start_year <= 2022


GROUP BY
    movie_basics.movie_id
;
"""
imdbq_result = pd.read_sql(imdbq, conn)
imdbq_result

Unnamed: 0,movie_id,primary_title,start_year,genres,averagerating,numvotes
0,tt0063540,Sunghursh,2013,"Action,Crime,Drama",7.0,77
1,tt0066787,One Day Before the Rainy Season,2019,"Biography,Drama",7.2,43
2,tt0069049,The Other Side of the Wind,2018,Drama,6.9,4517
3,tt0069204,Sabse Bada Sukh,2018,"Comedy,Drama",6.1,13
4,tt0100275,The Wandering Soap Opera,2017,"Comedy,Drama,Fantasy",6.5,119
...,...,...,...,...,...,...
73851,tt9913084,Diabolik sono io,2019,Documentary,6.2,6
73852,tt9914286,Sokagin Çocuklari,2019,"Drama,Family",8.7,136
73853,tt9914642,Albatross,2017,Documentary,8.5,8
73854,tt9914942,La vida sense la Sara Amat,2019,,6.6,5


### Merging Datasets

### Feature Engineering

## Analysis

### Return on Investment by Genre


### Popularity by Investment

### Popularity by Release Month

## Conclusions


- **Conclusion Number 1**
- **Conclusion Number 2**
- **Conclusion Number 3**

- **Recommendation Number 1**
- **Recommendation Number 2**
- **Recommendation Number 3**

### Next Steps