# Microsoft Movie Data Analysis

### By Chuck Nadel


## Welcome!

#### In this notebook, I analyzed movie data from 2010-2019 in order to 'recommend' to microsoft what kind of movies they should make. The three main things I looked at were the relationship between Genre and Movie Success, the effect that IMDB votes had on Profit, and the effect that budget had on Profit. I also looked at which directors did the best job generating buzz on IMDB, as well as creating successful films with the a relatively small amount of money.

## Importing Packages

In [161]:
import pandas as pd
import sqlite3
import plotly.express as px
import plotly.graph_objects as go

## Uploading Databases

In [162]:
# The Numbers DF
numsDF = pd.read_csv('zippedData/tn.movie_budgets.csv.gz')

# IMDB SQL Database
conn = sqlite3.connect('zippedData/im.db')
imdbDF = pd.read_sql('''
        SELECT *
        FROM sqlite_master
        WHERE type='table'
        ''', conn)

# IMDB Basics and Ratings Information, merged for ease of analysis
basics_ratingsDF = pd.read_sql('''
        SELECT * FROM movie_basics
        INNER JOIN movie_ratings ON movie_basics.movie_id=movie_ratings.movie_id
        ''', conn)
directors_infoDF = pd.read_sql('''
        SELECT * FROM directors
        INNER JOIN persons ON directors.person_id=persons.person_id
        ''', conn)

In [163]:
# Before we combine both IMDB Dataframes, we need to drop the duplicate columns they were merged on
basics_ratingsDF = basics_ratingsDF.T.drop_duplicates().T
directors_infoDF = directors_infoDF.T.drop_duplicates().T

In [164]:
# Now, we can combine these two dataframes
directors_infoDF = directors_infoDF.drop_duplicates()
complete_imdbDF = pd.merge(basics_ratingsDF, directors_infoDF, on='movie_id', how='left')

# Lets also drop some columns that we will not be using
complete_imdbDF.drop(['birth_year','death_year','primary_profession', 'runtime_minutes'], axis=1, inplace=True)

## Cleaning the Data

#### First, lets look information provided by each impored Database

In [165]:
complete_imdbDF.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 86783 entries, 0 to 86782
Data columns (total 9 columns):
 #   Column          Non-Null Count  Dtype 
---  ------          --------------  ----- 
 0   movie_id        86783 non-null  object
 1   primary_title   86783 non-null  object
 2   original_title  86783 non-null  object
 3   start_year      86783 non-null  object
 4   genres          85844 non-null  object
 5   averagerating   86783 non-null  object
 6   numvotes        86783 non-null  object
 7   person_id       86030 non-null  object
 8   primary_name    86030 non-null  object
dtypes: object(9)
memory usage: 6.6+ MB


Our IMDB dataframe is mostly okay, but it seems that we are missing about 800 movies' genres, and directors from about 700 films..
Since the number of movies missing a genre is relatively small, we will just drop those rows. We will do the same for the films without a director.
We also need to convert the numvotes column to be a float.

In [166]:
# The Numbers DF
numsDF.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5782 entries, 0 to 5781
Data columns (total 6 columns):
 #   Column             Non-Null Count  Dtype 
---  ------             --------------  ----- 
 0   id                 5782 non-null   int64 
 1   release_date       5782 non-null   object
 2   movie              5782 non-null   object
 3   production_budget  5782 non-null   object
 4   domestic_gross     5782 non-null   object
 5   worldwide_gross    5782 non-null   object
dtypes: int64(1), object(5)
memory usage: 271.2+ KB


At first glance, all seems well here. However, this database does stores the production_budget, domestic_gross, and worldwide_gross columns as objects, as opposed to integers.
To address this, we will use base python to strip the values under these columns such that they can be turned into integers for further analysis.

#### Now, lets clean the Basics and Ratings Dataframe and the Numbers Dataframe, based on what we wrote above.

Basics and Ratings Dataframe:

In [167]:
# Dropping Rows without genre
complete_imdbDF.dropna(subset = ['genres','person_id','primary_name'], axis=0, inplace=True)
complete_imdbDF['numvotes'] = pd.to_numeric(complete_imdbDF['numvotes'], downcast='integer')

In [168]:
complete_imdbDF.isnull().sum()

movie_id          0
primary_title     0
original_title    0
start_year        0
genres            0
averagerating     0
numvotes          0
person_id         0
primary_name      0
dtype: int64

The Numbers Dataframe:

In [169]:
# Remove $ Sign, commas from production_budget, domestic_gross, and worldwide_gross columns
columnstofix = ['production_budget', 'domestic_gross','worldwide_gross']
for column in columnstofix:
    numsDF[column] = numsDF[column].apply(lambda x:x.replace('$',''))
    numsDF[column] = numsDF[column].apply(lambda x:x.replace(',',''))

In [170]:
# Convert cleaned columns to int data type
for column in columnstofix:
    numsDF[column] = pd.to_numeric(numsDF[column])
# Convert Dates in release_date to datetime object
numsDF['year'] = numsDF['release_date'].str[-4:]
# create new column for 'year'
numsDF['year'] = pd.to_numeric(numsDF['year'], downcast='integer')

Let's also create a new column; Worldwide_Profit

In [171]:
numsDF['Worldwide_Profit'] = numsDF['worldwide_gross']-numsDF['production_budget']
numsDF.head()

Unnamed: 0,id,release_date,movie,production_budget,domestic_gross,worldwide_gross,year,Worldwide_Profit
0,1,"Dec 18, 2009",Avatar,425000000,760507625,2776345279,2009,2351345279
1,2,"May 20, 2011",Pirates of the Caribbean: On Stranger Tides,410600000,241063875,1045663875,2011,635063875
2,3,"Jun 7, 2019",Dark Phoenix,350000000,42762350,149762350,2019,-200237650
3,4,"May 1, 2015",Avengers: Age of Ultron,330600000,459005868,1403013963,2015,1072413963
4,5,"Dec 15, 2017",Star Wars Ep. VIII: The Last Jedi,317000000,620181382,1316721747,2017,999721747


## #1 Safest (profit-wise) Movies by Genre

Let's look at the % of movies in the green, black, and red based on their Genre

Green = 100%+ Profit

Black = 0-100% Profit 

Red = Lost Money

In [172]:
# First, let's create a new numerical column in NumsDF, %ROI (Profit/Budget)
numsDF['% Return'] = numsDF['Worldwide_Profit']/numsDF['production_budget']*100
# Now, let's categorize these results in a new column, simple_success, based on the parameters above
numsDF['simple_success'] = numsDF['% Return'].apply(lambda x: 'Yes' if x >= 100 else 'No' if x < 0 else 'Breakeven')
# Finally, we need to merge this dataframe with the genre series in the IMDB dataframe
numsimdbDF = complete_imdbDF.merge(numsDF, how = 'inner', left_on = ['primary_title', 'start_year'], right_on = ['movie', 'year'])
# Since the genres are listed together, we want to make a dataframe where their is only 1 genre per column. We will do this by splitting the genre entries, and then copying each column over
# Split by genres
numsimdbDF['genres'] = numsimdbDF['genres'].str.split(',')
# Make a list of genres, along with the simple success metric made
genre_list = [{'Profit?':row.simple_success,'genres':g} for idx, row in numsimdbDF.iterrows() for g in row.genres ]
# Convert it to dataframe
profit_genreDF = pd.DataFrame(genre_list)

Let's now group the dataframe we made above by genre, and find the counts of each of the results

In [173]:
grouped_df = profit_genreDF.groupby('genres')
profit_genreDF['Profit?'].value_counts().sort_values(ascending=False)

Yes          2262
No           1141
Breakeven     765
Name: Profit?, dtype: int64

In [174]:
# Now, we want to make another dataframe that groups each genre together and counts the number of movies that were in the green, black, and red.
df_stacked = profit_genreDF.groupby(['genres', 'Profit?']).size().reset_index(name='counts')
# Standardize by genres
df_stacked['counts'] = df_stacked.groupby('genres')['counts'].apply(lambda x: x/x.sum())
# Generate Bar Graph
genreProfitStacked = px.bar(df_stacked, x = 'genres', y = 'counts', color = 'Profit?', barmode = 'stack', color_discrete_map = {'Breakeven': 'black', 'Yes':'seagreen','No':'darkred'})
genreProfitStacked.update_layout(yaxis_title='Movie Profitability Rate', xaxis_title='Genre', title = 'Movie Profitability Rate by Genre')

As shown here, Western, War, and Documentaries lost money the most often, with Animation and Adventure Movies losing money the least.

In [175]:
# Let's make another visualization with just the profitability success rate (Green only), and sort from most successful to least
df_green = df_stacked[df_stacked['Profit?'] == 'Yes'].sort_values(by = 'counts', ascending=False)
genreProfitGreen = px.bar(df_green, x = 'genres', y = 'counts', color = 'Profit?', barmode = 'stack', color_discrete_map = {'Yes':'seagreen'}, range_y = [0,1])
genreProfitGreen.update_layout(yaxis_title='Profitability Success Rate', xaxis_title='Genre')

This graph shows that Animation, Adventure, Science Fiction, and Mystery Movies to the best in terms of generating consistent profits. Since Animation generated by far the most consistent profits (at 73%), and lost money the least amount of the time (at 8%), I would recommend that Microsoft focuses on the creation of an animated film.

### #2 Correlation between number of IMDB Reviews, Box Office Profit

In [176]:
# We will use the originally merged database, numsimdbDF for this analysis
numsimdbDF.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1651 entries, 0 to 1650
Data columns (total 19 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   movie_id           1651 non-null   object 
 1   primary_title      1651 non-null   object 
 2   original_title     1651 non-null   object 
 3   start_year         1651 non-null   object 
 4   genres             1651 non-null   object 
 5   averagerating      1651 non-null   object 
 6   numvotes           1651 non-null   int32  
 7   person_id          1651 non-null   object 
 8   primary_name       1651 non-null   object 
 9   id                 1651 non-null   int64  
 10  release_date       1651 non-null   object 
 11  movie              1651 non-null   object 
 12  production_budget  1651 non-null   int64  
 13  domestic_gross     1651 non-null   int64  
 14  worldwide_gross    1651 non-null   int64  
 15  year               1651 non-null   int16  
 16  Worldwide_Profit   1651 

In [177]:
# Let's make a new Dataframe with primary_title, numvotes, Worldwide_Profit, and % Return
question2DF = numsimdbDF[['primary_title','numvotes','Worldwide_Profit', '% Return']]

Now, lets graph numvotes and Worldwide_Profit on a scatterplot

In [178]:
votesProfit = px.scatter(question2DF, x="numvotes", y="Worldwide_Profit", trendline = 'ols', hover_name='primary_title')
correlation = question2DF["numvotes"].corr(question2DF["Worldwide_Profit"])
votesProfit.update_layout(annotations=[dict(x=0.95, y=0.95, xref='paper', yref='paper', text=f'correlation: {correlation:.2f}', 
                                 showarrow=False, font=dict(size=14))],xaxis_title='Number of Votes', yaxis_title='Worldwide Profit', title = 'Number of Votes as a Predictor of Worldwide Profit')

It appears that their is a positive correlation between the number of votes a film recieves on IMDB, and the profit it made. Let's explore this more by finding the correlation coefficient
Look into directors that get most votes and ROI, find intersect.

Now how about we look at Profit % versus numvotes

In [179]:
votesProfit = px.scatter(question2DF, x="numvotes", y="% Return")
correlation = question2DF["numvotes"].corr(question2DF["% Return"])
votesProfit.update_layout(annotations=[dict(x=0.95, y=0.95, xref='paper', yref='paper', text=f'correlation: {correlation:.2f}', 
                                 showarrow=False, font=dict(size=14))],xaxis_title='Number of Votes', yaxis_title='% Return')

So, their is no correlation between the number of votes a movie recieves and its % Return.

Let's now explore which directors generate the highest number of votes 

In [180]:
# Let's make another dataframe with directors, their movies, and the number of imdb votes recieved
completeDF = complete_imdbDF.merge(numsDF, how = 'inner', left_on = ['primary_title', 'start_year'], right_on = ['movie', 'year'])
directors_votesDF = completeDF[['primary_name','numvotes', 'Worldwide_Profit']]
# Now, we need to group each Director to each of their films, and the number of IMDB votes generated
directors_votesDF = directors_votesDF.groupby(['primary_name']).sum().sort_values(by=['numvotes'], ascending=False).reset_index()
# We'll also make this dataframe smaller, as we don't really care about one-and-done film directors
directors_votesDF = directors_votesDF.head(50)
# We'll also make a new dataframe with just the top 10 directors by votes, and return those directors in a table
directors_votesTopTen = directors_votesDF.head(10)
go.Figure(data=[go.Table(
    header=dict(values=list(directors_votesTopTen.columns),
                fill_color='paleturquoise',
                align='center'),
            cells=dict(values=[directors_votesTopTen.primary_name, directors_votesTopTen.numvotes, directors_votesTopTen.Worldwide_Profit],
               fill_color='lavender',
               align='center'))]).update_layout(title = 'Top Ten Directors Based on IMDB Votes')

From a perspective of generating buzz, Christopher Nolan is the best. Martin Scorsese, Ridley Scott, and the Russo Brothers also rank highly.

### #3 Budget vs Return on Investment

In [181]:
# Let's make a new dataframe with the Movies' title, Production Budget, % Return, Directors Name, and Worldwide Profit
Question3DF = completeDF[['primary_title','production_budget','% Return','primary_name', 'Worldwide_Profit']]
Question3DF.head()

Unnamed: 0,primary_title,production_budget,% Return,primary_name,Worldwide_Profit
0,Foodfight!,45000000,-99.836209,Lawrence Kasanoff,-44926294
1,The Secret Life of Walter Mitty,91000000,106.44086,Ben Stiller,96861183
2,A Walk Among the Tombstones,28000000,121.816382,Scott Frank,34108587
3,Jurassic World,215000000,666.909239,Colin Trevorrow,1433854864
4,The Rum Diary,45000000,-52.122818,Bruce Robinson,-23455268


Now, let's look at the % Return of a movie, compared to its production budget

In [182]:
px.scatter(Question3DF, x = 'production_budget', y = '% Return',hover_name='primary_title').update_layout(xaxis_title='Production Budget', yaxis_title='% Return',
                                                                                                            title = 'Production Budget as a Predictor of Worldwide Profit')

It appears that we have an outlier in our dataset, with The Gallows having a small budget but generating a 41557% Return.

However, even when excluding that, their is not clear correlation between Production Budget and % Return

Now, Let's look at Budget versus Profit

In [183]:
BudgetProfit = px.scatter(Question3DF, x = 'production_budget', y = 'Worldwide_Profit',hover_name='primary_title').update_layout(xaxis_title='Production Budget', yaxis_title='Worldwide Profit',
                                                                                                                                    title = 'Production Budget as a Predictor of Worldwide')
BudgetProfit

This looks good, as their appears to be a positive correlation between Production Budget and Worldwide Profit. Let's find that number and add it to our graph

In [184]:

# Correlation Coefficient
correlationQ3 = Question3DF["production_budget"].corr(Question3DF["Worldwide_Profit"])
BudgetProfit.update_layout(annotations=[dict(x=0.95, y=0.95, xref='paper', yref='paper', text=f'correlation: {correlationQ3:.3f}', 
                                 showarrow=False, font=dict(size=14))],xaxis_title='Production Budget', yaxis_title='Worldwide Profit', title = 'Production Budget as a Predictor of Worldwide')

Finally, let's sort this data by director, in order to find which directors do best, whether or not they recieve a large budget

In [185]:
# By Director
BudProfDirectorDF = Question3DF[['primary_name','production_budget','Worldwide_Profit']]
BudProfDirectorDF = BudProfDirectorDF.groupby(['primary_name']).sum()[['Worldwide_Profit','production_budget']].sort_values(by=['Worldwide_Profit'], ascending=False).reset_index()
BudProfDirector = px.scatter(BudProfDirectorDF, x = 'production_budget', y = 'Worldwide_Profit', hover_name='primary_name')
BudProfDirector.update_layout(xaxis_title='Production Budget', yaxis_title='Worldwide Profit', title = 'Production Budget as a Predictor of Worldwide Profit, Grouped by Director')
BudProfDirector

This looks good. As we can see, directors who do a good job, generally get larger production budgets. In addition, their appears to be 4 directors that have created highly profitable films, without needing an enourmous budget. Let's highlight those ones for our powerpoint audience, who can't hover over their names in this notebook.

In [186]:

directors = ['Pierre Coffin', 'Kyle Balda', 'Chris Renaud', 'James Wan']
BudProfDirectorDF.loc[BudProfDirectorDF['primary_name'].isin(directors), 'director_color' ] = 'blue'
BudProfDirectorDF['director_color'] = BudProfDirectorDF['director_color'].fillna(value = 'red')
# Nice. Now, let's update our scatterplot
BudProfDirector = px.scatter(BudProfDirectorDF, x = 'production_budget', y = 'Worldwide_Profit', hover_name='primary_name', color = BudProfDirectorDF['director_color'])
BudProfDirector.update_layout(xaxis_title='Production Budget', yaxis_title='Worldwide Profit', title = 'Production Budget as a Predictor of Worldwide Profit, Grouped by Director')
BudProfDirector

Very nice! It is clear that we should be targeting one of the 4 directors highlighted above. Since we've decided on making aanimated films, hiring Pierre Coffin, Chris Renaud, or Kyle Balda would be our best options, as they all have extensive experience in animated films

## Overview

##### In this analysis, we explored movie data from 2010-2019 in order to 'recommend' to Microsoft what kind of movies they should make. The main takeaways we had were that Microsoft should create Animated films, as they generate profits the most consistently by far, and should hire Pierre Coffin, Chris Renaud, or Kyle Balda to direct. In addition, Microsoft should focus on generating buzz online, so that more people go out and see the film. 

##### In essence, if I were Microsoft, I would create a Minecraft animated film, as that is part of the intellectual property, and the data shows that it has a high probability of being successful.