#### Christopher B. Johnson
#### 11/12/17


# Project: Investigate What Makes a Movie Exceptional

## Table of Contents
<ul>
<li><a href="#intro">Introduction</a></li>
<li><a href="#wrangling">Data Wrangling</a></li>
<li><a href="#eda">Exploratory Data Analysis</a></li>
<li><a href="#conclusions">Conclusions</a></li>
</ul>

<a id='intro'></a>
## Introduction

> **Tip**: In this section of the report, provide a brief introduction to the dataset you've selected for analysis. At the end of this section, describe the questions that you plan on exploring over the course of the report. Try to build your report around the analysis of at least one dependent variable and three independent variables.
>
> If you haven't yet selected and downloaded your data, make sure you do that first before coming back here. If you're not sure what questions to ask right now, then make sure you familiarize yourself with the variables and the dataset context for ideas of what to explore.

>This project will explore the different variables involved in making movies that are exceptional.  So, what defines a good movie?  

>  Using the dataset from the tmdb database, and for the sake of this investigation, a good movie can be defined in three ways : popularity, film profit as proportion of budget (here I'm going to adjust this result by budget, so that a film with a 1 mil budget that makes 1 mil in profits is considered more successful than a 100 mil budget film that makes 1 mil)  $$filmProfit = \frac{revenue-budget}{budget}$$, or the average voter score.  These three dependent variables address the more generic variables of fame, fortune, and quality, respectively.  Finally, there is Chris' opinion on the matter, which after a brief review of the data, suggests that this is a highly subjective topic.

>  Given the size of the dataset (approximately 10,000 movies), each definition of successful will be set to the top 500 movies (5%), since this appears to capture most of the significant information.

## Questions

>>1. Is there some common combination of genres that most successful movies share?
>>2. Is there any correlation between successful movies and their director?
>>3. Are most successful movies associated with big budgets?

>In order to answer these questions, as stated above, first movies will be ranked by 3 definitions of success (fame, fortune, and quality).  These will be the dependent variables based on the fields of popularity, (budget,revenue), and voter_average.  Then an attempt will be made to determine if there is any correlation between the movie genres, directors, or budgets and these top ranking successful movies.

### Data Uncertainty
> The current website for this data is at https://www.kaggle.com/tmdb/tmdb-movie-metadata

> While it confirms that the existing zeros in the data are to be ignored, it does not answer how the budget_adj and revenue_adj columns are to be considered.  Since these numbers went down in some cases, I don't think that it's adjusting for 2017 dollars, but instead it may be the actual values after review?  Revenues also dropped after adjustment in some cases, and it's just not clear what these adjusted numbers represent, so they will not be used.

In [2]:
#import all packages for this analysis
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

#make all plots in the notebook vs separate window
%matplotlib inline


<a id='wrangling'></a>
## Data Wrangling

> **Tip**: In this section of the report, you will load in the data, check for cleanliness, and then trim and clean your dataset for analysis. Make sure that you document your steps carefully and justify your cleaning decisions.

> In this section, the movie csv file is loaded into a dataframe and the tail is displayed to show some concerns with the data.  The problems are as follows :
1.  There are zeroes in the budget, revenue, budget_adj and revenue_adj columns.  These are likely not real values, and are simply placeholders.  These movies should not be included in the assessment of success using net profits, as they would wildly skew the results, and there is no immediate way to easily resolve the lack of information.
2.  There are some useless columns that are cluttering up the table for this analysis, such as homepage, tagline, and overview, and will not be used.  While it could be interesting to look at the vocabulary and punctuation to see if say, more !'s result in more viewers, this will not be pursued.
3.  While creating the money rank, the highest performers had a rank of over 1 million.  Upon further inspection, for those movies, someone had entered the budget assuming millions and then put down absolute revenue.  For example, id 10495, Karate Kid III has a budget of 113.00 and a revenue of 115103979.  While it's rather crude, I'm going to assume that all movies have a budget of greater than 100,000.  I'm going to leave the revenue alone for now.
4.  While working on casting the genres from lists to sets, it failed on index 426, which is a blank cell.  It was identifying this cell as float instead of empty list.  This was a lesson in groupby not taking mutable types, so I cast the column to string, then split on |, converted to set (to remove duplicates), and then sorted the set and cast to tuple (a non-mutable iterator which can be used by groupby).  This fixed the genre column, and while I also applied it to director (likely unnecessary) and actors (flawed due to it not finding individuals so much as a certain group...that works for genre, but not so much for actors).

### General Properties

In [10]:
# Loading the movie data and doing an initial inspection.  Other problems were found as documented above.

filename = 'tmdb-movies.csv'
df_movies = pd.read_csv(filename)
print df_movies.tail()

          id    imdb_id  popularity  budget  revenue  \
10861     21  tt0060371    0.080598       0        0   
10862  20379  tt0060472    0.065543       0        0   
10863  39768  tt0060161    0.065141       0        0   
10864  21449  tt0061177    0.064317       0        0   
10865  22293  tt0060666    0.035919   19000        0   

                 original_title  \
10861        The Endless Summer   
10862                Grand Prix   
10863       Beregis Avtomobilya   
10864    What's Up, Tiger Lily?   
10865  Manos: The Hands of Fate   

                                                    cast homepage  \
10861  Michael Hynson|Robert August|Lord 'Tally Ho' B...      NaN   
10862  James Garner|Eva Marie Saint|Yves Montand|Tosh...      NaN   
10863  Innokentiy Smoktunovskiy|Oleg Efremov|Georgi Z...      NaN   
10864  Tatsuya Mihashi|Akiko Wakabayashi|Mie Hama|Joh...      NaN   
10865  Harold P. Warren|Tom Neyman|John Reynolds|Dian...      NaN   

                 director            

### Prepare the data for investigation

>The approach here will be to create multiple tables much like preparing a schema for a database.  Once the tables are built, we can sort them as desired, and then do the necessary joins to create the associations in preparation for display.

>First prepare the table of independent variables.  The common fields dataframe had problems; specifically, around index 426, there was an empty cell in the data which was considered a float, when all other values were objects.  This caused a crash.  The intent with the genres was to look at what combinations of genres would be associated with successful movies, so it was desireable to capture the combinations.  However, this resulted in combinations such as drama,comedy and comedy,drama as separate entries.  The answer is to simply compare sets, but that became difficult since groupby will not work on mutable data objects.  So, while tuples would work, they didn't actually solve the problem of order significance.  Hence the solution below of first casting to string (eliminate the float) so I can split to a list on a | token.  Then take the list and convert to a set to eliminate any possible duplicates in the data set.  Finally return the sorted tuple, to fix the order significance problem when feeding into groupby.

In [12]:
independentFields = ['id','budget','cast','director','genres']
df_commonFields = pd.read_csv(filename,usecols=independentFields)

#define function to take in dataframe of independent fields and groom it to support groupby functions
listOfColumns = ['genres','cast','director']
#need to convert from set to tuple b/c groupby can't work on mutable objects
#https://stackoverflow.com/questions/39622884/pandas-groupby-over-list
def groomIndependentVars(df, colList):
    def setToTuple(x):
        return tuple(sorted(x))
    for column in colList:
        df[column] = df[column].astype(str)
        df[column] = df[column].str.split(pat='|')
        df[column] = df[column].apply(set)
        df[column] = df[column].apply(setToTuple)
    return df

df_commonFields = groomIndependentVars(df_commonFields,listOfColumns)

> **Tip**: You should _not_ perform too many operations in each cell. Create cells freely to explore your data. One option that you can take with this project is to do a lot of explorations in an initial notebook. These don't have to be organized, but make sure you use enough comments to understand the purpose of each code cell. Then, after you're done with your analysis, create a duplicate notebook where you will trim the excess and organize your steps so that you have a flowing, cohesive report.

> **Tip**: Make sure that you keep your reader informed on the steps that you are taking in your investigation. Follow every code cell, or every set of related code cells, with a markdown cell to describe to the reader what was found in the preceding cell(s). Try to make it so that the reader can then understand what they will be seeing in the following cell(s).

### Data Cleaning : create the tables for investigation

In [9]:
# After discussing the structure of the data and any problems that need to be
#   cleaned, perform those cleaning steps in the second part of this section.
filename = 'tmdb-movies.csv'

# 0. get table of common properties (to look for correlations between dependent variables of money/fame/quality and independent variables of directors/release months/genres/budgets)
independentFields = ['id','budget','cast','director','genres']
df_commonFields = pd.read_csv(filename,usecols=independentFields)
print df_commonFields.head()

# 1. Money : rank movies by net profits, and try to identify common : directors, release months, genres, budgets
moneyFields = ['id','budget','revenue']
df_movieMoney = pd.read_csv(filename, usecols=moneyFields)
print df_movieMoney.head()

# 2. Fame : rank movies by popularity and try to identify common : directors, release months, genres, budgets
# 3. Quality : rank movies by vote, corrected by the number of vote counts (higher ranked votes by larger numbers of people are better than high votes by few people)
# 4. Summary : are there any movies common to Money,Fame and Quality? : identify common directors, release months, genres, budgets



       id     budget                                               cast  \
0  135397  150000000  Chris Pratt|Bryce Dallas Howard|Irrfan Khan|Vi...   
1   76341  150000000  Tom Hardy|Charlize Theron|Hugh Keays-Byrne|Nic...   
2  262500  110000000  Shailene Woodley|Theo James|Kate Winslet|Ansel...   
3  140607  200000000  Harrison Ford|Mark Hamill|Carrie Fisher|Adam D...   
4  168259  190000000  Vin Diesel|Paul Walker|Jason Statham|Michelle ...   

           director                                     genres  
0   Colin Trevorrow  Action|Adventure|Science Fiction|Thriller  
1     George Miller  Action|Adventure|Science Fiction|Thriller  
2  Robert Schwentke         Adventure|Science Fiction|Thriller  
3       J.J. Abrams   Action|Adventure|Science Fiction|Fantasy  
4         James Wan                      Action|Crime|Thriller  
       id     budget     revenue
0  135397  150000000  1513528810
1   76341  150000000   378436354
2  262500  110000000   295238201
3  140607  200000000  20681

<a id='eda'></a>
## Exploratory Data Analysis

> **Tip**: Now that you've trimmed and cleaned your data, you're ready to move on to exploration. Compute statistics and create visualizations with the goal of addressing the research questions that you posed in the Introduction section. It is recommended that you be systematic with your approach. Look at one variable at a time, and then follow it up by looking at relationships between variables.

### Research Question 1 (Replace this header name!)

In [None]:
# Use this, and more code cells, to explore your data. Don't forget to add
#   Markdown cells to document your observations and findings.


### Research Question 2  (Replace this header name!)

In [None]:
# Continue to explore the data to address your additional research
#   questions. Add more headers as needed if you have more questions to
#   investigate.


<a id='conclusions'></a>
## Conclusions

> **Tip**: Finally, summarize your findings and the results that have been performed. Make sure that you are clear with regards to the limitations of your exploration. If you haven't done any statistical tests, do not imply any statistical conclusions. And make sure you avoid implying causation from correlation!

> **Tip**: Once you are satisfied with your work, you should save a copy of the report in HTML or PDF form via the **File** > **Download as** submenu. Before exporting your report, check over it to make sure that the flow of the report is complete. You should probably remove all of the "Tip" quotes like this one so that the presentation is as tidy as possible. Congratulations!