# Predict the Oscars with Data Science

In this practical workshop you'll use a dataset that contains previous Oscar winners to build a prediction model to guess the winner for Best Picture Award. You'll get an introduction to a data scientist's tools and methods, including an overview of basic machine learning concepts. Unlike this year's Oscars, our model will predict only one winner!

## Initial imports and loading data with Pandas

In [1]:
import numpy as np
import pandas as pd
from sklearn import tree
from sklearn.ensemble import RandomForestClassifier

pd.set_option('mode.chained_assignment', None)

In [2]:
train_file = "train.csv"
initial_train = pd.read_csv(train_file)

train = initial_train[(initial_train['Year'] > 1980)]

test_file = "test.csv"
test = pd.read_csv(test_file)

## Understanding your data

You need to "run" the two cells below, to do that select the cell and press:  *`Shift-Enter`*

In [3]:
train.head(5)

Unnamed: 0,Year,Movie,Won?,Budget,Opening Weekend,IMDB Rating,Genres,Won Golden Globe,Won Bafta,Oscar Nominations,Golden Globe Nominations,Bafta Nominations,IMdB id,Won Producers,Won Directors,Won Actors,Rate,Metascore
0,2016,Arrival,0,47000000.0,24000000.0,8.1,"Drama, Mystery, Sci-Fi, Thriller",0,0,8,2,9,tt2543164,0,0,0,PG-13,81.0
1,2016,Fences,0,24000000.0,129462.0,7.5,Drama,0,0,4,2,1,tt2671706,0,0,0,PG-13,79.0
2,2016,Hacksaw Ridge,0,40000000.0,15190758.0,8.3,"Drama, History, War",0,0,6,3,5,tt2119532,0,0,0,R,71.0
3,2016,Hell or High Water,0,12000000.0,621329.0,7.7,"Action, Crime, Drama, Western",0,0,4,3,3,tt2582782,0,0,0,R,88.0
4,2016,Hidden Figures,0,25000000.0,515499.0,7.9,"Biography, Drama, History",0,0,3,2,1,tt4846340,0,0,1,PG,74.0


In [4]:
train['Won?'].value_counts()

0    163
1     34
Name: Won?, dtype: int64

## Formatting your Data

In [5]:
# Set Rate to a number to be able to analyze it
train.ix[train["Rate"] == "G", "Rate"] = 1
train.ix[train["Rate"] == "PG", "Rate"] = 2
train.ix[train["Rate"] == "PG-13", "Rate"] = 3
train.ix[train["Rate"] == "R", "Rate"] = 4

test.ix[test["Rate"] == "G", "Rate"] = 1
test.ix[test["Rate"] == "PG", "Rate"] = 2
test.ix[test["Rate"] == "PG-13", "Rate"] = 3
test.ix[test["Rate"] == "R", "Rate"] = 4

.ix is deprecated. Please use
.loc for label based indexing or
.iloc for positional indexing

See the documentation here:
http://pandas.pydata.org/pandas-docs/stable/indexing.html#ix-indexer-is-deprecated
  
.ix is deprecated. Please use
.loc for label based indexing or
.iloc for positional indexing

See the documentation here:
http://pandas.pydata.org/pandas-docs/stable/indexing.html#ix-indexer-is-deprecated
  import sys


In [6]:
train.head(5)

Unnamed: 0,Year,Movie,Won?,Budget,Opening Weekend,IMDB Rating,Genres,Won Golden Globe,Won Bafta,Oscar Nominations,Golden Globe Nominations,Bafta Nominations,IMdB id,Won Producers,Won Directors,Won Actors,Rate,Metascore
0,2016,Arrival,0,47000000.0,24000000.0,8.1,"Drama, Mystery, Sci-Fi, Thriller",0,0,8,2,9,tt2543164,0,0,0,3,81.0
1,2016,Fences,0,24000000.0,129462.0,7.5,Drama,0,0,4,2,1,tt2671706,0,0,0,3,79.0
2,2016,Hacksaw Ridge,0,40000000.0,15190758.0,8.3,"Drama, History, War",0,0,6,3,5,tt2119532,0,0,0,4,71.0
3,2016,Hell or High Water,0,12000000.0,621329.0,7.7,"Action, Crime, Drama, Western",0,0,4,3,3,tt2582782,0,0,0,4,88.0
4,2016,Hidden Figures,0,25000000.0,515499.0,7.9,"Biography, Drama, History",0,0,3,2,1,tt4846340,0,0,1,2,74.0


## Cleaning your Data

In [9]:
train["IMDB Rating"].fillna(train["IMDB Rating"].median(), inplace=True)
test["IMDB Rating"].fillna(test["IMDB Rating"].median(), inplace=True)

train["Metascore"].fillna(train["Metascore"].median(), inplace=True)
test["Metascore"].fillna(train["Metascore"].median(), inplace=True)

## Decision Tree

In [10]:
target = train["Won?"].values

feature_names = [
    "Oscar Nominations",
    "Won Golden Globe",
    "Golden Globe Nominations",
    "Won Bafta",
    "Bafta Nominations",
    "Won Producers",
    "Won Actors",
    "Won Directors",
    "Metascore",
    "IMDB Rating"]

features = train[feature_names].values

# Fit your first decision tree: my_tree
my_tree = tree.DecisionTreeClassifier()
my_tree = my_tree.fit(features, target)

In [11]:
tree_importances = pd.DataFrame(my_tree.feature_importances_, feature_names, columns=["Importances"])

print(tree_importances)
print('Score', my_tree.score(features, target))

                          Importances
Oscar Nominations            0.151787
Won Golden Globe             0.041471
Golden Globe Nominations     0.123654
Won Bafta                    0.000000
Bafta Nominations            0.115543
Won Producers                0.000000
Won Actors                   0.013440
Won Directors                0.415422
Metascore                    0.075739
IMDB Rating                  0.062944
Score 1.0


## Predicting

In [12]:
test_features = test[feature_names].values

pred_tree = my_tree.predict_proba(test_features)[:, 1]

movie_name = np.array(test["Movie"])
year = np.array(test["Year"])
won = np.array(test["Won?"])

tree_prediction = pd.DataFrame(pred_tree.round(2), movie_name, columns=["Probability"])
tree_prediction["Year"] = year
tree_prediction["Actually Won?"] = won

In [14]:
tree_prediction[tree_prediction['Year'] != 2017]

Unnamed: 0,Probability,Year,Actually Won?
Avatar,0.0,2009,0.0
The Blind Side,0.0,2009,0.0
District 9,0.0,2009,0.0
An Education,0.0,2009,0.0
The Hurt Locker,0.0,2009,1.0
Inglourious Basterds,1.0,2009,0.0
Precious: Based on the Novel 'Push' by Sapphire,0.0,2009,0.0
A Serious Man,0.0,2009,0.0
Up,0.0,2009,0.0
Up in the Air,0.0,2009,0.0


## Overfitting

* Resulting model too tied to the training set.
* It doesn’t generalize to new data, which is the point of prediction.

## Random Forest Classifier

Random Forest Classifiers use many decision trees to build a classifier.  The data is randomly subset, a decision tree is built, and this process is repeated many times (1000 times in our case).  Finally the information that is gained through the many decision trees is used to create the random forest classifier

In [15]:
forest = RandomForestClassifier(
    max_depth=25,
    min_samples_split=15,
    n_estimators=1000,
    random_state=1)

my_forest = forest.fit(features, target)

In [16]:
forest_importances = pd.DataFrame(my_forest.feature_importances_, feature_names, columns=["Importances"])

print(forest_importances)
print('Score', my_forest.score(features, target))

                          Importances
Oscar Nominations            0.117848
Won Golden Globe             0.064634
Golden Globe Nominations     0.047376
Won Bafta                    0.015263
Bafta Nominations            0.062306
Won Producers                0.119394
Won Actors                   0.069556
Won Directors                0.364292
Metascore                    0.057246
IMDB Rating                  0.082086
Score 0.939086294416


## Predicting with Random Forest Classifier

In [17]:
pred_forest = my_forest.predict_proba(test_features)[:, 1]

forest_prediction = pd.DataFrame(pred_forest, movie_name, columns=["Probability"])
forest_prediction["Year"] = year
forest_prediction["Actually Won?"] = won

In [18]:
normalized_prediction = forest_prediction.copy()

for index, row in normalized_prediction.iterrows():
    normalized_prediction.loc[index, "Probability"] = \
        (row["Probability"] / forest_prediction["Probability"][forest_prediction["Year"] == row["Year"]].sum()).round(2)

In [19]:
normalized_prediction[normalized_prediction["Year"] == 1976].sort_values("Probability", ascending=False)

Unnamed: 0,Probability,Year,Actually Won?
Rocky,0.76,1976,1.0
Network,0.12,1976,0.0
Taxi Driver,0.09,1976,0.0
All the President's Men,0.04,1976,0.0
Bound for Glory,0.0,1976,0.0


In [20]:
normalized_prediction[normalized_prediction["Year"] == 1984].sort_values("Probability", ascending=False)

Unnamed: 0,Probability,Year,Actually Won?
Amadeus,0.8,1984,1.0
A Passage to India,0.1,1984,0.0
The Killing Fields,0.09,1984,0.0
A Soldier's Story,0.02,1984,0.0
Places in the Heart,0.0,1984,0.0


In [21]:
normalized_prediction[normalized_prediction["Year"] == 1996].sort_values("Probability", ascending=False)

Unnamed: 0,Probability,Year,Actually Won?
The English Patient,0.9,1996,1.0
Fargo,0.05,1996,0.0
Secrets & Lies,0.03,1996,0.0
Shine,0.01,1996,0.0
Jerry Maguire,0.0,1996,0.0


In [22]:
normalized_prediction[normalized_prediction["Year"] == 2009].sort_values("Probability", ascending=False)

Unnamed: 0,Probability,Year,Actually Won?
The Hurt Locker,0.41,2009,1.0
Inglourious Basterds,0.28,2009,0.0
Up,0.15,2009,0.0
Avatar,0.09,2009,0.0
Up in the Air,0.02,2009,0.0
The Blind Side,0.01,2009,0.0
District 9,0.01,2009,0.0
An Education,0.01,2009,0.0
A Serious Man,0.01,2009,0.0
Precious: Based on the Novel 'Push' by Sapphire,0.0,2009,0.0


In [23]:
normalized_prediction[normalized_prediction["Year"] == 2017].sort_values("Probability", ascending=False)

Unnamed: 0,Probability,Year,Actually Won?
The Shape of Water,0.47,2017,
"Three Billboards Outside Ebbing, Missouri",0.28,2017,
Dunkirk,0.08,2017,
Call Me By Your Name,0.06,2017,
Lady Bird,0.05,2017,
Phantom Thread,0.03,2017,
Darkest Hour,0.02,2017,
The Post,0.02,2017,
Get Out,0.0,2017,
