# Predict the Oscars with Data Science

In this practical workshop you'll use a dataset that contains previous Oscar winners to build a prediction model to guess the winner for Best Picture Award. You'll get an introduction to a data scientist's tools and methods, including an overview of basic machine learning concepts. Unlike this year's Oscars, our model will predict only one winner!

## Initial imports and loading data with Pandas

In [1]:
import numpy as np
import pandas as pd
from sklearn import tree
from sklearn.ensemble import RandomForestClassifier

pd.set_option('mode.chained_assignment', None)

In [2]:
train_file = "train.csv"
initial_train = pd.read_csv(train_file)

train = initial_train[(initial_train['Year'] > 1980)]

test_file = "test.csv"
test = pd.read_csv(test_file)

## Understanding your data

You need to "run" the two cells below, to do that select the cell and press:  *`Shift-Enter`*

In [3]:
train.head(5)

Unnamed: 0,Year,Movie,Won?,Budget,Opening Weekend,IMDB Rating,Genres,Won Golden Globe,Won Bafta,Oscar Nominations,Golden Globe Nominations,Bafta Nominations,IMdB id,Won Producers,Won Directors,Won Actors,Rate,Metascore
0,2015,Spotlight,1,20000000.0,295009.0,8.1,"Crime, Drama, History",0,0,6,3,3,tt1895587,0,0,1,R,93.0
1,2015,The Big Short,0,28000000.0,10531026.0,7.8,"Biography, Comedy, Drama, History",0,0,5,4,5,tt1596363,1,0,0,R,81.0
2,2015,Bridge of Spies,0,40000000.0,15371203.0,7.6,"Drama, History, Thriller",0,0,6,1,9,tt3682448,0,0,0,PG-13,81.0
3,2015,Brooklyn,0,11000000.0,187281.0,7.5,"Drama, Romance",0,0,3,1,6,tt2381111,0,0,0,PG-13,87.0
4,2015,Mad Max: Fury Road,0,150000000.0,45428128.0,8.1,"Action, Adventure, Sci-Fi, Thriller",0,0,10,2,7,tt1392190,0,0,0,R,90.0


In [4]:
train['Won?'].value_counts()

0    155
1     33
Name: Won?, dtype: int64

## Formatting your Data

In [5]:
# Set Rate to a number to be able to analyze it
train.ix[train["Rate"] == "G", "Rate"] = 1
train.ix[train["Rate"] == "PG", "Rate"] = 2
train.ix[train["Rate"] == "PG-13", "Rate"] = 3
train.ix[train["Rate"] == "R", "Rate"] = 4

test.ix[test["Rate"] == "G", "Rate"] = 1
test.ix[test["Rate"] == "PG", "Rate"] = 2
test.ix[test["Rate"] == "PG-13", "Rate"] = 3
test.ix[test["Rate"] == "R", "Rate"] = 4

In [6]:
train.head(5)

Unnamed: 0,Year,Movie,Won?,Budget,Opening Weekend,IMDB Rating,Genres,Won Golden Globe,Won Bafta,Oscar Nominations,Golden Globe Nominations,Bafta Nominations,IMdB id,Won Producers,Won Directors,Won Actors,Rate,Metascore
0,2015,Spotlight,1,20000000.0,295009.0,8.1,"Crime, Drama, History",0,0,6,3,3,tt1895587,0,0,1,4,93.0
1,2015,The Big Short,0,28000000.0,10531026.0,7.8,"Biography, Comedy, Drama, History",0,0,5,4,5,tt1596363,1,0,0,4,81.0
2,2015,Bridge of Spies,0,40000000.0,15371203.0,7.6,"Drama, History, Thriller",0,0,6,1,9,tt3682448,0,0,0,3,81.0
3,2015,Brooklyn,0,11000000.0,187281.0,7.5,"Drama, Romance",0,0,3,1,6,tt2381111,0,0,0,3,87.0
4,2015,Mad Max: Fury Road,0,150000000.0,45428128.0,8.1,"Action, Adventure, Sci-Fi, Thriller",0,0,10,2,7,tt1392190,0,0,0,4,90.0


## Cleaning your Data

In [7]:
train["IMDB Rating"].fillna(train["IMDB Rating"].median(), inplace=True)
test["IMDB Rating"].fillna(test["IMDB Rating"].median(), inplace=True)

train["Metascore"].fillna(train["Metascore"].median(), inplace=True)
test["Metascore"].fillna(train["Metascore"].median(), inplace=True)

## Decision Tree

In [21]:
target = train["Won?"].values

feature_names = [
    "Oscar Nominations",
    "Won Golden Globe",
    "Golden Globe Nominations",
    "Won Bafta",
    "Bafta Nominations",
    "Won Producers",
    "Won Actors",
    "Won Directors",
    "Metascore",
    "IMDB Rating"]

features = train[feature_names].values

# Fit your first decision tree: my_tree
my_tree = tree.DecisionTreeClassifier()
my_tree = my_tree.fit(features, target)

In [22]:
tree_importances = pd.DataFrame(my_tree.feature_importances_, feature_names, columns=["Importances"])

print(tree_importances)
print('Score', my_tree.score(features, target))

                          Importances
Oscar Nominations            0.103140
Won Golden Globe             0.067384
Golden Globe Nominations     0.066598
Won Bafta                    0.000000
Bafta Nominations            0.148232
Won Producers                0.032166
Won Actors                   0.040028
Won Directors                0.447840
Metascore                    0.031822
IMDB Rating                  0.062790
('Score', 1.0)


## Predicting

In [23]:
test_features = test[feature_names].values

pred_tree = my_tree.predict_proba(test_features)[:, 1]

movie_name = np.array(test["Movie"])
year = np.array(test["Year"])
won = np.array(test["Won?"])

tree_prediction = pd.DataFrame(pred_tree.round(2), movie_name, columns=["Probability"])
tree_prediction["Year"] = year
tree_prediction["Actually Won?"] = won

In [24]:
tree_prediction[tree_prediction['Year'] != 2016]

Unnamed: 0,Probability,Year,Actually Won?
Avatar,0.0,2009,0
The Blind Side,0.0,2009,0
District 9,0.0,2009,0
An Education,0.0,2009,0
The Hurt Locker,0.0,2009,1
Inglourious Basterds,1.0,2009,0
Precious: Based on the Novel 'Push' by Sapphire,0.0,2009,0
A Serious Man,0.0,2009,0
Up,0.0,2009,0
Up in the Air,0.0,2009,0


## Overfitting

* Resulting model too tied to the training set.
* It doesn’t generalize to new data, which is the point of prediction.

## Random Forest Classifier

Random Forest Classifiers use many decision trees to build a classifier.  The data is randomly subset, a decision tree is built, and this process is repeated many times (1000 times in our case).  Finally the information that is gained through the many decision trees is used to create the random forest classifier

In [25]:
forest = RandomForestClassifier(
    max_depth=25,
    min_samples_split=15,
    n_estimators=1000,
    random_state=1)

my_forest = forest.fit(features, target)

In [26]:
forest_importances = pd.DataFrame(my_forest.feature_importances_, feature_names, columns=["Importances"])

print(forest_importances)
print('Score', my_forest.score(features, target))

                          Importances
Oscar Nominations            0.121838
Won Golden Globe             0.047893
Golden Globe Nominations     0.048486
Won Bafta                    0.016283
Bafta Nominations            0.060315
Won Producers                0.133330
Won Actors                   0.069735
Won Directors                0.375788
Metascore                    0.037767
IMDB Rating                  0.088565
('Score', 0.94148936170212771)


## Predicting with Random Forest Classifier

In [27]:
pred_forest = my_forest.predict_proba(test_features)[:, 1]

forest_prediction = pd.DataFrame(pred_forest, movie_name, columns=["Probability"])
forest_prediction["Year"] = year
forest_prediction["Actually Won?"] = won

In [28]:
normalized_prediction = forest_prediction.copy()

for index, row in normalized_prediction.iterrows():
    normalized_prediction.loc[index, "Probability"] = \
        (row["Probability"] / forest_prediction["Probability"][forest_prediction["Year"] == row["Year"]].sum()).round(2)

In [29]:
normalized_prediction[normalized_prediction["Year"] == 1976].sort_values("Probability", ascending=False)

Unnamed: 0,Probability,Year,Actually Won?
Rocky,0.74,1976,1
Network,0.13,1976,0
Taxi Driver,0.09,1976,0
All the President's Men,0.03,1976,0
Bound for Glory,0.0,1976,0


In [30]:
normalized_prediction[normalized_prediction["Year"] == 1984].sort_values("Probability", ascending=False)

Unnamed: 0,Probability,Year,Actually Won?
Amadeus,0.77,1984,1
A Passage to India,0.12,1984,0
The Killing Fields,0.09,1984,0
A Soldier's Story,0.02,1984,0
Places in the Heart,0.0,1984,0


In [31]:
normalized_prediction[normalized_prediction["Year"] == 1996].sort_values("Probability", ascending=False)

Unnamed: 0,Probability,Year,Actually Won?
The English Patient,0.89,1996,1
Fargo,0.07,1996,0
Secrets & Lies,0.03,1996,0
Shine,0.01,1996,0
Jerry Maguire,0.0,1996,0


In [32]:
normalized_prediction[normalized_prediction["Year"] == 2009].sort_values("Probability", ascending=False)

Unnamed: 0,Probability,Year,Actually Won?
The Hurt Locker,0.41,2009,1
Inglourious Basterds,0.28,2009,0
Up,0.16,2009,0
Avatar,0.07,2009,0
A Serious Man,0.02,2009,0
Up in the Air,0.02,2009,0
The Blind Side,0.01,2009,0
District 9,0.01,2009,0
An Education,0.01,2009,0
Precious: Based on the Novel 'Push' by Sapphire,0.0,2009,0


In [33]:
normalized_prediction[normalized_prediction["Year"] == 2016].sort_values("Probability", ascending=False)

Unnamed: 0,Probability,Year,Actually Won?
La La Land,0.53,2016,0
Hidden Figures,0.17,2016,0
Moonlight,0.13,2016,1
Manchester by the Sea,0.09,2016,0
Arrival,0.04,2016,0
Hacksaw Ridge,0.03,2016,0
Lion,0.02,2016,0
Fences,0.0,2016,0
Hell or High Water,0.0,2016,0
