# Predict the Oscars with Data Science

In this practical workshop you'll use a dataset that contains previous Oscar winners to build a prediction model to guess the winner for Best Picture Award. You'll get an introduction to a data scientist's tools and methods, including an overview of basic machine learning concepts. Unlike this year's Oscars, our model will predict only one winner!

## Initial imports and loading data with Pandas

In [2]:
import numpy as np
import pandas as pd
from sklearn import tree
from sklearn.ensemble import RandomForestClassifier

pd.set_option('mode.chained_assignment', None)

In [3]:
train_file = "train.csv"
initial_train = pd.read_csv(train_file)

train = initial_train[(initial_train['Year'] > 1980)]

test_file = "test.csv"
test = pd.read_csv(test_file)

## Understanding your data

You need to "run" the two cells below, to do that select the cell and press:  *`Shift-Enter`*

In [4]:
train.head(5)

Unnamed: 0,Year,Movie,Won?,Budget,Opening Weekend,IMDB Rating,Genres,Won Golden Globe,Won Bafta,Oscar Nominations,Golden Globe Nominations,Bafta Nominations,IMdB id,Won Producers,Won Directors,Won Actors,Rate,Metascore
0,2015,Spotlight,1,20000000.0,295009.0,8.1,"Crime, Drama, History",0,0,6,3,3,tt1895587,0,0,1,R,93.0
1,2015,The Big Short,0,28000000.0,10531026.0,7.8,"Biography, Comedy, Drama, History",0,0,5,4,5,tt1596363,1,0,0,R,81.0
2,2015,Bridge of Spies,0,40000000.0,15371203.0,7.6,"Drama, History, Thriller",0,0,6,1,9,tt3682448,0,0,0,PG-13,81.0
3,2015,Brooklyn,0,11000000.0,187281.0,7.5,"Drama, Romance",0,0,3,1,6,tt2381111,0,0,0,PG-13,87.0
4,2015,Mad Max: Fury Road,0,150000000.0,45428128.0,8.1,"Action, Adventure, Sci-Fi, Thriller",0,0,10,2,7,tt1392190,0,0,0,R,90.0


In [5]:
train['Won?'].value_counts()

0    155
1     33
Name: Won?, dtype: int64

## Formatting your Data

In [6]:
# Set Rate to a number to be able to analyze it
train.ix[train["Rate"] == "G", "Rate"] = 1
train.ix[train["Rate"] == "PG", "Rate"] = 2
train.ix[train["Rate"] == "PG-13", "Rate"] = 3
train.ix[train["Rate"] == "R", "Rate"] = 4

test.ix[test["Rate"] == "G", "Rate"] = 1
test.ix[test["Rate"] == "PG", "Rate"] = 2
test.ix[test["Rate"] == "PG-13", "Rate"] = 3
test.ix[test["Rate"] == "R", "Rate"] = 4

In [7]:
train.head(5)

Unnamed: 0,Year,Movie,Won?,Budget,Opening Weekend,IMDB Rating,Genres,Won Golden Globe,Won Bafta,Oscar Nominations,Golden Globe Nominations,Bafta Nominations,IMdB id,Won Producers,Won Directors,Won Actors,Rate,Metascore
0,2015,Spotlight,1,20000000.0,295009.0,8.1,"Crime, Drama, History",0,0,6,3,3,tt1895587,0,0,1,4,93.0
1,2015,The Big Short,0,28000000.0,10531026.0,7.8,"Biography, Comedy, Drama, History",0,0,5,4,5,tt1596363,1,0,0,4,81.0
2,2015,Bridge of Spies,0,40000000.0,15371203.0,7.6,"Drama, History, Thriller",0,0,6,1,9,tt3682448,0,0,0,3,81.0
3,2015,Brooklyn,0,11000000.0,187281.0,7.5,"Drama, Romance",0,0,3,1,6,tt2381111,0,0,0,3,87.0
4,2015,Mad Max: Fury Road,0,150000000.0,45428128.0,8.1,"Action, Adventure, Sci-Fi, Thriller",0,0,10,2,7,tt1392190,0,0,0,4,90.0


## Cleaning your Data

In [8]:
train["IMDB Rating"].fillna(train["IMDB Rating"].median(), inplace=True)
test["IMDB Rating"].fillna(test["IMDB Rating"].median(), inplace=True)

train["Metascore"].fillna(train["Metascore"].median(), inplace=True)
test["Metascore"].fillna(train["Metascore"].median(), inplace=True)

## Decision Tree

In [9]:
target = train["Won?"].values

feature_names = [
    "Won Bafta",
    "Won Golden Globe",
    "IMDB Rating"]

features = train[feature_names].values

# Fit your first decision tree: my_tree
my_tree = tree.DecisionTreeClassifier()
my_tree = my_tree.fit(features, target)

In [10]:
tree_importances = pd.DataFrame(my_tree.feature_importances_, feature_names, columns=["Importances"])

print(tree_importances)
print('Score', my_tree.score(features, target))

                  Importances
Won Bafta            0.060264
Won Golden Globe     0.272212
IMDB Rating          0.667524
('Score', 0.88297872340425532)


## Predicting

In [11]:
test_features = test[feature_names].values

pred_tree = my_tree.predict_proba(test_features)[:, 1]

movie_name = np.array(test["Movie"])
year = np.array(test["Year"])
won = np.array(test["Won?"])

tree_prediction = pd.DataFrame(pred_tree.round(2), movie_name, columns=["Probability"])
tree_prediction["Year"] = year
tree_prediction["Actually Won?"] = won

In [12]:
tree_prediction[tree_prediction['Year'] != 2016]

Unnamed: 0,Probability,Year,Actually Won?
Avatar,0.0,2009,0
The Blind Side,0.0,2009,0
District 9,0.0,2009,0
An Education,0.0,2009,0
The Hurt Locker,0.0,2009,1
Inglourious Basterds,0.25,2009,0
Precious: Based on the Novel 'Push' by Sapphire,0.0,2009,0
A Serious Man,0.0,2009,0
Up,0.25,2009,0
Up in the Air,0.0,2009,0
