# 2018 Oscars Predictions: How BigML got 6 for 6!

## Looking for the data

Machine Learning bases its predictions in data of the past, so we had to obtain all the information we could about the history of the Oscars nominees and winners for each analyzed category. The first place that comes to mind is IMDB, where anyone can find all the relevant information about movies these days. We expected to be able to get that data by using an API access. There are several APIs, like the [Open Movies Database](http://imdbapi.net/), that allow you to search movies by title or year and return basic information: title, rating, genre, etc.

In [3]:
import pandas
pandas.read_csv('Movies.csv')

Unnamed: 0,year,movie,movie_id,Certificate,Duration,Genre,Rate,Metascore,Synopsis,Director_Stars,Votes,Gross
0,2000,Unbreakable,tt0217869,PG-13,106,"Drama, Mystery, Sci-Fi",7.2,62,A man learns something extraordinary about him...,Director: M. Night Shyamalan | Stars: Bruce Wi...,251457,95000000
1,2000,Requiem for a Dream,tt0180093,R,102,Drama,8.4,68,The drug-induced utopias of four Coney Island ...,Director: Darren Aronofsky | Stars: Ellen Burs...,596759,3610000
2,2000,Snatch,tt0208092,R,102,"Comedy, Crime",8.3,55,"Unscrupulous boxing promoters, violent bookmak...","Director: Guy Ritchie | Stars: Jason Statham, ...",623296,30090000
3,2000,Gladiator,tt0172495,R,155,"Action, Adventure, Drama",8.5,64,When a Roman general is betrayed and his famil...,"Director: Ridley Scott | Stars: Russell Crowe,...",1024040,187670000
4,2000,Memento,tt0209144,R,113,"Mystery, Thriller",8.5,80,A man juggles searching for his wife's murdere...,Director: Christopher Nolan | Stars: Guy Pearc...,879688,25530000


 However, we still lacked the main information we needed: the award winners per category. It was turn to ask our data wrangling team to help us!

## Feeding from many sources

After browsing the different sections in IMDB's web site, we realized that a lot of valuable information was there. Not only the winners and nominees of the Oscars, but also the winners of the BAFTA awards, the Golden Globes, and other important prizes.

We needed to adapt our retrieval methods to fit the format of the web presentation for these information items, but finally we were able to extract the following texts:

In [12]:
pandas.read_csv('Movies_awards.csv')


Unnamed: 0,year,movie_id,awards,won_nominated,categories
0,2017,tt5726616,"Academy Awards, USA",Nominated; Nominated; Nominated; Nominated,Best Performance by an Actor in a Leading Role...
1,2017,tt5726616,"Golden Globes, USA",Nominated; Nominated; Nominated,Best Motion Picture - Drama; Best Performance ...
2,2017,tt5726616,BAFTA Awards,Won; Nominated; Nominated; Nominated,Best Screenplay (Adapted); Best Film; Best Lea...
3,2017,tt5726616,Screen Actors Guild Awards,Nominated,Outstanding Performance by a Male Actor in a L...
4,2017,tt5726616,AACTA International Awards,Nominated; Nominated; Nominated; Nominated; No...,Best Film; Best Direction; Best Screenplay; Be...
5,2017,tt5726616,Adelaide Film Festival,Nominated,Best Feature


Other useful information, like the release date or the number of nominations, was also found in other sections and added to new files:

In [16]:
pandas.read_csv('Movies_release_date.csv')

Unnamed: 0,movie_id,movie_title,realease_date,user_review,critic_reviews,popularity,awards,awards_wins,awards_nominations
0,tt0217869,Unbreakable,22 November 2000 (USA),1356,293,184,2 wins & 12 nominations.,2,12
1,tt0180093,Requiem for a Dream,15 December 2000 (USA),1930,238,245,Another 32 wins & 61 nominations.,32,61
2,tt0208092,Snatch,19 January 2001 (USA),735,151,307,4 wins & 6 nominations.,4,6


In [17]:
pandas.read_csv('Movies_countries.csv')

Unnamed: 0,movie_id,country,main_country,budget,budget_currency,opening_weekend,opening_weekend_country,opening_weekend_date,budget_USD,opening_weekend_USD
0,tt0322674,Netherlands; Luxembourg,Netherlands,,,1207.0,USA,2005-05-06,,1207.0
1,tt1206488,Spain; Peru,Spain,,,1914.0,USA,2010-08-27,,1914.0
2,tt0265208,USA,USA,25000000.0,USD,3184.0,USA,2001-03-30,25000000.0,3184.0


At this point and after some error cleaning, we finally had significant data to use a Supervised Learning model to predict the Oscars awards for 2018 based on the previous ones. Still, a problem remained: we needed to aggregate this data into an ML-ready dataset.

## Getting your data ML-ready

As you can see in the excerpts of the described files, the fields we used to store the information were totally dependent on the page structure from where they were retrieved, but they all shared  a `movie_id` field that is the identifier for each film. 

To feed this data to a Supervised Learning model, you need a single file with a particular structure where:
- Every row has to contain all the existing properties of a particular movie
- Every column should contain one of the properties
- One of the columns in the file should be the objective field (the target for our predictions)


To fulfill the first conditions, we used the `movie_id` field as the key to join all the informations in all the retrieved files. We also changed the columns of the original CSVs to meet the second and third conditions. For example, we created new fields like `Oscar_best_picture_nominated`  and `Oscar_best_picture_won` with `yes` or `no` values based on the pairing of the `won_nominated` and `categories` fields of the `Movies_awards` file. When trying to predict each award, we selected we selected the objective field as the one that stored the information about the winners of that award, i.e. the `Oscar_best_picture_won` was selected as objective field when trying to predict the winner of the best picture award. After all these transformations, we finally reached an ML-ready dataset structure like the one we published in our [gallery](https://bigml.com/user/academy_awards/gallery/dataset/5a94302592fb565ed400103b). We were ready to start modeling!

## Training the models

And now it was the time to train the real models that we used to predict this year's winners. The model was trained using all the available data from 2000 to 2016.  Let's see how to do that using the Python bindings.

In [12]:
from bigml.api import BigML

# COMPLETE_DATASET = "dataset/5aa2c1e0e3de6875f90049cf"

COMPLETE_DATASET = "shared/dataset/5T3HptQKaQs57vgGi8f4vq1oSAT"

# creating the API connection
api = BigML() # the BIGML_USERNAME and BIGML_API_KEY environment variables are
              # expected.

# creating the training dataset with movies from 2000 to 2016
filter_expression = ["<=", 2000, ["field", "year"], 2016] # 2000 <= year <= 2016
training_dataset = api.create_dataset(COMPLETE_DATASET,
                                     {"json_filter": filter_expression,
                                      "name": "Movies 2000-2016"})
api.ok(training_dataset)
print("Number of movies to train: %s." % training_dataset['object']['rows'])

Validation error: {'origin_dataset': 'Invalid dataset id'}
Double-check the arguments for the call, please.


2018-03-16 22:12:55,343: Validation error: {'origin_dataset': 'Invalid dataset id'}
Double-check the arguments for the call, please.


The resource couldn't be created: {'code': 400, 'status': {'code': -1206, 'message': 'Validation error', 'extra': {'origin_dataset': 'Invalid dataset id'}}}


2018-03-16 22:12:55,344: The resource couldn't be created: {'code': 400, 'status': {'code': -1206, 'message': 'Validation error', 'extra': {'origin_dataset': 'Invalid dataset id'}}}


TypeError: 'NoneType' object is not subscriptable

The selected type of model was a `Deepnet`, i.e. BigML's deep neural network, which provides a unique first-class automatic optimization option ([Automatic Network Search](https://blog.bigml.com/2017/10/04/deepnets-behind-the-scenes/)). By using it, we tried to ensure that the right topology was used so that we obtained a top performing classifier. A `Deepnet` was built for each award category by configuring the objective field as the one storing the award winners' information. We'll take as an example the **Best Picture Award**. The information about the previous winners of this award is stored in the `Oscar_Best_Picture_won` field, so we created a `Deepnet` and configured this field as its objective field.

In [2]:
deepnet = api.create_deepnet(training_dataset,
                             {"objective_field": "Oscar_Best_Picture_won",
                              "search":  True})
api.ok(deepnet)

True

And that's all! We did not need to worry about trying to discover topologies or arguments to find the best performing network. Each Deepnet took around 30 minutes to train since it involved training dozens of different networks in the background, but in the end it produced the top-notch neural network without our intervention. Would that be enough to ensure sound predictions?

## Ensuring the predictions' quality

Before publishing any result, we wanted to ensure that our model produced sensible predictions, so we started by splitting our dataset in two: a training and a test dataset. We chose to filter the instances by date, so that the training dataset was built on the films dated from 2000 to 2012 and the test dataset contained those from 2013 to 2016. After building the model using the training dataset, we also evaluated it by running the instances in the test dataset through it and comparing the predicted results to the real ones. The split can be easily done in BigML thanks to our transformations language: [flatline](https://github.com/bigmlcom/flatline).



In [8]:
filter_expression = ["<=", 2000, ["field", "year"], 2012] # 2000 <= year <= 2012
training_dataset = api.create_dataset(COMPLETE_DATASET,
                                     {"json_filter": filter_expression,
                                      "name": "Movies 2000-2012"})
api.ok(training_dataset)

filter_expression = ["<=", 2013, ["field", "year"], 2016] # 2013 <= year <= 2016
test_dataset = api.create_dataset(COMPLETE_DATASET,
                                  {"json_filter": filter_expression,
                                   "name": "Movies 2013-2016"})
api.ok(test_dataset)

evaluation_deepnet = api.create_deepnet(training_dataset,
                                        {"objective_field": "Oscar_Best_Picture_won",
                                         "search":  True})
api.ok(evaluation_deepnet)

evaluation = api.create_evaluation(evaluation_deepnet, test_dataset)
api.ok(evaluation)

print(evaluation['object']['result']['model']['accuracy'])


2018-03-08 23:36:28,902: Validation error: {'origin_dataset': 'Invalid dataset id'}
Double-check the arguments for the call, please.
2018-03-08 23:36:28,902: The resource couldn't be created: {'status': {'extra': {'origin_dataset': 'Invalid dataset id'}, 'code': -1206, 'message': 'Validation error'}, 'code': 400}
2018-03-08 23:36:30,499: Validation error: {'origin_dataset': 'Invalid dataset id'}
Double-check the arguments for the call, please.
2018-03-08 23:36:30,500: The resource couldn't be created: {'status': {'extra': {'origin_dataset': 'Invalid dataset id'}, 'code': -1206, 'message': 'Validation error'}, 'code': 400}


## And the winner is...

To get the predicted winners, we just used the data we had for the movies in 2017 and created a `Batch Prediction` using the `Deepnet` associated to the award category that we meant to predict. The predictions included whether the movie was predicted to win the prize and also the probability of this prediction.

In [None]:
batch_prediction = api.create_batch_prediction(deepnet, {"probability": True})
api.ok(batch_prediction)

And, ta da! The movies that were considered to be the winners were the ones predicted to win each award with the highest probability. This year, BigML’s predictions proved right, [6 out 6](https://blog.bigml.com/2018/03/05/2018-oscars-predictions-proved-right-6-out-of-6/) for the winners of the major award categories. Not by magic or luck, [just Machine Learning](https://medium.com/enrique-dans/and-this-years-oscar-goes-to-bigml-machine-learning-1823837ae3aa) put to work!