# Oscar Award Winner Predictions

Living in Los Angeles, the entertainment capital of the world, it’s hard not to be to some extent drawn into the buzz of awards season. This past Academy Awards were especially interesting to me personally because I saw the majority of movies up for nominations. Each film is independent of any film to come before and after it in theory, but I questioned whether there were patterns that could be derived from historical data which could be used to make educated guesses on future winners

I decided to see whether I could build a model that could predict the winners of specific Oscar categories based on characteristics of winners from past Academy Awards. At some point I would love to extend the predictions to each and every category, but for the purpose of this project the three categories I decided to predict were; Cinematography, Directing and Best Picture.

The following report details the steps I took to gather, clean, and transform the data. It will detail descriptive plots that helped add context to the overarching problem. Last, it shows the results of the models and the features which ended up being the most important in terms of predictive accuracy. The first half is written in python while the modeling and visualizations are done in R.

To help try and answer this question I decided to grab data for Oscar nominated movies dating back to 1980 and see if I can predict the winners from the 2015-2019 Academy Awards in the categories of Cinematography, Directing, and Best Picture. For reference throughout the rest of the report the Academy Award year is for movies that came out the year specified, not when the award show took place. For example, the year 2019 in the dataset represents movies that released in 2019 even though the award show took place in February of 2020.  The models used would be based around classification where the film with the highest probability of winning within a category and a year would be predicted as the winner. 

In [1]:
#import of python libraries including tmbd and RottenTomatoes api's
import tmdbsimple as tmdb
from rotten_tomatoes_client import RottenTomatoesClient
import pandas as pd 
import numpy as np
import time

The dataset used for this project came from a variety of sources. First, I needed to get a comprehensive list of Academy Award nominees and winners over the past 40 years. Luckily for me someone had already compiled this list

The next part of information I wanted were detail oriented characteristics regarding each film. Characteristics such as, the top 3 actors/actresses in the film, how long was the film, what genre, etc. This information was found using an API from "the movie database" site . This allowed me to programmatically send the award nominees to the API and request an output of information for each movie. This worked seamlessly for roughly 90% of the data, however due to discrepancies in titles, such as the word “and” instead of “&”, it required some manual labor on my part to make sure my data was not only clean, but accurate. The last piece of information I wanted for my dataset is the average review score by critics of each film. I was able to gather this data using a Rotten Tomatoes API.Similar to the movie database API, this also required some manual work as far as adjusting movie titles to line up consistently across the API and the list of Academy Award nominees. 

In [2]:
#For purposes of my project looking at data from 1980 onwards in three categories, Cinematography, Best Picture, and Directing
academy = pd.read_csv(r'C:\Users\Patrick\Downloads\oscars-nominees-and-winners_zip\data\data_csv.csv')
mask1 = academy['year']>=1980
mask2 = academy['category']=='CINEMATOGRAPHY'
mask3 = academy['category']=='BEST PICTURE'
mask4 = academy['category']=='DIRECTING'
academy = academy[(mask1) & (mask2 | mask3 | mask4)]
academy['winner']=academy['winner'].apply(lambda x:1 if x == True else 0)
#manual data processing to rename movie titles that differ from API's
academy['entity']=academy['entity'].replace({'Good Fellas':'GoodFellas','My Left Foot':'My Left Foot: The Story of Christy Brown','The Godfather, Part III':'The Godfather: Part III'})
academy['entity']=academy['entity'].replace({'Red':'Three Colors: Red','Malèna':'Malena','Moulin Rouge':'Moulin Rouge!',"Precious: Based on the Novel 'Push' by Sapphire":'Precious'})
academy['entity']=academy['entity'].replace({'The Postman (Il Postino)':'The Postman'})
academy_distinct_list=academy[['entity','year']].drop_duplicates()
#create a zipped list of movie names and the year, this will be used for the api function
academy_titles=list(zip(academy_distinct_list['entity'],academy_distinct_list['year']))

In [3]:
#empty list to store api outputs
movie_list=[]

In [4]:
#insertion of api key
tmdb.API_KEY = 'a1f3bc11391863530744a6b40b85c598'
def movie_dataframe(movie_title,year):
    try:
        movie_dict={}
        time.sleep(5)
        search = tmdb.Search()
        #taking care of data anomalies when connecting to api.
        if movie_title == 'The Lover':
            response=search.movie(query="L'Amant")
        elif movie_title== 'The Postman':
            response=search.movie(query="Il Postino")
        else:
            response=search.movie(query=movie_title)
        info=None
        if '&' in movie_title:
            movie_title=movie_title.replace('&','and')
        for s in search.results:
            if '&' in s['title']:
                s['title']=s['title'].replace('&','and')
            if s.get('release_date')==None or s.get('release_date')=='':
                pass
            else:
                #api looks for movie with the same title name and release date within 3 years.
                if s['title'].lower() ==  movie_title.lower() and (int(s['release_date'].split('-')[0]) ==year or int(s['release_date'].split('-')[0]) ==year-1 or int(s['release_date'].split('-')[0]) ==year-2):
                    info=s
                    break
        #using api output, gather below details from the movie.
        movie_dict['Title']=info['title']
        movie_dict['Synopsis']=info['overview']
        movie_dict['Release_date']=info['release_date']
        id_num = info['id']
        movie=tmdb.Movies(id_num)
        time.sleep(5)
        movie_details = movie.info()
        movie_dict['Genre1']=None
        movie_dict['Genre2']=None
        movie_dict['Genre3']=None
        movie_dict['Production_Company1']=None
        movie_dict['Production_Company2']=None
        movie_dict['Production_Company3']=None
        if type(movie_details['genres'])==list:
            for num in range(len(movie_details['genres'])):
                if num < 3:
                    movie_dict['Genre{}'.format(num+1)]=movie_details['genres'][num]['name']
        else:
            movie_dict['Genre1']=movie_details.get('genres').get('name')
        if type(movie_details['production_companies'])==list:
            for num in range(len(movie_details['production_companies'])):
                if num < 3:
                    movie_dict['Production_Company{}'.format(num+1)]=movie_details['production_companies'][num]['name']
        else:
            movie_dict['Production_Company1']=movie_details.get('production_companies').get('name')
        movie_dict['Original_Language']=movie_details.get('original_language')
        movie_dict['Budget']=movie_details.get('budget')
        movie_dict['Runtime']=movie_details.get('runtime')   
        movie_dict['Tagline']=movie_details.get('tagline')
        movie_dict['Revenue']=movie_details.get('revenue')
        movie=tmdb.Movies(id_num)
        credits = movie.credits()
        movie_dict['Cast1']=None
        movie_dict['Cast2']=None
        movie_dict['Cast3']=None
        if type(credits.get('cast'))==list:
            for num in range(len(credits['cast'])):
                if num <3:
                    movie_dict['Cast{}'.format(num+1)]=credits['cast'][num]['name']
        else:
            movie_dict['Cast1']=credits.get('cast').get('name')
        movie_dict['Director1']=None
        movie_dict['Director2']=None
        movie_dict['Director3']=None
        movie_dict['Cinematographer1']=None
        movie_dict['Cinematographer2']=None
        movie_dict['Cinematographer3']=None
        movie_dict['Producer1']=None
        movie_dict['Producer2']=None
        movie_dict['Producer3']=None
        if type(credits.get('crew'))==list:
            k=0
            for crew in credits['crew']:
                if k <3 and crew.get('job')=='Director':
                    movie_dict['Director{}'.format(k+1)]=crew.get('name')
                    k+=1
            k=0
            for crew in credits['crew']:
                if k <3 and crew.get('job')=='Director of Photography':
                    movie_dict['Cinematographer{}'.format(k+1)]=crew.get('name')
                    k+=1
            k=0
            for crew in credits['crew']:
                if k <3 and crew.get('job')=='Producer':
                    movie_dict['Producer{}'.format(k+1)]=crew.get('name')
                    k+=1
        movie=tmdb.Movies(id_num)
        time.sleep(5)
        releases=movie.release_dates()
        for r in releases.get('results'):
            if r.get('iso_3166_1')=='US':
                if type(r.get('release_dates'))==list:
                    movie_dict['MPAA_Rating']=r.get('release_dates')[0].get('certification')
                else:
                    movie_dict['MPAA_Rating']=r.get('release_dates').get('certification')
        movie_title_rt = None
        #Following code is fixing data of titles to align with Rotten Tomatoes API.
        if movie_title == 'WarGames' :
            movie_title_rt = 'WarGames (War Games)' 
        elif movie_title == 'Star Trek IV: The Voyage Home' :
            movie_title_rt = 'Star Trek IV - The Voyage Home'  
        elif movie_title == 'Matewan' :
            movie_title_rt = 'Matewan: A Luta Final'
        elif movie_title == 'My Life as a Dog' :
            movie_title_rt = 'My Life as a Dog (Mitt Liv som Hund)'  
        elif movie_title == 'My Left Foot: The Story of Christy Brown' :
            movie_title_rt = 'My Left Foot'  
        elif movie_title == 'The Godfather: Part III' :
            movie_title_rt = 'The Godfather, Part III'  
        elif movie_title == 'The Lover' :
            movie_title_rt = "The Lover (L'amant)"
        elif movie_title == 'Farewell My Concubine' :
            movie_title_rt = 'Farewell My Concubine (Ba wang bie ji)' 
        elif movie_title == 'Three Colors: Red' :
            movie_title_rt = 'Three Colors: Red (Trois couleurs: Rouge)'  
        elif movie_title == 'Shanghai Triad' :
            movie_title_rt = 'Shanghai Triad (Yao a yao yao dao waipo qiao)'  
        elif movie_title == 'The Postman' :
            movie_title_rt = 'Il Postino: The Postman (Il Postino)'  
        elif movie_title == 'Life Is Beautiful' :
            movie_title_rt = 'Life Is Beautiful (La Vita è bella)'  
        elif movie_title == 'City of God' :
            movie_title_rt = 'Cidade de Deus (City of God)'  
        elif movie_title == 'Good Night, and Good Luck.' :
            movie_title_rt = 'Good Night, and Good Luck'  
        elif movie_title == 'The White Ribbon' :
            movie_title_rt = 'The White Ribbon (Das weisse Band)'  
        elif movie_title == 'Precious' :
            movie_title_rt = 'Precious: Based on the Novel Push by Sapphire'  
        elif movie_title == 'Birdman or (The Unexpected Virtue of Ignorance)' :
            movie_title_rt = 'Birdman'
        elif movie_title == 'Fanny and Alexander' :
            movie_title_rt = 'Fanny & Alexander'
        elif movie_title == 'Henry and June' :
            movie_title_rt = 'Henry & June'
        elif movie_title == 'Thelma and Louise' :
            movie_title_rt = 'Thelma & Louise'
        elif movie_title == 'Secrets and Lies' :
            movie_title_rt = 'Secrets & Lies'
        elif movie_title == 'Extremely Loud and Incredibly Close' :
            movie_title_rt = 'Extremely Loud & Incredibly Close'
        else:
            movie_title_rt= movie_title        
        search = RottenTomatoesClient.search(term=movie_title_rt,limit=30)
        info=None
        #Similar to tmbd, the Rotten Tomatoes API looks to match movie title and year it was released to get critics review score.
        for s in search['movies']:
            if s['name'].lower()==movie_title_rt.lower() and (s['year']==year or s['year']==year-1 or s['year']==year+1):
                info=s
        #manual inputs for movies not picked up in Rotten Tomatoes API
        if movie_title_rt == 'Parasite':
            movie_dict['Critics_Average_Score']=99
        elif movie_title_rt == 'Cold War':
            movie_dict['Critics_Average_Score']=92
        elif movie_title_rt == 'Once upon a Time… in Hollywood':
            movie_dict['Critics_Average_Score']=85
        if info == None:
            print(movie_title_rt,'rt')
        else:
            movie_dict['Critics_Average_Score']=info.get('meterScore')
        return movie_dict
    except:
        print(movie_title)

In [None]:
#for loop goes through zipped list to input movie title and year into function
for t, y in academy_titles:
    movie_list.append(movie_dataframe(t,y))

In [None]:
#turns list into dataframe
movie_df=pd.DataFrame(movie_list)

In [None]:
#manually change critic scores and rating for those not picked up by API
movie_df['Critics_Average_Score'].iloc[220]=63
movie_df['Critics_Average_Score'].iloc[229]=69
movie_df['Critics_Average_Score'].iloc[265]=91
movie_df['Critics_Average_Score'].iloc[290]=69
movie_df['MPAA_Rating'].iloc[236]='PG-13'
ratings=['R','PG','PG','R','R','PG','PG','R','R','Unrated','PG-13','R','PG','R','R','R','R','R','PG',
'R','PG-13','R','PG-13','PG-13','PG-13','R','R','PG-13','PG-13','R','PG-13','R','R','PG-13','R',
'PG-13','R','R','R','R','R','R','R','PG-13','PG-13','R','R','R','R']
for i,n in enumerate(nums):
    movie_df['MPAA_Rating'].iloc[n]=ratings[i]

In [48]:
#join newly acquired movie info to original Oscar winner dataset.
academy_df=academy.merge(movie_df,how='left',left_on='entity',right_on='Title')

In [51]:
#export to csv
academy_df.to_csv(r'C:\Users\Patrick\Documents\academy_dataset.csv')

In [211]:
#import from csv to create final dataset
academy_df=pd.read_csv(r'C:\Users\Patrick\Documents\academy_dataset.csv')
academy_df.drop(columns='Unnamed: 0',inplace=True)

In [171]:
#create dummy variables for each of the categorical fields. 
academy_df=pd.get_dummies(data=academy_df,columns=['Genre1','Genre2','Genre3'],prefix='Genre')
academy_df=pd.get_dummies(data=academy_df,columns=['Production_Company1','Production_Company2','Production_Company3'],prefix='Production_Company')
academy_df=pd.get_dummies(data=academy_df,columns=['Cast1','Cast2','Cast3'],prefix='Cast')
academy_df=pd.get_dummies(data=academy_df,columns=['Director1','Director2','Director3'],prefix='Director')
academy_df=pd.get_dummies(data=academy_df,columns=['Cinematographer1','Cinematographer2','Cinematographer3'],prefix='Cinematographer')
academy_df=pd.get_dummies(data=academy_df,columns=['Producer1','Producer2','Producer3'],prefix='Producer')
academy_df=pd.get_dummies(data=academy_df,columns=['Original_Language','MPAA_Rating'])
titles=['Genre_','Production_Company_','Cast_','Director_','Cinematographer_','Producer_']
#Following for loop ensures that there aren't duplicate categorical columns for example "Genre1_Horror", "Genre2_Horror" become just "Genre_Horror"
g=list(academy_df.columns)
for title in titles:
    for column in g:
        if column.startswith(title):
            try:
                academy_df[column+'1']=academy_df[column].max(axis=1)
                academy_df.drop(columns=column,inplace=True)
                academy_df.rename(columns={column+'1':column},inplace=True)
            except:
                pass

In [173]:
#Turn release date into a datetime object
#create date_rank column that ranks release dates by category and year
academy_df['Release_date']=pd.to_datetime(academy_df['Release_date'])
academy_df['Date_Rank']=academy_df.groupby(by=['year','category'])['Release_date'].rank(ascending=False)
academy_df.to_csv(r'C:\Users\Patrick\Documents\academy_final_dataset.csv',index=False)

In [69]:
#Run same process with test dataset
academy = pd.read_csv(r'C:\Users\Patrick\Documents\oscar_winners.csv')
academy['winner']=academy['winner'].apply(lambda x:1 if x == True else 0)
academy_distinct_list=academy[['entity','year']].drop_duplicates()
academy_titles=list(zip(academy_distinct_list['entity'],academy_distinct_list['year']))

In [70]:
academy
movie_list=[]

In [71]:
for t, y in academy_titles:
    movie_list.append(movie_dataframe(t,y))

Cold War rt
Once upon a Time… in Hollywood rt
Parasite rt


In [100]:
movie_df=pd.DataFrame(movie_list)

In [101]:
movie_df['MPAA_Rating'].iloc[0]='R'
movie_df['MPAA_Rating'].iloc[2]='R'
movie_df['MPAA_Rating'].iloc[10]='R'
movie_df['MPAA_Rating'].iloc[16]='PG-13'
movie_df['MPAA_Rating'].iloc[17]='PG-13'
movie_df['Title'].iloc[14]='Once upon a Time… in Hollywood'

In [102]:
academy_df=academy.merge(movie_df,how='left',left_on='entity',right_on='Title')

In [104]:
academy_df=pd.get_dummies(data=academy_df,columns=['Genre1','Genre2','Genre3'],prefix='Genre')
academy_df=pd.get_dummies(data=academy_df,columns=['Production_Company1','Production_Company2','Production_Company3'],prefix='Production_Company')
academy_df=pd.get_dummies(data=academy_df,columns=['Cast1','Cast2','Cast3'],prefix='Cast')
academy_df=pd.get_dummies(data=academy_df,columns=['Director1','Director2','Director3'],prefix='Director')
academy_df=pd.get_dummies(data=academy_df,columns=['Cinematographer1','Cinematographer2','Cinematographer3'],prefix='Cinematographer')
academy_df=pd.get_dummies(data=academy_df,columns=['Producer1','Producer2','Producer3'],prefix='Producer')
academy_df=pd.get_dummies(data=academy_df,columns=['Original_Language','MPAA_Rating'])
titles=['Genre_','Production_Company_','Cast_','Director_','Cinematographer_','Producer_']
g=list(academy_df.columns)
for title in titles:
    for column in g:
        if column.startswith(title):
            try:
                academy_df[column+'1']=academy_df[column].max(axis=1)
                academy_df.drop(columns=column,inplace=True)
                academy_df.rename(columns={column+'1':column},inplace=True)
            except:
                pass

In [105]:
academy_df['Release_date']=pd.to_datetime(academy_df['Release_date'])
academy_df['Date_Rank']=academy_df.groupby(by=['year','category'])['Release_date'].rank(ascending=False)
academy_df.to_csv(r'C:\Users\Patrick\Documents\academy_test_dataset.csv',index=False)

At the end of the data cleaning process I had ended up with the following variables for each movie: Top 3 Genres,  Top 3 Actors/Actresses, Top 3 Production Companies, Top 3 Cinematographers, Top 3 Directors, Top 3 Producers, MPAA Rating, Original Language, Movie Runtime, Budget, Revenue,  and Release Date. Top 3 of any category means that potentially a movie could have more than one in each of those categories, so for simplicity sake I took the top 3 listed for those specific categories. 