 Objective: 
           When one thinks about prestige in the filmmaking industry, the Academy Awards always come to mind. That being said, there are often always Oscar favorites, that are pegged to be sure winners from the get go. The purpose of this project is to explore the different correlations between films that are Oscar recipients, to see if there is indeed a particular  make an Oscar winning movie.
           
 Datasets: 
           IMDB Top 5000 Movies
           The Academy Awards 1925-2015

In [1]:
import pandas as pd
import numpy as np
import patsy
import statsmodels.api as sm
from scipy.stats import ttest_ind
import matplotlib.pyplot as plt


Firstly, I cleaned up the data from both csv-s, taking out the data I wanted to explore and merging the two.

In [2]:
#oscars dataset from 1927-2015
oscars = pd.read_csv('database.csv')
#imdb top 5000 dataset
imdb = pd.read_csv('movie_metadata.csv', encoding = "ISO-8859-1")

In [3]:
#film names had two junk characters appended at the end-cleaned up
imdb['Film'] = imdb['Film'].str[:-2]

In [4]:
#picked pertinent selectors for both datasets
imdb = imdb[['director_name', 'Film', 'gross', 'genres', 'duration', 'plot_keywords']]
oscars = oscars[['Year', 'Award', 'Name', 'Film']]

In [5]:
#used film name to merge the two datasets
joined = pd.merge(imdb, oscars, on="Film")

In [6]:
dropped_df = joined.dropna()

dropped_df.head()


Unnamed: 0,director_name,Film,gross,genres,duration,plot_keywords,Year,Award,Name
0,James Cameron,Titanic,658672302.0,Drama|Romance,194.0,artist|love|ship|titanic|wet,1997,Actress in a Leading Role,Kate Winslet
1,James Cameron,Titanic,658672302.0,Drama|Romance,194.0,artist|love|ship|titanic|wet,1997,Actress in a Supporting Role,Gloria Stuart
2,Christopher Nolan,The Dark Knight,533316061.0,Action|Crime|Drama|Thriller,152.0,based on comic book|dc comics|psychopath|star ...,2008,Actor in a Supporting Role,Heath Ledger
3,David Fincher,The Curious Case of Benjamin Button,127490802.0,Drama|Fantasy|Romance,166.0,deformed baby|diary|lingerie slip|older man yo...,2008,Actor in a Leading Role,Brad Pitt
4,David Fincher,The Curious Case of Benjamin Button,127490802.0,Drama|Fantasy|Romance,166.0,deformed baby|diary|lingerie slip|older man yo...,2008,Actress in a Supporting Role,Taraji P. Henson


I then explored the possbility if a particular genre of film might be more likely to nab an award. The common misconception is that Oscars are categorized by genre. However, this is untrue. They are in fact categorized by roles in the film ( Actor, Music Score, etc ) or by category ( Independant Film, Documentary, etc ). 

The visualization below is a word cloud with all the genres of the Oscar winning films with size based on the frequency of that given drama. The data set presented the genres for the films with a string of subgenres which leads to the majority of genre strings being unique, skewing the word cloud.

Therefore, I decided to split the genre strings up and find the word frequencies for every individual genre.

In [7]:
%%HTML

<div class='tableauPlaceholder' id='viz1501801152816' style='position: relative'><noscript><a href='#'><img alt='Sheet 1 ' src='https:&#47;&#47;public.tableau.com&#47;static&#47;images&#47;At&#47;Attempt1_14&#47;Sheet1&#47;1_rss.png' style='border: none' /></a></noscript><object class='tableauViz'  style='display:none;'><param name='host_url' value='https%3A%2F%2Fpublic.tableau.com%2F' /> <param name='site_root' value='' /><param name='name' value='Attempt1_14&#47;Sheet1' /><param name='tabs' value='no' /><param name='toolbar' value='yes' /><param name='static_image' value='https:&#47;&#47;public.tableau.com&#47;static&#47;images&#47;At&#47;Attempt1_14&#47;Sheet1&#47;1.png' /> <param name='animate_transition' value='yes' /><param name='display_static_image' value='yes' /><param name='display_spinner' value='yes' /><param name='display_overlay' value='yes' /><param name='display_count' value='yes' /><param name='filter' value='publish=yes' /></object></div>                <script type='text/javascript'>                    var divElement = document.getElementById('viz1501801152816');                    var vizElement = divElement.getElementsByTagName('object')[0];                    vizElement.style.width='100%';vizElement.style.height=(divElement.offsetWidth*0.75)+'px';                    var scriptElement = document.createElement('script');                    scriptElement.src = 'https://public.tableau.com/javascripts/api/viz_v1.js';                    vizElement.parentNode.insertBefore(scriptElement, vizElement);                </script>


In [8]:
#clean up film genres
genres = []
for row in joined['genres']:
   genres.extend(row.split('|'))


Splitting up the genres presented a more pertinant slice of the data. Drama proved to be a clear winner in terms of frequency. This could be due to many reasons. By definition, drama is: 

" [a] composition in verse or prose intended to portray life or character or to tell a story usually involving conflicts and emotions through action and dialogue and typically designed for theatrical performance "

This can practically be attributed with any film, leading to it being included in almost every genre string. 

The second most frequent drama was Romance with 224 hits.

In [9]:
%%HTML

<div class='tableauPlaceholder' id='viz1501801299793' style='position: relative'><noscript><a href='#'><img alt='&lt;Genre Map&gt; ' src='https:&#47;&#47;public.tableau.com&#47;static&#47;images&#47;Ge&#47;Genre_Map&#47;Sheet2&#47;1_rss.png' style='border: none' /></a></noscript><object class='tableauViz'  style='display:none;'><param name='host_url' value='https%3A%2F%2Fpublic.tableau.com%2F' /> <param name='site_root' value='' /><param name='name' value='Genre_Map&#47;Sheet2' /><param name='tabs' value='no' /><param name='toolbar' value='yes' /><param name='static_image' value='https:&#47;&#47;public.tableau.com&#47;static&#47;images&#47;Ge&#47;Genre_Map&#47;Sheet2&#47;1.png' /> <param name='animate_transition' value='yes' /><param name='display_static_image' value='yes' /><param name='display_spinner' value='yes' /><param name='display_overlay' value='yes' /><param name='display_count' value='yes' /><param name='filter' value='publish=yes' /></object></div>                <script type='text/javascript'>                    var divElement = document.getElementById('viz1501801299793');                    var vizElement = divElement.getElementsByTagName('object')[0];                    vizElement.style.width='100%';vizElement.style.height=(divElement.offsetWidth*0.75)+'px';                    var scriptElement = document.createElement('script');                    scriptElement.src = 'https://public.tableau.com/javascripts/api/viz_v1.js';                    vizElement.parentNode.insertBefore(scriptElement, vizElement);                </script>

In [10]:
plot_words = []

non_null = joined[joined['plot_keywords'].notnull()]
for plot_word in non_null['plot_keywords']:
   plot_words.extend(plot_word.split('|'))

plot_words.to_csv( 'plot_words.csv' )  
#key_word.to_csv('key_words.csv')

AttributeError: 'list' object has no attribute 'to_csv'