# 500_Oscar_merge_Prep


## Purpose
The purpose of this notebook is to merge our Oscar datasets with our movie industry dataset, creating the final dataset that we will use in our analysis.

## Datasets

 - input: wins.pkl, noms.pkl, movies.pkl
 - output: oscarMovies.pkl

# Loading the Datasets

First we need to load in the datasets that we have prepared in our earlier then we can begin to combine them.

In [1]:
import os.path
import pandas as pd
import numpy as np

In [2]:
if  not os.path.exists("movies.csv"):
    print("Missing dataset file")

In [3]:
movies = pd.read_pickle('../../data/processed/movies.pkl')

In [5]:
wins = pd.read_pickle('../../data/processed/wins.pkl')
noms = pd.read_pickle('../../data/processed/noms.pkl')

In [6]:
movies.head(5)

Unnamed: 0,budget,company,country,director,genre,gross,name,rating,released,runtime,score,star,votes,writer,year,scoreRank,grossRank,HarMean
0,8000000.0,Columbia Pictures Corporation,USA,Rob Reiner,Adventure,52287414.0,Stand by Me,R,1986-08-22,89,8.1,Wil Wheaton,299174,Stephen King,1986,6685.0,5495.0,0.884794
1,6000000.0,Paramount Pictures,USA,John Hughes,Comedy,70136369.0,Ferris Bueller's Day Off,PG-13,1986-06-11,103,7.8,Matthew Broderick,264740,John Hughes,1986,6436.5,5878.0,0.901333
2,15000000.0,Paramount Pictures,USA,Tony Scott,Action,179800601.0,Top Gun,PG,1986-05-16,110,6.9,Tom Cruise,236909,Jim Cash,1986,4651.5,6613.0,0.801096
3,18500000.0,Twentieth Century Fox Film Corporation,USA,James Cameron,Action,85160248.0,Aliens,R,1986-07-18,137,8.4,Sigourney Weaver,540152,James Cameron,1986,6775.0,6095.0,0.941311
4,9000000.0,Walt Disney Pictures,USA,Randal Kleiser,Adventure,18564613.0,Flight of the Navigator,PG,1986-08-01,90,6.9,Joey Cramer,36636,Mark H. Baker,1986,4651.5,3996.0,0.630507


In [7]:
movies.shape

(6820, 18)

We can see that datasets have loaded in correctly and are now ready to be combined.

# Combining the Datasets 

To create our final dataset, we needed to combine our two smaller datasets that have the counts of the Oscar wins and nominations.

Our first step then was to merge these two datasets with the movie industry dataset individually.


In [8]:
wins = movies.merge(wins, left_on='name', right_on='movie_name')# merging the data sets on the movie name

In [9]:
noms = movies.merge(noms, left_on='name', right_on='movie_name')# merging the data sets on the movie name

In [10]:
noms.shape

(353, 20)

In [11]:
noms.head()

Unnamed: 0,budget,company,country,director,genre,gross,name,rating,released,runtime,score,star,votes,writer,year,scoreRank,grossRank,HarMean,Oscar_noms,movie_name
0,18500000.0,Twentieth Century Fox Film Corporation,USA,James Cameron,Action,85160248.0,Aliens,R,1986-07-18,137,8.4,Sigourney Weaver,540152,James Cameron,1986,6775.0,6095.0,0.941311,1,Aliens
1,6000000.0,Hemdale,UK,Oliver Stone,Drama,138530565.0,Platoon,R,1987-02-06,120,8.1,Charlie Sheen,317585,Oliver Stone,1986,6685.0,6479.0,0.965281,2,Platoon
2,6000000.0,De Laurentiis Entertainment Group (DEG),USA,David Lynch,Drama,8551228.0,Blue Velvet,R,1986-10-23,120,7.8,Isabella Rossellini,146768,David Lynch,1986,6436.5,3057.0,0.607958,1,Blue Velvet
3,24500000.0,Warner Bros.,UK,Roland Joffé,Adventure,17218023.0,The Mission,PG,1986-10-31,125,7.5,Robert De Niro,47497,Robert Bolt,1986,5973.0,3887.0,0.690732,1,The Mission
4,13800000.0,Touchstone Pictures,USA,Martin Scorsese,Drama,52293982.0,The Color of Money,R,1986-10-17,119,7.0,Paul Newman,62495,Walter Tevis,1986,4885.5,5496.0,0.758738,1,The Color of Money


We can see that our noms dataset has successfully been merged.

In [12]:
wins.shape

(107, 20)

In [13]:
wins.head()

Unnamed: 0,budget,company,country,director,genre,gross,name,rating,released,runtime,score,star,votes,writer,year,scoreRank,grossRank,HarMean,Oscar_wins,movie_name
0,6000000.0,Hemdale,UK,Oliver Stone,Drama,138530565.0,Platoon,R,1987-02-06,120,8.1,Charlie Sheen,317585,Oliver Stone,1986,6685.0,6479.0,0.965281,1,Platoon
1,13800000.0,Touchstone Pictures,USA,Martin Scorsese,Drama,52293982.0,The Color of Money,R,1986-10-17,119,7.0,Paul Newman,62495,Walter Tevis,1986,4885.5,5496.0,0.758738,1,The Color of Money
2,6400000.0,Orion Pictures,USA,Woody Allen,Comedy,40084041.0,Hannah and Her Sisters,PG-13,1986-03-14,107,8.0,Mia Farrow,56988,Woody Allen,1986,6616.5,5121.0,0.846879,2,Hannah and Her Sisters
3,0.0,Paramount Pictures,USA,Randa Haines,Drama,31853080.0,Children of a Lesser God,R,1986-10-31,119,7.2,William Hurt,12538,Mark Medoff,1986,5360.5,4739.0,0.737876,1,Children of a Lesser God
4,25000000.0,Paramount Pictures,USA,Brian De Palma,Crime,76270454.0,The Untouchables,R,1987-06-03,119,7.9,Kevin Costner,234254,Oscar Fraley,1987,6541.5,5984.0,0.916852,1,The Untouchables


Our wins dataset also has merged successfully so now we can prepare to create our final dataset.

The next step we took was to add our noms dataset and our wins dataset to our movie industry dataset, this way we would have all the movies in the file with the count of their oscar wins and nominations.

In [14]:
oscar_wins = movies.append(wins)

In [15]:
oscar_noms = movies.append(noms)

We added the data with the Oscar win count to the end of the movie industry dataset. Then we dropped the duplicate rows, keeping the last duplicate, as these would be the duplicates with the Oscar counts. We then did the same process to the Oscar nominations data.

In [16]:
oscar_wins.drop_duplicates(subset='name', keep='last', inplace=True)

In [17]:
oscar_noms.drop_duplicates(subset='name', keep='last', inplace=True)

We now had two datasets with the oscar win/nominations counts for all movies. As any movie that was not in either the oscar datasets before the merge must then have not won or been nominated for any oscars.

In [18]:
oscar_noms.shape

(6731, 20)

In [19]:
oscar_wins.shape

(6731, 20)

We can see now that our two data sets are the same size and shape making merging very easy.

We created our final dataset that would be used in our analysis, by combining the Oscar_nom column to our Oscar Wins dataset. We merged the data based on movie name to ensure that data matched up correctly.


In [20]:
oscarMovies = pd.merge(oscar_wins,oscar_noms[['name','Oscar_noms']],on='name', how='left')

In [21]:
oscarMovies.head()

Unnamed: 0,HarMean,Oscar_wins,budget,company,country,director,genre,gross,grossRank,movie_name,...,rating,released,runtime,score,scoreRank,star,votes,writer,year,Oscar_noms
0,0.884794,,8000000.0,Columbia Pictures Corporation,USA,Rob Reiner,Adventure,52287414.0,5495.0,,...,R,1986-08-22,89,8.1,6685.0,Wil Wheaton,299174,Stephen King,1986,
1,0.901333,,6000000.0,Paramount Pictures,USA,John Hughes,Comedy,70136369.0,5878.0,,...,PG-13,1986-06-11,103,7.8,6436.5,Matthew Broderick,264740,John Hughes,1986,
2,0.801096,,15000000.0,Paramount Pictures,USA,Tony Scott,Action,179800601.0,6613.0,,...,PG,1986-05-16,110,6.9,4651.5,Tom Cruise,236909,Jim Cash,1986,
3,0.941311,,18500000.0,Twentieth Century Fox Film Corporation,USA,James Cameron,Action,85160248.0,6095.0,,...,R,1986-07-18,137,8.4,6775.0,Sigourney Weaver,540152,James Cameron,1986,1.0
4,0.630507,,9000000.0,Walt Disney Pictures,USA,Randal Kleiser,Adventure,18564613.0,3996.0,,...,PG,1986-08-01,90,6.9,4651.5,Joey Cramer,36636,Mark H. Baker,1986,


The dataset has now one more column than before and that is the Oscar_Nom column. We now have both the Oscar wins and Oscar nominations a single dataset.

Now we changed the NaN values in the Oscar wins and nominations columns with 0 as they have no wins or nominations by default.


In [22]:
oscarMovies['Oscar_noms']=oscarMovies['Oscar_noms'].fillna('0')#replace all NaN values with 0

In [23]:
oscarMovies['Oscar_wins']=oscarMovies['Oscar_wins'].fillna('0')#replace all NaN values with 0

In [24]:
oscarMovies.head()

Unnamed: 0,HarMean,Oscar_wins,budget,company,country,director,genre,gross,grossRank,movie_name,...,rating,released,runtime,score,scoreRank,star,votes,writer,year,Oscar_noms
0,0.884794,0,8000000.0,Columbia Pictures Corporation,USA,Rob Reiner,Adventure,52287414.0,5495.0,,...,R,1986-08-22,89,8.1,6685.0,Wil Wheaton,299174,Stephen King,1986,0
1,0.901333,0,6000000.0,Paramount Pictures,USA,John Hughes,Comedy,70136369.0,5878.0,,...,PG-13,1986-06-11,103,7.8,6436.5,Matthew Broderick,264740,John Hughes,1986,0
2,0.801096,0,15000000.0,Paramount Pictures,USA,Tony Scott,Action,179800601.0,6613.0,,...,PG,1986-05-16,110,6.9,4651.5,Tom Cruise,236909,Jim Cash,1986,0
3,0.941311,0,18500000.0,Twentieth Century Fox Film Corporation,USA,James Cameron,Action,85160248.0,6095.0,,...,R,1986-07-18,137,8.4,6775.0,Sigourney Weaver,540152,James Cameron,1986,1
4,0.630507,0,9000000.0,Walt Disney Pictures,USA,Randal Kleiser,Adventure,18564613.0,3996.0,,...,PG,1986-08-01,90,6.9,4651.5,Joey Cramer,36636,Mark H. Baker,1986,0


One of the final steps in preparing this dataset was to remove the columns that did not contain the key information we needed to preform our analysis. This was just a way to clean up the dataset further.

In [25]:
 oscarMovies = oscarMovies.drop(['movie_name','company','released','country','rating','genre','runtime','votes','writer'], axis=1)

In [26]:
oscarMovies.head()

Unnamed: 0,HarMean,Oscar_wins,budget,director,gross,grossRank,name,score,scoreRank,star,year,Oscar_noms
0,0.884794,0,8000000.0,Rob Reiner,52287414.0,5495.0,Stand by Me,8.1,6685.0,Wil Wheaton,1986,0
1,0.901333,0,6000000.0,John Hughes,70136369.0,5878.0,Ferris Bueller's Day Off,7.8,6436.5,Matthew Broderick,1986,0
2,0.801096,0,15000000.0,Tony Scott,179800601.0,6613.0,Top Gun,6.9,4651.5,Tom Cruise,1986,0
3,0.941311,0,18500000.0,James Cameron,85160248.0,6095.0,Aliens,8.4,6775.0,Sigourney Weaver,1986,1
4,0.630507,0,9000000.0,Randal Kleiser,18564613.0,3996.0,Flight of the Navigator,6.9,4651.5,Joey Cramer,1986,0


In [27]:
oscarMovies.shape

(6731, 12)

We also changed the type of our oscar win/nomination columns to make the analysis easier later.

In [28]:
oscarMovies.Oscar_noms = oscarMovies.Oscar_noms.astype(np.int64)#changing the type of the column to ints
oscarMovies.Oscar_wins = oscarMovies.Oscar_wins.astype(np.int64)

In [29]:
oscarMovies.dtypes

HarMean       float64
Oscar_wins      int64
budget        float64
director       object
gross         float64
grossRank     float64
name           object
score         float64
scoreRank     float64
star           object
year            int64
Oscar_noms      int64
dtype: object

Then we save the file to pickle.

In [30]:
oscarMovies.to_pickle('../../data/analysis/oscarMovies.pkl')#saves to pickle file