# 400_Prep_Movie-Industry

## Purpose
The purpose of this data set is to load in our data and prepare it for analysis.<br> We will also create our measure for success here as this dataset will be merged to two other datasets. 

## Datasets

 - input: movies.csv
 - output: movies.csv

# Loading the Data

Before analysing the data we need to load in the dataset. We will inspect the dataset and carry out any cleaning techniques in order to make sure the information is suitable for efficient analysis. 

In [145]:
import os.path
import pandas as pd
import numpy as np
from scipy import stats

In [146]:
# Ensure the file exists
if  not os.path.exists("movies.csv"):
    print("Missing dataset file")

In [147]:
movies = pd.read_csv('movies.csv',  encoding='latin-1') # Loading in the dataset

# Cleaning the Data

The dataset we have here seems to be cleaned already but we will still carry out some checks to ensure the data is correct.

In [148]:
movies.head(5) # Brings up the first 5 rows

Unnamed: 0,budget,company,country,director,genre,gross,name,rating,released,runtime,score,star,votes,writer,year
0,8000000.0,Columbia Pictures Corporation,USA,Rob Reiner,Adventure,52287414.0,Stand by Me,R,1986-08-22,89,8.1,Wil Wheaton,299174,Stephen King,1986
1,6000000.0,Paramount Pictures,USA,John Hughes,Comedy,70136369.0,Ferris Bueller's Day Off,PG-13,1986-06-11,103,7.8,Matthew Broderick,264740,John Hughes,1986
2,15000000.0,Paramount Pictures,USA,Tony Scott,Action,179800601.0,Top Gun,PG,1986-05-16,110,6.9,Tom Cruise,236909,Jim Cash,1986
3,18500000.0,Twentieth Century Fox Film Corporation,USA,James Cameron,Action,85160248.0,Aliens,R,1986-07-18,137,8.4,Sigourney Weaver,540152,James Cameron,1986
4,9000000.0,Walt Disney Pictures,USA,Randal Kleiser,Adventure,18564613.0,Flight of the Navigator,PG,1986-08-01,90,6.9,Joey Cramer,36636,Mark H. Baker,1986


We can look at the shape of the dataset and check if any information is missing.

In [149]:
movies.shape

(6820, 15)

In [150]:
movies.isnull().sum() #checking for missing values

budget      0
company     0
country     0
director    0
genre       0
gross       0
name        0
rating      0
released    0
runtime     0
score       0
star        0
votes       0
writer      0
year        0
dtype: int64

We can set the date as the index of the dataset to enable easier analysis later.

In [151]:
movies.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6820 entries, 0 to 6819
Data columns (total 15 columns):
budget      6820 non-null float64
company     6820 non-null object
country     6820 non-null object
director    6820 non-null object
genre       6820 non-null object
gross       6820 non-null float64
name        6820 non-null object
rating      6820 non-null object
released    6820 non-null object
runtime     6820 non-null int64
score       6820 non-null float64
star        6820 non-null object
votes       6820 non-null int64
writer      6820 non-null object
year        6820 non-null int64
dtypes: float64(3), int64(3), object(9)
memory usage: 799.3+ KB


In [152]:
movies['released'] = pd.to_datetime(movies['released'])
movies.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6820 entries, 0 to 6819
Data columns (total 15 columns):
budget      6820 non-null float64
company     6820 non-null object
country     6820 non-null object
director    6820 non-null object
genre       6820 non-null object
gross       6820 non-null float64
name        6820 non-null object
rating      6820 non-null object
released    6820 non-null datetime64[ns]
runtime     6820 non-null int64
score       6820 non-null float64
star        6820 non-null object
votes       6820 non-null int64
writer      6820 non-null object
year        6820 non-null int64
dtypes: datetime64[ns](1), float64(3), int64(3), object(8)
memory usage: 799.3+ KB


In [153]:
movies

Unnamed: 0,budget,company,country,director,genre,gross,name,rating,released,runtime,score,star,votes,writer,year
0,8000000.0,Columbia Pictures Corporation,USA,Rob Reiner,Adventure,52287414.0,Stand by Me,R,1986-08-22,89,8.1,Wil Wheaton,299174,Stephen King,1986
1,6000000.0,Paramount Pictures,USA,John Hughes,Comedy,70136369.0,Ferris Bueller's Day Off,PG-13,1986-06-11,103,7.8,Matthew Broderick,264740,John Hughes,1986
2,15000000.0,Paramount Pictures,USA,Tony Scott,Action,179800601.0,Top Gun,PG,1986-05-16,110,6.9,Tom Cruise,236909,Jim Cash,1986
3,18500000.0,Twentieth Century Fox Film Corporation,USA,James Cameron,Action,85160248.0,Aliens,R,1986-07-18,137,8.4,Sigourney Weaver,540152,James Cameron,1986
4,9000000.0,Walt Disney Pictures,USA,Randal Kleiser,Adventure,18564613.0,Flight of the Navigator,PG,1986-08-01,90,6.9,Joey Cramer,36636,Mark H. Baker,1986
5,6000000.0,Hemdale,UK,Oliver Stone,Drama,138530565.0,Platoon,R,1987-02-06,120,8.1,Charlie Sheen,317585,Oliver Stone,1986
6,25000000.0,Henson Associates (HA),UK,Jim Henson,Adventure,12729917.0,Labyrinth,PG,1986-06-27,101,7.4,David Bowie,102879,Dennis Lee,1986
7,6000000.0,De Laurentiis Entertainment Group (DEG),USA,David Lynch,Drama,8551228.0,Blue Velvet,R,1986-10-23,120,7.8,Isabella Rossellini,146768,David Lynch,1986
8,9000000.0,Paramount Pictures,USA,Howard Deutch,Comedy,40471663.0,Pretty in Pink,PG-13,1986-02-28,96,6.8,Molly Ringwald,60565,John Hughes,1986
9,15000000.0,SLM Production Group,USA,David Cronenberg,Drama,40456565.0,The Fly,R,1986-08-15,96,7.5,Jeff Goldblum,129698,George Langelaan,1986


This dataset was clean from the beginning and so after we searched for any issues in the data we found none. Therefore ther was no cleaning required on this dataset.

# Preparing the Data


## Measure of success

We created our measure of success here because this data set will be merged into two seperate datasets and so to get the best measure we calculated it here while we had the most amount of movies in the dataset.

We created our measure of success using the Harmonic mean. The data we used to create this is the average score of the movies and the gross income of the movies.

The first step in creating the harmonic mean was to get our gross and our score into the same numeric range. We did this by creating ranks for our score and our gross. This then got both of these values into the range 1 to 6820, with 1 being the best and 6820 being the worst rank.

In [154]:
#we used the built in .rank method to rank the scores and the gross, stroing the values in new columns.
movies['scoreRank'] = movies.score.rank(axis=0, method='average', numeric_only=None, na_option='keep', ascending=True, pct=False)
movies['grossRank'] = movies.gross.rank(axis=0, method='average', numeric_only=None, na_option='keep', ascending=True, pct=False)

With thins done we could then create our harmonic mean based on the rankings.

In [155]:
scores  = movies[['grossRank', 'scoreRank']]#creating an array of our gross rank and gross score
movies['HarMean'] = scores.apply(stats.hmean, axis=1)#loading in the array and apply the hmean method to the data creating a harmonic mean.

We then nomalised the harmonic mean to the range 0 and 1 as this made the mean more understandle and comparable.

In [156]:
movies['HarMean'] = ((movies['HarMean']-min(movies['HarMean']))/(max(movies['HarMean'])-min(movies['HarMean'])))#nomalising the harmonic with the standard formula.

In [157]:
movies

Unnamed: 0,budget,company,country,director,genre,gross,name,rating,released,runtime,score,star,votes,writer,year,scoreRank,grossRank,HarMean
0,8000000.0,Columbia Pictures Corporation,USA,Rob Reiner,Adventure,52287414.0,Stand by Me,R,1986-08-22,89,8.1,Wil Wheaton,299174,Stephen King,1986,6685.0,5495.0,0.884794
1,6000000.0,Paramount Pictures,USA,John Hughes,Comedy,70136369.0,Ferris Bueller's Day Off,PG-13,1986-06-11,103,7.8,Matthew Broderick,264740,John Hughes,1986,6436.5,5878.0,0.901333
2,15000000.0,Paramount Pictures,USA,Tony Scott,Action,179800601.0,Top Gun,PG,1986-05-16,110,6.9,Tom Cruise,236909,Jim Cash,1986,4651.5,6613.0,0.801096
3,18500000.0,Twentieth Century Fox Film Corporation,USA,James Cameron,Action,85160248.0,Aliens,R,1986-07-18,137,8.4,Sigourney Weaver,540152,James Cameron,1986,6775.0,6095.0,0.941311
4,9000000.0,Walt Disney Pictures,USA,Randal Kleiser,Adventure,18564613.0,Flight of the Navigator,PG,1986-08-01,90,6.9,Joey Cramer,36636,Mark H. Baker,1986,4651.5,3996.0,0.630507
5,6000000.0,Hemdale,UK,Oliver Stone,Drama,138530565.0,Platoon,R,1987-02-06,120,8.1,Charlie Sheen,317585,Oliver Stone,1986,6685.0,6479.0,0.965281
6,25000000.0,Henson Associates (HA),UK,Jim Henson,Adventure,12729917.0,Labyrinth,PG,1986-06-27,101,7.4,David Bowie,102879,Dennis Lee,1986,5798.5,3468.0,0.636565
7,6000000.0,De Laurentiis Entertainment Group (DEG),USA,David Lynch,Drama,8551228.0,Blue Velvet,R,1986-10-23,120,7.8,Isabella Rossellini,146768,David Lynch,1986,6436.5,3057.0,0.607958
8,9000000.0,Paramount Pictures,USA,Howard Deutch,Comedy,40471663.0,Pretty in Pink,PG-13,1986-02-28,96,6.8,Molly Ringwald,60565,John Hughes,1986,4408.5,5140.0,0.696145
9,15000000.0,SLM Production Group,USA,David Cronenberg,Drama,40456565.0,The Fly,R,1986-08-15,96,7.5,Jeff Goldblum,129698,George Langelaan,1986,5973.0,5139.0,0.810375


With the harmoic mean created our dataset was now prepared and we could begin merging the datasets.

In [158]:
movies.to_pickle('movies.pkl')#stored the data to a pickle file