## **Table of Contents**
  1. Gathering the Data
  2. Data Preprocess
  3. Data Analysis

##**Hollywood Movie Production Sales**
**Introduction**

This notebook will be looking at and analyzing sample data containing information on Hollywood productions. This sample data stretches from 2007 - 2012, and contains the following information.

**Fields in HollywoodsMostProfitableStories.csv**

  - Film name
  - Genre
  - Lead Studio
  - Audience Score %
  - Profitability
  - Rotten Tomato Score %
  - World Wide Gross Profit
  - Year

<a gathering-the-data="gathering-data"></a>
### Gathering the Data 

In [9]:
# Initialize libraries
import pandas as pd
import numpy as np
from scipy import stats as st
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

In [21]:
# Reads the file and outputs a snippet of the data set
# Showing us what the fields of the file
df = pd.read_csv('/HollywoodsMostProfitableStories.csv')
df.head(15)

Unnamed: 0,Film,Genre,Lead Studio,Audience score %,Profitability,Rotten Tomatoes %,Worldwide Gross,Year
0,27 Dresses,Comedy,Fox,71.0,5.343622,40.0,160.308654,2008
1,(500) Days of Summer,Comedy,Fox,81.0,8.096,87.0,60.72,2009
2,A Dangerous Method,Drama,Independent,89.0,0.448645,79.0,8.972895,2011
3,A Serious Man,Drama,Universal,64.0,4.382857,89.0,30.68,2009
4,Across the Universe,Romance,Independent,84.0,0.652603,54.0,29.367143,2007
5,Beginners,Comedy,Independent,80.0,4.471875,84.0,14.31,2011
6,Dear John,Drama,Sony,66.0,4.5988,29.0,114.97,2010
7,Enchanted,Comedy,Disney,80.0,4.005737,93.0,340.487652,2007
8,Fireproof,Drama,Independent,51.0,66.934,40.0,33.467,2008
9,Four Christmases,Comedy,Warner Bros.,52.0,2.022925,26.0,161.834,2008


In [34]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 74 entries, 0 to 73
Data columns (total 8 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   film               74 non-null     object 
 1   genre              74 non-null     object 
 2   lead studio        73 non-null     object 
 3   audience  score %  73 non-null     float64
 4   profitability      71 non-null     float64
 5   rotten tomatoes %  73 non-null     float64
 6   worldwide gross    74 non-null     float64
 7   year               74 non-null     int64  
dtypes: float64(4), int64(1), object(3)
memory usage: 4.8+ KB


<a data-pre="data-pre"></a>
### Data Preprocess

In [23]:
# Here we are looking at the columns and changing the column names
# Into columns names that are lower case and easier to work with
df = df.rename(columns = lambda x: x.lower())
df.columns

Index(['film', 'genre', 'lead studio', 'audience  score %', 'profitability',
       'rotten tomatoes %', 'worldwide gross', 'year'],
      dtype='object')

In [35]:
# Now we want to start checking if the data we have is clean
df.duplicated().sum()

0

Are data does not contain any duplicates so we move onto seeing if there are missing pieces of data.

In [36]:
df.isnull().sum()

film                 0
genre                0
lead studio          1
audience  score %    1
profitability        3
rotten tomatoes %    1
worldwide gross      0
year                 0
dtype: int64

Next we will fix the cases of null data in the dataset

In [37]:
df[df['lead studio'].isna()]

Unnamed: 0,film,genre,lead studio,audience score %,profitability,rotten tomatoes %,worldwide gross,year
38,No Reservations,Comedy,,64.0,3.30718,39.0,92.60105,2007


In [43]:
df = df.dropna(subset=['lead studio']).reset_index(drop=True)

In [44]:
df.isnull().sum()

film                 0
genre                0
lead studio          0
audience  score %    1
profitability        3
rotten tomatoes %    1
worldwide gross      0
year                 0
dtype: int64

The next parts of this is repeating the same thing for the next null values

In [51]:
df[df['audience  score %'].isna()]

Unnamed: 0,film,genre,lead studio,audience score %,profitability,rotten tomatoes %,worldwide gross,year
49,Something Borrowed,Romance,Independent,,1.719514,,60.183,2011


In [53]:
df = df.dropna(subset=['audience  score %']).reset_index(drop=True)
df.isnull().sum()

film                 0
genre                0
lead studio          0
audience  score %    0
profitability        3
rotten tomatoes %    0
worldwide gross      0
year                 0
dtype: int64

In [54]:
df[df['profitability'].isna()]

Unnamed: 0,film,genre,lead studio,audience score %,profitability,rotten tomatoes %,worldwide gross,year
18,Jane Eyre,Romance,Universal,77.0,,85.0,30.147,2011
40,Our Family Wedding,Comedy,Independent,49.0,,14.0,21.37,2010
68,When in Rome,Comedy,Disney,44.0,,15.0,43.04,2010


In [57]:
df = df.dropna(subset=['profitability']).reset_index(drop=True)
df.isnull().sum()

film                 0
genre                0
lead studio          0
audience  score %    0
profitability        0
rotten tomatoes %    0
worldwide gross      0
year                 0
dtype: int64

## **Data Visualization**