Phase II project

PROJECT OVERVIEW

Your company now sees all the big companies creating original video content and they want to get in on the fun. They have decided to create a new movie studio, but they don’t know anything about creating movies. You are charged with exploring what types of films are currently doing the best at the box office. You must then translate those findings into actionable insights that the head of your company's new movie studio can use to help decide what type of films to create.


OBJECTIVES

1.

BUSINESS UNDERSTANDING

DATA UNDERSTANDING

In [1]:
# import data analysis libraries
import numpy as np
import seaborn as sns
import pandas as pd
import matplotlib.pyplot as plt

import warnings
warnings.filterwarnings('ignore')

In [3]:
# import data and create df
# df = pd.read_csv("bom.movie_gross.csv") # encoding issues
df = pd.read_csv("bom.movie_gross.csv", encoding="latin-1")

#checking the first 5 columns
df.head()

Unnamed: 0,title,studio,domestic_gross,foreign_gross,year
0,Toy Story 3,BV,415000000.0,652000000,2010
1,Alice in Wonderland (2010),BV,334200000.0,691300000,2010
2,Harry Potter and the Deathly Hallows Part 1,WB,296000000.0,664300000,2010
3,Inception,WB,292600000.0,535700000,2010
4,Shrek Forever After,P/DW,238700000.0,513900000,2010


In [4]:
# checking the dataset information
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3387 entries, 0 to 3386
Data columns (total 5 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   title           3387 non-null   object 
 1   studio          3382 non-null   object 
 2   domestic_gross  3359 non-null   float64
 3   foreign_gross   2037 non-null   object 
 4   year            3387 non-null   int64  
dtypes: float64(1), int64(1), object(3)
memory usage: 132.4+ KB


In [5]:
#checking the dataset shape
df.shape

(3387, 5)

1.** Data Cleaning**


a. Duplicates

In [6]:
# to check for duplicates
df.duplicated().sum()

# the data has no duplicates

0

In [7]:
# make df copy to be used in data cleaning

data = df.copy()

B. Missing Values

In [8]:
# percentage of missing values per column
# sorted in descending order

data.isna().sum().sort_values(ascending=False)/len(data)*100

Unnamed: 0,0
foreign_gross,39.858282
domestic_gross,0.82669
studio,0.147623
title,0.0
year,0.0


In [9]:
# check for unique values in each column

for col in data.columns:
    print({col})
    print(data[col].unique())
    print()

{'title'}
['Toy Story 3' 'Alice in Wonderland (2010)'
 'Harry Potter and the Deathly Hallows Part 1' ... 'El Pacto' 'The Swan'
 'An Actor Prepares']

{'studio'}
['BV' 'WB' 'P/DW' 'Sum.' 'Par.' 'Uni.' 'Fox' 'Wein.' 'Sony' 'FoxS' 'SGem'
 'WB (NL)' 'LGF' 'MBox' 'CL' 'W/Dim.' 'CBS' 'Focus' 'MGM' 'Over.' 'Mira.'
 'IFC' 'CJ' 'NM' 'SPC' 'ParV' 'Gold.' 'JS' 'RAtt.' 'Magn.' 'Free' '3D'
 'UTV' 'Rela.' 'Zeit.' 'Anch.' 'PDA' 'Lorb.' 'App.' 'Drft.' 'Osci.' 'IW'
 'Rog.' nan 'Eros' 'Relbig.' 'Viv.' 'Hann.' 'Strand' 'NGE' 'Scre.' 'Kino'
 'Abr.' 'CZ' 'ATO' 'First' 'GK' 'FInd.' 'NFC' 'TFC' 'Pala.' 'Imag.' 'NAV'
 'Arth.' 'CLS' 'Mont.' 'Olive' 'CGld' 'FOAK' 'IVP' 'Yash' 'ICir' 'FM'
 'Vita.' 'WOW' 'Truly' 'Indic.' 'FD' 'Vari.' 'TriS' 'ORF' 'IM' 'Elev.'
 'Cohen' 'NeoC' 'Jan.' 'MNE' 'Trib.' 'Rocket' 'OMNI/FSR' 'KKM' 'Argo.'
 'SMod' 'Libre' 'FRun' 'WHE' 'P4' 'KC' 'SD' 'AM' 'MPFT' 'Icar.' 'AGF'
 'A23' 'Da.' 'NYer' 'Rialto' 'DF' 'KL' 'ALP' 'LG/S' 'WGUSA' 'MPI' 'RTWC'
 'FIP' 'RF' 'ArcEnt' 'PalUni' 'EpicPics' 'EO

In [10]:
# Check the number of missing values in each column

data.isna().sum().sort_values(ascending=False)

Unnamed: 0,0
foreign_gross,1350
domestic_gross,28
studio,5
title,0
year,0


fillna()
The fillna() method will be used to fill missing (NaN) values with a specified value.
In the code below, 0 was used to fill in the missing values.

In [15]:
# Replace missing values in 'foreign_gross' column with 0
data['foreign_gross'].fillna(0, inplace=True)

# Replace missing values in 'domestic_gross' column with 0
data['domestic_gross'].fillna(0, inplace=True)

# Replace missing values in 'studio' column with 'Unspecified'
data['studio'].fillna('Unspecified', inplace=True)

In [16]:
# Confirm the number of missing values in each column

data.isna().sum().sort_values(ascending=False)

Unnamed: 0,0
title,0
studio,0
domestic_gross,0
foreign_gross,0
year,0


C. Data Manipulation

Convert all float64 to int64 to make them whole numbers since the domestic gross and  are whole.