# MOVIES ANALYSIS WITH EDA

## BUSINESS UNDERSTANDING
### What factors should Microsoft consider when starting a Movie Studio?
Microsoft has decided to start a new Movie Studio. In order to do this they need to know the kinds of movies that are doing the best.This will help to know the type of films that microsoft should produce.EDA should be done so as to provide concrete recommendations to Microsoft.

## Data Understanding

The data for this analysis comes from;
* ### im.db
This is a SQLite database that consists of data about movies from IMDB.

* ### bom.movie_gross.csv
This is a csv file that contains data about movies from Box Office

## Importing Libraries

In [43]:
import numpy as np

#for analysis and manipulation of data
import pandas as pd

#for connecting to im.db
import sqlite3 

# for visualization
import matplotlib.pyplot as plt
import seaborn as sns

## Loading Data

In [44]:
# connecting to im.db database
imdb_df = sqlite3.connect("movies-Data/im.db")

In [45]:
#load the bom data
bom_df = pd.read_csv("movies-Data/bom.movie_gross.csv")

## Data Understanding (Box Office)

In [46]:
# preview of first 5 rows
bom_df.head()

Unnamed: 0,title,studio,domestic_gross,foreign_gross,year
0,Toy Story 3,BV,415000000.0,652000000,2010
1,Alice in Wonderland (2010),BV,334200000.0,691300000,2010
2,Harry Potter and the Deathly Hallows Part 1,WB,296000000.0,664300000,2010
3,Inception,WB,292600000.0,535700000,2010
4,Shrek Forever After,P/DW,238700000.0,513900000,2010


In [47]:
# preview of the last rows
bom_df.tail()

Unnamed: 0,title,studio,domestic_gross,foreign_gross,year
3382,The Quake,Magn.,6200.0,,2018
3383,Edward II (2018 re-release),FM,4800.0,,2018
3384,El Pacto,Sony,2500.0,,2018
3385,The Swan,Synergetic,2400.0,,2018
3386,An Actor Prepares,Grav.,1700.0,,2018


In [48]:
# Number of rows and columns
bom_df.shape

(3387, 5)

In [49]:
#Statistical summary of data
bom_df.describe()

Unnamed: 0,domestic_gross,year
count,3359.0,3387.0
mean,28745850.0,2013.958075
std,66982500.0,2.478141
min,100.0,2010.0
25%,120000.0,2012.0
50%,1400000.0,2014.0
75%,27900000.0,2016.0
max,936700000.0,2018.0


In [50]:
# information about the columns
bom_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3387 entries, 0 to 3386
Data columns (total 5 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   title           3387 non-null   object 
 1   studio          3382 non-null   object 
 2   domestic_gross  3359 non-null   float64
 3   foreign_gross   2037 non-null   object 
 4   year            3387 non-null   int64  
dtypes: float64(1), int64(1), object(3)
memory usage: 132.4+ KB


This is what we get from the above information;
* Box Office data has 5 columns namely ```title```, ```studio```, ```domestic_gross```, ```foreign_gross``` and ```year```. 
* ```title```, ```studio``` and ```foreign_gross``` columns are string ojects while ```domestic_gross``` and ```year``` are numerical objects.
* The data has a total of 3387 rows.
* The following columns have null values because they don't have all 3387 rows;
    * studio
    * domestic_gross
    * foreign_gross

## Data Cleaning(Box Office)


In [51]:
# number of rows with null values
bom_df.isnull().sum()

title                0
studio               5
domestic_gross      28
foreign_gross     1350
year                 0
dtype: int64

In [52]:
# sample rows where foreign_gross has no missing values
bom_df[bom_df["studio"].notna()].sample(5, random_state = 1)

Unnamed: 0,title,studio,domestic_gross,foreign_gross,year
571,Poetry,Kino,356000.0,1900000.0,2011
3343,The Third Murder,FM,89300.0,,2018
1850,Camp X-Ray,IFC,13300.0,,2014
3257,Don't Worry He Won't Get Far on Foot,Amazon,1400000.0,2500000.0,2018
1512,Non-Stop,Uni.,92200000.0,130600000.0,2014


In [53]:
# sample rows where foreign_gross has missing values
bom_df[bom_df["studio"].isna()]

Unnamed: 0,title,studio,domestic_gross,foreign_gross,year
210,Outside the Law (Hors-la-loi),,96900.0,3300000.0,2010
555,Fireflies in the Garden,,70600.0,3300000.0,2011
933,Keith Lemon: The Film,,,4000000.0,2012
1862,Plot for Peace,,7100.0,,2014
2825,Secret Superstar,,,122000000.0,2017


In [54]:
# drop rows where studio has a null value
bom_df.dropna(subset = ["studio"], inplace = True)

In [55]:
bom_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 3382 entries, 0 to 3386
Data columns (total 5 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   title           3382 non-null   object 
 1   studio          3382 non-null   object 
 2   domestic_gross  3356 non-null   float64
 3   foreign_gross   2033 non-null   object 
 4   year            3382 non-null   int64  
dtypes: float64(1), int64(1), object(3)
memory usage: 158.5+ KB


In [56]:
# percentage of missing values in domestic_gross
bom_df["domestic_gross"].isnull().sum()/len(bom_df)*100

0.768775872264932

This is a very small portion of the dataset therefore we can just drop the rows with null values.

In [57]:
bom_df.dropna(subset = ["domestic_gross"], inplace = True)
bom_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 3356 entries, 0 to 3386
Data columns (total 5 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   title           3356 non-null   object 
 1   studio          3356 non-null   object 
 2   domestic_gross  3356 non-null   float64
 3   foreign_gross   2007 non-null   object 
 4   year            3356 non-null   int64  
dtypes: float64(1), int64(1), object(3)
memory usage: 157.3+ KB


In [35]:
# percentage of missing values in foreign gross
bom_df["foreign_gross"].isnull().sum()/len(bom_df)*100

40.19666269368295

This is a significant portion of the dataset therefore we cannot just drop the rows. We can fill the missing values with either the mean or median.

In [58]:
# preview of foreign gross column
bom_df["foreign_gross"].head()

0    652000000
1    691300000
2    664300000
3    535700000
4    513900000
Name: foreign_gross, dtype: object

Looks like the datatype of foreign gross is string. So we need to change the column to a numerical datatype.

In [59]:
# cleaning the column and converting to a numerical datatype
bom_df["foreign_gross"] = [float(str(i).replace(",", "")) for i in bom_df["foreign_gross"]]
bom_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 3356 entries, 0 to 3386
Data columns (total 5 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   title           3356 non-null   object 
 1   studio          3356 non-null   object 
 2   domestic_gross  3356 non-null   float64
 3   foreign_gross   2007 non-null   float64
 4   year            3356 non-null   int64  
dtypes: float64(2), int64(1), object(2)
memory usage: 157.3+ KB


In [61]:
# mean of foreign_gross column before filling null values
bom_df["foreign_gross"].mean().round(2)

75790384.84

In [62]:
#median of foreign_gross
bom_df["foreign_gross"].median()

19400000.0

In [63]:
#checking mean after filling null values with the mean
bom_df1 = bom_df["foreign_gross"].fillna(bom_df["foreign_gross"].mean())
bom_df1.mean().round(2)

75790384.84

In [64]:
#checking mean after filling null values with median
bom_df2 = bom_df["foreign_gross"].fillna(bom_df["foreign_gross"].median())
bom_df2.mean().round(2)

53123332.05

According to the above results, filling null values with the median is not a good option because it reduces the mean. So we will fill the null values with the mean because that  does not alter the mean.

In [66]:
# replacing null values with the mean
bom_df["foreign_gross"] = bom_df["foreign_gross"].fillna(bom_df["foreign_gross"].mean())
bom_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 3356 entries, 0 to 3386
Data columns (total 5 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   title           3356 non-null   object 
 1   studio          3356 non-null   object 
 2   domestic_gross  3356 non-null   float64
 3   foreign_gross   3356 non-null   float64
 4   year            3356 non-null   int64  
dtypes: float64(2), int64(1), object(2)
memory usage: 157.3+ KB


In [68]:
#confirming there are no missing values
bom_df.isna().sum()

title             0
studio            0
domestic_gross    0
foreign_gross     0
year              0
dtype: int64

Our dataset has no missing values, therefore we can check and see if there are any duplicates.

In [69]:
#checking for duplicates
bom_df.duplicated().any()

False

Our data has no duplicates. We can then check the summary statistics after data cleaning.

In [70]:
# summary statistics of bom_df
bom_df.describe()

Unnamed: 0,domestic_gross,foreign_gross,year
count,3356.0,3356.0,3356.0
mean,28771490.0,75790380.0,2013.970203
std,67006940.0,106847200.0,2.479064
min,100.0,600.0,2010.0
25%,120000.0,12200000.0,2012.0
50%,1400000.0,75790380.0,2014.0
75%,27950000.0,75790380.0,2016.0
max,936700000.0,960500000.0,2018.0
