![example](images/director_shot.jpeg)

# Box Office Movie Analysis

**Authors:** Emmanuel Kiplimo
***

## Overview

This project aims to analyze performance of movies in the box office to get a clear picture of what makes a good movie. Through Exploratory Data Analysis (EDA) of movies and reviews data can get a better understanding of the features that contribute to popularity of films.
Microsoft can use this findings to make informed decisions as it plans to enter the films industry. 

## Business Problem

The movie space has been dominated by streaming services like Netflix, Disney and Hulu. Not so long ago, a close competitor, Apple joined in on the fun producing original video content with amaizing blockbusters i.e SEE. This makes it a question of WHEN not IF for Microsoft. Taking a calculated approach by analyzing the genre, actors, movie durations and viewers age group will render a better understanding of the film industry. Doing so will increase our chances of success upon entry as well as an additonal source of revenue.
***
Questions to consider:
* What are the business's pain points related to this project?
* How did you pick the data analysis question(s) that you did?
* Why are these questions important from a business perspective?
***

## Data Understanding

With us we have a movies database from IMDB(link), movie information and reviews from rotten tomato 
***
Questions to consider:
* Where did the data come from, and how do they relate to the data analysis questions?
* What do the data represent? Who is in the sample and what variables are included?
* What is the target variable?
* What are the properties of the variables you intend to use?
***

In [2]:
# Import standard packages
import pandas as pd
import numpy as np
import sqlite3

import matplotlib.pyplot as plt
import seaborn as sns

%matplotlib inline

In [3]:
# Establishing connection with IMDB database
conn = sqlite3.connect("Data/im.db")

#Converting important tables into DataFrames

#Title, Release, Duration, Genres
movie_basics = pd.read_sql("SELECT * FROM movie_basics",conn) 

#movie_id, avgrating, numvotes
movie_ratings = pd.read_sql("SELECT * FROM movie_ratings",conn)

#movie_id, title 
movie_akas = pd.read_sql("SELECT * FROM movie_akas",conn)

#person_id, primary_name
persons = pd.read_sql("SELECT * FROM persons",conn)

#person_id, movie_id
known_for = pd.read_sql("SELECT * FROM known_for",conn)

#movie_id, person_id
directors = pd.read_sql("SELECT * FROM directors",conn)



In [18]:
movie_basics.head()

Unnamed: 0,movie_id,primary_title,original_title,start_year,runtime_minutes,genres
0,tt0063540,Sunghursh,Sunghursh,2013,175.0,"Action,Crime,Drama"
1,tt0066787,One Day Before the Rainy Season,Ashad Ka Ek Din,2019,114.0,"Biography,Drama"
2,tt0069049,The Other Side of the Wind,The Other Side of the Wind,2018,122.0,Drama
3,tt0069204,Sabse Bada Sukh,Sabse Bada Sukh,2018,,"Comedy,Drama"
4,tt0100275,The Wandering Soap Opera,La Telenovela Errante,2017,80.0,"Comedy,Drama,Fantasy"


In [20]:
movie_basics.isna().sum()

movie_id               0
primary_title          0
original_title        21
start_year             0
runtime_minutes    31739
genres              5408
dtype: int64

In [15]:
movie_basics.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 146144 entries, 0 to 146143
Data columns (total 6 columns):
 #   Column           Non-Null Count   Dtype  
---  ------           --------------   -----  
 0   movie_id         146144 non-null  object 
 1   primary_title    146144 non-null  object 
 2   original_title   146123 non-null  object 
 3   start_year       146144 non-null  int64  
 4   runtime_minutes  114405 non-null  float64
 5   genres           140736 non-null  object 
dtypes: float64(1), int64(1), object(4)
memory usage: 6.7+ MB


In [9]:
directors.head() 

Unnamed: 0,movie_id,person_id
0,tt0285252,nm0899854
1,tt0462036,nm1940585
2,tt0835418,nm0151540
3,tt0835418,nm0151540
4,tt0878654,nm0089502


In [39]:
BOM = pd.read_csv("Data/bom.movie_gross.csv")
BOM.head()

Unnamed: 0,title,studio,domestic_gross,foreign_gross,year
0,Toy Story 3,BV,415000000.0,652000000,2010
1,Alice in Wonderland (2010),BV,334200000.0,691300000,2010
2,Harry Potter and the Deathly Hallows Part 1,WB,296000000.0,664300000,2010
3,Inception,WB,292600000.0,535700000,2010
4,Shrek Forever After,P/DW,238700000.0,513900000,2010


In [40]:
TN = pd.read_csv("Data/tn.movie_budgets.csv") 
TN.head()

Unnamed: 0,id,release_date,movie,production_budget,domestic_gross,worldwide_gross
0,1,"Dec 18, 2009",Avatar,"$425,000,000","$760,507,625","$2,776,345,279"
1,2,"May 20, 2011",Pirates of the Caribbean: On Stranger Tides,"$410,600,000","$241,063,875","$1,045,663,875"
2,3,"Jun 7, 2019",Dark Phoenix,"$350,000,000","$42,762,350","$149,762,350"
3,4,"May 1, 2015",Avengers: Age of Ultron,"$330,600,000","$459,005,868","$1,403,013,963"
4,5,"Dec 15, 2017",Star Wars Ep. VIII: The Last Jedi,"$317,000,000","$620,181,382","$1,316,721,747"


## Data Preparation

Describe and justify the process for preparing the data for analysis.

***
Questions to consider:
* Were there variables you dropped or created?
* How did you address missing values or outliers?
* Why are these choices appropriate given the data and the business problem?
***

Cleaning movie_basics




In [48]:
movie_basics = pd.read_sql("SELECT * FROM movie_basics",conn) 

In [49]:
# The column 'original_title' isn't relevant in our analysis 
movie_basics.drop(labels='original_title',inplace=True,axis=1)



In [50]:
# Rename the columns with relevant column names 
movie_basics.rename(columns={'primary_title':'title', 'runtime_minutes':'duration_minutes','Release':'release','genres':'genre'},inplace=True)

In [51]:
# Dropping rows with null values in the genre column
movie_basics.dropna(subset='genre',inplace=True)

In [52]:
movie_basics['genre'] = movie_basics['genre'].str.split(',')

In [22]:
#Single value columns
movie_basics.explode('genre')


Unnamed: 0,movie_id,title,release,duration_minutes,genre
0,tt0063540,Sunghursh,2013,175.0,Action
0,tt0063540,Sunghursh,2013,175.0,Crime
0,tt0063540,Sunghursh,2013,175.0,Drama
1,tt0066787,One Day Before the Rainy Season,2019,114.0,Biography
1,tt0066787,One Day Before the Rainy Season,2019,114.0,Drama
...,...,...,...,...,...
146139,tt9916538,Kuambil Lagi Hatiku,2019,123.0,Drama
146140,tt9916622,Rodolpho Teóphilo - O Legado de um Pioneiro,2015,,Documentary
146141,tt9916706,Dankyavar Danka,2013,,Comedy
146142,tt9916730,6 Gunn,2017,116.0,


In [53]:
movie_basics['duration_minutes'].fillna(movie_basics['duration_minutes'].mean(),inplace=True)

In [54]:
movie_basics['title'].duplicated().value_counts()

False    131336
True       9400
Name: title, dtype: int64

In [45]:
#movie_basics = movie_basics['title'].drop_duplicates()

In [60]:
movie_basics[movie_basics['title'].duplicated(keep=False)].sort_values(by='title')

Unnamed: 0,movie_id,title,start_year,duration_minutes,genre
131857,tt8219776,#5,2018,86.261902,[Documentary]
52892,tt3120962,#5,2013,68.000000,"[Biography, Comedy, Fantasy]"
106201,tt6214664,(aguirre),2016,98.000000,"[Biography, Comedy, Documentary]"
103890,tt6085916,(aguirre),2016,97.000000,"[Biography, Documentary]"
100818,tt5891614,1,2016,22.000000,[Documentary]
...,...,...,...,...,...
66989,tt3815122,Ângelo de Sousa - Tudo o Que Sou Capaz,2010,60.000000,"[Biography, Documentary]"
37636,tt2362758,Éden,2013,73.000000,[Drama]
23712,tt1961689,Éden,2011,64.000000,[Documentary]
93912,tt5471216,Ódio,2017,86.261902,[Action]


## Data Modeling
Describe and justify the process for analyzing or modeling the data.

***
Questions to consider:
* How did you analyze or model the data?
* How did you iterate on your initial approach to make it better?
* Why are these choices appropriate given the data and the business problem?
***

In [None]:
# Here you run your code to model the data


## Evaluation
Evaluate how well your work solves the stated business problem.

***
Questions to consider:
* How do you interpret the results?
* How well does your model fit your data? How much better is this than your baseline model?
* How confident are you that your results would generalize beyond the data you have?
* How confident are you that this model would benefit the business if put into use?
***

## Conclusions
Provide your conclusions about the work you've done, including any limitations or next steps.

***
Questions to consider:
* What would you recommend the business do as a result of this work?
* What are some reasons why your analysis might not fully solve the business problem?
* What else could you do in the future to improve this project?
***