![example](images/director_shot.jpeg)

# Project Title

**Authors:** Student 1, Student 2, Student 3
***

## Overview

A one-paragraph overview of the project, including the business problem, data, methods, results and recommendations.

## Business Problem

Summary of the business problem you are trying to solve, and the data questions that you plan to answer to solve them.

***
Questions to consider:
* What are the business's pain points related to this project?
* How did you pick the data analysis question(s) that you did?
* Why are these questions important from a business perspective?
***

## Data Understanding

Describe the data being used for this project.
***
Questions to consider:
* Where did the data come from, and how do they relate to the data analysis questions?
* What do the data represent? Who is in the sample and what variables are included?
* What is the target variable?
* What are the properties of the variables you intend to use?
***

In [53]:
# Import standard packages
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

%matplotlib inline

In [54]:
# Here you run your code to explore the data
!ls data/zippedData

bom.movie_gross.csv.gz
imdb.name.basics.csv.gz
imdb.title.akas.csv.gz
imdb.title.basics.csv.gz
imdb.title.crew.csv.gz
imdb.title.principals.csv.gz
imdb.title.ratings.csv.gz
rt.movie_info.tsv.gz
rt.reviews.tsv.gz
tmdb.movies.csv.gz
tn.movie_budgets.csv.gz


## IMPORT ALL DATA

In [55]:
#Box Office Mojo Data (bom)
bom_moviegross_df = pd.read_csv('data/zippedData/bom.movie_gross.csv.gz')

#IMDB Data (imdb)
imdb_name_basics = pd.read_csv('data/zippedData/imdb.name.basics.csv.gz') 
imdb_title_akas_df = pd.read_csv('data/zippedData/imdb.title.akas.csv.gz')
imdb_title_basics_df = pd.read_csv('data/zippedData/imdb.title.basics.csv.gz')
imdb_title_crew_df = pd.read_csv('data/zippedData/imdb.title.crew.csv.gz')
imdb_title_principals_df = pd.read_csv('data/zippedData/imdb.title.principals.csv.gz')
imdb_title_ratings_df = pd.read_csv('data/zippedData/imdb.title.ratings.csv.gz')

#Rotten Tomatoes Data (rt). This requires the use of a delimiter and encoding to tokenize properly. The data is not seperated by commas, which is required for to read it as a dsv file.
#rt_movie_info_df = pd.read_csv('data/zippedData/rt.movie_info.tsv.gz', delimiter = ('/t'), encoding = 'ISO-8859-1')
#rt_reviews_df =  pd.read_csv('data/zippedData/rt.reviews.tsv.gz', delimiter = ('/t'), encoding = 'ISO-8859-1')

#The Movie Database (tmdb)
tmdb_movies_df = pd.read_csv('data/zippedData/tmdb.movies.csv.gz')

#The Numbers (tn)
tn_movie_budgets_df = pd.read_csv('data/zippedData/tn.movie_budgets.csv.gz')

## Preview DataFrames

In [56]:
bom_moviegross_df.head()

Unnamed: 0,title,studio,domestic_gross,foreign_gross,year
0,Toy Story 3,BV,415000000.0,652000000,2010
1,Alice in Wonderland (2010),BV,334200000.0,691300000,2010
2,Harry Potter and the Deathly Hallows Part 1,WB,296000000.0,664300000,2010
3,Inception,WB,292600000.0,535700000,2010
4,Shrek Forever After,P/DW,238700000.0,513900000,2010


In [57]:
#imdb_name_basics.head()

In [58]:
#imdb_title_akas_df.head()

In [59]:
#imdb_title_basics_df.head()

In [60]:
#imdb_title_crew_df.head()

In [61]:
#imdb_title_principals_df.head()

In [62]:
#imdb_title_ratings_df.head()

In [63]:
#rt_move_info_df.head()

In [64]:
#rt_reviews_df.head()

In [65]:
#tmdb_movies_df.head()

In [66]:
#tn_movie_budgets_df.head()

## Check for missing values

In [67]:
bom_moviegross_df.isna().sum()

title                0
studio               5
domestic_gross      28
foreign_gross     1350
year                 0
dtype: int64

In [68]:
#imdb_name_basics.isna().sum()

In [69]:
#imdb_title_akas_df.isna().sum()

In [70]:
#imdb_title_basics_df.isna().sum()

In [71]:
#imdb_title_crew_df.isna().sum()

In [72]:
#imdb_title_principals_df.isna().sum()

In [73]:
#imdb_title_ratings_df.isna().sum()

## Data Preparation

Describe and justify the process for preparing the data for analysis.

***
Questions to consider:
* Were there variables you dropped or created?
* How did you address missing values or outliers?
* Why are these choices appropriate given the data and the business problem?
***

In [74]:
# Here you run your code to clean the data

## Box Office Mojo Data

In [75]:
bom_moviegross_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3387 entries, 0 to 3386
Data columns (total 5 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   title           3387 non-null   object 
 1   studio          3382 non-null   object 
 2   domestic_gross  3359 non-null   float64
 3   foreign_gross   2037 non-null   object 
 4   year            3387 non-null   int64  
dtypes: float64(1), int64(1), object(3)
memory usage: 132.4+ KB


In [76]:
bom_moviegross_df.isna().sum()

title                0
studio               5
domestic_gross      28
foreign_gross     1350
year                 0
dtype: int64

All right, let's deal with the foreign_gross column!

In [77]:
num_missing_bom_foreigngross = bom_moviegross_df.isna().sum()['foreign_gross']
total_moviegross_entries = len(bom_moviegross_df['foreign_gross'])
percentage_missing_foreign = num_missing_bom_foreigngross / total_moviegross_entries
print(percentage_missing_foreign)

0.3985828166519043


Hmmmmm, so 40% of the data for the 'foreign_gross' column is missing. It is also an object type column, so there must be non-standard missing values in this column.

First, let's replace all of the NaN values with the median so we do not affect the distribution too much.

In [78]:
bom_moviegross_df['foreign_gross'].describe()

count        2037
unique       1204
top       1200000
freq           23
Name: foreign_gross, dtype: object

There are 1204 unique values in this series. Let's put them all into a list so we can easily analyze what data type they are.

In [85]:
non_standard = []
for x in list(bom_moviegross_df['foreign_gross'].unique()):
    if type(x) != int and type(x) != float:
        non_standard.append(x)
    else:
        pass

In [86]:
len(non_standard)

1204

In [87]:
len([type(x) for x in non_standard if type(x) == str])
#Checking to see if the length of the list created by allocating 
#all str type values is the same as the length of the list of all non-standard values

1204

Wow! So ALL of the unique values for this dataset are strings!!! Let's change that!

## Data Modeling
Describe and justify the process for analyzing or modeling the data.

***
Questions to consider:
* How did you analyze or model the data?
* How did you iterate on your initial approach to make it better?
* Why are these choices appropriate given the data and the business problem?
***

In [None]:
# Here you run your code to model the data


## Evaluation
Evaluate how well your work solves the stated business problem.

***
Questions to consider:
* How do you interpret the results?
* How well does your model fit your data? How much better is this than your baseline model?
* How confident are you that your results would generalize beyond the data you have?
* How confident are you that this model would benefit the business if put into use?
***

## Conclusions
Provide your conclusions about the work you've done, including any limitations or next steps.

***
Questions to consider:
* What would you recommend the business do as a result of this work?
* What are some reasons why your analysis might not fully solve the business problem?
* What else could you do in the future to improve this project?
***