# *Movie Status Report: Computing Vision*
### I.- Introduction
#### A) Overview
    We have been tasked with determining which type of movie would be the best investment for a new movie studio. Based on empirical data collected over the past couple of years. We want your initial investment in a movie to be successful so you can hit the ground running in your new and exciting venture. For that to be a success, we have taken a deep dive into the data to determine how long a movie should be, what rating, and how much to spend to insure a hit in the box office.
#### B) Members
    1. Blake Medwed
    2. Rico Gutierrez
    3. Ahmed Isse
    4. Alberto Ruiz Martinez
#### C) Objectives
    1. Analyze empirical data based on the last few years
    2. Visualize the analyzed data
    3. Explain the vizualizations
    4. Determine the best course of action for the initial investment

### ll.- Determining Data
#### A) What production metrics makes a movie successful?
    1. Runtime - How long a movie lasts
    2. Age Rating - The age bracket and potential age restrictions determined by the MPAA
    3. Movie Budget - The amount invested in film production
#### B) What metrics are affected by production?
    1. Domestic gross - Gross revenue made released within a country
    2. Foreign gross - Gross revenue made by the movie outside of the origin country
    3. Worldwide gross - Gross revenue made by the movie worldwide
    4. Box Office - The amount of money raised by ticket sales
    5. Popularity Rating - How high would an audience rate a movie
#### C) Where does the data come from?
    1. Box Office Mojo - Domestic gross, foreign gross, release year
    2. Rotten Tomatoes - MPAA rating, runtime, box office
    3. The Movie Database - Popularity rating, average popularity vote
    4. The Numbers - Production budget, domestic gross, worldwide gross

### llI.- Cleaning and Modifying Data
###### A) Importing data into dataframes and respective libraries

In [27]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
bom_movies = pd.read_csv('bom.movie_gross.csv.gz', index_col=0)
rt_reviews = pd.read_csv("rt.reviews.tsv.gz", sep = '\t', encoding='windows-1252')
rt_movies = pd.read_csv("rt.movie_info.tsv.gz", sep = '\t', encoding='windows-1252')
tmdb = pd.read_csv('tmdb.movies.csv.gz', index_col=7)
tn = pd.read_csv('tn.movie_budgets.csv.gz', index_col=2)

##### B) Removing data that would potentially skew results
    1. Non applicable data
    2. Data containing synonymous values like zero

In [28]:
bom_movies.dropna(inplace=True)
rt_movies.dropna(inplace=True)

##### C) Change data types
    1. Changing a string into a manageable integer type

In [29]:

rt_movies['box_office'] = rt_movies['box_office'].str.replace(',', '')
rt_movies['box_office'] = rt_movies['box_office'].astype(int)

##### D) Merging dataframes 
    1. Increases population for metrics such as gross revenue
    2. Decreasing columns based on what information is valuable

In [30]:
df = pd.merge(bom_movies, tmdb, on = 'title')
df = df[['studio', 'domestic_gross', 'foreign_gross', 'year', 'popularity', 'vote_average']]
tn.index.name = 'title'
df = df.merge(tn, on='title', how='left' )
df.sort_index()

Unnamed: 0_level_0,studio,domestic_gross_x,foreign_gross,year,popularity,vote_average,id,release_date,production_budget,domestic_gross_y,worldwide_gross
title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
'71,RAtt.,1300000.0,355000,2015,10.523,6.8,,,,,
10 Cloverfield Lane,Par.,72100000.0,38100000,2016,17.892,6.9,54.0,"Mar 11, 2016","$5,000,000","$72,082,999","$108,286,422"
11-11-11,Rocket,32800.0,5700000,2011,5.196,4.3,,,,,
12 Strong,WB,45800000.0,21600000,2018,13.183,5.6,64.0,"Jan 19, 2018","$35,000,000","$45,819,713","$71,118,378"
12 Years a Slave,FoxS,56700000.0,131100000,2013,16.493,7.9,18.0,"Oct 18, 2013","$20,000,000","$56,671,993","$181,025,343"
...,...,...,...,...,...,...,...,...,...,...,...
Zero Dark Thirty,Sony,95700000.0,37100000,2012,14.239,6.9,66.0,"Dec 19, 2012","$52,500,000","$95,720,716","$134,612,435"
Zookeeper,Sony,80400000.0,89500000,2011,10.764,5.3,71.0,"Jul 8, 2011","$80,000,000","$80,360,866","$170,805,525"
Zoolander 2,Par.,28800000.0,27900000,2016,12.997,4.7,64.0,"Feb 12, 2016","$50,000,000","$28,848,693","$55,348,693"
Zootopia,BV,341300000.0,682500000,2016,27.549,7.7,57.0,"Mar 4, 2016","$150,000,000","$341,268,248","$1,019,429,616"


##### D) Further formatting
    1. Removing Duplicate columns
    2. Removing Nan values again
    3.Converting values

In [31]:
#Removing duplicate columns
df = df.drop('domestic_gross_y', axis=1)
df = df.drop('release_date', axis=1)

#Removing Nan
df.dropna(inplace=True)

#Formatting and convertting values
pd.to_datetime(df.year, format='%Y')

df['production_budget'] = df['production_budget'].str.replace(',', '')
df['production_budget'] = df['production_budget'].str.replace('$', '')
df['production_budget'] = df['production_budget'].astype(int)

df['worldwide_gross'] = df['worldwide_gross'].str.replace(',', '')
df['worldwide_gross'] = df['worldwide_gross'].str.replace('$', '')
df['worldwide_gross'] = df['worldwide_gross'].astype(int)

df.drop_duplicates(inplace=True)
df= df.rename(columns={'domestic_gross_x':'domestic_gross'})

df.head()

Unnamed: 0_level_0,studio,domestic_gross,foreign_gross,year,popularity,vote_average,id,production_budget,worldwide_gross
title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
Toy Story 3,BV,415000000.0,652000000,2010,24.445,7.7,47.0,200000000,1068879522
Inception,WB,292600000.0,535700000,2010,27.92,8.3,38.0,160000000,835524642
Shrek Forever After,P/DW,238700000.0,513900000,2010,15.041,6.1,27.0,165000000,756244673
The Twilight Saga: Eclipse,Sum.,300500000.0,398000000,2010,20.34,6.0,53.0,68000000,706102828
Iron Man 2,Par.,312400000.0,311500000,2010,28.515,6.8,15.0,170000000,621156389


### IV.- Analyzing Data

##### A)  Analyzing MPAA Age Rating

##### B) Analyzing Movie Budget

##### C) Analyzing Movie Runtime

### V.- Conclusion