# I See Dead People: An Analysis

Bryan Bumgardner, Data Scientist  
February - March 2016

### Death is a fact of life.   
That being said, it's worth studying how it fits in our popular culture. During the Metis Data Science bootcamp (shameless plug), while working on another project I discovered bodycounters.com[http://www.bodycounters.com/]. The dedicated volunteers at this site count and categorize the number of deaths in movies, and to date, have counted for over 2,000 movies. Go grab some beers, count the dead people in your favorite movie, and contribute to the site. 

The data are sitting on their site for anyone to see, and it gave me an idea. I reached out to the site and had some correspondence with Dana, who was gracious enough to share a CSV of all the data. I then cross referenced this data with information from another site, thenumbers.com, which shares basic budget and financial information about thousands of movies. I took this data, mashed it together, and asked some questions. 

Thanks again to the volunteers at bodycounters. Y'all are the greatest. Below is a quick data science passion project to analyze and find patterns in this great data you've collected. 

### Major questions:
1. Is the amount of on-screen deaths going up in modern movies?
2. Does a movie's death count have an effect on that movie's domestic gross?
3. Does a movie's death count have an effect on that movie's rating? 
4. How does budget factor into a movie's death count?
5. Is there a relationship between the movie's MPAA rating and body count?


### Table of Contents:
1. Reading and combining the data
2. Outputting various data frames
3. Visualizing
4. Conclusions
5. Documentation of the data

In [1]:
import pandas as pd 
import numpy as np
import csv
import os

import codecs
from bs4 import BeautifulSoup

import statsmodels.api as sm
import statsmodels.formula.api as smf

from sklearn.linear_model import LinearRegression 
from patsy import dmatrices


import matplotlib.pyplot as plt
import matplotlib as mpl
import seaborn as sns

%matplotlib inline
sns.set(color_codes=True)

### Step 1: read in and combine the data

Deaths data is a csv...

In [2]:
deaths = pd.read_csv("moviesmovies.csv", encoding="latin-1", index_col="movieskey") #might as well make use of their key

In [3]:
deaths["moviename"] = deaths["moviename"].str.rstrip("\t").str.lower() #cleaning some trailing spaces and standardizing for the matching
deaths["deathcomment"] = deaths["deathcomment"].str.rstrip("\t")

In [4]:
deaths.head(6)

Unnamed: 0_level_0,moviename,deathcount,deathcomment
movieskey,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1,king kong,71,"26 with Ape. Also 3 dinosaurs, 4 T-Rexes, 35 v..."
2,wrong turn,8,8
3,death to the supermodels,7,7
4,zombie honeymoon,10,10
5,ginger dead man,7,6 and a cookie
6,punisher,58,58


Grabbed a lot of the budget data from the website by saving it as a html, because the site crashes nonstop. Fixing:

In [5]:
budgetwebsite = pd.read_html("allmoviebudgets.html", header=0, encoding="utf-8")

In [6]:
all_budgets = pd.DataFrame(budgetwebsite[0])

How do I link them together? Fuzzy matching won't work: think "Terminator 2" and "Terminator 3." It would work if I had more data from the deaths source. I know how to do this in Excel... 

In [7]:
all_budgets.to_csv("all_budgets.csv", encoding='utf-8')

Ok so I'm really lazy and that only matched like maybe a third of the movies. Let's scrape all the movie data - EVER - because we are crazy. And also because I'm not interested in paying for access to their database. 

So what you see below is me scraping all the budgets and domestic box offices for every movie in the alphabetized database. Every letter has a chart with all the movies starting with that letter. I downloaded each webpage from the data source as an HTML webpage then pulled the charts from that, which is less likely to get you banned than hitting up their site with Selenium.

I then mashed the charts together into one GIANT dataframe which will be helpful for record linkage. This needs properly cleaned as well.

In [172]:
path = 'movie_pages/' #where the HTML files are
listing = os.listdir(path) #read the pages from the directory
all_movie_data = pd.DataFrame(columns = ['Release Date', 'Movie', 'Genre', 'ProductionBudget','DomesticBox Officeto Date', 'Trailer'])

for i, infile in enumerate(listing): #iterate through the files
    temp_list = pd.read_html("movie_pages/"+infile, header=0)
    temp_frame = pd.DataFrame(temp_list[0]) #temp_list is a list of dataframes. Take the first (and only) one
    all_movie_data = all_movie_data.append(temp_frame) #append this to the master dataframe 

In [173]:
del all_movie_data['Trailer'] #not necessary and screwed up the dropna

In [174]:
all_movie_data.rename(columns = {'DomesticBox Officeto Date': 'BoxOffice'}, inplace = True)

In [175]:
all_movie_data

Unnamed: 0,Release Date,Movie,Genre,ProductionBudget,BoxOffice
0,"Jan 28, 2014",1,Documentary,,$0
1,,,,,
2,"Jun 8, 2012",1 Out Of 7,Drama,,$0
3,,,,,
4,"Oct 5, 1979",10,Romantic Comedy,,"$52,134,699"
5,"Mar 7, 2008","10,000 B.C.",Adventure,"$105,000,000","$94,784,201"
6,"Jul 24, 2015",10 Cent Pistol,Thriller/Suspense,,$0
7,"Mar 11, 2016",10 Cloverfield Lane,Thriller/Suspense,"$5,000,000","$26,790,517"
8,"Nov 11, 2015",10 Days in a Madhouse,Drama,"$12,000,000","$14,616"
9,"Dec 31, 2015",10 Endrathukulla,Action,,$0


In [176]:
all_movie_data = all_movie_data.dropna() #drop unneccessary lines with incomplete data

In [177]:
all_movie_data

Unnamed: 0,Release Date,Movie,Genre,ProductionBudget,BoxOffice
5,"Mar 7, 2008","10,000 B.C.",Adventure,"$105,000,000","$94,784,201"
7,"Mar 11, 2016",10 Cloverfield Lane,Thriller/Suspense,"$5,000,000","$26,790,517"
8,"Nov 11, 2015",10 Days in a Madhouse,Drama,"$12,000,000","$14,616"
28,"Nov 22, 2000",102 Dalmatians,Comedy,"$85,000,000","$66,941,559"
31,"Aug 18, 2006",10th & Wolf,Drama,"$8,000,000","$54,702"
35,"Aug 12, 2005",11:14,Drama,"$6,000,000",$0
47,"Apr 13, 1957",12 Angry Men,Drama,"$340,000",$0
55,"Mar 27, 2009",12 Rounds,Action,"$20,000,000","$12,234,694"
60,"Nov 5, 2010",127 Hours,Drama,"$18,000,000","$18,335,230"
64,"Apr 23, 2004",13 Going On 30,Comedy,"$30,000,000","$57,139,723"


In [184]:
all_movie_data = all_movie_data.drop(all_movie_data['BoxOffice'] is "$0")

# Need to drop rows that have $0 profit???

In [185]:
all_movie_data

Unnamed: 0,Release Date,Movie,Genre,ProductionBudget,BoxOffice
5,"Mar 7, 2008","10,000 B.C.",Adventure,"$105,000,000","$94,784,201"
7,"Mar 11, 2016",10 Cloverfield Lane,Thriller/Suspense,"$5,000,000","$26,790,517"
8,"Nov 11, 2015",10 Days in a Madhouse,Drama,"$12,000,000","$14,616"
28,"Nov 22, 2000",102 Dalmatians,Comedy,"$85,000,000","$66,941,559"
31,"Aug 18, 2006",10th & Wolf,Drama,"$8,000,000","$54,702"
35,"Aug 12, 2005",11:14,Drama,"$6,000,000",$0
47,"Apr 13, 1957",12 Angry Men,Drama,"$340,000",$0
55,"Mar 27, 2009",12 Rounds,Action,"$20,000,000","$12,234,694"
60,"Nov 5, 2010",127 Hours,Drama,"$18,000,000","$18,335,230"
64,"Apr 23, 2004",13 Going On 30,Comedy,"$30,000,000","$57,139,723"


In [164]:
all_movie_data

Unnamed: 0,Release Date,Movie,Genre,ProductionBudget,BoxOffice
1292,"May 21, 2008",Auf der anderen Seite,Drama,,"$742,349"
1359,"Aug 14, 1998",The Avengers,Action,"$60,000,000","$23,385,416"
1292,"Apr 16, 1993",Boiling Point,Action,,"$10,058,318"
1359,"Oct 17, 2014",The Book of Life,Adventure,"$50,000,000","$50,151,543"
1407,"Dec 20, 1989",Born on the Fourth of July,Drama,"$14,000,000","$65,441,414"
1408,"Sep 28, 2001",Born Romantic,Comedy,,"$14,597"
1426,"Apr 28, 2000",Bossa Nova,Romantic Comedy,,"$1,816,792"
1439,"Jul 10, 2015",Boulevard,Drama,,"$126,150"
1440,"Nov 17, 2000",Bounce,Drama,"$35,000,000","$36,805,288"
1445,"Apr 16, 1993",Bound by Honor,,"$35,000,000","$4,496,583"


In [150]:
test_movie_data = all_movie_data

In [122]:
all_movie_data.to_csv("all_movie_data.csv", encoding='utf-8') #used this to prove that record linkage was needed.
# Looked at the data in Excel.

In [151]:
all_movie_data["Movie"] = all_movie_data["Movie"].str.lower()
all_movie_data["Genre"] = all_movie_data["Genre"].str.lower()
all_movie_data["ProductionBudget"] = all_movie_data["ProductionBudget"].str.strip('$')
all_movie_data["ProductionBudget"] = all_movie_data["ProductionBudget"].str.replace(",", '')

In [None]:
all_movie_data.head(10)

In [153]:
all_movie_data["BoxOffice"] = all_movie_data["BoxOffice"].str.strip('$')
all_movie_data["BoxOffice"] = all_movie_data["BoxOffice"].str.replace(",", '')

After that, just drop rows with null or empty values because this data is dirty:

In [155]:
all_movie_data

Unnamed: 0,Release Date,Movie,Genre,ProductionBudget,BoxOffice
1359,"Aug 14, 1998",the avengers,action,60000000,23385416
1359,"Oct 17, 2014",the book of life,adventure,50000000,50151543
1407,"Dec 20, 1989",born on the fourth of july,drama,14000000,65441414
1440,"Nov 17, 2000",bounce,drama,35000000,36805288
1456,"Aug 10, 2012",the bourne legacy,thriller/suspense,125000000,113203870
1457,"Jul 23, 2004",the bourne supremacy,thriller/suspense,85000000,176087450
1463,"Aug 13, 1999",bowfinger,comedy,55000000,66365290
1532,"Feb 12, 1993",braindead,horror,3000000,242623
1557,"Jun 22, 2012",brave,adventure,185000000,237282182
1570,"Dec 18, 1985",brazil,black comedy,15000000,9929135


### 5. Documentation of the data:

Budget and financial data of movies: http://www.the-numbers.com/movie/budgets/all