# I See Dead People: An Analysis

Bryan Bumgardner, Data Scientist  
February - March 2016

### Death is a fact of life.   
That being said, it's worth studying how it fits in our popular culture. During the Metis Data Science bootcamp (shameless plug), while working on another project I discovered bodycounters.com[http://www.bodycounters.com/]. The dedicated volunteers at this site count and categorize the number of deaths in movies, and to date, have counted for over 2,000 movies. Go grab some beers, count the dead people in your favorite movie, and contribute to the site. 

The data are sitting on their site for anyone to see, and it gave me an idea. I reached out to the site and had some correspondence with Dana, who was gracious enough to share a CSV of all the data. I then cross referenced this data with information from another site, thenumbers.com, which shares basic budget and financial information about thousands of movies. I took this data, mashed it together, and asked some questions. 

Thanks again to the volunteers at bodycounters. Y'all are the greatest. Below is a quick data science passion project to analyze and find patterns in this great data you've collected. 

### Major questions:
1. Is the amount of on-screen deaths going up in modern movies?

2. Does a movie's death count have an effect on that movie's domestic gross?
3. How does budget factor into a movie's death count?
4. Is there a relationship between the movie's MPAA rating and body count?
5. What about death count and genre?


### Table of Contents:
1. Reading and combining the data
2. Outputting various data frames
3. Visualizing
4. Conclusions
5. Documentation of the data

In [31]:
import pandas as pd 
import numpy as np
import csv
import os
from datetime import datetime
from future.builtins import next
import re
import logging
import optparse

from fuzzywuzzy import fuzz
from difflib import SequenceMatcher

import statsmodels.api as sm
import statsmodels.formula.api as smf

from sklearn.linear_model import LinearRegression 
from patsy import dmatrices

import matplotlib.pyplot as plt
import matplotlib as mpl
import seaborn as sns

%matplotlib inline
sns.set(color_codes=True)

### Step 1: read in and clean the data

Deaths data is a csv, provided by the wonderful folks at BodyCounters. 

In [105]:
deaths = pd.read_csv("moviesmovies.csv", encoding="latin-1", index_col="movieskey") #might as well make use of their key

deaths["moviename"] = deaths["moviename"].str.rstrip("\t").str.lower() #cleaning some trailing spaces and standardizing for the matching
deaths["deathcomment"] = deaths["deathcomment"].str.rstrip("\t")
deaths["moviename"] = deaths["moviename"].str.replace(".", '')
deaths["moviename"] = deaths["moviename"].str.replace(",", '')
deaths["moviename"] = deaths["moviename"].str.replace("/", '')
deaths["moviename"] = deaths["moviename"].str.replace("'", '')
deaths = deaths.sort(columns="moviename") #it will match faster if we sort these in alphabetical order

# Grabbed a lot of the budget data from the website by saving it as a html, because the site crashes nonstop. Fixing:
budgetwebsite = pd.read_html("allmoviebudgets.html", header=0, encoding="utf-8")
all_budgets = pd.DataFrame(budgetwebsite[0])

deaths.to_csv("deathscleaned.csv", encoding='utf-8')



Let's scrape all the movie data - EVER - because we are crazy. And also because I'm not interested in paying for access to their database. 

So what you see below is me scraping all the budgets and domestic box offices for every movie in the alphabetized database. Every letter has a chart with all the movies starting with that letter. I downloaded each webpage from the data source as an HTML webpage then pulled the charts from that, which is less likely to get you banned than hitting up their site with Selenium.

I then mashed the charts together into one GIANT dataframe which will be helpful for record linkage. This needs properly cleaned as well.

In [3]:
path = 'movie_pages/' #where the HTML files are
listing = os.listdir(path) #read the pages from the directory
all_movie_data = pd.DataFrame(columns = ['Release Date', 'Movie', 'Genre', 'ProductionBudget','DomesticBox Officeto Date', 'Trailer'])
# empty dataframe waiting for all the goodies

for i, infile in enumerate(listing): #iterate through the files in the directory 'path'
    temp_list = pd.read_html("movie_pages/"+infile, header=0) #THANK YOU BASED PANDAS. Cleans HTML away and pulls out just the tables.
    temp_frame = pd.DataFrame(temp_list[0]) #temp_list is a list of dataframes. Take the first (and only) one
    all_movie_data = all_movie_data.append(temp_frame) #append this to the master dataframe 
    
del all_movie_data['Trailer'] #not necessary and screwed up the dropna

all_movie_data.rename(columns = {'DomesticBox Officeto Date': 'BoxOffice'}, inplace = True)
all_movie_data = all_movie_data.dropna() #drop unneccessary lines with incomplete data

In [4]:
label_movie_data = all_movie_data #Save this just in case. Might try and do something with the labels, and this has the un-cleaned labels. 

In [83]:
all_movie_data = all_movie_data[all_movie_data.BoxOffice != "$0"]
all_movie_data = all_movie_data[all_movie_data.ProductionBudget != "$0"]
# remove all with $0 (I don't know those values got recorded? Negligence on the site, I think.)

all_movie_data.to_csv("all_movie_data.csv", encoding='utf-8') #used this to prove that record linkage was needed.
# Looked at the data in Excel, and could only quickly match 1/4th of the entries cleanly.

#Formatting all the data so it's actually useful and easier to sum, average, and categorize, along with effective timestamps. 

all_movie_data["Movie"] = all_movie_data["Movie"].str.lower()
all_movie_data["Movie"] = all_movie_data["Movie"].str.replace(".", '')
all_movie_data["Movie"] = all_movie_data["Movie"].str.replace(",", '')
all_movie_data["Movie"] = all_movie_data["Movie"].str.replace(")", '')
all_movie_data["Movie"] = all_movie_data["Movie"].str.replace("'", '')
all_movie_data["Movie"] = all_movie_data["Movie"].str.replace("(", '')
all_movie_data["Genre"] = all_movie_data["Genre"].str.lower()
all_movie_data["ProductionBudget"] = all_movie_data["ProductionBudget"].str.strip('$')
all_movie_data["ProductionBudget"] = all_movie_data["ProductionBudget"].str.replace(",", '')
all_movie_data["BoxOffice"] = all_movie_data["BoxOffice"].str.strip('$')
all_movie_data["BoxOffice"] = all_movie_data["BoxOffice"].str.replace(",", '')
all_movie_data["Release Date"] = pd.to_datetime(all_movie_data["Release Date"], format='%b %d, %Y')
all_movie_data = all_movie_data.sort(columns="Movie")
test_movie_data["DeathCount"] = np.nan



Ok so I'm really lazy and looked into how well these would match with Excel. Turns out the data is filthy and won't match well at all. There are extreme inconsistencies in movie names from both datasets. So I'll have to do fuzzy matching.

### Step 2. Combine the data using record linkage

So the Pandas dataframe-compatible record linkage package, recordlinkage, is turned off right now. I'm still trying to learn how to do this with something called FuzzyWuzzy, which is what you see below.

However, I used OpenRefine to connect the datasets outside of Python, then I'm reading it back into here as a CSV. This was much faster but doesn't scale well, so a true Python script is needed long term.

I did the quick cleaning by using OpenRefine's built-in matching powers mixed with this great fuzzy string reading service that emphasizes reconciliation:
http://okfnlabs.org/reconcile-csv/

In [81]:
"""#preparing the target dataframe
merged_movie_data = pd.DataFrame(columns = ['Movie', 'ReleaseDate', 'Genre', 'ProductionBudget', 'BoxOffice', 'DeathCount'])
test_dataframe = pd.DataFrame(columns = ['name_ratio', 'name_token_sort_ratio', 'name_partial_ratio', 'match'])
test_death_count = deaths[:100]
test_movie_data = all_movie_data[:100] #gonna see how this works
"""

#This is all under construction!!

In [77]:
"""for row in test_movie_data.iterrows():
    current_movie_name = test_movie_data["Movie"]
    
    
        for row2 in test_death_count.iterrows():
            second_movie_name = test_death_count["moviename"]
            second_movie_death_count = test_death_count["deathcount"]
            
            
    m = SequenceMatcher(None, current_movie_name, second_movie_name)
    if m.ratio() > 96:
        test_movie_data["DeathCount"] = test_movie_data["DeathCount"].append(second_movie_death_count)
        
        test_death_count.iterrows()
row = next(test_death_count.iterrows())

def send_back_first_title(test_movie_data)
    for row in test_movie_data.iterrows():
        current_movie_name = test_movie_data["Movie"]"""

# Skip past this! 

In [122]:
final_data_frame = pd.read_csv("CombinedData.csv", encoding="latin-1")

In [123]:
final_data_frame = final_data_frame.dropna() # In the pairing process, we had some movies with missing values
final_data_frame["Release"] = pd.to_datetime(final_data_frame["Release"])
final_data_frame = final_data_frame.drop("movieskey", 1) #drop some extra stuff that was only useful during linkage. 
final_data_frame = final_data_frame.drop("identifier", 1)

In [126]:
final_data_frame.head(3) #glad somebody had 30 Going on 30 covered

Unnamed: 0,moviename,BoxOffice,ProductionBudget,Genre,Release,deathcount,deathcomment
4,13 going on 30,57139723,30000000,comedy,2004-04-23,0,0
5,16 blocks,36883539,45000000,action,2006-03-03,2,2
8,2 guns,75612460,61000000,action,2013-08-02,30,"25, 3 chickens and 2 deer"


So we have all our stuff and some extra fun comments that will be interesting during visualization. 
Total: 1128 movies and 7 features. 

### Step 3: Statistical Analysis
Now that we have the data, let's revisit the questions we asked and find the best models to answer them.

#### Is the amount of on-screen deaths going up in modern movies?   
Technique to solve this: create an average of deaths per year in movies.

In [181]:
average_deaths = final_data_frame[["Release", "deathcount", "Genre"]] #taking Genre for visualization. I have a hunch...

In [185]:
average_deaths["Year"] = average_deaths["Release"].map(lambda x: x.strftime('%Y'))
# strip just the year from the datetime object and put it into it's own column
# What's with this error? Debug later

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  if __name__ == '__main__':


In [None]:
# need to bin by year and create averages for the movies that go in the bins 

#### Does a movie's death count have an effect on that movie's domestic gross?

In [None]:
deaths_vs_gross = final_data_frame[["BoxOffice", "deathcount", "Genre"]]

### 5. Documentation of the data:

Budget and financial data of movies: http://www.the-numbers.com/movie/budgets/all