# Final Report

## Project Statement

The US suffers from incredibly high rates of violent crime as compared to other OECD countries, despite having the highest punishment (incarceration rate) in the OECD. Recent well-publicized events in Las Vegas, New York City, and Sutherland Springs, TX, as well as the recent uptick in crime in some major metropolitans (e.g. Chicago), have pushed the violent crime to the forefront of political debate. Despite the large, and growing, interest in understanding crime in the US, we are often presented with incomplete information about violent crime data in order to serve a particular agenda. The goal of this project is to propose, build, and evaluate a data-driven model for predicting the number of murders in metropolitan regions using publicly available crime and census data.

## Introduction

The importance of predicting crime rates in US metropolitan areas is almost self-evident. Most crime happens in large metropolitans and having accurate predictive data on where crime is likely to occur can help optimize enforcement efforts, and hopefully, reduce crime. The problem is that the causes of crime are extremely hard to deduce due to the un-observability of important potential drivers (e.g. corruption of police force) and the complex nature of the problem (e.g. does crime cause poverty or is crime caused by poverty? Both are likely to be true), as well as the ethical and legal constraints on experimentation. What’s more, many of the potential drivers of crime are highly correlated, lower income is correlated with lower education levels, and both are correlated with crime. That is why, even though theoretical research has identified multiple possible causes for violent crime, proof of causality (as well as measurement of effect size) is elusive. To circumvent these problems, this project focuses on predicting violent crime, and in particular murder rates, from observable demographic data.

Micro-level (person) demographic data was obtained from the US Census via IPUMS (Steven Ruggles, Katie Genadek, Ronald Goeken, Josiah Grover, and Matthew Sobek. Integrated Public Use Microdata Series: Version 7.0 [dataset]. Minneapolis: University of Minnesota, 2017. https://doi.org/10.18128/D010.V7.0.). Though summary metro-level data was available directly from the Census ACS series, it suffered from inconsistent definitions, codes and variable names across different years in our sample (2006-2016). Rather that dive into the different codes and definitions used by the Census to produce the metro level data in the ACS series, we downloaded the ACS micro-level directly from IPUMS – a project dedicated to publishing Census micro-level data while maintaining consistent variable definitions and codes across years. We then use the micro-level data to produce metropolitan-level summary information for each year. To our surprise, doing so not only reduced measurement error by holding the summary variables definitions constant, but also provided us with more metro-year data points than the Census’ ACS summary data, even though IPUMS data only covers 2006-2015 (2630 vs. 814 metro-years). We believe trading off one year of observation in favor of increasing our sample by a factor of 3 is a worthwhile endeavor. 
Among violent crimes we focus particularly on the murder rate, as it is the most reliable statistic. While different groups may report robberies and assaults at different rates, it is hard to hide a body. Therefore, by predicting murder rates, one can learn a lot about the underlying socioeconomic and geographic factors that lead to murder. Insert description of FBI Data 

The IPUMS data and the FBI data were combined using pandas merging, using main city-state-year pairing.While the FBI data was larger than the IPUMS, we found that the IPUMS data merged over 2400 of the 2600 entries, for over 90% success.

Preliminary EDA (presented in the EDA tab above) suggested that murder rates in US metropolitan areas are positively correlated with city size, unemployment, poverty, income inequality and the percentage of single parent households (a common predictor in the crime literature). It is negatively correlated with median household income, percentage of homeownership, and education (measured both as percentage of high school graduates, and 4-year college graduates). EDA also suggested both regional and metro level differences in murder rates. The average murder rate in the South-East is almost twice the average murder rate in the North-East, and while the murder rate in some metropolitan areas is extremely low, it’s over 20 homicides per 100,000 residents in New Orleans. Finally, following the recent political debate over the relationship (or lack thereof) between crime and immigration, we have decided to incorporate proxies for immigration in our data (percentage of non-citizens, and residents whose primary language is not English). Preliminary EDA suggests a weak positive relationship between proxies for immigration and murder rates. We believe this relationship is driven by the fact that larger metropolitans (which have historically had higher murder rates) also attract more immigrants.





## Literature Review

Insight on which demographic variables are used in the literature to predict crime were gathered from Gleaser and Sacerdote, Why Is There More Crime in Cities?, Journal of Political Economy, 1999, vol. 107, no. 6, pt. 2, and Levitt, Using Electoral Cycles in Police Hiring to Estimate the Effect of Police on Crime, The American Economic Review, Volume 87, Issue 3 (June 1997), 270-290.

Insights on the relationship between inequality and crime were gleaned from Kelly, Inequality and Crime, Review of Economics and Statistics Volume 82, Issue 4, November 2000 p.530-539. Finally, information about the US’ unique standing at the top of the OECD crime and incarceration rates, was gathered from Spamann, The US Crime Puzzle: A Comparative Large-N Perspective on US Crime & Punishment (Working paper)

## Modeling Approach and Project Trajectory

We began using a simple linear regression that tries to predict metro-area murder rates (per 100k residents). After reviewing the literature and the data available from the Census and IPUMS we settled on the following list of predictors, all of which are both prevalent in the literature: %White, %Black, %Hispanic, %Ages 15-25, %Unemployed, %Poor, GINI index for household income, Median household income, %High school educated, %4-year college educated, population, %Non-citizens, %Immigrant (via the proxy of using English a second language), and %Single-parent households (defined as households with at least 1 child and no more than one adult) as well as year, region and metro area dummies. We then split the data into a train and test set (70%/30%), fit the linear regression on the train set, and tested its predictive power (via R^2) on the test set – a procedure we repeated with every one of the models below. 

The baseline model reached an R^2 of 0.705 on the test set and 0.783 on the training set.To improve upon the baseline model, we first tried alternative regression approaches: Lasso and Ridge, using 5-fold cross validation to select the penalty parameter. To our surprise Lasso regression performed terribly, produce an R^2 of about 0.05 on both the training and test set. Ridge regression did much better, but was still slightly less effective than the baseline model (R^2 of 0.703 on the test set and 0.765 on the training set). 

We then proceeded to explore Decision Tree models. Beginning with a simple Decision Tree, we found it to underperform both Linear and Ridge regressions (R^2 of 0.504 on the test set and 0.743 on the training set). A Random Forest estimator, with 5-fold cross validated number of trees and share of features (i.e. n_estimators and max_features), performed far better, outperforming regression models (R^2 of 0.719 on the test set and 0.955 on the training set), despite some evidence of overfitting (the extremely high R^2 on the training set and the gap between that and the test set R^2). Finally, we tried tested a Boosting estimator, once again using 5-fold cross validation to tune the learning rate, number of features considered, and max depth, and following the suggestions in lab setting a high number of estimators (3000) (we did not tune stopping conditions based on leaf size as they overlap with max_depth). The Boosting estimator improved upon the Random Forest, by posting a R^2 of 0.722 on the test set and 0.965 on the training set (suggesting less overfitting than the Random Forest model).

Finally, we tried to pool the predictions different models together using several different Stacking estimators. To do so, we generated a dataset of predictions produced by each of the models above. We then trained a Lasso, Ridge, Random Forest, and Boosting estimators on the train set predictions, using as before, 5-fold cross validation to tune the parameters, and tested it on the test set predictions. All 4 Stacking estimators performed roughly the same produce an R^2 of 0.713-0.715 on the test set, and 0.986-0.995 on the training set. The extremely high train set R^2 suggests they suffered from overfitting, which may explain why they failed to outperform the simple Boosting estimator (R^2 of 0.722 on the test set). 

## Results and Conclusions

We found that the estimator best suited for predicting metro area yearly murder rates from demographic data is the Boosting estimator with 3000 estimators, and 5-fold cross validated parameters (an R^2 of 0.722 on the test set). A roughly similarly accurate, but less computationally demanding estimator, is a Random Forest with 5-fold cross validated parameters (an R^2 of 0.719 on the test set). Both estimators provide explain a large share of the variation in yearly metro murder rates and can be used to optimize the allocation of enforcement resources. 

That being said, though these estimators provide the best predictions of yearly metro area murder rates they are not easily interpretable and thus are unable to give us insights into the causes of crime and how those might be addressed. Future work should take advantage of natural experiments (e.g. Levitt, Using Electoral Cycles in Police Hiring to Estimate the Effect of Police on Crime, The American Economic Review, Volume 87, Issue 3 (June 1997), 270-290) to detect causal patterns and inform policy. 
