# General Comments about the data set
This table provides the volume of violent crime (murder and nonnegligent manslaughter, rape, robbery, and aggravated assault) and property crime (burglary, larceny-theft, and motor vehicle theft) as reported by city and town law enforcement agencies (listed alphabetically by state) that contributed data to the UCR Program. (Note:  Arson is not included in the property crime total in this table; however, if complete arson data were provided, they will appear in the arson column.)

Population estimation
For the 2013 population estimates used in this table, the FBI computed individual rates of growth from one year to the next for every city/town and county using 2010 decennial population counts and 2011 through 2012 population estimates from the U.S. Census Bureau. Each agency’s rates of growth were averaged; that average was then applied and added to its 2012 Census population estimate to derive the agency’s 2013 population estimate.

 

# Instructions

We'll use the property crime model you've been working on with, based on the FBI:UCR data. Since your model formulation to date has used the entire New York State 2013 dataset, you'll need to validate it using some of the other crime datasets available at the FBI:UCR website. Options include other states crime rates in 2013 or crime rates in New York State in other years or a combination of these.
__I chose to test the model that I trained on NY to predict what will happen in CA, and see if the model is a good fit for the new data set__

Iterate
Based on the results of your validation test, create a revised model, and then test both old and new models on a new holdout or set of folds.

Include your model(s) and a brief writeup of the reasoning behind the validation method you chose and the changes you made to submit and review with your mentor.

In [106]:
# import all the modules I need
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
pd.set_option('mode.chained_assignment', None)
%matplotlib inline
import seaborn as sns
import math


# import in the data table I downloaded from UCR
df = pd.DataFrame()
df = pd.read_excel(
    "table_8_offenses_known_to_law_enforcement_new_york_by_city_2013.xls", sheet_name='13tbl8ny')

Need to clean by:
- fixing headers (line 3) and deleting rows 0-2
- dropping Rape\n(revised\ndefinition)1 column 
- filling in NaN with 0


Need to create features by:
- changing categorical variables to 0 and 1

In [107]:
df.drop([0, 1, 2], inplace=True)

In [108]:
df.drop('Unnamed: 4', axis=1, inplace=True)

In [109]:
df.columns = df.loc[3].values

In [110]:
df.drop([3, 352, 353, 354], inplace=True)

In [111]:
df['Arson3'] = df['Arson3'].fillna(0)
df

Unnamed: 0,City,Population,Violent crime,Murder and nonnegligent manslaughter,Rape (legacy definition)2,Robbery,Aggravated assault,Property crime,Burglary,Larceny- theft,Motor vehicle theft,Arson3
4,Adams Village,1861,0,0,0,0,0,12,2,10,0,0
5,Addison Town and Village,2577,3,0,0,0,3,24,3,20,1,0
6,Akron Village,2846,3,0,0,0,3,16,1,15,0,0
7,Albany,97956,791,8,30,227,526,4090,705,3243,142,0
8,Albion Village,6388,23,0,3,4,16,223,53,165,5,0
9,Alfred Village,4089,5,0,0,3,2,46,10,36,0,0
10,Allegany Village,1781,3,0,0,0,3,10,0,10,0,0
11,Amherst Town,118296,107,1,7,31,68,2118,204,1882,32,3
12,Amityville Village,9519,9,0,2,4,3,210,16,188,6,1
13,Amsterdam,18182,30,0,0,12,18,405,99,291,15,0


In [112]:
df['Population'] = df['Population'].astype(int)
df['Population2'] = df['Population']**2

In [113]:
df['Robbery'] = df['Robbery'].astype(int)
df['Robbery'] = df['Robbery'].apply(lambda x: (x > 0)*1)

In [114]:
df['Murder and\nnonnegligent\nmanslaughter'] = df['Murder and\nnonnegligent\nmanslaughter'].astype(
    int)
df['Murder and\nnonnegligent\nmanslaughter'] = df['Murder and\nnonnegligent\nmanslaughter'].apply(
    lambda x: (x > 0)*1)

In [115]:
from sklearn import linear_model
%matplotlib inline
pd.options.display.float_format = '{:.3f}'.format

# Suppress annoying harmless error.
import warnings
warnings.filterwarnings(action="ignore", module="scipy",
                        message="^internal gelsd")

In [116]:
regr = linear_model.LinearRegression()
regr.fit(df[['Population', 'Population2',
             'Murder and\nnonnegligent\nmanslaughter', 'Robbery']], df['Property\ncrime'])

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)

In [117]:
regr.coef_

array([ 1.69008467e-02, -1.73980871e-07,  6.15497194e+02,  1.48320688e+02])

In [118]:
regr.score(df[['Population', 'Population2',
               'Murder and\nnonnegligent\nmanslaughter', 'Robbery']], df['Property\ncrime'])

0.9935839631191373

Predictions. First figuring out how many property crimes there would be in Davis with the specification we had earlier.

With Drew, we realized how important it was for population to be in the regression specification. When we took out population, the regr score went way down. We see how adding random noise increases the regr score by overfitting the data. 

In [119]:
# predicting the nuber of property crimes in Davis:
regr.predict([[68111, 68111**2, 0, 0]])

array([318.62882533])

In [120]:
regr = linear_model.LinearRegression()
regr.fit(df[['Population2', 'Murder and\nnonnegligent\nmanslaughter',
             'Robbery']], df['Property\ncrime'])
regr.score(df[['Population2', 'Murder and\nnonnegligent\nmanslaughter',
               'Robbery']], df['Property\ncrime'])

0.05208377602595937

In [121]:
regr = linear_model.LinearRegression()
X = df[['Population2', 'Murder and\nnonnegligent\nmanslaughter', 'Robbery']].values
N = np.random.randn(X.shape[0], 10)
X = np.concatenate((X, N), axis=1)
regr.fit(X, df['Property\ncrime'])

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)

In [122]:
regr.score(X, df['Property\ncrime'])

0.07300583209075984

Now, let's use the new linear regression code that I learned in this lesson to refit the model and check params, pvalues, and the rsquared value for NY

In [123]:
import statsmodels.formula.api as smf
from statsmodels.sandbox.regression.predstd import wls_prediction_std

In [125]:
#rename and change the data type to be numerical instead of an object
df['Property\ncrime'] = df['Property\ncrime'].astype(int)
df['Population2'] = df['Population2'].astype(int)
df['Property_crime'] = df['Property\ncrime']
df['Murder'] = df['Murder and\nnonnegligent\nmanslaughter']

In [93]:
# Re-fit the model here.
linear_formula = 'Property_crime ~ Population + Murder'

# Fit the model to our data using the formula.
lm = smf.ols(formula=linear_formula, data=df).fit()
print(lm.params)
print(lm.pvalues)
print(lm.rsquared)
print(lm.conf_int())

Intercept     24.873
Population     0.017
Murder       654.243
dtype: float64
Intercept    0.491
Population   0.000
Murder       0.000
dtype: float64
0.9934062182812436
                 0       1
Intercept  -46.097  95.844
Population   0.017   0.017
Murder     462.613 845.873


In [71]:
# import in the data table I downloaded for CA
df_CA = pd.DataFrame()
df_CA = pd.read_excel(
    'table_8_offenses_known_to_law_enforcement_california_by_city_2013.xls', sheet_name='13tbl8ca')

In [72]:
df_CA.drop([0, 1, 2], inplace=True)

In [73]:
df_CA.drop('Unnamed: 4', axis=1, inplace=True)

In [74]:
df_CA.columns = df_CA.loc[3].values
df_CA

Unnamed: 0,City,Population,Violent crime,Murder and nonnegligent manslaughter,Rape (legacy definition)2,Robbery,Aggravated assault,Property crime,Burglary,Larceny- theft,Motor vehicle theft,Arson
3,City,Population,Violent\ncrime,Murder and\nnonnegligent\nmanslaughter,Rape\n(legacy\ndefinition)2,Robbery,Aggravated\nassault,Property\ncrime,Burglary,Larceny-\ntheft,Motor\nvehicle\ntheft,Arson
4,Adelanto,31165,198,2,15,52,129,886,381,372,133,17
5,Agoura Hills,20762,19,0,2,10,7,306,109,185,12,7
6,Alameda,76206,158,0,10,85,63,1902,287,1285,330,17
7,Albany,19104,29,0,1,24,4,557,94,388,75,7
8,Alhambra,84710,163,1,9,81,72,1774,344,1196,234,7
9,Aliso Viejo,50005,25,0,2,4,19,315,71,224,20,3
10,Alturas,2681,28,1,4,2,21,71,23,46,2,0
11,American Canyon,20068,54,0,4,31,19,510,91,387,32,2
12,Anaheim,345320,1130,11,82,437,600,9611,1412,6518,1681,35


In [75]:
df_CA.drop([3, 466, 467], inplace=True)

In [77]:
df_CA['Murder'] = df_CA['Murder and\nnonnegligent\nmanslaughter']
df_CA['Property_crime'] = df_CA['Property\ncrime']

In [78]:
df_CA['Murder'] = df_CA['Murder'].astype(int)
df_CA['Murder'] = df_CA['Murder'].apply(lambda x: (x > 0)*1)

In [85]:
df_CA['Population'] = df_CA['Population'].astype(int)
df_CA['Property_crime'] = df_CA['Property_crime'].astype(int)

In [94]:
# Re-fit the model here to CA 
linear_formula = 'Property_crime ~ Population + Murder'

# Fit the model to our data using the formula.
lm_CA = smf.ols(formula=linear_formula, data=df_CA).fit()
print(lm_CA.params)
# print(lm.pvalues)
# print(lm.rsquared)
# print(lm.conf_int())

Intercept    -57.299
Population     0.024
Murder       589.797
dtype: float64


In [102]:
y_pred = lm.predict(df_CA[['Property_crime', 'Population', 'Murder']])
y = df_CA['Property_crime']
SS_Residual = sum((y - y_pred) ** 2)
SS_Total = sum((y - np.mean(y)) ** 2)
r_squared = 1 - (float(SS_Residual)) / SS_Total
print(r_squared)
print(lm.pvalues)

0.8032009799627111
Intercept    0.491
Population   0.000
Murder       0.000
dtype: float64


We used the linear model that was trained on the NY data set from 2013 and tried it on the 2013 CA data set. This gave us a R^2 value of 0.803.

Next, let's validate our model by comparing how the CA trained linear model will work on the NY data set.

For my model to be trustworthy, I expect the R^2 values to be the same or very similar.

In [103]:
y_pred = lm_CA.predict(df[['Property_crime', 'Population', 'Murder']])
y = df['Property_crime']
SS_Residual = sum((y - y_pred) ** 2)
SS_Total = sum((y - np.mean(y)) ** 2)
r_squared = 1 - (float(SS_Residual)) / SS_Total
print(r_squared)
print(lm_CA.pvalues)

0.7998327830081634
Intercept    0.617
Population   0.000
Murder       0.001
dtype: float64


Yay! The R^2 value is 0.799, very similar to the 0.803 R^2 value from before. So, I conclude that my model is trustworthy.

Future steps:
- apply the NY 2013 linear model I made to another year, say 2015, for NY
- I expect that R^2 values would be closer or improved

- do the fold analysis with 5 groups/folds to see if the components listed at the end give similar values and then I can see whether or not I'm overfitting or not.