# 2.5.2 [Challenge: Validating a linear regression](https://courses.thinkful.com/data-201v1/project/2.5.2)

Iterate on model from [2.4.4](https://github.com/Eileenyc/thinkful_course/blob/master/unit_2/2_4_4_crime_regression_model.ipynb)

[Download the Excel file here](https://ucr.fbi.gov/crime-in-the-u.s/2013/crime-in-the-u.s.-2013/tables/table-8/table-8-state-cuts/table_8_offenses_known_to_law_enforcement_new_york_by_city_2013.xls) on crime data in New York State in 2013, provided by the FBI: UCR ([Thinkful mirror](https://raw.githubusercontent.com/Thinkful-Ed/data-201-resources/master/New_York_offenses/NEW_YORK-Offenses_Known_to_Law_Enforcement_by_City_2013%20-%2013tbl8ny.csv)).

Assess model on new dataset using holdout or cross-validation.

Goal: Model with consistent R-squared and only statistically significant parameters across multiple samples. 

Use other stats crime rates for 2013 or crime rates in New York State in other years. 

Create revised model and test both old and new models on a new holdout or set of folds.

Include your models and brief write up of the reasoning behind the validation method you chose and changes you made. 

In [81]:
import math
import warnings
import pandas as pd
import numpy as np
from sklearn import preprocessing
import matplotlib.pyplot as plt

from sklearn import linear_model
from sklearn.model_selection import cross_val_score, train_test_split, KFold, cross_val_predict
from sklearn import ensemble, tree
from sklearn.metrics import r2_score
import statsmodels.formula.api as smf
from IPython.display import display


# Display preferences.
%matplotlib inline
pd.options.display.float_format = '{:.3f}'.format

# Suppress annoying harmless error.
warnings.filterwarnings(
    action="ignore",
    module="scipy",
    message="^internal gelsd"
)

In [7]:
# Grab and process the raw data.
data_path = ("unit_2_data/ny_crime_data.csv"
            )
df_crime = pd.read_csv(data_path, header=None, delimiter=',')
df_crime.columns = list(df_crime.iloc[4,:])
df_crime.columns = df_crime.columns.str.replace('\n', '_').str.replace(' ', '_').str.lower()
df_crime = df_crime.rename({'murder_and_nonnegligent_manslaughter':'murder',
                           'larceny-_theft':'larceny',
                           'arson3':'arson'},axis='columns')

df_crime = pd.concat([df_crime.iloc[5:-3,0:4],df_crime.iloc[5:-3,6:-1]],axis=1).reset_index(drop=True)

for crime in list(df_crime.columns[1:]):
    df_crime[crime] = df_crime[crime].str.replace(',','').astype('int64')
    
crime_list = df_crime.columns[2:]

for crime in crime_list:
    df_crime[crime + str('_per_capita')] = df_crime[crime]/df_crime['population']
    
# create flags for >0 crimes
for crime in crime_list:
    df_crime[crime+str('_flag')] = np.where(df_crime[crime]>0,1,0)
    
df_crime['population_squared'] = df_crime['population']**2 

df_crime = df_crime.loc[df_crime['city']!='New York'].reset_index(drop=True)

## Used RandomForest to evaluate feature importance

In [109]:
X = df_crime[['population','aggravated_assault','violent_crime', 'robbery', 'population_squared', 'murder']]
y = df_crime['property_crime']

rfr = ensemble.RandomForestRegressor(max_depth=7,
                                     
                                     n_estimators=10)
rfr.fit(X, y)
#print("--- %s seconds ---" % (time.time() - start_time))


prediction = rfr.predict(X)
print(cross_val_score(rfr, X=X, y=y, cv=5))
print(r2_score(prediction, y))

dt_features = pd.DataFrame(data={'importance':rfr.feature_importances_, 'features':X.columns})
dt_features.sort_values(by='importance', ascending=False)

[0.88931979 0.84422424 0.30404544 0.9037787  0.7247789 ]
0.9664231742761831


Unnamed: 0,importance,features
4,0.206,population_squared
2,0.194,violent_crime
3,0.191,robbery
5,0.162,murder
0,0.157,population
1,0.09,aggravated_assault


Using what I learned from the Random Forest tried out combinations of top features until the p_values of the features included in the model were all less than 0.05. The p_value for the ftest was also less than 0.05 but the actually f-value was pretty large which I think is bad. Additionally the R_squared was low but population explains variance in crime most strongly and since population was removed from this relationship the opportunity to  explain more variance is limited 

In [112]:
linear_formula = 'property_crime_per_capita ~ violent_crime_per_capita\
+robbery_flag'

lm = smf.ols(formula=linear_formula, data=train).fit()
print('\nParameters:')
print(lm.params)

print('\nP-Values:')
print(lm.pvalues)

print('\nR-squared')
print(lm.rsquared)

print(lm.conf_int())

print('\nF-value:')
print(lm.fvalue)

print('\nF p_value:')
print(lm.f_pvalue)


Parameters:
Intercept                  0.010
violent_crime_per_capita   2.973
robbery_flag               0.009
dtype: float64

P-Values:
Intercept                  0.000
violent_crime_per_capita   0.000
robbery_flag               0.000
dtype: float64

R-squared
0.37559327510469187
                             0     1
Intercept                0.008 0.012
violent_crime_per_capita 2.219 3.726
robbery_flag             0.006 0.012

F-value:
68.87406602915397

F p_value:
3.810982637024007e-24


In [43]:
df_crime.columns

Index(['city', 'population', 'violent_crime', 'murder', 'robbery',
       'aggravated_assault', 'property_crime', 'burglary', 'larceny',
       'motor_vehicle_theft', 'violent_crime_per_capita', 'murder_per_capita',
       'robbery_per_capita', 'aggravated_assault_per_capita',
       'property_crime_per_capita', 'burglary_per_capita',
       'larceny_per_capita', 'motor_vehicle_theft_per_capita',
       'violent_crime_flag', 'murder_flag', 'robbery_flag',
       'aggravated_assault_flag', 'property_crime_flag', 'burglary_flag',
       'larceny_flag', 'motor_vehicle_theft_flag', 'population_squared'],
      dtype='object')

In [91]:
crime_features = ['population', 'violent_crime', 'murder', 'robbery',
       'aggravated_assault', 'violent_crime_per_capita', 'murder_per_capita',
       'robbery_per_capita', 'aggravated_assault_per_capita',
       'violent_crime_flag', 'murder_flag', 'robbery_flag',
       'aggravated_assault_flag', 'population_squared']