# 3.1.4 [Challenge: Model Comparison](https://courses.thinkful.com/data-201v1/project/3.1.4)

Iterate on model from [2.4.4](https://github.com/Eileenyc/thinkful_course/blob/master/unit_2/2_4_4_crime_regression_model.ipynb)

[Download the Excel file here](https://ucr.fbi.gov/crime-in-the-u.s/2013/crime-in-the-u.s.-2013/tables/table-8/table-8-state-cuts/table_8_offenses_known_to_law_enforcement_new_york_by_city_2013.xls) on crime data in New York State in 2013, provided by the FBI: UCR ([Thinkful mirror](https://raw.githubusercontent.com/Thinkful-Ed/data-201-resources/master/New_York_offenses/NEW_YORK-Offenses_Known_to_Law_Enforcement_by_City_2013%20-%2013tbl8ny.csv)).

Build Knn Regression and compare with OLS regression.
How similar are they?
Do they miss in different ways?
Write a few paragrasphs to describe the models' behaviors and why you favor one model or the other.

What is it about the data that causes the better model to outperform the weaker model.

To really get at whether these regressions are working I think I need to a seperate year or state. We previously had said there was alot of overfitting with the crime dataset. What I am for sure seeing is that the crime dataset benefits from being scaled. This makes alot of since since the features are very different scales.

It seems like ordinary least squares works better for the crime data. There is a strong linear relationship between property crime and population that will help fill in gaps.

In [11]:
import math
import warnings
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

from sklearn import linear_model, neighbors
import statsmodels.formula.api as smf
from IPython.display import display

from sklearn.model_selection import cross_val_score, cross_val_predict, train_test_split, KFold
from sklearn.metrics import r2_score
from sklearn.preprocessing import MinMaxScaler


from sklearn.neighbors import KNeighborsClassifier
from scipy import stats

# Display preferences.
%matplotlib inline
pd.options.display.float_format = '{:.3f}'.format

# Suppress annoying harmless error.
warnings.filterwarnings(
    action="ignore",
    module="scipy",
    message="^internal gelsd"
)

In [9]:
# Grab and process the raw data.
data_path = ("unit_3_data/ny_crime_data.csv"
            )
df_crime = pd.read_csv(data_path, header=None, delimiter=',')
df_crime.columns = list(df_crime.iloc[4,:])
df_crime.columns = df_crime.columns.str.replace('\n', '_').str.replace(' ', '_').str.lower()
df_crime = df_crime.rename({'murder_and_nonnegligent_manslaughter':'murder',
                           'larceny-_theft':'larceny',
                           'arson3':'arson'},axis='columns')

df_crime = pd.concat([df_crime.iloc[5:-3,0:4],df_crime.iloc[5:-3,6:-1]],axis=1).reset_index(drop=True)

for crime in list(df_crime.columns[1:]):
    df_crime[crime] = df_crime[crime].str.replace(',','').astype('int64')
    
crime_list = df_crime.columns[2:]

for crime in crime_list:
    df_crime[crime + str('_per_capita')] = df_crime[crime]/df_crime['population']
    
# create flags for >0 crimes
for crime in crime_list:
    df_crime[crime+str('_flag')] = np.where(df_crime[crime]>0,1,0)
    
df_crime['population_squared'] = df_crime['population']**2 

df_crime = df_crime.loc[df_crime['city']!='New York'].reset_index(drop=True)

In [26]:
# Build our model.
# Instantiate knn & knn weighted models
regr = linear_model.LinearRegression()

# Set Target
Y = df_crime['property_crime']
X = df_crime[['population', 'population_squared', 'violent_crime']]
predicted = cross_val_predict(regr, X, y, cv=10)

scaler = MinMaxScaler()
X_scaled = scaler.fit_transform(X)

score = cross_val_score(regr, X, Y, cv=5)
print("Linear Regression Cross Val Scores: %0.2f (+/- %0.2f)" % (score.mean(), score.std() * 2))

score = cross_val_score(regr, X_scaled, Y, cv=5)
print("Linear Regression Scaled Features Cross Val Scores: %0.2f (+/- %0.2f)" % (score.mean(), score.std() * 2))

cv_predictions = cross_val_predict(regr, X, Y, cv=5)
#print(cv_predictions)
print("R2 score calc cross val score: %0.2f" % (r2_score(Y, cv_predictions)))

Linear Regression Cross Val Scores: 0.69 (+/- 0.67)
Linear Regression Scaled Features Cross Val Scores: 0.69 (+/- 0.67)
R2 score calc cross val score: 0.90


In [29]:
regr_output = regr.fit(X,Y)


In [31]:
print(regr_output.coef_)
print(regr_output.intercept_)

[ 1.71870721e-02 -6.58860239e-08  3.81537367e+00]
-18.94113477691286


In [38]:
X_scaled

array([[5.16914928e-03, 4.75821272e-05, 0.00000000e+00],
       [7.94151698e-03, 9.50293164e-05, 9.23361034e-04],
       [8.98309088e-03, 1.16811532e-04, 9.23361034e-04],
       ...,
       [2.09282785e-02, 5.21118695e-04, 6.15574023e-04],
       [7.69014532e-01, 5.92103972e-01, 3.18867344e-01],
       [1.39845816e-01, 2.00448456e-02, 4.61680517e-03]])

In [33]:
# Write out the model formula.
# Your dependent variable on the right, independent variables on the left
# Use a ~ to represent an '=' from the functional form
linear_formula = 'property_crime ~ population+population_squared+violent_crime'

# Fit the model to our data using the formula.
lm = smf.ols(formula=linear_formula, data=df_crime).fit()

In [34]:
lm.params

Intercept            -18.941
population             0.017
population_squared    -0.000
violent_crime          3.815
dtype: float64

In [35]:
lm.pvalues

Intercept            0.330
population           0.000
population_squared   0.000
violent_crime        0.000
dtype: float64

In [36]:
lm.rsquared

0.9352483071162015

In [13]:
# Build our model.
# Instantiate knn & knn weighted models
knn = neighbors.KNeighborsRegressor(n_neighbors=10)
knn_w = neighbors.KNeighborsRegressor(n_neighbors=10, weights='distance')

# Define freatures and target
Y = df_crime['property_crime']
X = df_crime[['population', 'population_squared', 'violent_crime']]

scaler = MinMaxScaler()
X_scaled = scaler.fit_transform(X)

score = cross_val_score(knn, X, Y, cv=5)
print("Unweighted Accuracy: %0.2f (+/- %0.2f)" % (score.mean(), score.std() * 2))
score_w = cross_val_score(knn_w, X, Y, cv=5)
print("Weighed Accuracy: %0.2f (+/- %0.2f)" % (score_w.mean(), score_w.std() * 2))

score = cross_val_score(knn, X_scaled, Y, cv=5)
print("Unweighted Accuracy Using Scaled X: %0.2f (+/- %0.2f)" % (score.mean(), score.std() * 2))
score_w = cross_val_score(knn_w, X_scaled, Y, cv=5)
print("Weighed Accuracy Using Scaled X: %0.2f (+/- %0.2f)" % (score_w.mean(), score_w.std() * 2))

Unweighted Accuracy: 0.59 (+/- 0.18)
Weighed Accuracy: 0.55 (+/- 0.22)
Unweighted Accuracy Using Scaled X: 0.62 (+/- 0.19)
Weighed Accuracy Using Scaled X: 0.63 (+/- 0.18)


In [15]:
cv_predictions = cross_val_predict(knn_w, X, Y, cv=5)
#print(cv_predictions)
print(r2_score(Y, cv_predictions))

0.5274862630338488


In [16]:
cv_predictions = cross_val_predict(knn_w, X_scaled, Y, cv=5)
#print(cv_predictions)
print(r2_score(Y, cv_predictions))

0.6116009327607101


In [20]:
knn_w.fit(X_scaled,Y)

KNeighborsRegressor(algorithm='auto', leaf_size=30, metric='minkowski',
          metric_params=None, n_jobs=1, n_neighbors=10, p=2,
          weights='distance')

In [41]:
cv_predictions.min()

-11.319976881860857

In [24]:
knn_w.get_params()

{'algorithm': 'auto',
 'leaf_size': 30,
 'metric': 'minkowski',
 'metric_params': None,
 'n_jobs': 1,
 'n_neighbors': 10,
 'p': 2,
 'weights': 'distance'}

In [58]:
#np.random.seed(0)

T = np.linspace(0, 12491, 347)[:, np.newaxis]


In [62]:
len(Y)

347

In [63]:
len(X)

347

In [64]:
len(T)

347

In [65]:
# Fit regression model
n_neighbors = 5

for i, weights in enumerate(['uniform', 'distance']):
    knn = neighbors.KNeighborsRegressor(n_neighbors, weights=weights)
    y_ = knn.fit(X, Y).predict()

    plt.subplot(2, 1, i + 1)
    plt.scatter(X, Y, c='k', label='data')
    plt.plot(T, y_, c='g', label='prediction')
    plt.axis('tight')
    plt.legend()
    plt.title("KNeighborsRegressor (k = %i, weights = '%s')" % (n_neighbors,
                                                                weights))

plt.tight_layout()
plt.show()

TypeError: predict() missing 1 required positional argument: 'X'