## Regression Comparison - OLS vs. kNN

Comparing models is something data scientists do all the time. There's very rarely just one model that would be possible to run for a given situation, so learning to choose the best one is very important. In this exercise, the Ordinary Least Squares (i.e., linear regression) will be compared with k Nearest Neighbors (kNN).

The data and linear regression model were built from the Multiple Linear Regression exercise at https://github.com/ChE99/projects/blob/master/Multiple%20Linear%20Regression.ipynb.

In [1]:
import numpy as np
import pandas as pd
from sklearn import linear_model

### A. Linear Regression

#### 1. Fit the Model

In [2]:
# Open the FBI's 2013 New York crime dataset (skip first four rows).
df_ny13 = pd.read_csv('https://raw.githubusercontent.com/Thinkful-Ed/data-201-resources/master/New_York_offenses/NEW_YORK-Offenses_Known_to_Law_Enforcement_by_City_2013%20-%2013tbl8ny.csv', skiprows=4)

In [3]:
# Rename applicable columns.
df_ny13.rename(columns={'Population': 'population', 'Robbery': 'robbery', 'Property\ncrime': 'property_crime'}, inplace=True)

# Drop unnecessary variables.
df_ny13_final = df_ny13.drop(['City', 'Rape\n(revised\ndefinition)1'], axis=1)

# Drop last three rows with nulls.
df_ny13_final = df_ny13_final[:-3]

# Create function to remove commas and convert object type columns to numeric.
cols_ny13 = df_ny13_final.columns[df_ny13_final.dtypes.eq('object')]
convert_col = lambda col_obj: pd.to_numeric(col_obj.replace(',',''))
df_ny13_final[cols_ny13] = df_ny13_final[cols_ny13].applymap(convert_col)

In [4]:
# Create features. Non-zero values of robbery will be coded as 1. 
df_ny13_final['population_squared'] = df_ny13_final['population'] * df_ny13_final['population']
df_ny13_final['robbery_category'] = np.where(df_ny13_final['robbery']>0, 1, 0)

# Drop rows with zeroes.
df_ny13_final_nzpc = df_ny13_final[df_ny13_final['property_crime'] != 0]

# Drop the rows containing the population outliers.
df_ny13_or = df_ny13_final_nzpc[df_ny13_final_nzpc['population']<100000]

# Set the variables.
y_ny13_ppr_orlt = np.log(df_ny13_or['property_crime'])
X_ny13_ppr_orlt = df_ny13_or[['population','population_squared', 'robbery_category']]

# Fit the model.
regr_ny13_ppr_orlt = linear_model.LinearRegression()
regr_ny13_ppr_orlt.fit(X_ny13_ppr_orlt, y_ny13_ppr_orlt)

# Show linear regression parameters.
print('\nCoefficients: \n', regr_ny13_ppr_orlt.coef_)
print('\nIntercept: \n', regr_ny13_ppr_orlt.intercept_)
print('\nR-squared:')
print(regr_ny13_ppr_orlt.score(X_ny13_ppr_orlt, y_ny13_ppr_orlt))


Coefficients: 
 [  9.85497391e-05  -7.09633481e-10   1.17034383e+00]

Intercept: 
 2.98340998967

R-squared:
0.731359253678


#### 2. Train and Test the Model

##### a. Train/Test Split

In [5]:
# Split the data into train and test samples. Use the outlier-removed, log-transformed dataset.
from sklearn.model_selection import train_test_split
X_train_tts_ny13, X_test_tts_ny13, y_train_tts_ny13, y_test_tts_ny13 = train_test_split(X_ny13_ppr_orlt, y_ny13_ppr_orlt, test_size=0.3)

# Train the model.
regr_tts_ny13 = linear_model.LinearRegression()
lm_tts_ny13 = regr_tts_ny13.fit(X_train_tts_ny13, y_train_tts_ny13)

# Make predictions on the test sample.
predictions_tts_ny13 = regr_tts_ny13.predict(X_test_tts_ny13)

# Compare predicted vs. actual.
print('Predicted:', predictions_tts_ny13[0:5])
print('\n')
print('Actual:')
print(y_test_tts_ny13[0:5])

Predicted: [ 6.44514237  3.28751279  3.37697186  3.52965185  4.45990579]


Actual:
322    6.347389
50     4.060443
250    4.595120
264    3.496508
15     3.218876
Name: property_crime, dtype: float64


In [6]:
# Model score.
print('Score:', lm_tts_ny13.score(X_test_tts_ny13, y_test_tts_ny13))

Score: 0.689536394435


##### b. K-Folds Cross Validation

In [7]:
# Run the default three subsets, or folds, for cross validation.
from sklearn.cross_validation import cross_val_score, cross_val_predict
scores_cv_ny13 = cross_val_score(lm_tts_ny13, X_ny13_ppr_orlt, y_ny13_ppr_orlt)

# The score for each fold.
print('Cross validated scores:', scores_cv_ny13)

# Mean score.
print('Average score:', scores_cv_ny13.mean())

Cross validated scores: [ 0.71470059  0.70105702  0.70858824]
Average score: 0.708115283492




### B. KNN Regression 

#### 1. Fit the Model

In [8]:
# Fit the train and test data from the linear regression model into the KNN model.
from sklearn import neighbors

knn = neighbors.KNeighborsRegressor(n_neighbors=5)
knn_w = neighbors.KNeighborsRegressor(n_neighbors=5, weights='distance')
knn_tts_ny13 = knn.fit(X_train_tts_ny13, y_train_tts_ny13)
knn_wtts_ny13 = knn_w.fit(X_train_tts_ny13, y_train_tts_ny13)

#### 2. Train and Test the Model

##### a. Train/Test Split

In [9]:
print('Unweighted Accuracy:', knn_tts_ny13.score(X_test_tts_ny13, y_test_tts_ny13))
print('Weighted Accuracy:', knn_wtts_ny13.score(X_test_tts_ny13, y_test_tts_ny13))

Unweighted Accuracy: 0.553825865141
Weighted Accuracy: 0.478814128921


##### b. Cross Validation

In [10]:
# Run KNN cross validation.
from sklearn.model_selection import cross_val_score

score = cross_val_score(knn_tts_ny13, X_ny13_ppr_orlt, y_ny13_ppr_orlt, cv=3)
print("Unweighted Accuracy: %0.3f (+/- %0.3f)" % (score.mean(), score.std() * 2))
score_w = cross_val_score(knn_wtts_ny13, X_ny13_ppr_orlt, y_ny13_ppr_orlt, cv=3)
print("Weighted Accuracy: %0.3f (+/- %0.3f)" % (score_w.mean(), score_w.std() * 2))

Unweighted Accuracy: 0.641 (+/- 0.035)
Weighted Accuracy: 0.554 (+/- 0.058)


A summary of the scores is as follows:

Method|  OLS  |KNN UW | KNN W
------|-------|-------|-------
TTS   |  0.690|  0.554| 0.479
CV    |  0.708|  0.641| 0.554

There was a marked difference among the scores, and in both train/test split and cross validation, linear regression outperformed KNN. Since kNN is flexible, it is more susceptible to high noise - as opposed to linear regression, which is more rigid - and underperforms linear regression when the noise-to-signal ratio is high. kNN suffers from the curse of dimensionality, where prediction accuracy can significantly decrease as the number of predictors increases because of the data's distance from a test point. Although kNN can deal with nonlinearity, it cannot determine which predictors are important, or interpret the resulting predictors. Lastly, tuning K is critical to good performance.  

Linear regression, on the other hand, has a fixed number of parameters and is computationally faster, but makes strong assumptions about the data. The algorithm may work well if the assumptions turn out to be correct, but it may perform badly if the assumptions are wrong. 