Find a data set and build a KNN Regression and an OLS regression. Compare the two. How similar are they? Do they miss in different ways?

Create a Jupyter notebook with your models. At the end in a markdown cell write a few paragraphs to describe the models' behaviors and why you favor one model or the other. Try to determine whether there is a situation where you would change your mind, or whether one is unambiguously better than the other. Lastly, try to note what it is about the data that causes the better model to outperform the weaker model. 

In [1]:
import numpy as np
import pandas as pd
import math
from matplotlib import pyplot as plt
from sklearn import linear_model
from sklearn import neighbors
from sklearn.model_selection import cross_val_score
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsRegressor
from sklearn.model_selection import cross_val_score

%matplotlib inline
pd.options.display.float_format = '{:.3f}'.format

# Suppress annoying harmless error.
import warnings
warnings.filterwarnings(action="ignore", module="scipy", message="^internal gelsd")

## Cleaning and Normalizing the Data.

In [2]:
# Access the data file from the FBI: UCR 
dataset = pd.read_excel("NYCCrime.xls", header=4)

In [3]:
# Change the dataset into a DataFrame
data = pd.DataFrame(dataset)

In [4]:
# Access the Columns desired for this challenge
data_group = data.loc[:, ['Population', 'Property\ncrime', 'Robbery', 'Burglary', 'Larceny-\ntheft']]

In [5]:
# Rename the group columns
data_group.columns = ['Population', 'Property_crime', 'Robbery', 'Burglary', 'Theft']

In [6]:
# Drop the remaining null values in every column. 
data_group = data_group.dropna(axis=0, how='any')

In [7]:
# Function to remove outlier data
def reject_outliers(data, m=2):
    return data[abs(data - np.mean(data)) < m * np.std(data)]

In [8]:
# Filter the continuous variables through the outlier removal function and then drop the null values. 
for group in data_group.loc[:, 'Population':]:
    data_group[group] = reject_outliers(data_group[group], m=2)
data_group = data_group.dropna()

In [9]:
# Normalize each of the features in the dataframe
scaled_features = StandardScaler().fit_transform(data_group.values)
# Create a new dataframe with these rescaled values but maintain the previous indexes and column names. 
scaled_features_df = pd.DataFrame(scaled_features, index=data_group.index, columns=data_group.columns)

## Create Linear and KNN models with the data

### Linear Regression model

In a linear regression model it operates by finding estimators for coefficients in a formula that is defined to explain the relationship between variables. In this example the theft, burglary, robbery, and population features are used as independent variables to predict the dependent variable, property crime. A linear regression assumes that the relationship between the predictors and the outcome is linear. 

In [10]:
# My linear regression model. 
regr = linear_model.LinearRegression()
# Dependent variable
Y = scaled_features_df['Property_crime'].values.reshape(-1, 1)
# Independent variables
X = scaled_features_df[['Theft', 'Burglary', 'Robbery', 'Population']]
# Fitting my variables to the linear model
regr.fit(X, Y)

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)

In [16]:
print('\nCoefficients: \n', regr.coef_)
print('\nIntercept: \n', regr.intercept_)
score = cross_val_score(regr, X, Y, cv=5)
print('\nEach Cross Validated R2 score: \n', score)
print("\nOverall Linear Regression R2: %0.2f (+/- %0.2f)\n" % (score.mean(), score.std() * 2))


Coefficients: 
 [[ 0.74749402  0.24519983  0.02789408  0.00676561]]

Intercept: 
 [ -5.28605865e-17]

Each Cross Validated R2 score: 
 [ 0.9999052   0.99976608  0.99953834  0.99954907  0.9995398 ]

Overall Linear Regression R2: 1.00 (+/- 0.00)



### K Nearest Neighbors model

In a K-nearest neighbor model it learns through similarity. It looks for the data points that are most similar to the observation being predicted. To predict an observation, it finds the closest (or nearest) known observation in our training data and use that value to make our prediction. In this model a 4D space is created with the independent variables and then its corresponding property crime or dependent value. The K-nearest neighbor model looks at multiple neighboring values and each of these data points "vote" for a predicted property crime value. These votes then become a weighted average and returned as the predicted value. 

In [14]:
# My K-nearest neighbor regressor model
knn = neighbors.KNeighborsRegressor(n_neighbors=2, weights='distance')
# Dependent variable
Y = scaled_features_df['Property_crime'].values.reshape(-1, 1)
# Independent variables
X = scaled_features_df[['Theft', 'Burglary', 'Robbery', 'Population']]
# Fitting my variables to the KNN model
knn.fit(X, Y)

KNeighborsRegressor(algorithm='auto', leaf_size=30, metric='minkowski',
          metric_params=None, n_jobs=1, n_neighbors=2, p=2,
          weights='distance')

In [17]:
knn_score = cross_val_score(knn, X, Y, cv=5)
print('\nEach Cross Validated R2 score: \n', knn_score)
print("\nOverall KNN Regression R2: %0.2f (+/- %0.2f)\n" % (knn_score.mean(), knn_score.std() * 2))


Each Cross Validated R2 score: 
 [ 0.94834599  0.98342286  0.97749222  0.97621985  0.84722475]

Overall KNN Regression R2: 0.95 (+/- 0.10)



### Conclusion

In this scenario I favor my linear regression model because of its more consistent and better performance. My linear regression model's R-squared values were higher through each cross validation grouping compared to the K-nearest neighbor regressor model. I believe that the linear regression model outperformed the KNN model because the data for the predictors and the property crime have a very clear linear relationship. However, my KNN model still performed remarkably well and if I was working with a dataset that had a non-linear or more complicated relationships then it would likely be a better choice than a linear regression model. 