# Challenge

You now know two kinds of regression and two kinds of classifier. So let's use that to compare models!

Comparing models is something data scientists do all the time. There's very rarely just one model that would be possible to run for a given situation, so learning to choose the best one is very important.

Here let's work on regression. 

- Find a data set and build a KNN Regression and an OLS regression. Compare the two. How similar are they? Do they miss in different ways?
- Create a Jupyter notebook with your models. At the end in a markdown cell write a few paragraphs to describe the models' behaviors and why you favor one model or the other. 
- Try to determine whether there is a situation where you would change your mind, or whether one is unambiguously better than the other. 
- Lastly, try to note what it is about the data that causes the better model to outperform the weaker model.

### Import Statements

In [1]:
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
import pandas as pd
import scipy.stats as stats
from scipy.stats.mstats import winsorize
from scipy.stats import bartlett
from scipy.stats import levene
from scipy.stats import jarque_bera
from scipy.stats import normaltest
from statsmodels.tsa.stattools import acf
import statsmodels.api as sm
from sklearn import linear_model
from sklearn import neighbors
from sklearn.model_selection import cross_val_score
from sklearn.metrics import mean_absolute_error
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import Ridge
from sklearn.linear_model import Lasso
from sklearn.linear_model import ElasticNet
from sqlalchemy import create_engine
import statsmodels.api as sm
from statsmodels.tools.eval_measures import mse, rmse

# Display preferences.
%matplotlib inline
pd.options.display.float_format = '{:.3f}'.format

import warnings
warnings.filterwarnings('ignore')

### The Dataframe

In [9]:
postgres_user = 'dsbc_student'
postgres_pw = '7*.8G9QH21'
postgres_host = '142.93.121.174'
postgres_port = '5432'
postgres_db = 'houseprices'

engine = create_engine('postgresql://{}:{}@{}:{}/{}'.format(
    postgres_user, postgres_pw, postgres_host, postgres_port, postgres_db))
house_prices_df = pd.read_sql_query('select * from houseprices',con=engine)

engine.dispose()

### KNN Regression

In [21]:
knn = neighbors.KNeighborsRegressor(n_neighbors=10, weights='distance')
X = house_prices_df[['overallqual', 'grlivarea', 'garagecars', 'garagearea', 'totalbsmtsf']]
Y = house_prices_df['saleprice']
knn.fit(X, Y)

KNeighborsRegressor(algorithm='auto', leaf_size=30, metric='minkowski',
                    metric_params=None, n_jobs=None, n_neighbors=10, p=2,
                    weights='distance')

In [22]:
from sklearn.model_selection import cross_val_score

score = cross_val_score(knn, X, Y, cv=5)
print("Unweighted Accuracy: %0.2f (+/- %0.2f)" % (score.mean(), score.std() * 2))

Unweighted Accuracy: 0.72 (+/- 0.06)


### OLS Regression

In [23]:
# Y is the target variable.
Y = house_prices_df['saleprice']

# X is the feature set.
X = house_prices_df[['overallqual', 'grlivarea', 'garagecars', 'garagearea', 'totalbsmtsf']]

X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size = 0.2, random_state = 465)

In [24]:
# Fit an OLS model using sklearn.
lrm = LinearRegression()
results = lrm.fit(X_train, y_train)

In [25]:
score = cross_val_score(lrm, X, Y, cv=5)
print("Unweighted Accuracy: %0.2f (+/- %0.2f)"% (score.mean(), score.std() * 2))

Unweighted Accuracy: 0.75 (+/- 0.12)


### Analyzing the Models' Performances

In this case, I preferred the KNN model because its unweighted accuracy was close to that of the OLS model's and its margin of error was smaller. 

Personally, I don't feel that one model is better than other. Neither of the models take long to code, so my approach to picking a model would be to run both, improve on them through multiple iterations and pick the one that I can get to perform the best. 

But with that being said, my understanding is that OLS models work better for data that's linear while KNN models are helpful for data that's clumped together in groups, so I'd keep that in mind while working on the models.