# Challenge: Model Comparison
Find a data set and build a KNN Regression and an OLS regression. Compare the two. How similar are they? Do they miss in different ways? Describe the models' behaviors and why you favor one model or the other. Try to determine whether there is a situation where you would change your mind, or whether one is unambiguously better than the other. Lastly, try to note what it is about the data that causes the better model to outperform the weaker model. 

Source: https://www.kaggle.com/dongeorge/beer-consumption-sao-paulo/version/2#_=_

In [0]:
import pandas as pd
import numpy as np
import scipy
from scipy import stats
import sklearn
from sklearn import linear_model
import matplotlib.pyplot as plt
%matplotlib inline

import warnings
warnings.filterwarnings(action="ignore", module="scipy", message="^internal gelsd")

In [0]:
drink_df = pd.read_csv('https://raw.githubusercontent.com/RRamirez21/ThinkfulDrills/master/Beer_Consumption.csv')

In [80]:
drink_df.head(5)

Unnamed: 0,date,midTemperature (C),minTemperature (C),maxTemperature (C),precipitacion (mm),weekend,beerConsumption (L)
0,1/1/2015,27.3,23.9,32.5,0.0,0,25.461
1,1/2/2015,27.02,24.5,33.5,0.0,0,28.972
2,1/3/2015,24.82,22.4,29.9,0.0,1,30.814
3,1/4/2015,23.98,21.5,28.6,1.2,1,29.799
4,1/5/2015,23.82,21.0,28.3,0.0,0,28.9


In [0]:
drink_df.columns = ['date', 'midTemp', 'minTemp', 'maxTemp', 
'rain', 'weekend', 'beerCons']

In [82]:
drink_df.shape


(365, 7)

In [0]:
new_list = [i for i in range(1, 366)]
drink_df['date'] = new_list

In [84]:
regr = linear_model.LinearRegression()
Y = drink_df[['beerCons']]
X = drink_df[['date', 'midTemp', 'minTemp', 'maxTemp', 
'rain', 'weekend']]

regr.fit(X, Y)

print('\nCoefficients: \n', regr.coef_)
print('\nIntercept: \n', regr.intercept_)
print('\nR-squared: \n')
print(regr.score(X, Y))


Coefficients: 
 [[ 4.02742932e-03 -3.07717015e-02  4.68156108e-02  6.75534966e-01
  -5.84848567e-02  5.19887505e+00]]

Intercept: 
 [5.34565627]

R-squared: 

0.7315787113951985


In [0]:
Xdf = X
Ydf = Y

from sklearn.model_selection import KFold
X = Xdf.values
y = Ydf.values

In [86]:
kf = KFold(n_splits=10, random_state=1, shuffle=True)
kf.get_n_splits(X)

print(kf)

r2_scores=[]

for train_index, test_index in kf.split(X):
  #print("TRAIN:", train_index, "TEST:", test_index)
  X_train, X_test = X[train_index], X[test_index]
  y_train, y_test = y[train_index], y[test_index]
  regr.fit(X_train, y_train)
  r2_scores.append(regr.score(X_test, y_test))
  
  
r2_scores


KFold(n_splits=10, random_state=1, shuffle=True)


[0.7266384627978563,
 0.6818224014018928,
 0.683215388831223,
 0.8037159568623384,
 0.5557461469357415,
 0.7058978046954235,
 0.7155132913792982,
 0.7923353516143873,
 0.6658209707511891,
 0.6145861797589984]

In [87]:
np.mean(r2_scores)

0.6945291955028349

In [88]:
from sklearn.neighbors import KNeighborsRegressor
neighbors = KNeighborsRegressor(n_neighbors=5)
X2 = drink_df[['date', 'midTemp', 'minTemp', 'maxTemp', 
'rain', 'weekend']]
Y2 = drink_df[['beerCons']]
y_pred = neighbors.fit(X2,Y2)

## Predict beer consumption for the 180th day
# midTemp = 27 Celsius
# minTemp = 24 Celsius
# maxTemp = 30 Celsius
# rain    = 0 mm precipitation
# weekend = 0 (no weekend)
print('KNN predicts:', neighbors.predict([[180, 27, 24, 30 , 0, 0]]))

KNN predicts: [[23.1374]]


In [89]:
print(y_pred.score(X2, Y2))

0.4755361628687208


In [0]:
X2df = X2
Y2df = Y2

from sklearn.model_selection import KFold
X2 = X2df.values
y2 = Y2df.values

In [91]:
kf2 = KFold(n_splits=10, random_state=1, shuffle=True)
kf2.get_n_splits(X2)

print(kf2)

ypred_scores=[]

for train_index, test_index in kf2.split(X2):
  #print("TRAIN:", train_index, "TEST:", test_index)
  X_train, X_test = X2[train_index], X2[test_index]
  y_train, y_test = y2[train_index], y2[test_index]
  regr.fit(X_train, y_train)
  ypred_scores.append(y_pred.score(X_test, y_test))
  
  
ypred_scores


KFold(n_splits=10, random_state=1, shuffle=True)


[0.4328209992729607,
 0.32694105310245203,
 0.5412079869960815,
 0.5670738424709045,
 0.21786036466409264,
 0.36385967360681837,
 0.5013545109977888,
 0.5465000727548602,
 0.5140975622245572,
 0.3522425525148175]

In [94]:
np.mean(ypred_scores)

0.43639586186053336

Our two models differ with the linear regression providing a more accurate results than our KNN model. Since the KNN works by selecting the nearest number instead of the mean of nearby results, it is not as accurate in this case (for a continuous variable). However, if we were to create categories by setting an arbitrary number of alcohol consumption (let's say 27 L), our KNN would be more accurate at indentifying the corresponding categories below and above that value. Additionally, if we were to look deeper into the data for re-occuring values, we could "weight" them to update our KNN model and improve our accuracy.