# Challenge: model comparison
Here let's work on regression. Find a data set and build a KNN Regression and an OLS regression. Compare the two. How similar are they? Do they miss in different ways?

Create a Jupyter notebook with your models. At the end in a markdown cell write a few paragraphs to describe the models' behaviors and why you favor one model or the other. Try to determine whether there is a situation where you would change your mind, or whether one is unambiguously better than the other. Lastly, try to note what it is about the data that causes the better model to outperform the weaker model. Submit a link to your notebook below.

In [36]:
import numpy as np
import pandas as pd
from sklearn.linear_model import LinearRegression, RidgeCV, LassoCV, ElasticNetCV
from sklearn.model_selection import train_test_split
import matplotlib.pyplot as plt
import statsmodels.api as sm
from sklearn.neighbors import KNeighborsRegressor
from sklearn.model_selection import cross_val_score
from sqlalchemy import create_engine

import warnings
warnings.filterwarnings('ignore')

In [27]:
people_df = pd.read_csv('/home/uzi/Downloads/sentiment labelled sentences/500_Person_Gender_Height_Weight_Index.csv')
people_df.head(3)

Unnamed: 0,Gender,Height,Weight,Index
0,Male,174,96,4
1,Male,189,87,2
2,Female,185,110,4


In [28]:
people_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 500 entries, 0 to 499
Data columns (total 4 columns):
Gender    500 non-null object
Height    500 non-null int64
Weight    500 non-null int64
Index     500 non-null int64
dtypes: int64(3), object(1)
memory usage: 15.8+ KB


In [29]:
people_df['is_male'] = pd.get_dummies(people_df.Gender, drop_first=True)
people_df.head(1)

Unnamed: 0,Gender,Height,Weight,Index,is_male
0,Male,174,96,4,1


In [45]:
# Build our model.
knn = KNeighborsRegressor(n_neighbors=10)
knn_w = KNeighborsRegressor(n_neighbors=10, weights='distance')

X = people_df[['Height','Weight']]
Y = people_df['Index']
knn.fit(X,Y)
knn_w.fit(X,Y)

score = cross_val_score(knn, X, Y, cv=5)
print("Unweighted Accuracy: %0.2f (+/- %0.2f)" % (score.mean(), score.std() * 2))

score_w = cross_val_score(knn_w, X, Y, cv=5)
print("Weighted Accuracy: %0.2f (+/- %0.2f)" % (score_w.mean(), score_w.std() * 2))

Unweighted Accuracy: 0.94 (+/- 0.06)
Weighted Accuracy: 0.95 (+/- 0.06)


In [44]:
# statsmodels OLS
X_c = sm.add_constant(X)

sm.OLS(Y,X_c).fit().summary()

0,1,2,3
Dep. Variable:,Index,R-squared:,0.826
Model:,OLS,Adj. R-squared:,0.825
Method:,Least Squares,F-statistic:,1179.0
Date:,"Thu, 06 Feb 2020",Prob (F-statistic):,2.16e-189
Time:,21:03:39,Log-Likelihood:,-423.85
No. Observations:,500,AIC:,853.7
Df Residuals:,497,BIC:,866.3
Df Model:,2,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,6.1211,0.277,22.089,0.000,5.577,6.666
Height,-0.0350,0.002,-22.579,0.000,-0.038,-0.032
Weight,0.0337,0.001,42.998,0.000,0.032,0.035

0,1,2,3
Omnibus:,19.737,Durbin-Watson:,2.016
Prob(Omnibus):,0.0,Jarque-Bera (JB):,21.213
Skew:,-0.503,Prob(JB):,2.48e-05
Kurtosis:,3.071,Cond. No.,2200.0


In [43]:
alphas = [np.power(10.0,p) for p in np.arange(-10,40,1)]

# sklearn's OLS, Ridge, Lasso and ElasticNet
lrm = LinearRegression()

lrm.fit(X,Y)

# Ridge cross-validation regression
ridge = RidgeCV(alphas=alphas, cv=5) 
ridge.fit(X,Y)

# Lasso
lasso = LassoCV(alphas=alphas)
lasso.fit(X,Y)

# ElasticNet
e_net = ElasticNetCV(alphas=alphas, l1_ratio=0.5)
e_net.fit(X,Y)

print("R-squared of the OLS model: {}".format(lrm.score(X, Y)))
print("R-squared of the Ridge model: {}\nBest Ridge alpha value is: {}".format(ridge.score(X, Y), ridge.alpha_))
print("R-squared of the Lasso model: {}\nBest Lasso alpha value is: {}".format(lasso.score(X, Y), lasso.alpha_))
print("R-squared of the ElasticNet model: {}\nBest ElasticNet alpha value is: {}".format(e_net.score(X, Y), e_net.alpha_))

R-squared of the OLS model: 0.825906765937448
R-squared of the Ridge model: 0.8258945788888529
Best Ridge alpha value is: 1000.0
R-squared of the Lasso model: 0.8259067659371919
Best Lasso alpha value is: 1e-05
R-squared of the ElasticNet model: 0.8259067659305921
Best ElasticNet alpha value is: 0.0001


# What are we seeing?

With our $Index$ variable being defined as the following:
+ 0 - Extremely Weak 1 - Weak 2 - Normal 3 - Overweight 4 - Obesity 5 - Extreme Obesity ,  

and used somewhat as a proxy for Body Mass Index (BMI) per the documentation, since BMI is a direct function of height (m) and weight (kg), I suspect that our linear regression models are failling to achieve a higher $R^2$ because they are handicapped by their own assumptions of linearity. In other words, although...
> $BMI = weight (kg) / [height (m)]^2$ 

OLS models can only sum the weighted height  and weight values. They cannot multiply and divide them. To illustrate:

In [34]:
# Create a new feature that capture the multiplicative relationship between hieght and weight
people_df['bmi'] = people_df.Weight / (people_df.Height / 100)**2
people_df.bmi.head()

0    31.708284
1    24.355421
2    32.140248
3    27.350427
4    27.476240
Name: bmi, dtype: float64

In [48]:
# Run a new OLS model with the new feature
X = people_df[['Height','Weight','bmi']]
Y = people_df['Index']

X_c = sm.add_constant(X)
sm.OLS(Y,X).fit().summary()

0,1,2,3
Dep. Variable:,Index,R-squared (uncentered):,0.972
Model:,OLS,Adj. R-squared (uncentered):,0.972
Method:,Least Squares,F-statistic:,5764.0
Date:,"Thu, 06 Feb 2020",Prob (F-statistic):,0.0
Time:,21:08:14,Log-Likelihood:,-506.28
No. Observations:,500,AIC:,1019.0
Df Residuals:,497,BIC:,1031.0
Df Model:,3,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
Height,0.0001,0.001,0.196,0.845,-0.001,0.001
Weight,0.0144,0.002,7.563,0.000,0.011,0.018
bmi,0.0580,0.004,14.536,0.000,0.050,0.066

0,1,2,3
Omnibus:,13.589,Durbin-Watson:,1.961
Prob(Omnibus):,0.001,Jarque-Bera (JB):,14.18
Skew:,-0.412,Prob(JB):,0.000833
Kurtosis:,3.016,Cond. No.,29.9


Since the four true BMI categories are...
+ Underweight = <18.5
+ Normal weight = 18.5–24.9
+ Overweight = 25–29.9
+ Obesity = BMI of 30 or greater

and the $Index$ feature has five categories, we can see the classifications are not one-to-one. Nonetheless, the new $bmi$ feature has raised our $R^2$ value to 97.2%, which is slightly better than the K-Nearest Neighbors regression $R^2$ of 94%. I imagine that since the the KNN regression model was able to capture the multiplicative relationship between height and weight since it isn't hampered by the same assumptions as linear regression.