### Jose Mijangos<br>CST 463<br>Oct 10, 2018
# Stacking Regressor for Diabetes dataset
## Introduction
The Diabetes dataset contains ten features that characterize a person's health. An eleventh feature measures the progression of the disease one year after baseline. Our goal is to implement a stacking regressor and compare its performance to linear regression. We will do this by testing how well these regressors generalize the feature that measures the progression of diabetes.
## Imported Modules

In [1]:
%matplotlib inline
from sklearn import preprocessing
import warnings
if __name__ == '__main__':
    warnings.filterwarnings(action='ignore', category=DeprecationWarning)
import numpy as np
import numpy.linalg as LA
import matplotlib.pyplot as plt
import matplotlib.cm as cm
from sklearn import datasets
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import RobustScaler
from sklearn.model_selection import KFold, cross_val_score, train_test_split
from sklearn import svm
from sklearn.linear_model import LinearRegression
from sklearn.kernel_ridge import KernelRidge
from sklearn.linear_model import ElasticNet, Lasso
from sklearn.neighbors import KNeighborsClassifier
from sklearn.base import BaseEstimator, TransformerMixin, RegressorMixin, clone

## Stacking Regressor Class
This particular stacking regressor class has only one blender, but supports any number of predictors. The class utilizes cross validation and comes equipped with fit, predict, and score functions. The score function outputs the R<sup>2</sup> value.

In [2]:
class StackingRegressor(BaseEstimator, RegressorMixin, TransformerMixin):
    def __init__(self, predictors, meta_model, n_folds=10):
        self.predictors = predictors
        self.meta_model = meta_model
        self.n_folds = n_folds
   
    def fit(self, X, y):
        self.predictors_ = [list() for x in self.predictors]
        self.meta_model_ = clone(self.meta_model)
        kfold = KFold(n_splits=self.n_folds, shuffle=True, random_state=156)
        out_of_fold_predictions = np.zeros((X.shape[0], len(self.predictors)))
        for i, model in enumerate(self.predictors):
            for train_index, holdout_index in kfold.split(X, y):
                instance = clone(model)
                self.predictors_[i].append(instance)
                instance.fit(X[train_index], y[train_index])
                y_pred = instance.predict(X[holdout_index])
                out_of_fold_predictions[holdout_index, i] = y_pred 
        self.meta_model_.fit(out_of_fold_predictions, y)
        return self
   
    def predict(self, X):
        meta_features = np.column_stack([
            np.column_stack([model.predict(X) for model in predictors]).mean(axis=1)
            for predictors in self.predictors_ ])
        return self.meta_model_.predict(meta_features)
    
    def score(self, X, y):
        meta_features = np.column_stack([
            np.column_stack([model.predict(X) for model in predictors]).mean(axis=1)
            for predictors in self.predictors_ ])
        p = self.meta_model_.predict(meta_features)
        return 1 - sum((y - p)**2)/sum((y - np.mean(y))**2)

## Preparing Data for Machine Learning
We retrieve the Diabetes dataset from sklearn. Then we store the independent features in X and the target feature in y.

In [3]:
dat = datasets.load_diabetes()
X, y = dat["data"], dat["target"]

Next, we split the data so that 80% of the instances are used as the training set and the rest are used as the test set.

In [4]:
X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=0.20,random_state=1)

## Blender and Base Regressors
For our blender we will be using lasso regression. Elastic net, kernel ridge regression, k nearest neighbor and a polynomial SVM will serve as our base regressors.

In [5]:
lasso = make_pipeline(RobustScaler(), Lasso(alpha =0.0005, random_state=1))

ENet = make_pipeline(RobustScaler(), ElasticNet(alpha=0.0005, l1_ratio=.75,random_state=3))
KRR = KernelRidge(alpha=0.6, kernel='polynomial', degree=2, coef0=2.5)
knn = KNeighborsClassifier(n_neighbors=2)
sv = svm.SVC(kernel='poly')

## R<sup>2</sup> value on Test data
The R<sup>2</sup> value provides a lot of information about the performance of a regressor. A perfect regressor would score a R<sup>2</sup> value of 1. If a regressor's R<sup>2</sup> value is negative, that means a simple horizontal line would performs better than that regressor.

In [6]:
stacking_regr = StackingRegressor(predictors=(KRR,ENet,sv,knn), meta_model=lasso)
stacking_regr.fit(X_train, y_train)
print(stacking_regr.__class__.__name__, stacking_regr.score(X_test, y_test))

linear_regr = LinearRegression()
linear_regr.fit(X_train, y_train)
print(linear_regr.__class__.__name__, linear_regr.score(X_test, y_test))

StackingRegressor 0.43134629698615057
LinearRegression 0.43843604017332694


## Cross Validated MSE
We will use cross validation to ensure that our results are reliable. A perfect regressor would score a MSE value of 0. 

In [7]:
print(stacking_regr.__class__.__name__, 
-np.mean(cross_val_score(stacking_regr,X,y,cv=5,scoring="neg_mean_squared_error")))
print(linear_regr.__class__.__name__, 
-np.mean(cross_val_score(linear_regr,X,y,cv=5,scoring="neg_mean_squared_error")))

StackingRegressor 3001.2449536896247
LinearRegression 2993.0729432998864


## Conclusion
The stacking regressor we built has a similar R<sup>2</sup> value to linear regression on the test data. However, this could just be a result of how we split the data.<br><br>
The results we obtained from computing cross validated MSE are much more reliable. The MSE scores of both regressors are nearly the same with linear regression getting an ever so slightly better score.<br><br>
So we can conclude that for the Diabetes dataset, linear regression will perform just as well as our stacking regression model.