# Boosting and Stacking Exercises

## Introduction
We will be using the Pima Indians Diabetes database, 

For each record in the dataset it is provided:

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

In [None]:
import pandas as pd  # data processing
import numpy as np   # linear algebra
import matplotlib.pyplot as plt  #Plotting
from sklearn.preprocessing import StandardScaler

In [None]:
from __future__ import print_function
import os
data_path = ['..', '..', 'data']

Question 1

Import the data from the file diabetes.csv and examine the shape and data types.For the data types, there will be too many to list each column separately. Rather, aggregate the types by count.

Determine if the float columns need to be scaled.

In [None]:
import pandas as pd
import numpy as np

filepath = '../input/pima-indians-diabetes-database/diabetes.csv'
data = pd.read_csv(filepath, sep=',')

The data has quite a few predictor columns.

In [None]:
data.shape

In [None]:
data.dtypes.value_counts()

In [None]:
# Mask to select float columns
float_columns = (data.dtypes == np.float)

# Verify that the maximum of all float columns is 1.0
print( (data.loc[:,float_columns].max()==1.0).all() )

# Verify that the minimum of all float columns is -1.0
print( (data.loc[:,float_columns].min()==-1.0).all() )

In [None]:
data = data.rename(columns={'Outcome':'Diabetic'})


In [None]:
data.columns

### Question 2
Integer encode the activities.
Split the data into train and test data sets. Decide if the data will be stratified or not during the train/test split.

In [None]:
from sklearn.preprocessing import LabelEncoder

le = LabelEncoder()

data['Diabetic'] = le.fit_transform(data['Diabetic'])

le.classes_

In [None]:
data.Diabetic.unique()

NOTE: We are about to create training and test sets from data. On those datasets, we are going to run grid searches over many choices of parameters. This can take some time. In order to shorten the grid search time, feel free to downsample data and create X_train, X_test, y_train, y_test from the downsampled dataset.

Now split the data into train and test data sets. A stratified split was not used here. If there are issues with any of the error metrics on the test set, it can be a good idea to start model fitting over using a stratified split. Boosting is a pretty powerful model, though, so it may not be necessary in this case.

In [None]:
from sklearn.model_selection import train_test_split

# Alternatively, we could stratify the categories in the split, as was done previously
feature_columns = [x for x in data.columns if x != 'Diabetic']

X_train, X_test, y_train, y_test = train_test_split(data[feature_columns], data['Diabetic'],
                 test_size=0.3, random_state=42)

In [None]:
X_train.shape, y_train.shape, X_test.shape, y_test.shape

### Question 3
Fit gradient boosted tree models with all parameters set to their defaults the following tree numbers (n_estimators = [25, 50, 100, 200, 400]) and evaluate the accuracy on the test data for each of these models.
Plot the accuracy as a function of estimator number.
Note: This question may take some time to execute, depending on how many different values are fit for estimators. Setting max_features=4 in the gradient boosting classifier will increase the convergence rate.

Also, this is similar to question 3 from week 9, except that there is no such thing as out-of-bag error for boosted models. And the warm_flag=True setting has a bug in the gradient boosted model, so don't use it. Simply create the model inside the for loop and set the number of estimators at this time. This will make the fitting take a little longer. Additionally, boosting models tend to take longer to fit than bagged ones because the decision stumps must be fit successively.

In [None]:
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.metrics import accuracy_score

error_list = list()

# Iterate through all of the possibilities for number of estimators
tree_list =  [125, 58, 106, 20, 100]
for n_trees in tree_list:
    
    # Initialize the gradient boost classifier
    GBC = GradientBoostingClassifier(n_estimators=n_trees, 
                                     subsample=0.5,
                                     max_features=4,
                                     random_state=32)

    # Fit the model
    GBC.fit(X_train.values, y_train.values)
    y_pred = GBC.predict(X_test)

    # Get the error
    error = 1. - accuracy_score(y_test, y_pred)
    
    # Store it
    error_list.append(pd.Series({'n_trees': n_trees, 'error': error}))

error_df = pd.concat(error_list, axis=1).T.set_index('n_trees')

error_df

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

%matplotlib inline

sns.set_context('talk')
sns.set_style('white')
sns.set_palette('dark')

# Create the plot
ax = error_df.plot(marker='o')

# Set parameters
ax.set(xlabel='n_trees', ylabel='error')
ax.set_xlim(0, max(error_df.index)*1.1);

### Question 4¶
Using a grid search with cross-validation, fit a new gradient boosted classifier with the a list of estimators, similar to question 3. Also consider varying the learning rates (0.1, 0.01, 0.001, etc.), the subsampling value (1.0 or 0.5), and the number of maximum features (1, 2, etc.).
Examine the parameters of the best fit model.
Calculate relevant error metrics on this model and examine the confusion matrix.
Note: this question may take some time to execute, depending on how many features are associated with the grid search. It is recommended to start with only a few to ensure everything is working correctly and then add more features. Setting max_features=4 in the gradient boosting classifier will increase the convergence rate.

In [None]:
from sklearn.model_selection import GridSearchCV

# The parameters to be fit--only n_estimators and learning rate
# have been varied here for simplicity
param_grid = {'n_estimators': [20, 100],
              'learning_rate': [0.1, 0.01]}

# The grid search object
GV_GBC = GridSearchCV(GradientBoostingClassifier(subsample=0.5,
                                                 max_features=4,
                                                 random_state=32), 
                      param_grid=param_grid, 
                      scoring='accuracy',
                      n_jobs=-1)

# Do the grid search
GV_GBC = GV_GBC.fit(X_train, y_train)


In [None]:
# The best model
GV_GBC.best_estimator_

In [None]:
from sklearn.metrics import classification_report

y_pred = GV_GBC.predict(X_test)
print(classification_report(y_pred, y_test))

In [None]:
from sklearn.metrics import confusion_matrix

sns.set_context('talk')
cm = confusion_matrix(y_test, y_pred)
ax = sns.heatmap(cm, annot=True, fmt='d')

### Question 5
Create an AdaBoost model and fit it using grid search, much like question 4. Try a range of estimators between 100 and 200.
Compare the errors from AdaBoost to those from the GradientBoostedClassifier.
NOTE: Setting max_features=4 in the decision tree classifier used as the base classifier for AdaBoost will increase the convergence rate.

In [None]:
from sklearn.ensemble import AdaBoostClassifier
from sklearn.tree import DecisionTreeClassifier

ABC = AdaBoostClassifier(DecisionTreeClassifier(max_features=4))

param_grid = {'n_estimators': [10, 120, 167],
              'learning_rate': [0.01, 0.001]}

GV_ABC = GridSearchCV(ABC,
                      param_grid=param_grid, 
                      scoring='accuracy',
                      n_jobs=-1)

GV_ABC = GV_ABC.fit(X_train, y_train)

In [None]:
# The best model
GV_ABC.best_estimator_

In [None]:
y_pred = GV_ABC.predict(X_test)
print(classification_report(y_pred, y_test))

In [None]:
sns.set_context('talk')
cm = confusion_matrix(y_test, y_pred)
ax = sns.heatmap(cm, annot=True, fmt='d')

### Question 6
Fit a logistic regression model with regularization. This can be a replica of a model that worked well in the exercises from week 4.
Using VotingClassifier, fit the logistic regression model along with either the GratientBoostedClassifier or the AdaBoost model (or both) from questions 4 and 5.
Determine the error as before and compare the results to the appropriate gradient boosted model(s).
Plot the confusion matrix for the best model created in this set of exercises.

In [None]:
from sklearn.linear_model import LogisticRegressionCV

# L2 regularized logistic regression
LR_L2 = LogisticRegressionCV(Cs=5, cv=4, penalty='l2').fit(X_train, y_train)
y_pred = LR_L2.predict(X_test)
print(classification_report(y_pred, y_test))

In [None]:
sns.set_context('talk')
cm = confusion_matrix(y_test, y_pred)
ax = sns.heatmap(cm, annot=True, fmt='d')

In [None]:
from sklearn.ensemble import VotingClassifier

# The combined model--logistic regression and gradient boosted trees
estimators = [('LR_L2', LR_L2), ('GBC', GV_GBC)]

# Though it wasn't done here, it is often desirable to train 
# this model using an additional hold-out data set and/or with cross validation
VC = VotingClassifier(estimators, voting='soft')
VC = VC.fit(X_train, y_train)

In [None]:
y_pred = VC.predict(X_test)
print(classification_report(y_test, y_pred))

In [None]:
sns.set_context('talk')
cm = confusion_matrix(y_test, y_pred)
ax = sns.heatmap(cm, annot=True, fmt='d')