## Day 35 Lecture 1 Assignment

In this assignment, we will learn about gradient boosting. We will use a dataset describing survival rates after breast cancer surgery loaded below and analyze the model generated for this dataset.

In [1]:
import warnings

import pandas as pd
import numpy as np

import statsmodels.api as sm
from statsmodels.stats.outliers_influence import variance_inflation_factor

from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.metrics import (
    classification_report,
    confusion_matrix,
    f1_score,
    make_scorer,
)
from sklearn.preprocessing import StandardScaler
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline

# p much in practice:
# *if you want to use GradientBoostingClassifier
#     * use XGBClassifier instead
# *if you want to use GradientBoostingRegressor
#     * use XGBRegressor instead
from xgboost import XGBClassifier

import seaborn as sns
import matplotlib.pyplot as plt

%matplotlib inline

In [2]:
# Attributes:
# Age of patient at time of operation (numerical)
# Patient's year of operation (year - 1900, numerical)
# Number of positive axillary nodes detected (numerical)
# Survival status (class attribute)
#  -- 1 = the patient survived 5 years or longer
#  -- 2 = the patient died within 5 year

cols = ['age', 'op_year', 'nodes', 'survival']
cancer = pd.read_csv('https://tf-assets-prod.s3.amazonaws.com/tf-curric/data-science/haberman.data', names=cols)

In [3]:
cancer.head()

Unnamed: 0,age,op_year,nodes,survival
0,30,64,1,1
1,30,62,3,1
2,30,65,0,1
3,31,59,2,1
4,31,65,4,1


Check for missing data and remove all rows containing missing data

In [4]:
# answer below:
cancer = cancer.dropna()


Adjust the target variable so that it has values of either 0 or 1

In [5]:
cancer["survival"] -= 1


cancer = cancer.rename(columns={"survival": "death_5yrs"})

cancer.head()

Unnamed: 0,age,op_year,nodes,death_5yrs
0,30,64,1,0
1,30,62,3,0
2,30,65,0,0
3,31,59,2,0
4,31,65,4,0


Split the data into train and test (20% in test)

In [6]:
# answer below:
X = cancer.drop(columns=['death_5yrs'])
y = cancer['death_5yrs']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)

Create a gradient boosted classification algorithm with a learning rate of 0.01 and max depth of 5. Report the accuracy.

In [7]:
# answer below:

num_cols = ['age', 'op_year', 'nodes']


In [8]:
preprocessing = ColumnTransformer([
    ('scale', StandardScaler(), num_cols) 
], remainder='passthrough')

In [9]:
pipeline = Pipeline([
    ('preprocessing', preprocessing),
    ('xgb', XGBClassifier())
])

In [10]:
pipeline['xgb'].learning_rate = 0.01
pipeline['xgb'].max_depth = 5

pipeline.fit(X_train, y_train)
train_score = pipeline.score(X_train, y_train)
test_score = pipeline.score(X_test, y_test)

print(f'train_score: {train_score}')
print(f'test_score: {test_score}')
      

train_score: 0.8278688524590164
test_score: 0.6290322580645161


Print the confusion matrix for the test data. What do you notice about our predictions?

In [11]:
# answer below:
y_pred = pipeline.predict(X_test)
confusion_matrix(y_test, y_pred)

array([[38,  8],
       [15,  1]], dtype=int64)

Print the confusion matrix for a learning rate of 1 and a learning rate of 0.5. What do you see now that stands out to you in the confusion matrix?

In [13]:
# answer below:
print(classification_report(y_test, y_pred))


              precision    recall  f1-score   support

           0       0.72      0.83      0.77        46
           1       0.11      0.06      0.08        16

    accuracy                           0.63        62
   macro avg       0.41      0.44      0.42        62
weighted avg       0.56      0.63      0.59        62



Perform a grid search for the optimal learning rate. Instead of accuracy, use a metric that will help your model predict the positive class.

In [14]:
# Isolating the 2 classes predictors
X_train_0 = X_train[y_train == 0]
X_train_1 = X_train[y_train == 1]

# Making up a limit to how many observations
# the majority class will have.
# Played with the number to end up with 1.3
n_0 = round(X_train_1.shape[0] * 1.2)
n_1 = X_train_1.shape[0]

# Sample majority class to have less observations
X_train_0_sample = X_train_0.sample(n_0, replace=False, random_state=42)

# Re-combine data (using the downsampled X for majority class)
X_train_downsample = pd.concat((X_train_1, X_train_0_sample))
X_train_downsample = X_train_downsample.reset_index(drop=True)

y_train_downsample = np.array([1] * n_1 + [0] * n_0)

In [15]:
# answer below:
params = {
    "xgb__subsample": [0.5, 0.75, 1.0],
    "xgb__max_features": [0.3, 0.6, 1.0],
    "xgb__max_depth": [3, 4, 5],
}

n_trees = 50
learning_rate = 2 / n_trees

pipeline["xgb"].n_estimators = n_trees
pipeline["xgb"].learning_rate = learning_rate


In [17]:
pipeline_cv = GridSearchCV(pipeline, params, verbose=1)
pipeline_cv.fit(X_train_downsample, y_train_downsample)

pipeline_cv.best_params_

Fitting 5 folds for each of 27 candidates, totalling 135 fits
Parameters: { max_features } might not be used.

  This may not be accurate due to some parameters are only used in language bindings but
  passed down to XGBoost core.  Or some parameters are not used but slip through this
  verification. Please open an issue if you find above cases.


Parameters: { max_features } might not be used.

  This may not be accurate due to some parameters are only used in language bindings but
  passed down to XGBoost core.  Or some parameters are not used but slip through this
  verification. Please open an issue if you find above cases.


Parameters: { max_features } might not be used.

  This may not be accurate due to some parameters are only used in language bindings but
  passed down to XGBoost core.  Or some parameters are not used but slip through this
  verification. Please open an issue if you find above cases.


Parameters: { max_features } might not be used.

  This may not be accurat

[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.


Parameters: { max_features } might not be used.

  This may not be accurate due to some parameters are only used in language bindings but
  passed down to XGBoost core.  Or some parameters are not used but slip through this
  verification. Please open an issue if you find above cases.


Parameters: { max_features } might not be used.

  This may not be accurate due to some parameters are only used in language bindings but
  passed down to XGBoost core.  Or some parameters are not used but slip through this
  verification. Please open an issue if you find above cases.


Parameters: { max_features } might not be used.

  This may not be accurate due to some parameters are only used in language bindings but
  passed down to XGBoost core.  Or some parameters are not used but slip through this
  verification. Please open an issue if you find above cases.


Parameters: { max_features } might not be used.

  This may not be accurate due to some parameters are only used in language bindings bu

[Parallel(n_jobs=1)]: Done 135 out of 135 | elapsed:    5.8s finished


{'xgb__max_depth': 4, 'xgb__max_features': 0.3, 'xgb__subsample': 0.5}

List the feature importances for the model with the optimal learning rate.

In [None]:
# answer below:

