In [1]:
%reload_ext nb_black

<IPython.core.display.Javascript object>

# ❤️ Heart Disease 🤒

The data today is from the [Framingham Heart Study](https://www.framinghamheartstudy.org/).  Below excerpt from [their wikipedia page](https://en.wikipedia.org/wiki/Framingham_Heart_Study):

> The Framingham Heart Study is a long-term, ongoing cardiovascular cohort study of residents of the city of Framingham, Massachusetts. The study began in 1948 with 5,209 adult subjects from Framingham, and is now on its fourth generation of participants. Prior to the study almost nothing was known about the epidemiology of hypertensive or arteriosclerotic cardiovascular disease. Much of the now-common knowledge concerning heart disease, such as the effects of diet, exercise, and common medications such as aspirin, is based on this longitudinal study.

### Warm-up 🥵

Warm-up warm-ups
* Describe what boosting is.
* How do random forests avoid overfitting?

Actual warm-up
* How do we use residuals in gradient boosted trees?
* How do we avoid overfitting in gradient boosted trees?

## Data Import and EDA

In [None]:
import warnings

import pandas as pd
import numpy as np

import statsmodels.api as sm
from statsmodels.stats.outliers_influence import variance_inflation_factor

from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.metrics import (
    classification_report,
    confusion_matrix,
    f1_score,
    make_scorer,
)
from sklearn.preprocessing import StandardScaler
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline

# p much in practice:
# *if you want to use GradientBoostingClassifier
#     * use XGBClassifier instead
# *if you want to use GradientBoostingRegressor
#     * use XGBRegressor instead
from xgboost import XGBClassifier

import seaborn as sns
import matplotlib.pyplot as plt

%matplotlib inline

In [None]:
def print_vif(x):
    """Utility for checking multicollinearity assumption
    
    :param x: input features to check using VIF. This is assumed to be a pandas.DataFrame
    :return: nothing is returned the VIFs are printed as a pandas series
    """
    # Silence numpy FutureWarning about .ptp
    with warnings.catch_warnings():
        warnings.simplefilter("ignore")
        x = sm.add_constant(x)

    vifs = []
    for i in range(x.shape[1]):
        vif = variance_inflation_factor(x.values, i)
        vifs.append(vif)

    print("VIF results\n-------------------------------")
    print(pd.Series(vifs, index=x.columns))
    print("-------------------------------\n")

In [None]:
data_url = "https://docs.google.com/spreadsheets/d/1Tx7KJ7iW8IkiU-aERYFXsKvDsbFJbr80POW_2DyuYGQ/export?format=csv"
heart = pd.read_csv(data_url)
heart = heart.dropna()

Do basic EDA to get familiar with this heart data.

In [None]:
heart.shape

In [None]:
heart.head(3)

In [None]:
bin_cols = [
    "male",
    "currentSmoker",
    "BPMeds",
    "prevalentStroke",
    "prevalentHyp",
    "diabetes",
]

num_cols = [
    "age",
    "education",
    "cigsPerDay",
    "totChol",
    "sysBP",
    "diaBP",
    "BMI",
    "heartRate",
    "glucose",
]

Do we have balanced classes?  If our model gets 85% accuracy, should we consider that good?
* Calculate percentages of each class using `value_counts` and the `normalize` argument
* Show a bar plot of the counts of each class

In [None]:
heart["TenYearCHD"].value_counts(normalize=True)

In [None]:
sns.countplot(heart["TenYearCHD"])
plt.show()

Let's visualize our data with respect to our target variable, `'TenYearCHD'`.  We actually have a lot of categorical variables here that are already encoded as numbers for us. We might consider re-encoding education, but it's already encoded as ordinal, let's keep it as is and come back if we think it will help.

However, it might make more sense to visualize these as categorical rather than continuous.

In [None]:
bin_cols = [
    "male",
    "BPMeds",
    "prevalentStroke",
    "prevalentHyp",
    "diabetes",
]


num_cols = [
    "age",
    "cigsPerDay",
    "totChol",
    "diaBP",
    "BMI",
    "heartRate",
    "glucose",
]

* What's an appropriate chart type to plot our categorical variables with our categorical target variable?
* Write a `for` loop to iterate over the categorical column names (in `bin_cols`)
* Show a plot of `'TenYearCHD'` with each of the categorical variables.

In [None]:
for col in bin_cols:
    perc_chd = heart[["TenYearCHD", col]].groupby(col).mean()
    display(perc_chd)

    sns.countplot(hue="TenYearCHD", x=col, data=heart)
    plt.show()

* What's an appropriate chart type to plot our continuous variables with our categorical target variable?

In [None]:
for col in num_cols:
    sns.boxplot("TenYearCHD", col, data=heart)
    plt.show()

## Model Prep

* Perform a train test split

In [None]:
# sysBP and currentSmoker dropped based on VIF
# sysBP redundant with diaBP
# currentSmoker redundant with cigsPerDay
X = heart.drop(columns=["TenYearCHD", "sysBP", "currentSmoker"])
y = heart["TenYearCHD"]

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

In [None]:
print_vif(X_train)

* Define a `ColumnTransformer` to scale the numeric columns
   * Leave the remaining columns untouched

In [None]:
bin_cols = [
    "male",
    "BPMeds",
    "prevalentStroke",
    "prevalentHyp",
    "diabetes",
]


num_cols = [
    "age",
    "cigsPerDay",
    "totChol",
    "diaBP",
    "BMI",
    "heartRate",
    "glucose",
]

In [None]:
# fmt: off
preprocessing = ColumnTransformer([
    ____
], ____)
# fmt: on

* Define a `Pipeline` with:
    * the `ColumnTransformer` preprocessing as the first step
    * an `XGBClassifier` as the second step

In [None]:
# fmt: off
pipeline = Pipeline([
    ____,
    ____
])
# fmt: on

* Fit the pipeline to the training data with the default params


* What is the overall accuracy?
* Are we overfitting?
* Is this a good accuracy?

In [None]:
pipeline.fit(X_train, y_train)

train_score = pipeline.score(X_train, y_train)
test_score = pipeline.score(X_test, y_test)

print(f"Train score: {train_score}")
print(f"Test score: {test_score}")

* How are we making mistakes?
  * Show a `confusion_matrix` and a `classification report`
* In the context of the problem, what kind of mistake is the worst to make?
   * Mistake 1: Tell someone they're at risk when they're not
   * Mistake 2: Tell someone they're not at risk when they are
* Based on that, what number from a `classification_report` are we most interested in?
   * Do we want to maximize or minimize this value?

In [None]:
y_pred = pipeline.predict(X_test)

print(confusion_matrix(y_test, y_pred))
print(classification_report(y_test, y_pred))

We can try a grid search to see if we get better performance with better parameters.  This is one of the `xgboost` author's thoughts on the hyperparameter tuning.

<img src='https://i.stack.imgur.com/9GgQK.jpg' width='70%'>

Translation of main parameters of interest:
* Name in table - `xgb_parameter_name`

---

* \# of Trees - `n_estimators`
* Learning Rate - `learning_rate`
* Row Sampling - `subsample`
* Column Sampling - `colsample_bytree`
* Max Tree Depth - `max_depth`

---

* Set up a grid search using this pictured slide as guidance
* What were the best params according to this search?

In [None]:
# Adjusted max_features/max_depth to have smaller grid
params = {
    "____": [0.5, 0.75, 1.0],
    "____": [0.5, 0.75, 1.0],
    "____": [5, 7, 10],
}

n_trees = 100
learning_rate = 2 / n_trees

In [None]:
pipeline_cv = GridSearchCV(pipeline, params, verbose=1, cv=2)
pipeline_cv.fit(X_train, y_train)

pipeline_cv.best_params_

* How does this affect our performance?
* Would we want to deploy this model to predict heart disease?
* How can we make it better?

In [None]:
train_score = pipeline_cv.score(X_train, y_train)
test_score = pipeline_cv.score(X_test, y_test)

print(f"Train score: {train_score}")
print(f"Test score: {test_score}\n")

y_pred = pipeline_cv.predict(X_test)
print(confusion_matrix(y_test, y_pred))
print(classification_report(y_test, y_pred))

We're having a lot of trouble with this class imbalance problem, our model is really biased towards predicting the negative class because most the time it would be correct to do so.

There are strategies for dealing with class imbalance, and some common ones that aren't too bad to use are listed out here: https://elitedatascience.com/imbalanced-classes.

Let's look into a sampling approach to balance the classes in our training set.

* Separate the training data into 2 dataframes:
    * One with the majority class
    * One with the minority class

In [None]:
# Isolating the 2 classes predictors
X_train_0 = X_train[y_train == 0]
X_train_1 = X_train[y_train == 1]

* How many rows does each have?

In [None]:
n_0 = X_train_0.shape[0]
n_1 = X_train_1.shape[0]

In [None]:
n_0

In [None]:
n_1

* Use sampling to make both sides of the story have the same number of rows
    * 'Up sample' with replacement for the minority class
    * 'Down sample' without replacement for the majority class

In [None]:
n = ____

In [None]:
# Sample majority class to have less observations
X_train_0_sample = X_train_0.sample(n, replace=False, random_state=42)

# Sample minority class to have less observations
X_train_1_sample = X_train_1.sample(n, replace=True, random_state=42)

* Redefine `X_train` and `y_train` with your resampled data

In [None]:
# Re-combine data (using the downsampled X for majority class)
X_train_resample = pd.concat((X_train_1_sample, X_train_0_sample))
X_train_resample = X_train_resample.reset_index(drop=True)

y_train_resample = np.array([1] * n + [0] * n)

* Refit the same GridSearchCV object but with this new training data
* Print out the best parameters

In [None]:
params = {
    "xgb__subsample": [0.5, 0.75, 1.0],
    "xgb__max_features": [0.5, 0.75, 1.0],
    "xgb__max_depth": [3, 4, 5],
}

n_trees = 100
learning_rate = 2 / n_trees

In [None]:
pipeline_cv = GridSearchCV(pipeline, params, verbose=1, cv=2)
pipeline_cv.fit(X_train_resample, y_train_resample)

pipeline_cv.best_params_

* Is the performance better? worse? different at all?

In [None]:
train_score = pipeline_cv.score(X_train, y_train)
test_score = pipeline_cv.score(X_test, y_test)

print(f"Train score: {train_score}")
print(f"Test score: {test_score}\n")

y_pred = pipeline_cv.predict(X_test)
print(confusion_matrix(y_test, y_pred))
print(classification_report(y_test, y_pred))