# Unit 3 Supervised Learning Capstone

This capstone project is fosused on prediction of the precense of heart disease in individuals using data obtained from traditional diagnostic tests. The data is provided by the Cleveland Clinic Database. The data was provided by [Kaggel](https://www.kaggle.com/) the HEART DISEASE webpage is at this link  [HD_Dataset](https://www.kaggle.com/ronitf/heart-disease-uci).

#### Data Overview
The original data set contained personal information and more features than is provide to the public. There were originally 76 attributes measured in the original data. The data has been scrubbed of personal identifiers and the attributes reduced to 14. Of the 14 attributes there are both catigorical and continous varibles. The target feature is binary to indicate either the precence or abcense of heart disease.

In [1]:
import pandas as pd

import mlflow
from mlflow import sklearn as mlflow_sklearn

import sklearn.datasets
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split

from sklearn.ensemble import RandomForestClassifier
import catboost
import lightgbm 
import xgboost
# import warnings
# warnings.filterwarnings('ignore')

In [2]:
import numpy as np
import six
from collections import defaultdict
from scipy import sparse, stats

from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.externals.joblib import Parallel, delayed
from sklearn.preprocessing import OneHotEncoder
from sklearn.pipeline import make_pipeline, FeatureUnion, _fit_transform_one, _transform_one

import category_encoders as ce

In [3]:
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

## Import Data
### Data Set Exploration

In [4]:
# Read in data and look at the varibleas and thier data type
df = pd.read_csv('heart.csv')
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 303 entries, 0 to 302
Data columns (total 14 columns):
age         303 non-null int64
sex         303 non-null int64
cp          303 non-null int64
trestbps    303 non-null int64
chol        303 non-null int64
fbs         303 non-null int64
restecg     303 non-null int64
thalach     303 non-null int64
exang       303 non-null int64
oldpeak     303 non-null float64
slope       303 non-null int64
ca          303 non-null int64
thal        303 non-null int64
target      303 non-null int64
dtypes: float64(1), int64(13)
memory usage: 33.2 KB


In [5]:
# Sample first three rows of the data set
df.head(3)


Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target
0,63,1,3,145,233,1,0,150,0,2.3,0,0,1,1
1,37,1,2,130,250,0,1,187,0,3.5,0,0,2,1
2,41,0,1,130,204,0,0,172,0,1.4,2,0,2,1


In [6]:
# General stats on data
df.describe()

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target
count,303.0,303.0,303.0,303.0,303.0,303.0,303.0,303.0,303.0,303.0,303.0,303.0,303.0,303.0
mean,54.366337,0.683168,0.966997,131.623762,246.264026,0.148515,0.528053,149.646865,0.326733,1.039604,1.39934,0.729373,2.313531,0.544554
std,9.082101,0.466011,1.032052,17.538143,51.830751,0.356198,0.52586,22.905161,0.469794,1.161075,0.616226,1.022606,0.612277,0.498835
min,29.0,0.0,0.0,94.0,126.0,0.0,0.0,71.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,47.5,0.0,0.0,120.0,211.0,0.0,0.0,133.5,0.0,0.0,1.0,0.0,2.0,0.0
50%,55.0,1.0,1.0,130.0,240.0,0.0,1.0,153.0,0.0,0.8,1.0,0.0,2.0,1.0
75%,61.0,1.0,2.0,140.0,274.5,0.0,1.0,166.0,1.0,1.6,2.0,1.0,3.0,1.0
max,77.0,1.0,3.0,200.0,564.0,1.0,2.0,202.0,1.0,6.2,2.0,4.0,3.0,1.0


In [7]:
# Look for missing data in the file 
df.isnull().sum()

age         0
sex         0
cp          0
trestbps    0
chol        0
fbs         0
restecg     0
thalach     0
exang       0
oldpeak     0
slope       0
ca          0
thal        0
target      0
dtype: int64

### Variable descriptions and explanations

  1. age - subject's age in years

  2. sex - subject's gender (1 = Male, 0 = female)

  3. cp - chest pain type 4 values, (0 = typical angina, 1 = atypical angina, 2 = non-anginal pain, 3 = asymptomatic)

      * Typical angina is a common condition caused from ischemia or the lack of blood or oxygen to the heart muscle, most common in men. The symptoms are reported as tightness or pain felt in the chest after physical activities or stress.

      * Atypical angina is more common in women and is a more subtle with reported symptoms of fatigue, sleep disturbances, shortness of breath and may be accompanied with chest discomfort.

      * Non-anginal pain is caused by other conditions and can easily be mistaken for angina in emergency situations where the patient’s well-being can take precedence over time consuming evaluations.

      * Asymptomatic chest pain is the lack of the patient felling pain or tightness in their chest, yet other tests indicate there is the presence of myocardial infarction, "the silent heart attack"

  4. trestbps - resting blood pressure in mm Hg, measured on admission to the hospital 

  5. chol - serum cholesterol measured in mg/dl 

  6. fbs - fasting blood sugar > 120 mg/dl, (1 = True, 0 = False)

  7. restecg - resting electrocardiographic results (0 = normal, 1 = having ST-T wave abnormality, 2 = showing probable or definite LVH (left ventricular hypertrophy) by Estes' criteria)

      * Normal ecg and chest pain are typical symptoms of anxiety (panic attack).

      * ST-T wave abnormality is commonly linked to myocardial (Heart Muscle) disease.

      * LVH by Estes' criteria is an indication that the patient suffers from heart valve disease and may or may not have myocardial disease.

  8. thalach - maximum heart rate achieved during an exercise protocol, measured in bpm (beats per minute)

  9. exang - exercise induced angina (1 = True, 0 = False)

  10. oldpeak - ST depression induced by exercise relative to rest, measured in mm

      * ECG (electrocardiogram) originally was recorded on a paper chart where one square = to 1mm.

      * ST segment depression may occur as a normal variant

      * ST segment depression is a diagnostic tool to determine the presence of obstructive coronary atherosclerosis.

  11. slope - the slope of the peak exercise ST segment (0 = upsloping, 1 = flat, 2 = down sloping)

      * normal slope is upwards

      * abnormal slope is flat or downward

      * false positives are common in women

  12. ca - number of major vessels (0-3) colored by fluoroscopy during an angiogram. Values of 0 to 4 indicating the number of vessels seen with blood flow going through them.

      * Values of **zero** would question if the procedure was performed as no flow through the arteries is not realistic.

      * An angiogram is an invasive procedure where a contrast dye in introduced by a catheter so the coronary arteries can be imaged using x-rays.

      * The procedure is diagnostic and a surgery aid to place stents, perform angioplasty and for cardiac catheterization.

      * Normal is where all 4 major coronary vessels can be seen

  13. thal - Thallium stress test, where the contrast agent thallium is introduced to provide imaging by a gamma camera in order to image the perfusion of blood and oxygen throughout the heart muscle.  

      * Original dataset values given (3 = normal; 6 = fixed defect; 7 = reversible defect) do not match the data set values 0-3.

      * Assuming that a value of zero would mean the procedure was not performed, we will assign 1-3 with the assumption that 1 = normal, 2 = fixed defect, 3 = reversible defect.

      



In [None]:
# Target bias
print('Target Bias')
target_sum = df['target'].count()
target_count = df['target'].value_counts()
percent_pos = round(target_count[0] / target_sum *100,1)
percent_neg = 100 - percent_pos
print('Total number targets is {}, individuals with heart disease present is {}% positve and {}% negitive'.
      format(target_sum,str(percent_pos), str(percent_neg) ))


In [None]:
# Gender bias
print('Gender Bias')
gender_sum = df['sex'].count()
gender_count = df['sex'].value_counts()
percent_male = round(gender_count[1] / gender_sum *100,1)
percent_female = round(100 - percent_male,1)
print('Of the {} participents, {}% are Male and {}% are Female'.
      format(gender_sum,str(percent_male), str(percent_female) ))

In [None]:
# Examine data for type, shape
df.hist(bins=25, grid=False, figsize=(12,10), color='#86bf91', zorder=2, rwidth=0.7)
plt.show()

In [None]:
# Create data sets based on Varible
gender_data = df.copy(deep=True)
gender_data.rename(columns={'sex':'Gender'}, inplace=True)

gender_data['Gender'][gender_data['Gender'] == 0] = 'Female'
gender_data['Gender'][gender_data['Gender'] == 1] = 'Male'

gender_data['target'][gender_data['target'] == 0] = 'Healthy Heart'
gender_data['target'][gender_data['target'] == 1] = 'Heart Disease'

In [None]:
df["oldpeak"][df['target'] == 1].head()

In [None]:
plt.figure(figsize=(8,5))
sns.set(style="whitegrid", palette="pastel", color_codes=True)
ax = sns.violinplot(x="target", y="age", hue="Gender",
               split=True, inner="quart",
               palette={"Male": "b", "Female": "y"},
               data=gender_data)
ax.set_title('Heart Disease Gender Bias')
ax.set_ylabel('Age [years]')
ax.set_xlabel('Heart Condition')
sns.despine(left=True)

ax = sns.jointplot(df["oldpeak"][df['target'] == 0], df["trestbps"][df['target'] == 0], kind='kde')
ax = ax.annotate(stats.pearsonr)

ax = sns.jointplot(df["oldpeak"][df['target'] == 1], df["trestbps"][df['target'] == 1], kind='kde')
ax = ax.annotate(stats.pearsonr)


In [None]:

ax = sns.jointplot("chol", "age", data=gender_data, kind='kde')
ax = ax.annotate(stats.pearsonr)
ax = ax.set_axis_labels('Cholestoral [mg/dl]', 'Age [years]')
plt.title('Age vs Cholestoral', fontweight='bold')

ax = sns.jointplot("trestbps", "age", data=gender_data, kind='kde')
ax = ax.annotate(stats.pearsonr)
ax = ax.set_axis_labels('Resting Blood Pressure [mm Hg]', 'Age [years]')
plt.title('Age vs Resting BP', fontweight='bold')

ax = sns.jointplot("thalach", "age", data=gender_data, kind='kde')
ax = ax.annotate(stats.pearsonr)
ax = ax.set_axis_labels('Maximum Heart Rate [bpm]', 'Age [years]')
plt.title('Age vs Maximum HR', fontweight='bold')

ax = sns.jointplot("oldpeak", "age", data=gender_data, kind='kde')
ax = ax.annotate(stats.pearsonr)
ax = ax.set_axis_labels('ECG - ST Depression (exercise relative to rest) [mm]', 'Age [years]')
plt.title('Age vs ST Depression', fontweight='bold')

In [None]:

fig = plt.figure(1, figsize=(18,16))
fig.subplots_adjust(hspace=0.2, wspace=0.2)

sns.set(style="whitegrid", palette="pastel", color_codes=True)
plt.subplot(221)
ax = sns.violinplot(x="target", y="chol", hue="Gender",
               split=True, inner="quart",
               palette={"Male": "b", "Female": "y"},
               data=gender_data)
ax.set_title('Cholesterol Bias', fontweight='bold')
ax.set_ylabel('Cholestoral [mg/dl]')
ax.set_xlabel('Heart Condition')
sns.despine(left=True)

# plt.figure(1, figsize=(8,5))
#sns.set(style="whitegrid", palette="pastel", color_codes=True)
plt.subplot(222)
ax = sns.violinplot(x="target", y="trestbps", hue="Gender",
               split=True, inner="quart",
               palette={"Male": "b", "Female": "y"},
               data=gender_data)
ax.set_title('Resting Blood Presure Bias', fontweight='bold')
ax.set_ylabel('Blood Pressure [mm Hg]')
ax.set_xlabel('Heart Condition')
sns.despine(left=True)


# plt.figure(2, figsize=(16,10))
#sns.set(style="whitegrid", palette="pastel", color_codes=True)
plt.subplot(223)
ax = sns.violinplot(x="target", y="thalach", hue="Gender",
               split=True, inner="quart",
               palette={"Male": "b", "Female": "y"},
               data=gender_data)
ax.set_title('Maximum Heart Rate Bias', fontweight='bold')
ax.set_ylabel('Heart Rate [bpm]')
ax.set_xlabel('Heart Condition')
sns.despine(left=True)

#plt.figure(4, figsize=(16,10))
#sns.set(style="whitegrid", palette="pastel", color_codes=True)
plt.subplot(224)
ax = sns.violinplot(x="target", y="oldpeak", hue="Gender",
               split=True, inner="quart",
               palette={"Male": "b", "Female": "y"},
               data=gender_data)
ax.set_title('ST Depression Bias (exercise relative to rest)', fontweight='bold' )
ax.set_ylabel('ST Depression [mm]')
ax.set_xlabel('Heart Condition')
sns.despine(left=True)
plt.show()

In [None]:

# Reorder columns to group varibles by type [continous, catagorical, binary]
cols = list(df.columns.values)
new_index = ['age', 'trestbps', 'chol', 'thalach', 'oldpeak', 'cp',  'restecg', 'slope', 'ca', 'thal', 'exang', 'fbs', 'sex', 'target']
df = df.reindex(columns=new_index)
df.head(3)

In [None]:
# category_index = [ 'cp',  'restecg', 'slope', 'ca', 'thal']
# df[category_index] = df[category_index].astype('category')

# df.dtypes

In [None]:

age_female = df.age[df['sex']==0]
age_male = df.age[df['sex']==1]
df_age = df.copy()
df_age = df_age.groupby('sex')
print(df_age['age'].describe())
# Plot Data
age_male.hist(bins=30)
age_female.hist(bins=30)
plt.title('Histogram of Age Distribution')
plt.legend(['male','female'])
plt.xlabel('Age')
plt.show()

In [None]:
plt.scatter(df.trestbps,  df.age)
plt.scatter(df.chol,  df.age)
plt.scatter(df.thalach,  df.age)


In [None]:
class PandasFeatureUnion(FeatureUnion):
    def fit_transform(self, X, y=None, **fit_params):
        self._validate_transformers()
        result = Parallel(n_jobs=self.n_jobs)(
            delayed(_fit_transform_one)(trans, X, y, weight,
                                        **fit_params)
            for name, trans, weight in self._iter())

        if not result:
            # All transformers are None
            return np.zeros((X.shape[0], 0))
        Xs, transformers = zip(*result)
        self._update_transformer_list(transformers)
        if any(sparse.issparse(f) for f in Xs):
            Xs = sparse.hstack(Xs).tocsr()
        else:
            Xs = self.merge_dataframes_by_column(Xs)
        return Xs

    def merge_dataframes_by_column(self, Xs):
        return pd.concat(Xs, axis="columns", copy=False)

    def transform(self, X):
        Xs = Parallel(n_jobs=self.n_jobs)(
            delayed(_transform_one)(trans, X, None, weight)
            for name, trans, weight in self._iter())
        if not Xs:
            # All transformers are None
            return np.zeros((X.shape[0], 0))
        if any(sparse.issparse(f) for f in Xs):
            Xs = sparse.hstack(Xs).tocsr()
        else:
            Xs = self.merge_dataframes_by_column(Xs)
        return Xs

In [None]:
def _name_estimators(estimators):
    """Generate names for estimators."""

    names = [type(estimator).__name__.lower() for estimator in estimators]
    namecount = defaultdict(int)
    for est, name in zip(estimators, names):
        namecount[name] += 1

    for k, v in list(six.iteritems(namecount)):
        if v == 1:
            del namecount[k]

    for i in reversed(range(len(estimators))):
        name = names[i]
        if name in namecount:
            names[i] += "-%d" % namecount[name]
            namecount[name] -= 1

    return list(zip(names, estimators))

In [None]:
def make_pandas_union(*transformers, **kwargs):
    n_jobs = kwargs.pop('n_jobs', None)
    if kwargs:
        raise TypeError('Unknown keyword arguments: "{}"'
                        .format(list(kwargs.keys())[0]))
    return PandasFeatureUnion(_name_estimators(transformers), n_jobs=n_jobs)

In [None]:
class OrdinalEncoderPandas(TransformerMixin, BaseEstimator):
    def __init__(self, columns):
        self.columns = columns
        self.transformers = {}
    
    def fit(self, X, y=None):
        for column in self.columns:
            self.transformers[column] = ce.OrdinalEncoder(return_df=False, handle_unknown="impute").fit(X[[column]])
        return self
    
    def transform(self, X, y=None):
        X = X.drop(list(set(X.columns) - set(self.columns)), axis=1)
        for column in self.columns:
            X[column] = self.transformers[column].transform(X[[column]])
            X[column] = X[column].apply(lambda x: x if x else -1)
        return X

In [None]:
class OneHotEncoderPandas(TransformerMixin, BaseEstimator):
    def __init__(self, columns):
        self.columns = columns
        self.transformers = {}
        self.feature_names = {}
        self.feature_names_all = []

    def fit(self, X, y=None):
        for column in self.columns:
            self.transformers[column] = OneHotEncoder(sparse=False, handle_unknown='ignore').fit(X[[column]])
            features = [f"{column}_{i}" for i in self.transformers[column].get_feature_names()]
            self.feature_names[column] = features
            for feature in features:
                self.feature_names_all.append(feature)
        return self

    def transform(self, X, y=None):
        ohe_df_list = []
        for column in self.columns:
            ohe_df = pd.DataFrame(self.transformers[column].transform(X[[column]]))
            feature_names = [f"{column}_{i}" for i in self.transformers[column].get_feature_names()]
            ohe_df.columns = self.feature_names[column]
            ohe_df_list.append(ohe_df)
        ohe_df_concat = pd.concat(ohe_df_list, axis=1)
        return ohe_df_concat

In [None]:
class DropColumn(TransformerMixin, BaseEstimator):
    def __init__(self, columns, no_drops):
        self.columns = columns
        self.no_drops = no_drops

    def fit(self, X, y=None):
        return self

    def transform(self, X, y=None):
        for column in self.columns:
            if column in X.columns:
                drop_together = False
                if self.no_drops:
                    for no_drop in self.no_drops:
                        if column == no_drop and self.no_drops[no_drop] not in X.columns:
                            drop_together = True
                if not drop_together:
                    X = X.drop(columns=column)
            else:
                print(f"Drop Warning: Column {column} not in X")
        return X

In [None]:
class ChangeColumnType(TransformerMixin, BaseEstimator):
    def __init__(self, types):
        self.types = types

    def fit(self, X, y=None):
        return self

    def transform(self, X, y=None):
        for column in self.types.keys():
            if column in X.columns:
                X[column] = X[column].astype(self.types[column])
            else:
                print(f"Change Warning: Column {column} not in X")
        return X

## Notes:
- XGBoost/Random Forest requires OHE of categorical variables unless categorial variable is ordinal
- XGBoost/LightGBM/Catboost supports missing variables, but RandomForests do not (most models in sklearn does not support missing values)

### Features 5 to 9 (4 feature in total) categorical type (non-binary)

In [None]:
df_samples = df.copy()


In [None]:
categorical_features = list(df_samples.columns[5:9])
target = "target"

In [None]:
# for i in categorical_features:
#     df_samples[i] = abs(df_samples[i]).astype(int).astype(str)

In [None]:
df_samples.head()

### Pipeline for Models that does not support categorical variables

Need to use OHE, only when you know the data is not ordinal, otherwise you can use ordinal encoding

XGBoost, RandomForest (CART can handle categorical, but RF does not have this implemented)

In [None]:
pipe_ohe = make_pipeline(
    make_pandas_union(
        DropColumn(columns=categorical_features, no_drops=None),
        make_pipeline(
            ChangeColumnType(types={i: str for i in categorical_features}),
            OneHotEncoderPandas(columns=categorical_features)
        )
    )
)

In [None]:
pipe_ohe.fit_transform(df_samples).head(5)

In [None]:
ohe_columns = list(set(pipe_ohe.fit_transform(df_samples).columns) - set([target]))

### Pipeline for models that support categorical variables

LightGBM, CatBoost can handle categorical variables

In [None]:
pipe_cat = make_pipeline(
    make_pandas_union(
        DropColumn(columns=categorical_features, no_drops=None),
        OrdinalEncoderPandas(columns=categorical_features)
    ),
    ChangeColumnType(types={i: "category" for i in categorical_features}),
)

In [None]:
pipe_cat.fit_transform(df_samples).head(5)

## Training/Testing/Validation

In [None]:
df_samples_ohe = pipe_ohe.fit_transform(df_samples)
train_ohe, test_ohe = train_test_split(df_samples_ohe, test_size=0.2, stratify=df_samples_ohe[target], random_state=0)
train_ohe, valid_ohe = train_test_split(train_ohe, test_size=0.2, stratify=train_ohe[target], random_state=0)

In [None]:
df_samples_cat = pipe_cat.fit_transform(df_samples)
train_cat, test_cat = train_test_split(df_samples_cat, test_size=0.2, stratify=df_samples_cat[target], random_state=0)
train_cat, valid_cat = train_test_split(train_cat, test_size=0.2, stratify=train_cat[target], random_state=0)

### Without ML-Flow

In [None]:
rf_classifier = RandomForestClassifier(
    criterion='entropy',
    max_features=None,
    n_estimators=20,
    max_depth=4,
    random_state=0,
    n_jobs=4)

In [None]:
rf_classifier.fit(X=train_ohe[ohe_columns], y=train_ohe[target])
# Use validation set to modify hyperparameters
print(accuracy_score(valid_ohe[target], rf_classifier.predict(valid_ohe[ohe_columns])))
# Use testing set to evaluate final performance
print(accuracy_score(test_ohe[target], rf_classifier.predict(test_ohe[ohe_columns])))

In [None]:
xgb_classifier = xgboost.XGBClassifier(
    max_depth=4,
    learning_rate=0.008,
    n_estimators=200
)

In [None]:
xgb_classifier.fit(X=train_ohe[ohe_columns], y=train_ohe[target])
# Use validation set to modify hyperparameters
print(accuracy_score(valid_ohe[target], xgb_classifier.predict(valid_ohe[ohe_columns])))
# Use testing set to evaluate final performance
print(accuracy_score(test_ohe[target], xgb_classifier.predict(test_ohe[ohe_columns])))

In [None]:
lgb_classifier = lightgbm.LGBMClassifier(
    objective="binary",
    categorical_features="auto",
    max_depth=4,
    learning_rate=0.01,
    n_estimators=200
)

In [None]:
lgb_classifier.fit(X=train_ohe[ohe_columns], y=train_ohe[target])
# Use validation set to modify hyperparameters
print(accuracy_score(valid_ohe[target], lgb_classifier.predict(valid_ohe[ohe_columns])))
# Use testing set to evaluate final performance
print(accuracy_score(test_ohe[target], lgb_classifier.predict(test_ohe[ohe_columns])))

In [None]:
cat_classifier = catboost.CatBoostClassifier(
    max_depth=4,
    learning_rate=0.01,
    n_estimators=200,
    verbose=0
)

In [None]:
cat_classifier.fit(X=train_ohe[ohe_columns], y=train_ohe[target])
# Use validation set to modify hyperparameters
print(accuracy_score(valid_ohe[target], cat_classifier.predict(valid_ohe[ohe_columns])))
# Use testing set to evaluate final performance
print(accuracy_score(test_ohe[target], cat_classifier.predict(test_ohe[ohe_columns])))

### With ML-Flow

In [None]:
mlflow.set_experiment("Training/Testing/Validation")

In [None]:
with mlflow.start_run(run_name="Random Forest"):
    criterion = "entropy"
    max_features = None
    n_estimators = 20
    max_depth = 4
    
    rf_classifier = RandomForestClassifier(
        criterion=criterion,
        max_features=max_features,
        n_estimators=n_estimators,
        max_depth=max_depth,
        random_state=0,
        n_jobs=4)
    rf_classifier.fit(X=train_ohe[ohe_columns], y=train_ohe[target])
    
    valid_accuracy = accuracy_score(valid_ohe[target], rf_classifier.predict(valid_ohe[ohe_columns]))
    test_accuracy = accuracy_score(test_ohe[target], rf_classifier.predict(test_ohe[ohe_columns]))
    
    mlflow.log_param("criterion", criterion)
    mlflow.log_param("max_features", max_features)
    mlflow.log_param("n_estimators", n_estimators)
    mlflow.log_param("max_depth", max_depth)
    
    mlflow.log_metric("valid_accuracy", valid_accuracy)
    mlflow.log_metric("test_accuracy", test_accuracy)

    mlflow_sklearn.log_model(rf_classifier, "model")

In [None]:
with mlflow.start_run(run_name="XGBoost"):
    n_estimators = 200
    max_depth = 4
    learning_rate = 0.008
    
    xgb_classifier = xgboost.XGBClassifier(
        max_depth=max_depth,
        learning_rate=learning_rate,
        n_estimators=n_estimators
    )
    xgb_classifier.fit(X=train_ohe[ohe_columns], y=train_ohe[target])
    
    valid_accuracy = accuracy_score(valid_ohe[target], xgb_classifier.predict(valid_ohe[ohe_columns]))
    test_accuracy = accuracy_score(test_ohe[target], xgb_classifier.predict(test_ohe[ohe_columns]))
    
    mlflow.log_param("criterion", criterion)
    mlflow.log_param("max_features", max_features)
    mlflow.log_param("n_estimators", n_estimators)
    mlflow.log_param("max_depth", max_depth)
    
    mlflow.log_metric("valid_accuracy", valid_accuracy)
    mlflow.log_metric("test_accuracy", test_accuracy)

    mlflow_sklearn.log_model(xgb_classifier, "model")

In [None]:
with mlflow.start_run(run_name="LightGBM"):
    objective = "binary"
    categorical_features = "auto"
    n_estimators = 200
    max_depth = 4
    learning_rate = 0.01
    
    lgb_classifier = lightgbm.LGBMClassifier(
        objective=objective,
        categorical_features=categorical_features,
        max_depth=max_depth,
        learning_rate=learning_rate,
        n_estimators=n_estimators
    )
    lgb_classifier.fit(X=train_ohe[ohe_columns], y=train_ohe[target])
    
    valid_accuracy = accuracy_score(valid_ohe[target], lgb_classifier.predict(valid_ohe[ohe_columns]))
    test_accuracy = accuracy_score(test_ohe[target], lgb_classifier.predict(test_ohe[ohe_columns]))
    
    mlflow.log_param("criterion", criterion)
    mlflow.log_param("max_features", max_features)
    mlflow.log_param("n_estimators", n_estimators)
    mlflow.log_param("max_depth", max_depth)
    
    mlflow.log_metric("valid_accuracy", valid_accuracy)
    mlflow.log_metric("test_accuracy", test_accuracy)

    mlflow_sklearn.log_model(lgb_classifier, "model")

In [None]:
with mlflow.start_run(run_name="CatBoost"):
    n_estimators = 200
    max_depth = 8
    learning_rate = 0.01
    
    cat_classifier = catboost.CatBoostClassifier(
        max_depth=max_depth,
        learning_rate=learning_rate,
        n_estimators=n_estimators,
        verbose=0
    )
    cat_classifier.fit(X=train_ohe[ohe_columns], y=train_ohe[target])
    
    valid_accuracy = accuracy_score(valid_ohe[target], cat_classifier.predict(valid_ohe[ohe_columns]))
    test_accuracy = accuracy_score(test_ohe[target], cat_classifier.predict(test_ohe[ohe_columns]))
    
    mlflow.log_param("criterion", criterion)
    mlflow.log_param("max_features", max_features)
    mlflow.log_param("n_estimators", n_estimators)
    mlflow.log_param("max_depth", max_depth)
    
    mlflow.log_metric("valid_accuracy", valid_accuracy)
    mlflow.log_metric("test_accuracy", test_accuracy)

    mlflow_sklearn.log_model(cat_classifier, "model")

In [None]:
!mlflow ui --host 0.0.0.0

### Using Gridsearch

## Training/Testing/Validation with Early Stopping

### Without ML-Flow

### With ML-Flow

## Training/Testing with CV

### Without ML-Flow

### With ML-Flow

## Training/Testing with CV with Early Stopping

### Without ML-Flow

### With ML-Flow

## Nested CV

### Without ML-Flow

### With ML-Flow

## Nested CV with Early Stopping

### Without ML-Flow

### With ML-Flow