![image](https://www.deccanherald.com/sites/dh/files/article_images/2019/11/20/heart-attack-1574189524.jpg)

<h1><center>  <div style="background-color:skyblue;border-radius:10px; padding: 10px;">Introduction</div></center></h1>

#### 💙 Motivation
> The heart is one of the most important parts as it pumps blood around your body, delivering oxygen and nutrients to your cells and removing waste products. A heart attack occurs when an artery supplying your heart with blood and oxygen becomes blocked. Fatty deposits build up over time, forming plaques in your heart's arteries. If a plaque ruptures, a blood clot can form and block your arteries, causing a heart attack. I is quite exciting to know, which parameters affect our heart and how! We know some of these like, cholestrol, age, blood pressure, etc. but I wanted to use data science skills to know more about this, and so I decided make this notebook!

#### 🎯 Goal
> The goal of this exercise is to identify the parameters that influences the heart attack and build a ML model for the prediction of heart attack.

#### 🤖 AutoML
> Also, I was thinking that, is there any library for ML which is super easy to use and will perform all the ML steps under one hood. You might think `sklearn` is there so why to go else where, yes, but what I want is to do all the preprocessing steps, creating and training the model , evaluating and interpreting it, in few lines of codes. That's more of expectations but to my surprice 😮, I found few.

> I am using Pycaret in this notebook. Pycaret is an open source library for ML. It performs all the ML steps and also it is very easy to use. For more details, you can [check this](https://pycaret.org/).

<center> <img src="https://cdn.analyticsvidhya.com/wp-content/uploads/2020/05/Screenshot-from-2020-05-13-18-30-22.png"> </center>

> There is one more similar library which is quite facinating and easy to use, H2O. It is also an open source library and I will be this library also for comparison purposes. So, let's get started !! 🚗

<center> <img src= "https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcTi48tkmw7TT50iGq2uJRuvRxg6Cz_aUdkkIpGZ5bVWpsugEobka8wWUFdfhqKKKupgR1Q&usqp=CAU" ></center>

#### Run `pip install pycaret` before proceeding

In [None]:
# pip install pycaret

In [None]:
# Libraries data handling
import numpy as np
import pandas as pd

# Librarires for visualization
import matplotlib.pyplot as plt
import seaborn as sns

sns.set(font = 'Serif', style = 'white', rc = {'axes.facecolor':'#f1f1f1', 'figure.facecolor':'#f1f1f1'})

# Libraries for ML
from pycaret.classification import *

# Importing h2o library
import h2o
from h2o.automl import H2OAutoML

<h1><div style="background-color:skyblue; border-radius:10px; padding: 10px;"> <center>Looking into the Data 🔬</center></div></h1>

In [None]:
# Reading the data file
df = pd.read_csv('../input/heart-attack-analysis-prediction-dataset/heart.csv')
df.head()

In [None]:
# Checking the data types
df.info()

> The features `sex`,`cp`,`fbs`,`restecg`,`exng`,`slp`,`caa`,`thall` are categorical, but their data type is `int64`, so first let's change the data type.

In [None]:
# Changing the data type for the categorical variables
categorical_var = ['sex','cp','fbs','restecg','exng','slp','caa','thall']
df[categorical_var] = df[categorical_var].astype('category')

In [None]:
# Checking the mean, median, max 
df.describe()

<h1><center> <div style="background-color:skyblue;border-radius:10px; padding: 10px;"> Exploratory Data Analysis 📊</div></center></h1>

In [None]:
# Checking for the distribution of the numeric features
numeric_var = [i for i in df.columns if i not in categorical_var][:-1]

fig, ax = plt.subplots(1,5, figsize = (15,5))
for axis, num_var in zip(ax, numeric_var):
    sns.boxplot(y = num_var,data = df, ax = axis, color = 'skyblue')
    axis.set_xlabel(f"{num_var}", fontsize = 12)
    axis.set_ylabel(None)

fig.suptitle('Box Plot for Identifying Outliers', fontsize = 20, weight = 'bold')
fig.text(0.5, -0.05, 'Numeric Features', ha = 'center', fontsize = 14, weight = 'bold')
plt.tight_layout()

![image](https://miro.medium.com/max/961/0*q2_0X7rTtdNFT6Xi.jpg)

> The evaluation metric for this competition is F1-Score or F-Score.
The columns `trtbps`, `chol`, `thalachh`, and `oldpeak` contains some values which fall beyond the IQR (Inter Quantile Range), but we have to decide whether to call them outliers or not. For this case, I am considering 0.05 threshold for the outlier identification. Any data point which lies beyond 95 percentile or under 5 percentile is an outlier. You can also choose other value of the threshold. As I am not a medical expert, I am just keeping it on a safer side of *0.05*. 

<h5><div class="alert alert-block alert-info"> 📌 The outlier removal part will be done during setup of the data for PyCart.</div></h5>

In [None]:
# Let's check the output

colors = ['#06344d', '#00b2ff']
sns.set(palette = colors, font = 'Serif', style = 'white', 
        rc = {'axes.facecolor':'#f1f1f1', 'figure.facecolor':'#f1f1f1'})

fig = plt.figure(figsize = (10, 6))
ax = sns.countplot(x = 'output', data = df)

for i in ax.patches:
    ax.text(x = i.get_x() + i.get_width()/2, y = i.get_height()/7, 
            s = f"{np.round(i.get_height()/len(df)*100, 0)}%", 
            ha = 'center', size = 50, weight = 'bold', rotation = 90, color = 'white')

plt.title("Heart Attack Count", size = 20, weight = 'bold')

plt.annotate(text = "People with no \n Hearth Attack", xytext = (-0.4, 140), xy = (0.1, 100),
             arrowprops = dict(arrowstyle = "->", color = 'black', connectionstyle = "angle3, angleA = 0, angleB = 90"), 
             color = 'green', weight = 'bold', size = 14)
plt.annotate(text = "People with \n Hearth Attack", xytext = (0.15, 150), xy = (1, 110), 
             arrowprops = dict(arrowstyle = "->", color = 'black', connectionstyle = "angle3, angleA = 0, angleB = 90"), 
             color = 'red', weight = 'bold', size = 14)

plt.xlabel('Hearth Attack Count', weight = 'bold')
plt.ylabel('Number of People', weight = 'bold');

>The count of both the classes are comparable, so there is no class imbalance. If it is there, you can set `fix_imbalance` to True (by defalt it is False) and also can choose the method to remove it using `fix_imbalance_method` (by defalt it is SMOTE)

In [None]:
# Variation of Heart Attack rate with each numeric variable

fig, ax = plt.subplots(nrows = 2, ncols = 5, figsize = (15, 7), constrained_layout = True) # axis.patches can't be used
# plt.suptitle('Variation of Heart Attack rate', size = 16, weight = 'bold')

for axis, num_var in zip(ax.ravel(), numeric_var): 
    sns.kdeplot(data = df, x = num_var, hue = 'output', ax = axis,
                fill = True, multiple = 'stack', alpha = 0.6, linewidth = 1.5)
    axis.set_ylabel(None)
    axis.set_xlabel(None)

for i, num_var in zip(range(0, 5), numeric_var): 
    sns.histplot(data = df, x = num_var, hue = 'output', ax = ax[1][i])
    ax[1][i].set_ylabel(None)
    ax[1][i].set_xlabel(f'{num_var}', fontsize = 14)
    
fig.text(0.5, -0.05, 'Numeric Features', ha = 'center', fontsize = 14, weight = 'bold');
fig.text(0.5, 1.05, 'Effect of Numeric Features on Heart Attack', ha = 'center', fontsize = 20, weight = 'bold');

fig.show()

In [None]:
# Variation of Heart Attack rate with each categorical variable
fig, ax=plt.subplots(2,4, figsize=(20,14), sharey=True)

for col, axis in zip(categorical_var, ax.ravel()):
    sns.countplot(x=col, data=df, hue='output', ax=axis, order = np.sort(df[col].unique()))
    for i in axis.patches:    
        axis.text(x = i.get_x() + i.get_width()/2, y = i.get_height()+6,
                s = f"{i.get_height()}", 
                ha = 'center', size = 14, weight = 'bold', rotation = 0, color = 'black',
                bbox=dict(boxstyle="circle,pad=0.5", fc='pink', ec="pink", lw=2))

    axis.set_title(f'Effect of {col} on Heart Attack', fontsize=16, weight='bold', y=1.05);
    axis.set_xticklabels(df[col].unique(), fontsize=12)
    axis.set_ylabel('Number of People', fontsize=14);
    axis.set_xlabel(None);

fig.text(0.5, 1.05, 'Effect of Categorical Features on Heart Attack', ha = 'center', fontsize = 20, weight = 'bold')
plt.tight_layout()

In [None]:
# Correlation between numeric variables
fig=plt.figure(figsize=(10,7))
axis=sns.heatmap(df[numeric_var].corr(), annot=True, linewidths=3, square=True, cmap='Blues', fmt=".0%")

axis.set_title('Correlation between the features', fontsize=16, weight='bold', y=1.05);
axis.set_xticklabels(numeric_var, fontsize=12)
axis.set_yticklabels(numeric_var, fontsize=12, rotation=0);

fig.text(0.5, 0.0, 'Numeric Features', ha = 'center', fontsize = 14, weight = 'bold');

>The maximum correlation is 42% (negetive) between `thalachh` and `age`, it means that 42% variance in `thalachh` can be explained by `age` and vice versa. It seems that there is no multicollinearity, as for it to be present, the correlation should be higher than 80-85% (positive or negetive). If it is present, we can set `remove_multicollinearity` to True, during setup of PyCaret.

In [None]:
# Age distribution along gender and cholestrol levels
from matplotlib.patches import Rectangle

colors = ['#06344d', '#00b2ff']
lables=['<45', '45-60', '60+']
sex=['0','1']
df['age_bins']=pd.cut(x=df['age'],bins=[25,45,60,100], labels=lables, include_lowest=True)

fig, axis1 = plt.subplots(figsize=(10,7))
axis2 = axis1.twinx()
axis1=sns.countplot(x='age_bins', data=df, hue='sex')

# for i in axis1.patches:
#         i.text(x = i.get_x() + i.get_width()/2, y = i.get_height()+6,
#                s = f"{i.get_height()}", 
#                 ha = 'center', size = 12, weight = 'bold', rotation = 0, color = 'white',
#                 bbox=dict(boxstyle="circle", pad=0.5, fc='pink', ec="pink", lw=2))

axis2 = sns.boxplot(x='age_bins', y='chol', hue='sex', data=df)
# axis2.grid(b=False)
axis2.legend('')

axis1.text(x = 0.8, y = 500,
        s = 'Female', 
        ha = 'center', size = 12, weight = 'bold', rotation = 0, color = 'white',
        bbox=dict(boxstyle="circle,pad=0.1", fc= colors[0], ec=colors[0], lw=2))

axis1.text(x = 1.15, y = 500,
        s = 'Male', 
        ha = 'center', size = 12, weight = 'bold', rotation = 0, color = 'white',
        bbox=dict(boxstyle="circle,pad=0.8", fc= colors[1], ec=colors[1], lw=2))

axis1.set_title('Age and Cholestrol levels', fontsize=16, weight='bold', y=1.05);

axis1.text(x = 1, y = -60,
        s = 'Age', 
        ha = 'center', size = 12, weight = 'bold', rotation = 0, color = 'black')
axis1.set_ylabel('Cholestrol levels', size = 12, weight = 'bold');

rect1 = Rectangle( (-0.4,150), 2.8, 200, linestyle = 'dashed', facecolor = 'None', clip_on=False, edgecolor = 'black')
axis1.add_patch(rect1)

rect2 = Rectangle( (-0.4,0), 2.8, 120, linestyle = 'dashed', facecolor = 'None', clip_on=False, edgecolor = 'black')
axis1.add_patch(rect2)

axis1.annotate(text = "Cholestrol levels \n Distribution", xytext = (-0.2, 420), xy = (0.5, 350), 
             arrowprops = dict(arrowstyle = "->", color = 'black', connectionstyle = "angle3, angleA = 0, angleB = 90"), 
             color = 'black', weight = 'bold', size = 12)

axis1.text(x = 0, y = 80,
        s = 'Age Count', 
        ha = 'center', size = 12, weight = 'bold', rotation = 0, color = 'black');

<h1><center><div style = "background-color:skyblue;border-radius:10px; padding: 10px;"> Data Setup aka Data Cleaning 🚿</div></center></h1>

> `setup` from pycaret.classification will do all the preprocessing necessary before building a model, like:
>- Missing values : Imputing or removing for both categorical and numeric features
>- Handling Outliers
>- Class Imbalance using SMOTE
>- One hot encoding : For categorical features 
>- Scaling : For numeric features
>- Train and Test split
>and many more, for more details, you can [check this](https://pycaret.readthedocs.io/en/latest/api/classification.html).

In [None]:
categorical_var = ['sex','cp','fbs','restecg','exng','slp','caa','thall', 'age_bins']

In [None]:
# Using 'setup' from pycaret.classification
clf = setup(df, target = 'output', categorical_features = categorical_var,
            ordinal_features = {'restecg':['0','1','2'],
                               'age_bins':['<45','45-60','60+']}, # 'restecg' is an ordinal feature like low, medium and high
            remove_outliers = True, outliers_threshold = 0.05, # Removing outliers with threshold of 5 percentile
            normalize = True,
            normalize_method = 'zscore', # Mean => 0 and std. deviation => 1
            train_size = 0.8,
            fold = 10,
            use_gpu = True)

>When you run the cell, you have to press `enter` for proceeding further, otherwise you can type `quit` for not proceeding further.
>Once you preceed further, you will see all the parameters and their corresponding values. If you are not happy with this, you can change the values and run the `setup` again.
>For this example, the parameters look fine so I will proceed further.

> 📌 If you want to see the train and test data, you can use `get_config('X_train')` and `get_config('X_test')`

<h1><center><div style = "background-color:skyblue;border-radius:10px; padding: 10px;"> Comparing the Models 🔍</div></center></h1>

> `compare_models` trains varies models like Logistic Regression, Decision Tree, SVM, Random Forest, XGBoost, etc. and compares them based on varies parameters like Accuracy, AUC ROC score, Recall, Precision, etc. So, it becomes easy for us to choose from them.

In [None]:
# Comaparing the performance of various models
best_model = compare_models()

> We get the accuracy, AUC of ROC curve, Recall, Precision, etc. of 10+ models in single line of code! Now, we have to deciside which model to choose and this depends upon our evaluation criterion.

<h1><center><div style = "background-color:skyblue;border-radius:10px; padding: 10px;"> Evaluation Criterion 🧪</div></center></h1>

#### I am selecting `recall` as the evaluation matric
<center> <img src="https://miro.medium.com/max/1044/1*I0Yd-o2yQsHBRKFbf0rjpQ.png"></center>
<center> <img src="https://skappal7.files.wordpress.com/2018/08/confusion-matrix.jpg?w=748"></center>

> The reason that I am 🔬 focusing on recall and not precision is that, if the precision is less it means we have classified a patient with no heart disease with a heart disease which is not bad as the patient will take more care of himself. But if we classify a patient as no heart disease who actually has it, then it can be very dagerous.
***
> From the 📋 results, **Logistic Regression** has highest Accuracy, AUC, Recall, Precision, and F1. So we will use it to create the ML model.

<h3><center><div style = "background-color:pink;border-radius:10px; padding: 10px;"> Creating and Evaluating ML models using PyCaret</div></center></h3>

In [None]:
# Creating the ML model
lr = create_model('lr')

> The recall is 88%, this means that out 100 heart disease patients, we have correctly identified 88 and missclassifying 12 as no heart disease.

> Now let's see the performance of our model on the test data...

In [None]:
# Results for Test set
result = predict_model(lr)

> The recall for the test data is 90% which is 2% more than the train data, it means that our model is underfitted. We have to tune the hyperparamets to arrive at the optimul solution.

<h3><center><div style = "background-color:pink;border-radius:10px; padding: 10px;"> Hyperparameter Tuning 🔧🔨</div></center></h3>

> Hyperparameter tuning is a time consuming task, as there are lots of hyperparameter associated with a algorithm and setting them up to the right value to get best results can take significant amount of ⏳ time ⌛. 

> With PyCaret, tuning hyperparameters of a ML model in any module is as simple as writing `tune_model`. It tunes the hyperparameter of the model passed as an estimator using Random grid search with pre-defined grids that are fully customizable.

In [None]:
# tune hyperparameters with custom_grid
params = {"tol": [0.0001, 0.0005, 0.001],
          "C": [0.1, 0.5, 1, 1.5, 2]}

tuned_lr = tune_model(lr, custom_grid = params)

> Our Recall as well as Accuracy 📈 increased by 2%! You can see the hyperparamets of the model just by `model_name.get_params`. Let's check the results for the test data.

In [None]:
# Results for Test set
result1 = predict_model(tuned_lr)

#### Now, we will try to further improve the results by stacking different models.

<h3><center><div style = "background-color:pink;border-radius:10px; padding: 10px;"> Model Stacking 📚</div></center></h3>

> Model stacking is an efficient ensemble method in which the predictions, generated by using various machine learning algorithms, are used as inputs in a second-layer learning algorithm. This second-layer algorithm is trained to optimally combine the model predictions to form a new set of predictions.

<center><img src="http://i.imgur.com/QBuDOjs.jpg" ></center>

***

> We will stack top 3 models, but for that, we have to first create the remaining two. I am not considering SVM, as its AUC RUC score was 0.

In [None]:
# Creating the models
knn = create_model('knn')
lda = create_model('lda')

In [None]:
# Stacking top 3 models
stacked_model = stack_models(estimator_list=[knn, lda, tuned_lr])

> The stacked model is not performing better than the tuned logistic regression model, can anyone tell? 🤔 

> I will use the tuned logistic regression model as the final model! 🔒

<h1><center><div style = "background-color:skyblue;border-radius:10px; padding: 10px;"> H2O.ai 🥛</div></center></h1>

In [None]:
# Initializing h2o
h2o.init()

#### Preprocessing and H2O
> For training using h2o, we have to create a h2o frame, which is like a pandas dataframe. We have two options for doing this:
> - read and preprocess the data using h2o frame
> - read and preprocess the data using pandas and convert it to h2o frame

> As I am more confirtable in handling the data with pandas so I am choosing the second method but if you feel confirtable in handling data with h2o frame, you can very well do that. We have already preprocessed the data using pycaret library so we will continue from that. From pycaret `setup` we get, X_train, y_train, X_test and y_test as pandas dataframe.

> I didn't find any function to feed pandas dataframe to h2o, so first I convert these df to .csv and then read the .csv using `h2o.import_file`, to convert them into h2o frame.

In [None]:
# Getting training and test dataframes from pycaret setup
X_train, X_test, y_train, y_test = get_config('X_train'), get_config('X_test'), get_config('y_train'), get_config('y_test')

# Combining the X_train with y_train and X_test with y_test 
df_train = pd.concat([X_train, y_train], axis=1)
df_test = pd.concat([X_test, y_test], axis=1)

# Creating validation data (optional step)
from sklearn.model_selection import train_test_split
df_train, df_val = train_test_split(df_train, train_size = 0.7)

# Creating train and test .csv files for h2o
df_train.to_csv('train.csv', index=False)
df_test.to_csv('test.csv', index=False)
df_val.to_csv('val.csv', index=False)

In [None]:
# Reding the data using h2o frames
train = h2o.import_file('./train.csv')
test = h2o.import_file('./test.csv')
val = h2o.import_file('./val.csv')

<h3><center><div style = "background-color:pink;border-radius:10px; padding: 10px;"> Creating and Evaluating ML models using H2O</div></center></h3>

In [None]:
# Identifing predictors and response
x = train.columns
y = "output"
x.remove(y)

# For binary classification, response should be a factor
train[y] = train[y].asfactor()
test[y] = test[y].asfactor()

In [None]:
# Run AutoML for 20 base models (limited to 1 hour max runtime by default)
aml = H2OAutoML(max_models=20, seed=1)
aml.train(x=x, y=y, training_frame=train, validation_frame = val)

In [None]:
# View the AutoML Leaderboard
lb = aml.leaderboard
lb.head(rows=10)  # Print all rows instead of default (10 rows)

<h5><div class="alert alert-block alert-info">  The advantages I found in h2o over pycaret is that, h2o can create deep learning models and the models created have already tuned on some arbitary hyperparameters (which can be changed) ! </div></h5>

In [None]:
# Chosing th best model
best_h2o = aml.leader

<h1><center><div style = "background-color:skyblue;border-radius:10px; padding: 10px;"> Analysing the model 🧐</div></center></h1>

> Analyzing performance of trained machine learning model is an integral step in any machine learning workflow. Analyzing model performance in PyCaret is as simple as writing `plot_model`. The function takes trained model object and type of plot as string within `plot_model` function.

> The available plots are Area Under the Curve, Discrimination Threshold, Precision Recall Curve, Confusion Matrix, etc. for more info, click [here](https://pycaret.org/plot-model/).

> We shall compare the performance of the models created by PyCaret and H2O, side by side.

<div> <h3><center style="background-color:pink;border-radius:10px; padding: 10px;">AUC ROC Curve 📈</center></h3></div>

#### Area Under Curve (AUC) Receiver Operating Characteristic (ROC) Curve

> The Area Under the Curve (AUC) is the measure of the ability of a classifier to distinguish between classes and is used as a summary of the ROC curve. The higher the AUC, the better the performance of the model at distinguishing between the positive and negative classes.

<div> <h4><center style="background-color:orange;border-radius:10px; padding: 10px;">AUC ROC Curve for Logistic Regression</center></h4></div>

In [None]:
plot_model(tuned_lr, plot = 'auc')

<div> <h4><center style="background-color:orange;border-radius:10px; padding: 10px;">AUC ROC Curve for ANN</center></h4></div>

In [None]:
# build the roc curve:
perf_test = best_h2o.model_performance(test)
perf_test.plot(type = "roc")

> The ANN (AUC=0.948) is performing slightly better than the Logistic Regression (AUC=0.93)

<div> <h3><center style="background-color:pink;border-radius:10px; padding: 10px;">Confusion Matrix 😵</center></h3>
 
> A confusion matrix is a summary of prediction results on a classification problem. The number of correct and incorrect predictions are summarized with count values and broken down by each class. This is the key to the confusion matrix. is confused when it makes predictions

<div> <h4><center style="background-color:orange;border-radius:10px; padding: 10px;">Confusion Matrix for Logistic Regression</center></h4></div>

In [None]:
plot_model(tuned_lr, plot = 'confusion_matrix')

<div> <h4><center style="background-color:orange;border-radius:10px; padding: 10px;">Confusion Matrix for ANN</center></h4></div>

In [None]:
perf_test.confusion_matrix()

> The `model_performance()` changes the threshold besed on the best f1 score. This can give false output so, let's first predict using the best_h2o model and then plot the confusion matrix again.

In [None]:
# Predicting using the best_h2o model
predictions = best_h2o.predict(test)
# Converting to pandas dataframe
prediction = predictions.as_data_frame()

In [None]:
from sklearn.metrics import confusion_matrix, plot_confusion_matrix
matrix = confusion_matrix(y_test, prediction['predict'])

plt.style.use('seaborn-dark')
fig, axis1 = plt.subplots(1,1, figsize=(10,6), constrained_layout = True)
sns.heatmap(matrix, annot=True, fmt = '.0f', cbar=False, cmap='Blues',
                        linewidths=3, square=True, ax = axis1, annot_kws={"fontsize":16})
axis1.set_title(f"Confusion Matrics | Threshold : 0.5", fontsize=16, y=1.05);
axis1.set_xlabel('Predicted', fontsize=12)
axis1.set_ylabel('Actual', fontsize=12)
axis1.set_xticklabels([0,1], fontsize=12 )
axis1.set_yticklabels([0,1], fontsize=12, rotation=0);

> The results look similar, this means that models with exact confusion matrix can have different AUC score. Many people get confused in confusion matrix and AUC ROC, for better understanding of them, you can go through [this](https://www.kaggle.com/sonukiller99/confusion-in-consfusion-matrix-not-anymore) 📙

<div> <h3><center style="background-color:pink;border-radius:10px; padding: 10px;">Classification Report 📋</center></h3>

<div> <h4><center style="background-color:orange;border-radius:10px; padding: 10px;">Classification Report of Logistic Regression</center></h4></div>

In [None]:
plot_model(tuned_lr, plot = 'class_report')

<div> <h4><center style="background-color:orange;border-radius:10px; padding: 10px;">Classification Report of ANN</center></h4></div>

In [None]:
perf_test

<h5><div class="alert alert-block alert-info"> 📌 H2O doesn't provide Decision Boundary and Missiclassification per class plots, so I will be plotting them for PyCaret alone.</div></h5>


 <h3><center><div style="background-color:pink;border-radius:10px; padding: 10px;"> Decision Boundary </div></center></h3>

In [None]:
plot_model(tuned_lr, plot = 'boundary')

<h3><center><div style="background-color:pink;border-radius:10px; padding: 10px;"> ❌ Misscalssification in each class ❌ </div></center></h3>

In [None]:
plot_model(tuned_lr, plot = 'error')

<h3><center><div style="background-color:pink;border-radius:10px; padding: 10px;"> Learning Curve 📈 </div></center></h3>

<div> <h4><center style="background-color:orange;border-radius:10px; padding: 10px;">Learning Curve for Logistic Regression</center></h4></div>

In [None]:
plot_model(tuned_lr, plot = 'learning')

<div> <h4><center style="background-color:orange;border-radius:10px; padding: 10px;">Learning Curve for ANN</center></h4></div>

In [None]:
best_h2o.learning_curve_plot()

<h1><center><div style="background-color:skyblue;border-radius:10px; padding: 10px;"> Interpreting Model 📖 </div></center></h1>

> Regardless of the end goal of your data science solutions, an end-user will always prefer solutions that are interpretable and understandable. Moreover, as a data scientist you will always benefit from the interpretability of your model to validate and improve your work.

<h3><center><div style="background-color:pink;border-radius:10px; padding: 10px;"> Globle Interpretation 🌎 </div></center></h3>

<div> <h4><center style="background-color:orange;border-radius:10px; padding: 10px;">Globle Interpreting Model using Logistic Regression</center></h4></div>

<div class="alert alert-block alert-info"> 📌 For binary classification, only tree based models can only be used like Desicion Tree, Random Forest, XGBoost, CatBoost, lightgbm, etc. I will use Catboost but you can choose any other tree based algrithm also. So, for imterpreting using PyCaret, I will create a catboost model, you can also choose Random Forest, XGBoost or any other tree based model.</div>

In [None]:
# Creating catboost model
catboost = create_model('catboost')

In [None]:
plot_model(catboost, 'feature')

In [None]:
interpret_model(catboost)

> This plot is based on the SHAP values. This plot is made of all the dots in the train data. It demonstrates the following information:

> **Feature importance**: Variables are ranked in descending order.

> **Impact**: The horizontal location shows whether the effect of that value is associated with a higher or lower prediction.

> **Original value**: Color shows whether that variable is high (in red) or low (in blue) for that observation.

>**Correlation**: A high level of the `caa_0` has a high and positive impact on the Heart Attack. The “high” comes from the red color, and the “positive” impact is shown on the X-axis. Similarly, we will say the `thall_3` is negatively correlated with the target variable.

> 📘 If you want to learn about them in detail, you can read this [blog](https://towardsdatascience.com/explain-your-model-with-the-shap-values-bc36aac4de3d).

<div> <h4><center style="background-color:orange;border-radius:10px; padding: 10px;">Globle Interpreting Model using ANN</center></h4></div>

In [None]:
best_h2o.explain(test)

<h3><center><div style="background-color:pink;border-radius:10px; padding: 10px;"> Local Interpretation </div></center></h3>

<div> <h4><center style="background-color:orange;border-radius:10px; padding: 10px;">Local Interpreting Model using Logistic Regression</center></h4></div>

In [None]:
interpret_model(catboost, plot = 'reason', observation = 12)

>You can also save the model with [save_model](https://pycaret.org/save-model/), deploy your model with [deploy_model](https://pycaret.org/deploy-model/).

<div> <h4><center style="background-color:orange;border-radius:10px; padding: 10px;">Local Interpreting Model using ANN</center></h4></div>

In [None]:
best_h2o.explain_row(test, row_index=12)

<h2><center>  <div style="background-color:cyan;border-radius:10px; padding: 10px;">Hope this increased your blood flow by excitment and if you like this, don't forget to upvote!</div></center></h2>