# Introduction

PyCaret is an open-source, low-code machine learning library in Python that automates machine learning workflows. It is an end-to-end machine learning and model management tool that speeds up the experiment cycle exponentially and makes you more productive.

In comparison with the other open-source machine learning libraries, PyCaret is an alternate low-code library that can be used to replace hundreds of lines of code with few lines only. This makes experiments exponentially fast and efficient. PyCaret is essentially a Python wrapper around several machine learning libraries and frameworks such as scikit-learn, XGBoost, LightGBM, CatBoost, spaCy, Optuna, Hyperopt, Ray, and a few more.

The design and simplicity of PyCaret are inspired by the emerging role of citizen data scientists, a term first used by Gartner. Citizen Data Scientists are power users who can perform both simple and moderately sophisticated analytical tasks that would previously have required more technical expertise.

# Tutorial Objective

In this tutorial we will learn:

* Getting Data: How to import data from the PyCaret repository.
* Setting up Environment: How to set up a regression experiment in PyCaret and get started with building regression models.
* Create Model: How to create a model, perform cross-validation and evaluate regression metrics.
* Tune Model: How to automatically tune the hyperparameters of a regression model.
* Plot Model: How to analyze model performance using various plots.
* Predict Model: How to make predictions on new/unseen data.
* Save / Load Model: How to save/load a model for future use.

# Import Needed Libraries

In [1]:
!pip install pycaret

Collecting pycaret
  Obtaining dependency information for pycaret from https://files.pythonhosted.org/packages/eb/43/ec8d59a663e0a1a67196b404ec38ccb0051708bad74a48c80d96c61dd0e5/pycaret-3.2.0-py3-none-any.whl.metadata
  Downloading pycaret-3.2.0-py3-none-any.whl.metadata (17 kB)
Collecting kaleido>=0.2.1 (from pycaret)
  Downloading kaleido-0.2.1-py2.py3-none-manylinux1_x86_64.whl (79.9 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m79.9/79.9 MB[0m [31m17.9 MB/s[0m eta [36m0:00:00[0m
Collecting matplotlib<=3.6,>=3.3.0 (from pycaret)
  Downloading matplotlib-3.6.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (11.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m11.8/11.8 MB[0m [31m87.4 MB/s[0m eta [36m0:00:00[0m
Collecting pandas<2.0.0,>=1.3.0 (from pycaret)
  Downloading pandas-1.5.3-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (12.1 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m 

In [2]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

from pycaret.classification import *

# EDA 

In [3]:
df = pd.read_csv('/kaggle/input/heart-failure-prediction/heart.csv')
df.head()

Unnamed: 0,Age,Sex,ChestPainType,RestingBP,Cholesterol,FastingBS,RestingECG,MaxHR,ExerciseAngina,Oldpeak,ST_Slope,HeartDisease
0,40,M,ATA,140,289,0,Normal,172,N,0.0,Up,0
1,49,F,NAP,160,180,0,Normal,156,N,1.0,Flat,1
2,37,M,ATA,130,283,0,ST,98,N,0.0,Up,0
3,48,F,ASY,138,214,0,Normal,108,Y,1.5,Flat,1
4,54,M,NAP,150,195,0,Normal,122,N,0.0,Up,0


In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 918 entries, 0 to 917
Data columns (total 12 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   Age             918 non-null    int64  
 1   Sex             918 non-null    object 
 2   ChestPainType   918 non-null    object 
 3   RestingBP       918 non-null    int64  
 4   Cholesterol     918 non-null    int64  
 5   FastingBS       918 non-null    int64  
 6   RestingECG      918 non-null    object 
 7   MaxHR           918 non-null    int64  
 8   ExerciseAngina  918 non-null    object 
 9   Oldpeak         918 non-null    float64
 10  ST_Slope        918 non-null    object 
 11  HeartDisease    918 non-null    int64  
dtypes: float64(1), int64(6), object(5)
memory usage: 86.2+ KB


# PyCaret Setup

[Link: https://pycaret.readthedocs.io/en/latest/api/classification.html](https://pycaret.readthedocs.io/en/latest/api/classification.html)

In [5]:
s = setup(data=df, target='HeartDisease', ignore_features=['ejection_fraction'], 
          categorical_features=['Sex','ChestPainType', 'RestingECG', 'ExerciseAngina', 'ST_Slope'], 
          normalize=True, normalize_method='minmax', train_size=0.9, session_id=42
         )

Unnamed: 0,Description,Value
0,Session id,42
1,Target,HeartDisease
2,Target type,Binary
3,Original data shape,"(918, 12)"
4,Transformed data shape,"(918, 19)"
5,Transformed train set shape,"(826, 19)"
6,Transformed test set shape,"(92, 19)"
7,Ignore features,1
8,Ordinal features,2
9,Numeric features,6


### Existing Classification Models in PyCaret¶

In [6]:
models()

Unnamed: 0_level_0,Name,Reference,Turbo
ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
lr,Logistic Regression,sklearn.linear_model._logistic.LogisticRegression,True
knn,K Neighbors Classifier,sklearn.neighbors._classification.KNeighborsCl...,True
nb,Naive Bayes,sklearn.naive_bayes.GaussianNB,True
dt,Decision Tree Classifier,sklearn.tree._classes.DecisionTreeClassifier,True
svm,SVM - Linear Kernel,sklearn.linear_model._stochastic_gradient.SGDC...,True
rbfsvm,SVM - Radial Kernel,sklearn.svm._classes.SVC,False
gpc,Gaussian Process Classifier,sklearn.gaussian_process._gpc.GaussianProcessC...,False
mlp,MLP Classifier,sklearn.neural_network._multilayer_perceptron....,False
ridge,Ridge Classifier,sklearn.linear_model._ridge.RidgeClassifier,True
rf,Random Forest Classifier,sklearn.ensemble._forest.RandomForestClassifier,True


### Get the best model

In [7]:
best_model = compare_models()

Unnamed: 0,Model,Accuracy,AUC,Recall,Prec.,F1,Kappa,MCC,TT (Sec)
catboost,CatBoost Classifier,0.8813,0.9368,0.908,0.8824,0.8946,0.7588,0.7602,1.559
gbc,Gradient Boosting Classifier,0.8643,0.9298,0.8795,0.8762,0.8771,0.7254,0.7268,0.11
lightgbm,Light Gradient Boosting Machine,0.8606,0.9233,0.8903,0.8639,0.8763,0.7168,0.7186,0.281
rf,Random Forest Classifier,0.8583,0.9276,0.8861,0.8641,0.8739,0.7121,0.7148,0.167
ridge,Ridge Classifier,0.8582,0.0,0.8861,0.8639,0.8741,0.7118,0.7139,0.068
et,Extra Trees Classifier,0.8582,0.9232,0.8884,0.864,0.8741,0.7118,0.7166,0.148
lda,Linear Discriminant Analysis,0.857,0.9255,0.884,0.8635,0.8728,0.7095,0.7117,0.058
lr,Logistic Regression,0.8521,0.9249,0.8816,0.8579,0.8689,0.6993,0.7011,0.361
xgboost,Extreme Gradient Boosting,0.851,0.922,0.8772,0.8582,0.8671,0.6976,0.699,0.087
nb,Naive Bayes,0.8473,0.914,0.86,0.8659,0.8618,0.6911,0.6934,0.057


Processing:   0%|          | 0/69 [00:00<?, ?it/s]

### Evaluations

In [8]:
evaluate_model(best_model)

interactive(children=(ToggleButtons(description='Plot Type:', icons=('',), options=(('Pipeline Plot', 'pipelin…

### Plot model


In [9]:
plot_model(best_model, plot='confusion_matrix', save=True)

'Confusion Matrix.png'