# PyCaret for Classification

- PyCaret is an open-source, low-code machine learning library in Python that automates machine learning workflows.
- It is an end-to-end machine learning and model management tool that speeds up the experiment cycle exponentially and makes you more productive.
- It is a bundle of many Machine Learning algorithms.
- Only three lines of code are required to compare 20 ML models.
- Pycaret is available for:
    - Classification: https://pycaret.org/classification1/
    - Regression
    - Clustering

## 1. Install Pycaret

In [1]:
# !pip install pycaret

## 2. Get the version of the pycaret

In [2]:
from pycaret.utils import version
version()

'3.0.1'

## 3. Get the list of dataset available in pycaret

In [3]:
from pycaret.datasets import get_data
dataset=get_data('index')

Unnamed: 0,Dataset,Data Types,Default Task,Target Variable 1,Target Variable 2,# Instances,# Attributes,Missing Values
0,anomaly,Multivariate,Anomaly Detection,,,1000,10,N
1,france,Multivariate,Association Rule Mining,InvoiceNo,Description,8557,8,N
2,germany,Multivariate,Association Rule Mining,InvoiceNo,Description,9495,8,N
3,bank,Multivariate,Classification (Binary),deposit,,45211,17,N
4,blood,Multivariate,Classification (Binary),Class,,748,5,N
5,cancer,Multivariate,Classification (Binary),Class,,683,10,N
6,credit,Multivariate,Classification (Binary),default,,24000,24,N
7,diabetes,Multivariate,Classification (Binary),Class variable,,768,9,N
8,electrical_grid,Multivariate,Classification (Binary),stabf,,10000,14,N
9,employee,Multivariate,Classification (Binary),left,,14999,10,N


## 4. Get the "diabetes" dataset

In [4]:
dataset=get_data('diabetes')
dataset

Unnamed: 0,Number of times pregnant,Plasma glucose concentration a 2 hours in an oral glucose tolerance test,Diastolic blood pressure (mm Hg),Triceps skin fold thickness (mm),2-Hour serum insulin (mu U/ml),Body mass index (weight in kg/(height in m)^2),Diabetes pedigree function,Age (years),Class variable
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1
3,1,89,66,23,94,28.1,0.167,21,0
4,0,137,40,35,168,43.1,2.288,33,1


Unnamed: 0,Number of times pregnant,Plasma glucose concentration a 2 hours in an oral glucose tolerance test,Diastolic blood pressure (mm Hg),Triceps skin fold thickness (mm),2-Hour serum insulin (mu U/ml),Body mass index (weight in kg/(height in m)^2),Diabetes pedigree function,Age (years),Class variable
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1
3,1,89,66,23,94,28.1,0.167,21,0
4,0,137,40,35,168,43.1,2.288,33,1
...,...,...,...,...,...,...,...,...,...
763,10,101,76,48,180,32.9,0.171,63,0
764,2,122,70,27,0,36.8,0.340,27,0
765,5,121,72,23,112,26.2,0.245,30,0
766,1,126,60,0,0,30.1,0.349,47,1


## 5. Download the "diabetes" dataset to local system

In [None]:
dataset.to_csv('diabetesdataset.csv')
from anaconda_project import files
files.download('diabetesdataset.csv')

## 6. "Parameter setting" for classification models

In [6]:
from pycaret.classification import *
s = setup(data = dataset,target = 'Class variable',train_size=0.7)

Unnamed: 0,Description,Value
0,Session id,8144
1,Target,Class variable
2,Target type,Binary
3,Original data shape,"(768, 9)"
4,Transformed data shape,"(768, 9)"
5,Transformed train set shape,"(537, 9)"
6,Transformed test set shape,"(231, 9)"
7,Numeric features,8
8,Preprocess,True
9,Imputation type,simple


## 7. "Run and Compare" the model performance

- This function trains all the models in the model libary using default hyperparameter and evaluates performance metrics usiang cross-validation. It returns the trained model object. The evaluation metrics used are:
    - Classification: Accuract, AUC, Recall, Precision, F1, Kappa, MCC

In [8]:
cm = compare_models()

Unnamed: 0,Model,Accuracy,AUC,Recall,Prec.,F1,Kappa,MCC,TT (Sec)
lda,Linear Discriminant Analysis,0.7783,0.8434,0.5547,0.7474,0.631,0.4786,0.4929,0.07
lr,Logistic Regression,0.7746,0.8406,0.5602,0.7365,0.631,0.4737,0.4863,0.952
rf,Random Forest Classifier,0.7729,0.848,0.6307,0.6987,0.6604,0.4905,0.494,0.129
et,Extra Trees Classifier,0.7709,0.8412,0.5713,0.7245,0.6354,0.4717,0.4812,0.135
ridge,Ridge Classifier,0.7708,0.0,0.5275,0.7407,0.6094,0.4555,0.4721,0.056
nb,Naive Bayes,0.7654,0.8275,0.6137,0.6919,0.6447,0.4709,0.4772,0.058
ada,Ada Boost Classifier,0.7652,0.8294,0.6348,0.6872,0.6526,0.4764,0.4825,0.089
gbc,Gradient Boosting Classifier,0.7615,0.8372,0.6298,0.6698,0.6455,0.4667,0.4698,0.102
lightgbm,Light Gradient Boosting Machine,0.7578,0.8226,0.6409,0.6593,0.6469,0.4631,0.4658,0.235
qda,Quadratic Discriminant Analysis,0.7446,0.8062,0.5383,0.6625,0.5874,0.4079,0.4155,0.061


Processing:   0%|          | 0/61 [00:00<?, ?it/s]

## 8. "Three lines of codes" for model comparasion for "Cancer" dataset

In [7]:
from pycaret.datasets import get_data
from pycaret.classification import *

dataset = get_data('cancer')
s = setup(data = dataset,target = 'Class')
cm = compare_models()

Unnamed: 0,Class,age,menopause,tumor-size,inv-nodes,node-caps,deg-malig,breast,breast-quad,irradiat
0,0,5,1,1,1,2,1,3,1,1
1,0,5,4,4,5,7,10,3,2,1
2,0,3,1,1,1,2,2,3,1,1
3,0,6,8,8,1,3,4,3,7,1
4,0,4,1,1,3,2,1,3,1,1


Unnamed: 0,Description,Value
0,Session id,2004
1,Target,Class
2,Target type,Binary
3,Original data shape,"(683, 10)"
4,Transformed data shape,"(683, 10)"
5,Transformed train set shape,"(478, 10)"
6,Transformed test set shape,"(205, 10)"
7,Numeric features,9
8,Preprocess,True
9,Imputation type,simple


Unnamed: 0,Model,Accuracy,AUC,Recall,Prec.,F1,Kappa,MCC,TT (Sec)
et,Extra Trees Classifier,0.9749,0.9951,0.9765,0.9557,0.9647,0.9452,0.9469,0.147
knn,K Neighbors Classifier,0.9728,0.9896,0.9824,0.9431,0.9618,0.9407,0.9418,0.077
rf,Random Forest Classifier,0.9686,0.9907,0.9643,0.9496,0.9555,0.9313,0.933,0.155
lr,Logistic Regression,0.9643,0.9932,0.9393,0.9563,0.947,0.9201,0.9209,0.072
lightgbm,Light Gradient Boosting Machine,0.9643,0.9921,0.9577,0.9424,0.9489,0.9215,0.9229,0.138
nb,Naive Bayes,0.9623,0.9844,0.9761,0.9228,0.948,0.9185,0.9203,0.07
gbc,Gradient Boosting Classifier,0.9622,0.9907,0.964,0.934,0.9475,0.918,0.9198,0.094
svm,SVM - Linear Kernel,0.96,0.0,0.9327,0.9516,0.9398,0.91,0.9123,0.071
ridge,Ridge Classifier,0.958,0.0,0.9096,0.9677,0.9372,0.9057,0.9073,0.074
lda,Linear Discriminant Analysis,0.958,0.993,0.9096,0.9677,0.9372,0.9057,0.9073,0.074


Processing:   0%|          | 0/61 [00:00<?, ?it/s]

## 9. "Three lines of code" for model comparison for "Heart Disease" dataset

In [8]:
from pycaret.datasets import get_data
from pycaret.classification import *

dataset = get_data('heart_disease')
s = setup(data = dataset,target = 'Disease')
cm = compare_models()

Unnamed: 0,age,sex,chest pain type,resting blood pressure,serum cholestoral in mg/dl,fasting blood sugar > 120 mg/dl,resting electrocardiographic results,maximum heart rate achieved,exercise induced angina,oldpeak,slope of peak,number of major vessels,thal,Disease
0,70,1,4,130,322,0,2,109,0,2.4,2,3,3,1
1,67,0,3,115,564,0,2,160,0,1.6,2,0,7,0
2,57,1,2,124,261,0,0,141,0,0.3,1,0,7,1
3,64,1,4,128,263,0,0,105,1,0.2,2,1,7,0
4,74,0,2,120,269,0,2,121,1,0.2,1,1,3,0


Unnamed: 0,Description,Value
0,Session id,6427
1,Target,Disease
2,Target type,Binary
3,Original data shape,"(270, 14)"
4,Transformed data shape,"(270, 14)"
5,Transformed train set shape,"(189, 14)"
6,Transformed test set shape,"(81, 14)"
7,Numeric features,13
8,Preprocess,True
9,Imputation type,simple


Unnamed: 0,Model,Accuracy,AUC,Recall,Prec.,F1,Kappa,MCC,TT (Sec)
ridge,Ridge Classifier,0.883,0.0,0.8278,0.9236,0.8568,0.7581,0.7783,0.084
lr,Logistic Regression,0.8781,0.9295,0.8403,0.8978,0.856,0.7493,0.7643,0.11
lda,Linear Discriminant Analysis,0.8775,0.9216,0.8278,0.9036,0.8516,0.747,0.7629,0.09
et,Extra Trees Classifier,0.8673,0.9347,0.8292,0.8709,0.8427,0.7272,0.7362,0.164
nb,Naive Bayes,0.8567,0.9344,0.8417,0.8562,0.8408,0.709,0.7206,0.086
ada,Ada Boost Classifier,0.852,0.9057,0.8444,0.8286,0.8312,0.6987,0.7066,0.127
gbc,Gradient Boosting Classifier,0.852,0.9131,0.8194,0.855,0.829,0.6979,0.7092,0.109
rf,Random Forest Classifier,0.8515,0.9267,0.7806,0.8914,0.8182,0.6933,0.7128,0.163
qda,Quadratic Discriminant Analysis,0.8462,0.9096,0.8306,0.8377,0.8278,0.6873,0.6967,0.096
lightgbm,Light Gradient Boosting Machine,0.8363,0.9301,0.8069,0.8382,0.8123,0.6661,0.681,0.1


Processing:   0%|          | 0/61 [00:00<?, ?it/s]