Welcome to Pycaret tutorial 🌟  
in this tutorial we will learn how to use pycaret to choose the best model for our data  
but first we need to install pycaret  
you can install it by running this command in your terminal:  
  
pip install pycaret

Now we will try to use pycaret for a classification problem


In [1]:
# importing the libraries
from pycaret.classification import * 
from pycaret.datasets import get_data   #get_data is a function that will help us to load the data using pycaret

In [None]:
# You can run this cell if you want to check the dataset get_data function offers 💫
index = get_data('index')

We will use the glass dataset for this tutorial  
Our goal is to predict the type of glass based on its properties

In [3]:
# loading the data
data = get_data('glass') 

Unnamed: 0,RI,Na,Mg,Al,Si,K,Ca,Ba,Fe,Type
0,1.52101,13.64,4.49,1.1,71.78,0.06,8.75,0.0,0.0,1
1,1.51761,13.89,3.6,1.36,72.73,0.48,7.83,0.0,0.0,1
2,1.51618,13.53,3.55,1.54,72.99,0.39,7.78,0.0,0.0,1
3,1.51766,13.21,3.69,1.29,72.61,0.57,8.22,0.0,0.0,1
4,1.51742,13.27,3.62,1.24,73.08,0.55,8.07,0.0,0.0,1


In [4]:
# checking the unique values of the target column
data['Type'].unique().tolist()

[1, 2, 3, 5, 6, 7]

Now to the fun part, we will use pycaret to choose the best model for our data ✨  
pycaret will handle the data splitting, feature engineering, feature selection, and model training for us 🤩


In [5]:
# setup is a function that will help us to prepare the data for the model 
s = setup(data, target = 'Type', session_id = 123) # session_id is a random number that will help us to save the session so we have the same results when we run the code again

Unnamed: 0,Description,Value
0,Session id,123
1,Target,Type
2,Target type,Multiclass
3,Target mapping,"1: 0, 2: 1, 3: 2, 5: 3, 6: 4, 7: 5"
4,Original data shape,"(214, 10)"
5,Transformed data shape,"(214, 10)"
6,Transformed train set shape,"(149, 10)"
7,Transformed test set shape,"(65, 10)"
8,Numeric features,9
9,Preprocess,True


The information you see above are the default parameters that pycaret will use for the model, you can change them to fit your needs 😄


In [6]:
# Now it's time to compare the models
best = compare_models()

Unnamed: 0,Model,Accuracy,AUC,Recall,Prec.,F1,Kappa,MCC,TT (Sec)
et,Extra Trees Classifier,0.7724,0.568,0.7724,0.7326,0.7402,0.6833,0.6969,0.018
rf,Random Forest Classifier,0.759,0.5643,0.759,0.7251,0.7278,0.6631,0.6802,0.024
lightgbm,Light Gradient Boosting Machine,0.7252,0.5538,0.7252,0.6886,0.6919,0.6115,0.6281,0.067
gbc,Gradient Boosting Classifier,0.7052,0.0,0.7052,0.6724,0.6743,0.5886,0.6023,0.064
xgboost,Extreme Gradient Boosting,0.6843,0.5384,0.6843,0.6737,0.6586,0.5607,0.5815,0.021
dt,Decision Tree Classifier,0.679,0.4562,0.679,0.6544,0.6525,0.5593,0.5708,0.006
knn,K Neighbors Classifier,0.5919,0.4895,0.5919,0.5416,0.5488,0.418,0.4341,0.167
lda,Linear Discriminant Analysis,0.5843,0.0,0.5843,0.5414,0.537,0.4072,0.4289,0.006
lr,Logistic Regression,0.5714,0.0,0.5714,0.5448,0.5256,0.3962,0.4291,0.345
ridge,Ridge Classifier,0.5705,0.0,0.5705,0.5096,0.5111,0.3834,0.4121,0.006


What you see above is a table of the models that pycaret has, and the performance of each model using different metrics 🤩  
depending on the problem you are trying to solve, you can use different metrics to evaluate the performance of the model 🔍


In [8]:
print(best)  # best is the model that pycaret has chosen for us

ExtraTreesClassifier(bootstrap=False, ccp_alpha=0.0, class_weight=None,
                     criterion='gini', max_depth=None, max_features='sqrt',
                     max_leaf_nodes=None, max_samples=None,
                     min_impurity_decrease=0.0, min_samples_leaf=1,
                     min_samples_split=2, min_weight_fraction_leaf=0.0,
                     monotonic_cst=None, n_estimators=100, n_jobs=-1,
                     oob_score=False, random_state=123, verbose=0,
                     warm_start=False)


Now we will use the best model to make predictions on the test set 💻


In [18]:
predictions = predict_model(best, data)  # predict_model is a function that will help us to make predictions on the test set
print(predictions.sample(5).to_string())  # sample(5) is a function that will help us to print 5 random rows from the predictions dataframe

Unnamed: 0,Model,Accuracy,AUC,Recall,Prec.,F1,Kappa,MCC
0,Extra Trees Classifier,0.9299,0.9951,0.9299,0.9321,0.9294,0.9038,0.9045


          RI     Na    Mg    Al         Si     K     Ca    Ba    Fe  Type  prediction_label  prediction_score
166  1.52151  11.03  1.71  1.56  73.440002  0.58  11.62  0.00  0.00     5                 5              1.00
199  1.51609  15.01  0.00  2.51  73.050003  0.05   8.83  0.53  0.00     7                 7              0.94
31   1.51747  12.84  3.50  1.14  73.269997  0.56   8.55  0.00  0.00     1                 1              1.00
128  1.52068  13.55  2.09  1.67  72.180000  0.53   9.57  0.27  0.17     2                 2              1.00
93   1.51590  13.24  3.34  1.47  73.099998  0.39   8.22  0.00  0.00     2                 2              0.79


Now it's time to try to use pycaret for a regression problem 📈  
let's import the library for the regression problem and the dataset



In [20]:
from pycaret.regression import *
Data = get_data('diamond')

Unnamed: 0,Carat Weight,Cut,Color,Clarity,Polish,Symmetry,Report,Price
0,1.1,Ideal,H,SI1,VG,EX,GIA,5169
1,0.83,Ideal,H,VS1,ID,ID,AGSL,3470
2,0.85,Ideal,H,SI1,EX,EX,GIA,3183
3,0.91,Ideal,E,SI1,VG,VG,GIA,4370
4,0.83,Ideal,G,SI1,EX,EX,GIA,3171


Just like the classification problem, we will use the setup function to prepare the data for the model


In [24]:
S = setup(Data, target = 'Price', session_id = 123) 

Unnamed: 0,Description,Value
0,Session id,123
1,Target,Price
2,Target type,Regression
3,Original data shape,"(6000, 8)"
4,Transformed data shape,"(6000, 29)"
5,Transformed train set shape,"(4200, 29)"
6,Transformed test set shape,"(1800, 29)"
7,Numeric features,1
8,Categorical features,6
9,Preprocess,True


In [25]:
# Now it's time to compare the models
Best = compare_models()

Unnamed: 0,Model,MAE,MSE,RMSE,R2,RMSLE,MAPE,TT (Sec)
xgboost,Extreme Gradient Boosting,669.4844,1821212.1477,1331.1471,0.9826,0.0722,0.053,0.029
et,Extra Trees Regressor,719.6871,2033874.0284,1390.8879,0.9809,0.0786,0.0585,0.153
rf,Random Forest Regressor,725.2299,2336386.4997,1491.5615,0.9781,0.0785,0.0577,0.135
lightgbm,Light Gradient Boosting Machine,721.5692,2747712.106,1593.0198,0.9748,0.075,0.0551,0.057
gbr,Gradient Boosting Regressor,870.4619,2677900.4116,1616.3756,0.9744,0.099,0.075,0.045
dt,Decision Tree Regressor,919.7628,3495425.391,1821.5832,0.9664,0.1025,0.0744,0.017
llar,Lasso Least Angle Regression,2489.5031,14919265.5729,3837.0972,0.8571,0.6592,0.2962,0.017
ridge,Ridge Regression,2491.2859,14957594.1466,3840.8633,0.8568,0.647,0.2966,0.016
br,Bayesian Ridge,2493.3298,14989750.3796,3844.9434,0.8565,0.6497,0.2967,0.017
lasso,Lasso Regression,2490.799,14993880.8447,3845.5124,0.8565,0.6525,0.2961,0.184


In [26]:
print(Best)


XGBRegressor(base_score=None, booster='gbtree', callbacks=None,
             colsample_bylevel=None, colsample_bynode=None,
             colsample_bytree=None, device='cpu', early_stopping_rounds=None,
             enable_categorical=False, eval_metric=None, feature_types=None,
             gamma=None, grow_policy=None, importance_type=None,
             interaction_constraints=None, learning_rate=None, max_bin=None,
             max_cat_threshold=None, max_cat_to_onehot=None,
             max_delta_step=None, max_depth=None, max_leaves=None,
             min_child_weight=None, missing=nan, monotone_constraints=None,
             multi_strategy=None, n_estimators=None, n_jobs=-1,
             num_parallel_tree=None, objective='reg:squarederror', ...)


Now we will use the best model to make predictions on the test set 💻


In [27]:
Predictions = predict_model(Best, Data)
print(Predictions.sample(5).to_string())



Unnamed: 0,Model,MAE,MSE,RMSE,R2,RMSLE,MAPE
0,Extreme Gradient Boosting,436.7569,842835.1498,918.0605,0.9919,0.0545,0.0391


      Carat Weight        Cut Color Clarity Polish Symmetry Report  Price  prediction_label
1201          0.75      Ideal     G    VVS2     EX       EX    GIA   3879       3763.576172
1084          1.08      Ideal     F     SI1     VG       VG    GIA   5534       5857.025391
2060          1.29  Very Good     G     VS1      G       VG    GIA   9105       9802.951172
248           1.19      Ideal     G     VS1     ID       ID   AGSL   8034       8664.810547
5603          1.01  Very Good     E     SI1     VG       VG    GIA   5384       5259.361816


A quick explanation about the metrics used in the regression problem:  
  
MAE: Mean Absolute Error (will give us the average of the absolute differences between the predicted and actual values) | The lower the better  
MSE: Mean Squared Error (will give us the average of the squared differences between the predicted and actual values) | The lower the better  
RMSE: Root Mean Squared Error (will give us the square root of the mean squared error) | The lower the better  
R2: R-squared (will give us the percentage of the variance in the dependent variable that is explained by the independent variables) | The higher the better  


Thank you for following this tutorial, I hope you found it useful and informative 🌟  
if you have any questions or suggestions, please feel free to ask me 😄  
Linkedin: https://www.linkedin.com/notifications/?filter=all