# PyCaret

PyCaret is an open-source, low-code machine learning library and end-to-end model management tool built-in Python for automating machine learning workflows. It is incredibly popular for its ease of use, simplicity, and ability to quickly and efficiently build and deploy end-to-end ML prototypes.

#### Installs

In [4]:
!pip install pycaret[full]
#!pip install pretty_errors
#!python --version

Collecting pycaret[full]
  Using cached pycaret-3.3.2-py3-none-any.whl.metadata (17 kB)
Collecting ipywidgets>=7.6.5 (from pycaret[full])
  Downloading ipywidgets-8.1.5-py3-none-any.whl.metadata (2.3 kB)
Collecting tqdm>=4.62.0 (from pycaret[full])
  Downloading tqdm-4.66.5-py3-none-any.whl.metadata (57 kB)
Collecting numpy<1.27,>=1.21 (from pycaret[full])
  Downloading numpy-1.26.4-cp311-cp311-win_amd64.whl.metadata (61 kB)
Collecting pandas<2.2.0 (from pycaret[full])
  Downloading pandas-2.1.4-cp311-cp311-win_amd64.whl.metadata (18 kB)
Collecting scipy<=1.11.4,>=1.6.1 (from pycaret[full])
  Downloading scipy-1.11.4-cp311-cp311-win_amd64.whl.metadata (60 kB)
Collecting joblib<1.4,>=1.2.0 (from pycaret[full])
  Using cached joblib-1.3.2-py3-none-any.whl.metadata (5.4 kB)
Collecting scikit-learn>1.4.0 (from pycaret[full])
  Downloading scikit_learn-1.5.2-cp311-cp311-win_amd64.whl.metadata (13 kB)
Collecting pyod>=1.1.3 (from pycaret[full])
  Using cached pyod-2.0.2.tar.gz (165 kB)
  Pre

IOPub message rate exceeded.
The Jupyter server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--ServerApp.iopub_msg_rate_limit`.

Current values:
ServerApp.iopub_msg_rate_limit=1000.0 (msgs/sec)
ServerApp.rate_limit_window=3.0 (secs)



   -------------------------------------- - 9.4/9.9 MB 254.2 kB/s eta 0:00:02
   -------------------------------------- - 9.4/9.9 MB 254.2 kB/s eta 0:00:02
   -------------------------------------- - 9.4/9.9 MB 254.2 kB/s eta 0:00:02
   -------------------------------------- - 9.4/9.9 MB 254.2 kB/s eta 0:00:02
   ---------------------------------------  9.7/9.9 MB 238.8 kB/s eta 0:00:01
   ---------------------------------------  9.7/9.9 MB 238.8 kB/s eta 0:00:01
   ---------------------------------------- 9.9/9.9 MB 232.9 kB/s eta 0:00:00
Using cached tbats-1.1.3-py3-none-any.whl (44 kB)
Downloading tqdm-4.66.5-py3-none-any.whl (78 kB)
Using cached umap_learn-0.5.6-py3-none-any.whl (85 kB)
Using cached uvicorn-0.30.6-py3-none-any.whl (62 kB)
Using cached werkzeug-2.3.8-py3-none-any.whl (242 kB)
Using cached ydata_profiling-4.10.0-py2.py3-none-any.whl (356 kB)
Using cached ImageHash-4.3.1-py2.py3-none-any.whl (296 kB)
Using cached yellowbrick-1.5-py3-none-any.whl (282 kB)
Using cached 

IOPub message rate exceeded.
The Jupyter server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--ServerApp.iopub_msg_rate_limit`.

Current values:
ServerApp.iopub_msg_rate_limit=1000.0 (msgs/sec)
ServerApp.rate_limit_window=3.0 (secs)



#### Dataset

In [1]:
import pretty_errors
from pycaret.datasets import get_data
data = get_data('insurance')

Unnamed: 0,age,sex,bmi,children,smoker,region,charges
0,19,female,27.9,0,yes,southwest,16884.924
1,18,male,33.77,1,no,southeast,1725.5523
2,28,male,33.0,3,no,southeast,4449.462
3,33,male,22.705,0,no,northwest,21984.47061
4,32,male,28.88,0,no,northwest,3866.8552


#### Data Preperation

Whenever you initialize the setup function in PyCaret, it profiles the dataset and infers the data types for all input features. If all data types are correctly inferred, you can press enter to continue.

In [2]:
# Initialize setup
from pycaret.regression import *
s = setup(data, target = 'charges')

Unnamed: 0,Description,Value
0,Session id,965
1,Target,charges
2,Target type,Regression
3,Original data shape,"(1338, 7)"
4,Transformed data shape,"(1338, 10)"
5,Transformed train set shape,"(936, 10)"
6,Transformed test set shape,"(402, 10)"
7,Numeric features,3
8,Categorical features,3
9,Preprocess,True


#### Available Models

To check the list of all models available for training, you can use the function called models . It displays a table with model ID, name, and the reference of the actual estimator.

In [3]:
# Check all the available models
models()

Unnamed: 0_level_0,Name,Reference,Turbo
ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
lr,Linear Regression,sklearn.linear_model._base.LinearRegression,True
lasso,Lasso Regression,sklearn.linear_model._coordinate_descent.Lasso,True
ridge,Ridge Regression,sklearn.linear_model._ridge.Ridge,True
en,Elastic Net,sklearn.linear_model._coordinate_descent.Elast...,True
lar,Least Angle Regression,sklearn.linear_model._least_angle.Lars,True
llar,Lasso Least Angle Regression,sklearn.linear_model._least_angle.LassoLars,True
omp,Orthogonal Matching Pursuit,sklearn.linear_model._omp.OrthogonalMatchingPu...,True
br,Bayesian Ridge,sklearn.linear_model._bayes.BayesianRidge,True
ard,Automatic Relevance Determination,sklearn.linear_model._bayes.ARDRegression,False
par,Passive Aggressive Regressor,sklearn.linear_model._passive_aggressive.Passi...,True


#### Model Training & Selection a

The most used function for training any model in PyCaret is create_model . It takes an ID for the estimator you want to train.

The output shows the 10-fold cross-validated metrics with mean and standard deviation. The output from this function is a trained model object, which is essentially a scikit-learn object

In [4]:
# Trian the decision tree
dt = create_model('dt')

Unnamed: 0_level_0,MAE,MSE,RMSE,R2,RMSLE,MAPE
Fold,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
0,2686.1489,34458757.0785,5870.1582,0.6771,0.4727,0.3017
1,2584.0985,31326670.8308,5597.0234,0.7545,0.4268,0.3006
2,4061.895,60418203.2928,7772.9147,0.4323,0.5469,0.4272
3,2815.1135,38450702.7614,6200.8631,0.7877,0.5304,0.4181
4,2443.0983,29384432.8714,5420.741,0.818,0.4555,0.2403
5,2439.7678,26867269.919,5183.3647,0.8403,0.3786,0.1746
6,4376.715,63727613.8698,7982.9577,0.318,0.6576,0.4001
7,3469.1438,44938065.9864,6703.5861,0.7377,0.5819,0.375
8,2637.4232,32985692.8248,5743.3172,0.7518,0.5386,0.3572
9,3541.3879,54560682.6523,7386.5203,0.659,0.5236,0.3735


In [5]:
print(dt)

DecisionTreeRegressor(random_state=965)


#### Model Training & Selection b

To train multiple models in a loop, you can write a simple list comprehension:

In [19]:
# Train multiple_modles
multiple_models = [create_model(i) for i in ['dt', 'lr', 'xgboost']]

# Check multiple_models
type(multiple_models), len(multiple_models)
#>>>(list, 3)

print(multiple_models)

Unnamed: 0_level_0,MAE,MSE,RMSE,R2,RMSLE,MAPE
Fold,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
0,2686.1489,34458757.0785,5870.1582,0.6771,0.4727,0.3017
1,2584.0985,31326670.8308,5597.0234,0.7545,0.4268,0.3006
2,4061.895,60418203.2928,7772.9147,0.4323,0.5469,0.4272
3,2815.1135,38450702.7614,6200.8631,0.7877,0.5304,0.4181
4,2443.0983,29384432.8714,5420.741,0.818,0.4555,0.2403
5,2439.7678,26867269.919,5183.3647,0.8403,0.3786,0.1746
6,4376.715,63727613.8698,7982.9577,0.318,0.6576,0.4001
7,3469.1438,44938065.9864,6703.5861,0.7377,0.5819,0.375
8,2637.4232,32985692.8248,5743.3172,0.7518,0.5386,0.3572
9,3541.3879,54560682.6523,7386.5203,0.659,0.5236,0.3735


Unnamed: 0_level_0,MAE,MSE,RMSE,R2,RMSLE,MAPE
Fold,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
0,3908.4462,27235759.4962,5218.7891,0.7448,0.5074,0.462
1,3772.0015,31762410.6016,5635.815,0.7511,0.421,0.3159
2,3770.3455,31426000.2759,5605.8898,0.7047,0.7505,0.4568
3,4062.6876,33826211.0249,5816.0305,0.8132,0.6119,0.3896
4,4542.9602,38983152.0757,6243.6489,0.7586,0.7171,0.4129
5,4159.1489,34402497.0594,5865.3642,0.7955,0.6345,0.4028
6,4755.7748,44673653.2347,6683.8352,0.5219,0.7208,0.4712
7,4564.1747,41166796.0771,6416.1356,0.7597,0.5276,0.4059
8,4156.1393,31855400.5582,5644.0589,0.7603,0.5384,0.4886
9,4788.5506,50468503.5617,7104.1188,0.6845,0.4837,0.4274


ValueError: Estimator xgboost not available. Please see docstring for list of available estimators.

#### Compare all models, Check the best model, Predict on hold out

If you want to train all the models available in the library instead of the few selected you can use PyCaret’s compare_models function instead of writing your own loop (the results will be the same though), compare_models returns the output which shows the cross-validated metrics for all models.

To generate predictions on the unseen dataset you can use the same predict_model function but just pass an extra parameter data.


In [9]:
# Compare all models
best_model = compare_models()

Unnamed: 0,Model,MAE,MSE,RMSE,R2,RMSLE,MAPE,TT (Sec)
gbr,Gradient Boosting Regressor,2561.9581,20839872.6702,4509.2381,0.8453,0.4367,0.3015,0.043
lightgbm,Light Gradient Boosting Machine,2866.222,23001115.8795,4734.2444,0.8289,0.5375,0.3573,0.067
catboost,CatBoost Regressor,2769.046,23088427.0349,4739.5314,0.8281,0.4627,0.3281,0.402
rf,Random Forest Regressor,2702.5424,23057400.4013,4755.0618,0.8279,0.4476,0.3129,0.091
et,Extra Trees Regressor,2733.2611,26184376.7773,5093.9676,0.804,0.4697,0.3097,0.073
ada,AdaBoost Regressor,4109.5438,27304096.3138,5198.232,0.7953,0.6165,0.6985,0.023
ridge,Ridge Regression,4260.2893,36580250.7406,6023.8019,0.7296,0.5986,0.4256,0.021
llar,Lasso Least Angle Regression,4248.1383,36578723.3345,6023.2914,0.7295,0.5931,0.4234,0.022
br,Bayesian Ridge,4254.6285,36580715.7925,6023.6433,0.7295,0.6006,0.4245,0.023
lasso,Lasso Regression,4248.141,36578730.0646,6023.2918,0.7295,0.5931,0.4234,0.019


In [10]:
# Check the best model
print(best_model)

GradientBoostingRegressor(random_state=965)


In [11]:
# Predict on hold-out
pred_holdout = predict_model(best_model)

Unnamed: 0,Model,MAE,MSE,RMSE,R2,RMSLE,MAPE
0,Gradient Boosting Regressor,2514.0753,22150045.8664,4706.3835,0.8594,0.4193,0.2869


In [12]:
# Create copy of data drop target column
data2 = data.copy()
data2.drop('charges', axis=1, inplace=True)

# Generate predictions
predictions = predict_model(best_model, data = data2)

#### Writing and Training Custom Model
So far I have shown training and model selection for all the available models in PyCaret. However, the way PyCaret works for custom models is exactly the same. As long as, your estimator is compatible with sklearn API style, it will work the same way. Let’s see few examples.
Before I show you how to write your own custom class, I will first demonstrate how you can work with custom non-sklearn models (models that are not available in sklearn or pycaret’s base library).

#### GPLearn Models:
While Genetic Programming (GP) can be used to perform a very wide variety of tasks, gplearn is purposefully constrained to solving symbolic regression problems.

Symbolic regression is a machine learning technique that aims to identify an underlying mathematical expression that best describes a relationship. It begins by building a population of naive random formulas to represent a relationship between known independent variables and their dependent variable targets to predict new data. Each successive generation of programs is then evolved from the one that came before it by selecting the fittest individuals from the population to undergo genetic operations.

To use models from gplearn you will have to first install it:

In [13]:
# Install gplearn
!pip install gplearn

Collecting gplearn
  Downloading gplearn-0.4.2-py3-none-any.whl.metadata (4.3 kB)
Downloading gplearn-0.4.2-py3-none-any.whl (25 kB)
Installing collected packages: gplearn
Successfully installed gplearn-0.4.2


#### Imports

In [14]:
# Import the untrained estimator
from gplearn.genetic import SymbolicRegressor
sc = SymbolicRegressor()

# Train uusing create_model
sc_trained = create_model(sc)

Unnamed: 0_level_0,MAE,MSE,RMSE,R2,RMSLE,MAPE
Fold,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
0,2585.7516,23601031.7201,4858.0893,0.7789,0.4047,0.1856
1,3111.2659,33444118.3437,5783.089,0.7379,0.4341,0.1825
2,2530.449,26891322.3799,5185.6844,0.7473,0.4167,0.207
3,3347.8401,35402788.1869,5950.0242,0.8045,0.4587,0.2502
4,3362.2251,39594428.9866,6292.4104,0.7548,0.4648,0.2089
5,3127.5371,34475319.2734,5871.5687,0.7951,0.4067,0.1794
6,3183.5883,36586795.6063,6048.702,0.6085,0.6043,0.1908
7,3269.7147,37028528.0671,6085.1071,0.7839,0.498,0.2039
8,2765.1227,34365017.998,5862.1684,0.7415,0.5004,0.195
9,3080.1971,37237640.7976,6102.2652,0.7672,0.3646,0.17


In [15]:
print(sc_trained)

sub(mul(div(X0, 0.528), div(X0, 0.528)), mul(sub(sub(add(sub(add(X8, X4), add(mul(sub(mul(sub(div(div(X3, X4), add(add(X4, X5), div(X0, -0.762))), add(X4, X3)), div(mul(X5, X2), mul(X6, X2))), add(X4, X3)), div(mul(X2, X6), add(X5, X5))), X2)), add(X5, X5)), mul(sub(mul(sub(div(div(X3, X4), add(add(X4, X5), div(mul(X2, X6), mul(X4, X8)))), add(X4, X3)), div(mul(X2, X6), mul(X4, X8))), add(X4, X3)), div(mul(X2, X6), mul(X4, X8)))), mul(sub(add(X4, X3), add(X4, X3)), div(mul(X2, X6), mul(X4, X8)))), add(mul(sub(add(X1, X0), sub(X1, X1)), X3), sub(div(mul(X6, X1), sub(X0, X1)), div(sub(add(sub(add(X8, X4), add(X3, X2)), div(mul(X5, X2), mul(X6, X2))), mul(sub(div(div(X3, X4), div(X3, X3)), add(X4, X3)), div(mul(X2, X6), mul(X4, X8)))), div(X4, X2))))))


#### You can alos check the hold-out score for this:

In [16]:
# Check hold-out score
pred_holdout_sc = predict_model(sc_trained)

Unnamed: 0,Model,MAE,MSE,RMSE,R2,RMSLE,MAPE
0,SymbolicRegressor,3074.1674,42180772.6476,6494.6726,0.7322,0.4362,0.1773


#### NGBoost Models

ngboost is a Python library that implements Natural Gradient Boosting, as described in “NGBoost: Natural Gradient Boosting for Probabilistic Prediction”. It is built on top of Scikit-Learn and is designed to be scalable and modular with respect to the choice of proper scoring rule, distribution, and base learner. 

To use models from ngboost, you will have to first install ngboost:t:

In [17]:
# Install ngboost
!pip install ngboost

Collecting ngboost
  Downloading ngboost-0.5.1-py3-none-any.whl.metadata (4.0 kB)
Collecting lifelines>=0.25 (from ngboost)
  Downloading lifelines-0.29.0-py3-none-any.whl.metadata (3.2 kB)
Collecting autograd>=1.5 (from lifelines>=0.25->ngboost)
  Downloading autograd-1.7.0-py3-none-any.whl.metadata (7.5 kB)
Collecting autograd-gamma>=0.3 (from lifelines>=0.25->ngboost)
  Downloading autograd-gamma-0.5.0.tar.gz (4.0 kB)
  Preparing metadata (setup.py): started
  Preparing metadata (setup.py): finished with status 'done'
Collecting formulaic>=0.2.2 (from lifelines>=0.25->ngboost)
  Downloading formulaic-1.0.2-py3-none-any.whl.metadata (6.8 kB)
Collecting interface-meta>=1.2.0 (from formulaic>=0.2.2->lifelines>=0.25->ngboost)
  Downloading interface_meta-1.3.0-py3-none-any.whl.metadata (6.7 kB)
Downloading ngboost-0.5.1-py3-none-any.whl (33 kB)
Downloading lifelines-0.29.0-py3-none-any.whl (349 kB)
Downloading autograd-1.7.0-py3-none-any.whl (52 kB)
Downloading formulaic-1.0.2-py3-none-

Once installed, you can import the untrained estimator from the ngboost library and use create_model to train and evaluate the model:

In [18]:
# Import untrained estimator
from ngboost import NGBRegressor
ng = NGBRegressor()

# Train using create_model
ng_trained = create_model(ng)

Processing:   0%|          | 0/4 [00:00<?, ?it/s]

ValueError: 
All the 10 fits failed.
It is very likely that your model is misconfigured.
You can try to debug the error by setting error_score='raise'.

Below are more details about the failures:
--------------------------------------------------------------------------------
10 fits failed with the following error:
Traceback (most recent call last):
  File "C:\Users\josjohn\AppData\Local\anaconda3\envs\Py311\Lib\site-packages\sklearn\model_selection\_validation.py", line 895, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "C:\Users\josjohn\AppData\Local\anaconda3\envs\Py311\Lib\site-packages\pycaret\internal\pipeline.py", line 278, in fit
    fitted_estimator = self._memory_fit(
                       ^^^^^^^^^^^^^^^^^
  File "C:\Users\josjohn\AppData\Local\anaconda3\envs\Py311\Lib\site-packages\joblib\memory.py", line 353, in __call__
    return self.func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\josjohn\AppData\Local\anaconda3\envs\Py311\Lib\site-packages\pycaret\internal\pipeline.py", line 69, in _fit_one
    transformer.fit(*args)
TypeError: NGBoost.fit() missing 1 required positional argument: 'Y'


In [None]:
print(ng_trained)

#### Writing a Custom Class

The above two examples gplearn and ngboost are custom models for pycaret as they are not available in the default library but you can use them just like you can use any other out-of-the-box models. However, there may be a use-case that involves writing your own algorithm (i.e. maths behind the algorithm), in which case you can inherit the base class from sklearn and write your own maths.
Below I s create a naive estimator which learns the mean value of target variable during fit stage and predicts the same mean value for all new data points, irrespective of X input (probably not useful in real life, but just to make demonstrate the functionality).

In [None]:
# Create a custom estimator

import numpy as np
from sklearn.base import BaseEstimator

class MyOwnModel(BaseEstimator):
    def __init__(self):
        self.mean = 0

    def fit(self, X, y):
        self.mean = y.mean()
        return self

    def predict(self, X):
        return np.array(X.shape[0]*[self.mean])

Now let's use this estimator for training:

In [None]:
# Import MyOwnModel Class
mom = MyOwnModel()

# Train using create_model
mom_trained = create_model(mom)

In [None]:
# generate predictions on data
predictions = predict_model(mom_trained, data = data)