1. Introducción

El propósito de este documento es investigar casos de uso de la librería de Python PyCaret. Además se pondrá un ejemplo de estos y se dará una opinión al respecto del framework en cuestión.


2. PyCaret

PyCaret es una librería de Python que permite llevar a cabo desde la preparación de los datos, hasta el despliegue del modelo final en tan solo unos minutos. Esta librería es compatible con cualquier tipo de notebook de Python, y además nos permite realizar comparaciones de varios modelos automáticamente.

A modo de ejemplo vamos a crear un Jupyter Notebook que sea capaz, en tan solo unas líneas, de leer los datos, procesarlos obteniendo un ranking de modelos de ML, entrenar el modelo más potente y desplegarlo para obtener predicciones sobre datos.

Primero vamos a instalar PyCaret en nuestro entorno de Python, para ello ejecutamos el siguiente comando en una terminal:

Pero antes al leer la documentacion podemos ver que el pycaret solo funciona con python 3.6 ~ 3.8
asi que vamos a crear un entorno virtual con python 3.7

Instalamos el requirement.txt

In [1]:
# %pip install pycaret[full]

In [2]:
# !pip install pycaret --user

Hacemos los siguientes imports:

In [3]:
import pandas as pd
import numpy as np
import pycaret
from pycaret.datasets import get_data
import warnings
warnings.filterwarnings("ignore")
%matplotlib inline

Utilizaremos un conjunto de datos proporcionado por PyCaret llamado ‘credit’, para importarlo corremos el siguiente código

In [4]:
index = get_data('index')

Unnamed: 0,Dataset,Data Types,Default Task,Target Variable 1,Target Variable 2,# Instances,# Attributes,Missing Values
0,anomaly,Multivariate,Anomaly Detection,,,1000,10,N
1,france,Multivariate,Association Rule Mining,InvoiceNo,Description,8557,8,N
2,germany,Multivariate,Association Rule Mining,InvoiceNo,Description,9495,8,N
3,bank,Multivariate,Classification (Binary),deposit,,45211,17,N
4,blood,Multivariate,Classification (Binary),Class,,748,5,N
5,cancer,Multivariate,Classification (Binary),Class,,683,10,N
6,credit,Multivariate,Classification (Binary),default,,24000,24,N
7,diabetes,Multivariate,Classification (Binary),Class variable,,768,9,N
8,electrical_grid,Multivariate,Classification (Binary),stabf,,10000,14,N
9,employee,Multivariate,Classification (Binary),left,,14999,10,N


In [5]:
data = get_data('credit')

Unnamed: 0,LIMIT_BAL,SEX,EDUCATION,MARRIAGE,AGE,PAY_1,PAY_2,PAY_3,PAY_4,PAY_5,...,BILL_AMT4,BILL_AMT5,BILL_AMT6,PAY_AMT1,PAY_AMT2,PAY_AMT3,PAY_AMT4,PAY_AMT5,PAY_AMT6,default
0,20000,2,2,1,24,2,2,-1,-1,-2,...,0.0,0.0,0.0,0.0,689.0,0.0,0.0,0.0,0.0,1
1,90000,2,2,2,34,0,0,0,0,0,...,14331.0,14948.0,15549.0,1518.0,1500.0,1000.0,1000.0,1000.0,5000.0,0
2,50000,2,2,1,37,0,0,0,0,0,...,28314.0,28959.0,29547.0,2000.0,2019.0,1200.0,1100.0,1069.0,1000.0,0
3,50000,1,2,1,57,-1,0,-1,0,0,...,20940.0,19146.0,19131.0,2000.0,36681.0,10000.0,9000.0,689.0,679.0,0
4,50000,1,1,2,37,0,0,0,0,0,...,19394.0,19619.0,20024.0,2500.0,1815.0,657.0,1000.0,1000.0,800.0,0


In [6]:
data.shape

(24000, 24)

Vamos a particionar el conjunto de datos, obteniendo el 95% para entrenar el modelo:

In [7]:
df = data.sample(frac=0.95, random_state=42)
df

Unnamed: 0,LIMIT_BAL,SEX,EDUCATION,MARRIAGE,AGE,PAY_1,PAY_2,PAY_3,PAY_4,PAY_5,...,BILL_AMT4,BILL_AMT5,BILL_AMT6,PAY_AMT1,PAY_AMT2,PAY_AMT3,PAY_AMT4,PAY_AMT5,PAY_AMT6,default
3111,180000,2,1,2,24,0,0,0,0,0,...,68398.0,70409.0,71408.0,4000.0,2600.0,2000.0,3000.0,2025.0,4791.0,0
18679,180000,2,2,1,32,-1,2,-1,-1,-1,...,1473.0,2705.0,1473.0,0.0,1473.0,1473.0,2705.0,1473.0,1473.0,0
17472,60000,2,2,2,23,0,0,0,0,0,...,59006.0,39578.0,38973.0,2039.0,2250.0,2060.0,1506.0,1500.0,1500.0,0
21451,160000,1,2,1,32,1,2,0,0,0,...,3801.0,2540.0,2279.0,0.0,1094.0,1500.0,0.0,1000.0,0.0,1
20800,650000,2,1,2,29,1,-1,-1,-1,0,...,2482.0,5178.0,5506.0,3000.0,1000.0,2500.0,3500.0,4000.0,3000.0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
23674,20000,1,2,1,32,2,0,0,2,0,...,19362.0,19402.0,19607.0,2000.0,3000.0,0.0,850.0,920.0,750.0,1
13226,10000,1,3,1,30,1,2,0,0,0,...,7637.0,6056.0,2852.0,0.0,1176.0,1000.0,121.0,57.0,5507.0,1
6783,230000,2,3,1,44,-1,-1,-1,-1,-1,...,6222.0,15121.0,17425.0,11632.0,4987.0,6222.0,15121.0,17425.0,17007.0,0
3600,100000,2,2,2,23,0,0,0,0,0,...,18306.0,20594.0,26368.0,1258.0,1255.0,3000.0,3000.0,7000.0,1225.0,0


El 5% restante lo vamos a utilizar para comprobar el rendimiento del modelo sobre datos nunca antes vistos:

In [8]:
df_unseen = data.drop(df.index)
df_unseen

Unnamed: 0,LIMIT_BAL,SEX,EDUCATION,MARRIAGE,AGE,PAY_1,PAY_2,PAY_3,PAY_4,PAY_5,...,BILL_AMT4,BILL_AMT5,BILL_AMT6,PAY_AMT1,PAY_AMT2,PAY_AMT3,PAY_AMT4,PAY_AMT5,PAY_AMT6,default
9,260000,2,1,2,51,-1,-1,-1,-1,-1,...,8517.0,22287.0,13668.0,21818.0,9966.0,8583.0,22301.0,0.0,3640.0,0
55,80000,1,1,2,31,-1,-1,-1,-1,-1,...,390.0,390.0,390.0,0.0,390.0,390.0,390.0,390.0,390.0,0
77,90000,1,2,2,35,0,0,0,0,0,...,35565.0,30942.0,30835.0,3621.0,3597.0,1179.0,1112.0,1104.0,1143.0,0
117,80000,2,2,1,23,1,2,3,2,0,...,9898.0,10123.0,12034.0,1650.0,0.0,0.0,379.0,2091.0,1.0,0
126,30000,1,1,2,41,2,2,2,2,2,...,28168.0,27579.0,28321.0,3500.0,0.0,2200.0,0.0,1200.0,1250.0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
23860,150000,1,5,1,36,0,0,0,0,0,...,136378.0,139219.0,142172.0,5500.0,3800.0,3900.0,4000.0,4100.0,4100.0,0
23864,10000,1,3,1,42,2,2,0,0,2,...,9926.0,8898.0,7667.0,0.0,1200.0,1500.0,0.0,1000.0,3000.0,1
23897,140000,1,2,1,34,0,0,0,0,0,...,44433.0,28029.0,32386.0,4000.0,5000.0,5000.0,5000.0,5000.0,10000.0,0
23930,410000,1,1,2,34,0,0,0,-1,-1,...,1467.0,1421.0,-15.0,17259.0,18600.0,1474.0,1428.0,0.0,0.0,1


Como podemos ver, tenemos varios predictores que se utilizarán para predecir la variable binaria ‘default’.

Lo último que haremos para limpiar los datos es resetear los índices de cada subconjunto de datos:

In [9]:
df.reset_index(inplace=True, drop=True)
df_unseen.reset_index(inplace=True, drop=True)

Ahora vamos a comparar el rendimiento de distintos modelos. Para ello debemos, en primer lugar, importar

In [10]:
from pycaret.classification import *

Paso seguido definimos el entorno de PyCaret con los datos de entrenamiento, esto hará que cada vez que llamemos a un modelo a entrenar se escojan dichos datos para entrenar. Además este proceso también preprocesa los datos automáticamente de manera que sea más fácil aplicar los modelos estadísticos:

In [None]:
model_setup = setup(data=df, target='default', session_id=123, use_gpu=True)

[LightGBM] [Info] Number of positive: 1, number of negative: 1
[LightGBM] [Info] This is the GPU trainer!!
[LightGBM] [Info] Total Bins 0
[LightGBM] [Info] Number of data points in the train set: 2, number of used features: 0
[LightGBM] [Info] Using GPU Device: Intel(R) RaptorLake-S Mobile Graphics Controller, Vendor: Intel(R) Corporation
[LightGBM] [Info] Compiling OpenCL Kernel with 16 bins...
[LightGBM] [Info] GPU programs have been built
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.500000 -> initscore=0.000000
[LightGBM] [Info] Number of positive: 1, number of negative: 1
[LightGBM] [Info] This is the GPU trainer!!
[LightGBM] [Info] Total Bins 0
[LightGBM] [Info] Number of data points in the train set: 2, number of used features: 0
[LightGBM] [Info] Using GPU Device: Intel(R) RaptorLake-S Mobile Graphics Controller, Vendor: Intel(R) Corporation
[LightGBM] [Info] Compiling OpenCL Kernel with 16 bins...
[LightGBM] [Info] GPU programs have been built
[LightGBM] [Info] [binary:Boo

Unnamed: 0,Description,Value
0,Session id,123
1,Target,default
2,Target type,Binary
3,Original data shape,"(22800, 24)"
4,Transformed data shape,"(22800, 24)"
5,Transformed train set shape,"(15959, 24)"
6,Transformed test set shape,"(6841, 24)"
7,Numeric features,23
8,Preprocess,True
9,Imputation type,simple


[LightGBM] [Info] Number of positive: 1, number of negative: 1
[LightGBM] [Info] This is the GPU trainer!!
[LightGBM] [Info] Total Bins 0
[LightGBM] [Info] Number of data points in the train set: 2, number of used features: 0
[LightGBM] [Info] Using GPU Device: Intel(R) RaptorLake-S Mobile Graphics Controller, Vendor: Intel(R) Corporation
[LightGBM] [Info] Compiling OpenCL Kernel with 16 bins...
[LightGBM] [Info] GPU programs have been built
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.500000 -> initscore=0.000000
[LightGBM] [Info] Number of positive: 1, number of negative: 1
[LightGBM] [Info] This is the GPU trainer!!
[LightGBM] [Info] Total Bins 0
[LightGBM] [Info] Number of data points in the train set: 2, number of used features: 0
[LightGBM] [Info] Using GPU Device: Intel(R) RaptorLake-S Mobile Graphics Controller, Vendor: Intel(R) Corporation
[LightGBM] [Info] Compiling OpenCL Kernel with 16 bins...
[LightGBM] [Info] GPU programs have been built
[LightGBM] [Info] [binary:Boo

Vemos como la ejecución de esta consulta es interactiva, pues espera que comprobemos que los tipos de datos inferidos automáticamente sean los correctos, en tal caso pulsamos enter. Entonces se nos mostrarán los cambios realizados a los datos de entrenamiento que hemos realizado

Podemos ver los modelos de clasificación de que dispone PyCaret, mediante el siguiente comando:

In [12]:
models()

[LightGBM] [Info] Number of positive: 1, number of negative: 1
[LightGBM] [Info] This is the GPU trainer!!
[LightGBM] [Info] Total Bins 0
[LightGBM] [Info] Number of data points in the train set: 2, number of used features: 0
[LightGBM] [Info] Using GPU Device: Intel(R) RaptorLake-S Mobile Graphics Controller, Vendor: Intel(R) Corporation
[LightGBM] [Info] Compiling OpenCL Kernel with 16 bins...
[LightGBM] [Info] GPU programs have been built
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.500000 -> initscore=0.000000
[LightGBM] [Info] Number of positive: 1, number of negative: 1
[LightGBM] [Info] This is the GPU trainer!!
[LightGBM] [Info] Total Bins 0
[LightGBM] [Info] Number of data points in the train set: 2, number of used features: 0
[LightGBM] [Info] Using GPU Device: Intel(R) RaptorLake-S Mobile Graphics Controller, Vendor: Intel(R) Corporation
[LightGBM] [Info] Compiling OpenCL Kernel with 16 bins...
[LightGBM] [Info] GPU programs have been built
[LightGBM] [Info] [binary:Boo

Unnamed: 0_level_0,Name,Reference,Turbo
ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
lr,Logistic Regression,sklearn.linear_model._logistic.LogisticRegression,True
knn,K Neighbors Classifier,sklearn.neighbors._classification.KNeighborsCl...,True
nb,Naive Bayes,sklearn.naive_bayes.GaussianNB,True
dt,Decision Tree Classifier,sklearn.tree._classes.DecisionTreeClassifier,True
svm,SVM - Linear Kernel,sklearn.linear_model._stochastic_gradient.SGDC...,True
rbfsvm,SVM - Radial Kernel,sklearn.svm._classes.SVC,False
gpc,Gaussian Process Classifier,sklearn.gaussian_process._gpc.GaussianProcessC...,False
mlp,MLP Classifier,sklearn.neural_network._multilayer_perceptron....,False
ridge,Ridge Classifier,sklearn.linear_model._ridge.RidgeClassifier,True
rf,Random Forest Classifier,sklearn.ensemble._forest.RandomForestClassifier,True


Este output es importante puesto que se necesitan los id de cada modelo para trabajar con ellos más en específico, como veremos a continuación.

Una de las funciones más útiles de esta librería nos permite comparar todos los modelos anteriores, esta función es la siguiente:

In [13]:
best_model = compare_models()

Unnamed: 0,Model,Accuracy,AUC,Recall,Prec.,F1,Kappa,MCC,TT (Sec)
gbc,Gradient Boosting Classifier,0.8202,0.7809,0.3677,0.6731,0.4752,0.3779,0.4032,8.904
catboost,CatBoost Classifier,0.8191,0.7823,0.3686,0.6672,0.4744,0.376,0.4004,47.573
lightgbm,Light Gradient Boosting Machine,0.8178,0.7756,0.3672,0.6607,0.4717,0.3724,0.396,2.712
ada,Ada Boost Classifier,0.8163,0.7723,0.3276,0.6777,0.4413,0.347,0.3801,1.728
rf,Random Forest Classifier,0.8126,0.762,0.3604,0.637,0.4599,0.3571,0.3785,0.424
lda,Linear Discriminant Analysis,0.81,0.7172,0.2535,0.6963,0.3716,0.2873,0.3385,0.102
et,Extra Trees Classifier,0.8079,0.7546,0.366,0.6113,0.4575,0.3497,0.3669,0.277
lr,Logistic Regression,0.8048,0.7108,0.2261,0.6797,0.336,0.2553,0.309,0.818
ridge,Ridge Classifier,0.7983,0.7172,0.1507,0.7133,0.2485,0.1857,0.2621,0.069
dummy,Dummy Classifier,0.7783,0.5,0.0,0.0,0.0,0.0,0.0,0.024


Mediante esta tabla podemos escoger el modelo que más nos convenga, teniendo en cuenta las diferentes puntuaciones sobre las métricas que se muestran.

En nuestro caso, por ejemplo vamos a construir y entrenar un random forest sobre los datos de entrenamiento. Para ello ejecutamos:

In [14]:
catboost = create_model('catboost')

Unnamed: 0_level_0,Accuracy,AUC,Recall,Prec.,F1,Kappa,MCC
Fold,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
0,0.817,0.7698,0.3853,0.6445,0.4823,0.3796,0.3982
1,0.8277,0.8074,0.3644,0.7207,0.4841,0.3937,0.4268
2,0.8264,0.7962,0.404,0.6842,0.508,0.411,0.432
3,0.8164,0.7905,0.3644,0.6548,0.4682,0.368,0.3911
4,0.8177,0.7807,0.3898,0.6479,0.4868,0.3841,0.4025
5,0.8239,0.785,0.3757,0.6891,0.4863,0.391,0.4172
6,0.8083,0.7632,0.3475,0.6212,0.4457,0.3408,0.3618
7,0.8127,0.7677,0.3277,0.6554,0.4369,0.3392,0.3686
8,0.8258,0.7706,0.3644,0.7088,0.4813,0.3894,0.4205
9,0.8219,0.7958,0.3796,0.6734,0.4855,0.3878,0.4111


In [15]:
print(catboost)

<catboost.core.CatBoostClassifier object at 0x000002301BF094D0>


Podemos ver como ha entrenado 10 modelos distintos, para poder obtener los detalles de este modelo en media y, así, poder extrapolar los resultados en mayor detalle. También se pueden ver los hiperparámetros con los que el modelo ha sido entrenado.

Para mejorar este modelo, es decir, obtener los hiperparámetros óptimos o que más se aproximan a estos, podemos correr la siguiente función, que entrena 10 modelos distintos 10 veces cada uno, y devuelve el que mejor precisión media obtenga:

In [17]:
tunned_rf = tune_model(rf)

NameError: name 'rf' is not defined

In [18]:
print(tunned_rf)

NameError: name 'tunned_rf' is not defined

In [19]:
plot_model(tunned_rf,plot='auc')

NameError: name 'tunned_rf' is not defined

In [None]:
plot_model(tunned_rf, plot='feature')

Llegados a este punto, podemos obtener predicciones sobre el conjunto de datos test, que no ha sido utilizado para entrenar el modelo:

In [None]:
unseen_prediction = predict_model(tunned_rf, data=df_unseen)
unseen_prediction.head()

Como podemos ver, se crean dos columnas nuevas. Label hace referencia a la predicción realizada mientras que Score es la probabilidad asociada a la predicción.

Por último, para terminar de configurar nuestro modelo random forest, debemos finalizar el modelo, es decir, se va a entrenar con todo el conjunto de datos del que se dispone:

In [None]:
final_rf = finalize_model(tunned_rf)
print(final_rf)

De esta manera, el modelo está listo para su puesta en producción, por tanto podemos guardarlo localmente mediante:

In [None]:
save_model(final_rf, 'modelo_final')

-------------------------------

In [None]:
data = get_data('diamond')

In [None]:
import plotly.express as px
fig = px.scatter(x=data['Carat Weight'], y= data['Price'], facet_col=data['Cut'], opacity= 0.25, template= 'plotly_dark', trendline_color_override= 'red', title='Diamonds')
fig.show();

In [None]:
fig = px.histogram(data, x=["Price"], template = 'plotly_dark', title = 'Histogram of Price')
fig.show()

In [None]:
data_copy = data.copy()
data_copy['Log_Price'] = np.log(data['Price'])
fig = px.histogram(data_copy, x=["Log_Price"], title = 'Histgrama Log Price', template = 'plotly_dark')
fig.show()

In [None]:
from pycaret.regression import *
s = setup(data, target = 'Price', transform_target = True, log_experiment = True, experiment_name = 'diamond')

In [None]:
best = compare_models()

In [None]:
plot_model(best, plot = 'residuals_interactive')

In [None]:
plot_model(best, plot = 'feature')


In [None]:
final_best = finalize_model(best)
save_model(final_best, 'diamond-pipeline')

--------------------

In [None]:
data = get_data('iris')

In [None]:
s = setup(data, target='species', session_id= 123)

In [None]:
eda()