pycaret clone aimed at simplicity and production-ready code within half the usual development time.
- Automation of preprocessing
- Comparison of various algorithms
- Bayesian hyperparameter tuning
- Exhaustive feature reduction
- Single-artefact export
Install pytelligence via pip either directly from pypi.org
pip install pytelligence
or from its github repository
pip install @git....
First import pytelligence:
import pytelligence as pt
pytelligence
is focused around configuration from a single yaml
file typially called config.yml
. The format is as follows for a classification use case:
modelling:
target_clf: 'class'
numeric_cols:
- 'deg-malig'
categorical_cols:
- 'age'
- 'menopause'
- 'tumor-size'
- 'inv-nodes'
- 'node-caps'
- 'breast'
- 'breast-quad'
- 'irradiat'
feature_scaling: False
This configuration is loaded once during the data preparation step.
target_clf
specifies the column name of the target label for classification use cases.
numeric_cols
& categorical_cols
inform pytelligence about the types of feature columns, that are to be used during data preparation and subsequent training.
feature_scaling
serves as boolean flag for activating feature scaling during data preparation.
Assuming a dataframe df
that fits the above config file the data can be prepared. Preparation includes
- column verification
- one-hot-encoding of categorical columns
- encoding of the target label in case of classification (if required)
- scaling of numeric features (if turned on within config)
setup, X_sample, y_sample = pt.modelling.prepare_data(
train_data=df,
config_path="./config.yml",
)
The returned setup
is a container object, that holds the prepared data and configuration. It is handed to subsequent processing steps.
X_sample
and y_sample
are samples from the prepared dataframe and target column.
Various algorithms can be compared via
compare_df, algo_list, model_list = pt.modelling.compare_algorithms(
setup=setup,
include=[
"lr",
"nb",
],
sort="f1",
return_models=True,
)
Implemented algorithms include:
- lr - LogisticRegression
- nb - Gaussian Naive Bayes
If no algorithm list is provided to the include
parameter, all available algorithms are compared.
compare_df
returns a pandas dataframe which holds the results of the different algorithms sorted by the metric provided with the sort
parameter. Possible values for sort
are accuracy
, precision
, recall
, f1
& roc_auc
.
algo_list
is a list comprised of the different algorithm abbreviations sorted by sort
. It serves as an easy means to referencing the best algorithms in later processing steps.
model_list
holds model instances trained on the entire training dataset sorted by sort
. Only computed if return_models
is set to True
. (Note that these are the only the estimators, not including the preprocessing pipeline applied to the initial dataset.)
Hyperparameter tuning is automated using Bayesian optimization.
compare_df_tune, model_list, opt_history_dict = (
pt.modelling.tune_hyperparams(
setup=setup,
include=["lr", "nb"],
optimize="f1",
n_trials=5,
return_models=True,
feature_list=None, # optional
)
)
optimize
determines the metric to optimize for.
n_trials
specifies the number of hyperparameter sets to check for each algorithm.
feature_list
determines which feature columns to use during training. This optional parameter is often used after analysing the best features via EFR
(see further down).
compare_df_tune
holds an overview of the achieved target metric for each algorithm along with the best set of hyperparameters.
model_list
holds trained model instances. Only returned when return_models=True
.
opt_history_dict
holds plots of the optimization process, which can be accessed by the key of the algorithms, e.g. opt_history_dict["lr"].show()
The full list of features used during model training might not be desirable for two reasons:
- Training/tuning or inference time might be unnecceraly large, when only a few features hold the key information.
- Surplus features might actually decrease an algorithm's metrics.
EFR
tackles this problem by subsequently finding the weakest feature and removing it from training. This process runs until an acceptable loss is reached.
best_feature_list, metric_feature_df = pt.modelling.reduce_feature_space(
setup=setup,
algorithm= "nb",
metric="f1",
reference_metric=compare_df_tune.iloc[0]["metric"],
acceptable_loss=0.99,
hyperparams=compare_df_tune.iloc[0]["hyperparams"]
)
reference_metric
is the so far achieved metric score.
acceptable_loss
is a percentage value, which is used to calculated a threshold metric: threshold = acceptable_loss * reference_metric
. Once the set of features does not reach this threshold anymore, EFR
is stopped.
(Optional) hyperparams
is a dictionary holding the set of hyperparameters to use for the algorithm.
best_feature_list
holds an unsorted list of the best feature combination found.
metric_feature_df
is a pandas dataframe with the columns metric
, features
and feature_count
giving an overview of the correlation between feature number and metric. The column features
contains lists of the features used during a specific EFR
cycle.
The preprocessing pipeline is stored within the setup container and can be accessed via setup.prep_pipe
.
It is automatically added to a model instance during export:
pt.modelling.export_model(
setup=setup,
model=model_list[0],
target_dir="./",
)
The naming format of the artefact is
model_{date}_{algo}_#{number}.joblib
,
where #{number}
is used when several models are exported during the same day.