pytelligence

pycaret clone aimed at simplicity and production-ready code within half the usual development time.

Features

Automation of preprocessing
Comparison of various algorithms
Bayesian hyperparameter tuning
Exhaustive feature reduction
Single-artefact export

Installation

Install pytelligence via pip either directly from pypi.org

pip install pytelligence

or from its github repository

pip install @git....

Usage

First import pytelligence:

import pytelligence as pt

Configuration

pytelligence is focused around configuration from a single yaml file typially called config.yml. The format is as follows for a classification use case:

modelling:
  target_clf: 'class'
  numeric_cols:
    - 'deg-malig'
  categorical_cols:
    - 'age'
    - 'menopause'
    - 'tumor-size'
    - 'inv-nodes'
    - 'node-caps'
    - 'breast'
    - 'breast-quad'
    - 'irradiat'
  feature_scaling: False

This configuration is loaded once during the data preparation step.

target_clf specifies the column name of the target label for classification use cases.

numeric_cols & categorical_cols inform pytelligence about the types of feature columns, that are to be used during data preparation and subsequent training.

feature_scaling serves as boolean flag for activating feature scaling during data preparation.

Data Preparation

Assuming a dataframe df that fits the above config file the data can be prepared. Preparation includes

column verification
one-hot-encoding of categorical columns
encoding of the target label in case of classification (if required)
scaling of numeric features (if turned on within config)

setup, X_sample, y_sample = pt.modelling.prepare_data(
    train_data=df,
    config_path="./config.yml",
)

The returned setup is a container object, that holds the prepared data and configuration. It is handed to subsequent processing steps.

X_sample and y_sample are samples from the prepared dataframe and target column.

Algorithm Comparison

Various algorithms can be compared via

compare_df, algo_list, model_list = pt.modelling.compare_algorithms(
    setup=setup,
    include=[
        "lr",
        "nb",
    ],
    sort="f1",
    return_models=True,
)

Implemented algorithms include:

lr - LogisticRegression
nb - Gaussian Naive Bayes

If no algorithm list is provided to the include parameter, all available algorithms are compared.

compare_df returns a pandas dataframe which holds the results of the different algorithms sorted by the metric provided with the sort parameter. Possible values for sort are accuracy, precision, recall, f1 & roc_auc.

algo_list is a list comprised of the different algorithm abbreviations sorted by sort. It serves as an easy means to referencing the best algorithms in later processing steps.

model_list holds model instances trained on the entire training dataset sorted by sort. Only computed if return_models is set to True. (Note that these are the only the estimators, not including the preprocessing pipeline applied to the initial dataset.)

Hyperparameter Tuning

Hyperparameter tuning is automated using Bayesian optimization.

compare_df_tune, model_list, opt_history_dict = (
        pt.modelling.tune_hyperparams(
               setup=setup,
               include=["lr", "nb"],
               optimize="f1",
               n_trials=5,
               return_models=True,
               feature_list=None, # optional
        )
)

optimize determines the metric to optimize for.

n_trials specifies the number of hyperparameter sets to check for each algorithm.

feature_list determines which feature columns to use during training. This optional parameter is often used after analysing the best features via EFR (see further down).

compare_df_tune holds an overview of the achieved target metric for each algorithm along with the best set of hyperparameters.

model_list holds trained model instances. Only returned when return_models=True.

opt_history_dict holds plots of the optimization process, which can be accessed by the key of the algorithms, e.g. opt_history_dict["lr"].show()

Exhaustive Feature Reduction (EFR)

The full list of features used during model training might not be desirable for two reasons:

Training/tuning or inference time might be unnecceraly large, when only a few features hold the key information.
Surplus features might actually decrease an algorithm's metrics.

EFR tackles this problem by subsequently finding the weakest feature and removing it from training. This process runs until an acceptable loss is reached.

best_feature_list, metric_feature_df = pt.modelling.reduce_feature_space(
    setup=setup,
    algorithm= "nb",
    metric="f1",
    reference_metric=compare_df_tune.iloc[0]["metric"],
    acceptable_loss=0.99,
    hyperparams=compare_df_tune.iloc[0]["hyperparams"]
)

reference_metric is the so far achieved metric score.

acceptable_loss is a percentage value, which is used to calculated a threshold metric: threshold = acceptable_loss * reference_metric. Once the set of features does not reach this threshold anymore, EFR is stopped.

(Optional) hyperparams is a dictionary holding the set of hyperparameters to use for the algorithm.

best_feature_list holds an unsorted list of the best feature combination found.

metric_feature_df is a pandas dataframe with the columns metric, features and feature_count giving an overview of the correlation between feature number and metric. The column features contains lists of the features used during a specific EFR cycle.

Single-Artefact Export

The preprocessing pipeline is stored within the setup container and can be accessed via setup.prep_pipe.

It is automatically added to a model instance during export:

pt.modelling.export_model(
        setup=setup,
        model=model_list[0],
        target_dir="./",
    )

The naming format of the artefact is model_{date}_{algo}_#{number}.joblib, where #{number} is used when several models are exported during the same day.

Name		Name	Last commit message	Last commit date
Latest commit History 130 Commits
data		data
demo		demo
dev_datasets/classification		dev_datasets/classification
notebooks		notebooks
pytelligence		pytelligence
tests		tests
.gitignore		.gitignore
.python-version		.python-version
README.md		README.md
idea_backlog.md		idea_backlog.md
install-pyenv-win.ps1		install-pyenv-win.ps1
poetry.lock		poetry.lock
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

pytelligence

Features

Installation

Usage

Configuration

Data Preparation

Algorithm Comparison

Hyperparameter Tuning

Exhaustive Feature Reduction (EFR)

Single-Artefact Export

About

Releases

Packages

Languages

George713/pytelligence

Folders and files

Latest commit

History

Repository files navigation

pytelligence

Features

Installation

Usage

Configuration

Data Preparation

Algorithm Comparison

Hyperparameter Tuning

Exhaustive Feature Reduction (EFR)

Single-Artefact Export

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages