autoML_plus: Fast AutoML for Structured Data

Use Case

The goal of this package is to provide AutoML for structured data in a useful way for data science professionals. We focus on structured data (regression in the current version), due to its importance for business analytics.

Slides discussing this project are available here. I developed this project as an Insight AI Fellow in Fall 2018.

Technical Description and Basic Usage

The main script (autoML_plus) in the autoML_plus directory performs a hyperparmeter tuning run. It is built on TPOT but has the following additional features

It recommends running for fewer model trainigs (20 or 100)
It has a quick stop option for when RMSE is already small after one training
It has a deep neural network option, which is useful on non-linear data

The main command is

python autoML_plus.py [ options ]

The options are file_name, trainings, quick_stop, seed, test_size, target_column, verbosity, and model. Each option can be specified using

-option value

or

--option value

after "python autoML_plus.py" on the command line. All options are optional.

file_name If file_name is specified, it should be the pathname of a CSV file. If it is omitted, the Boston Housing Dataset built into scikit-learn will be used instead of a csv file.

trainings The default value is 100. Any integer value can be used, though the number of trainings implemented will be rounded up to the nearest 5.

quick_stop The default value is "NONE". Other options are "AGRESSIVE": If r = RMSE / mean(abs(test_values)) < 0.4 after training with a random hyperparameter point, stop execution. and "MODERATE": If r < 0.1 after training with a random hyperparameter point, stop execution.

seed The default value is 42. Any integer value may be specified.

test_size

The default value is 0.25. Any floating point value > 0 and < 1 may be chosen. An exception will be raised and execution will cease if test_size + validation_size >= 1.

validation_size

The default value is 0.1. Any floating point value > 0 and < 1 may be chosen. An exception will be raised and execution will cease if test_size + validation_size >= 1.

target_column

The default value is -1 (the right most column), but other integer (positive or negative) values can be specified following the python indexing convention.

verbosity

Default value is 0 (quite). This sets the extent to which TPOT sends progress messages to stdout. Options are 0, 1, 2, and 3.

model

Default is the full set of regression models in TPOT (RidgeCV, LassoLarsCV, ElasticNetCV, DecisionTreeRegressor, ExtraTreesRegressor, GradientBoostingRegressor, AdaBoostRegressor, RandomForestRegressor, KNeighborsRegressor, and LinearSVR).

Repo Format:

autoML_plus : Package directory for autarchy.
obtain_TPOT_results : Code for running TPOT to obtain data about how well AutoML performs.
datasets : Code to download datasets/ info about benchmark datasets.
tests: Tests.

Name		Name	Last commit message	Last commit date
Latest commit History 209 Commits
autoML_plus		autoML_plus
datasets		datasets
obtain_TPOT_results		obtain_TPOT_results
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

autoML_plus

autoML_plus

datasets

datasets

obtain_TPOT_results

obtain_TPOT_results

LICENSE

LICENSE

README.md

README.md

requirements.txt

requirements.txt

setup.py

setup.py

Repository files navigation

autoML_plus: Fast AutoML for Structured Data

Use Case

Technical Description and Basic Usage

Repo Format:

About

Releases

Packages

Languages

License

JamieGainer/autarchy

Folders and files

Latest commit

History

Repository files navigation

autoML_plus: Fast AutoML for Structured Data

Use Case

Technical Description and Basic Usage

Repo Format:

About

Resources

License

Stars

Watchers

Forks

Languages