Scalable Machine Learning Pipeline

Scope

Efficient, scalable Machine Learning pipeline, that enables training and inference of large datasets that do not fit in memory by scaling up using fast storage.

Builds a ML pipeline on top of existing ML libraries (IBM Snap ML, scikit-learn), and using the AWS ML-IO library.

Usage

Setup conda environment

conda config --prepend channels https://public.dhe.ibm.com/ibmdl/export/pub/software/server/ibm-ai/conda/
conda config --add channels conda-forge
conda config --add channels mlio
conda create --yes -n smlp-environment python=3.7
conda activate smlp-environment

Install dependencies

conda install --file requirements.txt --yes

Install smlp module locally

python setup.py install

Run a test

python test/MLPipelineTester.py --ml_lib snap

Full pipeline test example

Epsilon dataset from the PASCAL Large Scale Learning Challenge.

for ch in 50000 100000 200000; do echo "chunk="$ch; python examples/smlp-demo.py --dataset_path /path_to_dataset/epsilon.train.csv --dataset_test_path /path_to_dataset/epsilon.test.csv --chunk_size $ch --ml_lib snap --ml_obj logloss --ml_model_options objective=logloss,num_round=1,min_max_depth=4,max_max_depth=4,n_threads=40,random_state=42; echo; done

Notes

Currenlty we support:

ML models: Snap Booster, sklearn Decision Trees
Input data format: csv

Dependencies:

Python (>= 3.7)
scikit-learn
numpy
pai4sk
mlio-py
psutil

License

This project is licensed under the Apached 2.0 License. If you would like to see the detailed LICENSE click here.

Contributing

Please see CONTRIBUTING for details. Note that this repository has been configured with the DCO bot.

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
.github		.github
examples		examples
smlp		smlp
test		test
.travis.yml		.travis.yml
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
MAINTAINERS.md		MAINTAINERS.md
README.md		README.md
requirements.txt		requirements.txt
setup.py		setup.py

License

IBM/smlp

Folders and files

Latest commit

History

Repository files navigation

Scalable Machine Learning Pipeline

Scope

Usage

Setup conda environment

Install dependencies

Install smlp module locally

Run a test

Full pipeline test example

Epsilon dataset from the PASCAL Large Scale Learning Challenge.

Notes

Dependencies:

License

Contributing

About

Resources

License

Code of conduct

Stars

Watchers

Forks

Languages