ML project workflow
Project
bin
- bash files for running pipelines, place all.sh
files herecommon
- data preprocessing scripts, utils, everything liketypically, it our project/library corepython common/scripts/{some-script}.py # or from common import utils
docker
- project Docker files for pure reproducibilitypresets
- datasets, notebooks, etc - all you don't need to push to git, usepresets/data
for datasetspresents/notebooks
for notebookspresets/serving
for serving artefacts
requirements
- different project python requirements for docker, tests, CI, etcserving
- microservices, etc - production with Reactiontraining
- model, experiment, etc - research with Alchemy & Catalyst, usetraining/configs
- for all configs, just all.yml
files
Workflow
tip: you can save all answers to presets/_faq.md
.
Before ML (miniFAQ)
- What problem are you trying to solve?
- How do you think it can be solved? What is your hypothesis?
- What is the value of the hypothesis you are testing?
- What are the main metrics? How to measure them?
- How can metrics prove that hypothesis works?
- Is it possible to check it without ML? How?
- How will your solution be integrated into the current system?
- What can go wrong? What kind of corner cases can occur?
ML (plan)
- Perform data exploratory analysis, check that the data and labeling are correct.
- Plot main statistics, find outliers and recheck them again.
- Do data preprocessing, get clean data from raw one.
- Split data into train/valid/test parts and fix this data split for future experiments.
- Run
adversarial split
on your train/valid/test parts to check split correctness. - Overfit your model on one batch from train part to ensure, that pipeline work correctly.
- Use train/valid parts for model training (log all your experiments) and valid part for model final postrocessing.
- Track metrics for all experiments (use tables for this). Do not forget to write tests for used metrics.
After ML (todo)
- Write down all experiments, check their performance on test part, select best one.
- Trace the model :)
Extra
To keep your code simple and readable, you can use catalyst-codestyle
# install
pip install -U catalyst-codestyle
# and run
catalyst-make-codestyle && catalyst-check-codestyle