Skip to content

Latest commit

 

History

History

ml-training

Folders and files

NameName
Last commit message
Last commit date

parent directory

..
 
 
 
 
 
 

ML Training

This is a WIP! Please contribute back if you have ideas. You can track the associated issue here.

A machine learning training system can easily be modeled as a state machine.

While often, this is more of a DAG (E.G. query_data -> feature_engineer -> train_model -> evaluate_model), there are some interesting reasons this as a state machine.

More granular training routines

If, rather than treating train as a action, you treat train_epoch as an action, you can utilize the abstractions burr represents to get the following benefits:

Checkpointing

You can store the current params + best model (or pointers to them). Then go back to the failure point. This allows you to recover from failure in a generic way.

Decoupling training logic from termination condition

You can have a condition that checks training completeness based on metric history, which is different from the training logic itself. This allows you to easily swap in and out termination conditions (testing them against each other or running different ones in different scenarios), as well as easy reuse.

Visibility into lower-level training

Similar to hooks in pytorch lightning, you can use Burr hooks to log metrics, visualize, etc... at each step. This allows you to have a more granular view of training that updates live (obviuosly depending on the UI/model-monitoring you're using.)

Human in the loop

While some models are trained in a single-shot and shipped to production, many require human input. Burr can be used to express training, then checkpoint/pause the state while a human is evaluating it, and have their input (e.g. go/no-go) passed in as an input parameter.

Note that this still requires a scheduling tool (say a task executor that runs until the next human input is needed), but that task executor does not need to be complicated (all it needs to do is run a job when prompted, and possibly on a chron schedule).

Hyperparameter training

Similar to epoch training, hyperparameter training can be modeled as a state machine. The system makes a decision about what to look for next based on the prior results. Furthermore, one can model a step that handles multiple jobs, keeping them alive/ launching new ones as they complete (based on state).

For a sketch of the latter see application.py & diagram.