A high-level experiment manager.
- 📄 One Python file for your entire experiment, flexible and easily versionable.
- 💚 No boilerplate like parsing a text file with configuration variables, multiprocessing code, nor logging/saving results.
- ♻️ Easily re-run failed experiment settings.
pip install superparams
View on PyPi.
Superparams incentivises use of Python's built-in dataclass
to specify both the parameters and the experiment-specific logic in one place. It bundles this with a bunch of quality-of-life improvements for managing your experiments.
# file: experiments/params.py
import dataclasses as dc
import superparams as sp
@dc.dataclass
class Hyperparams(sp.Experiment):
steps :int = 100
batch_size :int = sp.search(16, 32)
# and automatic string substitution!
dataset_path :str = 'data/raw/{dataset_name}'
dataset_name :str = sp.search('alphabet', 'numbers')
def run(self) -> dict | pl.DataFrame:
'''
Runs this setting of parameters (override this method)
Auto-stores the returned dict/pl.DataFrame in a parquet table.
'''
results = {
'total samples': self.batch_size * self.steps,
'path': self.dataset_path
}
print(results)
# automatically save results in a parquet file by returning them
return results
def format_results(self, results: pl.DataFrame) -> None | pl.DataFrame:
'''
Useful for plotting and post-processing,
optionally can return formatted dataframe to be saved
'''
results.plot.bar('path', 'total_samples').show()
This constructs an iterator to grid-search the parameter settings,
meaning you could add a snippet like the following to invoke your experiment
from the terminal with python -m experiments.params
.
# file: experiments/params.py
if __name__ == '__main__':
for h in Hyperparams():
results = h.run()
print(f'Setting ({h.steps}, {h.batch_size}): {results}')
# Setting (100, 16): {'total_samples': 1600}
# Setting (100, 32): {'total_samples': 3200}
But we promised no boilerplate! Instead, you can invoke from the terminal, which handles result-caching for you, and enables easy multiprocessing.
experiment params.Hyperparams --n_proc 2
This will:
- print a nice overview of the running experiments
- store results and log under
experiments/progress/params/Hyperparams
- prompt you to resume interrupted/failed experiment settings
- do the multiprocessing for you :)
Warning
Python-native multiprocessing
shares the Hyperparams
data with each process by pickling it!. This is woefully inefficient, and poses a massive bottleneck if sharing >10MB data. Consider refactoring such that each run
method instantiates this data itself.
In the future, I may do a refactor that shares the data more efficiently; but this is not trivial in Python and definitely not possible in all cases. See Python docs.
Dataclasses don't require Java-style repetitive constructors. To modify your hyperparameter combination, simply instantiate it as follows.
Hyperparams(batch_size=search(2,4,8))
You can run multiple settings on multiple processes.
params = Hyperparams()
params.run_all(n_proc = os.cpu_count() - 2)
Also note that Experiment
objects have access to concurrency-related fields initialised by superparams. These are:
rank
: the process id of this experiment setting, i.e.rank in {0,1,2,3}
ifn_proc = 4
.n_proc
: parameter passed to then_proc
field.
Python throws a tantrum if you try to assign a mutable value to a dataclass:
@dataclass
class Params(Experiment):
iterable = [1,2,3]
# Error > you should use field(default_factory=lambda: [1,2,3])
This is ugly. Python does this to protect you in case you were to instantiate a second set of Params(), and modify the iterable
. As it's a class attribute, you'd be modifying both instantiated Params
objects.
I think this is stupid and limits the potential of dataclasses
(especially given that frozen = True
is a setting that enforces this yet still raises the error). For now, using iterable = search([1,2,3])
should work. In the future, I may rewrite the built-in dataclass to not follow this pattern to make it more explicit.
Note a similar thing is much more likely to happen in functions, where it is not guarded by Python. E.g. in
def function(items = [1,2,3])
print(items)
items.append(4)
function() # [1,2,3]
function() # [1,2,3,4]
Further reference in Python docs.
Any decent package should list viable alternatives. Here are some that I considered, but ended up building this package instead.
- wandb sweeps is best used for Bayesian hyperparameter search to optimise a DL model; but requires specifying settings in JSON files.
- ray tune enables SOTA algorithms like PBT (similar to genetic optimisation) and HyperBand/ASHA (large population with early stopping), and allows for relatively unsupervised optimisation by specifying a search space and objective in Python. It is also compatible with Keras Hyperopt and Pytorch Optuna.
- orion is similar to ray tune, but more or less a wrapper around an argument parser you need to set up yourself (so you have to specify everything in plain-text cli commands).
- hydra is probably most-similar in features to
superparams
, but relies onyaml
for specification and doesn't collate results nicely into apolars
dataframe.
I think of superparams as more open-ended than ray-tune: there may not be a direct objective to optimise as the right objective is often not yet established in the early stages of experimentation. And, by allowing everything to be specified in a single Python dataclass, you maintain flexibility by not assuming that the entire optimisation is a black-box. To me, it is valuable to be able to specify all parameters and logic in a single place, completely in lsp-understandable python.