Skip to content

Commit

Permalink
Merge pull request #138 from HDI-Project/documentation_update
Browse files Browse the repository at this point in the history
Updated documentation on ModelHub
  • Loading branch information
csala committed May 9, 2019
2 parents 1889701 + 8bae295 commit a283c19
Show file tree
Hide file tree
Showing 2 changed files with 25 additions and 22 deletions.
43 changes: 23 additions & 20 deletions docs/database.rst
Original file line number Diff line number Diff line change
Expand Up @@ -17,13 +17,14 @@ A Dataset represents a single set of data which can be used to train and test
models by ATM. The table stores information about the location of the data as
well as metadata to help with analysis.

- ``dataset_id`` (Int): Unique identifier for the dataset.
- ``id`` (Int): Unique identifier for the dataset.
- ``name`` (String): Identifier string for a classification technique.
- ``description`` (String): Human-readable description of the dataset.
- not described in the paper
- ``class_column`` (String): Name of the class label column.
- ``train_path`` (String): Location of the dataset train file.
- ``test_path`` (String): Location of the dataset test file.
- ``class_column`` (String): Name of the class label column.
- ``description`` (String): Human-readable description of the dataset.

- not described in the paper

The metadata fields below are not described in the paper.

Expand All @@ -41,15 +42,16 @@ A Datarun is a single logical job for ATM to complete. The Dataruns table
contains a reference to a dataset, configuration for ATM and BTB, and
state information.

- ``datarun_id`` (Int): Unique identifier for the datarun.
- ``id`` (Int): Unique identifier for the datarun.
- ``dataset_id`` (Int): ID of the dataset associated with this datarun.
- ``description`` (String): Human-readable description of the datarun.
- not in the paper
- not described in the paper

BTB configuration:

- ``selector`` (String): Selection technique for hyperpartitions.
- called "hyperpartition_selection_scheme" in the paper

- ``k_window`` (Int): The number of previous classifiers the selector will
consider, for selection techniques that set a limit of the number of
historical runs to use.
Expand All @@ -63,7 +65,7 @@ BTB configuration:
numeric hyperparameter will be chosen from a set of ``gridding`` discrete,
evenly-spaced values. If set to 0 or NULL, values will be chosen from the
full, continuous space of possibilities.
- not in the paper
- not described in the paper

ATM configuration:

Expand All @@ -79,29 +81,29 @@ ATM configuration:
- ``deadline`` (DateTime): If provided, and if ``budget_type`` is set to
"walltime", the datarun will run until this absolute time. This overrides the
``budget`` column.
- not in the paper
- not described in the paper
- ``metric`` (String): The metric by which to score each classifier for
comparison purposes. Can be one of ["accuracy", "cohen_kappa", "f1",
"roc_auc", "ap", "mcc"] for binary problems, or ["accuracy", "rank_accuracy",
"cohen_kappa", "f1_micro", "f1_macro", "roc_auc_micro", "roc_auc_macro"] for
multiclass problems
- not in the paper
- not described in the paper
- ``score_target`` (Enum): One of ["cv", "test", "mu_sigma"]. Determines how the
final comparative metric (the *judgment metric*) is calculated.
- "cv" (cross-validation): the judgment metric is the average of a 5-fold
cross-validation test.
- "test": the judgment metric is computed on the test data.
- "mu_sigma": the judgment metric is the lower error bound on the mean CV
score.
- not in the paper
- not described in the paper

State information:

- ``start_time`` (DateTime): Time the DataRun began.
- ``end_time`` (DateTime): Time the DataRun was completed.
- ``status`` (Enum): Indicates whether the run is pending, in progress, or has
been finished. One of ["pending", "running", "complete"].
- not in the paper
- not described in the paper


Hyperpartitions
Expand All @@ -113,38 +115,38 @@ instance must be associated with a single datarun; the performance of a
hyperpartition in a previous datarun is assumed to have no bearing on its
performance in the future.

- ``hyperparition_id`` (Int): Unique identifier for the hyperparition.
- ``id`` (Int): Unique identifier for the hyperparition.
- ``datarun_id`` (Int): ID of the datarun associated with this hyperpartition.
- ``method`` (String): Code for, or path to a JSON file describing, this
hyperpartition's classification method (e.g. "svm", "knn").
- ``categoricals`` (Base64-encoded object): List of categorical hyperparameters
- ``categoricals_hyperparameters_64`` (Base64-encoded object): List of categorical hyperparameters
whose values are fixed to define this hyperpartition.
- called "partition_hyperparameter_values" in the paper
- ``tunables`` (Base64-encoded object): List of continuous hyperparameters which
- ``tunables_hyperparameters_64`` (Base64-encoded object): List of continuous hyperparameters which
are free; their values must be selected by a Tuner.
- called "conditional_hyperparameters" in the paper
- ``constants`` (Base64-encoded object): List of categorical or continuous
- ``constants_hyperparameters_64`` (Base64-encoded object): List of categorical or continuous
parameters whose values are always fixed. These do not define the
hyperpartition, but their values must be passed to the classification method
to fully parameterize it.
- not in the paper
- not described in the paper
- ``status`` (Enum): Indicates whether the hyperpartition has caused too many
classifiers to error, or whether the grid for this partition has been fully
explored. One of ["incomplete", "gridding_done", "errored"].
- not in the paper
- not described in the paper


Classifiers
-----------
A Classifier represents a single train/test run using a method and a set of hyperparameters with a particular dataset.

- ``classifier_id`` (Int): Unique identifier for the classifier.
- ``id`` (Int): Unique identifier for the classifier.
- ``datarun_id`` (Int): ID of the datarun associated with this classifier.
- ``hyperpartition_id`` (Int): ID of the hyperpartition associated with this
classifier.
- ``host`` (String): IP address or name of the host machine where the classifier
was tested.
- not in the paper
- not described in the paper
- ``model_location`` (String): Path to the serialized model object for this
classifier.
- ``metrics_location`` (String): Path to the full set of metrics computed during
Expand All @@ -153,8 +155,9 @@ A Classifier represents a single train/test run using a method and a set of hype
cross-validated training data.
- ``cv_judgment_metric_stdev`` (Number): Standard deviation of the
cross-validation test.
- not described in the paper
- ``test_judgment_metric`` (Number): Judgment metric computed on the test data.
- ``hyperparameters_values`` (Base64-encoded object): The full set of
- ``hyperparameters_values_64`` (Base64-encoded object): The full set of
hyperparameter values used to create this classifier.
- ``start_time`` (DateTime): Time that a worker started working on the
classifier.
Expand Down
4 changes: 2 additions & 2 deletions docs/quickstart.rst
Original file line number Diff line number Diff line change
Expand Up @@ -31,7 +31,7 @@ Create a datarun
----------------

Before we can train any classifiers, we need to create a datarun. In ATM, a
datarun is a single logical machine learning task. The ``enter_data.py`` script
datarun is a single logical machine learning task. The ``enter_data`` command
will set up everything you need.::

(atm-env) $ atm enter_data
Expand Down Expand Up @@ -75,7 +75,7 @@ An ATM *worker* is a process that connects to a ModelHub, asks it what dataruns
need to be worked on, and trains and tests classifiers until all the work is
done. To run one, use the following command::

(atm-env) $ atm worker.py
(atm-env) $ atm worker

This will start a process that builds classifiers, tests them, and saves them to
the ./models/ directory. As it runs, it should print output indicating which
Expand Down

0 comments on commit a283c19

Please sign in to comment.