Skip to content

Latest commit

 

History

History
63 lines (47 loc) · 3.97 KB

2019_Li.md

File metadata and controls

63 lines (47 loc) · 3.97 KB

Title: A Generalized Framework for Population Based Training

Author: Li et al (2019) (DeepMind)

General Content:

Generalizes the original PBT pipeline into a fully blackbox type API by decomposing training into async trials and introducing a PBT controller in charge of coordination. This is inspired by the Google Vizier architecture. Extend experimental PBT results to WaveNet

Key take-away for own work:

Mainly API/program architecture/distributed systems design paper. No really strong improvements on original PBT idea.

Keypoints:

  • Problem PBT application: require continuous and simult training of all workers and may encounter issues when workers could be preempted by other higher priority jobs in real distributed working envs

    • Also glass-box: trainer has to know info about parallel workers and performs weight copying and hyperparam changes - can require computational graph to be changed (i.e. activations)
    • Parallel workers read and write to the same database and decide to warmstart from another workers checkpoint!
  • Glass-box limitations:

    1. Cant make changes to comp graph easily
    2. Does not easily handle case of worker job being preempted by another worker job
    3. Not extendable to advanced evolutation/mutation strategies
  • Here: Propose to decompose whole training into multiple trials

    • In each trial a worker only trains for limited amount of time
    • Each trial is dependent on one or more other trials (exploit idea)
    • trainer/controller introduced to oversee the whole population: Decides about the hyperparams and checkpoint for warmstarting the agent
  • Vizier hyperparam service: blackbox - makes setup of experiments easier - highest flexibility

  • Succes of PBT - makes decisions based on incomplete obervations = non/convergent objective values

  • Generalized framework = stateless service - each request to service does not depend on any other. Allows for high scalability

    • All decisions are being mad by a central controller
    • Small work load for each trial executed by each worker
    • Trial = continuous training session - use protocol buffer
    • Represent parameters in a DAG to incorporate notion parameter dependence
  • Advantages of this approach:

    1. Allows for tuning of all hyperparams regardless of comp graph
    2. Allows for training diff/non-diff objectives
    3. Allows for all hyperparams to be adapted dynamically over time
    4. Maximizes flexibility!
  • Initiator Based Evolution: Every trial is guaranteed to participate in at least one reproduction procedure - guarantee is important for small population sizes

    • Fitness: Can be multiple criteria. Using weighting or Pareto improvement criterion for comparison
    • Repro:
      • Initiator: Assymetric evo - send request to trainer to init competition
      • Opponent selection: Compare to last 2 generations of potential opponents
      • Winner = Parent for current reproduction: Every population member participates in a binary tournament once and only once
      • Reproduction: Copy from parent + crossover/mutation
      • For categorical params - sample = mutate
      • For discrete params = sample 0/1 lower or higher next values
  • Proposed Budget mode: simulate evolution with large generation size using small number of workers

    • Only start reproduction when initiators generation has reached specified population size
  • Training Replay: Ultimately need to train on a slightly modified dataset (train + val + test) - simply apply previously obtained schedule

    • Can also apply only subset of schedule - useful when doing ensembling
    • Trial dependency graph: training replay requires extracting dependency graph of cetrain final trial in order to perform same training procedure.Do so via topoligical sort
  • Opponent selection strategy: Only look at past 1-2 generations appears to yield best results!

Questions/Critical Observations:

  • In Jaderberg et al PBT is introduced as being async. Here claimed that not - sznc glass-box system