Skip to content

Commit

Permalink
Updates train_agatha documentation
Browse files Browse the repository at this point in the history
  • Loading branch information
Justin Sybrandt committed May 4, 2020
1 parent dd998f8 commit f2d5d7b
Showing 1 changed file with 320 additions and 1 deletion.
321 changes: 320 additions & 1 deletion docs/help/train_agatha.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,6 +7,23 @@ hypotheses after you've already processed all necessary information using
efficiently manage the distributed training of the predicate ranking model
stored in `agatha.ml.hypothesis_predictor`.

## tl:dr;

You will need the following files:

- `predicate_graph.sqlite3`
- `predicate_entities.sqlite3`
- `embeddings/predicate_subset/*.h5`

You will need to run `python3 -m agatha.ml.hypothesis_predictor` with the right
keyword arguments. If performing distributed training, you will need to run this
on each machine in your training cluster.

Take a look at
[scripts/train_2020.sh][https://github.com/JSybrandt/agatha/blob/master/scripts/train_2020.sh]
for how to train the agatha model.


## Background

The Agatha deep learning model learns to rank entity-pairs. To learn this
Expand Down Expand Up @@ -78,8 +95,310 @@ This process is slow compared to most other operations in the training pipeline.
Each query has to check against the sqlite `key` index, which is stored on disk,
load the `value`, also stored on disk, and then parse the string. There are two
optimizations that make this faster: preloading and caching. Look into the API
documentation for more detail
documentation for more detail.

## Installing Apex for AMP

Apex is a bit of a weird dependency, but it allows us to take advantage of some
GPU optimizations that really cut back our memory footprint. Amp allows us to
train using 16-bit precision, enabling more samples per batch, resulting in
faster training times. However, note that if you install apex on a node that has
one type of GPU, you will get an error if you try and train on another. This
means that you **need** to install this dependency on a training node with the
appropriate GPU.

To install apex, first select a location such as `~/software` to keep the files.
Next, download apex with git `git clone https://github.com/NVIDIA/apex.git`.
Finally, install the dependency with: `pip install -v --no-cache-dir
--global-option="--cpp_ext" --global-option="--cuda_ext" ./`

In full, run this:

```bash
# SSH into one of your training nodes with the correct GPU configuration.
# Make sure the appropriate modules are loaded.
# Assuming you would like to install apex in ~/software/apex

# make software dir if its not present
mkdir -p ~/software

# Clone apex to ~/software/apex
git clone https://github.com/NVIDIA/apex.git ~/software/apex

# Enter Apex dir
cd ~/software/apex

# Run install
pip install -v \
--no-cache-dir \
--global-option="--cpp_ext" \
--global-option="--cuda_ext" \
./
```

## Model Parameters

This is _NOT_ an exhaustive list of the parameters present in the Agatha deep
learning model, but is a full list of the parameters you need to know to train
the model.

`amp_level`
: The optimization level used by NVIDIA. `O1` works well. `O2` causes some
convergence issues, so I would stay away from that.

`default_root_dir`
: The directory to store model training files.

`dataloader-workers`
: The number of processes used to generate predicate pairs, per-gpu. Too many
dataloader workers will cause an out-of-memory error. I've found 3 works well.

`dim`
: The number of dimensions of each input embedding. We use 512 in most cases.
This parameter effects the size of various internal parameters.

`distributed_backend`
: Used to specify how to communicate between GPUs. Ignored if using only one
GPU. Set to `ddp` for distributed data parallel (even if only using gpus on
the same node).

`embedding-dir`
: The system path containing embedding `HDF5` (`*.h5`) files.

`entity-db`
: The system path to the entities `.sqlite3` database.

`gpus`
: The specific GPUs enabled on this machine. GPUs are indexed starting from 0.
On a 2-GPU node, this should be set to `0,1`.

`gradient_clip_val`
: A single step of gradient decent cannot move a parameter more than this
amount. We find that setting this to `1.0` enables convergence.

`graph-db`
: The system path to the graph `.sqlite3` database.

`lr`
: The learning rate. We use `0.02` because we're cool.

`margin`
: The objective of the Agatha training procedure is [Margin Ranking Loss][3].
This parameter determines how different a positive ranking criteria needs to
be from all negative ranking criteria. Setting this too high or low will cause
convergence issues. Remember that the model outputs in the `[0,1]` interval.
We recommend `0.1`.

`max_epochs`
: The maximum number of times to go through the training set.

`negative-scramble-rate`
: For each positive sample, how many negative scrambles (easy negative samples).

`negative-swap-rate`
: For each positive sample, how many negative swaps (hard negative samples).

`neighbor-sample-rate`
: When sampling a term-pair, we also sample each pair's disjoint neighborhood.
This determines the maximum number of neighbors to include.

`num_nodes`
: This determines the number of _MACHINES_ used to train the model.

`num_sanity_val_steps`
: Before starting training in earnest, we can optionally take a few validation
steps just to make sure everything has been configured properly. If this is
set above zero, we will run multiple validation steps on the newly
instantiated model. Recommended to run around `3` just to make sure everything
is working.

`positives-per-batch`
: Number of positive samples per batch per machine. More results in faster
training. Keep in mind that the true batch size will be `num_nodes *
positives-per-batch * (negative-scramble-rate + negative-swap-rate)`. When
running with 16-bit precision on V100 gpus, we can handle around `80`
positives per batch.

`precision`
: The number of bits per-float. Set to `16` for half-precision if you've
installed apex.

`train_percent_check`
: Limits the number of actual training examples per-epoch. If set to `0.1` then
one epoch will occur after every 10\% of the training data. This is important
because we only checkpoint after every epoch, and don't want to spend too
much time computing between checkpoints. We recommend that if you set this
value, you should increase `max_epochs` accordingly.

`transformer-dropout`
: Within the transformer encoder of Agatha, there is a dropout parameter that
helps improve performance. Recommended you set this to `0.1`.

`transformer-ff-dim`
: The size fo the transformer-encoded feed-forward layer. Recommended you set
this to something between `2*dim` and `4*dim`.

`transformer-heads`
: The number of self-attention operations per self-attention block in the
transformer encoder. We use `16`.

`transformer-layers`
: The number of transformer encoder blocks. Each transformer-encoder contains
multi-headed self-attention and a feed-forward layer. More transformer encoder
layers should lead to higher quality, but will require additional training
time and memory.

`val_percent_check`
: Just like how `train_percent_check` limits the number of training samples
per-epoch, `val_percent_check` limits the number of validation samples
per-epoch. Recommended that if you set one, you set the other accordingly.

`validation-fraction`
: Before training, this parameter determines the training-validation split. A
higher value means less training data, but more consistent validation numbers.
Recommended you set to `0.2`.

`warmup-steps`
: Agatha uses a gradient warmup strategy to improve early convergence. This
parameter indicates the number of steps needed to reach the input learning
rate. For instance, if you specify a learning rate of `0.02` and `100` warmup
steps, at step `50` there will be an effective learning rate around `0.01`. We
set this to `100`, but higher can be better if you have the time.

`weight-decay`
: Each step, the weights of the agatha model will be moved towards zero at this
rate. This helps with latter convergence and encourages sparsity. We set to
`0.01`.

`weights_save_path`
: The result root directory. Model checkpoints will be stored in
`weights_save_path/checkpoints/version_X/`. Recommended that this is set to
the same value as `default_root_dir`.

## Running Distributed Training

In order to preform distributed training, you will need to ensure that your
training cluster is each configured with the same modules, libraries, and python
versions.

On palmetto, and many HPC systems, this can be done with modules and Anaconda. I
recommend adding a section to your `.bashrc` for the sake of training Agatha
that loads all necessary modules and activates the appropriate conda
environment. As part of this configuration, you will need to set some
environment variables on each machine that help coordinate training. These are `MASER_ADDR`, `MASTER_PORT`, and `NODE_RANK`.

### Distributed Training Env Variables

`MASER_ADDR`
: Needs to be set to the hostname of one of your training nodes. This node will
coordinate the others.

`MASTER_PORT`
: Needs to be set to an unused network port for each machine. Can be any large
number. We recommend: `12910`.

`NODE_RANK`
: If you have N machines, then each machine needs a unique `NODE_RANK` value
between 0 and N-1.

We recommend setting these values automatically using a `nodefile`. A `nodefile`
is just a text file containing the hostnames of each machine in your training
cluster. The first name will be the `MASTER_ADDR` and the `NODE_RANK` will
correspond to the order of names in the file.

If `~/.nodefile` is the path to your nodefile, then you can set these values
with:

```bash
export NODEFILE="~/.nodefile"
export NODE_RANK=$(grep -n $HOSTNAME $NODEFILE | awk 'BEGIN{FS=":"}{print $1-1}')
export MASTER_ADDR=$(head -1 $NODEFILE)
export MASTER_PORT=12910
```

If you're on palmetto, you've already got access to the nodefile referenced by
`PBS_NODEFILE`. However, only the first machine will have this variable set. I
recommend automatically copying this file to some shared location whenever it is
detected. You can do that in `.bashrc` by putting the following lines _BEFORE_
setting the `NODE_RANK` and `MASER_ADDR` variables.

```bash
# If $PBS_NODEFILE is a file
if [[ -f $PBS_NODEFILE ]]; then
cp $PBS_NODEFILE ~/.nodefile
fi
```

### Launching Training on Each Machine with Parallel

Once each machine is configured, you will then need to run the agatha training
module on each. We recommend `parallel` to help you do this. Parallel runs a
given bash script multiple times simultaneously, and has some flags that let
us run a script on each machine in a nodefile.

Put simply, you can start distributed training with the following:

```bash
parallel \
--sshloginfile $NODEFILE \
--ungroup \
--nonall \
python3 -m agatha.ml.hypothesis_predictor \
... agatha args ...
```

To explain the parameters:

`sshloginfile`
: Specifies the set of machines to run training on. We use the `NODEFILE`
created in the previous step.

`ungroup`
: By default, `parallel` will wait until a process exits to show us its output.
This flag gives us input every time a process writes the newline character.

`nonall`
: This specifies that the following command (`python3`) will not need its
arguments set by `parallel`, and that we would like to run the following
command as-is, once per machine in `$NODEFILE`.

## Palmetto-Specific Details

On palmetto, there are a number of modules that you will need to run Agatha.
Here is what I load on every machine I use to train agatha:

```bash
# C++ compiler modules
module load gcc/8.3.0
module load mpc/0.8.1

# NVIDIA modules
module load cuda-toolkit/10.2.89
module load duDNN/10.2.v7.6.5
module load nccl/2.6.4-1

# Needed for parallel
module load gnu-parallel

# Needed to work with HDF5 files
module load hdf5/1.10.5

# Needed to work with sqlite
module load sqlite/3.21.0

conda activate agatha

# Copy PBS_NODEFILE if it exists
if [[ -f $PBS_NODEFILE ]]; then
cp $PBS_NODEFILE ~/.nodefile
fi

# Set distributed training variables
export NODEFILE="~/.nodefile"
export NODE_RANK=$(grep -n $HOSTNAME $NODEFILE | awk 'BEGIN{FS=":"}{print $1-1}')
export MASTER_ADDR=$(head -1 $NODEFILE)
export MASTER_PORT=12910
```

[1]:https://pytorch.org/
[2]:https://github.com/PytorchLightning/pytorch-lightning
Expand Down

0 comments on commit f2d5d7b

Please sign in to comment.