Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
65 commits
Select commit Hold shift + click to select a range
36610b2
added training_end
williamFalcon Oct 31, 2019
4a9dd77
added training_end
williamFalcon Oct 31, 2019
57dfd81
added training_end
williamFalcon Oct 31, 2019
ed8f161
added training_end
williamFalcon Oct 31, 2019
afbdf6f
added training_end
williamFalcon Oct 31, 2019
5b6387a
added training_end
williamFalcon Oct 31, 2019
3d7b4c7
added training_end
williamFalcon Nov 1, 2019
33592de
added training_end
williamFalcon Nov 1, 2019
89ed38a
added training_end
williamFalcon Nov 1, 2019
0e91a0d
added training_end
williamFalcon Nov 1, 2019
79a139a
added training_end
williamFalcon Nov 1, 2019
092a606
added training_end
williamFalcon Nov 1, 2019
80a1fd5
allow ddp and apex to be configured
williamFalcon Nov 2, 2019
5417f67
allow ddp and apex to be configured
williamFalcon Nov 2, 2019
0ba0640
bananas
williamFalcon Nov 3, 2019
8fc9da6
bananas
williamFalcon Nov 3, 2019
5277cc8
bananas
williamFalcon Nov 3, 2019
661e925
bananas
williamFalcon Nov 3, 2019
2914120
bananas
williamFalcon Nov 3, 2019
6a04ac1
bananas
williamFalcon Nov 3, 2019
fddcf9d
bananas
williamFalcon Nov 3, 2019
089c709
bananas
williamFalcon Nov 3, 2019
f602c3a
bananas
williamFalcon Nov 3, 2019
4a283c0
bananas
williamFalcon Nov 3, 2019
fe150ff
bananas
williamFalcon Nov 3, 2019
c1c042f
bananas
williamFalcon Nov 3, 2019
bf214e4
bananas
williamFalcon Nov 3, 2019
e552cd6
bananas
williamFalcon Nov 3, 2019
a895c45
bananas
williamFalcon Nov 3, 2019
b5571d5
added eval and train for redundancy
williamFalcon Nov 5, 2019
233a4fd
added eval and train for redundancy
williamFalcon Nov 5, 2019
33f94b3
added training_end
williamFalcon Oct 31, 2019
96f1670
added training_end
williamFalcon Oct 31, 2019
a1f9318
added training_end
williamFalcon Oct 31, 2019
5624fbc
added training_end
williamFalcon Oct 31, 2019
c70d9c4
added training_end
williamFalcon Oct 31, 2019
2117fec
added training_end
williamFalcon Oct 31, 2019
0d77e21
added training_end
williamFalcon Nov 1, 2019
cbd8189
added training_end
williamFalcon Nov 1, 2019
e214ca8
added training_end
williamFalcon Nov 1, 2019
8d5dca0
added training_end
williamFalcon Nov 1, 2019
ebd3c3b
added training_end
williamFalcon Nov 1, 2019
7330dec
added training_end
williamFalcon Nov 1, 2019
0cfcc50
allow ddp and apex to be configured
williamFalcon Nov 2, 2019
5551586
allow ddp and apex to be configured
williamFalcon Nov 2, 2019
274a6be
bananas
williamFalcon Nov 3, 2019
2260fac
bananas
williamFalcon Nov 3, 2019
d473daa
bananas
williamFalcon Nov 3, 2019
25d7351
bananas
williamFalcon Nov 3, 2019
91982f1
bananas
williamFalcon Nov 3, 2019
a916ab6
bananas
williamFalcon Nov 3, 2019
9de558a
bananas
williamFalcon Nov 3, 2019
534d68f
bananas
williamFalcon Nov 3, 2019
c40e8ce
bananas
williamFalcon Nov 3, 2019
caaca3b
bananas
williamFalcon Nov 3, 2019
b9556a3
bananas
williamFalcon Nov 3, 2019
2f046cd
bananas
williamFalcon Nov 3, 2019
d0f55fc
bananas
williamFalcon Nov 3, 2019
2e1a534
bananas
williamFalcon Nov 3, 2019
6b5d865
bananas
williamFalcon Nov 3, 2019
f3e8ff7
added eval and train for redundancy
williamFalcon Nov 5, 2019
f1fcdc1
added eval and train for redundancy
williamFalcon Nov 5, 2019
7de6f0e
Merge branch 'master' into ddp2_fix
williamFalcon Nov 5, 2019
5c4f214
added eval and train for redundancy
williamFalcon Nov 5, 2019
9f11a04
added eval and train for redundancy
williamFalcon Nov 5, 2019
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -294,6 +294,7 @@ Lightning also adds a text column with all the hyperparameters for this experime

#### Distributed training

- [Implement Your Own Distributed (DDP) training](https://williamfalcon.github.io/pytorch-lightning/Trainer/hooks/#init_ddp_connection)
- [16-bit mixed precision](https://williamfalcon.github.io/pytorch-lightning/Trainer/Distributed%20training/#16-bit-mixed-precision)
- [Multi-GPU](https://williamfalcon.github.io/pytorch-lightning/Trainer/Distributed%20training/#Multi-GPU)
- [Multi-node](https://williamfalcon.github.io/pytorch-lightning/Trainer/Distributed%20training/#Multi-node)
Expand Down
84 changes: 84 additions & 0 deletions docs/LightningModule/RequiredTrainerInterface.md
Original file line number Diff line number Diff line change
Expand Up @@ -15,6 +15,7 @@ Otherwise, to Define a Lightning Module, implement the following methods:

**Optional**:

- [training_end](RequiredTrainerInterface.md#training_end)
- [validation_step](RequiredTrainerInterface.md#validation_step)
- [validation_end](RequiredTrainerInterface.md#validation_end)
- [test_step](RequiredTrainerInterface.md#test_step)
Expand Down Expand Up @@ -178,6 +179,89 @@ def training_step(self, batch, batch_nb, hiddens):
You can also return a -1 instead of a dict to stop the current loop. This is useful if you want to
break out of the current training epoch early.

---
### training_end

``` {.python}
def training_end(self, train_step_outputs)
```
In certain cases (dp, ddp2), you might want to use all outputs of every process to do something.
For instance, if using negative samples, you could run a batch via dp and use ALL the outputs
for a single softmax across the full batch (ie: the denominator would use the full batch).

In this case you should define training_end to perform those calculations.


**Params**

| Param | description |
|---|---|
| outputs | What you return in training_step.

**Return**

Dictionary or OrderedDict

| key | value | is required |
|---|---|---|
| loss | tensor scalar | Y |
| progress_bar | Dict for progress bar display. Must have only tensors | N |
| log | Dict of metrics to add to logger. Must have only tensors (no images, etc) | N |


**Example**

``` {.python}
# WITHOUT training_end
# if used in DP or DDP2, this batch is 1/nb_gpus large
def training_step(self, batch, batch_nb):
# batch is 1/nb_gpus big
x, y = batch

out = self.forward(x)
loss = self.softmax(out)
loss = nce_loss(loss)
return {'loss': loss}

# --------------
# with training_end to do softmax over the full batch
def training_step(self, batch, batch_nb):
# batch is 1/nb_gpus big
x, y = batch

out = self.forward(x)
return {'out': out}

def training_end(self, outputs):
# this out is now the full size of the batch
out = outputs['out']

# this softmax now uses the full batch size
loss = self.softmax(out)
loss = nce_loss(loss)
return {'loss': loss}
```

If you define multiple optimizers, this step will also be called with an additional ```optimizer_idx``` param.
``` {.python}
# Multiple optimizers (ie: GANs)
def training_step(self, batch, batch_nb, optimizer_idx):
if optimizer_idx == 0:
# do training_step with encoder
if optimizer_idx == 1:
# do training_step with decoder
```

If you add truncated back propagation through time you will also get an additional argument with the hidden states of the previous step.
``` {.python}
# Truncated back-propagation through time
def training_step(self, batch, batch_nb, hiddens):
# hiddens are the hiddens from the previous truncated backprop step
```

You can also return a -1 instead of a dict to stop the current loop. This is useful if you want to
break out of the current training epoch early.

---
### train_dataloader

Expand Down
89 changes: 89 additions & 0 deletions docs/Trainer/hooks.md
Original file line number Diff line number Diff line change
Expand Up @@ -175,3 +175,92 @@ def tbptt_split_batch(self, batch, split_size):

return splits
```

---
#### configure_apex
Overwrite to define your own Apex implementation init.

```python
def configure_apex(self, amp, model, optimizers, amp_level):
"""
Override to init AMP your own way
Must return a model and list of optimizers
:param amp:
:param model:
:param optimizers:
:param amp_level:
:return: Apex wrapped model and optimizers
"""
model, optimizers = amp.initialize(
model, optimizers, opt_level=amp_level,
)

return model, optimizers
```

---
#### configure_ddp
Overwrite to define your own DDP implementation init.
The only requirement is that:
1. On a validation batch the call goes to model.validation_step.
2. On a training batch the call goes to model.training_step.
3. On a testing batch, the call goes to model.test_step

```python
def configure_ddp(self, model, device_ids):
"""
Override to init DDP in a different way or use your own wrapper.
Must return model.
:param model:
:param device_ids:
:return: DDP wrapped model
"""
# Lightning DDP simply routes to test_step, val_step, etc...
model = LightningDistributedDataParallel(
model,
device_ids=device_ids,
find_unused_parameters=True
)
return model
```

---
#### init_ddp_connection
Override to init DDP in your own way.

```python
def init_ddp_connection(self):
"""
Connect all procs in the world using the env:// init
Use the first node as the root address
"""

# use slurm job id for the port number
# guarantees unique ports across jobs from same grid search
try:
# use the last 4 numbers in the job id as the id
default_port = os.environ['SLURM_JOB_ID']
default_port = default_port[-4:]

# all ports should be in the 10k+ range
default_port = int(default_port) + 15000

except Exception as e:
default_port = 12910

# if user gave a port number, use that one instead
try:
default_port = os.environ['MASTER_PORT']
except Exception:
os.environ['MASTER_PORT'] = str(default_port)

# figure out the root node addr
try:
root_node = os.environ['SLURM_NODELIST'].split(' ')[0]
except Exception:
root_node = '127.0.0.2'

root_node = self.trainer.resolve_root_node_address(root_node)
os.environ['MASTER_ADDR'] = root_node
dist.init_process_group('nccl', rank=self.proc_rank, world_size=self.world_size)
```
1 change: 1 addition & 0 deletions docs/Trainer/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -42,6 +42,7 @@ But of course the fun is in all the advanced things it can do:

**Distributed training**

- [Implement Your Own Distributed (DDP) training](https://williamfalcon.github.io/pytorch-lightning/Trainer/hooks/#init_ddp_connection)
- [16-bit mixed precision](https://williamfalcon.github.io/pytorch-lightning/Trainer/Distributed%20training/#16-bit-mixed-precision)
- [Multi-GPU](https://williamfalcon.github.io/pytorch-lightning/Trainer/Distributed%20training/#Multi-GPU)
- [Multi-node](https://williamfalcon.github.io/pytorch-lightning/Trainer/Distributed%20training/#Multi-node)
Expand Down
1 change: 1 addition & 0 deletions docs/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -99,6 +99,7 @@ Notice a few things about this flow:

###### Distributed training

- [Implement Your Own Distributed (DDP) training](https://williamfalcon.github.io/pytorch-lightning/Trainer/hooks/#init_ddp_connection)
- [16-bit mixed precision](https://williamfalcon.github.io/pytorch-lightning/Trainer/Distributed%20training/#16-bit-mixed-precision)
- [Multi-GPU](https://williamfalcon.github.io/pytorch-lightning/Trainer/Distributed%20training/#Multi-GPU)
- [Multi-node](https://williamfalcon.github.io/pytorch-lightning/Trainer/Distributed%20training/#Multi-node)
Expand Down
80 changes: 79 additions & 1 deletion pytorch_lightning/root_module/root_module.py
Original file line number Diff line number Diff line change
@@ -1,8 +1,10 @@
import os
import warnings
import collections
from argparse import Namespace

import torch
import torch.distributed as dist

from pytorch_lightning.root_module.decorators import data_loader
from pytorch_lightning.root_module.grads import GradInformation
Expand All @@ -11,6 +13,7 @@
from pytorch_lightning.root_module.model_saving import ModelIO
from pytorch_lightning.trainer.trainer_io import load_hparams_from_tags_csv
import logging
from pytorch_lightning.pt_overrides.override_data_parallel import LightningDistributedDataParallel


class LightningModule(GradInformation, ModelIO, ModelHooks):
Expand Down Expand Up @@ -48,10 +51,19 @@ def training_step(self, *args, **kwargs):
return loss, dict with metrics for tqdm
:param called with batch, batch_nb
additional: optimizer_i if multiple optimizers used
:return:
:return: dict with loss key and optional log, progress keys
if implementing training_step, return whatever you need in that step
"""
raise NotImplementedError

def training_end(self, *args, **kwargs):
"""
return loss, dict with metrics for tqdm
:param called with outputs of training_step
:return: dict with loss key and optional log, progress keys
"""
pass

def validation_step(self, *args, **kwargs):
"""
return whatever outputs will need to be aggregated in validation_end
Expand Down Expand Up @@ -90,6 +102,72 @@ def test_end(self, outputs):
"""
pass

def configure_ddp(self, model, device_ids):
"""
Override to init DDP in a different way or use your own wrapper.
Must return model.
:param model:
:param device_ids:
:return: DDP wrapped model
"""
model = LightningDistributedDataParallel(
model,
device_ids=device_ids,
find_unused_parameters=True
)
return model

def init_ddp_connection(self, proc_rank, world_size):
"""
Connect all procs in the world using the env:// init
Use the first node as the root address
"""

# use slurm job id for the port number
# guarantees unique ports across jobs from same grid search
try:
# use the last 4 numbers in the job id as the id
default_port = os.environ['SLURM_JOB_ID']
default_port = default_port[-4:]

# all ports should be in the 10k+ range
default_port = int(default_port) + 15000

except Exception as e:
default_port = 12910

# if user gave a port number, use that one instead
try:
default_port = os.environ['MASTER_PORT']
except Exception:
os.environ['MASTER_PORT'] = str(default_port)

# figure out the root node addr
try:
root_node = os.environ['SLURM_NODELIST'].split(' ')[0]
except Exception:
root_node = '127.0.0.2'

root_node = self.trainer.resolve_root_node_address(root_node)
os.environ['MASTER_ADDR'] = root_node
dist.init_process_group('nccl', rank=proc_rank, world_size=world_size)

def configure_apex(self, amp, model, optimizers, amp_level):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the thought behind passing amp as an argument rather than letting the user import it themselves? From what I can see, it's always going to be the amp module from Apex. Is there some future extension we're thinking about here?

"""
Override to init AMP your own way
Must return a model and list of optimizers
:param amp:
:param model:
:param optimizers:
:param amp_level:
:return: Apex wrapped model and optimizers
"""
model, optimizers = amp.initialize(
model, optimizers, opt_level=amp_level,
)

return model, optimizers

def configure_optimizers(self):
"""
Return a list of optimizers and a list of schedulers (could be empty)
Expand Down
Loading