Skip to content

Latest commit

 

History

History
97 lines (67 loc) · 3.62 KB

File metadata and controls

97 lines (67 loc) · 3.62 KB
orphan

GPU training (Basic)

Audience: Users looking to save money and run large models faster using single or multiple


What is a GPU?

A Graphics Processing Unit (GPU), is a specialized hardware accelerator designed to speed up mathematical computations used in gaming and deep learning.


Train on 1 GPU

Make sure you're running on a machine with at least one GPU. There's no need to specify any NVIDIA flags as Lightning will do it for you.

trainer = Trainer(accelerator="gpu", devices=1)


Train on multiple GPUs

To use multiple GPUs, set the number of devices in the Trainer or the index of the GPUs.

trainer = Trainer(accelerator="gpu", devices=4)

Choosing GPU devices

You can select the GPU devices using ranges, a list of indices or a string containing a comma separated list of GPU ids:

k = 1

# DEFAULT (int) specifies how many GPUs to use per node Trainer(accelerator="gpu", devices=k)

# Above is equivalent to Trainer(accelerator="gpu", devices=list(range(k)))

# Specify which GPUs to use (don't use when running on cluster) Trainer(accelerator="gpu", devices=[0, 1])

# Equivalent using a string Trainer(accelerator="gpu", devices="0, 1")

# To use all available GPUs put -1 or '-1' # equivalent to list(range(torch.cuda.device_count())) Trainer(accelerator="gpu", devices=-1)

The table below lists examples of possible input formats and how they are interpreted by Lightning.

devices Type Parsed Meaning
3 int [0, 1, 2] first 3 GPUs
-1 int [0, 1, 2, ...] all available GPUs
[0] list [0] GPU 0
[1, 3] list [1, 3] GPUs 1 and 3
"3" str [0, 1, 2] first 3 GPUs
"1, 3" str [1, 3] GPUs 1 and 3
"-1" str [0, 1, 2, ...] all available GPUs

Note

When specifying number of devices as an integer devices=k, setting the trainer flag auto_select_gpus=True will automatically help you find k GPUs that are not occupied by other processes. This is especially useful when GPUs are configured to be in "exclusive mode", such that only one process at a time can access them. For more details see the trainer guide <../common/trainer>.