Skip to content

HFAiLab/pytorch_distributed

Repository files navigation

PyTorch Distributed Test on High-Flyer AIHPC

We test the different implementations of PyTorch distributed training, and compare the performances.

We recommend that users use Apex to conduct distributed training on High-Flyer AIHPC.

Dataset

ImageNet. We use ffrecord to aggregate the scattered files on High-Flyer AIHPC.

train_data = '/public_dataset/1/ImageNet/train.ffr'
val_data = '/public_dataset/1/ImageNet/val.ffr'

Test Model

ResNet

torchvision.models.resnet50()

Parameters

  • batch_size: 400
  • num_nodes: 1
  • gpus: 8

Results

Summary

  1. Apex is the most effective implementation to conduct PyTorch distributed training for now.
  2. The acceleration effect is basically the same as the number of GPU.
  3. The deeper the degree of parallelism, the lower the utilization of GPU.

About

The test of different distributed-training methods on High-Flyer AIHPC

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages