Skip to content

Jexxie/Performances-of-Highly-Scalable-Deep-Learning-Training-System-with-Different-Precision

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 

Repository files navigation

HPML_Performances-of-Highly-Scalable-Deep-Learning-Training-System-with-Different-Precision

This is the project for 2022 Sprinf ECE-GY 9143 High Performance Machine Learnig, maintained by Jingxuan Wang and Yuchen Kou. We compare the performance of ResNet based on CIFAR-10 dataset under different precion situations.

Content in this repository

Environment

Usage

(how to execute the code)

Example on HPC

Code Structure

Results and Observation

Precision Training Time Model Size Bandwidth
fp16 单元格 ---- ----
TF32 单元格 ---- ----
mp 单元格 ---- ----
fp32 单元格 ---- ----

Challenges we met

  • When we started this project, the lecture did not involve relevant knowledge, and the principle was not clear. We spent a lot of time looking at the NVIDIA manual, related papers and blogs to learn about the possible performance and differences between different precisions.
  • We planned to use multiple machines in parallel to calculate under different precisions situations, and we were not sure if we could accomplish this task.
  • We are not sure if the experimental results will be consistent with our prediction。

About

This is the project for ECE-GY 9143 High Performance Machine Learnig

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published