Skip to content

Latest commit

 

History

History
34 lines (25 loc) · 1.5 KB

Beyond the Memory Wall_A Case for Memory-centric HPC System for Deep Learning.md

File metadata and controls

34 lines (25 loc) · 1.5 KB

Paper title:

Beyond the Memory Wall: A Case for Memory-centric HPC System for Deep Learning

Publication:

MICRO’19

Problem to solve

As researchers seek to deploy deeper and larger DNN topologies however, end-users are faced upon a memory “capacity” wall, where the limited on-device physical memory constrains the algorithm that can be trained. Current trends point to an urgent need for a system architectural solution that satisfies the dual requirements of (a) fast inter-device communication for parallel training, and (b) high performance memory virtualization over a large memory pool to enable memory hungry DNNs to be trainable over accelerator devices.

Major contribution

This work first highlights the importance of device-side interconnects in training scaled up DL algorithms, presenting a quantitative analysis on parallel training in the context of HPC systems with multiple accelerator (GPU/TPU) devices.

This work identifies key system-level performance bottlenecks on DC-DLA and motivates the need for a new system architecture that balances fast communication and user productivity in training large DNN algorithms.

Propose and evaluate a system architecture called MC-DLA that provides transparent memory capacity expansion while also enabling fast inter-device communication. Compared to DC-DLA designs, this paper achieves an average 2.8× performance improvement while expanding the system-wide memory capacity exposed to the accelerators to tens of TBs.