





# Exercise 4: Implementing a New Training Loop



## **Taekyung Heo**

Postdoctoral Fellow, School of ECE Georgia Institute of Technology taekyung@gatech.edu

Acknowledgments: William Won (GT), Srinivas Sridharan (Facebook), Sudarshan Srinivasan (Intel)



| Time (EDT)    | Topic                                                        | Presenter                    |
|---------------|--------------------------------------------------------------|------------------------------|
| 8:30 – 9:30   | Introduction to Distributed Deep Learning Training Platforms | Tushar Krishna               |
| 9:30 – 10:30  | ASTRA-sim                                                    | Saeed Rashidi                |
| 10:30 - 11:00 | Coffee Break                                                 |                              |
| 11:00 – 11:50 | Demo and Exercises                                           | William Won and Taekyung Heo |
| 11:50 – 12:00 | Extensions and Future Development                            | Taekyung Heo                 |

#### **Tutorial Website**

includes agenda, slides, ASTRA-sim installation instructions (via source + docker image) <a href="https://astra-sim.github.io/tutorials/isca-2022">https://astra-sim.github.io/tutorials/isca-2022</a>

**Attention:** Tutorial is being recorded

## **Training Loops**



- Training loop determines the behavior of a workload
  - Parallelization strategy
  - Computation order
  - Communication order
- Supported training loops
  - Data parallel Goal: tweak this loop
  - Model parallel
  - DLRM
  - Transformer
- You can implement a new training loop to support other models

# Vanilla Data-parallel Training Loop (FWD)



#### Vanilla Data-parallel Training Schedule



## Vanilla Data-parallel Training Loop (BWD)



Flow-per-layer: 1.Compute weight gradient-> 2.issue weight gradient comm -> 3.compute input gradient -> 4. go to previous layer



### **Vanilla Data-parallel Training Schedule**



## Vanilla Data-parallel Training Loop

## Vanilla Data-parallel Training Schedule



### **FSM Diagram**



## Exercise: Reorder Data-parallel Training Loop

### **Reordered Data-parallel Training Loop**



### **FSM Diagram**



## Adding a New Training Loop

- See astra-sim/workload/Workload.cc
- Vanilla data-parallel loop is implemented in iterate\_data\_parallel()
- Add a reordered version, iterate\_data\_parallel\_reorder()

```
void Workload::call(EventType event, CallData* data) {
  if (counter > 0) {
    generator->try_register_event(
        this, EventType::Workload_Wait, NULL, counter);
    return;
}
if (parallelismPolicy == ParallelismPolicy::Data) {
    iterate_data_parallel();
} else if (parallelismPolicy == ParallelismPolicy::DataReorder) {
    iterate_data_parallel_reorder();
} else if (parallelismPolicy == ParallelismPolicy::Transformer) {
    iterate_hybrid_parallel_Transformer();
```

# Vanilla Training Loop (iterate\_data\_parallel)

```
void Workload::iterate data parallel() {
  assert(index >= 0);
  assert(index < SIZE);</pre>
  check for sim end();
  if (current state == LoopState::Forward Pass)
  - 31 lines: if (!lavers[index]->is weight grad comm finished b]
    if (index >= SIZE) {
      current state = LoopState::Weight Gradient;
      index--:
    generator->register event(this, EventType::General, NULL, 1);
    else if (current state == LoopState::Weight Gradient) {
    14 lines: if (delay loaded == false) {---
    if (index == 0) {
      pass counter++;
      current state = LoopState::Forward Pass;
      else {
      current_state = LoopState::Input_Gradient;
    generator->register event(this, EventType::General, NULL, 1);
    return:
    else if (current state == LoopState::Input Gradient)
    11 lines: if (delay loaded == false) {---
    delay loaded = false;
    index--;
    current state = LoopState::Weight Gradient;
    generator->register event(this, EventType::General, NULL, 1);
    return;
```

- Training loop is implemented as a FSM
- index presents the current layer index
- current\_state holds the current state

#### **Vanilla Data-parallel Training Schedule**



# FSM Diagram BWD WG IG

## Reordered Training Loop (iterate\_data\_reorder)

```
void Workload::iterate data parallel reorder() {
 assert(index >= 0);
 assert(index < SIZE);</pre>
 check for sim end():
    (current state == LoopState::Forward Pass)
   16 lines: if ('layers[index]->is weight grad comm finished b
   if (index >= SIZE) {
     current state = LoopState::Input Gradient;
     index--:
   generator->register_event(this, EventType::General, NULL, 1);
   return;
   else if (current_state == LoopState::Weight_Gradient) {
   15 lines: if (delay loaded == false) {-----
   if (index > 1) {
     index--:
     current state = LoopState::Input Gradient;
    } else if (index == 1) {
      index--:
     current state = LoopState::Weight Gradient
     else if (index == 0) {
     pass counter++;
     current state = LoopState::Forward Pass;
    generator->register_event(this, EventType::General, NULL, 1);
   else if (current state == LoopState::Input Gradient)
   11 lines: if (delay loaded == false) {------
    delay loaded = false
   current state = LoopState::Weight Gradient;
   generator->register_event(this, EventType::ueneral, NULL, 1);
   return;
```

 You can reorder the computation schedule by tweaking the index and current state

#### **Reordered Data-parallel Training Schedule**





## Adding Debugging Messages

```
void Workload::iterate data parallel reorder() {
 assert(index >= 0);
 assert(index < SIZE);</pre>
 check_for_sim_end();
 if (current_state == LoopState::Forward Pass) {
    3 lines: if (!layers[index]->is_weight_grad_comm_finished_blockir
   if (delay loaded == false) {
     counter = layers[index]->get fwd pass compute();
     delav loaded = true:
     if (generator->id == 0)
        std::cout << "[TUTORIAL] FWD[" << index <<"]" << std::endl;</pre>
  } else if (current state == LoopState::Weight Gradient) {
   if (delay loaded == false) {
     counter = layers[index]->get_weight_grad_compute();
     if (generator->id == 0)
       std::cout << "[TUTORIAL] BWD_WG[" << index <<"]" << std::endl</pre>
 } else if (current_state == LoopState::Input_Gradient) {
   if (delay loaded == false) {
     counter = layers[index]->get input grad compute();
     delay loaded = true:
     if (generator->id == 0)
        std::cout << "[TUTORIAL] BWD IG[" << index <<"]" << std::endl</pre>
    9 lines: if (counter > 0) {-----
```

- You can add debugging messages to make sure that the training loop works as expected
- Make sure to print debugging messages only when (generator->id == 0)
  - Each processing element is a generator
  - If you don't filter the ID, you will see debugging messages from all PEs

## Adding Debugging Messages

#### Vanilla Data-parallel Loop

./exercise 4/exercise 4 vanilla.sh | grep TUTORIAL

```
[TUTORIAL] FWD[0]
[TUTORIAL] FWD[1]
[TUTORIAL] FWD[2]
[TUTORIAL] FWD[3]
[TUTORIAL] BWD_WG[3]
[TUTORIAL] BWD_IG[3]
[TUTORIAL] BWD_IG[2]
[TUTORIAL] BWD_IG[2]
[TUTORIAL] BWD_WG[1]
[TUTORIAL] BWD_WG[1]
[TUTORIAL] BWD_WG[0]
....
```

#### **Reordered Data-parallel Loop**

./exercise\_4/exercise\_4\_reorder.sh | grep TUTORIAL

```
[TUTORIAL] FWD[0]
[TUTORIAL] FWD[1]
[TUTORIAL] FWD[2]
[TUTORIAL] FWD[3]
[TUTORIAL] BWD_IG[3]
[TUTORIAL] BWD_WG[3]
[TUTORIAL] BWD_IG[2]
[TUTORIAL] BWD_WG[2]
[TUTORIAL] BWD_IG[1]
[TUTORIAL] BWD_WG[1]
[TUTORIAL] BWD_WG[0]
....
```



| Time (EDT)    | Topic                                                        | Presenter                    |
|---------------|--------------------------------------------------------------|------------------------------|
| 8:30 – 9:30   | Introduction to Distributed Deep Learning Training Platforms | Tushar Krishna               |
| 9:30 – 10:30  | ASTRA-sim                                                    | Saeed Rashidi                |
| 10:30 - 11:00 | Coffee Break                                                 |                              |
| 11:00 – 11:50 | Demo and Exercises                                           | William Won and Taekyung Heo |
| 11:50 – 12:00 | Extensions and Future Development                            | Taekyung Heo                 |

#### **Tutorial Website**

includes agenda, slides, ASTRA-sim installation instructions (via source + docker image) <a href="https://astra-sim.github.io/tutorials/isca-2022">https://astra-sim.github.io/tutorials/isca-2022</a>

**Attention:** Tutorial is being recorded