# Week 3. High-Performance Modeling
---

## Distributed Training

---
- At first, training models is quick and easy
- Training models becomes more time-consuming: with more data, with larger models
- Longer training, more epochs, less eficient
- Use distributed training approaches

### Types of distributed training
- **Data parallelism:** models are replicated differente accelerators (GPU/TPU) and data is split between them

- **Model parallelsim:** When models are too large to fit on a single device then they can be divided into partitions, assigning differente partitions to different accelerators

<img src = "https://i.gyazo.com/f50ba65245767fa1937dd9acfa2ee378.png">

- Each worker independently computes the errors between its predictions for its training samples and the labeled data. 
- Then each worker performs backprop to update its model based on the errors, and communicates all of its changes to the other workers so that they can update their models. 
- This means that the workers need to synchronize their gradients at the end of each batch to ensure that they are training a consistent model.

<img src = "https://i.gyazo.com/78520849bffd29aea556fde68c287a78.png">

### Making your models distribute-aware

If you want to distribute a model:
- Supported in high-level APIs such as Keras/Estimators
- For more control, you can use custom training loops

<img src = "https://i.gyazo.com/8e205796c381207b7cf04c25886989fa.png">

<img src = "https://i.gyazo.com/f3712d1e09757024b3179afbeeb45f84.png">

## High-Performance Ingestion
---
- Accelerators are a key part of high-performance modeling, training, and inference, but accelerators are also expensive, so it's important to use them efficiently. 

### Why input pipelines?
- Data at times can't fit into memory and sometimes, CPUs are under-utilized in compute intensive tasks like trainign a complex model
- You should avoid these inefficiencies so that you can make the most of hardware available, use pipelines

### tf.data: TensorFlow Input Pipeline
- You can view input pipelines as an ETL process, providing a framework to facilitate applying performance optimization. 

<img src = "https://i.gyazo.com/1c40c596478f3d4f3ceadff2a7502d99.png">

### Inefficient ETL process

<img src = "https://i.gyazo.com/ff6dc4052d26825a5e6e4f07c7f5f146.png">

### An improved ETL process

<img src = "https://i.gyazo.com/0d1d5a4c6a29b695e436f56029239150.png">

- We reduce the time when the disk and CPU remain idle
- Accelerator 100% utilized

### Pipelining

<img src = "https://i.gyazo.com/2711da8b903a692965827da2fbdb4843.png">

### How to optimize pipeline performance?
- Prefetching: where you begin loading data for the next step before the current step completes. 
- Parallelize data extraction and transformation
- Caching: Caching the dataset to get started with training immediately once a new epic begins, is also very effective when you have enough cash. 
- Reduce memory

### Parallelize data transformation
<img src = "https://i.gyazo.com/c38d10057a4b7b6cc9a6c8297dc177e0.png">


## Training Large Models - The Rise of Giant Nueral Nets and Parallelism 
---

## Overcoming memory constraints
---

- Strategy #1 - Gradient Accumulation: Split batches into mini-batches and only perform backprop after whole batch

- Strategy #2 - Memory swap: Copy activation between CPU and memory, back and forth

### Parallelism revisited

<img src = "https://i.gyazo.com/bc5c3d216270e6a492fe1e83670faa53.png">

### Challenges keeping accelerators busy

<img src = "https://i.gyazo.com/3e9807f42850590dd36e6eb21bee5f7a.png" width = "400px">

## Pipeline parallelism

<img src = "https://i.gyazo.com/cba36f24a34fc1d6ba44b84537c40c46.png">

- GPipe
- Pipedream

Integrates both data and model parallelism:
- Divide mini-batch data into micro-batches
- Differente workers work on different micro-batches in parallel
- Allow ML models to have significantly more parameters

### GPipe - Key features
- Open-sources TensorFlow library (using Lingvo)
- Inserts communication primiives at the partition boundaries
- Automatic parallelism to reduce memory consumption
- Gradient accumulation across micro-batches, so that model quality is preserved
- Partitioning is heuristic-based
