Mixed-precision training lowers the required resources by using lower-precision arithmetic, which has the following benefits.<br/>
Decrease the required amount of memory. Half-precision floating point format (FP16) uses 16 bits, compared to 32 bits for single precision (FP32). Lowering the required memory enables training of larger models or training with larger minibatches.<br/>
Shorten the training or inference time. Execution time can be sensitive to memory or arithmetic bandwidth. Half-precision halves the number of bytes accessed, thus reducing the time spent in memory-limited layers. NVIDIA GPUs offer up to 8x more half precision arithmetic throughput when compared to single-precision, thus speeding up math-limited layers.

![image.png](attachment:image.png)

Half-precision floating point format consists of 1 sign bit, 5 bits of exponent, and 10 fractional bits.  Supported exponent values fall into the [-24, 15] range, which means the format supports non-zero value magnitudes in the [2-24, 65,504] range. Since this is narrower than the [2-149, ~3.4×1038] range supported by single-precision format, training some networks requires extra consideration. This section describes three techniques for successful training of DNNs with half precision: accumulation of FP16 products into FP32; loss scaling; and an FP32 master copy of weights. With these techniques NVIDIA and Baidu Research were able to match single-precision result accuracy for all networks that were trained (Mixed-Precision Training). Note that not all networks require training with all of these techniques.

Accumulation into FP32(用FP32完成矩阵加)<br/>
The NVIDIA Volta GPU architecture introduces Tensor Core instructions, which multiply half precision matrices, accumulating the result into either single- or half-precision output. We found that accumulation into single precision is critical to achieving good training results. Accumulated values are converted to half precision before writing to memory. The cuDNN and CUBLAS libraries provide a variety of functions that rely on Tensor Cores for arithmetic.

Loss Scaling<br/>
There are four types of tensors encountered when training DNNs: activations, activation gradients, weights, and weight gradients. In our experience activations, weights, and weight gradients fall within the range of value magnitudes representable in half precision. However, for some networks small-magnitude activation gradients fall below half-precision range. As an example, consider the histogram of activation gradients encountered when training the Multibox SSD detection network in Figure 2, which shows the percentage of values on a log2 scale. Values smaller than 2-24 become zeros in half-precision format.

Note that most of the half-precision range is not used by activation gradients, which tend to be small values with magnitudes below 1. Thus, we can “shift” the activation gradients into FP16-representable range by multiplying them by a scale factor S. In the case of the SSD network it was sufficient to multiply the gradients by 8. This suggests that activation gradient values with magnitudes below 2-27 were not relevant to training of this network, whereas it was important to preserve values in the [2-27, 2-24) range.
![image.png](attachment:image.png)

A very efficient way to ensure that gradients fall into the range representable by half precision is to multiply the training loss with the scale factor. This adds just a single multiplication and by the chain rule it ensures that all the gradients are scaled up (or shifted up) at no additional cost. Loss scaling ensures that relevant gradient values lost to zeros are recovered. Weight gradients need to be scaled down by the same factor S before the weight update. The scale-down operation could be fused with the weight update itself (resulting in no extra memory accesses) or carried out separately. For more details see the Training with Mixed Precision User Guide and Mixed-Precision Training paper.

The procedure described in the previous section requires you to pick a loss scaling factor to adjust the gradient magnitudes. There is no downside to choosing a large scaling factor as long as it doesn’t cause overflow during backpropagation, which would lead to weight gradients containing infinities or NaNs, that in turn would irreversibly damage the weights during the update. These overflows can be easily and efficiently detected by inspecting the computed weight gradients, for example, multiply the weight gradient with 1/S step in the previous section. One option is to skip the weight update when an overflow is detected and simply move on to the next iteration.<br/>

There are several options to choose the loss scaling factor. The simplest one is to pick a constant scaling factor. We trained a number of feed-forward and recurrent networks with Tensor Core math for various tasks with scaling factors ranging from 8 to 32K (many networks did not require a scaling factor), matching the network accuracy achieved by training in FP32. However, since the minimum required scaling factor can depend on the network, framework, minibatch size, etc., some trial and error may be required when picking a scaling value. A constant scaling factor can be chosen more directly if gradient statistics are available. Choose a value so that its product with the maximum absolute gradient value is below 65,504 (the maximum value representable in FP16).
<br/>
A more robust approach is to choose the loss scaling factor dynamically. The basic idea is to start with a large scaling factor and then reconsider it in each training iteration. If no overflow occurs for a chosen number of iterations N then increase the scaling factor. If an overflow occurs, skip the weight update and decrease the scaling factor. We found that as long as one skips updates infrequently the training schedule does not have to be adjusted to reach the same accuracy as FP32 training. Note that N effectively limits how frequently we may overflow and skip updates. The rate for scaling factor update can be adjusted by picking the increase/decrease multipliers as well as N, the number of non-overflow iterations before the increase. We successfully trained networks with N = 2000, increasing scaling factor by 2, decreasing scaling factor by 0.5, many other settings are valid as well. Dynamic loss-scaling approach leads to the following high-level training procedure:<br/>
1.Maintain a primary copy of weights in FP32.<br/>
2.Initialize S to a large value.<br/>
3.For each iteration:<br/>
a.Make an FP16 copy of the weights.<br/>
b.Forward propagation (FP16 weights and activations).<br/>
c.Multiply the resulting loss with the scaling factor S.<br/>
d.Backward propagation (FP16 weights, activations, and their gradients).<br/>
e.If there is an Inf or NaN in weight gradients:<br/>
i.Reduce S.<br/>
ii.Skip the weight update and move to the next iteration.<br/>
f.Multiply the weight gradient with 1/S.<br/>
g.Complete the weight update (including gradient clipping, etc.).<br/>
h.If there hasn’t been an Inf or NaN in the last N iterations, increase S.<br/>