Why the gradient scaling factor is multiplied before quantization? #59

Guangxuan-Xiao · 2022-08-17T07:58:31Z

Line 81 in ed0d8b1

p.grad.data = self.grad_quant(p.grad.data * self.grad_scaling)

In OptimLP, the gradient scaling factor is multiplied before quantization. However, grad scaling is meant to prevent possible underflow of low precision quantized gradient values. I think the current implementation cannot prevent underflow.

Maybe the correct implementation is to multiply the scaling factor after quantization.

p.grad.data = self.grad_quant(p.grad.data) * self.grad_scaling

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Why the gradient scaling factor is multiplied before quantization? #59

Why the gradient scaling factor is multiplied before quantization? #59

Guangxuan-Xiao commented Aug 17, 2022

Why the gradient scaling factor is multiplied before quantization? #59

Why the gradient scaling factor is multiplied before quantization? #59

Comments

Guangxuan-Xiao commented Aug 17, 2022