v0.2 update

NVIDIA · Mar 13, 2018 · c978691 · c978691
1 parent bf1dd3a
commit c978691
Show file tree

Hide file tree

Showing 85 changed files with 5,396 additions and 4,539 deletions.
diff --git a/README.md b/README.md
diff --git a/analysis/reproduction.md b/analysis/reproduction.md
@@ -29,6 +29,20 @@ It took several cycles of trial and error to come up with a result comparable to
  * **Processing Speed**: With our hardware and batch size we achieved a processing speed (wall-time) of 76k characters/second compared to OpenAI's 12.5k ch/s
  * **Processing Time**: It took approximately 5 days to train on the million samples of the paper and 6.5 days to train on a full epoch of the amazon dataset.
 
+## FP16 Training
+Training our models with FP16 arithmetic has proven to be critical for improving turnaround time of our own experiments and the speed of ideation. In addition to the faster arithmetic, with FP16 training we're also able to utilize a 2x larger batch size with no significant computation slowdown. This allows for an additional 2x speedup in training on top of the faster arithmetic.
+
+However, as is often the case with reduced-precision training, training convergence and numeric instability/lack of dynamic range is a concern.
+
+In order to address numeric instability in training we used several techniques that we deemed necessary:
+ * **Dynamic loss scaling**: We utilized [dynamic loss scaling's](https://arxiv.org/abs/1710.03740) iterative approach to find a scaling constant for the gradients so that  small gradients do not underflow and important numerical information is not lost during training time.
+ * **FP32 Master Params**: We keep a copy of our FP16 parameters in FP32 for accumulating gradient updates. This presents minimal additional compute overhead as elementwise additions are relatively fast in fp32. Furthermore, these FP32 parameters can also be kept on cpu to not consume additional memory.
+ * **FP32 Softmax CrossEntropy**: In order to overcome the harsh exponentiation and numerical instability in the softmax operation. We found it necessary to use intermediary FP32 logits when calculating our loss. (The final linear multiplication to get these logits is still done in fp16)
+
+We also establish other best practices that did not directly affect our results, but address other possible sources of numeric instability that may arise while training:
+ * **FP32 loss reduction**: Any reduction/averaging of more than 65k terms will be met with a dynamic range problem, so we convert any loss scalars to float as a relatively inexpensive safety measure.
+ * **FP32 normalization**: Any l2 norm greater than 256 will have a dynamic range problem while computing the norm. As such, we develop a FusedNorm kernel which exponentiates and accumulates values into FP32 before returning the final norm in FP16. This kernel is utilized to perform weight normalization.
+
 ## Transfer
 We chose to reproduce transfer results with the binary Stanford Sentiment Treebank as opposed to the IMDB dataset because of its smaller size, and faster turnaround time for experiments.
 

diff --git a/analysis/scale.md b/analysis/scale.md
@@ -1,5 +1,5 @@
 # Data Parallel Scalability
-**These scalability results are for PyTorch 0.2.0, we're working on updated results for 0.3.0. Thanks for your Patience.**
+**These scalability results are for PyTorch 0.2.0, we're working on updated results for >=0.3.0. Thanks for your Patience.**
 Training a model on an amazon-review-sized dataset is a significant time investment. In order to improve our ability to iterate on experiments we found it imperative early on to investigate the scalability of data parallelism in PyTorch. The model was trained on Tesla V100's (volta), Tesla P100's (pascal), and VCA 6000's (maxwell), with a batch size of 32 per gpu, in order to benchmark wall-time processing speed (in characters/second) against OpenAI's reported speed of 12.5k characters/second. Four of our pascall-class gpus achieved a combined speed of 13.4k characters/second.
 
 ![scaling graph](../figures/both_scalability.png "(Distributed) Data Parallelism Scalability")

diff --git a/cfg/__init__.py b/cfg/__init__.py
diff --git a/cfg/config.py b/cfg/config.py
diff --git a/cfg/configure_data.py b/cfg/configure_data.py
diff --git a/cfg/configure_devices.py b/cfg/configure_devices.py
diff --git a/cfg/configure_model.py b/cfg/configure_model.py