OneFlow v0.6.0 Release Notes

OneFlow has been open sourced for 528 days since July 31,2020. Today OneFlow v0.6.0 came out. Welcome to use OneFlow v0.6.0. We would love to hear feedback!

This version mainly updates three parts: framework, model, and OneFlow-ONNX. Hightlights include:

Performance optimization in static graphs, dynamic graphs, operators, memory occupation, etc
A larger number of common operators
Improvements in static graphs and ConsistentTensor
Serving functionality as Nvidia Triton's backend
Richer visual pre-training models similar to torchvision and timm
Better OneFlow-ONNX conversion functionality

The following are the detailed release notes.

Framework

1. Performance Optimization of nn.Graph

Compared to v0.5.0, nn.Graph in v0.6.0 delivers a 10% speedup in training on models such as ResNet AMP and WDL, etc
- Optimized nn.Graph's performance in high frequency iterative training scenarios
- Redesigned the scheduling instructions of nn.Graph and refactored the interaction logic between Actor Graph and Eager VM so that the runtime execution of the Graph is asynchronous and parallel to Python input/output Tensor as much as possible

2. Performance Optimization of Eager

Compared to v0.5.0, v0.6.0 OneFlow Eager's training speed increases dramatically in small batch scenarios
- Optimized the scheduling logic for virtual machines
- Optimized get/set item
- Optimized tensor.numel()
- Optimized oneflow.Size()

3. Performance Optimization of Operators

Optimized some operators that affect the performance of new model to significantly improve the training speed of these models
- Added fused dropout operators
- Added CPU-version group deconv and optimized its performance
- Added inplace-version implementation for operators mul, hard_sigmoid, and sin
- Optimized performance for linalg.vector_norm when ord=2.0 and it is 4 times faster than before
- Deeply optimized the LayerNorm operator, making its performance greatly better than PyTorch and Apex implementation. For more information, refer to How to Implement an Efficient LayerNorm CUDA Kernel — OneFlow Performance Optimization
- Realized automatic type promotion of operators. For more information, refer to Automatic Type Promotion of Operators in OneFlow

4. Performance Optimization of Eager's Memory Occupation

Optimized some operators' memory occupation during net training, making the same computing device run bigger models or data
- Optimized the backward memory occupation of broadcast binary operators
- Optimized the backward memory occupation of Slice operator
- Optimized the memory occupation of LayerNorm operator

5. More Useful Features to Static Computation Graph (nn.Graph)

The newly added features are related to the effeciency, debugging, completeness, and usability of static graphs
- To help the debugging of static graphs, we added the following features:
  - debug mode supports graph.debug(1) printing more information about the composition
  - Provided the environment variable ONEFLOW_DEBUG_PASS to show the changes in the computed graph before and after compile-time optimization
  - Added user-readable thread naming information to Nsight Profile for locating and retrieving target key thread locations
  - Added many static graph test cases and added automatic nn.Graph tests that accompany Eager tests
- Provided graph.save() and load() interfaces to support the deployment of models (Serving) using nn.Graph
- To do AMP acceleration on GPUs which use TensorCore, the environment variable ONEFLOW_ENABLE_NHWC is provided to indicate the CNN-related operators for channels last calculation
- Enabled nn.Graph to support more usage scenarios:
  - Supported for Sparse Update Optimizer for sparse update of parameters in WDL scenarios
  - Supported for using the following nn.Module Containers with nn.Graph:
    Sequential, ModuleList, ModuleDict, ParameterList, and ParameterDict
  - Supported for creating Optimizer in the init function of nn.Graph
  - Supported multiple parameters sharing the same Tensor with nn.Graph
  - Supported for scenarios where the actual number of processes is greater than the number of GPU devices
  - Supported more Inplace execution for Consistent SBP inference under nn.Graph

6. A Larger Number of Operators

Newly added operators: cumsum, meshgrid, linspace, diagonal, movedim, roialign, nms, arccos, and roll
Newly added operators: masked_fill, floordiv, glu, pool1d, pool2d, and pool3d
Newly added unfold and fold operators: Adding Unfold and Fold Ops into OneFlow
Achieved automatic data type promotion of operators: [Automatic Type Promotion of Operators in OneFlow
Added expand and repeat operators: Added Expand and Repeat Operators into OneFlow
Supported one-click switching for the current torchvision library models by the command import oneflow as torch

7. User-Defined autograd.Function

Users can customize autograd.Function just like using Torch.

8. Added Basic Serving Functionality

Serving functionality of models is provided by OneFlow as Nvidia Triton's backend.

9. Added Some Functionalities of Tensor (ConsistentTensor)

Supported Tensor using 2-D SBP to represent arbitrary hybrid parallelism (such as a Linear operation that runs data parallelism in the row direction of the device matrix and model parallelism in the column)
Supported Tensor's conversion from arbitrary 1-D SBP to 2-D SBP (the network consists of a mixture of 1-D parallel and 2-D parallel)
Supported constructing ConsistentTensor from numpy
oneflow.from_numpy()
oneflow.numel()
tensor.expand_as()

Model

Released flowvision 0.0.54.

1. Richer Visual Pre-training Models

Image Classification

CNN series: ResNet, DenseNet, VGG, ResNext, EfficientNet, etc
Vision Transformer series: ViT, PVT, Swin-Transformer, etc
Vision MLP series: Mlp-Mixer, Res-MLP, g-MLP, etc

Object Detection

SSD, SSDLite
Faster R-CNN
RetinaNet

Image Segmentation

FCN
DeepLabV3

Style Migration

StyleNet: Suport Styles sketch, candy, mosaic, rain_princess, and undie

2. Implemented Data Augmentation Operations Similar to torchvision

For data augmentation operations like CenterCrop and ColorJitter similar to torvhvision, developers can run import flowvision as torchvisionto execute in most scenarios.

3. Implemented Advanced Data Augmentation Opertations Similar to timm

Advanced data augmentation opertations implemented in flowvision.data:

Mixup
CutMix
Random-Erasing
AutoAugment
RandAugment
AugMix

4. Separated the Layers Module and Provided a Plug-and-play Block when Building a Model

flowvision.layers.attention

Implemented plug-and-play attention models like Non-Local, SELayer, CBAM, BAM, ECA, etc

flowvision.layers.blocks

Provided modules that might be used for model building like PatchEmb, Pooler, ConvBnAct, etc

flowvision.layers.regularization

Provided regularization modules such as drop-path, drop-block, and stochastic depth to improve model generalization ability
Provided separate files such as activation and weight_init to improve components like activation function and initialize method

OneFlow-ONNX Conversion

Updated OneFlow to ONNX toolkit:

Supported OneFlow model converting to ONNX model in CPU or GPU mode
Added test cases for operators and models to align all classification models in OneFlowVision library
Fixed onnx-runtime bugs during PReLU conversion
Compatible with v1.9.0 onnx-runtime library or later versions
Released v0.5.4 oneflow-onnx package, and developers can run pip install oneflow-onnx to experience

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Version 0.6.0