Skip to content

Version 0.8.0

Compare
Choose a tag to compare
@jackalcooper jackalcooper released this 18 Jul 06:05

OneFlow v0.8.0 Release Note

OneFlow v0.8.0 came out, welcome to install the new version for a better experience. 

  • Highlights
  • Backwards Incompatible Change
  • Deprecations
  • New Features
  • Performance
  • Improvements
  • Bug fixes
  • Documentation

Highlights

This update contains 523 commits and the following highlights:

  • PyTorch compatible APIs have been further optimized, 68 new APIs aligned with PyTorch have been added, and 84 compatibility bugs between operator and interface have been fixed. More PyTorch models support being one-button transferred into OneFlow.

  • All operators support Global Tensor more completely and efficiently, 28 Global Tensor-related bugs have been fixed, and 180 operator unit tests have been newly added.

  • Graph's advanced features have been further optimized:

    • In addition to the existing ZeRO-DP, Zero Redundancy Optimizer(ZeRO) can also be used in combination with MP parallelism, 2D parallelism, and 3D parallelism, which saves more memory overhead.

    • Graph provided new pipeline parallelism API, which not only simplifies the pipeline parallelism configuration but also optimizes the performance of pipeline parallelism and 3D parallelism.

    • Multi-dimensional debugging functionality in the logic graph, light plan physical graph, memory analysis, Python stack information, and others have been newly added, making Graph.debug more efficient.

  • Empowered by OneFlow v0.8.0 and LiBai v0.2.0, 3D parallelism speed under GPT and BERT witnesses a notable increase, and its training speed performance exceeds Megatron-LM with same configuration in multiple dimensions. For more details, please click here.

  • OneEmbedding has been released recently. It is an extension component designed for large-scale recommendation systems, boasting high efficiency, extensibility, flexibility, and other advantages.

  • Multi-Device adaptation: OneFlow v0.8.0 has provided a neat, efficient, and easily-extensible hardware abstraction layer called EP(Execution Provider) and defined a collection of basic computing interfaces called Primitive, allowing to re-implement kernels based on Primitive interface. 

  • Added new debugging tool stacks: OneFlow-Profiler and AutoProf

    • OneFlow-Profiler is a tool designed to collect performance information during framework execution. It can record the execution time of operators and system components, the allocation of memory and DRAM, and the corresponding input and parameters of operators. The information can help developers find out the main source of overhead in framework execution and thus implement targeted optimization.

    • AutoProf is a framework designed to efficiently detect the alignment between OneFlow APIs and PyTorch APIs. Besides, it can automatically compare the performance results of OneFlow APIs and PyTorch APIs.

  • Significantly optimized the exception handling process in OneFlow API and improved the error message when APIs meet exceptions.

  • Significantly optimized the OneFlow API documentation: the API documentation has been restructured based on functionality. In addition to general operator APIs, oneflow.nn.graph, oneflow.embedding, oneflow.autograd and other modules in OneFlow and their environment variables have also been explained in detail.

Backwards Incompatible Change

  • Graph has been re-designed to configure ZeRO API, which saves configuration and learning cost for users. Besides, the latest ZeRO supports 2D mixed parallelism that contains model parallelism and pipeline parallelism, and 3D parallelism.(#8036, #8404, #8464)

Outdated configuration method in OneFlow v0.7.0:

import oneflow as flow

class Graph(flow.nn.Graph):
    def __init__(self):
        super().__init__()
        self.linear = flow.nn.Linear(3, 8, False)
        self.config.set_zero_redundancy_optimizer_mode("distributed_split")
        if zero_stage > 1:
            # stage 2
            flow.boxing.nccl.enable_use_compute_stream(True)
            if zero_stage > 2:
                # stage 3
                flow.boxing.nccl.disable_group_boxing_by_dst_parallel(True)
    def build(self, x):
        return self.linear(x)

graph = Graph()

New interface in OneFlow v0.8.0:

import oneflow as flow

class Graph(flow.nn.Graph):
    def __init__(self):
        super().__init__()
        self.linear = flow.nn.Linear(3, 8, False)
        self.config.enable_zero(stage=2)
    def build(self, x):
        return self.linear(x)

graph = Graph()

Deprecations

Python API

  • The outdated parameter axis (remains compatible) in oneflow.sbp.split() has been uniformly changed into using dim to represent the slice dimension.(#8411)

v0.7.0

oneflow.sbp.split(axis=0)

v0.8.0

oneflow.sbp.split(dim=0)
  • For the outdated pipeline parallelism configuration method self.module_layer_0.config.stage_id = 0 (this method is not suggested ), we have added a novel pipeline parallelism API config.set_stage, which optimizes pipeline parallelism performance as well as avoids implementing the input_tensor.to_global(placement=this_stage_placement) operation for all module input tensors at every stage. (#8442)

v0.7.0

import oneflow as flow

B = [flow.sbp.broadcast]
P_0 = flow.placement(type = "cuda", ranks = [0, 1])
P_1 = flow.placement(type = "cuda", ranks = [2, 3])

class Graph(flow.nn.Graph):
    def __init__(self):
        super().__init__()
        self.m_stage0 = flow.nn.Linear(8, 8, False).to_global(placement=P_0, sbp=B)
        self.m_stage1 = flow.nn.Linear(8, 8, False).to_global(placement=P_1, sbp=B)
        # Set different module's stage id to hint the graph preparing right num of buffers in pipeline.
        self.m_stage0.config.stage_id = 0 
        self.m_stage1.config.stage_id = 1
        self.config.set_gradient_accumulation_steps(4)        

    def build(self, x):
        x = x.to_global(placement=P0, sbp=B)
        y = self.m_stage0(x)
        # Move tensor between different pipeline stages.
        y = y.to_global(placement=P1, sbp=B)
        z = self.m_stage1(y)
        return z

v0.8.0

class Graph(flow.nn.Graph):
    def __init__(self):
        super().__init__()
        self.m_stage0 = flow.nn.Linear(8, 8, False).to_global(placement=P_0, sbp=B)
        self.m_stage1 = flow.nn.Linear(8, 8, False).to_global(placement=P_1, sbp=B)
        # set_stage(stage_id, placement)
        # The Stage ID is numbered starting from 0 and increasing by 1.
        # The Placement is all tensors placement of this module.
        self.m_stage0.config.set_stage(stage_id=0, placement=P_0)
        self.m_stage1.config.set_stage(stage_id=1, placement=P_1)
        self.config.set_gradient_accumulation_steps(4)
    
    def build(self, x):
        # There will be automatically do tensor.to_global(placement) for all input tensor of this module.
        # So there is no need to write to_global() in/out of the module forward function.
        y = self.m_stage0(x)
        z = self.m_stage1(y)
        return z

New Features

Graph

  • Added new interfaces: oneflow.env.init_rdma and oneflow.env.rdma_is_initialized to delay turning on the RDMA, thus accelerating the network communications across multiple devices (Note: avoid using fork() after RDMA being turned on, for example, DataLoader’s num_workers > 1 should be executed before init rdma). #8415

  • Graph provided new algorithm optimization interface: graph.config.enable_straighten_algorithm to optimize the execution order in computation graph, which maximizes the overlap between transferring and computation. With this interface, the data transfer speed witnesses a 0.6% rise in data parallelism mode and a 6% rise in model parallelism mode. (#8347, #8483, #8495 )

  • Optimized the implementation of clip grad in Graph to support clip_grad_max_norm > 1.0 and provided configurable clip_grad_norm_type, which could only be set to 2 before but now can be set to +/- inf, +/- 1, +/- 2, +/- 3, and bigger p-norm values. See the reference from here (#7548)

  • Global tensor in Graph supported the tensor.set_item operation for invariable ops, for example, mask[:, :len_keep] = 0 (#7751)

  • Graph exported build_graph and compile_and_init_runtime interfaces, allowing to compile the pass that was previously self-defined by users after building the graph, thus rewriting and optimizing the graph. The two interfaces also supported Graph to restore an external graph (job). (#8168)

  • Added the RegisterJobPass interface to support rewriting the self-defined external job pass graph. (#8370)

  • oneflow.boxing.nccl.enable_use_compute_stream(True) optimized supports for NCCL logical kernel:

    • Added noncontiguous ReduceScatter kernel to support the conversion of P -> S(i), (i > 0) (#8361)

    • Supported the conversion of B -> S (#8355)

    • Enabled nccl send/recv primitives to support special SBP conversions (#8318)

  • Added the efficient fused kernel oneflow.nn.FusedMLP, which is controlled by export ONEFLOW_FUNCTOR_DISABLE_FUSED_MLP = 0 (#7391, #8165, #8217, #8413)

Debug

  • Graph.debug offered the new parameter: max_stack_depth (default = 2) to note the maximal stack depth of the Python stack where the op exists in Graph, making it convenient to locate the Python context for each op in Graph. (#8028)

  • Apart from supporting printing the input/output/variable info of modules in Graph, it also newly supported printing operator info constructed in module forward. (#8135)

  • Enabled export ONEFLOW_DEBUG_MODE=true and export GLOG_v=3 to print the full memory log, which contains multi-level MemBlock info on each device (Total Memory-> Chunk -> MemBlock), Block that has exclusive memory, Eager Variable and other information. Besides, a lifecycle label was added in Regst to analyze each tensor's memory lifecycle.

  • LightPlan provided a more simplified way to display Actor Graph, cutting down the cost of debug based on Plan. When ONEFLOW_DEBUG_MODE = true , a series of light plan files corresponding to each rank in Graph will be generated under the log/local_rank_0/machine/ directory, containing simplified actor sub-graphs in each rank, and the filename is GraphName_rank_i_light_plan. (#8396)

  • The print graph method allowed to display the logic graph by Module, making the debugging more efficient in constructing graphs. (#8131)

Eager

  • Supported passing extra parameters when Optimizer ParamGroup is being built, meeting other special operation demands for LrScheduler. (#7753)

    • param_groups = [{"params": [model.parameters()], "excess_param": ...}]
      optim = optim.Adam(param_groups, lr=0.1)
  • Added the oneflow.cuda.current_device interface to return the device index of the current rank (#7856)

  • Added the oneflow.utils.from_torch interface to convert a PyTorch Tensor into an OneFlow Tensor (#7851)

  • Added the oneflow.utils.to_torch interface to convert an OneFlow Tensor into a PyTorch Tensor (#7851)

  • Added the oneflow.cuda.empty_cache interface to manually release memory #8482)

  • Added the oneflow.roc_auc_score interface on CPU, which is equivalent to sklearn.metrics.roc_auc_score (#7951)

Tensor

  • Provided the Tensor.contiguous_ interface as the contiguous operation for the inplace version (#8275)

  • Added the Tensor.local_to_global and Tensor.global_to_global interfaces to separately implement different default check meta operations (#8027)

  • Global Tensor's Slice/SliceUpdate supported all nd_sbp inputs, and SliceUpdate fully supported the inplace operation and backpropagation (#8313, #8337, #8344, #8416)

Global Boxing

  • Eager Global Tensor supported balanced spliter nd sbp eager boxing (#7768)

  • Supported executing Eager Slice Boxing on random devices, including non-CPU devices and non-CUDA-capable devices (#8180)

OneEmbedding

For better recommendations, modern recommendation systems always rely on huge Embedding tables. Besides, frequent iterations of user data require model training to be fast enough.

OneEmbedding is a component designed for large-scale recommendation systems, and it's efficient, extensible, and highly flexible. The following are its advantages:

  1. Hierarchical storage and dynamic capacity expansion: users can expand the capacity of the Embedding at much lower cost.

  2. Mixed parallelism strategy: it supports easily extending the model to train it on multi-machine multi-GPU.

  3. Embedding quantization for better communication: in the parallel scenario, communication data can be quantized to reduce the communication amount, thus accelerating the training.

  4. Efficient data pipeline: the model parts that have no data dependency can be executed in advance, thus overlapping with other operations in time.

  5. Automatic mixed precision training: data can be computed in FP16 to reduce the occupied memory, thus accelerating the training speed and ensuring high model convergence precision.

  6. A collection of efficient CUDA ops for common operations in recommendation systems is available.

  7. Flexible model building is supported.

See OneEmbedding API documentation from here.

PyTorch Compatibility

A collection of new functionalities and interfaces that are compatible with PyTorch 1.10.0 have been added.

Tensor

  • Added the Tensor.pin_memory functionality, which supports changing the memory to pinned memory when the tensor is being created. (#8073)

    • Supported passing the pin_memory parameter when the tensor is being created. (#8176)

    • DataLoader supported pin_memory (#8214)

    • Added theTensor.is_pinned attribute (#8447)

  • Added the ~Tensor (invert) method to conduct logic NOT operation for each tensor with the dtype of .bool. (#7899)

  • Added the Tensor.log2 method to get log2 for each tensor. (#7906)

  • Added the Tensor.new_zeros method to generate a new tensor that has a shape of 0. (#7937)

  • Added the oneflow.as_tensor interface to convert the input data into a new tensor that shares data. (#7855)

  • Added the Tensor.__array__ method. np.array supports to input oneflow tensor to construct np.ndarry object. (#7970)

  • Added the Tensor.new_tensor method to copy the input data to generate a new tensor. (#7973)

  • Added the Tensor.half method, which is equivalent to tensor.to (oneflow.float16) . (#7971)

  • Added the Tensor.byte method to generate a new uint8 tensor, and tensor.byte() is equivalent to tensor.to(oneflow.uint8). (#8053)

  • Added the Tensor.view_as and Tensor.new_empty methods (#8077)

  • Added the Tensor.type method to implement corresponding cast and adding objects for oneflow(.cuda).{Byte, Char, Short, Int, Long, Half, Float, Double}Tensor (#8129)

  • Added the Tensor.dot method to compute the dot product of two 1D tensors, and this method is equivalent to oneflow.dot. (#8520)

  • Added the oneflow.nn.init.orthogonal_ interface to initialize tensors (#8009)

Operators

  • Added the oneflow.nn.Softshrink op (#7826)

  • Added the oneflow.nn.Threshold op (#7875)

  • Added the oneflow.nn.Hardshrink activation function (#7887)

  • Added the oneflow.isnan and oneflow.isinf interfaces to decide the element in tensor is nan or inf (#7943)

  • The oneflow.nn.functional.* interface supported passing the numpy scalar parameter (#7935)

  • Added the oneflow.nn.functional.cosine_similarity op to calculate the cosine similarity of two tensors (#8119)

  • Added the oneflow.nn.functional.conv_transpose1d, the oneflow.nn.functional.conv_transpose2d op, and thenn.functional.conv_transpose3d op (#7991)

  • Added the oneflow.unbind interface to return a tuple of all slices along a given dimension (#7730)

  • Added the oneflow.swapdims interface to specify the swapping of two dimensions, and oneflow.swapdims is equivalent to NumPy’s swapaxes. (#7659)

  • Added the oneflow.addcmul op to execute an element-wise composite function: out=input+value×tensor1×tensor2 (#7282)

  • Added the oneflow.searchsorted op (#7949)

  • Added the oneflow.mm op (#8440)

  • Added the oneflow.tensordot interface and offered a collection of cases of equivalent transformation operations (#7968)

  • Added the oneflow.repeat_interleave op to repeat the elements of the tensor, and this op is equivalent to numpy.repeat (#8324)

  • Added the oneflow.amax and Tensor.amax methods (#7996)

  • Added the oneflow.median and Tensor.median methods (#8069)

  • Added the oneflow.normal method and fixed the Tensor.normalmethod (#7956)

  • Added the oneflow.amin and Tensor.amin methods (#8042)

  • Added the oneflow.mv op and Tensor.mv method (#8445)

Random

  • Added new interfaces: oneflow.cuda.manual_seed, oneflow.cuda.manual_seed_all, oneflow.seed, oneflow.manual_seed, oneflow.initial_seed, oneflow.get_rng_state, oneflow.set_rng_state and improved the configuration of OneFlow random seed initialization. (#7957 )

AutoGrad

  • Added new interfaces: oneflow.set_grad_enabled and oneflow.enable_grad to enable or disable automatic gradient update for some of subgraphs. (#8016)

  • Supported the upstream gradient dtype of the autograd reverse operator is different from that of the input. (#8233, #8309)

  • Supported the backward operator that does not capture any tensor to execute backward computation multiple times. (#8031)

CUDA

  • Added APIs for oneflow.cuda.set_device and oneflow.cuda.synchronize. (#8322)

RNN

  • Refactored the Module of RNN and migrated the implementation of Python layer splicing to C++, which greatly optimized the performance. Added modules related to RNNCell and modules aligned with the torch.nn.utils.rnn in functionality:

    • Refactored modules: RNN, LSTM, and GRU
    • Added modules: RNNCell , LSTMCell, GRUCell, andoneflow.nn.utils.rnn
    • Supported and fixed RNN unit tests of local and global, and completed documentation.

Device

Supported heterogeneous equipment type: In order to cope with the complexity of different hardware, OneFlow, in line with the dependency inversion principle in software engineering, has introduced a hardware abstraction layer called Execution Provider (EP). The hardware abstraction layer is composed of a series of interfaces, which are abstracted from the capabilities provided by the required hardware devices during the running of the framework. After the hardware abstraction layer is introduced, each modules can directly call the interface provided by the hardware abstraction layer, not the original hardware interface, to use the underlying hardware, so it's unneccessary to concern the specific details of the underlying hardware. When a new hardware device is introduced, because the hardware abstraction interface remains unchanged, all modules can adapt to the new hardware device without any modification. At the same time, when adapting new hardware for the framework, we do not need to pay attention to the specific implementation details of the framework. We only need to implement a series of interfaces according to the agreement of the hardware abstract interface and the actual situation of the hardware device, and then the hardware adaptation can be completed.

Execution Provider has defined a collection of runtime interfaces: device registration interface, device management interface, queue management interface, event management interface, and memory management interface.

Primitive

In addition to the runtime interfaces, the Execution Provider has also defined a set of computing interfaces called Primitive, which are used to describe the commonly-used computation in the deep learning framework, thus simplifying the development of operators in hardware adaptation. Compared with the runtime interfaces provided by the Execution Provider, the interfaces provided by Primitive are more loose and flexible. All interfaces are mutually independent, and each interface represents a specific computing capability provided by a certain hardware device. Similar to runtime interfaces, the abstraction of interfaces provided by Primitive is closer to the device side, and developers can carry out adaptation work without an in-depth understanding of OneFlow's mechanism. Developers need to implement all interfaces provided by Execution Provider when adapting runtime interfaces, but in the process of adapting Primitive, developers can selectively adapt according to the actual situation of the project.

  • Added unit test of ep::primitive basic function (#8099)

  • Added ep::primitive::constant_pad, optimized performance, removed obsolete pad grad and used pad as the inverse of pad (#8152)

  • Used unary primitive interface instead of original implementation in Kernel (#8270)

  • Added environment variable ONEFLOW_EP_CUDA_CUBLAS_WORKSPACE_SIZE_MB to configure cublas workspace size (#8478)

  • Scalar logical kernel supported primitives (#8531)

  • Used primitives to implement logical not kernel (#8544)

  • Migrated all activation kernels to use primitive (#8300)

  • Bias add kernel supported primitive (#8512)

  • Decoupled OneDNN from ep::primitive CPU device and provided environment variable ONEFLOW_ENABLE_ONEDNN_OPTS to enable onednn to accelerate CPU primitive interface (#8274)

Debug tools

  • Saved the log independently for each rank to log/local_rank_{i} when launching multiple processes by launcher. (#7825)

  • Optimized the display of OF_PROFILER_RANGE_GUARD in nsys. (#8121)

OneFlow-Profiler

OneFlow-Profiler is designed to collect various performance-related information during the execution flow of the framework. It can calculate the execution time of the operator or system components, the allocation of memory and DRAM, and can record the input and parameter information corresponding to the operator. This information can be used by developers to analyze which part brings the most overhead and implement some targeted optimizations.

  • Added OneFlow-Profiler. (#8047)

  • Profiled the information of the CUDA operator. (#8195)

  • Profiled the bandwidth information of the operator. (#8254)

  • Added interfaces to collect bandwidth information and optimized code implementation. (#8332)

  • Refined Profiler. (#8332)

  • Used Kineto and CUPTI to profile the information of CUDA operator. (#8417)

Auto-Test

  • When the value check fails, the value of the input tensor and Paramter will be automatically printed, and the pseudo-code segment of the output program will be highlighted for debugging (#8383)

AutoProf

AutoProf is a framework designed to test the performance of OneFlow and PyTorch operators. It can automatically test the operator performance and print a comparison table under different CPU threads and GPUs. At present, it has been applied to the development of some existed operators and all new operators. Its effect is shown below:

image

  • Added auto speed comparison framework of operator AutoProf to automatically run op to test: (#8207)

    • The speed of OneFlow and PyTorch.

    • The speed of CPU/GPU Kernel under different numbers of threads.

    • Total end-to-end time with CPU Kernel.

  • Optimized the display of AutoProf to save testing time. (#8303)

  • Supported API tests without actual kernel execution, and the time would be end2end. (#8320)

  • Supported AutoProf to measure kernel bandwidth. (#8367)

IR

  • Used Cast to remove pass. (#7837 )

  • Used MLIR to complete constant folding, combined the composition optimization of Conv and BN. (#7799)

  • Optimized constant folding in OneFlow C++ API. (#8124)

  • Provided fault tolerance checking for parsed module. (#8299)

  • Fixed the BUG of constant folding unit test. (#8340)

  • Supported IREE. (#8249)

  • Added oneflow_iree(python) to CI. (#8431)

  • Removed redundant output_lbns in IR. (#8409)

  • Provided a conversion marker for Variable -> constant. (#8412)

  • Removed hardcoded properties in IR. (#8420)

  • Implemented AutoNHWC Pass and provided environment variable ONEFLOW_MLIR_PREFER_NHWC. Supported automatic conversion of common network data formats to channels last optimization and had a noticeable acceleration on NVIDIA graphics cards that support FP16. (#7890)

Performance

Graph

  • Optimized the speed and memory of GPT and BERT under 3-D parallelism:

    • Performance optimization: fused_scale_mask_softmax operator supported broadcast input. Optimized the kernel implementation and performance of softmax under specific cols (1024). Optimized the incomplete GetSbp list of fused_scale_mask_softmax reverse operator. (#8321)

    • Communication optimization: Optimized the communication cost of SBP cost under B->S, B->B, B->P. (#8378)

    • Interface optimization: Optimized the inefficient edge connection problem caused by the misalignment of stage id and to_global sequence dependency when using pipeline stage. (#8442)

    • Communication optimization: nccl_use_compute_stream supported more comprehensive sbp conversions like P -> S(i). (#8361)

    • Communication optimization: Parallel use of RDMA communication. (#8415)

    • Memory optimization: Eliminated the randomness of the memory multiplexing algorithm, so that the memory multiplexing effect of each rank is consistent when the subgraphs are the same. There will be no bad case. (#8441)

    • Memory optimization: Removed the extra buffer problem of Stage 0 CPU copy under Pipeline parallelism. (#8484)

    • Memory optimization: Under Checkpointing and Pipeline, the input identity of the module was de-duplicated to reduce additional Checkpointing tensor, and added the block name prefix of the module to the identity. (#8509)

    • Combination Optimization: ZeRO-DP supported using with Pipeline parallel and 3-D parallel. (#8464)

      • Memory optimization: Removed extra identity tensor in ZeRO optimization. (#8407)
  • Provided new environment variable optimization switches: ONEFLOW_ENABLE_MULTI_TENSOR_MODEL_UPDATE and ONEFLOW_FUSE_MODEL_UPDATE_CAST . In the case of AMP, they supported the fusion of the Optimizer model update kernel and the next round of forward cast operators. (#8373)

Eager

  • Enabled export ONEFLOW_EAGER_LOCAL_TO_GLOBAL_BALANCED_OVERRIDE =true to accelerate the execution of Eager Global, which can save the synchronization of meta information on each rank of Global Tensor. (when users are confident that their code execution is symmetrical, SPMD)(#7981)

    This environment variable is used to indicate whether the shape of the input data is the same when local to global is executed. If it is set to true, there is no need to synchronize the shape of each rank, and the logical shape is calculated locally.

  • Used python c api to replace pybind11 to optimize the calling speed of tensor and functional.

    • Optimized functional return types to save overhead and avoid reference copies. And solved the bug that the inplace tensor id may be inconsistent. (#7985)

    • Moved tensor API from pybind11 to c python API. Added tensor hash function. Resolves function naming conflict. (#8258, #8315, #8342, #8375)

  • Performance optimization: Let vm worker threads concentrate on computing tasks, and decoupled memory tasks from computing tasks. (#7976)

  • Optimized the speed of operations in DataLoader, including MakeLocalTensorFromData, which is 20% faster under swin-T dataloader. (#8066)

Operators & Tensor

  • Optimized global sparse_softmax_cross_entropy kernel. (#7298)

  • Optimized and sped up CPU permute kernel with OneDNN. (#7872)

  • Optimized and sped up CPU softmax kernel with OneDNN. (#8071#8075)

  • Optimized the memory and speed required for the reverse calculation of the pooling kernel. (#7980)

  • Optimized Slice and Tensor getitem operations based on View to improve the speed of dataloader. (#8148, #8211, #8243)

  • Optimized the reverse composition logic of flip and cumsum, and remove some grad operators. When testing Grad diffs, used random value tests to increase test robustness. (#8155)

  • Optimized the memory usage of the NormalizationAddReluGrad operator and added versions that does not require addend_diff. (#8213)

  • Optimized and sped up the implementation of tensor.reshape and tensor.reshape_as from python implementation to c++ implementation. (#8304)

  • Converted tensor.view, tensor.view_as, tensor.permute, tensor.transpose, tensor.contiguous_ from python implementation to c++ implementation. (#8317)

  • Greatly optimized the performance of index_select and repeat_interleave by using gather to replace dim gather. (#8360)

  • Optimized and removed temporary memory in cumprod cpu grad kernel. (#8369)

  • The embedding operator supported amp, improved the performance under normal path, and fixed the bug that the gather cpu kernel memory out of bounds. (#8374)

  • Optimized the performance of Tensor.fill_. (#8283)

  • Greatly optimized the performance of the broadcast element-wise binary family operators in reverse calculation. (#8339)

  • Added fusion operator BinaryCrossEntropyWithLogitsReduceMean. (#8476)

  • Added high-performance matrix multiplication Fused kernel based on cublasLt. (#8462, #8222, #8063)

Primitive

  • Lowered the elementwise.cuh template's requirement for pointer alignment.

Improvements

Graph

  • Exported oneflow env to python and used python's objects to manage its lifecycle. (#7792)

  • Used Python's reference counting to control the life cycle of Graph and constructed strict and rich destruction test cases. (#7857)

  • Supported recycling independent threads that can no longer be reused when Graph is destructed. (#7862)

  • Changed the basic configuration of resource from one-time static effect to real-time effect. (#8444)

  • Consolidated the nccl_comm dynamically created by the Graph NCCL logical kernel into the runtime for initial creation to avoid the deadlock caused by the inconsistency between the creation order of each rank and the eager nccl comm creation order. (#8263)

  • Refactor optimization: Merged nn.graph.util.IONode , nn.graph.util.IONodeType into IOArgs. (#8272)

  • Refactor optimization: Renamed the global singleton Global object to the Singleton object. (#8490)

  • Refactor optimization: Removed gpu_device_num (#8516)

  • Refactor optimization: Removed outdated AvailableMemDesc concepts. (#8145)

  • Refactor optimization: Removed outdated Model IO Kernel logic. (#8151)

  • Refactor optimization: Replaced GpuDeviceNum with the actual number of devices to avoid coupling with specific device types. (#8166)

Eager

  • C++ is available now. You can manually trigger allocator gc on each stream (applicable in ZeRO)(https://github.com/Oneflow-Inc/oneflow/pull/8452)

  • The execution of Eager VirtualMachine instruction is based on the execution of EP. (#7923)

  • Optimized and removed all redundant interfaces of Get(Ptr)OrThrow. (#7812)

  • Added the validity check of flow.save(global_dst_rank). (#7964)

  • Supported the backward function node to run multiple times if it does not capture any tensor. (#8031)

  • Added the ThreadLocalCached decorator to clear the cache in time to alleviate increasing memory. (#7858)

  • Added std for C++14::inclusive_scan/std::exclusive_scan implementations. (#8128)

  • Packaged the parameters required by the eager opkernel and pass them in each thread to solve some thread-unsafe problems. (#7617)

  • Eager Stream supports kernel computation on pinned memory. (#8486)

  • Introduced a tool class for dim range check to replace simplified Functor's various checking logic for dimensions. (#8382)

  • Refactoring and optimization: removed the Blob object in EagerBlobObject, which leads to redundant TensorView instructions. At the same time, in order to support ShapeView efficiently, the elem_cnt attribute has also been removed. (#7895)

  • Refactoring and optimization: extracted the algorithm used by BinAllocator to share dynamic memory pools

  • Refactoring and optimization: VectorAt and MapAt functions uniformly use reference to pass parameters to solve the mixed use of reference interface and pointer interface. (#8191)

  • Refactoring and optimization: removed the cfg application on C++. (#8158)

  • Refactoring and optimization: removed the outdated code related to RemoteBlob in Single-Client. (#8228)

  • Refactoring and optimization: merged duplicate logic in eager boxing ccl and nccl boxing expr. (#7930)

  • Refactoring and optimization: removed cfg on Python and reduced the number of symbols to optimize the link speed of compilation.

  • Refactoring and optimization: merged symbol::IdCache and symbol::Storage. (#8331)

  • Refactoring and optimization: introduced llvm::SmallVetor and used oneflow::small_vector instead of fixed_vector. Besides, we have optimized the implementation and usage of Shape and Stride. (#8365 , #8402)

  • Refactoring and optimization: refactored ShapeView and Shape to eliminated duplication and inconsistencies. (#8422)

  • Refactoring and optimization: eager VirtualMachine has decoupled InstructionType's dependency on StreamType. (#7607)

  • Refactoring and optimization: removed the InstructionMsg class and merged all its functions and fields into the Instruction class. (#7623)

Operators & Tensor

  • Stride support:

    • Tensor, UserOp and UserKernel in user_op:: all supported stride attribute. (#7829)

    • cast supports stride. (#8292)

  • View support and optimization:

    • Added a new input tensor to decide whether to support non-contiguous when making op definitions. Besides, we now support transpose, permute, narrow, expand, expand_as, split, chunk, unfold_tensor, movedim, as_strided, select, swapaxes, T, t, hsplit, vsplit, tensor_split none-contiguous view ops.(#7813)

    • Tensor slice used view operations by default.(https://github.com/Oneflow-Inc/oneflow/pull/8302)

  • Automatically generated version status (Feature Stage) for OneFlow's API. (#7945)

  • Optimized CUDA memset to cudaMemsetAsynchttps://github.com/Oneflow-Inc/oneflow/pull/7763)

  • LeakyReLU supported inplace optimization. (#8060)

  • Added the following parameters to nn.Embedding interface: padding_idx, max_norm, norm_type, scale_grad_by_freq. (#8110)

  • Aligned PyTorch's max_pool_1d, max_pool_2d, max_pool_3d, avg_pool_1d, avg_pool_2d, avg_pool_3d, and distinguish old pooling kernel aligned with TensorFlow. (#8111)

  • VectorAt supported passing in non-const references: JUST(VectorAt(vec, 1)) = 5;. (#8013)

  • Reduced the uncommon kernel template specializations of layer norm. (#8209)

  • Modified the logic of Tensor.numpy to avoid extra memory growth when saving the model. (#8449)

  • Tensor str supported printing nd sbp. (#8458)

  • Slice supported SBP infer (S->P), and the semi-automatically deduced sbp was able to selecte the same sbp as expected in the reducible nd_sbp. (#8536)

  • When printing non-CPU and non-CUDA tensor, you must copy to cpu first and then print. (#8548)

  • Refactoring and optimization: decoupling user kernel and device tag. (#8529)

  • Refactoring and optimization: a series of kernels (squeeze, reshape_like, flatten, expand_dims, reshape, amp_white_identity, identity, identity_buffer, parallel_cast, hierarchical_parallel_cast, hierarchical_parallel_cast_like) were refactored to CopyDataContentKernel #8537

  • Refactoring and optimization: removed obsolete constant_pad1d , constant_pad2d , constant_pad3d kernel. (#8113)

  • Refactoring and optimization: removed obsolete old lazy upsample kernel implementation.(#8188)

  • Refactoring and optimization: removed obsolete message in shape proto and used sequential to represent stride. (#8220)

  • Refactoring and optimization: removed obsolete multiply kernel, whick was included in broadcast_mul. (#8359)

  • Refactoring and optimization: Renamed the shape in UserOp/Kernel to shape_view interface. (#8433)

  • Refactoring and optimization: removed oneflow gemm. (#8499)

  • Optimized the Maybe return type of such interfaces as Scalar.As(). (#8348)

Device

  • Code refactoring ep::CpuDevice (#7911)

  • Code refactoring: removed hard-coded special decision for device type like "cpu", "cuda" from system code. (#8201)

  • Removed all dnn-related interfaces from the old version of KernelUtil (Primitive will be used to replace those interfaces). (#8141)

  • Removed all interfaces related to mathematical calculation in the old version of KernelUtil (Primitive will be used to replace those interfaces). (#8157)

  • Removed incomplete special decision for 'cuda 'device type in scope util. (#8173)

  • Achieved delayed capture of CUDA Graph(#8474)

  • Code refactoring: removed cuda_event. (#8493)

  • Code refactoring: removed useless WITH_CUDA macro. (#8562)

Tests

Eager Global Module Tests:

In 0.8.0, we have completed the ability of all kernels to deal with global tensor in distributed situation, and fixed many known bugs related to sbp. The global tensor worked efficiently and correctly at the kernel level. No matter how the distributed topology structure changes, the same algorithm logic can efficiently get mathematically consistent results, which greatly reduced the trouble of verifying correctness in the complex, diverse and asymmetric distributed parallel training process.

module/functional op PR
abs Oneflow-Inc/oneflow#7540
0_dim_tensor Oneflow-Inc/oneflow#7540
activation Oneflow-Inc/oneflow#7540
adaptive_pool Oneflow-Inc/oneflow#7563
addmm Oneflow-Inc/oneflow#7565
add Oneflow-Inc/oneflow#7204
affine_grid Oneflow-Inc/oneflow#7578
arange Oneflow-Inc/oneflow#7576
argmax Oneflow-Inc/oneflow#7579
argmin Oneflow-Inc/oneflow#7581
argsort Oneflow-Inc/oneflow#7582
argwhere Oneflow-Inc/oneflow#7584
avgpool Oneflow-Inc/oneflow#7585
batch_gather Oneflow-Inc/oneflow#7590
bernoulli Oneflow-Inc/oneflow#7732
bmm Oneflow-Inc/oneflow#7741
broadcast_like Oneflow-Inc/oneflow#7742
cast Oneflow-Inc/oneflow#7773
ceil Oneflow-Inc/oneflow#7744
chunk Oneflow-Inc/oneflow#7750
clamp Oneflow-Inc/oneflow#7752
clip_grad Oneflow-Inc/oneflow#7757
concat Oneflow-Inc/oneflow#7204
conv1d Oneflow-Inc/oneflow#7769
conv2d Oneflow-Inc/oneflow#7771
conv3d Oneflow-Inc/oneflow#7771
cumsum Oneflow-Inc/oneflow#7772
deconv2d Oneflow-Inc/oneflow#7772
diagonal Oneflow-Inc/oneflow#7772
diag Oneflow-Inc/oneflow#7421
div Oneflow-Inc/oneflow#7421
dot Oneflow-Inc/oneflow#7421
dropout Oneflow-Inc/oneflow#7772
empty Oneflow-Inc/oneflow#7508
eq Oneflow-Inc/oneflow#7421
erfc Oneflow-Inc/oneflow#7421
erf Oneflow-Inc/oneflow#7421
expand Oneflow-Inc/oneflow#7772
expm1 Oneflow-Inc/oneflow#7421
eye Oneflow-Inc/oneflow#7421
flatten Oneflow-Inc/oneflow#7421
flip Oneflow-Inc/oneflow#7496
floor Oneflow-Inc/oneflow#7421
fmod Oneflow-Inc/oneflow#7421
fold Oneflow-Inc/oneflow#7772
greater_equal Oneflow-Inc/oneflow#7421
greater Oneflow-Inc/oneflow#7366
fused_bias_add_dropout Oneflow-Inc/oneflow#7867
fused_bias_add_gelu Oneflow-Inc/oneflow#7867
fused_scale_mask_softmax_dropout Oneflow-Inc/oneflow#7867
fused_scale_mask_softmax Oneflow-Inc/oneflow#7867
fused_scale_tril Oneflow-Inc/oneflow#7867
fused_self_attention Oneflow-Inc/oneflow#7867
fused_tril_softmax_mask_scale Oneflow-Inc/oneflow#7867
gather_nd Oneflow-Inc/oneflow#7880
gather Oneflow-Inc/oneflow#7880
glu Oneflow-Inc/oneflow#7880
grid_sample Oneflow-Inc/oneflow#7881
groupnorm Oneflow-Inc/oneflow#7885
masked_fill Oneflow-Inc/oneflow#7457
masked_select Oneflow-Inc/oneflow#7492
math_ops Oneflow-Inc/oneflow#7461
matmul Oneflow-Inc/oneflow#7465
maxpool Oneflow-Inc/oneflow#7683
max Oneflow-Inc/oneflow#7450
mean Oneflow-Inc/oneflow#7650
meshgrid Oneflow-Inc/oneflow#7533
min_max_observer Oneflow-Inc/oneflow#7725
min Oneflow-Inc/oneflow#7450
movedim Oneflow-Inc/oneflow#7679
moving_average_min_max_observer Oneflow-Inc/oneflow#7726
mul Oneflow-Inc/oneflow#7717
narrow Oneflow-Inc/oneflow#7647
negative Oneflow-Inc/oneflow#7644
ne Oneflow-Inc/oneflow#7642
nms Oneflow-Inc/oneflow#7536
nonzero Oneflow-Inc/oneflow#7645
normalize Oneflow-Inc/oneflow#7635
ones_like Oneflow-Inc/oneflow#7635
parital_fc Oneflow-Inc/oneflow#7534
permute Oneflow-Inc/oneflow#7635
prod Oneflow-Inc/oneflow#7635
randint Oneflow-Inc/oneflow#7508
rand Oneflow-Inc/oneflow#7508
reshape Oneflow-Inc/oneflow#7472
roi_align Oneflow-Inc/oneflow#7794
scatter_nd Oneflow-Inc/oneflow#7807
scatter_ops Oneflow-Inc/oneflow#7807
sign Oneflow-Inc/oneflow#7818
slice Oneflow-Inc/oneflow#7818
softplus Oneflow-Inc/oneflow#7818
sparse_softmax_cross_entr Oneflow-Inc/oneflow#7298
split Oneflow-Inc/oneflow#7277
sqrt_square_sum Oneflow-Inc/oneflow#7277
squeeze Oneflow-Inc/oneflow#7289
stack Oneflow-Inc/oneflow#7289
stateful_kernel_with_cache Oneflow-Inc/oneflow#7289
std Oneflow-Inc/oneflow#7303
sub Oneflow-Inc/oneflow#7303
sum Oneflow-Inc/oneflow#7303
tensor_ops Oneflow-Inc/oneflow#7307
tensor_scatter_nd_update Oneflow-Inc/oneflow#7308
tile Oneflow-Inc/oneflow#7322
transpose Oneflow-Inc/oneflow#7332
tril Oneflow-Inc/oneflow#7322
TripletMarginLoss Oneflow-Inc/oneflow#7332
triu Oneflow-Inc/oneflow#7882
unfold Oneflow-Inc/oneflow#7883
unfold_tensor Oneflow-Inc/oneflow#7883
unsqueeze Oneflow-Inc/oneflow#7882
upsample Oneflow-Inc/oneflow#7884
var Oneflow-Inc/oneflow#7891
view Oneflow-Inc/oneflow#7886
weight_norm Oneflow-Inc/oneflow#7886
where Oneflow-Inc/oneflow#7886
zeropad2d Oneflow-Inc/oneflow#7886

EP::Primitive

Completed some unit tests of Primitive log_softmax, softmax, copynd, Memset, Memcpy, matmul, add, binary, unary, matmul, batch_matmul, fill etc. (#8132, #8139, #8137, #8109, #8143, #8108, #8154, #8154, #8118#8291)

Exception

Improve exception error handling

  • Added reshape exception handling. (#7847)

  • Improved the error message of module when the input information does not match. (#7918)

  • Added the MAYBE_NEED_ERROR_MSG_CHECK environment variable to check whether the CHECK function of Maybe contains oneflow:: Error message. It is used to prompt developers to add error prompt message. (#7955)

  • Improved the exception error message of gather op.(#7979)

  • Improved LayerNorm error message. (#8090)

  • Optimized the error message when Eager and Graph encounter multiple inconsistent input placement in op. (#8054)

  • Improved the error message checking in activation-related kernel processing logic.(#8080)

  • Improved the error message in tensor.to_global and tensor.to_local. (#8067)

  • Improved the exception error message in the dot kernel. (#8051)

  • Rewrited the exception check in batch_matmul kernel. (#8186)

  • Fixed the problem of exception error checking when Python parses arg. (#8205)

  • Improved the exception error checking logic of all array functor. (#8116)

  • Improved the exception error checking logic of all binary functor. (#8161)

  • Improved the exception error reporting logic in nn grad functor. (#8210)

  • Added error message when Graph.build is not reloaded. (#8250)

  • Added TypeError type and device-related error message. (#8057)

  • Improved the error message of Eager SliceBoxing. (#8232)

  • Improved the error message of broadcast op. (Improve the error message of broadcast op)

  • Improved the error message of Eager Boxing when it is at runtime. (#7926)

  • Improved the error message of Tensor index. (#8234)

  • Improved the error message in nn.functor. (#7910)

  • Added check for Physical Shape when Graph compiles exec_graph. (#8002)

  • Added default error message for CUDA check. (#8427)

  • Added similar error checking information to add n calculation. (#8495)

  • Improved the error message of arg sort. (#8513)

  • Improved the error message of bias add. (#8524)

  • Improved the error message in autograd function. (#8496)

  • Improved the error message of batch gather. (#8533)

  • Improved the error message prompt of defense code in autograd. (#8525#8541)

Build

  • Supported CUDA 11.5, 11.6. (ttps://github.com//pull/7852 , #8423)

  • Fixed the version of click at 8.0.0. (#7967)

  • Updated nccl version to 2.12.10. (#7822)

  • Default alignment pytorch version 1.10.0. (#7019)

  • Updated tvm oneflow frontend dependencies. (#8048)

  • Updated the version of LLVM/MLIR to support IREE. (#8068#8461)

  • Fixed the version of protobuf between 3.9.2 to 4.0. (#8198)

  • Removed the cfg tool in cmake. (#8218)

  • The environment variable of CMAKE INTERPROCEDURAL OPTIMIZATION was enabled by default. (#8237)

  • Removed the XRT part in the OneFlow source code, and the OneFlow-XRT will be used as a third-party plugin for oneflow. (#8273#8288)

  • Changed Liboneflow to dynamic library. (#8312)

  • Updated the version of clang-tidy to 14.0.4. Supports the following syntax now: NOLINT, NOLINTNEXTLINE, NOLINTBEGIN & NOLINTEND. (#8306)

  • Removed EXTERNAL_INCLUDE_DIRS , only builds with target. (#8421)

  • Removed obsolete linkages in cmake. (#8426)

CI

Improve the running speed and stability of CI

  • Supported CI to automatically upload built docs.(#7894 #7917)

  • Added CI test for IREE. (#8419)

  • Printed the pip package in the container used to test in order to query version information easily. (#7952)

  • Optimized the old version of SpeedTest. (#7871 #7990 #8035)

  • Optimized the memory used by AutoTest. (#7988)

  • Adjusted the threshold of benchmark. (#8043)

  • Adjusted the timeout threshold. (#8103)

  • Optimized the warning output related to __del__ in CI. (#8049)

  • Optimized the interval of gc to improve the test speed. (#8138)

  • Optimized the use of super Tensor in CI unit test to avoid gc too slow and slow down the running speed of CI. (#8177)

  • Optimized the number of CI build to improve the speed of build. (#8229)

  • Optimized CI workflow, stops all workflows when a job fails. (#8255)

  • Increased maximum parallelism 5 -> 10. (#8259)

  • Strict CI timeout-minutes. (#8266)

  • Supported optional multi-machine testing via the need-test-distributed tag. (#8372)

  • Tried to use a distributed test cache when testing on multiple machines. (https://github.com/Oneflow-Inc/oneflow/pull/8387/files)

  • Optimized the test time of global test. (#8468)

  • Optimized the execution time of test_math_ops, test_loss, test_activation, test_tensor_part1, test_tensor_part2 and other eager test. (#8494)

  • Optimized test_convtranspose, test_einsum, test_sqrt_square_sum in expensive eager test. (#8504)

Models

  • Added the test of LiBai in CI. (#7537, #7929)

  • Fixed the speed test for Swin-Transformer. (#7840)

  • Added the benchmark test for flow-vision.(#7806, #8024)

  • Added compatibility tests for conv_mixer, densenet, ghostnet, googlenet, inception_v3, mnasnet, rexnet, rexnet_lite, res2net, shufflenet_v2, squeezenet, convnext, crossformer, efficientnet, levit, mlp_mixer, poolformer, pvt, res_mlp, uniformer, swin_transformer, senet and other models. Fixes such compatibility issues as conv2d module padding parameter does not support string; the parameter list of functional.layer_norm is not aligned; meshgrid does not support the input of list[tensor]; adds a interface for tensor.reshape_as. (#7942)

  • Fixed the bug of Swin-Transformer dataloader. (#8037)

  • Added single-node 4-Gpus tests for models such as InsightFace in oneflow_face repository. (#8130)

Bug fixes

Graph

  • Fixed the bug of nccl deadlock caused by CUDA kernel asynchronous launch limit for nccl logical kernel in 3-D parallelism. (#7924)

  • Fixed cycle import of scope and session. (#7993)

  • Used log_softmax + nll to make sparse_softmax_cross_entropy ms more stable numerically for calculating subgraphs. (#7987)

  • Fixed the bug that B2P boxing misses TaskEdge lbi. (#8052)

  • Fixed the problem that compilation fails due to eager free tensor is not in nn.Graph's job. (#8114)

  • Fixed the possible problem of SegmentFault caused by BlobDesc. (#8252)

  • Solved the bug of circular import in python 3.6. (#8268)

  • Solved the problem that Graph's input and parameter/buffer tensors fail to handle non-contiguous tensors.(#8281)

  • Solved the potential deadlock caused by inconsistent partial order execution of multiple ranks in 3-D parallelism. (https://github.com/Oneflow-Inc/oneflow/pull/8226)

  • Fixed the bug that Ibverbs failed to start the environment due to incorrect mtu value in special network environment. (#8451)

  • Solved the potential deadlock caused by the partial order execution of each rank when the subsequent subgraph of GradAcc is inserted into the NCCL logical op; at the same time, traverse the subsequent subgraph of GradAcc more comprehensively to solve the problem of missing NCCL op. (#8459)

  • Fixed the bug that NCCL logical kernels does not support bool type. (#8455)

  • Fixed the bug of tensor detach and clone in Graph. (#8498)

Eager

  • Aligned DataLoader.__next__ interface (#7835)

  • Fixed backtracking failure when calculating higher-order derivatives, which is caused by the capturing of forward detached tensors via AutoGrad

  • Fixed inadequate execution of the semantics of sync by Barrier Instruction (#7702)

  • Fixed memory leak caused by imperfect management of VM instruction count

  • Fixed getitem when tensor device id is not in the current rank

  • Fixed global norm error on gradient calculation for various placements when calling clip grad in pipeline parallelism in eager global mode (#7879)

  • Fixed possible int32 arithmetic overflow caused by Shape.elem_cnt (#8178)

  • Fixed incorrect results produced by Module.to_global when introducing parameters (#8187)

  • Fixed extra GPU memory usage in flow.load and module.load_state_dict (#8301)

  • Fixed extra GPU memory usage when Optimizer loads models (#8310)

  • Fixed the error occurs when loading models via flow.load in multi nodes (#8314)

  • Fixed instability of eager caused by the introduction of callback thread (#8193)

  • Fixed tensor.from_numpy interface to avoid memory leak when the input of numpy is non-contiguous tensor (#8391)

  • Fixed stack overflow when destructing the deep backward computational graph after recursion (#8056)

Operators & Tensor

Global Tensor

  • Fixed global SBP inference of unfold (#7883)

  • Fixed global SBP inference of grid_sample (#7881)

  • Fixed incorrect pass of values in slice boxing kernel in certain cases (#7893)

  • Fixed eager global inplace (#7903)

  • Fixed SBP inference of upsample op (#7884)

  • Fixed SBP inference of ScatterAdd, ScatterUpdate, and ScatterScalarUpdate (#7807)

  • Fixed backward memory error of partial_fc with Global Tensor (#8041)

  • Added support for S0 in randperm and fixed equal local tensors across all ranks in random op in Split (#7571)

  • Fixed tensor getitem index error in global (#8153)

  • Fixed SBP inference of RoiAlign and added global unit test (#7794)

  • Fixed SBP inference of stack op (#8181)

  • Fixed random initialization in median under CPU global (#8245)

  • Fixed SBP inference of narrow op and added global unit test for narrow and chunk (#7750)

  • Improved legal SBP list of batch_matmul (#8385)

  • Fixed NLLLoss’ failure to support model parallelism (#8380)

  • Fixed S->S and S->P inference in Slice Op SBP infer (#8521)

Tensor

  • Fixed the bug occurs when Tensor dim is set to -1

  • Fixed failure for Tensor type to be directly transferred to int and float in Python (#7927)

  • Fixed the bug in Tensor.is_contiguous that skips initialization when caching and executes random initialization when getting values (#7785)

  • Fixed the bug in Tensor slice view under 1d contiguous (#7898)

  • Fixed incorrect processing of None value by Tensor.__eq__ (#7938)

  • Fixed unaligned memory size in from_numpy interface (#7963)

  • Fixed incorrect initialization of random seed in Tensor (#7904)

  • Fixed failure of oneflow.Size to create Tensor with a specified shape (#8429)

  • Aligned alpha parameter in Tensor.add (#8140)

Scalar Tensor

  • Fixed failure of add to support Scalar Tensor (#7827)

  • Fixed failure of reduce_sum to support Scalar Tensor (#7866)

  • Fixed failure of one_hot to support Scalar Tensor (#7975)

Fixed failure of gather to support Scalar Tensor (#8376)

  • Fixed “memory access out of bounds” error in dim_scatter kernel under Scalar Tensor (#8418)

  • Fixed failure of start and end parameters in arrange op to support Scalar Tensor (#8522)

  • Fixed failure of all to support Scalar Tensor and 0-Size Tensor (#8547)

0-Size Tensor

  • Fixed failure of conv and deconv to support 0-Size Tensor (#8001)

  • Fixed failure of cuda_check_numerics to support 0-Size Tensor (#8050)

  • Fixed failure of expand and advanced_index to support 0-Size Tensor (#8094)

  • Fixed the bug occurs when processing 0-Size Tensor in repeat_interleave kernel and removed relevant special judge in gather (#8414)

  • Fixed failure of diag to support 0-Size Tensor (#8557)

Operators

  • Fixed sorting in nms unit test (#7831)

  • Fixed torch alignment of beta and threshold interfaces of softplus op (#7888)

  • Fixed failure of expand to support passing tuples as parameters (#7913)

  • Fixed computation failure in randperm when n is too large (#7908)

  • Fixed failure relative to list or tuple in parameter passing in meshgrid (#7933)

  • Fixed nn.functional.conv2d bug that all parameters must be specified (#7892)

  • Fixed failure of rand and randn to support tuple as an input (#7914)

  • Fixed the bug occurs in concat when inputs are of inconsistent data types (#7921)

  • Fixed wrong device id got by generator in certain cases in randn,dropout, randint, rand, random_mask_like, and randperm (#7896)

  • Fixed inconsistent behaviors of __shfl_sync under sm_61 in layernorm (#7978)

  • Fixed failure of scatter op to support negative dim (#7934)

  • Fixed the bug in scatter op nd update value(#7953)

  • Fixed failure of masked_select to support certain Broadcast operations in eager mode (#7984)

  • Fixed the bug in PReLU op when dispatching num_blocks (#8004)

  • Fixed misused numpy forced synchronization logic in index_select python and transferred the logic to functor for implementation (#7965)

  • Aligned dtype parameter in prod (#7932)

  • Fixed the bug occurs when ord = 0 in linalg.vector_norm op; Fixed check on nan/inf by clip_grad (#8007)

  • Fixed failure of min and max to operate on inconsistent dtypes (#8021)

  • Added num_batches_tracked buffer to batch_norm to facilitate transfer of ResNet-18, a torch pretrained model, to OneFlow (#7920)

  • Fixed the misuse of logf, expf, and powf in math kernel (#8038)

  • Fixed exclusion of dtype parameters in cumsum and cumprod and provided Tensor.cumsum and Tensor.cumprod methods (#8065)

  • Fixed possible overflow when dtype is not int64 in non_zero op (#7907)

  • Aligned sum, mean, all, any, and prod operations in reduce (#8085)

  • Fixed incorrect backward computation in cumprod (#8136)

  • Aligned alpha parameter in sub operation (#8026)

  • Fixed shape inference in upsample op (#8105)

  • Fixed failure of addn inplace operation on CPU tensor (#8280)

  • Fixed limit on tensor size in cum backward op based on the size of shared memory (#8289)

  • Improved the logic of dtype inference for arange op (#8338)

  • Fixed NaN propagation of UnaryFunctor (#8346)

  • Fixed ndim check of pad (#8354)

  • Fixed vector check in broadcast_min and broadcast_max backward computations (#8379)

  • Fixed the bug relative to index computation logic in cumprod op (#8388)

  • Fixed possible int32 overflow in softmax and math unary / binary cuda kernel; for kernels that operate integer division on i in CUDA_1D_KERNEL_LOOP, provided if statement to branch computations to prevent performance loss in most cases when int32 works (#8472)

  • Fixed failure to pass size via size=(...) in random ops (normal, rand, randn, randint, and randperm) (#8506)

Device

  • Fixed error in cudaGetDeviceCount when CUDA device count=0 (#8184)

  • Fixed possible unregistration of devices caused by hob.ToString method; Used static local variables to establish dependency between static variables of device registration and the static code for device registration (#8235)

  • Fixed cudaErrorNoDevice caused by drive errors (#8262)

  • Fixed memory leak caused by realpath (#8540)

Higher order derivative

  • Introduced AutogradCapturedTensor in backward computation to avoid circular reference and allow correct backtracking to the input gradient node in higher order derivative graph (#7808)

  • Added higher order derivative of sin/cos op; Fixed autograd bugs relative to higher order derivative (#8163)

  • Fixed bugs in backward computation in concat and split_like to support higher order derivative (#8208)

Build

  • Fixed RTD [sphinx] failure to build docstr (#7901)

  • Fixed compilation failure caused by opencv copy header failure (#7944)

  • Fixed failure to generate a new .so in compilation when CMAKE_LINK_DEPENDS_NO_SHARED=YES (#7868)

  • Fixed Eigen url in cmake third party (#8223)

  • Fixed the bug caused by multi-time linking to libof_protoobj in XRT (#8326)

  • Made libproto a dynamic library to avoid collision between static global variables (#8345)

  • Made of_pyext_obj static only when there is one Python extension dynamic library that has Python symbols (#8393)

  • Fixed the bug in undefined symbol: del_curterm in source code compilation (#8398)

  • Fixed false positive warning in gcc11 compilation (#8401)

  • Fixed SegFault that occurs when unzipping dataset in the container by making zlib a dynamic library (#8481)

  • Fixed undefined reference of culibosTlsSetValue (#8479)

  • Fixed stringop-truncation compilation error for gcc9 (#8532)

CI

  • Disabled static link of Simple CI and enabled debug build to avoid too many symbols (#7940)

  • Fixed the bug in AutoTest fake program; Fixed print error in AutoTest (#8279; #8290)

Module

  • Disabled conv3d test temporarily for its relatively large error of random values (#7969)

  • Reduced test error in nn.LayerNorm (#7941)

  • Optimized input data range of certain math op tests (#8010)

  • Fixed incorrect unit test case in permute (#8083)

  • Aligned error message of chunk to torch (#8096)

  • Fixed incorrect use of permute in tensor tests (#8144)

  • Fixed omission of test cases in instancenorm (#8215)

  • Adjusted unit test threshold for leaky_relu (#8242)

  • Annotated cpu bn grad method that tests with random values (#8257)

  • Skipped test cases of global argmax and median in multi-GPU scenarios (#8264)

  • Adjusted unit test threshold for fused_dot_feature_interaction (#8293)

  • Disabled unit tests for conv_transpose1d, conv_transpose2d, and conv_transpose3d (#8319)

  • Adjusted tolerance setting in embedding_renorm unit test (#8394)

  • Removed test cases with excessive accumulated elements in test_fused_dot_feature_interaction_pooling_sum to avoid overly large sum error (#8425)

Documentation

  • Ensured that all PyTorch references in OneFlow API documentation belong to the same PyTorch version (1.10.0) (#8058)

  • Added "copy" button for code in API docs to facilitate trial runs of sample code (#7997)

  • Refined script that automatically generates version status for OneFlow APIs and fixed bugs in docs (#8546)

  • Refined interface documentation of Tensor and Module (#7823)

    • Refined Tensor.to_global interface documentation and added descriptions of gard_sbp

    • Refined Tensor.to_local interface documentation

    • Added Tensor Attributes docs for oneflow.placement, oneflow.env.all_device_placement, and oneflow.sbp.sbp

    • Added interface documentation for Module.to_consistent (outdated) and Module.to_global

  • Fixed invalid links in Tensor docs and updated consistent to global (#7821)

  • Added docstr for Tensor.sqrt, Tensor.square, Tensor.addmm, Tensor.cosh, Tensor.diagonal, Tensor.log, Tensor.ndim, and Tensor.rsqrt (#7841)

  • Enabled derived classes of pybind11 to add documentation for non-overriding methods and added interface documentation related to Tensor and autograd (#7849)

  • Refined documentation of oneflow.argsort (#7844)

  • Refined documentation of Tensor.zero_, Tensor.is_contiguous, Tensor.is_cuda, and oneflow.nn.functional.layer_norm op (#7839)

  • Refined interface documentation of support_sparse and step in oneflow.optim.Adamw, oneflow.optim.SGD (#7848)

  • Refined interface documentation of LambdaLR.step, ReduceLROnPlateau.in_cooldown, and ReduceLROnPlateau.is_better (#7848)

  • Refined interface documentation of nn.Module (#8190)

  • Refined interface documentation of oneflow.optim.lr_scheduler.PolynomialLR (#8430)

  • Refined docs and formula illustrations for oneflow.nn.CombinedMarginLoss (#8206)

  • Refined documentation of oneflow.logical_and, oneflow.logical_or, oneflow.logical_xor, and oneflow.logical_not (#8297)

  • Fixed the bug in the documentation of quantization ops (#8333)

  • Updated solution in Troubleshooting for the case when libunwind.h is not found (#8336)

  • Restructured API documentation based on features; added and refined docs of features that are unique to OneFlow (#8392)