Convergence on SFT is too slow and the performance is bad

#### 1. Description
We are doing supervised fine-tuning on large language models with `peft` and `trl` packages. The convergence is way too slow on Ascend NPUs compared with GPUs. The loss started from 1.3, and reduced to 0.3 in the first half epoch on V100 while it remained around 0.8 even after 5 epochs on Ascend 910B. 

We are using `accelerate launch` for distributed training. The training scripts and arguments are the same for difference devices except for the `cuda` and `npu` parts.

There were warnings that could be the cause of the problem:
```
.../python3.9/site-packages/torch/autograd/__init__.py:251: UserWarning: Grad strides do not match bucket view strides. This may indicate grad was not created according to the gradient layout contract, or that the param's strides changed since DDP was constructed.  This is not an error, but may impair performance.
grad.sizes() = [1024, 64], strides() = [64, 1]
bucket_view.sizes() = [65536], strides() = [1] (Triggered internally at /usr1/02/workspace/j_yxiCvvHE/pytorch/torch_npu/csrc/distributed/reducer.cpp:314.)
  Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
```

We've checked inside the source code according to the messages, and found the difference in the `reducer.cpp` between the original `torch` and `torch_npu`:
+ In `torch`
```cpp
void Reducer::initialize_bucket_views(Reducer::Bucket& bucket) {
  const auto& gradients = bucket.gradients;
  for (const auto i : c10::irange(bucket.variables.size())) {
    auto& v = bucket.variables[i];
    const auto offset = bucket.offsets[i];
    const auto length = bucket.lengths[i];
    // TODO(@egienvalue): remove special case after view ops are fully
    // supported on MTIA.
    // In general, on MTIA, due to the special memory layout, it doesn't
    // support as_strided which creates a view tensor and aten::view will
    // create a new tensor on MTIA for now.
    if (v.is_non_overlapping_and_dense() && !v.is_mtia()) {
      // If the param's memory is dense, match its layout, anticipating
      // the autograd engine (AccumulateGrad) will also create gradients
      // matching its layout.
      bucket.bucket_views_in.push_back(
          gradients.as_strided(v.sizes(), v.strides(), offset));
    } else {
      // Fall back to a C-style contiguous view, again anticipating
      // AccumulateGrad will do the same when stashing grads for non-dense
      // params.
      bucket.bucket_views_in.push_back(
          gradients.narrow(0, offset, length).view(v.sizes()));
    }
...
```
+ In `torch_npu`
```cpp
void Reducer::initialize_bucket_views(
    Reducer::BucketReplica& replica,
    at::Tensor& contents) {
  for (const auto i : c10::irange(replica.variables.size())) {
    auto& v = replica.variables[i];
    const auto offset = replica.offsets[i];
    const auto length = replica.lengths[i];
    // element size of 'bucket_views_in' depends on variable 'gradient_as_bucket_view_'.
    if (!gradient_as_bucket_view_) {
        replica.bucket_views_in.push_back(contents.narrow(0, offset, length));
    } else {
        replica.bucket_views_in.push_back(contents.narrow(0, offset, length).view(v.sizes()));
    }
```
It seems the `torch_npu` doesn't support the `bucket_view` stride matching with the gradient.

#### 2. environment
+ OS: Ubuntu 20.04
+ kernel: 4.19.90-vhulk2211.3.0.h1543.eulerosv2r10.aarch64
+ CANN: 7.0.RC1
+ Python: 3.9
+ torch: 2.1.0
+ torch_npu: 2.1.0rc1.post20231013


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Convergence on SFT is too slow and the performance is bad #17

1. Description

2. environment

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Convergence on SFT is too slow and the performance is bad #17

Description

1. Description

2. environment

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions