Skip to content

Convergence on SFT is too slow and the performance is bad #17

@Kunhao18

Description

@Kunhao18

1. Description

We are doing supervised fine-tuning on large language models with peft and trl packages. The convergence is way too slow on Ascend NPUs compared with GPUs. The loss started from 1.3, and reduced to 0.3 in the first half epoch on V100 while it remained around 0.8 even after 5 epochs on Ascend 910B.

We are using accelerate launch for distributed training. The training scripts and arguments are the same for difference devices except for the cuda and npu parts.

There were warnings that could be the cause of the problem:

.../python3.9/site-packages/torch/autograd/__init__.py:251: UserWarning: Grad strides do not match bucket view strides. This may indicate grad was not created according to the gradient layout contract, or that the param's strides changed since DDP was constructed.  This is not an error, but may impair performance.
grad.sizes() = [1024, 64], strides() = [64, 1]
bucket_view.sizes() = [65536], strides() = [1] (Triggered internally at /usr1/02/workspace/j_yxiCvvHE/pytorch/torch_npu/csrc/distributed/reducer.cpp:314.)
  Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass

We've checked inside the source code according to the messages, and found the difference in the reducer.cpp between the original torch and torch_npu:

  • In torch
void Reducer::initialize_bucket_views(Reducer::Bucket& bucket) {
  const auto& gradients = bucket.gradients;
  for (const auto i : c10::irange(bucket.variables.size())) {
    auto& v = bucket.variables[i];
    const auto offset = bucket.offsets[i];
    const auto length = bucket.lengths[i];
    // TODO(@egienvalue): remove special case after view ops are fully
    // supported on MTIA.
    // In general, on MTIA, due to the special memory layout, it doesn't
    // support as_strided which creates a view tensor and aten::view will
    // create a new tensor on MTIA for now.
    if (v.is_non_overlapping_and_dense() && !v.is_mtia()) {
      // If the param's memory is dense, match its layout, anticipating
      // the autograd engine (AccumulateGrad) will also create gradients
      // matching its layout.
      bucket.bucket_views_in.push_back(
          gradients.as_strided(v.sizes(), v.strides(), offset));
    } else {
      // Fall back to a C-style contiguous view, again anticipating
      // AccumulateGrad will do the same when stashing grads for non-dense
      // params.
      bucket.bucket_views_in.push_back(
          gradients.narrow(0, offset, length).view(v.sizes()));
    }
...
  • In torch_npu
void Reducer::initialize_bucket_views(
    Reducer::BucketReplica& replica,
    at::Tensor& contents) {
  for (const auto i : c10::irange(replica.variables.size())) {
    auto& v = replica.variables[i];
    const auto offset = replica.offsets[i];
    const auto length = replica.lengths[i];
    // element size of 'bucket_views_in' depends on variable 'gradient_as_bucket_view_'.
    if (!gradient_as_bucket_view_) {
        replica.bucket_views_in.push_back(contents.narrow(0, offset, length));
    } else {
        replica.bucket_views_in.push_back(contents.narrow(0, offset, length).view(v.sizes()));
    }

It seems the torch_npu doesn't support the bucket_view stride matching with the gradient.

2. environment

  • OS: Ubuntu 20.04
  • kernel: 4.19.90-vhulk2211.3.0.h1543.eulerosv2r10.aarch64
  • CANN: 7.0.RC1
  • Python: 3.9
  • torch: 2.1.0
  • torch_npu: 2.1.0rc1.post20231013

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions