# 使用 TorchAcc 加速 OLMo 模型训练

TorchAcc是基于Pytorch的分布式训练加速框架。本文将介绍通过简单引入TorchAcc的几行代码来加速OLMo模型训练的过程。

# 1. 环境准备

安装OLMo相关的依赖。


In [1]:
! pip install boto3 cached-path omegaconf rich
! pip install git+https://github.com/allenai/OLMo --no-deps

Looking in indexes: http://mirrors.cloud.aliyuncs.com/pypi/simple/
[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.3.2[0m[39;49m -> [0m[32;49m24.0[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpython3.10 -m pip install --upgrade pip[0m
Looking in indexes: http://mirrors.cloud.aliyuncs.com/pypi/simple/
Collecting git+https://github.com/allenai/OLMo
  Cloning https://github.com/allenai/OLMo to /tmp/pip-req-build-a1itauad
  Running command git clone --filter=blob:none --quiet https://github.com/allenai/OLMo /tmp/pip-req-build-a1itauad
  Resolved https://github.com/allenai/OLMo to commit 9fd9130d25ac2249c381a6283e6ea8d954aeab23
  Installing build dependencies ... [?25ldone
[?25h  Getting requirements to build wheel ... [?25ldone
[?25h  Preparing metadata (pyproject.toml) ... [?25ldone
[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m

In [2]:
import datasets
import itertools
import torch
import torchacc
from transformers import AutoModelForCausalLM, AutoTokenizer, get_scheduler, default_data_collator

  torch.utils._pytree._register_pytree_node(


# 2. 准备训练数据




In [3]:
def get_dataloader(tokenizer, max_seq_length, batch_size):
    raw_datasets = datasets.load_dataset("wikitext", "wikitext-2-raw-v1", split='train')
    column_names = list(raw_datasets.features)
    text_column_name = 'text' if 'text' in column_names else column_names[0]

    def tokenize_function(examples):
        return tokenizer(examples[text_column_name], return_token_type_ids=False)

    tokenized_datasets = raw_datasets.map(tokenize_function, batched=True, remove_columns=column_names)
    block_size = max_seq_length

    def group_texts(examples):
        concatenated_examples = {k: list(itertools.chain(*examples[k])) for k in examples.keys()}
        total_length = len(concatenated_examples[list(examples.keys())[0]])
        if total_length >= block_size: total_length = (total_length // block_size) * block_size
        result = { k: [ t[i:i + block_size] for i in range(0, total_length, block_size) ] for k, t in concatenated_examples.items() }
        result['labels'] = result['input_ids'].copy()
        return result

    train_dataset = tokenized_datasets.map(group_texts, batched=True)
    train_dataloader = torch.utils.data.DataLoader(train_dataset, batch_size=batch_size, collate_fn=default_data_collator)

    return train_dataloader

# 3. 定义模型

In [4]:
model = AutoModelForCausalLM.from_pretrained("allenai/OLMo-1B", cache_dir="./hf_models", trust_remote_code=True)
tokenizer = AutoTokenizer.from_pretrained("allenai/OLMo-1B", use_fast=False, trust_remote_code=True)
train_loader = get_dataloader(tokenizer, max_seq_length=2048, batch_size=1)

print(model)

OLMoForCausalLM(
  (model): Olmo(
    (transformer): ModuleDict(
      (wte): Embedding(50304, 2048)
      (emb_drop): Dropout(p=0.0, inplace=False)
      (ln_f): LayerNorm()
      (blocks): ModuleList(
        (0-15): 16 x OlmoSequentialBlock(
          (dropout): Dropout(p=0.0, inplace=False)
          (act): SwiGLU()
          (attn_out): Linear(in_features=2048, out_features=2048, bias=False)
          (ff_out): Linear(in_features=8192, out_features=2048, bias=False)
          (rotary_emb): RotaryEmbedding()
          (attn_norm): LayerNorm()
          (ff_norm): LayerNorm()
          (att_proj): Linear(in_features=2048, out_features=6144, bias=False)
          (ff_proj): Linear(in_features=2048, out_features=16384, bias=False)
        )
      )
      (ff_out): Embedding(50304, 2048)
    )
  )
)


# 4. 使用 TorchAcc 加速模型训练

通过TorchAcc加速模型训练一般需要三个步骤：

1. 定义`TorchAcc.Config`
   
   定义`TorchAcc.Config`，并指定加速选项。
   
2. 调用`torchacc.accelerate`
   
   调用`torchacc.accelerate`，并传入model和config，完成加速训练的准备。
   
3. 加速数据加载
   
   通过`TorchAcc.AsyncLoader`对torch dataset_loader进行封装，加速数据加载。

In [5]:
# 简单定义 TorchAcc 配置
config = torchacc.Config()
config.compute.bf16 = True # 开启 bf16
config.compute.acc_scaled_dot_attn = True # 自动替换Torch ScaledDot 为使用 torchacc flash attn 版本
config.dist.fsdp.size = torchacc.dist.world_size() # 开启 FSDP，设置 FSDP 数目
config.dist.fsdp.wrap_layer_cls = {"OlmoSequentialBlock"} # 传入将OLMo模型的decoder layer进行FSDP封装

# 一行代码加速模型
model = torchacc.accelerate(model=model, config=config)

# 异步加速数据加载
train_loader = torchacc.AsyncLoader(train_loader, model.device)



# 5. 定义Optimizer和LR scheduler

In [6]:
optimizer = torch.optim.AdamW(model.parameters(), lr=5e-5, betas=(0.9, 0.999), eps=1e-8)
lr_scheduler = get_scheduler(name='linear', optimizer=optimizer, num_warmup_steps=0, num_training_steps=len(train_loader))

# 6. 开始训练模型

In [7]:
model.train()

for step, inputs in enumerate(train_loader):
    optimizer.zero_grad()
    with torch.cuda.amp.autocast(dtype=torch.bfloat16):
        loss = model(**inputs)['loss']
    loss.backward()
    torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)
    optimizer.step()
    lr_scheduler.step()

    if step % 10 == 0:
        torchacc.sync()
        print(f'[Iteration {step}/{len(train_loader)}] loss: {loss:.2f} .')

  return Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
W0000 00:00:1708095186.985443    2027 hlo_rematerialization.cc:2946] Can't reduce memory use below 179.31MiB (188023804 bytes) by rematerialization; only reduced to 24.20GiB (25986106296 bytes), down from 25.15GiB (26999801936 bytes) originally


[Iteration 0/1154] loss: 2.66 .


W0000 00:00:1708095196.056070    2027 hlo_rematerialization.cc:2946] Can't reduce memory use below 194.51MiB (203962189 bytes) by rematerialization; only reduced to 23.80GiB (25559335832 bytes), down from 24.74GiB (26560471082 bytes) originally


[Iteration 10/1154] loss: 3.25 .
[Iteration 20/1154] loss: 3.22 .
[Iteration 30/1154] loss: 3.69 .
[Iteration 40/1154] loss: 3.17 .
[Iteration 50/1154] loss: 3.23 .
[Iteration 60/1154] loss: 3.27 .
[Iteration 70/1154] loss: 3.28 .
[Iteration 80/1154] loss: 3.44 .
[Iteration 90/1154] loss: 3.33 .
[Iteration 100/1154] loss: 3.32 .
[Iteration 110/1154] loss: 3.47 .
[Iteration 120/1154] loss: 3.54 .
[Iteration 130/1154] loss: 3.51 .
[Iteration 140/1154] loss: 3.17 .
[Iteration 150/1154] loss: 3.43 .
[Iteration 160/1154] loss: 3.42 .
[Iteration 170/1154] loss: 3.46 .
[Iteration 180/1154] loss: 3.45 .
[Iteration 190/1154] loss: 3.16 .
[Iteration 200/1154] loss: 3.50 .
[Iteration 210/1154] loss: 3.30 .
[Iteration 220/1154] loss: 3.56 .
[Iteration 230/1154] loss: 3.06 .
[Iteration 240/1154] loss: 3.50 .
[Iteration 250/1154] loss: 3.42 .
[Iteration 260/1154] loss: 3.39 .
[Iteration 270/1154] loss: 3.51 .
[Iteration 280/1154] loss: 3.80 .
[Iteration 290/1154] loss: 3.90 .
[Iteration 300/1154] lo

W0000 00:00:1708095716.881173    2027 hlo_rematerialization.cc:2946] Can't reduce memory use below 194.51MiB (203962193 bytes) by rematerialization; only reduced to 23.80GiB (25559335832 bytes), down from 24.74GiB (26560471082 bytes) originally


[Iteration 1090/1154] loss: 2.59 .
[Iteration 1100/1154] loss: 3.27 .
[Iteration 1110/1154] loss: 3.09 .
[Iteration 1120/1154] loss: 3.08 .
[Iteration 1130/1154] loss: 3.29 .
[Iteration 1140/1154] loss: 3.12 .
[Iteration 1150/1154] loss: 2.92 .


# 7. 分布式训练

由于jupyter内很难运行多卡分布式程序，如需运行分布式训练，建议在GPU开发机或PAI-DLC上运行以下脚本（其中`train_olmo.py`可通过jupyter Notebook将当前文件导出为Python文件，导出步骤为：`File-> Save and Export Notebook as -> Executable Script` ）。

```shell
pip install boto3 cached-path omegaconf rich
pip install git+https://github.com/allenai/OLMo --no-deps

torchrun --nproc_per_node=4 train_olmo.py
```