# 新实施微调教程

本笔记本是关于如何在新数据集上微调GR00T-N1预训练模型的教程。

# 1. Lerobot SO100微调教程

GR00T-N1对于各种机器人形态的每个人都是可访问的。基于Huggingface的低成本[So100 Lerobot手臂](https://github.com/huggingface/lerobot/blob/main/examples/10_use_so100.md)，用户可以通过`new_embodiment`标签在自己的机器人上微调GR00T-N1。


![so100_eval_demo.gif](../media/so100_eval_demo.gif)

## 步骤1：数据集

用户可以使用任何lerobot数据集进行微调。在本教程中，我们将首先使用一个示例数据集：[so100_strawberry_grape](https://huggingface.co/spaces/lerobot/visualize_dataset?dataset=youliangtan%2Fso100_strawberry_grape&episode=0)

请注意，这种实施在我们的预训练数据集混合中没有使用过。


### 首先，下载数据集

```bash
huggingface-cli download --repo-type dataset youliangtan/so100_strawberry_grape --local-dir ./demo_data/so100_strawberry_grape
```

### 其次，复制模态文件

`modality.json`文件提供关于状态和动作模态的额外信息，使其"GR00T兼容"。将`examples/so100__modality.json`复制到数据集`<DATASET_PATH>/meta/modality.json`。

```bash
cp examples/so100__modality.json ./demo_data/so100_strawberry_grape/meta/modality.json
```

然后我们可以使用`LeRobotSingleDataset`类加载数据集。

In [None]:
from gr00t.utils.misc import any_describe
from gr00t.data.dataset import LeRobotSingleDataset
from gr00t.experiment.data_config import DATA_CONFIG_MAP

dataset_path = "./demo_data/so100_strawberry_grape"   # change this to your dataset path

data_config = DATA_CONFIG_MAP["so100"]

dataset = LeRobotSingleDataset(
    dataset_path=dataset_path,
    modality_configs=data_config.modality_config(),
    embodiment_tag="new_embodiment",
    video_backend="torchvision_av",
)

resp = dataset[7]
any_describe(resp)

In [None]:
# visualize the dataset
# show img
import matplotlib.pyplot as plt

images_list = []

for i in range(100):
    if i % 10 == 0:
        resp = dataset[i]
        img = resp["video.webcam"][0]
        images_list.append(img)

fig, axs = plt.subplots(2, 5, figsize=(20, 10))
for i, ax in enumerate(axs.flat):
    ax.imshow(images_list[i])
    ax.axis("off")
    ax.set_title(f"Image {i}")
plt.tight_layout() # adjust the subplots to fit into the figure area.
plt.show()

## 步骤2：微调

微调可以通过使用我们的微调脚本`scripts/gr00t_finetune.py`来完成。


```bash
python scripts/gr00t_finetune.py \
   --dataset-path /datasets/so100_strawberry_grape/ \
   --num-gpus 1 \
   --output-dir ~/so100-checkpoints  \
   --max-steps 2000 \
   --data-config so100 \
   --video-backend torchvision_av
```

## 步骤3：开环评估

训练完成后，您可以运行以下命令来可视化微调后的策略。

```bash
python scripts/eval_policy.py --plot \
   --embodiment_tag new_embodiment \
   --model_path <YOUR_CHECKPOINT_PATH> \
   --data_config so100 \
  --dataset_path /datasets/so100_strawberry_grape/ \
   --video_backend torchvision_av \
   --modality_keys single_arm gripper
```

这是训练策略7000步后的图表。

![so100-7k-steps.png](../media/so100-7k-steps.png)


经过更多步骤的训练后，图表会看起来明显更好。


太棒了！您已成功在新实施上微调了GR00T-N1。

## 部署

有关部署的更多详细信息，请参阅笔记本：`5_policy_deployment.md`

---

# 2. G1块堆叠数据集微调教程

这提供了如何在G1块堆叠数据集上微调GR00T-N1的分步指南。

## 步骤1：数据集

加载用于微调的任何数据集可以通过2个步骤完成：
- 1.1：为数据集定义模态配置和转换
- 1.2：使用`LeRobotSingleDataset`类加载数据集

### 步骤：1.0 下载数据集

- 从以下地址下载数据集：https://huggingface.co/datasets/unitreerobotics/G1_BlockStacking_Dataset
- 将`examples/unitree_g1_blocks__modality.json`复制到数据集`<DATASET_PATH>/meta/modality.json`
  - 这提供关于状态和动作模态的额外信息，使其"GR00T兼容"
  - `cp examples/unitree_g1_blocks__modality.json datasets/G1_BlockStacking_Dataset/meta/modality.json`


**理解模态配置**

该文件提供有关状态和动作模态的详细元数据，使以下功能成为可能：

- **分离数据存储和解释：**
  - **状态和动作：**存储为连接的float32数组。`modality.json`文件提供了将这些数组解释为具有额外训练信息的不同、细粒度字段所需的元数据。
  - **视频：**存储为单独的文件，配置文件允许将它们重命名为标准化格式。
  - **注释：**跟踪所有注释字段。如果没有注释，请不要在配置文件中包含`annotation`字段。
- **细粒度分割：**将状态和动作数组分为更具语义意义的字段。
- **清晰映射：**数据维度的明确映射。
- **复杂数据转换：**在训练期间支持特定字段的归一化和旋转转换。

#### 模式

```json
{
    "state": {
        "<state_name>": {
            "start": <int>,         // 状态数组中的起始索引
            "end": <int>,           // 状态数组中的结束索引
        }
    },
    "action": {
        "<action_name>": {
            "start": <int>,         // 动作数组中的起始索引
            "end": <int>,           // 动作数组中的结束索引
        }
    },
    "video": {
        "<video_name>": {}  // 空字典，保持与其他模态的一致性
    },
    "annotation": {
        "<annotation_name>": {}  // 空字典，保持与其他模态的一致性
    }
}
```

示例在`getting_started/examples/unitree_g1_blocks__modality.json`中显示。此文件位于lerobot数据集的`meta`文件夹中。


通过运行以下命令生成统计信息（`meta/metadata.json`）：
```bash ```bash
python scripts/load_dataset.py --data_path /datasets/G1_BlockStacking_Dataset/ --embodiment_tag new_embodiment
```

In [None]:
from gr00t.data.schema import EmbodimentTag

In [None]:
dataset_path = "./demo_data/g1"  # change this to your dataset path
embodiment_tag = EmbodimentTag.NEW_EMBODIMENT

### 步骤：1.1 模态配置和转换

模态配置让您可以选择在微调期间为每种输入类型（视频、状态、动作、语言等）使用哪些特定的数据流，让您精确控制使用数据集的哪些部分。

In [None]:
from gr00t.data.dataset import ModalityConfig


# select the modality keys you want to use for finetuning
video_modality = ModalityConfig(
    delta_indices=[0],
    modality_keys=["video.cam_right_high"],
)

state_modality = ModalityConfig(
    delta_indices=[0],
    modality_keys=["state.left_arm", "state.right_arm", "state.left_hand", "state.right_hand"],
)

action_modality = ModalityConfig(
    delta_indices=[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15],
    modality_keys=["action.left_arm", "action.right_arm", "action.left_hand", "action.right_hand"],
)

language_modality = ModalityConfig(
    delta_indices=[0],
    modality_keys=["annotation.human.task_description"],
)

modality_configs = {
    "video": video_modality,
    "state": state_modality,
    "action": action_modality,
    "language": language_modality,
}

In [None]:
from gr00t.data.transform.base import ComposedModalityTransform
from gr00t.data.transform import VideoToTensor, VideoCrop, VideoResize, VideoColorJitter, VideoToNumpy
from gr00t.data.transform.state_action import StateActionToTensor, StateActionTransform
from gr00t.data.transform.concat import ConcatTransform
from gr00t.model.transforms import GR00TTransform


# select the transforms you want to apply to the data
to_apply_transforms = ComposedModalityTransform(
    transforms=[
        # video transforms
        VideoToTensor(apply_to=video_modality.modality_keys, backend="torchvision"),
        VideoCrop(apply_to=video_modality.modality_keys, scale=0.95, backend="torchvision"),
        VideoResize(apply_to=video_modality.modality_keys, height=224, width=224, interpolation="linear", backend="torchvision" ),
        VideoColorJitter(apply_to=video_modality.modality_keys, brightness=0.3, contrast=0.4, saturation=0.5, hue=0.08, backend="torchvision"),
        VideoToNumpy(apply_to=video_modality.modality_keys),

        # state transforms
        StateActionToTensor(apply_to=state_modality.modality_keys),
        StateActionTransform(apply_to=state_modality.modality_keys, normalization_modes={
            "state.left_arm": "min_max",
            "state.right_arm": "min_max",
            "state.left_hand": "min_max",
            "state.right_hand": "min_max",
        }),

        # action transforms
        StateActionToTensor(apply_to=action_modality.modality_keys),
        StateActionTransform(apply_to=action_modality.modality_keys, normalization_modes={
            "action.right_arm": "min_max",
            "action.left_arm": "min_max",
            "action.right_hand": "min_max",
            "action.left_hand": "min_max",
        }),

        # ConcatTransform
        ConcatTransform(
            video_concat_order=video_modality.modality_keys,
            state_concat_order=state_modality.modality_keys,
            action_concat_order=action_modality.modality_keys,
        ),
        # model-specific transform
        GR00TTransform(
            state_horizon=len(state_modality.delta_indices),
            action_horizon=len(action_modality.delta_indices),
            max_state_dim=64,
            max_action_dim=32,
        ),
    ]
)


### 步骤1.2 加载数据集

首先我们将可视化数据集，然后使用`LeRobotSingleDataset`类加载它（不使用转换）。

In [None]:
from gr00t.data.dataset import LeRobotSingleDataset

train_dataset = LeRobotSingleDataset(
    dataset_path=dataset_path,
    modality_configs=modality_configs,
    embodiment_tag=embodiment_tag,
    video_backend="torchvision_av",
)


In [None]:
# use matplotlib to visualize the images
import matplotlib.pyplot as plt
import numpy as np

print(train_dataset[0].keys())

images = []
for i in range(5):
    image = train_dataset[i]["video.cam_right_high"][0]
    # image is in HWC format, convert it to CHW format
    image = image.transpose(2, 0, 1)
    images.append(image)   

fig, axs = plt.subplots(1, 5, figsize=(20, 5))
for i, image in enumerate(images):
    axs[i].imshow(np.transpose(image, (1, 2, 0)))
    axs[i].axis("off")
plt.show()

现在，我们将使用我们的模态配置和转换初始化一个数据集。

In [None]:
train_dataset = LeRobotSingleDataset(
    dataset_path=dataset_path,
    modality_configs=modality_configs,
    embodiment_tag=embodiment_tag,
    video_backend="torchvision_av",
    transforms=to_apply_transforms,
)

**额外说明**：
 - 我们使用缓存数据加载器来加速训练速度。缓存数据加载器将所有数据加载到内存中，这显著提高了训练性能。但是，如果您的数据集很大或您遇到内存不足（OOM）错误，您可以切换到标准lerobot数据加载器（`gr00t.data.dataset.LeRobotSingleDataset`）。它使用与缓存数据加载器相同的API，因此您可以在不更改代码的情况下来回切换。
 - 我们使用torchvision_av作为视频后端，视频编码采用av而不是标准h264。


### 步骤2：加载模型

训练过程分为3个步骤：
- 2.1：从HuggingFace或本地路径加载基础模型
- 2.2：准备训练参数
- 2.3：运行训练循环

#### 步骤2.1 加载基础模型

我们将使用`from_pretrained_for_tuning`方法加载模型。此方法允许我们指定要调整模型的哪些部分。

In [None]:
import os
import torch
os.environ["CUDA_VISIBLE_DEVICES"] = "0"
device = "cuda" if torch.cuda.is_available() else "cpu"

In [None]:
from gr00t.model.gr00t_n1 import GR00T_N1

BASE_MODEL_PATH = "nvidia/GR00T-N1-2B"
TUNE_LLM = False            # Whether to tune the LLM
TUNE_VISUAL = True          # Whether to tune the visual encoder
TUNE_PROJECTOR = True       # Whether to tune the projector
TUNE_DIFFUSION_MODEL = True # Whether to tune the diffusion model

model = GR00T_N1.from_pretrained(
    pretrained_model_name_or_path=BASE_MODEL_PATH,
    tune_llm=TUNE_LLM,  # backbone's LLM
    tune_visual=TUNE_VISUAL,  # backbone's vision tower
    tune_projector=TUNE_PROJECTOR,  # action head's projector
    tune_diffusion_model=TUNE_DIFFUSION_MODEL,  # action head's DiT
)

# Set the model's compute_dtype to bfloat16
model.compute_dtype = "bfloat16"
model.config.compute_dtype = "bfloat16"
model.to(device)

#### 步骤2.2 准备训练参数

我们使用huggingface的`TrainingArguments`来配置训练过程。以下是主要参数：

In [None]:
from transformers import TrainingArguments

output_dir = "output/model/path"    # CHANGE THIS ACCORDING TO YOUR LOCAL PATH
per_device_train_batch_size = 8     # CHANGE THIS ACCORDING TO YOUR GPU MEMORY
max_steps = 20                      # CHANGE THIS ACCORDING TO YOUR NEEDS
report_to = "wandb"
dataloader_num_workers = 8

training_args = TrainingArguments(
    output_dir=output_dir,
    run_name=None,
    remove_unused_columns=False,
    deepspeed="",
    gradient_checkpointing=False,
    bf16=True,
    tf32=True,
    per_device_train_batch_size=per_device_train_batch_size,
    gradient_accumulation_steps=1,
    dataloader_num_workers=dataloader_num_workers,
    dataloader_pin_memory=False,
    dataloader_persistent_workers=True,
    optim="adamw_torch",
    adam_beta1=0.95,
    adam_beta2=0.999,
    adam_epsilon=1e-8,
    learning_rate=1e-4,
    weight_decay=1e-5,
    warmup_ratio=0.05,
    lr_scheduler_type="cosine",
    logging_steps=10.0,
    num_train_epochs=300,
    max_steps=max_steps,
    save_strategy="steps",
    save_steps=500,
    evaluation_strategy="no",
    save_total_limit=8,
    report_to=report_to,
    seed=42,
    do_eval=False,
    ddp_find_unused_parameters=False,
    ddp_bucket_cap_mb=100,
    torch_compile_mode=None,
)


#### 步骤2.3 初始化训练运行器并运行训练循环

In [None]:
from gr00t.experiment.runner import TrainRunner

experiment = TrainRunner(
    train_dataset=train_dataset,
    model=model,
    training_args=training_args,
)

experiment.train()

我们可以看到1000步离线验证结果与10000步离线验证结果的对比：

**Unitree G1块堆叠数据集上的微调结果：**

| 1k步 | 10k步 |
| --- | --- |
| ![1k](../media/g1_ft_1k.png) | ![10k](../media/g1_ft_10k.png) |
| MSE: 0.0181 | MSE: 0.0022 | 