# New Fine-Tuning Tutorial

This notebook is a guide to fine-tuning the GR00T-N1 pretrained model on a new dataset.

# 1. Lerobot SO100 Fine-Tuning Tutorial

GR00T-N1 is open to everyone regardless of robot embodiment. With the low-cost [So100 Lerobot arm](https://github.com/huggingface/lerobot/blob/main/examples/10_use_so100.md) built on Hugging Face, users can fine-tune GR00T-N1 on their own robot using the `new_embodiment` tag.

![so100_eval_demo.gif](../media/so100_eval_demo.gif)

## Step 1: Dataset

Any Lerobot dataset can be used for fine-tuning. In this tutorial we start with the sample dataset [so100_strawberry_grape](https://huggingface.co/spaces/lerobot/visualize_dataset?dataset=youliangtan%2Fso100_strawberry_grape&episode=0).

> Note: this embodiment was **not** part of our pre-training mixture.

### First, download the dataset

```bash
huggingface-cli download --repo-type dataset youliangtan/so100_strawberry_grape --local-dir ./demo_data/so100_strawberry_grape
```

### Second, copy the modality file

The `modality.json` file provides extra information about the state and action modalities to make the dataset “GR00T-compatible”. Copy `examples/so100__modality.json` into `<DATASET_PATH>/meta/modality.json`.

```bash
cp examples/so100__modality.json ./demo_data/so100_strawberry_grape/meta/modality.json
```

You can now load the dataset with the `LeRobotSingleDataset` class.

In [None]:
from gr00t.utils.misc import any_describe
from gr00t.data.dataset import LeRobotSingleDataset
from gr00t.experiment.data_config import DATA_CONFIG_MAP

dataset_path = "./demo_data/so100_strawberry_grape"   # change this to your dataset path

data_config = DATA_CONFIG_MAP["so100"]

dataset = LeRobotSingleDataset(
    dataset_path=dataset_path,
    modality_configs=data_config.modality_config(),
    embodiment_tag="new_embodiment",
    video_backend="torchvision_av",
)

resp = dataset[7]
any_describe(resp)

In [None]:
# visualize the dataset
# show img
import matplotlib.pyplot as plt

images_list = []

for i in range(100):
    if i % 10 == 0:
        resp = dataset[i]
        img = resp["video.webcam"][0]
        images_list.append(img)

fig, axs = plt.subplots(2, 5, figsize=(20, 10))
for i, ax in enumerate(axs.flat):
    ax.imshow(images_list[i])
    ax.axis("off")
    ax.set_title(f"Image {i}")
plt.tight_layout() # adjust the subplots to fit into the figure area.
plt.show()

## Step 2: Fine-Tuning

Fine-tuning is done with our script `scripts/gr00t_finetune.py`.

```bash
python scripts/gr00t_finetune.py \
   --dataset-path /datasets/so100_strawberry_grape/ \
   --num-gpus 1 \
   --output-dir ~/so100-checkpoints \
   --max-steps 2000 \
   --data-config so100 \
   --video-backend torchvision_av
```

> **Tip**: Default settings require ~25 GB of VRAM.  
> If you have less VRAM, add `--no-tune_diffusion_model` and/or lower `--batch-size` to avoid OOM errors .

## Step 3: Open-loop Evaluation

After training, visualize the fine-tuned policy:

```bash
python scripts/eval_policy.py --plot \
   --embodiment_tag new_embodiment \
   --model_path <YOUR_CHECKPOINT_PATH> \
   --data_config so100 \
   --dataset_path /datasets/so100_strawberry_grape/ \
   --video_backend torchvision_av \
   --modality_keys single_arm gripper
```

Here is a plot after 7 000 training steps:

![so100-7k-steps.png](../media/so100-7k-steps.png)

With more steps the curves improve noticeably.

🎉 Great! You have successfully fine-tuned GR00T-N1 on a new embodiment.

## Deployment

For deployment details, see the notebook `5_policy_deployment.md`.

---

# 2. Fine-Tuning Tutorial on G1 Block-Stacking Dataset

This section provides a step-by-step guide to fine-tune GR00T-N1 on the G1 block-stacking dataset.

## Step 1: Dataset

Loading any dataset for fine-tuning is a two-step process:

- **1.1** Define the modality configuration and transforms for the dataset  
- **1.2** Load the dataset with the `LeRobotSingleDataset` class

### Step 1.0: Download the Dataset

- Download the dataset from:  
  [https://huggingface.co/datasets/unitreerobotics/G1_BlockStacking_Dataset](https://huggingface.co/datasets/unitreerobotics/G1_BlockStacking_Dataset)

- Copy `examples/unitree_g1_blocks__modality.json` into `<DATASET_PATH>/meta/modality.json`.  
  This supplies extra metadata about state and action modalities so the dataset becomes “GR00T-compatible”.

```bash
cp examples/unitree_g1_blocks__modality.json datasets/G1_BlockStacking_Dataset/meta/modality.json
```

---

### Understanding the Modality Configuration

The file provides detailed metadata for state and action modalities, enabling:

- **Decoupled storage and interpretation**  
  - **States & actions**: Stored as concatenated float32 arrays. `modality.json` maps these arrays into distinct, semantically meaningful fields with training hints.  
  - **Videos**: Stored as separate files; the config renames them to a canonical format.  
  - **Annotations**: Tracks all annotation fields. Omit the `annotation` key if no annotations exist.

- **Fine-grained slicing** – splits arrays into semantically meaningful fields.  
- **Clear mapping** – explicit dimension mapping.  
- **Complex transforms** – per-field normalization and rotation transforms at training time.

#### Schema

```json
{
    "state": {
        "<state_name>": {
            "start": <int>,   // start index in the state array
            "end":   <int>    // end index in the state array
        }
    },
    "action": {
        "<action_name>": {
            "start": <int>,   // start index in the action array
            "end":   <int>    // end index in the action array
        }
    },
    "video": {
        "<video_name>": {}   // empty dict for consistency
    },
    "annotation": {
        "<annotation_name>": {}   // empty dict for consistency
    }
}
```

An example can be found at `getting_started/examples/unitree_g1_blocks__modality.json`; place it inside the dataset’s `meta` folder.

---

### Generate Dataset Statistics

Create `meta/metadata.json` by running:

```bash
python scripts/load_dataset.py \
  --data_path /datasets/G1_BlockStacking_Dataset/ \
  --embodiment_tag new_embodiment
```

In [None]:
from gr00t.data.schema import EmbodimentTag

In [None]:
dataset_path = "./demo_data/g1"  # change this to your dataset path
embodiment_tag = EmbodimentTag.NEW_EMBODIMENT

### Step 1.1: Modality Configuration & Transforms

The modality configuration lets you cherry-pick exactly which data streams—video, state, action, language, etc.—are used during fine-tuning, giving you fine-grained control over which parts of the dataset the model sees.

In [None]:
from gr00t.data.dataset import ModalityConfig


# select the modality keys you want to use for finetuning
video_modality = ModalityConfig(
    delta_indices=[0],
    modality_keys=["video.cam_right_high"],
)

state_modality = ModalityConfig(
    delta_indices=[0],
    modality_keys=["state.left_arm", "state.right_arm", "state.left_hand", "state.right_hand"],
)

action_modality = ModalityConfig(
    delta_indices=[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15],
    modality_keys=["action.left_arm", "action.right_arm", "action.left_hand", "action.right_hand"],
)

language_modality = ModalityConfig(
    delta_indices=[0],
    modality_keys=["annotation.human.task_description"],
)

modality_configs = {
    "video": video_modality,
    "state": state_modality,
    "action": action_modality,
    "language": language_modality,
}

In [None]:
from gr00t.data.transform.base import ComposedModalityTransform
from gr00t.data.transform import VideoToTensor, VideoCrop, VideoResize, VideoColorJitter, VideoToNumpy
from gr00t.data.transform.state_action import StateActionToTensor, StateActionTransform
from gr00t.data.transform.concat import ConcatTransform
from gr00t.model.transforms import GR00TTransform


# select the transforms you want to apply to the data
to_apply_transforms = ComposedModalityTransform(
    transforms=[
        # video transforms
        VideoToTensor(apply_to=video_modality.modality_keys, backend="torchvision"),
        VideoCrop(apply_to=video_modality.modality_keys, scale=0.95, backend="torchvision"),
        VideoResize(apply_to=video_modality.modality_keys, height=224, width=224, interpolation="linear", backend="torchvision" ),
        VideoColorJitter(apply_to=video_modality.modality_keys, brightness=0.3, contrast=0.4, saturation=0.5, hue=0.08, backend="torchvision"),
        VideoToNumpy(apply_to=video_modality.modality_keys),

        # state transforms
        StateActionToTensor(apply_to=state_modality.modality_keys),
        StateActionTransform(apply_to=state_modality.modality_keys, normalization_modes={
            "state.left_arm": "min_max",
            "state.right_arm": "min_max",
            "state.left_hand": "min_max",
            "state.right_hand": "min_max",
        }),

        # action transforms
        StateActionToTensor(apply_to=action_modality.modality_keys),
        StateActionTransform(apply_to=action_modality.modality_keys, normalization_modes={
            "action.right_arm": "min_max",
            "action.left_arm": "min_max",
            "action.right_hand": "min_max",
            "action.left_hand": "min_max",
        }),

        # ConcatTransform
        ConcatTransform(
            video_concat_order=video_modality.modality_keys,
            state_concat_order=state_modality.modality_keys,
            action_concat_order=action_modality.modality_keys,
        ),
        # model-specific transform
        GR00TTransform(
            state_horizon=len(state_modality.delta_indices),
            action_horizon=len(action_modality.delta_indices),
            max_state_dim=64,
            max_action_dim=32,
        ),
    ]
)


### Step 1.2: Load the Dataset

First, we’ll visualize the dataset; then we’ll load it with the `LeRobotSingleDataset` class (without transforms).

In [None]:
from gr00t.data.dataset import LeRobotSingleDataset

train_dataset = LeRobotSingleDataset(
    dataset_path=dataset_path,
    modality_configs=modality_configs,
    embodiment_tag=embodiment_tag,
    video_backend="torchvision_av",
)


In [None]:
# use matplotlib to visualize the images
import matplotlib.pyplot as plt
import numpy as np

print(train_dataset[0].keys())

images = []
for i in range(5):
    image = train_dataset[i]["video.cam_right_high"][0]
    # image is in HWC format, convert it to CHW format
    image = image.transpose(2, 0, 1)
    images.append(image)   

fig, axs = plt.subplots(1, 5, figsize=(20, 5))
for i, image in enumerate(images):
    axs[i].imshow(np.transpose(image, (1, 2, 0)))
    axs[i].axis("off")
plt.show()

Now we initialize the dataset with our modality configuration and transforms.

In [None]:
train_dataset = LeRobotSingleDataset(
    dataset_path=dataset_path,
    modality_configs=modality_configs,
    embodiment_tag=embodiment_tag,
    video_backend="torchvision_av",
    transforms=to_apply_transforms,
)

**Additional Notes**  
- We use a **cached data loader** to accelerate training. It loads the entire dataset into memory, which greatly improves throughput. If your dataset is very large or you encounter out-of-memory (OOM) errors, simply switch to the standard LeRobot data loader (`gr00t.data.dataset.LeRobotSingleDataset`). Both loaders share the same API, so you can toggle between them without changing your code.  
- The **video backend** is set to `torchvision_av`, which employs the `av` codec instead of the default h264.

### Step 2: Load the Model

Training proceeds in three stages:
- **2.1** Load the base model from Hugging Face or a local path  
- **2.2** Prepare training parameters  
- **2.3** Run the training loop

#### Step 2.1: Load the Base Model

We will load the model using the `from_pretrained_for_tuning` method, which lets us specify exactly which parts of the model to fine-tune.

In [None]:
import os
import torch
os.environ["CUDA_VISIBLE_DEVICES"] = "0"
device = "cuda" if torch.cuda.is_available() else "cpu"

In [None]:
from gr00t.model.gr00t_n1 import GR00T_N1

BASE_MODEL_PATH = "nvidia/GR00T-N1-2B"
TUNE_LLM = False            # Whether to tune the LLM
TUNE_VISUAL = True          # Whether to tune the visual encoder
TUNE_PROJECTOR = True       # Whether to tune the projector
TUNE_DIFFUSION_MODEL = True # Whether to tune the diffusion model

model = GR00T_N1.from_pretrained(
    pretrained_model_name_or_path=BASE_MODEL_PATH,
    tune_llm=TUNE_LLM,  # backbone's LLM
    tune_visual=TUNE_VISUAL,  # backbone's vision tower
    tune_projector=TUNE_PROJECTOR,  # action head's projector
    tune_diffusion_model=TUNE_DIFFUSION_MODEL,  # action head's DiT
)

# Set the model's compute_dtype to bfloat16
model.compute_dtype = "bfloat16"
model.config.compute_dtype = "bfloat16"
model.to(device)

#### Step 2.2: Prepare Training Parameters

We configure training with Hugging Face’s `TrainingArguments`. Key parameters are outlined below:

In [None]:
from transformers import TrainingArguments

output_dir = "output/model/path"    # CHANGE THIS ACCORDING TO YOUR LOCAL PATH
per_device_train_batch_size = 8     # CHANGE THIS ACCORDING TO YOUR GPU MEMORY
max_steps = 20                      # CHANGE THIS ACCORDING TO YOUR NEEDS
report_to = "wandb"
dataloader_num_workers = 8

training_args = TrainingArguments(
    output_dir=output_dir,
    run_name=None,
    remove_unused_columns=False,
    deepspeed="",
    gradient_checkpointing=False,
    bf16=True,
    tf32=True,
    per_device_train_batch_size=per_device_train_batch_size,
    gradient_accumulation_steps=1,
    dataloader_num_workers=dataloader_num_workers,
    dataloader_pin_memory=False,
    dataloader_persistent_workers=True,
    optim="adamw_torch",
    adam_beta1=0.95,
    adam_beta2=0.999,
    adam_epsilon=1e-8,
    learning_rate=1e-4,
    weight_decay=1e-5,
    warmup_ratio=0.05,
    lr_scheduler_type="cosine",
    logging_steps=10.0,
    num_train_epochs=300,
    max_steps=max_steps,
    save_strategy="steps",
    save_steps=500,
    evaluation_strategy="no",
    save_total_limit=8,
    report_to=report_to,
    seed=42,
    do_eval=False,
    ddp_find_unused_parameters=False,
    ddp_bucket_cap_mb=100,
    torch_compile_mode=None,
)


#### Step 2.3: Initialize the Training Runner and Launch the Training Loop

In [None]:
from gr00t.experiment.runner import TrainRunner

experiment = TrainRunner(
    train_dataset=train_dataset,
    model=model,
    training_args=training_args,
)

experiment.train()

We can compare the offline validation results after 1 000 steps versus 10 000 steps:

**Fine-tuning Results on the Unitree G1 Block-Stacking Dataset**

| 1 k steps | 10 k steps |
|-----------|------------|
| ![1k](../media/g1_ft_1k.png) | ![10k](../media/g1_ft_10k.png) |
| MSE: 0.0181 | MSE: 0.0022 |