Skip to content
/ TCoT Public

Code for paper "TCoT: Trajectory Chain-of-Thoughts for Robotic Manipulation with Failure Recovery in Vision-Language-Action Model"

License

Notifications You must be signed in to change notification settings

Serenos/TCoT

Repository files navigation

TCoT: Trajectory Chain-of-thoughts for Robotic Manipulation with Failure Recovery in Vision-Language-Action Model

we introduce TCoT, a unified VLA framework that enhances this direct mapping with trajectory planning as well as failure detection and recovery. TCoT leverages hierarchy trajectories as a precise and compact representation of CoT reasoning for manipulation: global planning provides a high-level, goal-oriented trajectory to guide the robot toward its task objective, while local planning focuses on real-time adjustments to address dynamic changes. Moreover, we designed the Global-Local Switching Recovery algorithm that detects and effectively recovers from failures.

Our codebase is built on top of OpenVLA and ECoT. We refer to it for the detailed documentation of the code and dependencies.

Quickstart

The global and local trajectory genreate by TCoT on libero looks like this:

We provide a Colab notebook containing code for loading up our TCoT policy and using it to generate reasoning and actions in response to an observation. Loading the model for inference is easy:

from transformers import AutoModelForVision2Seq, AutoProcessor

device = "cuda"
path_to_hf = "TCoT/tcot-openvla-7b-libero"
processor = AutoProcessor.from_pretrained(path_to_hf, trust_remote_code=True)
vla = AutoModelForVision2Seq.from_pretrained(path_to_hf, torch_dtype=torch.bfloat16).to(device)

observation = <ROBOT IMAGE OBSERVATION HERE>
instruction = <YOUR INSTRUCTION HERE>
prompt = "A chat between a curious user and an artificial intelligence assistant. " + \
    "The assistant gives helpful, detailed, and polite answers to the user's questions. " + \
    f"USER: What action should the robot take to {instruction.lower()}? ASSISTANT: TASK:"

inputs = processor(prompt, image).to(device, dtype=torch.bfloat16)
action, generated_ids = vla.predict_action(**inputs, unnorm_key="libero", max_new_tokens=1024)
generated_text = processor.batch_decode(generated_ids)[0]

The standard model in torch.bfloat16 requires 16 GB of GPU memory, but using bitsandbytes and 4-bit quantization lowers memory usage to around 5 GB. See the Colab for more details.

Training and Evaluation

To train the models, from scratch use the following command:

bash ./vla-scripts/finetune_tcot.sh

To evaluate the model on the libero,

python experiments/robot/libero/run_libero_eval_tcot_globallocal.py + args

To evaluate the model on real robot arm (AIRBOT),

client:
  python experiments/robot/airbot/airbot_client_aug_tcot.py + args
server:
  pyton experiments/robot/airbot/deploy_tcot.py + args

Pretrained models

We release two TCoT models trained as part of our work, and the dataset of trajectory-based reasonings, available on our HuggingFace page:

Explicit Notes on Model Licensing & Commercial Use: While all code in this repository is released under an MIT License, our pretrained models may inherit restrictions from the underlying base models we use. Specifically, both the above models are derived from Llama-2, and as such are subject to the Llama Community License.


Installation

See the original OpenVLA repository for detailed installation instructions.

Repository Structure

High-level overview of repository/project file-tree:

  • prismatic - Package source; provides core utilities for model loading, training, data preprocessing, etc.
  • experiments - Code for evaluating the policies on a WidowX robot.
  • vla-scripts/ - Core scripts for training, fine-tuning, and deploying VLAs.
  • LICENSE - All code is made available under the MIT License; happy hacking!
  • Makefile - Top-level Makefile (by default, supports linting - checking & auto-fix); extend as needed.
  • pyproject.toml - Full project configuration details (including dependencies), as well as tool configurations.
  • README.md - You are here!

About

Code for paper "TCoT: Trajectory Chain-of-Thoughts for Robotic Manipulation with Failure Recovery in Vision-Language-Action Model"

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published