TCoT: Trajectory Chain-of-thoughts for Robotic Manipulation with Failure Recovery in Vision-Language-Action Model
we introduce TCoT, a unified VLA framework that enhances this direct mapping with trajectory planning as well as failure detection and recovery. TCoT leverages hierarchy trajectories as a precise and compact representation of CoT reasoning for manipulation: global planning provides a high-level, goal-oriented trajectory to guide the robot toward its task objective, while local planning focuses on real-time adjustments to address dynamic changes. Moreover, we designed the Global-Local Switching Recovery algorithm that detects and effectively recovers from failures.
Our codebase is built on top of OpenVLA and ECoT. We refer to it for the detailed documentation of the code and dependencies.
The global and local trajectory genreate by TCoT on libero looks like this:
We provide a Colab notebook containing code for loading up our TCoT policy and using it to generate reasoning and actions in response to an observation. Loading the model for inference is easy:
from transformers import AutoModelForVision2Seq, AutoProcessor
device = "cuda"
path_to_hf = "TCoT/tcot-openvla-7b-libero"
processor = AutoProcessor.from_pretrained(path_to_hf, trust_remote_code=True)
vla = AutoModelForVision2Seq.from_pretrained(path_to_hf, torch_dtype=torch.bfloat16).to(device)
observation = <ROBOT IMAGE OBSERVATION HERE>
instruction = <YOUR INSTRUCTION HERE>
prompt = "A chat between a curious user and an artificial intelligence assistant. " + \
"The assistant gives helpful, detailed, and polite answers to the user's questions. " + \
f"USER: What action should the robot take to {instruction.lower()}? ASSISTANT: TASK:"
inputs = processor(prompt, image).to(device, dtype=torch.bfloat16)
action, generated_ids = vla.predict_action(**inputs, unnorm_key="libero", max_new_tokens=1024)
generated_text = processor.batch_decode(generated_ids)[0]The standard model in torch.bfloat16 requires 16 GB of GPU memory, but using bitsandbytes and 4-bit quantization lowers memory usage to around 5 GB. See the Colab for more details.
To train the models, from scratch use the following command:
bash ./vla-scripts/finetune_tcot.shTo evaluate the model on the libero,
python experiments/robot/libero/run_libero_eval_tcot_globallocal.py + argsTo evaluate the model on real robot arm (AIRBOT),
client:
python experiments/robot/airbot/airbot_client_aug_tcot.py + args
server:
pyton experiments/robot/airbot/deploy_tcot.py + argsWe release two TCoT models trained as part of our work, and the dataset of trajectory-based reasonings, available on our HuggingFace page:
libero_spatial_trajectory: The trajectory-based reasoning dataset for libero-spatial dataset.libero_goal_trajectory: The trajectory-based reasoning dataset for libero-goal dataset.libero_object_trajectory: The trajectory-based reasoning dataset for libero-object dataset.libero_10_trajectory: The trajectory-based reasoning dataset for libero-10 dataset.
Explicit Notes on Model Licensing & Commercial Use: While all code in this repository is released under an MIT License, our pretrained models may inherit restrictions from the underlying base models we use. Specifically, both the above models are derived from Llama-2, and as such are subject to the Llama Community License.
See the original OpenVLA repository for detailed installation instructions.
High-level overview of repository/project file-tree:
prismatic- Package source; provides core utilities for model loading, training, data preprocessing, etc.experiments- Code for evaluating the policies on a WidowX robot.vla-scripts/- Core scripts for training, fine-tuning, and deploying VLAs.LICENSE- All code is made available under the MIT License; happy hacking!Makefile- Top-level Makefile (by default, supports linting - checking & auto-fix); extend as needed.pyproject.toml- Full project configuration details (including dependencies), as well as tool configurations.README.md- You are here!


