Hi, i have 2 question about the codebase :
-
Why the training code snippet contains BEV feature ? (carllava only use RGB image as input)
|
labels["input_bev_latent"] = bev |
-
Most of VLM sacrifices speed for explainable (language) output enhancing the reasoning & robustness of predicted waypoint. However, i didn't saw any language output in the codebase of ETA ?