# **ORION - Outputs inspection**

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
import os
os.chdir('/content/drive/MyDrive/RA-KaiKaiLiu/Orion')

#Inference steps



The complete inference flow is:

* [test.py](https://github.com/xiaomi-mlab/Orion/blob/main/adzoo/orion/test.py) → loads config and checkpoint
* forward_test() → resets memory, calls simple_test_pts()
* extract_feat() → backbone + neck feature extraction
* Detection heads → pts_bbox_head.forward() and map_head.forward()
* VLM → lm_head.inference_ego() generates planning token
* Planning decoder → VAE/Diffusion/MLP generates trajectory
* Metrics → compute planning metrics (L2 error, collision, etc.)

During inference, the following methods in [orion.py](https://github.com/xiaomi-mlab/Orion/blob/2eddb627/mmcv/models/detectors/orion.py) are called in sequence:
<ul type="none">
<li>a) forward_test() - Entry point for test-time inference</li>
 <ul>
<li>Resets memory if needed </li>
<li>Delegates to simple_test() </li>
</ul>
<li>b) simple_test() - Processes single scene</li>
<ul>
<li>Calls extract_img_feat() to extract image features</li>
<li>Calls simple_test_pts() for the main inference pipeline</li>
</ul>
<li>c) simple_test_pts() - Core inference pipeline</li>
<li>This is where the three-stage processing happens:</li>
<ul>
<li>Vision Stage: Calls pts_bbox_head() for object detection and map_head() for lane detection</li>
<li>Reasoning Stage: Calls lm_head.inference_ego() or lm_head.generate() for VLM processing</li>
<li>Action Stage: Depending on config, calls trajectory decoders:
<ul><li>VAE decoder: ego_fut_decoder()</li><li>Diffusion decoder: diff_decoder()</li><li>MLP decoder: waypoint_decoder()</li></ul></li>
</ul>

# Vision Space
##Image Loading & Preprocessing
Pipeline: inference_only_pipeline in agent configuration [orion_stage3_agent.py](https://github.com/Rahhul17-IITH/Orion-v1/blob/main/adzoo/orion/configs/orion_stage3_agent.py)

**Input:**

* Raw images from CARLA: 6 camera views (RGB, variable resolution) [orion_b2d_agent.py](https://github.com/Rahhul17-IITH/Orion-v1/blob/main/team_code/orion_b2d_agent.py)

**Processing:**

* LoadMultiViewImageFromFilesInCeph: Load images to float32 [orion_stage3_agent.py](https://github.com/Rahhul17-IITH/Orion-v1/blob/main/adzoo/orion/configs/orion_stage3_agent.py)
* ResizeCropFlipRotImage: Apply image transforms (no augmentation during inference)
* ResizeMultiview3D: Resize to (640, 640)
* NormalizeMultiviewImage: Apply normalization
* PadMultiViewImage: Pad to divisible by 32

**Output:**

* results['img']: List of 6 images, stacked to (6, 3, 640, 640) tensor [orion_b2d_agent.py](https://github.com/Rahhul17-IITH/Orion-v1/blob/main/team_code/orion_b2d_agent.py)

## Feature Extraction
**Function:** extract_img_feat() called from simple_test() [orion.py](https://github.com/xiaomi-mlab/Orion/blob/main/mmcv/models/detectors/orion.py)

**Input:**

* img: (B, N, C, H, W) = (1, 6, 3, 640, 640) tensor (dtype: float32 or float16)

**Processing:**

* Reshape: (B*N, C, H, W) = (6, 3, 640, 640)
* img_backbone.forward() (EVAViT): Extract features
* Reshape back: (B, N, C_feat, H_feat, W_feat)

**Output:**

* img_feats_reshaped: (1, 6, 1024, 40, 40) tensor - 1024-dim features at 40×40 resolution

## Position Encoding
**Function:** position_embeding() in simple_test_pts() [orion.py](https://github.com/xiaomi-mlab/Orion/blob/main/mmcv/models/detectors/orion.py)

**Input:**

* img_feats: (1, 6, 1024, 40, 40)
* location: (6, 40, 40, 2) - 2D grid coordinates from prepare_location()

**Processing:**

* Flatten spatial: (B, N*H*W, C) = (1, 9600, 1024)
* Generate depth bins and encode via position_encoder MLP

**Output:**

* pos_embed: (1, 9600, 256) - positional embeddings

## Object Detection
* Function: pts_bbox_head.forward() (OrionHead) [orion.py](https://github.com/xiaomi-mlab/Orion/blob/main/mmcv/models/detectors/orion.py)

**Input:**

* img_feats: (1, 6, 1024, 40, 40)
* pos_embed: (1, 9600, 256)
* img_metas: List with camera calibration, ego pose, etc.

**Processing:**

* Flatten & project: memory = input_projection(img_feats) → (1, 9600, 256)
* Initialize 600 object queries + 256 VLM tokens
* 6-layer transformer decoder with temporal memory
* Extract VLM tokens from last layer: (1, 256, 256)
* Project to 4096-dim: (1, 256, 4096)
* Add CAN bus embedding: (1, 257, 4096) (256 object + 1 planning token)

**Output:**

* outs_bbox: Dict with detection predictions (bboxes, classes, scores)
* det_query: (1, 257, 4096) tensor - vision tokens for VLM

## Map Detection
* Function: map_head.forward() (OrionHeadM) [orion.py](https://github.com/xiaomi-mlab/Orion/blob/main/mmcv/models/detectors/orion.py)

**Input:**

* img_feats: (1, 6, 1024, 40, 40)
* pos_embed: (1, 9600, 256)

**Processing:**

* Similar to OrionHead but with 1800 lane queries + 256 VLM tokens
* 6-layer transformer decoder
* Extract & project VLM tokens: (1, 256, 4096)

**Output:**

* outs_lane: Dict with lane predictions (coordinates, classes)
* map_query: (1, 256, 4096) tensor - map vision tokens

In [None]:
#printing det_query, map_query
!python results-orion-outputs/scripts/visualize_vision_embeddings.py

Object Embedding Shape: (1, 273, 4096)
Object Embedding Array:
 [[[ 0.15061843  1.785963   -0.5301314  ... -1.2414083   0.5657144
    0.26052558]
  [ 0.03301103  0.80292714 -0.33988827 ... -0.79655665  0.15823792
    0.54157895]
  [ 0.0091894   1.3270832  -0.52480584 ... -1.0784613   0.5990549
    0.6027265 ]
  ...
  [ 0.4923144  -0.64405024 -0.15923941 ...  0.03407398 -0.02638396
    1.6473213 ]
  [ 0.49269247 -0.63164103 -0.15247867 ...  0.02763009 -0.05691024
    1.6609404 ]
  [ 0.7587868   0.41652974 -1.6516763  ... -2.2075346   0.46042457
    0.6991449 ]]]

Map Embedding Shape: (1, 256, 4096)
Map Embedding Array:
 [[[-0.01955852  0.9018345  -1.5668068  ... -0.85982764  0.4004894
    1.6175321 ]
  [-0.05273603  0.8952322  -1.3868546  ... -1.0558487   1.1358562
    1.7613556 ]
  [-0.13663417  0.6679472  -1.623211   ... -1.0674946   0.6367695
    1.7504694 ]
  ...
  [-0.26593617  0.46420035 -1.4571394  ... -1.073191    0.68581504
    1.777324  ]
  [ 0.01227309  0.88971245 -0.83627385

# Bounding Box Results [orion.py](https://github.com/xiaomi-mlab/Orion/blob/2eddb627/mmcv/models/detectors/orion.py)
This file is saved as an array of dictionaries, where each dictionary contains the detection results for a single sample or time step. The common keys and their meanings are:

* boxes_3d: 3D bounding boxes (LiDARInstance3DBoxes format)
* scores_3d: Confidence scores for each detection
* labels_3d: Class labels (0-8 for the 9 detection classes)
* trajs_3d (optional): Future trajectory predictions if motion forecasting is enabled

# Lane Results [orion.py](https://github.com/xiaomi-mlab/Orion/blob/2eddb627/mmcv/models/detectors/orion.py)
* map_scores_3d: Confidence scores for detected lanes
* map_labels_3d: Lane class labels (6 classes: Broken, Solid, SolidSolid, Center, TrafficLight, StopSign)
* map_pts_3d: Lane control points (shape: [num_lanes, n_control, 3])

In [None]:
#printing boxes_3d, scores_3d, labels_3d, trajs_3d; map_scores_3d, map_labels_3d, map_pts_3d
%%bash
conda run -n orion_env python results-orion-outputs/scripts/visualize_detection_results.py


Bounding Box Results Type: <class 'numpy.ndarray'>
Bounding Box Results Shape: (1,)
Bounding Box Results Array:
 [{'boxes_3d': LiDARInstance3DBoxes(
     tensor([[  4.5223, -20.0158,  -1.7567,  ...,   2.8503,  -2.7403,   8.9753],
         [  4.9073, -34.3515,  -1.4148,  ...,  -0.2639,   2.0223,  -7.8007],
         [ -7.2264,  23.0681,  -1.7680,  ...,   2.8710,  -0.9644,   3.6656],
         ...,
         [ -7.2264,  23.0681,  -1.7680,  ...,   2.8710,  -0.9644,   3.6656],
         [  1.1681,  -2.1684,  -1.7212,  ...,  -2.9380,  -0.1131,   0.1009],
         [  4.5223, -20.0158,  -1.7567,  ...,   2.8503,  -2.7403,   8.9753]])), 'scores_3d': tensor([0.9768, 0.9757, 0.9716, 0.9677, 0.1089, 0.1081, 0.1066, 0.1065, 0.1041,
         0.1039, 0.0933, 0.0848, 0.0833, 0.0813, 0.0793, 0.0770, 0.0750, 0.0737,
         0.0702, 0.0588, 0.0581, 0.0539, 0.0537, 0.0520, 0.0505, 0.0498, 0.0440,
         0.0331, 0.0324, 0.0311, 0.0307, 0.0281, 0.0276, 0.0261, 0.0249, 0.0226,
         0.0224, 0.0218, 0.0218

#Reasoning Space
## Vision Token Concatenation
* Function: In simple_test_pts() [orion.py](https://github.com/xiaomi-mlab/Orion/blob/main/mmcv/models/detectors/orion.py)

**Input:**

* vision_embeded_obj: (1, 257, 4096) from OrionHead
* vision_embeded_map: (1, 256, 4096) from OrionHeadM

**Processing:**

* vision_embeded = torch.cat([vision_embeded_obj, vision_embeded_map], dim=1) [orion.py](https://github.com/xiaomi-mlab/Orion/blob/main/mmcv/models/detectors/orion.py)

**Output:**

* vision_embeded: (1, 513, 4096) tensor (dtype: float32 or float16)
## Text Input Preparation
* Pipeline: LoadAnnoatationCriticalVQATest transform [orion_stage3_agent.py](https://github.com/Rahhul17-IITH/Orion-v1/blob/main/adzoo/orion/configs/orion_stage3_agent.py)

**Input:**

* Critical QA prompts from dataset (e.g., "Describe the driving scenario and plan waypoints")

**Processing:**

* Tokenize using LLaVA tokenizer
* Format as conversation with special tokens

**Output:**

* input_ids: List of token sequences, each (seq_len,) tensor [orion.py](https://github.com/xiaomi-mlab/Orion/blob/main/mmcv/models/detectors/orion.py)

## Multi-Turn Conversation Loop
* Function: Loop in simple_test_pts() [orion.py](https://github.com/xiaomi-mlab/Orion/blob/main/mmcv/models/detectors/orion.py)

For each conversation turn:

**Input:**

* input_ids[i]: (1, turn_len) - current turn tokens orion.py:769
* history_input_output_id: List of previous turns [orion.py](https://github.com/xiaomi-mlab/Orion/blob/main/mmcv/models/detectors/orion.py)

**Processing:**

* Check for special waypoint token (<waypoint_ego>) [orion.py](https://github.com/xiaomi-mlab/Orion/blob/main/mmcv/models/detectors/orion.py)
* Concatenate conversation history: context_input_ids = torch.cat(history_input_output_id, dim=-1) orion.py:782

**Output:**

* context_input_ids: (1, total_seq_len) - full conversation context
## LLM Inference for Planning Token
* Function: lm_head.inference_ego() [orion.py](https://github.com/xiaomi-mlab/Orion/blob/main/mmcv/models/detectors/orion.py)

**Input:**

* inputs: (1, total_seq_len) - conversation tokens [orion.py](https://github.com/xiaomi-mlab/Orion/blob/main/mmcv/models/detectors/orion.py)
* images: (1, 513, 4096) - vision embeddings
* return_ego_feature=True

**Processing:** (inside LlavaLlamaForCausalLM.inference_ego()) llava_llama.py:243-311 :

* Merge vision and text: prepare_inputs_labels_for_multimodal() creates inputs_embeds of shape (1, 513+total_seq_len, 4096) [llava_llama.py](https://github.com/xiaomi-mlab/Orion/blob/2eddb627/mmcv/utils/llava_llama.py)
* LLaMA forward pass: self.model(inputs_embeds=inputs_embeds) → hidden_states of shape (1, 513+total_seq_len, 4096)
* Extract planning token: Find <waypoint_ego> token position and extract its hidden state

**Output:**

* ego_feature: (1, 4096) tensor (dtype: float32) - planning token [orion.py](https://github.com/xiaomi-mlab/Orion/blob/main/mmcv/models/detectors/orion.py)
## LLM Inference for Text Generation (Non-Planning Turns)
* Function: lm_head.generate()

**Input:**

* inputs: (1, total_seq_len) - conversation tokens
* images: (1, 513, 4096) - vision embeddings
* Generation params: temperature=0.1, top_p=0.75, max_new_tokens=320

**Processing:**

* Similar vision-text merging
* Autoregressive generation with sampling

**Output:**

* output_ids: (1, generated_len) tensor - generated token IDs [orion.py](https://github.com/xiaomi-mlab/Orion/blob/main/mmcv/models/detectors/orion.py)

In [None]:
#printing ego_feature
!python results-orion-outputs/scripts/visualize_ego_feature.py


Ego Feature Shape: (1, 4096)
Ego Feature Array:
 [[-0.17627868  0.07835264 -0.06944825 ... -0.07497703 -0.5635871
  -0.2444535 ]]


# Action Space
## Planning Token Preparation
**Function:** In simple_test_pts() [orion.py](https://github.com/xiaomi-mlab/Orion/blob/main/mmcv/models/detectors/orion.py)

**Input:**

* ego_feature: (1, 4096) tensor

**Processing:**

* Convert to float32: ego_feature = ego_feature.to(torch.float32) [orion.py](https://github.com/xiaomi-mlab/Orion/blob/main/mmcv/models/detectors/orion.py)
* Add time dimension: current_states = ego_feature.unsqueeze(1) → (1, 1, 4096)

**Output:**

* current_states: (1, 1, 4096) tensor

# Trajectory Generation (VAE Mode)
* Function: VAE-based planning in simple_test_pts() [orion.py](https://github.com/xiaomi-mlab/Orion/blob/main/mmcv/models/detectors/orion.py)

**Input:**

* current_states: (1, 1, 4096)

**Processing:**

* Distribution sampling: distribution_forward(current_states, None, None) → sample of shape (1, 32) (latent code) [orion.py](https://github.com/xiaomi-mlab/Orion/blob/main/mmcv/models/detectors/orion.py)
* Future state prediction: future_states_predict(B, sample, hidden_states, current_states) → states_hs of shape (6, 1, 1, 4096) (6 timesteps)
* Trajectory decoding: For each timestep, ego_fut_decoder(ego_query_hs[i]) → (1, 6, 2) (6 modes, 2D coords)
* Stack timesteps: ego_fut_preds = torch.stack(ego_fut_trajs_list, dim=2) → (1, 6, 6, 2)

In [None]:
#printing ego_fut_preds
!python results-orion-outputs/scripts/visualize_trajectory.py


Trajectory Shape: (1, 6, 6, 2)
Trajectory Array:
 [[[[-0.20913687  3.5807621 ]
   [-1.0376061   3.764699  ]
   [-1.3528324   3.6801393 ]
   [-1.2351054   3.6720114 ]
   [-1.2603774   3.6748312 ]
   [-1.2735626   3.6967483 ]]

  [[ 1.057352    3.0138216 ]
   [ 1.7058829   2.3560398 ]
   [ 1.9005634   2.1520894 ]
   [ 1.8582957   2.1411152 ]
   [ 1.9006087   2.1504965 ]
   [ 1.9230629   2.2301652 ]]

  [[-0.17620757  3.5128942 ]
   [-1.1287831   3.6970603 ]
   [-1.3341503   3.749539  ]
   [-1.2232906   3.70322   ]
   [-1.2255021   3.7313988 ]
   [-1.2195609   3.792561  ]]

  [[-0.04183037  3.5330524 ]
   [-1.0285923   3.6404078 ]
   [-1.3653128   3.6833303 ]
   [-1.2526083   3.676955  ]
   [-1.2725302   3.6854796 ]
   [-1.2821392   3.7095413 ]]

  [[-0.09738618  3.4222233 ]
   [-1.5524168   3.805614  ]
   [-1.8910606   4.282611  ]
   [-1.7400434   4.3998213 ]
   [-1.6791778   4.5095673 ]
   [-1.619999    4.524088  ]]

  [[ 0.0781251   3.640977  ]
   [-0.45696583  3.7615323 ]
   [-1.1823