diff --git a/README.md b/README.md
index 9870dfd..cc83b03 100644
--- a/README.md
+++ b/README.md
@@ -8,7 +8,7 @@
[](https://arxiv.org/abs/2312.16170)
-[](./assets/2024_NeurIPS_MMScan_Camera_Ready.pdf)
+[](./assets/2406.09401v2.pdf)
[](https://tai-wang.github.io/mmscan)
@@ -21,14 +21,21 @@
## ๐ Contents
-1. [About](#-about)
-2. [Getting Started](#-getting-started)
-3. [MMScan API Tutorial](#-mmscan-api-tutorial)
-4. [MMScan Benchmark](#-mmscan-benchmark)
-5. [TODO List](#-todo-list)
+1. [News](#-news)
+2. [About](#-about)
+3. [Getting Started](#-getting-started)
+4. [MMScan Tutorial](#-mmscan-api-tutorial)
+5. [MMScan Benchmark](#-mmscan-benchmark)
+6. [TODO List](#-todo-list)
-## ๐ About
+## ๐ฅ News
+
+- \[2025-06\] We are co-organizing the CVPR 2025 3D Scene Understanding Challenge. You're warmly invited to participate in the MMScan Hierarchical Visual Grounding track!
+The challenge test server is now online [here](https://huggingface.co/spaces/rbler/3d-iou-challenge). We look forward to your strong submissions!
+- \[2025-01\] We are delighted to present the official release of [MMScan-devkit](https://github.com/OpenRobotLab/EmbodiedScan/tree/mmscan), which encompasses a suite of data processing utilities, benchmark evaluation tools, and adaptations of some models for the MMScan benchmarks. We invite you to explore these resources and welcome any feedback or questions you may have!
+
+## ๐ About
@@ -59,7 +66,6 @@ existing benchmarks and in-the-wild evaluation.
## ๐ Getting Started
-
### Installation
1. Clone Github repo.
@@ -100,247 +106,90 @@ existing benchmarks and in-the-wild evaluation.
Please refer to the [guide](data_preparation/README.md) here.
-## ๐ MMScan API Tutorial
-
+## ๐ MMScan Tutorial
The **MMScan Toolkit** provides comprehensive tools for dataset handling and model evaluation in tasks.
-To import the MMScan API, you can use the following commands:
-
-```bash
-import mmscan
-
-# (1) The dataset tool
-import mmscan.MMScan as MMScan_dataset
-
-# (2) The evaluator tool ('VisualGroundingEvaluator', 'QuestionAnsweringEvaluator', 'GPTEvaluator')
-import mmscan.VisualGroundingEvaluator as MMScan_VG_evaluator
-
-import mmscan.QuestionAnsweringEvaluator as MMScan_QA_evaluator
-
-import mmscan.GPTEvaluator as MMScan_GPT_evaluator
-```
-
### MMScan Dataset
The dataset tool in MMScan allows seamless access to data required for various tasks within MMScan.
-#### Usage
-
-Initialize the dataset for a specific task with:
-
-```bash
-my_dataset = MMScan_dataset(split='train', task="MMScan-QA", ratio=1.0)
-# Access a specific sample
-print(my_dataset[index])
-```
-
-#### Data Access
-
-Each dataset item is a dictionary containing key elements:
-
-(1) 3D Modality
-
-- **"ori_pcds"** (tuple\[tensor\]): Original point cloud data extracted from the .pth file.
-- **"pcds"** (np.ndarray): Point cloud data with dimensions [n_points, 6(xyz+rgb)], representing the coordinates and color of each point.
-- **"instance_labels"** (np.ndarray): Instance ID assigned to each point in the point cloud.
-- **"class_labels"** (np.ndarray): Class IDs assigned to each point in the point cloud.
-- **"bboxes"** (dict): Information about bounding boxes within the scan, structured as { object ID:
- {
- "type": object type (str),
- "bbox": 9 DoF box (np.ndarray)
- }}
-
-(2) Language Modality
-
-- **"sub_class"**: The category of the sample.
-- **"ID"**: The sample's ID.
-- **"scan_id"**: The scan's ID.
-- *For Visual Grounding task*
-- **"target_id"** (list\[int\]): IDs of target objects.
-- **"text"** (str): Text used for grounding.
-- **"target"** (list\[str\]): Text prompt to specify the target grounding object.
-- **"anchors"** (list\[str\]): Types of anchor objects.
-- **"anchor_ids"** (list\[int\]): IDs of anchor objects.
-- **"tokens_positive"** (dict): Indices of positions where mentioned objects appear in the text.
-- *For Qusetion Answering task*
-- **"question"** (str): The text of the question.
-- **"answers"** (list\[str\]): List of possible answers.
-- **"object_ids"** (list\[int\]): Object IDs referenced in the question.
-- **"object_names"** (list\[str\]): Types of referenced objects.
-- **"input_bboxes_id"** (list\[int\]): IDs of input bounding boxes.
-- **"input_bboxes"** (list\[np.ndarray\]): Input 9-DoF bounding boxes.
-
-(3) 2D Modality
+- #### Usage
-- **'img_path'** (str): File path to the RGB image.
-- **'depth_img_path'** (str): File path to the depth image.
-- **'intrinsic'** (np.ndarray): Intrinsic parameters of the camera for RGB images.
-- **'depth_intrinsic'** (np.ndarray): Intrinsic parameters of the camera for depth images.
-- **'extrinsic'** (np.ndarray): Extrinsic parameters of the camera.
-- **'visible_instance_id'** (list): IDs of visible objects in the image.
+ Initialize the dataset for a specific task with:
-### MMScan Evaluator
+ ```bash
+ from mmscan import MMScan
-Our evaluation tool is designed to streamline the assessment of model outputs for the MMScan task, providing essential metrics to gauge model performance effectively.
+ # (1) The dataset tool
+ my_dataset = MMScan(split='train'/'test'/'val', task='MMScan-VG'/'MMScan-QA')
+ # Access a specific sample
+ print(my_dataset[index])
+ ```
-#### 1. Visual Grounding Evaluator
+ *Note:* For the test split, we have only made the VG portion publicly available, while the QA portion has not been released.
-For the visual grounding task, our evaluator computes multiple metrics including AP (Average Precision), AR (Average Recall), AP_C, AR_C, and gtop-k:
+- #### Data Access
-- **AP and AR**: These metrics calculate the precision and recall by considering each sample as an individual category.
-- **AP_C and AR_C**: These versions categorize samples belonging to the same subclass and calculate them together.
-- **gTop-k**: An expanded metric that generalizes the traditional Top-k metric, offering superior flexibility and interpretability compared to traditional ones when oriented towards multi-target grounding.
-
-*Note:* Here, AP corresponds to APsample in the paper, and AP_C corresponds to APbox in the paper.
+ Each dataset item is a dictionary containing data information from three modalities: language, 2D, and 3D.๏ผ[Details](https://rbler1234.gitbook.io/mmscan-devkit-tutorial#data-access)๏ผ
-Below is an example of how to utilize the Visual Grounding Evaluator:
+### MMScan Evaluation
-```python
-# Initialize the evaluator with show_results enabled to display results
-my_evaluator = MMScan_VG_evaluator(show_results=True)
+Our evaluation tool is designed to streamline the assessment of model outputs for the MMScan task, providing essential metrics to gauge model performance effectively. We provide three evaluation tools: `VisualGroundingEvaluator`, `QuestionAnsweringEvaluator`, and `GPTEvaluator`. For more details, please refer to the [documentation](https://rbler1234.gitbook.io/mmscan-devkit-tutorial/evaluator).
-# Update the evaluator with the model's output
-my_evaluator.update(model_output)
-
-# Start the evaluation process and retrieve metric results
-metric_dict = my_evaluator.start_evaluation()
-
-# Optional: Retrieve detailed sample-level results
-print(my_evaluator.records)
-
-# Optional: Show the table of results
-print(my_evaluator.print_result())
-
-# Important: Reset the evaluator after use
-my_evaluator.reset()
-```
-
-The evaluator expects input data in a specific format, structured as follows:
-
-```python
-[
- {
- "pred_scores" (tensor/ndarray): Confidence scores for each prediction. Shape: (num_pred, 1)
-
- "pred_bboxes"/"gt_bboxes" (tensor/ndarray): List of 9 DoF bounding boxes.
- Supports two input formats:
- 1. 9-dof box format: (num_pred/gt, 9)
- 2. center, size and rotation matrix:
- "center": (num_pred/gt, 3),
- "size" : (num_pred/gt, 3),
- "rot" : (num_pred/gt, 3, 3)
-
- "subclass": The subclass of each VG sample.
- "index": Index of the sample.
- }
- ...
-]
-```
-
-#### 2. Question Answering Evaluator
-
-The question answering evaluator measures performance using several established metrics:
-
-- **Bleu-X**: Evaluates n-gram overlap between prediction and ground truths.
-- **Meteor**: Focuses on precision, recall, and synonymy.
-- **CIDEr**: Considers consensus-based agreement.
-- **SPICE**: Used for semantic propositional content.
-- **SimCSE/SBERT**: Semantic similarity measures using sentence embeddings.
-- **EM (Exact Match) and Refine EM**: Compare exact matches between predictions and ground truths.
+```bash
+from mmscan import MMScan
-```python
-# Initialize evaluator with pre-trained weights for SIMCSE and SBERT
-my_evaluator = MMScan_QA_evaluator(model_config={}, show_results=True)
+# (2) The evaluator tool ('VisualGroundingEvaluator', 'QuestionAnsweringEvaluator', 'GPTEvaluator')
+from mmscan import VisualGroundingEvaluator, QuestionAnsweringEvaluator, GPTEvaluator
-# Update evaluator with model output
+# For VisualGroundingEvaluator and QuestionAnsweringEvaluator, initialize the evaluator in the following way, update the model output to the evaluator, and finally perform the evaluation and save the final results.
+my_evaluator = VisualGroundingEvaluator(show_results=True) / QuestionAnsweringEvaluaton(show_results=True)
my_evaluator.update(model_output)
-
-# Start evaluation and obtain metrics
metric_dict = my_evaluator.start_evaluation()
-# Optional: View detailed sample-level results
-print(my_evaluator.records)
-
-# Important: Reset evaluator after completion
-my_evaluator.reset()
-```
+# For GPTEvaluator, initialize the Evaluator in the following way, and evaluate the model's output using multithreading, finally saving the results to the specified path (tmp_path).
+gpt_evaluator = GPTEvaluator(API_key='XXX')
+metric_dict = gpt_evaluator.load_and_eval(model_output, num_threads=1, tmp_path='XXX')
-The evaluator requires input data structured as follows:
-
-```python
-[
- {
- "question" (str): The question text,
- "pred" (list[str]): The predicted answer, single element list,
- "gt" (list[str]): Ground truth answers, containing multiple elements,
- "ID": Unique ID for each QA sample,
- "index": Index of the sample,
- }
- ...
-]
```
-#### 3. GPT Evaluator
-In addition to classical QA metrics, the GPT evaluator offers a more advanced evaluation process.
-
-```python
-# Initialize GPT evaluator with an API key for access
-my_evaluator = MMScan_GPT_Evaluator(API_key='XXX')
-
-# Load, evaluate with multiprocessing, and store results in temporary path
-metric_dict = my_evaluator.load_and_eval(model_output, num_threads=5, tmp_path='XXX')
-
-# Important: Reset evaluator when finished
-my_evaluator.reset()
-```
-
-The input structure remains the same as for the question answering evaluator:
-
-```python
-[
- {
- "question" (str): The question text,
- "pred" (list[str]): The predicted answer, single element list,
- "gt" (list[str]): Ground truth answers, containing multiple elements,
- "ID": Unique ID for each QA sample,
- "index": Index of the sample,
- }
- ...
-]
-```
+### MMScan HVG Challenge Submission
+To participate in the MMScan Visual Grounding Challenge and submit your results, please follow the instructions available on our [test server](https://huggingface.co/spaces/rbler/3d-iou-challenge). We welcome your feedback and inquiriesโplease feel free to contact us at linjingli@166.com.
## ๐ MMScan Benchmark
+
+

+
### MMScan Visual Grounding Benchmark
| Methods | gTop-1 | gTop-3 | APsample | APbox | AR | Release | Download |
-|---------|--------|--------|---------------------|------------------|----|-------|----|
-| ScanRefer | 4.74 | 9.19 | 9.49 | 2.28 | 47.68 | [code](https://github.com/rbler1234/EmbodiedScan/tree/mmscan-devkit/models/Scanrefer) | [model](https://drive.google.com/file/d/1C0-AJweXEc-cHTe9tLJ3Shgqyd44tXqY/view?usp=drive_link) \| [log](https://drive.google.com/file/d/1ENOS2FE7fkLPWjIf9J76VgiPrn6dGKvi/view?usp=drive_link) |
+|---------|----------------|-----------|---------------------|------------------|----|-------|----|
+| ScanRefer | 4.74 | 9.19 | 9.49 | 2.28 | 47.68 | [code](./models/Scanrefer/README.md) | [model](https://drive.google.com/file/d/1C0-AJweXEc-cHTe9tLJ3Shgqyd44tXqY/view?usp=drive_link) | [log](https://drive.google.com/file/d/1ENOS2FE7fkLPWjIf9J76VgiPrn6dGKvi/view?usp=drive_link) |
| MVT | 7.94 | 13.07 | 13.67 | 2.50 | 86.86 | - | - |
| BUTD-DETR | 15.24 | 20.68 | 18.58 | 9.27 | 66.62 | - | - |
| ReGround3D | 16.35 | 26.13 | 22.89 | 5.25 | 43.24 | - | - |
-| EmbodiedScan | 19.66 | 34.00 | 29.30 | **15.18** | 59.96 | [code](https://github.com/OpenRobotLab/EmbodiedScan/tree/mmscan/models/EmbodiedScan) | [model](https://drive.google.com/file/d/1F6cHY6-JVzAk6xg5s61aTT-vD-eu_4DD/view?usp=drive_link) \| [log](https://drive.google.com/file/d/1Ua_-Z2G3g0CthbeBkrR1a7_sqg_Spd9s/view?usp=drive_link) |
+| EmbodiedScan | 19.66 | 34.00 | 29.30 | **15.18** | 59.96 | [code](./models/EmbodiedScan/README.md) | [model](https://drive.google.com/file/d/1F6cHY6-JVzAk6xg5s61aTT-vD-eu_4DD/view?usp=drive_link) | [log](https://drive.google.com/file/d/1Ua_-Z2G3g0CthbeBkrR1a7_sqg_Spd9s/view?usp=drive_link) |
| 3D-VisTA | 25.38 | 35.41 | 33.47 | 6.67 | 87.52 | - | - |
| ViL3DRef | **26.34** | **37.58** | **35.09** | 6.65 | 86.86 | - | - |
### MMScan Question Answering Benchmark
+
| Methods | Overall | ST-attr | ST-space | OO-attr | OO-space | OR| Advanced | Release | Download |
|---|--------|--------|--------|--------|--------|--------|-------|----|----|
-| LL3DA | 45.7 | 39.1 | 58.5 | 43.6 | 55.9 | 37.1 | 24.0| [code](https://github.com/rbler1234/EmbodiedScan/tree/mmscan-devkit/models/LL3DA) | [model](https://drive.google.com/file/d/1mcWNHdfrhdbtySBtmG-QRH1Y1y5U3PDQ/view?usp=drive_link) \| [log](https://drive.google.com/file/d/1VHpcnO0QmAvMa0HuZa83TEjU6AiFrP42/view?usp=drive_link) |
-| LEO |54.6 | 48.9 | 62.7 | 50.8 | 64.7 | 50.4 | 45.9 | [code](https://github.com/rbler1234/EmbodiedScan/tree/mmscan-devkit/models/LEO) | [model](https://drive.google.com/drive/folders/1HZ38LwRe-1Q_VxlWy8vqvImFjtQ_b9iA?usp=drive_link)|
+| LL3DA | 45.7 | 39.1 | 58.5 | 43.6 | 55.9 | 37.1 | 24.0| [code](./models/LL3DA/README.md) | [model](https://drive.google.com/file/d/1mcWNHdfrhdbtySBtmG-QRH1Y1y5U3PDQ/view?usp=drive_link) | [log](https://drive.google.com/file/d/1VHpcnO0QmAvMa0HuZa83TEjU6AiFrP42/view?usp=drive_link) |
+| LEO |54.6 | 48.9 | 62.7 | 50.8 | 64.7 | 50.4 | 45.9 | [code](./models/LEO/README.md) | [model](https://drive.google.com/drive/folders/1HZ38LwRe-1Q_VxlWy8vqvImFjtQ_b9iA?usp=drive_link)|
| LLaVA-3D |**61.6** | 58.5 | 63.5 | 56.8 | 75.6 | 58.0 | 38.5|- | - |
*Note:* These two tables only show the results for main metrics; see the paper for complete results.
-We have released the codes of some models under [./models](./models/README.md).
+We have released the codes of some models under [./models](./models).
## ๐ TODO List
-
- \[ \] MMScan annotation and samples for ARKitScenes.
-- \[ \] Online evaluation platform for the MMScan benchmark.
- \[ \] Codes of more MMScan Visual Grounding baselines and Question Answering baselines.
- \[ \] Full release and further updates.
diff --git a/assets/2024_NeurIPS_MMScan_Camera_Ready.pdf b/assets/2024_NeurIPS_MMScan_Camera_Ready.pdf
deleted file mode 100644
index ff3ac2c..0000000
Binary files a/assets/2024_NeurIPS_MMScan_Camera_Ready.pdf and /dev/null differ
diff --git a/assets/2406.09401v2.pdf b/assets/2406.09401v2.pdf
new file mode 100644
index 0000000..813b1f4
Binary files /dev/null and b/assets/2406.09401v2.pdf differ
diff --git a/assets/LEO.png b/assets/LEO.png
new file mode 100644
index 0000000..15d0799
Binary files /dev/null and b/assets/LEO.png differ
diff --git a/assets/LL3DA.png b/assets/LL3DA.png
new file mode 100644
index 0000000..5e86868
Binary files /dev/null and b/assets/LL3DA.png differ
diff --git a/assets/Scanrefer.png b/assets/Scanrefer.png
new file mode 100644
index 0000000..cb297b5
Binary files /dev/null and b/assets/Scanrefer.png differ
diff --git a/assets/benchmark.png b/assets/benchmark.png
new file mode 100644
index 0000000..b0c3966
Binary files /dev/null and b/assets/benchmark.png differ
diff --git a/assets/circle.png b/assets/circle.png
new file mode 100644
index 0000000..8f073d1
Binary files /dev/null and b/assets/circle.png differ
diff --git a/assets/ex.png b/assets/ex.png
new file mode 100644
index 0000000..6dce8cb
Binary files /dev/null and b/assets/ex.png differ
diff --git a/assets/graph.png b/assets/graph.png
new file mode 100644
index 0000000..4ca22f1
Binary files /dev/null and b/assets/graph.png differ
diff --git a/assets/mix.png b/assets/mix.png
new file mode 100644
index 0000000..f0b7dbd
Binary files /dev/null and b/assets/mix.png differ
diff --git a/data_preparation/README.md b/data_preparation/README.md
index 4273267..3addbf8 100644
--- a/data_preparation/README.md
+++ b/data_preparation/README.md
@@ -1,4 +1,4 @@
-### Prepare MMscan info files.
+### MMScan Dataset Preparation
Given the licenses of respective raw datasets, we recommend users download the raw data from their official websites and then organize them following the below guide.
Detailed steps are shown as follows.
diff --git a/data_preparation/process_all_scan.py b/data_preparation/process_all_scan.py
index 7509110..e826762 100644
--- a/data_preparation/process_all_scan.py
+++ b/data_preparation/process_all_scan.py
@@ -54,31 +54,32 @@ def create_scene_pcd(es_anno: dict,
pc, color, label = pcd_result
label = np.ones_like(label) * -100
instance_ids = np.ones(pc.shape[0], dtype=np.int16) * (-100)
- bboxes = es_anno['bboxes'].reshape(-1, 9)
- bboxes[:, 3:6] = np.clip(bboxes[:, 3:6], a_min=1e-2, a_max=None)
- object_ids = es_anno['object_ids']
- object_types = es_anno['object_types'] # str
- sorted_indices = sorted(enumerate(bboxes),
- key=lambda x: -np.prod(x[1][3:6]))
- # the larger the box, the smaller the index
- sorted_indices_list = [index for index, value in sorted_indices]
-
- bboxes = [bboxes[index] for index in sorted_indices_list]
- object_ids = [object_ids[index] for index in sorted_indices_list]
- object_types = [object_types[index] for index in sorted_indices_list]
-
- for box, obj_id, obj_type in zip(bboxes, object_ids, object_types):
- obj_type_id = TYPE2INT.get(obj_type, -1)
- center, size = box[:3], box[3:6]
-
- orientation = np.array(
- euler_angles_to_matrix(torch.tensor(box[np.newaxis, 6:]),
- convention='ZXY')[0])
-
- box_pc_mask = is_inside_box(pc, center, size, orientation)
-
- instance_ids[box_pc_mask] = obj_id
- label[box_pc_mask] = obj_type_id
+ if 'bboxes' in es_anno:
+ bboxes = es_anno['bboxes'].reshape(-1, 9)
+ bboxes[:, 3:6] = np.clip(bboxes[:, 3:6], a_min=1e-2, a_max=None)
+ object_ids = es_anno['object_ids']
+ object_types = es_anno['object_types'] # str
+ sorted_indices = sorted(enumerate(bboxes),
+ key=lambda x: -np.prod(x[1][3:6]))
+ # the larger the box, the smaller the index
+ sorted_indices_list = [index for index, value in sorted_indices]
+
+ bboxes = [bboxes[index] for index in sorted_indices_list]
+ object_ids = [object_ids[index] for index in sorted_indices_list]
+ object_types = [object_types[index] for index in sorted_indices_list]
+
+ for box, obj_id, obj_type in zip(bboxes, object_ids, object_types):
+ obj_type_id = TYPE2INT.get(obj_type, -1)
+ center, size = box[:3], box[3:6]
+
+ orientation = np.array(
+ euler_angles_to_matrix(torch.tensor(box[np.newaxis, 6:]),
+ convention='ZXY')[0])
+
+ box_pc_mask = is_inside_box(pc, center, size, orientation)
+
+ instance_ids[box_pc_mask] = obj_id
+ label[box_pc_mask] = obj_type_id
return pc, color, label, instance_ids
@@ -180,6 +181,11 @@ def process_one_scan(
type=str,
default=f'{path_of_version1}/embodiedscan_infos_val.pkl',
)
+ parser.add_argument(
+ '--test_pkl_path',
+ type=str,
+ default=f'{path_of_version1}/embodiedscan_infos_test.pkl',
+ )
parser.add_argument('--nproc', type=int, default=8)
args = parser.parse_args()
@@ -198,7 +204,7 @@ def process_one_scan(
allow_pickle=True)['metainfo']['categories']
es_anno.update(read_annotation_pickle(args.train_pkl_path))
es_anno.update(read_annotation_pickle(args.val_pkl_path))
-
+ es_anno.update(read_annotation_pickle(args.test_pkl_path))
# loading the required scan id
with open(f'{args.meta_path}/all_scan.json', 'r') as f:
scan_id_list = json.load(f)
diff --git a/data_preparation/utils/data_utils.py b/data_preparation/utils/data_utils.py
index 177521f..850207b 100644
--- a/data_preparation/utils/data_utils.py
+++ b/data_preparation/utils/data_utils.py
@@ -11,22 +11,29 @@ def read_annotation_pickle(path: str, show_progress: bool = True):
show_progress (bool): whether showing the progress.
Returns:
dict: A dictionary.
- scene_id : (bboxes, object_ids, object_types, visible_dict,
- extrinsics_c2w, axis_align_matrix, intrinsics, image_paths)
- bboxes: numpy array of bounding boxes,
+ scene_id : (bboxes, object_ids, object_types,
+ visible_view_object_dict, extrinsics_c2w,
+ axis_align_matrix, intrinsics, image_paths)
+ bboxes:
+ numpy array of bounding boxes,
shape (N, 9): xyz, lwh, ypr
- object_ids: numpy array of obj ids, shape (N,)
- object_types: list of strings, each string is a type of object
- visible_view_object_dict: a dictionary {view_id:
- visible_instance_ids}
- extrinsics_c2w: a list of 4x4 matrices, each matrix is the
- extrinsic matrix of a view
- axis_align_matrix: a 4x4 matrix, the axis-aligned matrix
- of the scene
- intrinsics: a list of 4x4 matrices, each matrix is the
- intrinsic matrix of a view
- image_paths: a list of strings, each string is the path
- of an image in the scene
+ object_ids:
+ numpy array of obj ids, shape (N,)
+ object_types:
+ list of strings, each string is a type of object
+ visible_view_object_dict:
+ a dictionary {view_id: visible_instance_ids}
+ extrinsics_c2w:
+ a list of 4x4 matrices, each matrix is the extrinsic
+ matrix of a view
+ axis_align_matrix:
+ a 4x4 matrix, the axis-aligned matrix of the scene
+ intrinsics:
+ a list of 4x4 matrices, each matrix is the intrinsic
+ matrix of a view
+ image_paths:
+ a list of strings, each string is the path of an image
+ in the scene
"""
with open(path, 'rb') as f:
data = np.load(f, allow_pickle=True)
@@ -39,12 +46,8 @@ def read_annotation_pickle(path: str, show_progress: bool = True):
pbar = (tqdm(range(len(datalist))) if show_progress else range(
len(datalist)))
for scene_idx in pbar:
- # print(datalist[scene_idx]['sample_idx'])
- # if "matterport3d" not in datalist[scene_idx]['sample_idx']:
- # continue
- # print(datalist[scene_idx].keys())
+
images = datalist[scene_idx]['images']
- # print(images[0].keys())
intrinsic = datalist[scene_idx].get('cam2img', None) # a 4x4 matrix
missing_intrinsic = False
@@ -61,25 +64,25 @@ def read_annotation_pickle(path: str, show_progress: bool = True):
'axis_align_matrix'] # a 4x4 matrix
scene_id = datalist[scene_idx]['sample_idx']
+ if 'instances' in datalist[scene_idx]:
+ instances = datalist[scene_idx]['instances']
+ bboxes = []
+ object_ids = []
+ object_types = []
+ object_type_ints = []
+ for object_idx in range(len(instances)):
+ bbox_3d = instances[object_idx]['bbox_3d'] # list of 9 values
+ bbox_label_3d = instances[object_idx]['bbox_label_3d'] # int
+ bbox_id = instances[object_idx]['bbox_id'] # int
+ object_type = object_int_to_type[bbox_label_3d]
- instances = datalist[scene_idx]['instances']
- bboxes = []
- object_ids = []
- object_types = []
- object_type_ints = []
- for object_idx in range(len(instances)):
- bbox_3d = instances[object_idx]['bbox_3d'] # list of 9 values
- bbox_label_3d = instances[object_idx]['bbox_label_3d'] # int
- bbox_id = instances[object_idx]['bbox_id'] # int
- object_type = object_int_to_type[bbox_label_3d]
-
- object_type_ints.append(bbox_label_3d)
- object_types.append(object_type)
- bboxes.append(bbox_3d)
- object_ids.append(bbox_id)
- bboxes = np.array(bboxes)
- object_ids = np.array(object_ids)
- object_type_ints = np.array(object_type_ints)
+ object_type_ints.append(bbox_label_3d)
+ object_types.append(object_type)
+ bboxes.append(bbox_3d)
+ object_ids.append(bbox_id)
+ bboxes = np.array(bboxes)
+ object_ids = np.array(object_ids)
+ object_type_ints = np.array(object_type_ints)
visible_view_object_dict = {}
visible_view_object_list = []
@@ -99,11 +102,12 @@ def read_annotation_pickle(path: str, show_progress: bool = True):
intrinsic = images[image_idx]['cam2img']
depth_intrinsic = images[image_idx]['cam2img']
- visible_instance_indices = images[image_idx][
- 'visible_instance_ids'] # numpy array of int
- visible_instance_ids = object_ids[visible_instance_indices]
- visible_view_object_dict[extrinsic_id] = visible_instance_ids
- visible_view_object_list.append(visible_instance_ids)
+ if 'instances' in datalist[scene_idx]:
+ visible_instance_indices = images[image_idx][
+ 'visible_instance_ids'] # numpy array of int
+ visible_instance_ids = object_ids[visible_instance_indices]
+ visible_view_object_dict[extrinsic_id] = visible_instance_ids
+ visible_view_object_list.append(visible_instance_ids)
extrinsics_c2w.append(cam2global)
intrinsics.append(intrinsic)
depth_intrinsics.append(depth_intrinsic)
@@ -112,14 +116,7 @@ def read_annotation_pickle(path: str, show_progress: bool = True):
if show_progress:
pbar.set_description(f'Processing scene {scene_id}')
output_data[scene_id] = {
- # object level
- 'bboxes': bboxes,
- 'object_ids': object_ids,
- 'object_types': object_types,
- 'object_type_ints': object_type_ints,
# image level
- 'visible_instance_ids': visible_view_object_list,
- 'visible_view_object_dict': visible_view_object_dict,
'extrinsics_c2w': extrinsics_c2w,
'axis_align_matrix': axis_align_matrix,
'intrinsics': intrinsics,
@@ -127,4 +124,21 @@ def read_annotation_pickle(path: str, show_progress: bool = True):
'image_paths': image_paths,
'depth_image_paths': depth_image_paths,
}
+ if 'instances' in datalist[scene_idx]:
+ output_data[scene_id].update({
+ # object level
+ 'bboxes':
+ bboxes,
+ 'object_ids':
+ object_ids,
+ 'object_types':
+ object_types,
+ 'object_type_ints':
+ object_type_ints,
+ # image level
+ 'visible_instance_ids':
+ visible_view_object_list,
+ 'visible_view_object_dict':
+ visible_view_object_dict
+ })
return output_data
diff --git a/mmscan/mmscan.py b/mmscan/mmscan.py
index 85a6005..7349ce7 100644
--- a/mmscan/mmscan.py
+++ b/mmscan/mmscan.py
@@ -65,14 +65,10 @@ def __init__(
self.dataroot = os.path.join(
os.path.dirname(os.path.dirname(ENV_PATH)), 'mmscan_data')
self.verbose = verbose
-
- # now we skip the test split because we don not provide ground truth.
- if split == 'test':
- split = 'val'
self.split = split
self.check_mode = check_mode
if self.check_mode:
- print("embodiedscan's checking mode!!!")
+ print("MMScan's checking mode")
self.pkl_name = f'{self.dataroot}/embodiedscan_split' +\
f'/embodiedscan-{self.version}' +\
f'/embodiedscan_infos_{split}.pkl'
@@ -230,16 +226,18 @@ def __getitem__(self, index_: int) -> dict:
scan_idx = self.mmscan_collect['anno'][index_]['scan_id']
pcd_info = self.__process_pcd_info__(scan_idx)
images_info = self.__process_img_info__(scan_idx)
- box_info = self.__process_box_info__(scan_idx)
data_dict['ori_pcds'] = pcd_info['ori_pcds']
data_dict['pcds'] = pcd_info['pcds']
data_dict['obj_pcds'] = pcd_info['obj_pcds']
data_dict['instance_labels'] = pcd_info['instance_labels']
data_dict['class_labels'] = pcd_info['class_labels']
- data_dict['bboxes'] = box_info
data_dict['images'] = images_info
+ if self.split != 'test':
+ box_info = self.__process_box_info__(scan_idx)
+ data_dict['bboxes'] = box_info
+
# (3) loading the data from the collection
# necessary to use deepcopy?
data_dict.update(deepcopy(self.mmscan_collect['anno'][index_]))
@@ -282,7 +280,6 @@ def get_possess(self, table_name: str, scan_idx: str):
Args:
table_name (str): The ype of the expected data.
scan_idx (str): The scan id to get the data.
-
Returns:
The data corresponding to the table_name and scan_idx.
"""
@@ -332,7 +329,6 @@ def __filter_lang_anno__(self, samples: List[dict]) -> List[dict]:
Args:
samples (list[dict]): The samples.
-
Returns:
list[dict] : The filtered results.
"""
@@ -351,11 +347,12 @@ def __check_lang_anno__(self, sample: dict) -> bool:
Args:
sample (dict): The item from the samples.
-
Returns:
bool : Whether the item is valid or not.
"""
# fix little typo
+ if self.split == 'test':
+ return True
anno_obj_ids = self.embodiedscan_anno[sample['scan_id']]['object_ids']
if self.task == 'MMScan-VG':
if (len(sample['target']) != len(sample['target_id'])
@@ -379,7 +376,6 @@ def __load_base_anno__(self, pkl_path: str) -> dict:
Args:
pkl_path (str): The path of the pkl.
-
Returns:
dict : The embodiedscan annotations of scans.
(with scan_idx as keys)
@@ -393,7 +389,6 @@ def __process_pcd_info__(self, scan_idx: str) -> dict:
Args:
scan_idx (str): ID of the scan.
-
Returns:
dict : The corresponding scan information.
"""
@@ -429,7 +424,6 @@ def __process_box_info__(self, scan_idx: str) -> dict:
Args:
scan_idx (str): ID of the scan.
-
Returns:
dict : The corresponding bounding boxes information.
"""
@@ -454,7 +448,6 @@ def __process_img_info__(self, scan_idx: str) -> List[dict]:
Args:
scan_idx (str): ID of the scan.
-
Returns:
list[dict] :The corresponding bounding boxes information
for each camera.
@@ -473,8 +466,9 @@ def __process_img_info__(self, scan_idx: str) -> List[dict]:
self.get_possess('depth_intrinsics', scan_idx))
img_info['extrinsic'] = deepcopy(
self.get_possess('extrinsics_c2w', scan_idx))
- img_info['visible_instance_id'] = deepcopy(
- self.get_possess('visible_instance_ids', scan_idx))
+ if self.split != 'test':
+ img_info['visible_instance_id'] = deepcopy(
+ self.get_possess('visible_instance_ids', scan_idx))
img_info_list = []
for camera_index in range(len(img_info['img_path'])):
@@ -495,7 +489,6 @@ def down_9dof_to_6dof(
the point clouds
box_9DOF(np.ndarray / Tensor):
the 9DOF bounding box
-
Returns:
np.ndarray :
The transformed 6DOF bounding box.
@@ -510,7 +503,6 @@ def __downsample_annos__(self, annos: List[dict],
Args:
annos (list[dict]): The original annotations.
ratio (float): The ratio to downsample.
-
Returns:
list[dict] : The result.
"""
diff --git a/mmscan/utils/data_io.py b/mmscan/utils/data_io.py
index 6c08c1e..f1a65ce 100644
--- a/mmscan/utils/data_io.py
+++ b/mmscan/utils/data_io.py
@@ -31,22 +31,26 @@ def read_annotation_pickle(path: str, show_progress: bool = True):
scene_id : (bboxes, object_ids, object_types,
visible_view_object_dict, extrinsics_c2w,
axis_align_matrix, intrinsics, image_paths)
-
- bboxes: numpy array of bounding boxes,
+ bboxes:
+ numpy array of bounding boxes,
shape (N, 9): xyz, lwh, ypr
- object_ids: numpy array of obj ids, shape (N,)
- object_types: list of strings, each string is a type
- of object
- visible_view_object_dict: a dictionary
- {view_id: visible_instance_ids}
- extrinsics_c2w: a list of 4x4 matrices, each matrix is
- the extrinsic matrix of a view
- axis_align_matrix: a 4x4 matrix, the axis-aligned matrix
- of the scene
- intrinsics: a list of 4x4 matrices, each matrix is the
- intrinsic matrix of a view
- image_paths: a list of strings, each string is the path
- of an image in the scene
+ object_ids:
+ numpy array of obj ids, shape (N,)
+ object_types:
+ list of strings, each string is a type of object
+ visible_view_object_dict:
+ a dictionary {view_id: visible_instance_ids}
+ extrinsics_c2w:
+ a list of 4x4 matrices, each matrix is the extrinsic
+ matrix of a view
+ axis_align_matrix:
+ a 4x4 matrix, the axis-aligned matrix of the scene
+ intrinsics:
+ a list of 4x4 matrices, each matrix is the intrinsic
+ matrix of a view
+ image_paths:
+ a list of strings, each string is the path of an image
+ in the scene
"""
with open(path, 'rb') as f:
data = np.load(f, allow_pickle=True)
@@ -60,7 +64,6 @@ def read_annotation_pickle(path: str, show_progress: bool = True):
len(datalist)))
for scene_idx in pbar:
images = datalist[scene_idx]['images']
-
intrinsic = datalist[scene_idx].get('cam2img', None) # a 4x4 matrix
missing_intrinsic = False
if intrinsic is None:
@@ -76,25 +79,25 @@ def read_annotation_pickle(path: str, show_progress: bool = True):
'axis_align_matrix'] # a 4x4 matrix
scene_id = datalist[scene_idx]['sample_idx']
-
- instances = datalist[scene_idx]['instances']
- bboxes = []
- object_ids = []
- object_types = []
- object_type_ints = []
- for object_idx in range(len(instances)):
- bbox_3d = instances[object_idx]['bbox_3d'] # list of 9 values
- bbox_label_3d = instances[object_idx]['bbox_label_3d'] # int
- bbox_id = instances[object_idx]['bbox_id'] # int
- object_type = object_int_to_type[bbox_label_3d]
-
- object_type_ints.append(bbox_label_3d)
- object_types.append(object_type)
- bboxes.append(bbox_3d)
- object_ids.append(bbox_id)
- bboxes = np.array(bboxes)
- object_ids = np.array(object_ids)
- object_type_ints = np.array(object_type_ints)
+ if 'instances' in datalist[scene_idx]:
+ instances = datalist[scene_idx]['instances']
+ bboxes = []
+ object_ids = []
+ object_types = []
+ object_type_ints = []
+ for object_idx in range(len(instances)):
+ bbox_3d = instances[object_idx]['bbox_3d'] # list of 9 values
+ bbox_label_3d = instances[object_idx]['bbox_label_3d'] # int
+ bbox_id = instances[object_idx]['bbox_id'] # int
+ object_type = object_int_to_type[bbox_label_3d]
+
+ object_type_ints.append(bbox_label_3d)
+ object_types.append(object_type)
+ bboxes.append(bbox_3d)
+ object_ids.append(bbox_id)
+ bboxes = np.array(bboxes)
+ object_ids = np.array(object_ids)
+ object_type_ints = np.array(object_type_ints)
visible_view_object_dict = {}
visible_view_object_list = []
@@ -114,11 +117,12 @@ def read_annotation_pickle(path: str, show_progress: bool = True):
intrinsic = images[image_idx]['cam2img']
depth_intrinsic = images[image_idx]['cam2img']
- visible_instance_indices = images[image_idx][
- 'visible_instance_ids'] # numpy array of int
- visible_instance_ids = object_ids[visible_instance_indices]
- visible_view_object_dict[extrinsic_id] = visible_instance_ids
- visible_view_object_list.append(visible_instance_ids)
+ if 'instances' in datalist[scene_idx]:
+ visible_instance_indices = images[image_idx][
+ 'visible_instance_ids'] # numpy array of int
+ visible_instance_ids = object_ids[visible_instance_indices]
+ visible_view_object_dict[extrinsic_id] = visible_instance_ids
+ visible_view_object_list.append(visible_instance_ids)
extrinsics_c2w.append(cam2global)
intrinsics.append(intrinsic)
depth_intrinsics.append(depth_intrinsic)
@@ -126,15 +130,10 @@ def read_annotation_pickle(path: str, show_progress: bool = True):
depth_image_paths.append(depth_image)
if show_progress:
pbar.set_description(f'Processing scene {scene_id}')
+ extrinsics_c2w = [(axis_align_matrix @ extrinsic) for extrinsic in
+ extrinsics_c2w]
output_data[scene_id] = {
- # object level
- 'bboxes': bboxes,
- 'object_ids': object_ids,
- 'object_types': object_types,
- 'object_type_ints': object_type_ints,
# image level
- 'visible_instance_ids': visible_view_object_list,
- 'visible_view_object_dict': visible_view_object_dict,
'extrinsics_c2w': extrinsics_c2w,
'axis_align_matrix': axis_align_matrix,
'intrinsics': intrinsics,
@@ -142,6 +141,23 @@ def read_annotation_pickle(path: str, show_progress: bool = True):
'image_paths': image_paths,
'depth_image_paths': depth_image_paths,
}
+ if 'instances' in datalist[scene_idx]:
+ output_data[scene_id].update({
+ # object level
+ 'bboxes':
+ bboxes,
+ 'object_ids':
+ object_ids,
+ 'object_types':
+ object_types,
+ 'object_type_ints':
+ object_type_ints,
+ # image level
+ 'visible_instance_ids':
+ visible_view_object_list,
+ 'visible_view_object_dict':
+ visible_view_object_dict
+ })
return output_data
diff --git a/mmscan/utils/task_utils.py b/mmscan/utils/task_utils.py
index e4c5b8e..62b7d60 100644
--- a/mmscan/utils/task_utils.py
+++ b/mmscan/utils/task_utils.py
@@ -17,6 +17,8 @@ def anno_token_flatten(samples: List[dict], keep_only_one: bool = True):
marked_indices = []
for i, d in enumerate(samples):
+ if 'target_id' not in d:
+ continue
target_ids = d['target_id']
ret_target_ids = []
ret_target = []
diff --git a/models/EmbodiedScan/README.md b/models/EmbodiedScan/README.md
new file mode 100644
index 0000000..0bdabf2
--- /dev/null
+++ b/models/EmbodiedScan/README.md
@@ -0,0 +1,47 @@
+# [EmbodiedScan](https://arxiv.org/abs/2312.16170) for MMScan Visual Grounding
+
+
+
+## Introduction
+
+In the original model, Embodied Perceptron accepts RGB-D sequence with any number of views along with texts as multi-modal input. It uses classical encoders to extract features for each modality and adopts dense and isomorphic sparse fusion with corresponding decoders for different predictions. The 3D features integrated with the text feature can be further used for language-grounded understanding.
+
+To adapt to the specifications of MMScan Visual Grounding, we basically replace the ego-centric views input with point clouds to make it
+consistent with other baselines. We changed its multi-view image input to point cloud and removed
+its corresponding ResNet-50 backbone, reducing it to a framework similar to L3Det.
+
+
+

+
+
+## Tutorial
+
+1. Follow the [EmbodiedScan](https://github.com/OpenRobotLab/EmbodiedScan/blob/main/README.md) to setup the environment. Download the [Multi-View 3D Detection model's weights](https://download.openmmlab.com/mim-example/embodiedscan/mv-3ddet.pth) and change the "load_from" path in the config file under `configs/grounding` to the path where the weights are saved.
+
+2. Install MMScan API.
+
+3. Run the following command to train EmbodiedScan (multiple GPUs):
+
+ ```bash
+ # Single GPU training
+ python tools/train.py configs/grounding/pcd_4xb24_mmscan_vg_num256.py --work-dir=path/to/save
+
+ # Multiple GPUs training
+ python tools/train.py configs/grounding/pcd_4xb24_mmscan_vg_num256.py --work-dir=path/to/save --launcher="pytorch"
+ ```
+
+4. Run the following command to evaluate EmbodiedScan (multiple GPUs):
+
+ ```bash
+ # Single GPU testing
+ python tools/test.py configs/grounding/pcd_4xb24_mmscan_vg_num256.py path/to/load_pth
+
+ # Multiple GPUs testing
+ python tools/test.py configs/grounding/pcd_4xb24_mmscan_vg_num256.py path/to/load_pth --launcher="pytorch"
+ ```
+
+## Results and Models
+
+| Input Modality | Det Pretrain | Epoch | gTop-1 @ 0.25 | gTop-3 @ 0.25 | Config | Download |
+| :-------: | :----: | :----:| :----: | :---------: | :--------------------------------------------------------: | :--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------: |
+| Point Cloud | โ | 12 | 19.66 | 34.00 | [config](configs/grounding/pcd_4xb24_mmscan_vg_num256.py) | [model](https://drive.google.com/file/d/1F6cHY6-JVzAk6xg5s61aTT-vD-eu_4DD/view?usp=drive_link) | [log](https://drive.google.com/file/d/1Ua_-Z2G3g0CthbeBkrR1a7_sqg_Spd9s/view?usp=drive_link)
diff --git a/models/EmbodiedScan/embodiedscan/datasets/mmscan_dataset.py b/models/EmbodiedScan/embodiedscan/datasets/mmscan_dataset.py
index eee72a2..ed8d209 100644
--- a/models/EmbodiedScan/embodiedscan/datasets/mmscan_dataset.py
+++ b/models/EmbodiedScan/embodiedscan/datasets/mmscan_dataset.py
@@ -9,7 +9,7 @@
from embodiedscan.registry import DATASETS
from embodiedscan.structures import get_box_type
from embodiedscan.structures.points import DepthPoints, get_points_type
-from lry_utils.utils_read import to_sample_idx
+from mmscan_utils.utils_read import to_sample_idx
from mmengine.dataset import BaseDataset
from mmengine.fileio import load
diff --git a/models/EmbodiedScan/embodiedscan/datasets/mv_3dvg_dataset.py b/models/EmbodiedScan/embodiedscan/datasets/mv_3dvg_dataset.py
index 6d4e802..0db7aa5 100644
--- a/models/EmbodiedScan/embodiedscan/datasets/mv_3dvg_dataset.py
+++ b/models/EmbodiedScan/embodiedscan/datasets/mv_3dvg_dataset.py
@@ -8,7 +8,7 @@
import numpy as np
from embodiedscan.registry import DATASETS
from embodiedscan.structures import get_box_type
-from lry_utils.utils_read import to_sample_idx
+from mmscan_utils.utils_read import to_sample_idx
from mmengine.dataset import BaseDataset
from mmengine.fileio import load
diff --git a/models/EmbodiedScan/embodiedscan/datasets/pcd_3dvg_dataset.py b/models/EmbodiedScan/embodiedscan/datasets/pcd_3dvg_dataset.py
index 5c33aec..b03ac02 100644
--- a/models/EmbodiedScan/embodiedscan/datasets/pcd_3dvg_dataset.py
+++ b/models/EmbodiedScan/embodiedscan/datasets/pcd_3dvg_dataset.py
@@ -8,7 +8,7 @@
import numpy as np
from embodiedscan.registry import DATASETS
from embodiedscan.structures import get_box_type
-from lry_utils.utils_read import to_sample_idx
+from mmscan_utils.utils_read import to_sample_idx
from mmengine.dataset import BaseDataset
from mmengine.fileio import load
diff --git a/models/EmbodiedScan/embodiedscan/datasets/pcd_3dvg_dataset_demo.py b/models/EmbodiedScan/embodiedscan/datasets/pcd_3dvg_dataset_demo.py
index f91a9d7..efa9bba 100644
--- a/models/EmbodiedScan/embodiedscan/datasets/pcd_3dvg_dataset_demo.py
+++ b/models/EmbodiedScan/embodiedscan/datasets/pcd_3dvg_dataset_demo.py
@@ -8,7 +8,7 @@
import numpy as np
from embodiedscan.registry import DATASETS
from embodiedscan.structures import get_box_type
-from lry_utils.utils_read import to_sample_idx
+from mmscan_utils.utils_read import to_sample_idx
from mmengine.dataset import BaseDataset
from mmengine.fileio import load
from scipy.spatial.transform import Rotation as R
diff --git a/models/EmbodiedScan/embodiedscan/datasets/transforms/default.py b/models/EmbodiedScan/embodiedscan/datasets/transforms/default.py
index 18f9952..e5f59ba 100644
--- a/models/EmbodiedScan/embodiedscan/datasets/transforms/default.py
+++ b/models/EmbodiedScan/embodiedscan/datasets/transforms/default.py
@@ -5,7 +5,7 @@
import torch
from embodiedscan.registry import TRANSFORMS
from embodiedscan.structures.points import DepthPoints, get_points_type
-from lry_utils.utils_read import NUM2RAW_3RSCAN, to_sample_idx, to_scene_id
+from mmscan_utils.utils_read import NUM2RAW_3RSCAN, to_sample_idx, to_scene_id
from mmcv.transforms import BaseTransform, Compose
diff --git a/models/EmbodiedScan/embodiedscan/datasets/transforms/pointcloud.py b/models/EmbodiedScan/embodiedscan/datasets/transforms/pointcloud.py
index 8b73dd8..01fb1c6 100644
--- a/models/EmbodiedScan/embodiedscan/datasets/transforms/pointcloud.py
+++ b/models/EmbodiedScan/embodiedscan/datasets/transforms/pointcloud.py
@@ -5,7 +5,7 @@
import torch
from embodiedscan.registry import TRANSFORMS
from embodiedscan.structures.points import DepthPoints, get_points_type
-from lry_utils.utils_read import NUM2RAW_3RSCAN, to_sample_idx, to_scene_id
+from mmscan_utils.utils_read import NUM2RAW_3RSCAN, to_sample_idx, to_scene_id
from mmcv.transforms import BaseTransform, Compose
diff --git a/models/EmbodiedScan/embodiedscan/datasets/transforms/pointcloud_demo.py b/models/EmbodiedScan/embodiedscan/datasets/transforms/pointcloud_demo.py
index 6931e6d..916b15f 100644
--- a/models/EmbodiedScan/embodiedscan/datasets/transforms/pointcloud_demo.py
+++ b/models/EmbodiedScan/embodiedscan/datasets/transforms/pointcloud_demo.py
@@ -5,7 +5,7 @@
import torch
from embodiedscan.registry import TRANSFORMS
from embodiedscan.structures.points import DepthPoints, get_points_type
-from lry_utils.utils_read import NUM2RAW_3RSCAN, to_sample_idx, to_scene_id
+from mmscan_utils.utils_read import NUM2RAW_3RSCAN, to_sample_idx, to_scene_id
from mmcv.transforms import BaseTransform, Compose
diff --git a/models/EmbodiedScan/lry_utils/utils_read.py b/models/EmbodiedScan/mmscan_utils/utils_read.py
similarity index 100%
rename from models/EmbodiedScan/lry_utils/utils_read.py
rename to models/EmbodiedScan/mmscan_utils/utils_read.py
diff --git a/models/LEO/README.md b/models/LEO/README.md
new file mode 100644
index 0000000..44bf2e6
--- /dev/null
+++ b/models/LEO/README.md
@@ -0,0 +1,50 @@
+# [LEO](https://arxiv.org/abs/2311.12871) for MMScan Question Answering
+
+
+## Introduction
+
+LEO takes egocentric 2D images, 3D point clouds, and texts as input and
+formulates comprehensive 3D tasks as autoregressive sequence prediction. Through instruction tuning, LEO extends the capabilities of large language models to unified multi-modal vision-language-action tasks.
+
+
+

+
+
+## Tutorial
+
+1. Follow the [LEO](https://github.com/embodied-generalist/embodied-generalist/blob/main/README.md) to setup the environment. For data preparation, you need not load the datasets, only need to:
+
+ (1) Download [Vicuna-7B](https://huggingface.co/huangjy-pku/vicuna-7b/tree/main) and update cfg_path in configs/llm/\*.yaml
+
+ (2) Download the [sft_noact.pth](https://huggingface.co/datasets/huangjy-pku/LEO_data/tree/main) and store it under the `./weights` folder
+
+2. Install MMScan API.
+
+3. Edit the config under `scripts/train_tuning_mmscan.sh` and `scripts/test_tuning_mmscan.sh`
+
+4. Run the following command to train LEO (4 GPUs):
+
+ ```bash
+ bash scripts/train_tuning_mmscan.sh
+ ```
+
+5. Run the following command to evaluate LEO (4 GPUs):
+
+ ```bash
+ bash scripts/test_tuning_mmscan.sh
+ ```
+
+ Optinal: You can use the GPT evaluator by this after getting the result.
+ 'test_embodied_scan_l_complete.json' will be generated under the checkpoint folder after evaluation and the tmp_path is used for temporarily storing.
+
+ ```bash
+ python evaluator/GPT_eval.py --file path/to/test_embodied_scan_l_complete.json
+ --tmp_path path/to/tmp --api_key your_api_key --eval_size -1
+ --nproc 4
+ ```
+
+## Results and Models
+
+| LLM | 2D Backbone | 3D Backbone | Epoch | Overall GPT Score | Config | Download |
+| :-------: | :----: | :----: | :----: |:---------: | :--------------------------------------------------------: | :--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------: |
+| Vicuna7b | ConvNeXt | PointNet++ | 1 | 54.6 | [config](https://drive.google.com/file/d/1CJccZd4TOaT_JdHj073UKwdA5PWUDtja/view?usp=drive_link) | [model](https://drive.google.com/drive/folders/1HZ38LwRe-1Q_VxlWy8vqvImFjtQ_b9iA?usp=drive_link) |
diff --git a/models/LL3DA/README.md b/models/LL3DA/README.md
new file mode 100644
index 0000000..9bcc771
--- /dev/null
+++ b/models/LL3DA/README.md
@@ -0,0 +1,54 @@
+# [LL3DA](https://arxiv.org/abs/2311.18651) for MMScan Question Answering
+
+
+
+## Introduction
+
+(a) The overall pipeline of LL3DA first extracts interaction-aware 3D scene
+embeddings, which are later projected to the prefix of textual instructions as the input of a frozen LLM.
+(b) The detailed design of the
+Interactor3D, which aggregates visual prompts, textual instructions, and 3D scene embeddings into a fixed length querying tokens. (c) The
+prompt encoder encodes the user clicks and box coordinates with the positional embeddings and ROI features, respectively.
+
+
+

+
+
+## Tutorial
+
+1. Follow the [LL3DA](https://github.com/Open3DA/LL3DA/blob/main/README.md) to setup the environment. For data preparation, you need not load the datasets, only need to:
+
+ (1) download the [release pre-trained weights.](https://huggingface.co/CH3COOK/LL3DA-weight-release/blob/main/ll3da-opt-1.3b.pth) and put them under `./pretrained`
+
+ (2) Download the [pre-processed BERT embedding weights](https://huggingface.co/CH3COOK/bert-base-embedding/tree/main) and store them under the `./bert-base-embedding` folder
+
+2. Install MMScan API.
+
+3. Edit the config under `./scripts/opt-1.3b/eval.mmscanqa.sh` and `./scripts/opt-1.3b/tuning.mmscanqa.sh`
+
+4. Run the following command to train LL3DA (4 GPUs):
+
+ ```bash
+ bash scripts/opt-1.3b/tuning.mmscanqa.sh
+ ```
+
+5. Run the following command to evaluate LL3DA (4 GPUs):
+
+ ```bash
+ bash scripts/opt-1.3b/eval.mmscanqa.sh
+ ```
+
+ Optinal: You can use the GPT evaluator by this after getting the result.
+ 'qa_pred_gt_val.json' will be generated under the checkpoint folder after evaluation and the tmp_path is used for temporarily storing.
+
+ ```bash
+ python eval_utils/evaluate_gpt.py --file path/to/qa_pred_gt_val.json
+ --tmp_path path/to/tmp --api_key your_api_key --eval_size -1
+ --nproc 4
+ ```
+
+## Results and Models
+
+| Detector | Captioner | Iters | Overall GPT Score | Download |
+| :-------: | :----: | :----: | :---------: |:--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------: |
+| Vote2Cap-DETR | LL3DA | 100k | 45.7 | [model](https://drive.google.com/file/d/1mcWNHdfrhdbtySBtmG-QRH1Y1y5U3PDQ/view?usp=drive_link) | [log](https://drive.google.com/file/d/1VHpcnO0QmAvMa0HuZa83TEjU6AiFrP42/view?usp=drive_link) |
diff --git a/models/README.md b/models/README.md
deleted file mode 100644
index db3bc96..0000000
--- a/models/README.md
+++ /dev/null
@@ -1,140 +0,0 @@
-## 3D Visual Grounding Models
-
-These are 3D visual grounding models adapted for the mmscan-devkit. Currently, two models have been released: EmbodiedScan and ScanRefer.
-
-### ScanRefer
-
-1. Follow the [ScanRefer](https://github.com/daveredrum/ScanRefer/blob/master/README.md) to setup the environment. For data preparation, you need not load the datasets, only need to download the [preprocessed GLoVE embeddings](https://kaldir.vc.in.tum.de/glove.p) (~990MB) and put them under `data/`
-
-2. Install MMScan API.
-
-3. Overwrite the `lib/config.py/CONF.PATH.OUTPUT` to your desired output directory.
-
-4. Run the following command to train ScanRefer (one GPU):
-
- ```bash
- python -u scripts/train.py --use_color --epoch {10/25/50}
- ```
-
-5. Run the following command to evaluate ScanRefer (one GPU):
-
- ```bash
- python -u scripts/train.py --use_color --eval_only --use_checkpoint "path/to/pth"
- ```
-#### Results and Models
-
-| Epoch | gTop-1 @ 0.25|gTop-1 @0.50 | Config | Download |
-| :-------: | :---------:| :---------: | :--------------------------------------------------------: | :--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------: |
-| 50 | 4.74 | 2.52 | [config](https://drive.google.com/file/d/1iJtsjt4K8qhNikY8UmIfiQy1CzIaSgyU/view?usp=drive_link) | [model](https://drive.google.com/file/d/1C0-AJweXEc-cHTe9tLJ3Shgqyd44tXqY/view?usp=drive_link) \| [log](https://drive.google.com/file/d/1ENOS2FE7fkLPWjIf9J76VgiPrn6dGKvi/view?usp=drive_link)
-### EmbodiedScan
-
-1. Follow the [EmbodiedScan](https://github.com/OpenRobotLab/EmbodiedScan/blob/main/README.md) to setup the environment. Download the [Multi-View 3D Detection model's weights](https://download.openmmlab.com/mim-example/embodiedscan/mv-3ddet.pth) and change the "load_from" path in the config file under `configs/grounding` to the path where the weights are saved.
-
-2. Install MMScan API.
-
-3. Run the following command to train EmbodiedScan (multiple GPUs):
-
- ```bash
- # Single GPU training
- python tools/train.py configs/grounding/pcd_4xb24_mmscan_vg_num256.py --work-dir=path/to/save
-
- # Multiple GPUs training
- python tools/train.py configs/grounding/pcd_4xb24_mmscan_vg_num256.py --work-dir=path/to/save --launcher="pytorch"
- ```
-
-4. Run the following command to evaluate EmbodiedScan (multiple GPUs):
-
- ```bash
- # Single GPU testing
- python tools/test.py configs/grounding/pcd_4xb24_mmscan_vg_num256.py path/to/load_pth
-
- # Multiple GPUs testing
- python tools/test.py configs/grounding/pcd_4xb24_mmscan_vg_num256.py path/to/load_pth --launcher="pytorch"
- ```
-#### Results and Models
-
-| Input Modality | Det Pretrain | Epoch | gTop-1 @ 0.25 | gTop-1 @ 0.50 | Config | Download |
-| :-------: | :----: | :----:| :----: | :---------: | :--------------------------------------------------------: | :--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------: |
-| Point Cloud | ✔ | 12 | 19.66 | 8.82 | [config](https://github.com/rbler1234/EmbodiedScan/blob/mmscan-devkit/models/EmbodiedScan/configs/grounding/pcd_4xb24_mmscan_vg_num256.py) | [model](https://drive.google.com/file/d/1F6cHY6-JVzAk6xg5s61aTT-vD-eu_4DD/view?usp=drive_link) \| [log](https://drive.google.com/file/d/1Ua_-Z2G3g0CthbeBkrR1a7_sqg_Spd9s/view?usp=drive_link)
-
-## 3D Question Answering Models
-
-These are 3D question answering models adapted for the mmscan-devkit. Currently, two models have been released: LL3DA and LEO.
-
-### LL3DA
-
-1. Follow the [LL3DA](https://github.com/Open3DA/LL3DA/blob/main/README.md) to setup the environment. For data preparation, you need not load the datasets, only need to:
-
- (1) download the [release pre-trained weights.](https://huggingface.co/CH3COOK/LL3DA-weight-release/blob/main/ll3da-opt-1.3b.pth) and put them under `./pretrained`
-
- (2) Download the [pre-processed BERT embedding weights](https://huggingface.co/CH3COOK/bert-base-embedding/tree/main) and store them under the `./bert-base-embedding` folder
-
-2. Install MMScan API.
-
-3. Edit the config under `./scripts/opt-1.3b/eval.mmscanqa.sh` and `./scripts/opt-1.3b/tuning.mmscanqa.sh`
-
-4. Run the following command to train LL3DA (4 GPUs):
-
- ```bash
- bash scripts/opt-1.3b/tuning.mmscanqa.sh
- ```
-
-5. Run the following command to evaluate LL3DA (4 GPUs):
-
- ```bash
- bash scripts/opt-1.3b/eval.mmscanqa.sh
- ```
-
- Optinal: You can use the GPT evaluator by this after getting the result.
- 'qa_pred_gt_val.json' will be generated under the checkpoint folder after evaluation and the tmp_path is used for temporarily storing.
-
- ```bash
- python eval_utils/evaluate_gpt.py --file path/to/qa_pred_gt_val.json
- --tmp_path path/to/tmp --api_key your_api_key --eval_size -1
- --nproc 4
- ```
-#### Results and Models
-
-| Detector | Captioner | Iters | Overall GPT Score | Download |
-| :-------: | :----: | :----: | :---------: |:--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------: |
-| Vote2Cap-DETR | LL3DA | 100k | 45.7 | [model](https://drive.google.com/file/d/1mcWNHdfrhdbtySBtmG-QRH1Y1y5U3PDQ/view?usp=drive_link) \| [log](https://drive.google.com/file/d/1VHpcnO0QmAvMa0HuZa83TEjU6AiFrP42/view?usp=drive_link) |
-
-
-
-### LEO
-
-1. Follow the [LEO](https://github.com/embodied-generalist/embodied-generalist/blob/main/README.md) to setup the environment. For data preparation, you need not load the datasets, only need to:
-
- (1) Download [Vicuna-7B](https://huggingface.co/huangjy-pku/vicuna-7b/tree/main) and update cfg_path in configs/llm/\*.yaml
-
- (2) Download the [sft_noact.pth](https://huggingface.co/datasets/huangjy-pku/LEO_data/tree/main) and store it under the `./weights` folder
-
-2. Install MMScan API.
-
-3. Edit the config under `scripts/train_tuning_mmscan.sh` and `scripts/test_tuning_mmscan.sh`
-
-4. Run the following command to train LEO (4 GPUs):
-
- ```bash
- bash scripts/train_tuning_mmscan.sh
- ```
-
-5. Run the following command to evaluate LEO (4 GPUs):
-
- ```bash
- bash scripts/test_tuning_mmscan.sh
- ```
-
- Optinal: You can use the GPT evaluator by this after getting the result.
- 'test_embodied_scan_l_complete.json' will be generated under the checkpoint folder after evaluation and the tmp_path is used for temporarily storing.
-
- ```bash
- python evaluator/GPT_eval.py --file path/to/test_embodied_scan_l_complete.json
- --tmp_path path/to/tmp --api_key your_api_key --eval_size -1
- --nproc 4
- ```
-#### ckpts & Logs
-
-| LLM | 2D Backbone | 3D Backbone | Epoch | Overall GPT Score | Config | Download |
-| :-------: | :----: | :----: | :----: |:---------: | :--------------------------------------------------------: | :--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------: |
-| Vicuna7b | ConvNeXt | PointNet++ | 1 | 54.6 | [config](https://drive.google.com/file/d/1CJccZd4TOaT_JdHj073UKwdA5PWUDtja/view?usp=drive_link) | [model](https://drive.google.com/drive/folders/1HZ38LwRe-1Q_VxlWy8vqvImFjtQ_b9iA?usp=drive_link) |
diff --git a/models/Scanrefer/README.md b/models/Scanrefer/README.md
new file mode 100644
index 0000000..f9b6f62
--- /dev/null
+++ b/models/Scanrefer/README.md
@@ -0,0 +1,45 @@
+# [ScanRefer](https://arxiv.org/abs/1912.08830) for MMScan Visual Grounding
+
+
+
+## Introduction
+
+The PointNet++ backbone takes as input a
+point cloud and aggregates it to high-level point feature maps, which are then
+clustered and fused as object proposals by a voting module. Object proposals are masked by the objectness predictions, and then
+fused with the sentence embedding of the input descriptions, which is obtained
+by a GloVE + GRU embedding.
+
+For MMScan Visual Grounding task, to fit the
+oriented 3D box output, we add a 6D rotation representation into original regression targets and use
+a disentangled Chamfer Distance (CD) loss for eight corners to supervise it.
+
+
+

+
+
+## Tutorial
+
+1. Follow the [ScanRefer](https://github.com/daveredrum/ScanRefer/blob/master/README.md) to setup the environment. For data preparation, you need not load the datasets, only need to download the [preprocessed GLoVE embeddings](https://kaldir.vc.in.tum.de/glove.p) (~990MB) and put them under `data/`
+
+2. Install MMScan API.
+
+3. Overwrite the `lib/config.py/CONF.PATH.OUTPUT` to your desired output directory.
+
+4. Run the following command to train ScanRefer (one GPU):
+
+ ```bash
+ python -u scripts/train.py --use_color --epoch {10/25/50}
+ ```
+
+5. Run the following command to evaluate ScanRefer (one GPU):
+
+ ```bash
+ python -u scripts/train.py --use_color --eval_only --use_checkpoint "path/to/pth"
+ ```
+
+## Results and Models
+
+| Epoch | gTop-1 @ 0.25|gTop-1 @0.50 | Config | Download |
+| :-------: | :---------:| :---------: | :--------------------------------------------------------: | :--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------: |
+| 50 | 4.74 | 2.52 | [config](https://drive.google.com/file/d/1iJtsjt4K8qhNikY8UmIfiQy1CzIaSgyU/view?usp=drive_link) | [model](https://drive.google.com/file/d/1C0-AJweXEc-cHTe9tLJ3Shgqyd44tXqY/view?usp=drive_link) | [log](https://drive.google.com/file/d/1ENOS2FE7fkLPWjIf9J76VgiPrn6dGKvi/view?usp=drive_link)