From 0569df4a4097c175f7d1b17decdef053646ee9e3 Mon Sep 17 00:00:00 2001
From: rbler1234 <rbler1234@sjtu.edu.cn>
Date: Fri, 24 Jan 2025 16:01:59 +0800
Subject: [PATCH 1/4] edit readme

---
 README.md        | 93 +++++++++++++++++++++++++++++++++---------------
 models/README.md | 21 ++++++++++-
 2 files changed, 84 insertions(+), 30 deletions(-)
diff --git a/README.md b/README.md
index a852d5b..12cfef0 100644
--- a/README.md
+++ b/README.md
@@ -21,12 +21,14 @@
 
 ## 📋 Contents
 
-1. [About](#-about)
-2. [Getting Started](#-getting-started)
-3. [Model and Benchmark](#-model-and-benchmark)
-4. [TODO List](#-todo-list)
+1. [About](#topic1)
+2. [Getting Started](#topic2)
+3. [MMScan API Tutorial](#topic3)
+4. [MMScan Benchmark](#topic4)
+5. [TODO List](#topic5)
 
 ## 🏠 About
+<span id='topic1'/>
 
 <!-- ![Teaser](assets/teaser.jpg) -->
 
@@ -55,7 +57,8 @@ Furthermore, we use this high-quality dataset to train state-of-the-art 3D visua
 grounding and LLMs and obtain remarkable performance improvement both on
 existing benchmarks and in-the-wild evaluation.
 
-## 🚀 Getting Started:
+## 🚀 Getting Started
+<span id='topic2'/>
 
 ### Installation
 
@@ -98,6 +101,7 @@ existing benchmarks and in-the-wild evaluation.
    Please refer to the [guide](data_preparation/README.md) here.
 
 ## 👓 MMScan API Tutorial
+<span id='topic3'/>
 
 The **MMScan Toolkit** provides comprehensive tools for dataset handling and model evaluation in  tasks.
 
@@ -137,39 +141,41 @@ Each dataset item is a dictionary containing key elements:
 
 (1) 3D Modality
 
-- **"ori_pcds"** (tuple\[tensor\]): Raw point cloud data from the `.pth` file.
-- **"pcds"** (np.ndarray): Point cloud data, dimensions (\[n_points, 6(xyz+rgb)\]).
-- **"instance_labels"** (np.ndarray): Instance IDs for each point.
-- **"class_labels"** (np.ndarray): Class IDs for each point.
-- **"bboxes"** (dict): Bounding boxes in the scan.
+- **"ori_pcds"** (tuple\[tensor\]): Original point cloud data extracted from the .pth file.
+- **"pcds"** (np.ndarray): Point cloud data with dimensions [n_points, 6(xyz+rgb)], representing the coordinates and color of each point.
+- **"instance_labels"** (np.ndarray): Instance ID assigned to each point in the point cloud.
+- **"class_labels"** (np.ndarray): Class IDs assigned to each point in the point cloud.
+- **"bboxes"** (dict): Information about bounding boxes within the scan.
 
 (2)  Language Modality
 
-- **"sub_class"**: Sample category.
-- **"ID"**: Unique sample ID.
-- **"scan_id"**: Corresponding scan ID.
-- **--------------For Visual Grounding Task**
-- **"target_id"** (list\[int\]): IDs of target objects.
-- **"text"** (str): Grounding text.
+- **"sub_class"**: The sample category of the sample.
+- **"ID"**: A unique identifier for the sample.
+- **"scan_id"**:Identifier corresponding to the related scan.
+
+    *For Visual Grounding Task*
+- **"target_id"** (list\[int\]): IDs of target objects. 
+- **"text"** (str): Text used for grounding.
 - **"target"** (list\[str\]): Types of target objects.
 - **"anchors"** (list\[str\]): Types of anchor objects.
 - **"anchor_ids"** (list\[int\]): IDs of anchor objects.
-- **"tokens_positive"** (dict): Position indices of mentioned objects in the text.
-- **--------------ForQuestion Answering Task**
-- **"question"** (str): The question text.
+- **"tokens_positive"** (dict):  Indices of positions where mentioned objects appear in the text.
+
+    *For Question Answering Task*
+- **"question"** (str): The text of the question.
 - **"answers"** (list\[str\]): List of possible answers.
 - **"object_ids"** (list\[int\]): Object IDs referenced in the question.
 - **"object_names"** (list\[str\]): Types of referenced objects.
 - **"input_bboxes_id"** (list\[int\]): IDs of input bounding boxes.
-- **"input_bboxes"** (list\[np.ndarray\]): Input bounding boxes, 9 DoF.
+- **"input_bboxes"** (list\[np.ndarray\]): Input bounding box data, with 9 degrees of freedom.
 
 (3) 2D Modality
 
-- **'img_path'** (str): Path to RGB image.
-- **'depth_img_path'** (str): Path to depth image.
-- **'intrinsic'** (np.ndarray): Camera intrinsic parameters for RGB images.
-- **'depth_intrinsic'** (np.ndarray): Camera intrinsic parameters for depth images.
-- **'extrinsic'** (np.ndarray): Camera extrinsic parameters.
+- **'img_path'** (str): File path to the RGB image.
+- **'depth_img_path'** (str): File path to the depth image.
+- **'intrinsic'** (np.ndarray):  Intrinsic parameters of the camera for RGB images.
+- **'depth_intrinsic'** (np.ndarray):  Intrinsic parameters of the camera for Depth images.
+- **'extrinsic'** (np.ndarray): Extrinsic parameters of the camera.
 - **'visible_instance_id'** (list): IDs of visible objects in the image.
 
 ### MMScan  Evaluator
@@ -182,7 +188,9 @@ For the visual grounding task, our evaluator computes multiple metrics including
 
 - **AP and AR**: These metrics calculate the precision and recall by considering each sample as an individual category.
 - **AP_C and AR_C**: These versions categorize samples belonging to the same subclass and calculate them together.
-- **gtop-k**: An expanded metric that generalizes the traditional top-k metric, offering insights into broader performance aspects.
+- **gTop-k**: An expanded metric that generalizes the traditional Top-k metric, offering insights into broader performance aspects.
+  
+*Note:* Here, AP corresponds to  AP<sub>sample</sub> in the paper, and AP_C corresponds to  AP<sub>box</sub> in the paper.
 
 Below is an example of how to utilize the Visual Grounding Evaluator:
 
@@ -301,11 +309,38 @@ The input structure remains the same as for the question answering evaluator:
 ]
 ```
 
-### Models
+## 🏆 MMScan Benchmark
+
+<span id='topic4'/>
+
+### MMScan Visual Grounding Benchmark
 
-We have adapted the MMScan API for some [models](./models/README.md).
+| Methods | gTop-1 | gTop-3 | AP<sub>sample</sub> | AP<sub>box</sub> | AR | Release | Download |
+|---------|--------|--------|---------------------|------------------|----|-------|----|
+| ScanRefer | 4.74 | 9.19 | 9.49 | 2.28 | 47.68 | [code](https://github.com/rbler1234/EmbodiedScan/tree/mmscan-devkit/models/Scanrefer) | [model](https://drive.google.com/file/d/1C0-AJweXEc-cHTe9tLJ3Shgqyd44tXqY/view?usp=drive_link) \| [log](https://drive.google.com/file/d/1ENOS2FE7fkLPWjIf9J76VgiPrn6dGKvi/view?usp=drive_link) |
+| MVT | 7.94 | 13.07 | 13.67 | 2.50 | 86.86 | ~ | ~ |
+| BUTD-DETR  | 15.24 | 20.68 | 18.58 | 9.27 | 66.62 |  ~ | ~ |
+| ReGround3D  | 16.35 | 26.13 | 22.89 | 5.25 | 43.24 | ~ | ~ |
+| EmbodiedScan  | 19.66 | 34.00 | 29.30 | **15.18** | 59.96 | [code](https://github.com/OpenRobotLab/EmbodiedScan/tree/mmscan/models/EmbodiedScan) |  [model](https://drive.google.com/file/d/1F6cHY6-JVzAk6xg5s61aTT-vD-eu_4DD/view?usp=drive_link) \| [log](https://drive.google.com/file/d/1Ua_-Z2G3g0CthbeBkrR1a7_sqg_Spd9s/view?usp=drive_link) |
+| 3D-VisTA | 25.38 | 35.41 | 33.47 | 6.67 | 87.52 |  ~ | ~ |
+| ViL3DRef | **26.34** | **37.58** | **35.09** | 6.65 | 86.86 | ~ | ~ |
+
+### MMScan Question Answering Benchmark
+| Methods | Overall | ST-attr | ST-space | OO-attr | OO-space | OR| Advanced | Release | Download |
+|---|--------|--------|--------|--------|--------|--------|-------|----|----|
+| LL3DA | 45.7 | 39.1 | 58.5 | 43.6 | 55.9 | 37.1 | 24.0| [code](https://github.com/rbler1234/EmbodiedScan/tree/mmscan-devkit/models/LL3DA) | [model](https://drive.google.com/file/d/1mcWNHdfrhdbtySBtmG-QRH1Y1y5U3PDQ/view?usp=drive_link) \| [log](https://drive.google.com/file/d/1VHpcnO0QmAvMa0HuZa83TEjU6AiFrP42/view?usp=drive_link) |
+| LEO |54.6 | 48.9 | 62.7 | 50.8 | 64.7 | 50.4 | 45.9 | [code](https://github.com/rbler1234/EmbodiedScan/tree/mmscan-devkit/models/LEO) | [model](https://drive.google.com/drive/folders/1HZ38LwRe-1Q_VxlWy8vqvImFjtQ_b9iA?usp=drive_link)|
+| LLaVA-3D |**61.6** | 58.5 | 63.5 | 56.8 | 75.6 | 58.0 | 38.5|~ | ~ |
+
+*Note:* These two tables only show the results for main metrics; see the paper for complete results.
+
+We have released the codes of some models under [./models](./models/README.md).
 
 ## 📝 TODO List
 
-- \[ \] More Visual Grounding baselines and Question Answering baselines.
+<span id='topic5'/>
+
+- \[ \] MMScan annotation and samples for ARKitScenes.
+- \[ \] Online evaluation platform for the MMScan benchmark.
+- \[ \] Codes of more MMScan Visual Grounding baselines and Question Answering baselines.
 - \[ \] Full release and further updates.
diff --git a/models/README.md b/models/README.md
index 5309e7b..86aabf9 100644
--- a/models/README.md
+++ b/models/README.md
@@ -21,7 +21,11 @@ These are 3D visual grounding models adapted for the mmscan-devkit. Currently, t
    ```bash
    python -u scripts/train.py --use_color --eval_only --use_checkpoint "path/to/pth"
    ```
+#### ckpts & Logs
 
+| Epoch |   gTop-1 @ 0.25/0.50  |                           Config                           |                                                                                                                                                                 Download                                                                                                                                                                 |
+| :-------:   | :---------: | :--------------------------------------------------------: | :--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------: |
+| 50 |  4.74 / 2.52    |    [config](https://drive.google.com/file/d/1iJtsjt4K8qhNikY8UmIfiQy1CzIaSgyU/view?usp=drive_link)    |             [model](https://drive.google.com/file/d/1C0-AJweXEc-cHTe9tLJ3Shgqyd44tXqY/view?usp=drive_link) \| [log](https://drive.google.com/file/d/1ENOS2FE7fkLPWjIf9J76VgiPrn6dGKvi/view?usp=drive_link)  
 ### EmbodiedScan
 
 1. Follow the [EmbodiedScan](https://github.com/OpenRobotLab/EmbodiedScan/blob/main/README.md) to setup the Env. Download the [Multi-View 3D Detection model's weights](https://download.openmmlab.com/mim-example/embodiedscan/mv-3ddet.pth) and change the "load_from" path in the config file under `configs/grounding` to the path where the weights are saved.
@@ -47,6 +51,11 @@ These are 3D visual grounding models adapted for the mmscan-devkit. Currently, t
    # Multiple GPU testing
    python tools/test.py configs/grounding/pcd_4xb24_mmscan_vg_num256.py path/to/load_pth --launcher="pytorch"
    ```
+#### ckpts & Logs
+
+| Input modality  | Load pretrain | Epoch |  gTop-1 @ 0.25/0.50  |                           Config                           |                                                                                                                                                                 Download                                                                                                                                                                 |
+| :-------:  | :----: | :----: | :---------: | :--------------------------------------------------------: | :--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------: |
+| Point cloud   |  True  |  12 |  19.66 / 8.82     |    [config](https://github.com/rbler1234/EmbodiedScan/blob/mmscan-devkit/models/EmbodiedScan/configs/grounding/pcd_4xb24_mmscan_vg_num256.py)    |             [model](https://drive.google.com/file/d/1F6cHY6-JVzAk6xg5s61aTT-vD-eu_4DD/view?usp=drive_link) \| [log](https://drive.google.com/file/d/1Ua_-Z2G3g0CthbeBkrR1a7_sqg_Spd9s/view?usp=drive_link)  
 
 ## 3D Question Answering Models
 
@@ -84,6 +93,13 @@ These are 3D question answering models adapted for the mmscan-devkit. Currently,
    --tmp_path path/to/tmp  --api_key your_api_key --eval_size -1
    --nproc 4
    ```
+#### ckpts & Logs
+
+| Detector  | Captioner | Iters |  GPT score overall  |                                                                                                                                                                       Download                                                                                                                                                                 |
+| :-------:  | :----: | :----: | :---------: |:--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------: |
+| Vote2Cap-DETR   |  ll3da  |  100k |  45.7     |             [model](https://drive.google.com/file/d/1mcWNHdfrhdbtySBtmG-QRH1Y1y5U3PDQ/view?usp=drive_link) \| [log](https://drive.google.com/file/d/1VHpcnO0QmAvMa0HuZa83TEjU6AiFrP42/view?usp=drive_link)             |
+
+
 
 ### LEO
 
@@ -117,5 +133,8 @@ These are 3D question answering models adapted for the mmscan-devkit. Currently,
    --tmp_path path/to/tmp  --api_key your_api_key --eval_size -1
    --nproc 4
    ```
+#### ckpts & Logs
 
-PS : It is possible that LEO may encounter an "NaN" error in the MultiHeadAttentionSpatial module due to the training setup when training more epoches. ( no problem for 4GPU one epoch)
+| LLM  | 2d/3d backbones | epoch |  GPT score overall  |                           Config                           |                                                                                                                                                                 Download                                                                                                                                                                 |
+| :-------:  | :----: | :----: | :---------: | :--------------------------------------------------------: | :--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------: |
+| Vicuna7b   |  ConvNeXt / PointNet++  |  1 |  54.6     |    [config](https://drive.google.com/file/d/1CJccZd4TOaT_JdHj073UKwdA5PWUDtja/view?usp=drive_link)    |             [model](https://drive.google.com/drive/folders/1HZ38LwRe-1Q_VxlWy8vqvImFjtQ_b9iA?usp=drive_link)              |

From 8c0344a1fd4ca19a8465154e68ab253908a90d9d Mon Sep 17 00:00:00 2001
From: rbler1234 <rbler1234@sjtu.edu.cn>
Date: Sat, 25 Jan 2025 23:42:23 +0800
Subject: [PATCH 2/4] edit readme

---
 README.md        | 95 +++++++++++++++++++++++-------------------------
 models/README.md | 22 +++++------
 2 files changed, 57 insertions(+), 60 deletions(-)

diff --git a/README.md b/README.md
index 12cfef0..1fce8ef 100644
--- a/README.md
+++ b/README.md
@@ -22,10 +22,9 @@
 ## 📋 Contents
 
 1. [About](#topic1)
-2. [Getting Started](#topic2)
+2. [MMScan Benchmark](#topic2)
 3. [MMScan API Tutorial](#topic3)
-4. [MMScan Benchmark](#topic4)
-5. [TODO List](#topic5)
+4. [TODO List](#topic4)
 
 ## 🏠 About
 <span id='topic1'/>
@@ -57,10 +56,43 @@ Furthermore, we use this high-quality dataset to train state-of-the-art 3D visua
 grounding and LLMs and obtain remarkable performance improvement both on
 existing benchmarks and in-the-wild evaluation.
 
-## 🚀 Getting Started
+
+## 🏆 MMScan Benchmark
+
 <span id='topic2'/>
 
-### Installation
+### MMScan Visual Grounding Benchmark
+
+| Methods | gTop-1 | gTop-3 | AP<sub>sample</sub> | AP<sub>box</sub> | AR | Release | Download |
+|---------|--------|--------|---------------------|------------------|----|-------|----|
+| ScanRefer | 4.74 | 9.19 | 9.49 | 2.28 | 47.68 | [code](https://github.com/rbler1234/EmbodiedScan/tree/mmscan-devkit/models/Scanrefer) | [model](https://drive.google.com/file/d/1C0-AJweXEc-cHTe9tLJ3Shgqyd44tXqY/view?usp=drive_link) \| [log](https://drive.google.com/file/d/1ENOS2FE7fkLPWjIf9J76VgiPrn6dGKvi/view?usp=drive_link) |
+| MVT | 7.94 | 13.07 | 13.67 | 2.50 | 86.86 | - | - |
+| BUTD-DETR  | 15.24 | 20.68 | 18.58 | 9.27 | 66.62 |  - | - |
+| ReGround3D  | 16.35 | 26.13 | 22.89 | 5.25 | 43.24 | - | - |
+| EmbodiedScan  | 19.66 | 34.00 | 29.30 | **15.18** | 59.96 | [code](https://github.com/OpenRobotLab/EmbodiedScan/tree/mmscan/models/EmbodiedScan) |  [model](https://drive.google.com/file/d/1F6cHY6-JVzAk6xg5s61aTT-vD-eu_4DD/view?usp=drive_link) \| [log](https://drive.google.com/file/d/1Ua_-Z2G3g0CthbeBkrR1a7_sqg_Spd9s/view?usp=drive_link) |
+| 3D-VisTA | 25.38 | 35.41 | 33.47 | 6.67 | 87.52 |  - | - |
+| ViL3DRef | **26.34** | **37.58** | **35.09** | 6.65 | 86.86 | - | - |
+
+### MMScan Question Answering Benchmark
+| Methods | Overall | ST-attr | ST-space | OO-attr | OO-space | OR| Advanced | Release | Download |
+|---|--------|--------|--------|--------|--------|--------|-------|----|----|
+| LL3DA | 45.7 | 39.1 | 58.5 | 43.6 | 55.9 | 37.1 | 24.0| [code](https://github.com/rbler1234/EmbodiedScan/tree/mmscan-devkit/models/LL3DA) | [model](https://drive.google.com/file/d/1mcWNHdfrhdbtySBtmG-QRH1Y1y5U3PDQ/view?usp=drive_link) \| [log](https://drive.google.com/file/d/1VHpcnO0QmAvMa0HuZa83TEjU6AiFrP42/view?usp=drive_link) |
+| LEO |54.6 | 48.9 | 62.7 | 50.8 | 64.7 | 50.4 | 45.9 | [code](https://github.com/rbler1234/EmbodiedScan/tree/mmscan-devkit/models/LEO) | [model](https://drive.google.com/drive/folders/1HZ38LwRe-1Q_VxlWy8vqvImFjtQ_b9iA?usp=drive_link)|
+| LLaVA-3D |**61.6** | 58.5 | 63.5 | 56.8 | 75.6 | 58.0 | 38.5|- | - |
+
+*Note:* These two tables only show the results for main metrics; see the paper for complete results.
+
+We have released the codes of some models under [./models](./models/README.md).
+
+
+
+## 🚀 MMScan API Tutorial
+<span id='topic3'/>
+
+The **MMScan Toolkit** provides comprehensive tools for dataset handling and model evaluation in tasks.
+
+### Getting Started
+
 
 1. Clone Github repo.
 
@@ -80,13 +112,13 @@ existing benchmarks and in-the-wild evaluation.
 
    Use `"all"` to install all components and specify `"VG"` or `"QA"` if you only need to install the components for Visual Grounding or Question Answering, respectively.
 
-### Data Preparation
+3. Download and prepare the dataset.
 
-1. Download the Embodiedscan and MMScan annotation. (Fill in the [form](https://docs.google.com/forms/d/e/1FAIpQLScUXEDTksGiqHZp31j7Zp7zlCNV7p_08uViwP_Nbzfn3g6hhw/viewform) to apply for downloading)
+   a. Download the Embodiedscan and MMScan annotation. (Fill in the [form](https://docs.google.com/forms/d/e/1FAIpQLScUXEDTksGiqHZp31j7Zp7zlCNV7p_08uViwP_Nbzfn3g6hhw/viewform) to apply for downloading)
 
-   Create a folder `mmscan_data/` and then unzip the files. For the first zip file, put `embodiedscan` under `mmscan_data/embodiedscan_split` and rename it to `embodiedscan-v1`. For the second zip file, put `MMScan-beta-release` under `mmscan_data/MMScan-beta-release` and `embodiedscan-v2` under `mmscan_data/embodiedscan_split`.
+   b. Create a folder `mmscan_data/` and then unzip the files. For the first zip file, put `embodiedscan` under `mmscan_data/embodiedscan_split` and rename it to `embodiedscan-v1`. For the second zip file, put `MMScan-beta-release` under `mmscan_data/MMScan-beta-release` and `embodiedscan-v2` under `mmscan_data/embodiedscan_split`.
 
-   The directory structure should be as below:
+   The directory structure should be as below, after then, refer to the [guide](data_preparation/README.md) here.
 
    ```
    mmscan_data
@@ -96,14 +128,6 @@ existing benchmarks and in-the-wild evaluation.
    ├── MMScan-beta-release   # MMScan veta data in 'embodiedscan-v2-beta.zip'
    ```
 
-2. Prepare the point clouds files.
-
-   Please refer to the [guide](data_preparation/README.md) here.
-
-## 👓 MMScan API Tutorial
-<span id='topic3'/>
-
-The **MMScan Toolkit** provides comprehensive tools for dataset handling and model evaluation in  tasks.
 
 To import the MMScan API, you can use the following commands:
 
@@ -121,7 +145,7 @@ import mmscan.QuestionAnsweringEvaluator as MMScan_QA_evaluator
 import mmscan.GPTEvaluator as MMScan_GPT_evaluator
 ```
 
-### MMScan Dataset
+### MMScan Dataset Tool
 
 The dataset tool in MMScan allows seamless access to data required for various tasks within MMScan.
 
@@ -152,16 +176,14 @@ Each dataset item is a dictionary containing key elements:
 - **"sub_class"**: The sample category of the sample.
 - **"ID"**: A unique identifier for the sample.
 - **"scan_id"**:Identifier corresponding to the related scan.
-
-    *For Visual Grounding Task*
+-  *For Visual Grounding task*
 - **"target_id"** (list\[int\]): IDs of target objects. 
 - **"text"** (str): Text used for grounding.
 - **"target"** (list\[str\]): Types of target objects.
 - **"anchors"** (list\[str\]): Types of anchor objects.
 - **"anchor_ids"** (list\[int\]): IDs of anchor objects.
 - **"tokens_positive"** (dict):  Indices of positions where mentioned objects appear in the text.
-
-    *For Question Answering Task*
+-   *For Qusetion Answering task*
 - **"question"** (str): The text of the question.
 - **"answers"** (list\[str\]): List of possible answers.
 - **"object_ids"** (list\[int\]): Object IDs referenced in the question.
@@ -178,7 +200,7 @@ Each dataset item is a dictionary containing key elements:
 - **'extrinsic'** (np.ndarray): Extrinsic parameters of the camera.
 - **'visible_instance_id'** (list): IDs of visible objects in the image.
 
-### MMScan  Evaluator
+### MMScan Evaluator Tool
 
 Our evaluation tool is designed to streamline the assessment of model outputs for the MMScan task, providing essential metrics to gauge model performance effectively.
 
@@ -309,36 +331,11 @@ The input structure remains the same as for the question answering evaluator:
 ]
 ```
 
-## 🏆 MMScan Benchmark
-
-<span id='topic4'/>
-
-### MMScan Visual Grounding Benchmark
-
-| Methods | gTop-1 | gTop-3 | AP<sub>sample</sub> | AP<sub>box</sub> | AR | Release | Download |
-|---------|--------|--------|---------------------|------------------|----|-------|----|
-| ScanRefer | 4.74 | 9.19 | 9.49 | 2.28 | 47.68 | [code](https://github.com/rbler1234/EmbodiedScan/tree/mmscan-devkit/models/Scanrefer) | [model](https://drive.google.com/file/d/1C0-AJweXEc-cHTe9tLJ3Shgqyd44tXqY/view?usp=drive_link) \| [log](https://drive.google.com/file/d/1ENOS2FE7fkLPWjIf9J76VgiPrn6dGKvi/view?usp=drive_link) |
-| MVT | 7.94 | 13.07 | 13.67 | 2.50 | 86.86 | ~ | ~ |
-| BUTD-DETR  | 15.24 | 20.68 | 18.58 | 9.27 | 66.62 |  ~ | ~ |
-| ReGround3D  | 16.35 | 26.13 | 22.89 | 5.25 | 43.24 | ~ | ~ |
-| EmbodiedScan  | 19.66 | 34.00 | 29.30 | **15.18** | 59.96 | [code](https://github.com/OpenRobotLab/EmbodiedScan/tree/mmscan/models/EmbodiedScan) |  [model](https://drive.google.com/file/d/1F6cHY6-JVzAk6xg5s61aTT-vD-eu_4DD/view?usp=drive_link) \| [log](https://drive.google.com/file/d/1Ua_-Z2G3g0CthbeBkrR1a7_sqg_Spd9s/view?usp=drive_link) |
-| 3D-VisTA | 25.38 | 35.41 | 33.47 | 6.67 | 87.52 |  ~ | ~ |
-| ViL3DRef | **26.34** | **37.58** | **35.09** | 6.65 | 86.86 | ~ | ~ |
-
-### MMScan Question Answering Benchmark
-| Methods | Overall | ST-attr | ST-space | OO-attr | OO-space | OR| Advanced | Release | Download |
-|---|--------|--------|--------|--------|--------|--------|-------|----|----|
-| LL3DA | 45.7 | 39.1 | 58.5 | 43.6 | 55.9 | 37.1 | 24.0| [code](https://github.com/rbler1234/EmbodiedScan/tree/mmscan-devkit/models/LL3DA) | [model](https://drive.google.com/file/d/1mcWNHdfrhdbtySBtmG-QRH1Y1y5U3PDQ/view?usp=drive_link) \| [log](https://drive.google.com/file/d/1VHpcnO0QmAvMa0HuZa83TEjU6AiFrP42/view?usp=drive_link) |
-| LEO |54.6 | 48.9 | 62.7 | 50.8 | 64.7 | 50.4 | 45.9 | [code](https://github.com/rbler1234/EmbodiedScan/tree/mmscan-devkit/models/LEO) | [model](https://drive.google.com/drive/folders/1HZ38LwRe-1Q_VxlWy8vqvImFjtQ_b9iA?usp=drive_link)|
-| LLaVA-3D |**61.6** | 58.5 | 63.5 | 56.8 | 75.6 | 58.0 | 38.5|~ | ~ |
-
-*Note:* These two tables only show the results for main metrics; see the paper for complete results.
 
-We have released the codes of some models under [./models](./models/README.md).
 
 ## 📝 TODO List
 
-<span id='topic5'/>
+<span id='topic4'/>
 
 - \[ \] MMScan annotation and samples for ARKitScenes.
 - \[ \] Online evaluation platform for the MMScan benchmark.
diff --git a/models/README.md b/models/README.md
index 86aabf9..f5dd0a1 100644
--- a/models/README.md
+++ b/models/README.md
@@ -23,9 +23,9 @@ These are 3D visual grounding models adapted for the mmscan-devkit. Currently, t
    ```
 #### ckpts & Logs
 
-| Epoch |   gTop-1 @ 0.25/0.50  |                           Config                           |                                                                                                                                                                 Download                                                                                                                                                                 |
-| :-------:   | :---------: | :--------------------------------------------------------: | :--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------: |
-| 50 |  4.74 / 2.52    |    [config](https://drive.google.com/file/d/1iJtsjt4K8qhNikY8UmIfiQy1CzIaSgyU/view?usp=drive_link)    |             [model](https://drive.google.com/file/d/1C0-AJweXEc-cHTe9tLJ3Shgqyd44tXqY/view?usp=drive_link) \| [log](https://drive.google.com/file/d/1ENOS2FE7fkLPWjIf9J76VgiPrn6dGKvi/view?usp=drive_link)  
+| Epoch |   gTop-1 @ 0.25|gTop-1 @0.50  |                           Config                           |                                                                                                                                                                 Download                                                                                                                                                                 |
+| :-------:   | :---------:| :---------: | :--------------------------------------------------------: | :--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------: |
+| 50 |  4.74 | 2.52    |    [config](https://drive.google.com/file/d/1iJtsjt4K8qhNikY8UmIfiQy1CzIaSgyU/view?usp=drive_link)    |             [model](https://drive.google.com/file/d/1C0-AJweXEc-cHTe9tLJ3Shgqyd44tXqY/view?usp=drive_link) \| [log](https://drive.google.com/file/d/1ENOS2FE7fkLPWjIf9J76VgiPrn6dGKvi/view?usp=drive_link)  
 ### EmbodiedScan
 
 1. Follow the [EmbodiedScan](https://github.com/OpenRobotLab/EmbodiedScan/blob/main/README.md) to setup the Env. Download the [Multi-View 3D Detection model's weights](https://download.openmmlab.com/mim-example/embodiedscan/mv-3ddet.pth) and change the "load_from" path in the config file under `configs/grounding` to the path where the weights are saved.
@@ -53,9 +53,9 @@ These are 3D visual grounding models adapted for the mmscan-devkit. Currently, t
    ```
 #### ckpts & Logs
 
-| Input modality  | Load pretrain | Epoch |  gTop-1 @ 0.25/0.50  |                           Config                           |                                                                                                                                                                 Download                                                                                                                                                                 |
-| :-------:  | :----: | :----: | :---------: | :--------------------------------------------------------: | :--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------: |
-| Point cloud   |  True  |  12 |  19.66 / 8.82     |    [config](https://github.com/rbler1234/EmbodiedScan/blob/mmscan-devkit/models/EmbodiedScan/configs/grounding/pcd_4xb24_mmscan_vg_num256.py)    |             [model](https://drive.google.com/file/d/1F6cHY6-JVzAk6xg5s61aTT-vD-eu_4DD/view?usp=drive_link) \| [log](https://drive.google.com/file/d/1Ua_-Z2G3g0CthbeBkrR1a7_sqg_Spd9s/view?usp=drive_link)  
+| Input Modality  | Det Pretrain | Epoch |  gTop-1 @ 0.25 | gTop-1 @ 0.50  |                           Config                           |                                                                                                                                                                 Download                                                                                                                                                                 |
+| :-------:  | :----: | :----:| :----:  | :---------: | :--------------------------------------------------------: | :--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------: |
+| Point Cloud   |  &#10004;  |  12 |  19.66 | 8.82     |    [config](https://github.com/rbler1234/EmbodiedScan/blob/mmscan-devkit/models/EmbodiedScan/configs/grounding/pcd_4xb24_mmscan_vg_num256.py)    |             [model](https://drive.google.com/file/d/1F6cHY6-JVzAk6xg5s61aTT-vD-eu_4DD/view?usp=drive_link) \| [log](https://drive.google.com/file/d/1Ua_-Z2G3g0CthbeBkrR1a7_sqg_Spd9s/view?usp=drive_link)  
 
 ## 3D Question Answering Models
 
@@ -95,9 +95,9 @@ These are 3D question answering models adapted for the mmscan-devkit. Currently,
    ```
 #### ckpts & Logs
 
-| Detector  | Captioner | Iters |  GPT score overall  |                                                                                                                                                                       Download                                                                                                                                                                 |
+| Detector  | Captioner | Iters |  Overall GPT Score  |                                                                                                                                                                       Download                                                                                                                                                                 |
 | :-------:  | :----: | :----: | :---------: |:--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------: |
-| Vote2Cap-DETR   |  ll3da  |  100k |  45.7     |             [model](https://drive.google.com/file/d/1mcWNHdfrhdbtySBtmG-QRH1Y1y5U3PDQ/view?usp=drive_link) \| [log](https://drive.google.com/file/d/1VHpcnO0QmAvMa0HuZa83TEjU6AiFrP42/view?usp=drive_link)             |
+| Vote2Cap-DETR   |  LL3DA  |  100k |  45.7     |             [model](https://drive.google.com/file/d/1mcWNHdfrhdbtySBtmG-QRH1Y1y5U3PDQ/view?usp=drive_link) \| [log](https://drive.google.com/file/d/1VHpcnO0QmAvMa0HuZa83TEjU6AiFrP42/view?usp=drive_link)             |
 
 
 
@@ -135,6 +135,6 @@ These are 3D question answering models adapted for the mmscan-devkit. Currently,
    ```
 #### ckpts & Logs
 
-| LLM  | 2d/3d backbones | epoch |  GPT score overall  |                           Config                           |                                                                                                                                                                 Download                                                                                                                                                                 |
-| :-------:  | :----: | :----: | :---------: | :--------------------------------------------------------: | :--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------: |
-| Vicuna7b   |  ConvNeXt / PointNet++  |  1 |  54.6     |    [config](https://drive.google.com/file/d/1CJccZd4TOaT_JdHj073UKwdA5PWUDtja/view?usp=drive_link)    |             [model](https://drive.google.com/drive/folders/1HZ38LwRe-1Q_VxlWy8vqvImFjtQ_b9iA?usp=drive_link)              |
+| LLM  | 2D Backbone | 3D Backbone | Epoch | Overall GPT Score   |                           Config                           |                                                                                                                                                                 Download                                                                                                                                                                 |
+| :-------:  | :----: | :----: | :----: |:---------: | :--------------------------------------------------------: | :--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------: |
+| Vicuna7b   |  ConvNeXt | PointNet++  |  1 |  54.6     |    [config](https://drive.google.com/file/d/1CJccZd4TOaT_JdHj073UKwdA5PWUDtja/view?usp=drive_link)    |             [model](https://drive.google.com/drive/folders/1HZ38LwRe-1Q_VxlWy8vqvImFjtQ_b9iA?usp=drive_link)              |

From 5677efa629c32c4e29c0a6348ea11d3e98db3c08 Mon Sep 17 00:00:00 2001
From: rbler1234 <rbler1234@sjtu.edu.cn>
Date: Sun, 26 Jan 2025 14:09:05 +0800
Subject: [PATCH 3/4] edit readme

---
 README.md | 95 ++++++++++++++++++++++++++++---------------------------
 1 file changed, 48 insertions(+), 47 deletions(-)

diff --git a/README.md b/README.md
index 1fce8ef..ec22624 100644
--- a/README.md
+++ b/README.md
@@ -21,13 +21,14 @@
 
 ## 📋 Contents
 
-1. [About](#topic1)
-2. [MMScan Benchmark](#topic2)
-3. [MMScan API Tutorial](#topic3)
-4. [TODO List](#topic4)
+1. [About](#-about)
+2. [Getting Started](#-getting-started)
+3. [MMScan API Tutorial](#-mmscan-api-tutorial)
+4. [MMScan Benchmark](#-mmscan-benchmark)
+5. [TODO List](#-todo-list)
 
 ## 🏠 About
-<span id='topic1'/>
+
 
 <!-- ![Teaser](assets/teaser.jpg) -->
 
@@ -56,43 +57,10 @@ Furthermore, we use this high-quality dataset to train state-of-the-art 3D visua
 grounding and LLMs and obtain remarkable performance improvement both on
 existing benchmarks and in-the-wild evaluation.
 
+## 🚀 Getting Started
 
-## 🏆 MMScan Benchmark
-
-<span id='topic2'/>
-
-### MMScan Visual Grounding Benchmark
-
-| Methods | gTop-1 | gTop-3 | AP<sub>sample</sub> | AP<sub>box</sub> | AR | Release | Download |
-|---------|--------|--------|---------------------|------------------|----|-------|----|
-| ScanRefer | 4.74 | 9.19 | 9.49 | 2.28 | 47.68 | [code](https://github.com/rbler1234/EmbodiedScan/tree/mmscan-devkit/models/Scanrefer) | [model](https://drive.google.com/file/d/1C0-AJweXEc-cHTe9tLJ3Shgqyd44tXqY/view?usp=drive_link) \| [log](https://drive.google.com/file/d/1ENOS2FE7fkLPWjIf9J76VgiPrn6dGKvi/view?usp=drive_link) |
-| MVT | 7.94 | 13.07 | 13.67 | 2.50 | 86.86 | - | - |
-| BUTD-DETR  | 15.24 | 20.68 | 18.58 | 9.27 | 66.62 |  - | - |
-| ReGround3D  | 16.35 | 26.13 | 22.89 | 5.25 | 43.24 | - | - |
-| EmbodiedScan  | 19.66 | 34.00 | 29.30 | **15.18** | 59.96 | [code](https://github.com/OpenRobotLab/EmbodiedScan/tree/mmscan/models/EmbodiedScan) |  [model](https://drive.google.com/file/d/1F6cHY6-JVzAk6xg5s61aTT-vD-eu_4DD/view?usp=drive_link) \| [log](https://drive.google.com/file/d/1Ua_-Z2G3g0CthbeBkrR1a7_sqg_Spd9s/view?usp=drive_link) |
-| 3D-VisTA | 25.38 | 35.41 | 33.47 | 6.67 | 87.52 |  - | - |
-| ViL3DRef | **26.34** | **37.58** | **35.09** | 6.65 | 86.86 | - | - |
-
-### MMScan Question Answering Benchmark
-| Methods | Overall | ST-attr | ST-space | OO-attr | OO-space | OR| Advanced | Release | Download |
-|---|--------|--------|--------|--------|--------|--------|-------|----|----|
-| LL3DA | 45.7 | 39.1 | 58.5 | 43.6 | 55.9 | 37.1 | 24.0| [code](https://github.com/rbler1234/EmbodiedScan/tree/mmscan-devkit/models/LL3DA) | [model](https://drive.google.com/file/d/1mcWNHdfrhdbtySBtmG-QRH1Y1y5U3PDQ/view?usp=drive_link) \| [log](https://drive.google.com/file/d/1VHpcnO0QmAvMa0HuZa83TEjU6AiFrP42/view?usp=drive_link) |
-| LEO |54.6 | 48.9 | 62.7 | 50.8 | 64.7 | 50.4 | 45.9 | [code](https://github.com/rbler1234/EmbodiedScan/tree/mmscan-devkit/models/LEO) | [model](https://drive.google.com/drive/folders/1HZ38LwRe-1Q_VxlWy8vqvImFjtQ_b9iA?usp=drive_link)|
-| LLaVA-3D |**61.6** | 58.5 | 63.5 | 56.8 | 75.6 | 58.0 | 38.5|- | - |
-
-*Note:* These two tables only show the results for main metrics; see the paper for complete results.
-
-We have released the codes of some models under [./models](./models/README.md).
-
-
-
-## 🚀 MMScan API Tutorial
-<span id='topic3'/>
-
-The **MMScan Toolkit** provides comprehensive tools for dataset handling and model evaluation in tasks.
-
-### Getting Started
 
+### Installation
 
 1. Clone Github repo.
 
@@ -112,13 +80,13 @@ The **MMScan Toolkit** provides comprehensive tools for dataset handling and mod
 
    Use `"all"` to install all components and specify `"VG"` or `"QA"` if you only need to install the components for Visual Grounding or Question Answering, respectively.
 
-3. Download and prepare the dataset.
+### Data Preparation
 
-   a. Download the Embodiedscan and MMScan annotation. (Fill in the [form](https://docs.google.com/forms/d/e/1FAIpQLScUXEDTksGiqHZp31j7Zp7zlCNV7p_08uViwP_Nbzfn3g6hhw/viewform) to apply for downloading)
+1. Download the Embodiedscan and MMScan annotation. (Fill in the [form](https://docs.google.com/forms/d/e/1FAIpQLScUXEDTksGiqHZp31j7Zp7zlCNV7p_08uViwP_Nbzfn3g6hhw/viewform) to apply for downloading)
 
-   b. Create a folder `mmscan_data/` and then unzip the files. For the first zip file, put `embodiedscan` under `mmscan_data/embodiedscan_split` and rename it to `embodiedscan-v1`. For the second zip file, put `MMScan-beta-release` under `mmscan_data/MMScan-beta-release` and `embodiedscan-v2` under `mmscan_data/embodiedscan_split`.
+   Create a folder `mmscan_data/` and then unzip the files. For the first zip file, put `embodiedscan` under `mmscan_data/embodiedscan_split` and rename it to `embodiedscan-v1`. For the second zip file, put `MMScan-beta-release` under `mmscan_data/MMScan-beta-release` and `embodiedscan-v2` under `mmscan_data/embodiedscan_split`.
 
-   The directory structure should be as below, after then, refer to the [guide](data_preparation/README.md) here.
+   The directory structure should be as below:
 
    ```
    mmscan_data
@@ -128,6 +96,14 @@ The **MMScan Toolkit** provides comprehensive tools for dataset handling and mod
    ├── MMScan-beta-release   # MMScan veta data in 'embodiedscan-v2-beta.zip'
    ```
 
+2. Prepare the point clouds files.
+
+   Please refer to the [guide](data_preparation/README.md) here.
+
+## 👓 MMScan API Tutorial
+
+
+The **MMScan Toolkit** provides comprehensive tools for dataset handling and model evaluation in  tasks.
 
 To import the MMScan API, you can use the following commands:
 
@@ -145,7 +121,7 @@ import mmscan.QuestionAnsweringEvaluator as MMScan_QA_evaluator
 import mmscan.GPTEvaluator as MMScan_GPT_evaluator
 ```
 
-### MMScan Dataset Tool
+### MMScan Dataset
 
 The dataset tool in MMScan allows seamless access to data required for various tasks within MMScan.
 
@@ -200,7 +176,7 @@ Each dataset item is a dictionary containing key elements:
 - **'extrinsic'** (np.ndarray): Extrinsic parameters of the camera.
 - **'visible_instance_id'** (list): IDs of visible objects in the image.
 
-### MMScan Evaluator Tool
+### MMScan  Evaluator
 
 Our evaluation tool is designed to streamline the assessment of model outputs for the MMScan task, providing essential metrics to gauge model performance effectively.
 
@@ -331,11 +307,36 @@ The input structure remains the same as for the question answering evaluator:
 ]
 ```
 
+## 🏆 MMScan Benchmark
+
+
+
+### MMScan Visual Grounding Benchmark
+
+| Methods | gTop-1 | gTop-3 | AP<sub>sample</sub> | AP<sub>box</sub> | AR | Release | Download |
+|---------|--------|--------|---------------------|------------------|----|-------|----|
+| ScanRefer | 4.74 | 9.19 | 9.49 | 2.28 | 47.68 | [code](https://github.com/rbler1234/EmbodiedScan/tree/mmscan-devkit/models/Scanrefer) | [model](https://drive.google.com/file/d/1C0-AJweXEc-cHTe9tLJ3Shgqyd44tXqY/view?usp=drive_link) \| [log](https://drive.google.com/file/d/1ENOS2FE7fkLPWjIf9J76VgiPrn6dGKvi/view?usp=drive_link) |
+| MVT | 7.94 | 13.07 | 13.67 | 2.50 | 86.86 | - | - |
+| BUTD-DETR  | 15.24 | 20.68 | 18.58 | 9.27 | 66.62 |  - | - |
+| ReGround3D  | 16.35 | 26.13 | 22.89 | 5.25 | 43.24 | - | - |
+| EmbodiedScan  | 19.66 | 34.00 | 29.30 | **15.18** | 59.96 | [code](https://github.com/OpenRobotLab/EmbodiedScan/tree/mmscan/models/EmbodiedScan) |  [model](https://drive.google.com/file/d/1F6cHY6-JVzAk6xg5s61aTT-vD-eu_4DD/view?usp=drive_link) \| [log](https://drive.google.com/file/d/1Ua_-Z2G3g0CthbeBkrR1a7_sqg_Spd9s/view?usp=drive_link) |
+| 3D-VisTA | 25.38 | 35.41 | 33.47 | 6.67 | 87.52 |  - | - |
+| ViL3DRef | **26.34** | **37.58** | **35.09** | 6.65 | 86.86 | - | - |
+
+### MMScan Question Answering Benchmark
+| Methods | Overall | ST-attr | ST-space | OO-attr | OO-space | OR| Advanced | Release | Download |
+|---|--------|--------|--------|--------|--------|--------|-------|----|----|
+| LL3DA | 45.7 | 39.1 | 58.5 | 43.6 | 55.9 | 37.1 | 24.0| [code](https://github.com/rbler1234/EmbodiedScan/tree/mmscan-devkit/models/LL3DA) | [model](https://drive.google.com/file/d/1mcWNHdfrhdbtySBtmG-QRH1Y1y5U3PDQ/view?usp=drive_link) \| [log](https://drive.google.com/file/d/1VHpcnO0QmAvMa0HuZa83TEjU6AiFrP42/view?usp=drive_link) |
+| LEO |54.6 | 48.9 | 62.7 | 50.8 | 64.7 | 50.4 | 45.9 | [code](https://github.com/rbler1234/EmbodiedScan/tree/mmscan-devkit/models/LEO) | [model](https://drive.google.com/drive/folders/1HZ38LwRe-1Q_VxlWy8vqvImFjtQ_b9iA?usp=drive_link)|
+| LLaVA-3D |**61.6** | 58.5 | 63.5 | 56.8 | 75.6 | 58.0 | 38.5|- | - |
 
+*Note:* These two tables only show the results for main metrics; see the paper for complete results.
+
+We have released the codes of some models under [./models](./models/README.md).
 
 ## 📝 TODO List
 
-<span id='topic4'/>
+
 
 - \[ \] MMScan annotation and samples for ARKitScenes.
 - \[ \] Online evaluation platform for the MMScan benchmark.

From 0ed18db9d9e794f608bd477d98075afd1c845180 Mon Sep 17 00:00:00 2001
From: rbler1234 <rbler1234@sjtu.edu.cn>
Date: Mon, 24 Feb 2025 10:52:32 +0800
Subject: [PATCH 4/4] update readme

---
 README.md        | 26 ++++++++++++++------------
 models/README.md | 36 ++++++++++++++++++------------------
 2 files changed, 32 insertions(+), 30 deletions(-)

diff --git a/README.md b/README.md
index ec22624..9870dfd 100644
--- a/README.md
+++ b/README.md
@@ -93,7 +93,7 @@ existing benchmarks and in-the-wild evaluation.
    ├── embodiedscan_split
    │   ├──embodiedscan-v1/   # EmbodiedScan v1 data in 'embodiedscan.zip'
    │   ├──embodiedscan-v2/   # EmbodiedScan v2 data in 'embodiedscan-v2-beta.zip'
-   ├── MMScan-beta-release   # MMScan veta data in 'embodiedscan-v2-beta.zip'
+   ├── MMScan-beta-release   # MMScan data in 'embodiedscan-v2-beta.zip'
    ```
 
 2. Prepare the point clouds files.
@@ -145,17 +145,21 @@ Each dataset item is a dictionary containing key elements:
 - **"pcds"** (np.ndarray): Point cloud data with dimensions [n_points, 6(xyz+rgb)], representing the coordinates and color of each point.
 - **"instance_labels"** (np.ndarray): Instance ID assigned to each point in the point cloud.
 - **"class_labels"** (np.ndarray): Class IDs assigned to each point in the point cloud.
-- **"bboxes"** (dict): Information about bounding boxes within the scan.
+- **"bboxes"** (dict): Information about bounding boxes within the scan, structured as { object ID:
+                    {
+                        "type": object type (str),
+                        "bbox": 9 DoF box (np.ndarray)
+                    }}
 
 (2)  Language Modality
 
-- **"sub_class"**: The sample category of the sample.
-- **"ID"**: A unique identifier for the sample.
-- **"scan_id"**:Identifier corresponding to the related scan.
+- **"sub_class"**: The category of the sample.
+- **"ID"**: The sample's ID.
+- **"scan_id"**: The scan's ID.
 -  *For Visual Grounding task*
-- **"target_id"** (list\[int\]): IDs of target objects. 
+- **"target_id"** (list\[int\]): IDs of target objects.
 - **"text"** (str): Text used for grounding.
-- **"target"** (list\[str\]): Types of target objects.
+- **"target"** (list\[str\]): Text prompt to specify the target grounding object.
 - **"anchors"** (list\[str\]): Types of anchor objects.
 - **"anchor_ids"** (list\[int\]): IDs of anchor objects.
 - **"tokens_positive"** (dict):  Indices of positions where mentioned objects appear in the text.
@@ -165,14 +169,14 @@ Each dataset item is a dictionary containing key elements:
 - **"object_ids"** (list\[int\]): Object IDs referenced in the question.
 - **"object_names"** (list\[str\]): Types of referenced objects.
 - **"input_bboxes_id"** (list\[int\]): IDs of input bounding boxes.
-- **"input_bboxes"** (list\[np.ndarray\]): Input bounding box data, with 9 degrees of freedom.
+- **"input_bboxes"** (list\[np.ndarray\]): Input 9-DoF bounding boxes.
 
 (3) 2D Modality
 
 - **'img_path'** (str): File path to the RGB image.
 - **'depth_img_path'** (str): File path to the depth image.
 - **'intrinsic'** (np.ndarray):  Intrinsic parameters of the camera for RGB images.
-- **'depth_intrinsic'** (np.ndarray):  Intrinsic parameters of the camera for Depth images.
+- **'depth_intrinsic'** (np.ndarray):  Intrinsic parameters of the camera for depth images.
 - **'extrinsic'** (np.ndarray): Extrinsic parameters of the camera.
 - **'visible_instance_id'** (list): IDs of visible objects in the image.
 
@@ -186,7 +190,7 @@ For the visual grounding task, our evaluator computes multiple metrics including
 
 - **AP and AR**: These metrics calculate the precision and recall by considering each sample as an individual category.
 - **AP_C and AR_C**: These versions categorize samples belonging to the same subclass and calculate them together.
-- **gTop-k**: An expanded metric that generalizes the traditional Top-k metric, offering insights into broader performance aspects.
+- **gTop-k**: An expanded metric that generalizes the traditional Top-k metric, offering superior flexibility and interpretability compared to traditional ones when oriented towards multi-target grounding.
   
 *Note:* Here, AP corresponds to  AP<sub>sample</sub> in the paper, and AP_C corresponds to  AP<sub>box</sub> in the paper.
 
@@ -310,7 +314,6 @@ The input structure remains the same as for the question answering evaluator:
 ## 🏆 MMScan Benchmark
 
 
-
 ### MMScan Visual Grounding Benchmark
 
 | Methods | gTop-1 | gTop-3 | AP<sub>sample</sub> | AP<sub>box</sub> | AR | Release | Download |
@@ -337,7 +340,6 @@ We have released the codes of some models under [./models](./models/README.md).
 ## 📝 TODO List
 
 
-
 - \[ \] MMScan annotation and samples for ARKitScenes.
 - \[ \] Online evaluation platform for the MMScan benchmark.
 - \[ \] Codes of more MMScan Visual Grounding baselines and Question Answering baselines.
diff --git a/models/README.md b/models/README.md
index f5dd0a1..db3bc96 100644
--- a/models/README.md
+++ b/models/README.md
@@ -2,56 +2,56 @@
 
 These are 3D visual grounding models adapted for the mmscan-devkit. Currently, two models have been released: EmbodiedScan and ScanRefer.
 
-### Scanrefer
+### ScanRefer
 
-1. Follow the [Scanrefer](https://github.com/daveredrum/ScanRefer/blob/master/README.md) to setup the Env. For data preparation, you need not load the datasets, only need to download the [preprocessed GLoVE embeddings](https://kaldir.vc.in.tum.de/glove.p) (~990MB) and put them under `data/`
+1. Follow the [ScanRefer](https://github.com/daveredrum/ScanRefer/blob/master/README.md) to setup the environment. For data preparation, you need not load the datasets, only need to download the [preprocessed GLoVE embeddings](https://kaldir.vc.in.tum.de/glove.p) (~990MB) and put them under `data/`
 
 2. Install MMScan API.
 
 3. Overwrite the `lib/config.py/CONF.PATH.OUTPUT` to your desired output directory.
 
-4. Run the following command to train Scanrefer (one GPU):
+4. Run the following command to train ScanRefer (one GPU):
 
    ```bash
    python -u scripts/train.py --use_color --epoch {10/25/50}
    ```
 
-5. Run the following command to evaluate Scanrefer (one GPU):
+5. Run the following command to evaluate ScanRefer (one GPU):
 
    ```bash
    python -u scripts/train.py --use_color --eval_only --use_checkpoint "path/to/pth"
    ```
-#### ckpts & Logs
+#### Results and Models
 
 | Epoch |   gTop-1 @ 0.25|gTop-1 @0.50  |                           Config                           |                                                                                                                                                                 Download                                                                                                                                                                 |
 | :-------:   | :---------:| :---------: | :--------------------------------------------------------: | :--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------: |
 | 50 |  4.74 | 2.52    |    [config](https://drive.google.com/file/d/1iJtsjt4K8qhNikY8UmIfiQy1CzIaSgyU/view?usp=drive_link)    |             [model](https://drive.google.com/file/d/1C0-AJweXEc-cHTe9tLJ3Shgqyd44tXqY/view?usp=drive_link) \| [log](https://drive.google.com/file/d/1ENOS2FE7fkLPWjIf9J76VgiPrn6dGKvi/view?usp=drive_link)  
 ### EmbodiedScan
 
-1. Follow the [EmbodiedScan](https://github.com/OpenRobotLab/EmbodiedScan/blob/main/README.md) to setup the Env. Download the [Multi-View 3D Detection model's weights](https://download.openmmlab.com/mim-example/embodiedscan/mv-3ddet.pth) and change the "load_from" path in the config file under `configs/grounding` to the path where the weights are saved.
+1. Follow the [EmbodiedScan](https://github.com/OpenRobotLab/EmbodiedScan/blob/main/README.md) to setup the environment. Download the [Multi-View 3D Detection model's weights](https://download.openmmlab.com/mim-example/embodiedscan/mv-3ddet.pth) and change the "load_from" path in the config file under `configs/grounding` to the path where the weights are saved.
 
 2. Install MMScan API.
 
-3. Run the following command to train EmbodiedScan (multiple GPU):
+3. Run the following command to train EmbodiedScan (multiple GPUs):
 
    ```bash
    # Single GPU training
    python tools/train.py configs/grounding/pcd_4xb24_mmscan_vg_num256.py --work-dir=path/to/save
 
-   # Multiple GPU training
+   # Multiple GPUs training
    python tools/train.py configs/grounding/pcd_4xb24_mmscan_vg_num256.py --work-dir=path/to/save --launcher="pytorch"
    ```
 
-4. Run the following command to evaluate EmbodiedScan (multiple GPU):
+4. Run the following command to evaluate EmbodiedScan (multiple GPUs):
 
    ```bash
    # Single GPU testing
    python tools/test.py configs/grounding/pcd_4xb24_mmscan_vg_num256.py path/to/load_pth
 
-   # Multiple GPU testing
+   # Multiple GPUs testing
    python tools/test.py configs/grounding/pcd_4xb24_mmscan_vg_num256.py path/to/load_pth --launcher="pytorch"
    ```
-#### ckpts & Logs
+#### Results and Models
 
 | Input Modality  | Det Pretrain | Epoch |  gTop-1 @ 0.25 | gTop-1 @ 0.50  |                           Config                           |                                                                                                                                                                 Download                                                                                                                                                                 |
 | :-------:  | :----: | :----:| :----:  | :---------: | :--------------------------------------------------------: | :--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------: |
@@ -63,7 +63,7 @@ These are 3D question answering models adapted for the mmscan-devkit. Currently,
 
 ### LL3DA
 
-1. Follow the [LL3DA](https://github.com/Open3DA/LL3DA/blob/main/README.md) to setup the Env. For data preparation, you need not load the datasets, only need to:
+1. Follow the [LL3DA](https://github.com/Open3DA/LL3DA/blob/main/README.md) to setup the environment. For data preparation, you need not load the datasets, only need to:
 
    (1) download the [release pre-trained weights.](https://huggingface.co/CH3COOK/LL3DA-weight-release/blob/main/ll3da-opt-1.3b.pth) and put them under `./pretrained`
 
@@ -73,13 +73,13 @@ These are 3D question answering models adapted for the mmscan-devkit. Currently,
 
 3. Edit the config under `./scripts/opt-1.3b/eval.mmscanqa.sh` and `./scripts/opt-1.3b/tuning.mmscanqa.sh`
 
-4. Run the following command to train LL3DA (4 GPU):
+4. Run the following command to train LL3DA (4 GPUs):
 
    ```bash
    bash scripts/opt-1.3b/tuning.mmscanqa.sh
    ```
 
-5. Run the following command to evaluate LL3DA (4 GPU):
+5. Run the following command to evaluate LL3DA (4 GPUs):
 
    ```bash
    bash scripts/opt-1.3b/eval.mmscanqa.sh
@@ -93,7 +93,7 @@ These are 3D question answering models adapted for the mmscan-devkit. Currently,
    --tmp_path path/to/tmp  --api_key your_api_key --eval_size -1
    --nproc 4
    ```
-#### ckpts & Logs
+#### Results and Models
 
 | Detector  | Captioner | Iters |  Overall GPT Score  |                                                                                                                                                                       Download                                                                                                                                                                 |
 | :-------:  | :----: | :----: | :---------: |:--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------: |
@@ -103,7 +103,7 @@ These are 3D question answering models adapted for the mmscan-devkit. Currently,
 
 ### LEO
 
-1. Follow the [LEO](https://github.com/embodied-generalist/embodied-generalist/blob/main/README.md) to setup the Env. For data preparation, you need not load the datasets, only need to:
+1. Follow the [LEO](https://github.com/embodied-generalist/embodied-generalist/blob/main/README.md) to setup the environment. For data preparation, you need not load the datasets, only need to:
 
    (1) Download [Vicuna-7B](https://huggingface.co/huangjy-pku/vicuna-7b/tree/main) and update cfg_path in configs/llm/\*.yaml
 
@@ -113,13 +113,13 @@ These are 3D question answering models adapted for the mmscan-devkit. Currently,
 
 3. Edit the config under `scripts/train_tuning_mmscan.sh` and `scripts/test_tuning_mmscan.sh`
 
-4. Run the following command to train LEO (4 GPU):
+4. Run the following command to train LEO (4 GPUs):
 
    ```bash
    bash scripts/train_tuning_mmscan.sh
    ```
 
-5. Run the following command to evaluate LEO (4 GPU):
+5. Run the following command to evaluate LEO (4 GPUs):
 
    ```bash
    bash scripts/test_tuning_mmscan.sh