# Inferencer study

### Various steps
1. Import and initialize the inferencer
2. Select some images, then **Just inference**
3. **Inference + Visualization**
4. **Remote** Inference + **Remote** Visualization

In [None]:
import sys

# # Print the PYTHONPATH before the modifications
# for i, path in enumerate(sys.path):
#     print(i, path)
# print()

# Add the new paths separately
# new_path1 = "/home/michele/.local/lib/python3.10/site-packages/"
# if not new_path1 in sys.path:
#     sys.path.insert(1, new_path1)
new_path2 = "/home/michele/code/michele_mmdet3d/"
if not new_path2 in sys.path:
    sys.path.insert(1, new_path2)

# # Print the PYTHONPATH after the modifications
# for i, path in enumerate(sys.path):
#     print(i, path)
# print()

# Print the version of MMCV
import mmcv
print(mmcv.__version__)
print(mmcv.__file__)

In [2]:
from mmdet3d.apis import LidarDet3DInferencer

In [None]:
#################################################################################################################
#                                           Initialize Inferencer                                               #
#################################################################################################################



inferencer = LidarDet3DInferencer(model='/home/michele/code/michele_mmdet3d/configs/minerva/CONDENSED_pointpillars_minerva.py', 
                                  weights='/home/michele/code/michele_mmdet3d/work_dirs/pointpillars_minerva/epoch_120.pth',
                                  show_progress = False)

# inferencer = LidarDet3DInferencer(model='/home/michele/code/michele_mmdet3d/configs/minerva/CONDENSED_pointpillars_minerva.py', 
#                                   weights='/home/michele/code/michele_mmdet3d/work_dirs/pointpillars_minerva/epoch_120.pth',
#                                   want_losses=True,
#                                   show_progress = False)

In [None]:
#################################################################################################################
#                                               Just Inference                                                  #
#################################################################################################################



# Read the files in validation list
val_list_txt_file = "/home/michele/code/michele_mmdet3d/data/minerva_polimove/ImageSets/val.txt"
with open(val_list_txt_file, 'r') as file:
    val_file_names = [line.strip() for line in file]

# Choose a smaller set of the validation files
one_every_n = 1
max_number = 1
wait_time_default = 10
inputs = []
for i, name in enumerate(val_file_names):
    if i%one_every_n==0 and len(inputs)<max_number:
        inputs.append(dict(points=("/home/michele/code/michele_mmdet3d/data/minerva_polimove/training/velodyne/"+name+".bin")))

print(f"Total validation point_clouds: {len(val_file_names)}")
print(f"\tOne every n: {one_every_n}")
print(f"\tMax number: {max_number}")
print(f"\tSelected point_clouds: {len(inputs)}")

# Initial format for the inputs was the following:
results = []
for input in inputs:
    results.append(inferencer(input))

In [None]:
#################################################################################################################
#                                           Inference and Visualize                                             #
#################################################################################################################



# NOTE by Michele:  The visualization does not work properly if it gets passed a list. So just cycle through the 
#       "__call__" funcion of the inferencer
for input in inputs:
    inferencer(input, show=True, wait_time = wait_time_default)

In [None]:
#################################################################################################################
#                                              Remote Inferencer                                                #
#################################################################################################################



import os
import shutil
out_directory = '/home/michele/code/michele_mmdet3d/data/minerva_polimove/remote_inference/'
if os.path.exists(out_directory):
    shutil.rmtree(out_directory)
    print("Pre-existing directory removed. Will be created again from scratch")

# NOTE by Michele:  Same as above (need to use the for loop)
for input in inputs:
    inferencer(input, show=False, out_dir=out_directory)

# Visualize the predictions from the saved files
local_inferencer = LidarDet3DInferencer(model='/home/michele/code/michele_mmdet3d/configs/minerva/CONDENSED_pointpillars_minerva.py',
                                        weights='/home/michele/code/michele_mmdet3d/work_dirs/pointpillars_minerva/epoch_120.pth')

saved_predictions = [(out_directory + "preds/" + filename) for filename in os.listdir(out_directory + "preds/")]
local_inferencer.visualize_preds_fromfile(saved_predictions, show=True, wait_time = wait_time_default)

# VoxelNet(SingleStage3DDetector(Base3DDetector)): Description by ChatGPT

The `Base3DDetector` class in MMDetection3D serves as a parent class for 3D detectors, providing a unified interface for 3D object detection models. It inherits from `BaseDetector`, which is part of the 2D detection framework MMDetection, and extends it to handle 3D-specific data and operations. Below, I'll break down its interactions with PyTorch and explain its overall workflow:

### 1. **Interaction with PyTorch**
The main interactions with PyTorch happen in the `forward` method, which handles different modes of operation: `'tensor'`, `'predict'`, and `'loss'`. These modes control how the model processes inputs and interacts with PyTorch tensors.

1. **Forward Pass (`mode='tensor'`)**:  
   When `mode='tensor'`, the `forward()` method calls `_forward()`. This is where the network processes the inputs through the various layers defined in the specific detector model (e.g., backbone, neck, head). This is a typical PyTorch `nn.Module` forward pass, which returns raw tensor outputs directly without additional processing.

2. **Prediction (`mode='predict'`)**:  
   When `mode='predict'`, it processes the inputs and data samples using the `predict()` method. This involves running inference, post-processing, and formatting the results into structured outputs like bounding boxes, scores, and labels. During this stage, PyTorch is used for operations like applying non-maximum suppression (NMS) on 3D bounding boxes and converting model outputs into usable predictions.

3. **Loss Calculation (`mode='loss'`)**:  
   When `mode='loss'`, it calculates and returns the loss values using the `loss()` method. This mode is used during training. The `inputs` and `data_samples` are passed to the `loss()` function, which computes various losses (e.g., classification, regression) based on ground truth data. The resulting tensors are then used by the optimizer for backpropagation in a typical PyTorch training loop.

### 2. **High-Level Workflow of `Base3DDetector`**
The overall workflow of the class, when instantiated as part of a detector model, follows these steps:

1. **Initialization (`__init__` method)**:
   - The class accepts a `data_preprocessor` and an `init_cfg` argument.
   - `data_preprocessor` configures preprocessing steps like padding, mean normalization, and standardization for both point cloud and image data.
   - `init_cfg` handles initialization settings, such as loading pretrained weights or defining weight initialization schemes.

2. **Forward Method (`forward`)**:
   The `forward()` method is the unified entry point for different operations. It accepts the following key parameters:
   - `inputs`: A dictionary or list of dictionaries containing batch data. It typically includes keys like `points` (for point cloud data) and `imgs` (for image data).
   - `data_samples`: Optional annotation data for each sample, such as ground truth 3D bounding boxes.
   - `mode`: The mode to run the forward process (`'tensor'`, `'predict'`, or `'loss'`).

   Based on the mode, it delegates the call to one of three methods:
   - **`loss()`**: Computes the loss for training.
   - **`predict()`**: Processes inputs and generates predictions.
   - **`_forward()`**: Returns the raw tensor outputs for deeper analysis or debugging.

3. **Adding Predictions to Data Samples (`add_pred_to_datasample` method)**:
   This utility method formats the model outputs into the structure required for downstream use. It converts raw predictions into the `Det3DDataSample` format, which includes keys like:
   - `pred_instances_3d`: Contains 3D bounding boxes (`bboxes_3d`), labels (`labels_3d`), and scores (`scores_3d`).
   - `pred_instances`: Similar structure for 2D bounding boxes if applicable.

   It creates new `InstanceData` objects for missing 3D or 2D data and adds these predictions back into the `data_samples` list. This structure makes it easier to evaluate and visualize the results later.

### 3. **Usage in PyTorch Training and Inference Loops**
The `Base3DDetector` class is designed to fit seamlessly into PyTorch's training and inference workflows. During training, it is used as follows:

1. **Training**:
   - When calling `model(inputs, data_samples, mode='loss')`, the class processes the input data through the network and computes the loss.
   - The loss dictionary is used by the optimizer to perform backpropagation and update weights.

2. **Inference**:
   - When calling `model(inputs, data_samples, mode='predict')`, it produces the post-processed predictions in the format expected by MMDetection3D's evaluation pipeline.
   - Predictions are wrapped into `Det3DDataSample` objects, making it easy to evaluate using standard metrics (e.g., mAP, AP40).

3. **Debugging and Tensor Operations**:
   - When calling `model(inputs, mode='tensor')`, it skips all post-processing and returns the raw tensors directly, allowing for deeper debugging or custom analysis using PyTorch tensor operations.

### 4. **Extensibility for New 3D Detection Models**
The `Base3DDetector` class serves as a template for implementing new 3D detectors. Derived classes can:
- Override `loss()` to implement custom loss functions.
- Implement `predict()` for different types of 3D detectors (e.g., voxel-based, point-based).
- Customize `_forward()` to define different network architectures.

### 5. **Example: How it Fits in an MMDetection3D Pipeline**
Consider a typical 3D object detection training pipeline:
1. **Data Preprocessing**: The dataset class formats input point clouds and images, passes them through the preprocessor, and provides `inputs` and `data_samples`.
2. **Training Loop**: The model’s `forward()` method is called in `'loss'` mode, computing the losses.
3. **Inference Loop**: During testing, the same `forward()` method is called in `'predict'` mode to generate detection results.
4. **Result Formatting**: `add_pred_to_datasample()` structures the output for easy evaluation and visualization.

In essence, `Base3DDetector` provides the foundation, and specific 3D detectors extend it to implement unique architectures and functionality tailored to their tasks.

# Model analysis

### **Components of the model inside MMDetection3D:** --------------------------------------------------------------------------------------------------------------------------------------------------------
- General type of net is "VoxelNet" with the following sub-components:
    1. data_preprocessor 
    2. voxel_encoder
    3. middle_encoder
    4. backbone
    5. neck
    6. bbox_head
- Better description of the "3DDetector" class in the paragraph above

### **Data Preprocessor (Det3DDataPreprocessor)** ------------------------------------------------------------------------------------------------------------------------------------------------------------
- By delving into *VoxelNet --> SingleStage3DDetector --> Base3DDetector --> BaseDetector --> BaseModel*, you can find that the **data_preprocessor** is called every time the *"run_forward"* method is called.
    - *"run_forward"* can be of **mode=predict** in *"val_step"* and *"test_step"*, or it can be of **mode=loss** in *"train_step"*
    - The data_preprocessor provides the basic data for the *"run_forward"* method to be executed
- In our case the data_preprocessor is of class **"Det3DDataPreprocessor"**.
    - Calls the function *"voxelize"* that can be of four types: hard/dynamic/cylindrical/minkunet voxelization. In our case it is **hard** voxelization.
    - Performs the voxelization based on the information provided to the dictionary **"voxel_layer"** in the configuration file.
        - max_voxels
        - max_num_points
        - point_cloud_range
        - voxel_size
- The voxelization_layer is of **class VoxelizationByGridShape** which also accepts the flag *"deterministic = False"* which may speed up the process. <mark>It is worth a try</mark>
- There is a way to perform **dynamic voxelization**, and also a VoxelEncoder that supports dynamic voxelization (same file as the other). <mark>It is worth a try</mark>

### **Voxel Encoder (PillarFeatureNet)** ---------------------------------------------------------------------------------------------------------------------------------------------------------------------
- Performs the first processing of the voxels.
    - Takes as input a tensor of shape (NxMxC) with N=n. of pillars, M=n. of max points per pillar, C=n. of features per point (in our case 4).
    - To the input some parameters can be added, such as *cluster center distance* (from centroid), *voxel center distance* (from voxel centre), and *distance from the origin* (from voxel origin). So the new input becomes (NxMxC_augmented). This method is called **feature augmentation**.
    - For each pillar, **pooling** is performed (either max or average) to extract the main features of the pillar. The number of features is defined in the parameter **"feat_channels"**. <mark>May try to modify **number of features** and also **type of pooling**</mark>
    - The output is (NxC_features) where C_features is exactly the parameter *"feat_channels"*.

### **Middle Encoder (PointPillarsScatter)** -----------------------------------------------------------------------------------------------------------------------------------------------------------------
- Simply takes the voxels as listed in a **(NxC_features)** tensor, and turns them into a **(1xC_featuresxN_yxN_x)** tensor.
- Essentially, takes the pillars **from a 1-D representation** and orders them **to a grid-like structure** which can be used **for CNN-like computations**.

### **Backbone (SECOND)** ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
- Convolution in 3 steps as explained in the following.
    - From (C, X, Y)        to      (C, X/2, Y/2)
    - From (C, X/2, Y/2)    to      (2C, X/4, Y/4)
    - From (2C, X/4, Y/4)   to      (4C, X/8, Y/8)
- All of these layers proceed in **PARALLEL**, and they will be decovoluted and then concatenated in the neck (see below)

### **Neck (SECOND FPN)** ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
- Another standard FPN: Feature Pyramid Network. 
    - There are a down-sampling and then up-sampling that proceed in parallel with **lateral connections** that are then merged together.
    - Multi-scale features allow to get spacial resolution from non-scaled images, and feature richness by scaled images.
- The up-sampling works like the following.
    - From (C, X/2, Y/2)    to      (2C, X/2, Y/2)
    - From (2C, X/4, Y/4)   to      (2C, X/2, Y/2)
    - From (4C, X/8, Y/8)   to      (2C, X/2, Y/2)
- So at the end there are 2C-thick layers for three times, and they are sticked together to get (6C, X/2, Y/2).

### **BBox Head (Anchor3DHead)** -----------------------------------------------------------------------------------------------------------------------------------------------------------------------------
- Takes as input the feature map, and outputs the final results for prediction and training.
    - For just **INFERENCE** the outputs are
        - Predicted BBox parameters (7 values).
        - Classification scores for each BBox, a value for each class.
        - Direction classification.
    - For **TRAINING** the outputs are instead
        - Classification scores, BBox offset from center, classification predictions.
        - Loss values for the back-propagation, so *loss_cls/loss_bbox/loss_dir*.