# Initial Setup

The project repository is mounted from Google Drive and added to the Python path to allow clean imports from the src module. The dataset is copied to the local Colab filesystem to improve I/O performance during training. All global settings (random seed, device selection, paths, batch sizes) are defined once and reused across the notebook to ensure consistency and reproducibility.

Weights & Biases is initialized for experiment tracking, and all training stages use the same precomputed dataset statistics and DataLoaders for fair comparison across models.

In [None]:
import sys
from pathlib import Path

from google.colab import drive
drive.mount('/content/drive')

%cd "/content/drive/MyDrive/Colab Notebooks/Applied Computer Vision/Applied-Computer-Vision-Projects/Multimodal_Learning_02/"

PROJECT_ROOT = Path.cwd()
if str(PROJECT_ROOT) not in sys.path:
    sys.path.append(str(PROJECT_ROOT))

In [None]:
# Install dependencies
%%capture
%pip install --no-cache-dir -r requirements.txt

In [None]:
#%%capture
#%pip install fiftyone==1.10.0 sympy==1.12 torch torchvision numpy open-clip-torch open3d

In [None]:
import os
from google.colab import userdata

import wandb
import fiftyone as fo
from PIL import Image

import numpy as np
import open3d as o3d
import matplotlib.pyplot as plt

import torch
import torchvision.transforms.v2 as transforms

In [None]:
from src.config import (SEED, IMG_SIZE, CLASSES, TMP_TRANSFORMED_DATA_PATH, DRIVE_ROOT,
                    RAW_DATA, DRIVE_TRANSFORMED_DATA_PATH, TMP_TRANSFORMED_DATA_PATH)
from src.utility import set_seeds, prepare_lidar_pointclouds
from src.datasets import find_matching_files, compute_dataset_mean_std
from src.visualization import build_grouped_dataset, plot_class_distributions

In [None]:
!rm -rf /content/data
!cp -r "/content/drive/MyDrive/Colab Notebooks/Applied Computer Vision/Applied-Computer-Vision-Projects/Multimodal_Learning_02/data/assessment" /content/data

In [None]:
# Specific constants for Dataset Visualzation
FIFTYONE_DATASET_NAME = "cilp_assessment"

In [None]:
# Usage: Call this function at the beginning and before each training phase
set_seeds(SEED)

In [None]:
# Load W&B API key from Colab Secrets and make it available as env variable
wandb_key = userdata.get('WANDB_API_KEY')
os.environ["WANDB_API_KEY"] = wandb_key
wandb.login()

# Loading and preparation of Data

The RGB–LiDAR dataset is loaded and preprocessed by normalizing RGB images and converting LiDAR depth data into aligned point cloud representations suitable for multimodal learning.

In [None]:
pcd_status = prepare_lidar_pointclouds(
    raw_dataset_dir=RAW_DATA,
    local_pointcloud_dir=TMP_TRANSFORMED_DATA_PATH,
    cache_dir=DRIVE_TRANSFORMED_DATA_PATH,
    converter_script=DRIVE_ROOT / "scripts" / "convert_lidar_to_pcd.py",
    classes=CLASSES,
)

print("LiDAR point cloud preparation:", pcd_status)

In [None]:
# Calculates mean and standard deviation of the rgb train data
# for different dataset (or change in train data) recalculate mean and standard deviation
# mean, std = compute_dataset_mean_std(root_dir=RAW_DATA, img_size=IMG_SIZE)

In [None]:
img_transforms = transforms.Compose([
    transforms.ToImage(),   # Scales data into [0,1]
    transforms.Resize(IMG_SIZE),
    transforms.ToDtype(torch.float32, scale=True),
    transforms.Normalize(([0.0051, 0.0052, 0.0051, 1.0000]), ([5.8023e-02, 5.8933e-02, 5.8108e-02, 2.4509e-07]))    
    # transforms.Normalize(mean.tolist(), std.tolist())     # uncomment for different dataset (or change in train data)
])

In [None]:
pairs = find_matching_files(
    CLASSES,
    rgb_root=RAW_DATA,
    lidar_root=TMP_TRANSFORMED_DATA_PATH,
    rgb_subdir="rgb",
    lidar_subdir="pcd",   
    rgb_ext="png",
    lidar_ext="pcd",
)

# Create FiftyOne Grouped Dataset
A grouped dataset is created to explicitly link each RGB image with its corresponding LiDAR point cloud. This enables joint visualization and inspection of both modalities within a single sample, facilitating qualitative analysis of multimodal alignment and data quality.

In [None]:
dataset = build_grouped_dataset(
    name=FIFTYONE_DATASET_NAME,
    pairs=pairs,
    persistent=True,
    overwrite=True
)

In [None]:
session = fo.launch_app(dataset, auto=False)

print(f"✅ Created dataset '{dataset.name}' with {len(dataset)} samples")
print("Group field:", dataset.group_field)
print("Group slices:", dataset.group_slices)

# Visual Exploration - Evaluation:

In [None]:
print(dataset)

In [None]:
total_per_class = {cls: len(items) for cls, items in pairs.items()}
total_samples = sum(total_per_class.values())

print("Total samples per class:")
for cls, n in total_per_class.items():
    print(f"  {cls}: {n}")
print(f"\nTotal samples: {total_samples}")

# picks the first class and first sample from pairs
any_class = CLASSES[0]
sample = pairs[any_class][0]

sample_rgb_path = sample["rgb"]
sample_pcd_path = sample["lidar"]

# RGB image
rgb_img = Image.open(sample_rgb_path)
print("RGB image:")
print("  size (width, height):", rgb_img.size)
print("  mode:", rgb_img.mode)
print("  format:", rgb_img.format)

pcd = o3d.io.read_point_cloud(str(sample_pcd_path))
lidar = np.asarray(pcd.points)
color = np.asarray(pcd.colors)

print("\nLiDAR depth map:")
print("  shape:", lidar.shape)
print("  dtype:", lidar.dtype)

## Observations about the dataset

**Observation 1 — Clear shape signal in LiDAR despite low RGB resolution**

What we see:
* RGB image: very low resolution, blurry, little texture
* LiDAR: sphere shape is clearly recognizable (smooth, rounded point distribution)

Interpretation: Although the RGB images are low-resolution and provide limited texture information, the LiDAR modality captures the geometric structure of the objects very clearly. In particular, spherical objects form smooth, rounded point clouds, making shape information more salient in LiDAR than in RGB.

----------

**Observation 2 — Complementarity of modalities**

What we see:
* RGB alone: hard to distinguish shape confidently
* LiDAR alone: shape (sphere vs cube) is obvious

Interpretation: The two modalities provide complementary information: while RGB captures appearance cues, LiDAR provides strong geometric cues. This complementarity motivates the use of multimodal contrastive learning, as each modality compensates for weaknesses in the other.

-------------

**Observation 3 — Sparse but structured LiDAR point clouds**

What we see:
* LiDAR point cloud is not dense
* Still forms a coherent spherical structure

Interpretation: While the LiDAR point clouds are relatively sparse, they retain sufficient structural information to represent object shape. However, variability in point density across samples could introduce noise during training.

## Data quality issues and observed patterns:

The dataset exhibits a uniform background and consistent object centering in RGB images, introducing potential bias. RGB images are low-resolution, while LiDAR point clouds provide clearer geometric cues but vary in sparsity. Overall, RGB and LiDAR modalities are well aligned and consistently paired, indicating good dataset completeness.

# Creating test and validation set
The dataset is split into training, validation, and test subsets using a fixed random seed. This ensures reproducible evaluation and consistent comparisons across all models and experiments. In addition, the class distribution is analyzed to verify dataset balance and identify potential class imbalance issues.

In [None]:
set_seeds(SEED)
train_ratio = 0.8

splits = {
    "train": {},
    "val": {},
}

for cls, items in pairs.items():
    n = len(items)
    n_train = int(n * train_ratio)

    splits["train"][cls] = items[:n_train]
    splits["val"][cls] = items[n_train:]

train_size = sum(len(v) for v in splits["train"].values())
val_size = sum(len(v) for v in splits["val"].values())

print("Train/validation sizes:")
for cls in CLASSES:
    print(
        f"  {cls}: train={len(splits['train'][cls])}, "
        f"val={len(splits['val'][cls])}"
    )
print(f"\nTotal train: {train_size}")
print(f"Total val:   {val_size}")

In [None]:
fig, axes = plot_class_distributions(
    total_per_class=total_per_class,
    splits=splits
)
plt.show()