# Dataset Exploration

## Data Preperation

From the original dataset ([kaggle](https://www.kaggle.com/datasets/andandand/cubes-and-spheres-lidar-and-rgb)) consisting of 9999 spheres and 9999 cubes each having rgb and lidar data, I selected a subset of 1000 cubes and 1000 spheres. I uploaded this subset as a grouped fiftyone dataset to huggingface [here](https://huggingface.co/datasets/MatthiasCr/multimodal-shapes-subset). In this and all of the following notebooks and experiments I will always start with this huggingface dataset.
The fiftyone dataset already has tags for train and validation split to ensure that we will use the same split for all experiments. The azimuth and zenith data for the lidar samples is added as dataset-level metadata fields. 

The following is the code I used to create this subset and push it to huggingface. The code loads the original full dataset from google drive, creates the grouped fiftyone dataset, adds the azimuth and zenith data, and creates a train / val split. 
For this notebook this code must not be executed, I just added it for completeness.

```python
from google.colab import drive
import random
import fiftyone as fo
import fiftyone.utils.random as four
from fiftyone.utils.huggingface import push_to_hub
import numpy as np

drive.mount('/content/drive')
data_root = "/content/drive/MyDrive/compVision/spheres_and_cubes"

# randomly choose 1000 out of the 9999 indices of the original dataset
N = 9999
indices = sorted(random.sample(range(N), int(0.1 * N) + 1))

# create a gouped fiftyone dataset
dataset = fo.Dataset("multimodal_shapes_subset")
dataset.add_group_field("group")

shapes = {
    "cubes": "cube",
    "spheres": "sphere",
}

# using the chosen indices add 1000 sphere and 1000 cube samples to fiftyone dataset
groups = {}
for cls, label in shapes.items():
    for i in indices:
        gid = f"{cls}_{i:04d}"

        groups.setdefault(gid, fo.Group())

        # rgb sample
        rgb = fo.Sample(
            filepath=f"{data_root}/{cls}/rgb/{i:04d}.png",
            group = groups[gid].element("rgb"),
            label=fo.Classification(label=label),
        )

        # lidar sample
        lidar = fo.Sample(
            filepath=f"{data_root}/{cls}/lidar/{i:04d}.npy",
            group = groups[gid].element("lidar"),
            label=fo.Classification(label=label),
        )
        dataset.add_samples([rgb, lidar])

# also load azimuth and zenith data and them it as dataset metadata fields
azimuth_cubes = np.load(f"{data_root}/cubes/azimuth.npy")
zenith_cubes  = np.load(f"{data_root}/cubes/zenith.npy")
azimuth_spheres = np.load(f"{data_root}/spheres/azimuth.npy")
zenith_spheres = np.load(f"{data_root}/spheres/zenith.npy")

dataset.info["azimuth_cubes"] = azimuth_cubes.tolist()
dataset.info["zenith_cubes"] = zenith_cubes.tolist()
dataset.info["azimuth_spheres"] = azimuth_spheres.tolist()
dataset.info["zenith_spheres"] = zenith_spheres.tolist()
dataset.save()

# train / val split per class so the classes remain balanced inside train and val split
spheres = dataset.match(F("label.label") == "sphere")
four.random_split(spheres, {"train": 0.8, "val": 0.2})
cubes = dataset.match(F("label.label") == "cube")
four.random_split(cubes, {"train": 0.8, "val": 0.2})

# push fiftyone dataset to huggingface
push_to_hub(dataset, "multimodal-shapes-subset", label_field="label")
```

In [None]:
import sys

# Colab-only setup
if "google.colab" in sys.modules:
    print("Running in Google Colab. Setting up repo")

    !git clone https://github.com/Computer-Vision-Assignment-2.git
    %cd Computer-Vision-Assignment-2
    !pip install -r requirements.txt

## Load the Data and Visualize in Fiftyone

Now we can really start by loading the dataset from huggingface.

In [1]:
import os
from PIL import Image
import fiftyone as fo
from fiftyone import ViewField as F
import fiftyone.core.groups as fog
from fiftyone.utils.huggingface import load_from_hub
import numpy as np
import open3d as o3d

Jupyter environment detected. Enabling Open3D WebVisualizer.
[Open3D INFO] WebRTC GUI backend enabled.
[Open3D INFO] WebRTCWindowSystem: HTTP handshake server disabled.


In [8]:
dataset = load_from_hub("MatthiasCr/multimodal-shapes-subset", 
                         name="multimodal-shapes-subset",
                         # fewer workers and greater batch size to hopefully avoid getting rate limited
                         num_workers=2,
                         batch_size=500
                        )

Downloading config file fiftyone.yml from MatthiasCr/multimodal-shapes-subset
Loading dataset


ServerSelectionTimeoutError: localhost:27017: [Errno 111] Connection refused (configured timeouts: socketTimeoutMS: 20000.0ms, connectTimeoutMS: 20000.0ms), Timeout: 30s, Topology Description: <TopologyDescription id: 6952ca619ed9dc27d8adb1b4, topology_type: Unknown, servers: [<ServerDescription ('localhost', 27017) server_type: Unknown, rtt: None, error=AutoReconnect('localhost:27017: [Errno 111] Connection refused (configured timeouts: socketTimeoutMS: 20000.0ms, connectTimeoutMS: 20000.0ms)')>]>

Fiftyone can't visualize lidar data stored in raw npy files. To really visualize the lidar data in fiftyone, we first convert the lidar depth to xyz coordinates using the azimuth and zenith data. Then we use open3d to convert it to point clouds and store it as pcd files. These pcd files we then add as a third slice to the grouped dataset, so that every group has 3 samples: rgb, lidar (npy), and pcd. 

In [None]:
# load azimuth and zenith data from dataset-level metadata fields
azimuth_cubes = np.array(dataset.info["azimuth_cubes"])
zenith_cubes  = np.array(dataset.info["zenith_cubes"])
azimuth_spheres = np.array(dataset.info["azimuth_spheres"])
zenith_spheres = np.array(dataset.info["zenith_spheres"])

In [None]:
# convert lidar depth to xyza given azimuth and zenith data
def get_xyza(lidar_depth, azimuth, zenith):
    x = lidar_depth * np.sin(-azimuth[:, None]) * np.cos(-zenith[None, :])
    y = lidar_depth * np.cos(-azimuth[:, None]) * np.cos(-zenith[None, :])
    z = lidar_depth * np.sin(-zenith[None, :])
    a = np.where(lidar_depth < 50.0, np.ones_like(lidar_depth), np.zeros_like(lidar_depth))
    xyza = np.stack((x, y, z, a))
    return xyza

Now for each lidar sample we create a point cloud pcd file and add it as a third sample to the group:

In [None]:
# local directory for .pcd files
os.makedirs("./data", exist_ok=True)

lidar_view = dataset.select_group_slices("lidar")

# iterate over all lidar samples
for idx, sample in enumerate(lidar_view):
    lidar_path = sample.filepath
    lidar_depth = np.load(lidar_path)

    if sample.label.label == "cube":
        az = azimuth_cubes
        ze = zenith_cubes
    else:
        az = azimuth_spheres
        ze = zenith_spheres

    xyza_data = get_xyza(lidar_depth, az, ze)
    xyz = xyza_data[:3].reshape(3, -1).T
    a = xyza_data[3].reshape(-1).astype(bool)
    # apply mask
    xyz_valid = xyz[a]

    # create point cloud
    pcd = o3d.geometry.PointCloud()
    pcd.points = o3d.utility.Vector3dVector(xyz_valid)
    pcd_filepath = f"./data/{idx}.pcd"
    o3d.io.write_point_cloud(pcd_filepath, pcd)

    # add pcd as third sample to the group
    pcd_sample = fo.Sample(
        filepath=pcd_filepath,
        label=sample.label,
    )

    # Attach to same group under a new slice
    pcd_sample['group'] = sample.group.element("pcd")

    dataset.add_sample(pcd_sample)

Launch the app to visualize the point clouds and explore the data.

In [None]:
session = fo.launch_app(dataset=dataset, auto=False)

In [None]:
print(session.url)

![](../results/fo-rgb-slices.png)
![](../results/fo-cube-rgb-and-pcd.png)

## Dataset Statistics and Observations

We first inspect datatypes and size of the images and lidar data.

In [None]:
rgb_view = dataset.select_group_slices("rgb")
lidar_view = dataset.select_group_slices("lidar")


image_size = Image.open(rgb_view.first().filepath).size
lidar_dtype = np.load(lidar_view.first().filepath).dtype
lidar_size = np.load(lidar_view.first().filepath).shape

print(f"Image size: {image_size}")
print(f"Lidar dtype: {lidar_dtype}")
print(f"Lidar size: {lidar_size}")

Now we calculate number of samples per class, train/val split size, and class distribution in train split.

In [None]:
sphere_view = dataset.match(F("label.label") == "sphere")
cube_view = dataset.match(F("label.label") == "cube")

train_view = rgb_view.match_tags("train")
train_percentage = (int)((train_view.count() / rgb_view.count()) * 100)
sphere_train_view = train_view.match(F("label.label") == "sphere")
cube_train_view = train_view.match(F("label.label") == "cube")

print("Class distribution:")
print(f"Overall number of samples: {dataset.count()}")
print(f"spheres: {sphere_view.count()}")
print(f"cubes: {cube_view.count()}")
print()
print("Train/Val Split:")
print(f"Overall number of train samples: {train_view.count()} ({train_percentage}%)")
print(f"train spheres: {sphere_train_view.count()}")
print(f"train cubes: {cube_train_view.count()}")

As already described at the beginning, the dataset consists of 1000 cubes and 1000 spheres. 80% of the data is used for training and 20% for validation. The class distribution within the splits is also exactly balanced, having 800 train and 200 val cubes and also 800 train and 200 val spheres.

We can also plot histograms using fiftyone:

![](../results/fo-hist-classdist.png)
![](../results/fo-hist-trainval.png)