# Dataset Statistics

This notebook shows how to gather some statistics, such as the class distribution, from a labeled point cloud dataset.

In this code we assume the point clouds have been labelled following the process in our [Urban PointCloud Processing](https://github.com/Amsterdam-AI-Team/Urban_PointCloud_Processing/tree/main/datasets) project. For more information on the specifics of the datasets used, see [the description there](https://github.com/Amsterdam-AI-Team/Urban_PointCloud_Processing/blob/main/datasets/README.md).

In [1]:
# Add project src to path.
import set_path

import numpy as np
import pandas as pd
import laspy
import pathlib
from tqdm import tqdm

from upcp.utils import las_utils

In [2]:
# We provide some example data for demonstration purposes.
dataset_folder = '../datasets/pointcloud/'
prefix = 'processed_'
files = list(pathlib.Path(dataset_folder).glob(f'{prefix}*.laz'))

CLS_LABELS = {0: 'Unlabelled',
              1: 'Ground',
              2: 'Building',
              3: 'Tree',
              4: 'Street light',
              5: 'Traffic sign',
              6: 'Traffic light',
              7: 'Car',
              8: 'City bench',
              9: 'Rubbish bin',
              10: 'Road',
              99: 'Noise'}

---
## Statistics per point cloud tile

We collect the total number of points, the number of classes, and the number of points per class for each labelled point cloud tile.

In [3]:
# Create dataframe.
columns = ['tilecode','n_points','n_classes']
columns.extend(CLS_LABELS.values())
data_df = pd.DataFrame(columns=columns).set_index('tilecode')

for f in tqdm(files):
    # Load point cloud.
    pc = laspy.read(f.as_posix())
    tilecode = las_utils.get_tilecode_from_filename(f.as_posix())
    
    # Count points per class.
    classes, counts = np.unique(pc.label, return_counts=True)
    data = {CLS_LABELS[c]: cnt for c, cnt in zip(classes, counts)}
    
    # Total point count.
    data['n_points'] = np.sum(counts)
    
    # Number of classes present in the point cloud (excluding 'noise' and 'unlabelled').
    real_classes = [cnt for c, cnt in zip(classes, counts) if c not in (0, 99)]
    data['n_classes'] = np.count_nonzero(real_classes)
    
    data_df.loc[tilecode] = data

data_df = data_df.fillna(0).astype('int64').sort_index()

100%|█████████████████████████████████████████████| 2/2 [00:02<00:00,  1.16s/it]


In [4]:
data_df

Unnamed: 0_level_0,n_points,n_classes,Unlabelled,Ground,Building,Tree,Street light,Traffic sign,Traffic light,Car,City bench,Rubbish bin,Road,Noise
tilecode,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1
2386_9702,7165726,9,536296,1496010,1419319,1034504,24379,8090,0,262600,25471,6348,2329219,23490
2397_9705,10854907,9,1536554,1848937,2544002,1977360,82342,3271,0,377125,41388,13364,2374463,56101


---
## Summary statistics for the entire dataset

We sum over all point cloud tiles and compute summary statistics:
* the total number of points for each class accross the dataset;
* the percentage of points (of the total) belonging to each class;
* the number of tiles that contain points of each class.

In [7]:
# Create dataframe.
columns = ['Total']
columns.extend(CLS_LABELS.values())
stats_df = pd.DataFrame(columns=columns)

# Get total point counts.
counts = data_df.sum()
counts['Total'] = counts['n_points']
stats_df.loc['n_points'] = counts

# Compute percantage.
stats_df.loc['percentage'] = (100 * stats_df.loc['n_points'] / stats_df.loc['n_points', 'Total']).astype(float).round(2)

# Compute number of tiles where class is present.
occurs = [len(data_df)]
occurs.extend(np.count_nonzero(data_df[CLS_LABELS.values()], axis=0))
stats_df.loc['n_tiles'] = occurs

In [8]:
stats_df

Unnamed: 0,Total,Unlabelled,Ground,Building,Tree,Street light,Traffic sign,Traffic light,Car,City bench,Rubbish bin,Road,Noise
n_points,18020633.0,2072850.0,3344947.0,3963321.0,3011864.0,106721.0,11361.0,0.0,639725.0,66859.0,19712.0,4703682.0,79591.0
percentage,100.0,11.5,18.56,21.99,16.71,0.59,0.06,0.0,3.55,0.37,0.11,26.1,0.44
n_tiles,2.0,2.0,2.0,2.0,2.0,2.0,2.0,0.0,2.0,2.0,2.0,2.0,2.0
