# Dataset statistics

This notebook shows how to gather some statistics, such as the class distribution, from a labeled point cloud dataset.

In this code we assume the point clouds have been labelled following the process in our [Urban PointCloud Processing](https://github.com/Amsterdam-AI-Team/Urban_PointCloud_Processing/tree/main/datasets) project. For more information on the specifics of the datasets used, see [the description there](https://github.com/Amsterdam-AI-Team/Urban_PointCloud_Processing/blob/main/datasets/README.md).

In [None]:
import numpy as np
import pandas as pd
import laspy
import pathlib
from tqdm import tqdm

import set_path  # add project src to path
from upcp.utils import las_utils

import config as cf  # use config or config_azure

In [None]:
# Retrieve paths to point cloud demo data
files = list(pathlib.Path(cf.dataset_folder).glob(f'{cf.prefix}*.laz'))

# Set class labels
CLS_LABELS = {
        1: 'Road',
        9: 'Other ground',
        10: 'Building',
        30: 'Tree',
        40: 'Car',
        60: 'Streetlight',
        61: 'Traffic light',
        62: 'Traffic sign',
        80: 'City bench',
        81: 'Rubbish bin'
    }

---
## Statistics per point cloud tile

We collect the total number of points, the number of classes, and the number of points per class for each labelled point cloud tile.

In [None]:
# Create dataframe
columns = ['tilecode','n_points','n_classes']
columns.extend(CLS_LABELS.values())
data_df = pd.DataFrame(columns=columns).set_index('tilecode')

for f in tqdm(files):
    # Load point cloud
    pc = laspy.read(f.as_posix())
    tilecode = las_utils.get_tilecode_from_filename(f.as_posix())
    npz_file = np.load(cf.pred_folder + cf.prefix_pred + tilecode + '.npz')
    labels = npz_file['label']

    # Count points per class
    classes, counts = np.unique(labels, return_counts=True)
    data = {CLS_LABELS[c]: cnt for c, cnt in zip(classes, counts)}
    
    # Get total point count
    data['n_points'] = np.sum(counts)
    
    # Get number of classes present in the point cloud (excluding 'noise' and 'unlabelled')
    real_classes = [cnt for c, cnt in zip(classes, counts) if c not in (0, 99)]
    data['n_classes'] = np.count_nonzero(real_classes)
    
    data_df.loc[tilecode] = data

data_df = data_df.fillna(0).astype('int64').sort_index()

In [None]:
data_df

---
## Summary statistics for the entire dataset

We sum over all point cloud tiles and compute summary statistics:
* the total number of points for each class accross the dataset;
* the percentage of points (of the total) belonging to each class;
* the number of tiles that contain points of each class.

In [None]:
# Create dataframe
columns = ['Total']
columns.extend(CLS_LABELS.values())
stats_df = pd.DataFrame(columns=columns)

# Get total point counts
counts = data_df.sum()
counts['Total'] = counts['n_points']
stats_df.loc['n_points'] = counts

# Compute percentage
stats_df.loc['percentage'] = (100 * stats_df.loc['n_points'] / stats_df.loc['n_points', 'Total']).astype(float).round(2)

# Compute number of tiles where class is present
occurs = [len(data_df)]
occurs.extend(np.count_nonzero(data_df[CLS_LABELS.values()], axis=0))
stats_df.loc['n_tiles'] = occurs

In [None]:
stats_df