# Outlier removal with SOR and ScOR using `py4dgeo`

This notebook demonstrates how to perform automated outlier removal in terrestrial and permanent laser scanning (TLS/PLS) point clouds using two complementary approaches implemented in `py4dgeo`:

1. **Statistical Outlier Removal (SOR)** – a widely used, distance-based filter that detects points whose local neighbor distances deviate strongly from the global distribution ([Rusu et al., 2008](#References)).
2. **Scan Outlier Ratio (ScOR)** – a LiDAR scanning- and survey-aware descriptor that compares *expected* and *observed* point spacing in the scanners angular domain and is specifically designed for TLS/PLS data.

Both methods are applied to a small multi-temporal example data set consisting of three epochs acquired from the same scan position:

- `t1` – reference epoch (to be filtered)
- `t2` – alternative epoch used as a multi-temporal neighborhood
- `t3` – further epoch for temporally aggregated neighborhoods

In line with the associated ScOR paper ([Tabernig and Höfle, 2026](#References)), the focus is on:

- removing **detached and transient points** (e.g. insects, rain, ghost points, temporary objects),
- **preserving coherent surfaces** that are suitable for 3D surface change analysis,
- and demonstrating how **multi-temporal neighborhoods** can be used to detect and remove large transient objects.

We first run SOR on the reference epoch, then compute ScOR for:
- a **single epoch** (standard ScOR),
- a **single other epoch** (bi-temporal ScOR),
- and **temporally aggregated epochs** (multi-temporal ScOR).

The results are stored in LAS files (via `py4dgeo.Vapc`) so they can later be explored and thresholded interactively, e.g. in CloudCompare.


In [None]:
import py4dgeo
import numpy as np

## Statistical Outlier Removal (SOR) on the first epoch

We start by importing the required packages and defining the input and output files:

- `t1_file` – the first epoch, which we want to filter.
- `t2_file`, `t3_file` – additional epochs used later as **multi-temporal neighborhood candidates**.
- `SOR_out` – output LAS file where the SOR results for epoch 1 are stored.

We also specify the additional dimensions `return_number` and `number_of_returns`, which will later allow us to restrict neighborhoods to **last returns** only (as recommended in the ScOR paper).

Next, we configure and run the SOR filter:

- `k` – number of nearest neighbors used to compute local mean distances,
- `std_dev_multiplier` – global threshold on mean distances, expressed in units of standard deviation of the global distribution,
- `remove_points=False` – **we do not remove points yet**, but instead keep an inlier/outlier flag for inspection and further processing.

SOR is applied to `t1` and returns:

- the (possibly updated) epoch `search_points_epoch`,
- an `inlier_outlier` flag per point,
- the mean neighbor distance for each point (`mean_distances`).

We then wrap the result in a `py4dgeo.Vapc` object, store the per-point mean distances and the inlier/outlier flag as attributes, and write them to `SOR_out`. This creates a SOR-annotated point cloud that we will use as input for ScOR.


In [None]:
t1_file = r"E:\test_data\240102_070000 - SINGLESCANS - 240102_070700.laz" # first epoch, the one to be filtered
t2_file = r"E:\test_data\240102_080000 - SINGLESCANS - 240102_080650.laz" # second epoch, neighborhood candidates (for single other epoch multi-temporal)
t3_file = r"E:\test_data\240102_090000 - SINGLESCANS - 240102_090653.laz" # third epoch, neighborhood candidates (for aggregated-epochs multi-temporal)

SOR_out = t1_file.replace(".laz","_SOR_py4dgeo.laz")
dims = {"return_number": "return_number",
        "number_of_returns": "number_of_returns"}

k= 8
std_dev_multiplier=1.0
# Lets not remvoe the points directly but store the inlier/outlier flag instead.
remove_points = False 

search_points_epoch = py4dgeo.read_from_las(t1_file, additional_dimensions=dims)
SOR = py4dgeo.SOR(
    epoch = search_points_epoch,
    k = k,
    std_dev_multiplier=std_dev_multiplier,
    remove_points=remove_points,
)

search_points_epoch,inlier_outlier, mean_distances = SOR.run()

vp_ob = py4dgeo.Vapc(search_points_epoch, voxel_size=0.01)
vp_ob.out["mean_distance_%s"%k] = mean_distances
if remove_points is False:
        vp_ob.out["inlier_outlier_%s"%k] = inlier_outlier
vp_ob.save_as_las(SOR_out)

## Restricting neighborhoods to last returns

ScOR, as defined in the paper, is primarily evaluated on **last and single returns**. Intermediate returns within vegetation or complex structures often represent semi-transparent or transient objects (e.g. canopy layers, moving leaves) that violate local surface assumptions.

To enforce this, we create a small helper function `mask_last_returns`:

- It takes an `Epoch` and the names of the fields `return_number` and `number_of_returns`.
- It builds a boolean mask where `return_number == number_of_returns`, i.e. **last returns only**.
- It returns a new `Epoch` containing only those last-return points and their additional dimensions.

This function is used later to build consistent, surface-focused neighborhoods for ScOR.

In [None]:
def mask_last_returns(neighborhood_candidates1: py4dgeo.Epoch, return_number_field_name: str, number_of_returns_field_name: str) -> py4dgeo.Epoch:
    """Return a new Epoch containing only the last returns from the input Epoch."""
    ad = neighborhood_candidates1.additional_dimensions
    rn = ad[return_number_field_name].astype(np.int32).ravel()
    nr = ad[number_of_returns_field_name].astype(np.int32).ravel()
    mask = (rn == nr)
    last_return_epoch = py4dgeo.Epoch(
        cloud=neighborhood_candidates1.cloud[mask],
        additional_dimensions=ad[mask],
    )
    return last_return_epoch

## Single-epoch ScOR

In the next step, we compute **ScOR** for the first epoch using only its own points as neighborhood candidates:

1. We define additional dimensions to be loaded:
   - `return_number` and `number_of_returns` (for last-return masking),
   - `inlier_outlier_k` (here with `k = 8`) to keep track of the SOR result.

2. We specify:
   - `scan_position` – scanner location in 3D (here `[0, 0, 0]`),
   - `scan_resolution` – nominal angular step in radians (here an adjusted value),
   - `increment` – step size in the scan grid used to define neighbors (here `0.5`).

   To reduce aliasing effects (as described in the paper), we slightly **adjust** the scan resolution by multiplying with 2 and adding a small offset. This grid artefacts when discretizing the scan angles for neighborhood construction.

3. We read `search_points_t1` from `t1_file` (the SOR output) and use `mask_last_returns` to create `neighborhood_candidates_t1` containing **only last returns**.

4. We instantiate `py4dgeo.ScOR` with:
   - the full point cloud as `search_point_epoch`,
   - last returns from the same epoch as `neighborhood_candidate_epochs`,
   - scan position and scan resolution,
   - increment to define the neighborhood in the angular grid.

When we call `ScOR.run()`, we obtain:

- `scor_value_standard` – the ScOR value for each point in `search_points_t1`,  
  which gives a value between 0.0 (detached/outlier-like) and 1.0 (fully consistent surface).
- `expected_distance_standard` – the expected neighbor distance in 3D, derived from scan geometry.
- `observed_distance_standard` – the actually measured 3D neighbor distance in object space.

These quantities directly reflect the **local surface consistency** in the scan geometry, which is the main idea behind ScOR.


In [None]:
t1_file = SOR_out

dims_search_points = {"return_number": "return_number",
        "number_of_returns": "number_of_returns",
        "inlier_outlier_%s"%k:"inlier_outlier_%s"%k} # we keep the SOR result as well
                                                      
return_number_field = "return_number"
number_of_returns_field = "number_of_returns"
dims_neighborhood_candidates = {return_number_field: return_number_field,
        number_of_returns_field: number_of_returns_field}


scan_position = [0,0,0]
scan_resolution = 0.015 # regular scan resolution
# To avoid aliasing effects, we adjust the scan resolution slightly (2x + small offset)
scan_resolution = scan_resolution*2 + 0.0028 # adjusted to prevent aliasing effects
increment = 0.5

# Read the point cloud
search_points_t1 = py4dgeo.read_from_las(t1_file, additional_dimensions=dims_search_points)
neighborhood_candidates_t1 = mask_last_returns(search_points_t1, "return_number", "number_of_returns")

ScOR = py4dgeo.ScOR(
    search_point_epoch=search_points_t1,
    neighborhood_candidate_epochs=neighborhood_candidates_t1,
    scan_position=scan_position,
    scan_resolution=scan_resolution,
    increment=increment
)

scor_value_standard, expected_distance_standard, observed_distance_standard = ScOR.run()

## Multi-temporal ScOR: single other epoch as neighborhood

So far, neighborhoods were defined **within the same epoch**. For permanent laser scanning (PLS) setups, we can also exploit **multi-temporal neighborhoods**, i.e. build neighborhoods from different epochs acquired from the same scan position.

In this section we:

- Fix epoch 1 (`search_points_t1`) as the **reference epoch**, and
- Use epoch 2 (`t2_file`) as the **only neighborhood candidate**.

Steps:

1. Set `scan_position` and `scan_resolution` for this multi-temporal experiment.
2. Read `neighborhood_candidates_t2` from `t2_file` and again restrict to **last returns** using `mask_last_returns`.
3. Instantiate a new `py4dgeo.ScOR` object with:
   - `search_point_epoch = search_points_t1` (epoch 1),
   - `neighborhood_candidate_epochs = neighborhood_candidates_t2` (epoch 2),

`ScOR.run()` then computes:

- `scor_value_to_t2` – ScOR values for points of epoch 1, but **using neighborhoods from epoch 2**.
- `expected_distance_to_t2` and `observed_distance_to_t2` – analogous to the single-epoch case, but now based on cross-epoch neighborhoods.

Conceptually, this configuration highlights **temporal inconsistency**:  
points that represent transient objects (e.g. a person present only in one epoch) will obtain **low ScOR values**, because their local neighborhood in the other epoch is dominated by different structures (typically the background surface).


In [None]:
neighborhood_candidates_t2 = py4dgeo.read_from_las(t2_file,additional_dimensions=dims_neighborhood_candidates)
neighborhood_candidates_t2 = mask_last_returns(neighborhood_candidates_t2, "return_number", "number_of_returns")

scor = py4dgeo.ScOR(
    search_point_epoch=search_points_t1,
    neighborhood_candidate_epochs=neighborhood_candidates_t2,
    scan_position=scan_position,
    scan_resolution=scan_resolution,
    increment=increment
)
scor_value_to_t2, expected_distance_to_t2, observed_distance_to_t2 = scor.run()
print("Finito single-other-epoch neighborhood ScOR run.")

## Multi-temporal ScOR: temporally aggregated neighborhoods

In many PLS applications, we are interested not only in pairwise comparisons, but also in **temporally aggregated neighborhoods** ([Tabernig et al., 2025](#References)). Instead of a single other epoch, we aggregate several epochs into one neighborhood candidate set.

Here we:

1. Read `neighborhood_candidates_t3` from `t3_file`.
2. Ensure that both `neighborhood_candidates_t2` and `neighborhood_candidates_t3` are again restricted to **last returns**.
3. Build an aggregated, multi-epoch neighborhood consisting of:
   - `neighborhood_candidates_t1` (epoch 1),
   - `neighborhood_candidates_t2` (epoch 2),
   - `neighborhood_candidates_t3` (epoch 3).

These are passed as a **tuple of epochs** to `py4dgeo.ScOR` as `neighborhood_candidate_epochs`.

When calling `ScOR.run()` in this configuration, we obtain:

- `scor_value_aggregated` – ScOR values for epoch 1 points, but now using a **multi-epoch aggregated neighborhood**.
- `expected_distance_aggregated` and `observed_distance_aggregated` – expected and observed neighbor distances in this aggregated neighborhood.

This setup corresponds to the **Level of Aggregation (LoA)** described in the paper:

- It tends to **stabilize neighborhoods** by averaging over several epochs.
- Large transient objects (e.g. a person present in only one epoch) become very inconsistent relative to the aggregated background and thus obtain very low ScOR values.
- Persistent surfaces (rock outcrop, ground, walls, etc.) keep similar neighborhoods across epochs and remain with high ScOR values.

Aggregated neighborhoods therefore help to **robustly detect dynamic objects** in PLS time series.

In [None]:
neighborhood_candidates_t3 = py4dgeo.read_from_las(t3_file, additional_dimensions=dims_neighborhood_candidates)
neighborhood_candidates_t3 = mask_last_returns(neighborhood_candidates_t3, "return_number", "number_of_returns")

scor = py4dgeo.ScOR(
    search_point_epoch=search_points_t1,
    neighborhood_candidate_epochs=(neighborhood_candidates_t1, neighborhood_candidates_t2, neighborhood_candidates_t3),
    scan_position=scan_position,
    scan_resolution=scan_resolution,
    increment=increment)

scor_value_aggregated, expected_distance_aggregated, observed_distance_aggregated = scor.run()

## Exporting ScOR diagnostics to LAS for further analysis

Finally, we collect all ScOR-related quantities into a single `py4dgeo.Vapc` object for the reference epoch:

- From the **single-epoch ScOR** run:
  - `ScOR_standard`
  - `expected_distance_standard`
  - `observed_distance_standard`

- From the **bi-temporal ScOR** run (epoch 1 vs epoch 2):
  - `ScOR_to_t2`
  - `expected_distance_to_t2`
  - `observed_distance_to_t2`

- From the **aggregated multi-temporal ScOR** run (epochs 1–3):
  - `ScOR_aggregated`
  - `expected_distance_aggregated`
  - `observed_distance_aggregated`

These arrays are stored as additional per-point attributes in a `Vapc` object and written to the output file `ScOR_out`.

This LAS file can now be:

- loaded into **CloudCompare** (or other point cloud tools),
- visualized with ScOR fields as color scales,
- and thresholded (e.g. using a value around `ScOR ≤ 0.11` as discussed in the paper) to:

  - remove detached points, small transient clusters (e.g. insects, leaves),
  - and detect large temporary objects via multi-temporal neighborhoods.

Together with the SOR flag and distances, this provides a **complete outlier-removal toolkit** that combines:

- SOR’s global distance-based filtering, and  
- ScOR’s scan-aware, range-robust measure of local surface coherence.

In [None]:
ScOR_out = t1_file.replace(".laz","_py4dgeo.laz")

vapc = py4dgeo.Vapc(search_points_t1,voxel_size=0.01)
vapc.out["ScOR_standard"] = scor_value_standard
vapc.out["expected_distance_standard"] = expected_distance_standard
vapc.out["observed_distance_standard"] = observed_distance_standard

vapc.out["ScOR_to_t2"] = scor_value_to_t2
vapc.out["expected_distance_to_t2"] = expected_distance_to_t2
vapc.out["observed_distance_to_t2"] = observed_distance_to_t2

vapc.out["ScOR_aggregated"] = scor_value_aggregated
vapc.out["expected_distance_aggregated"] = expected_distance_aggregated 
vapc.out["observed_distance_aggregated"] = observed_distance_aggregated

vapc.save_as_las(ScOR_out)

### References:
* Rusu, R.B., Marton, Z.C., Blodow, N., Dolha, M., Beetz, M., 2008. Towards 3D Point cloud based object maps for household environments. Robot. Auton. Syst. 56, 927–941. doi.org/10.1016/j.robot.2008.08.005
* Tabernig, R., Albert, W., Weiser, H., Fritzmann, P., Anders, K., Rutzinger, M., Höfle, B., 2025. Temporal aggregation of point clouds improves permanent laser scanning of landslides in forested areas. Sci. Remote Sens. 12, 100254. doi.org/10.1016/j.srs.2025.100254
* Tabernig, R., Höfle, B., 2026. Scan Outlier Ratio (ScOR): LiDAR Scanning and Survey-Aware Filtering of Detached Points in Terrestrial and Permanent Laser Scanning Point Clouds
(forthcoming)