# High influx filter

This notebook shows the "high influx filter" to detect unrealistically high rainfall records in private weather station data (PWS) based on data from neighboring sensors.

The original R code stems from https://github.com/LottedeVos/PWSQC/. 

Publication:
de Vos, L. W., Leijnse, H., Overeem, A., & Uijlenhoet, R. (2019). Quality control for crowdsourced personal weather stations to enable operational rainfall monitoring. _Geophysical Research Letters_, 46(15), 8820-8829.

The idea of the filter is to evaluate rainfall data of a sensor (in our case here a PWS) in comparison to a reference, which in case of a PWS network stems from neighboring sensors, and flag time steps where the stations is reporting unrealistically high rainfall rainfall. This can be caused by, for example, people pouring liquids through the rain gauge for cleaning, handling of the device with tilting movements, or sprinklers in the vicinity.


In [1]:
%load_ext autoreload
%autoreload 2

In [2]:
# Import packages

import poligrain as plg
import xarray as xr

import pypwsqc as pws

## Load example data

In this example we use an open PWS dataset from Amsterdam, called the "AMS PWS" dataset. The data set can be downloaded locally by running the curl-command in the cell below.

In [3]:
!curl -OL https://github.com/OpenSenseAction/OS_data_format_conventions/raw/main/notebooks/data/OpenSense_PWS_example_format_data.nc

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed

  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0

 25 5687k   25 1440k    0     0   993k      0  0:00:05  0:00:01  0:00:04  993k
100 5687k  100 5687k    0     0  3447k      0  0:00:01  0:00:01 --:--:-- 20.7M


In [4]:
# read PWS data with xarray
ds_pws = xr.open_dataset("OpenSense_PWS_example_format_data.nc")

# view data
ds_pws

## Create distance matrix

The first step is to find number of neighbours within a specificed range max_distance around the station that are reporting rainfall for each time step. The selected range depends on the use case and area of interest. In this example we use 10'000 meters. 

In [5]:
# select range maximum_distance in which to find neighbours
max_distance = 10e3  # range around each station, meters

### Reproject coordinates to metric projection to allow for distance calculations 

Second, we reproject the coordinates to a local metric coordinate reference system, in the Amsterdam case EPSG:25832. This can be done with the function `spatial.project_point_coordinates` in the `poligrain`package. 

In [6]:
ds_pws.coords["x"], ds_pws.coords["y"] = plg.spatial.project_point_coordinates(
    x=ds_pws.longitude, y=ds_pws.latitude, target_projection="EPSG:25832"
)

### Calculate distance between all stations of the network in meters

Then, we use `poligrain` to create a distance matrix with the distances between all stations in the PWS network.

In [7]:
distance_matrix = plg.spatial.calc_point_to_point_distances(ds_pws, ds_pws)

### Calculate number of neighbours reporting rainfall per timestep
Now we can calculate number of neighbours reporting rainfall per timestep and save it in the data array `nbrs_not_nan`.

In [8]:
%%time
ds_pws = ds_pws.load()

nbrs_not_nan = []

for pws_id in ds_pws.id.data:
    neighbor_ids = distance_matrix.id.data[
        distance_matrix.sel(id=pws_id) < max_distance
    ]
    N = ds_pws.rainfall.sel(id=neighbor_ids).isnull().sum(dim="id")
    nbrs_not_nan.append(N)

ds_pws["nbrs_not_nan"] = xr.concat(nbrs_not_nan, dim="id")

CPU times: total: 31.8 s
Wall time: 31.9 s


## Calculate reference

The default reference of the filter is to compare the observed rainfall of a given station with the median rainfall from all stations within range `max_distance` (i.e. `reference`). If the median is below the threshold value `HIthresA`, the HI flag for the station is set to 1 (i.e. high influx) for rainfall amounts above threshold `HIthresB`. When the surrounding stations report moderate to heavy rainfall, the threshold becomes variable: for a median of `HIthresA` or higher, the station's HI flag is set to 1 when its measurements exceed the median times `HIthresB/HIthresA`. 

In [9]:
%%time

reference = []

for pws_id in ds_pws.id.data:
    neighbor_ids = distance_matrix.id.data[
        distance_matrix.sel(id=pws_id) < max_distance
    ]
    median = ds_pws.sel(id=neighbor_ids).rainfall.median(dim="id")
    reference.append(median)

CPU times: total: 4min 35s
Wall time: 4min 37s


In [10]:
ds_pws["reference"] = xr.concat(reference, dim="id")

# view data
ds_pws

## (Faulty Zeroes filter)

Conditions for raising Faulty Zeroes flag:

* FZflag is not -1
* Median rainfall of neighbouring stations within range `max_distance` is larger than zero for at least `nint` time intervals while the station itself reports zero rainfall.

The FZ flag remains 1 until the station reports nonzero rainfall. For settings for parameter `nint`, see table 1 in https://agupubs.onlinelibrary.wiley.com/doi/full/10.1029/2019GL083731 

In [11]:
# fz_flag = pws.flagging.fz_filter(
#    pws_data=ds_pws.rainfall,
#    reference=ds_pws.reference,
#    nint=3
# )

In [12]:
# ds_pws["fz_flag"]= fz_flag

## High Influx filter

Conditions for raising High Influx flag:

* If median below threshold `ϕA`, then high influx if rainfall above threshold `ϕB`
* If median above `ϕA`, then high influx if rainfall exceeds median times `ϕB`/`ϕA`

Filter cannot be applied if less than `nstat` neighbours are reporting data (HI flag is set to -1)

For settings for parameter `ϕA`, `ϕB` and `nstat`, see table 1 in https://agupubs.onlinelibrary.wiley.com/doi/full/10.1029/2019GL083731

In [13]:
hi_flag = pws.flagging.hi_filter(
    pws_data=ds_pws.rainfall,
    nbrs_not_nan=ds_pws.nbrs_not_nan,
    reference=ds_pws.reference,
    hi_thres_a=0.4,
    hi_thres_b=0.2,
    n_stat=5,
)

In [14]:
ds_pws["hi_flag"] = hi_flag

Now we have a dataset with hi-flags per time step!

In [15]:
ds_pws