# Neighbour checks for quality control flags
Covers QC16-25

## Table of contents
[QC16 Daily neighbours (wet)](#QC16---Daily-neighbours-wet)  
[QC17 Hourly neighbours (wet)](#QC17---Hourly-neighbours-wet)  
[QC18 Daily neighbours (dry)](#QC18---Daily-neighbours-dry)  
[QC19 Hourly neighbours (dry)](#QC19---Hourly-neighbours-dry)  
[QC20 Monthly neighbours](#QC20---Monthly-neighbours)  
[QC21 Timing offset](#QC21---Timing-offset)  
[QC22 Pre-QC Affinity](#QC22---Pre-QC-Affinity)  
[QC23 Pre-QC Pearson](#QC23---Pre-QC-Pearson)  
[QC24 Daily factor](#QC24---Daily-factor)  
[QC25 Monthly factor](#QC25---Monthly-factor)  

See '3.3.4 Neighbouring gauge checks on large values' in Lewis et al. (2021)

In [228]:
import datetime
import glob

import polars as pl

import geopy.distance

In [155]:
TARGET_STATION_ID = "DE_02483"
DISTANCE_THRESHOLD = 50 # 50 km
OVERLAP_THRESHOLD = 365*3 # three years

## Data reading globals
GAUGE_DATA_PATH = "../data/gauge_data"
DATA_ROWS_TO_SKIP = 20 ## First 20 rows are metadata TODO: what if not?
UNIT_COL = "new_units" ## There is an original_units col too TODO: think of way to do this for different data

In [156]:
def read_metadata(data_path):
    metadata = {}

    with open(data_path, 'r') as f:
        while True:
            key, val = f.readline().strip().split(':', maxsplit=1)
            key = key.lower().replace(' ', '_')
            metadata[key.strip()] = val.strip()
            if key == 'other':
                break
    return metadata

### Step 1. Make or read summary metadata of stations

In [157]:
## Could work by checking if metadata already exists (or user can input)
all_gauge_data = glob.glob(f"{GAUGE_DATA_PATH}/*.txt")
all_gauge_data

['../data/gauge_data/DE_06303.txt',
 '../data/gauge_data/DE_02718.txt',
 '../data/gauge_data/DE_04313.txt',
 '../data/gauge_data/DE_00390.txt',
 '../data/gauge_data/DE_02483.txt',
 '../data/gauge_data/DE_03215.txt',
 '../data/gauge_data/DE_00389.txt',
 '../data/gauge_data/DE_06264.txt',
 '../data/gauge_data/DE_01300.txt',
 '../data/gauge_data/DE_04488.txt',
 '../data/gauge_data/DE_00310.txt']

In [158]:
all_station_metadata_list = []
for ind, file in enumerate(all_gauge_data):
    one_station_metadata = read_metadata(data_path=file)
    all_station_metadata_list.append(one_station_metadata)


In [159]:
all_station_metadata = pl.from_dicts(all_station_metadata_list)
all_station_metadata = all_station_metadata.with_columns(
    pl.col("latitude").cast(pl.Float64),
    pl.col("longitude").cast(pl.Float64),
    (pl.col("start_datetime")+'00').str.strptime(pl.Datetime, "%Y%m%d%H%M"),
    (pl.col("end_datetime")+'00').str.strptime(pl.Datetime, "%Y%m%d%H%M"),
)
all_station_metadata.head()

station_id,country,original_station_number,original_station_name,path_to_original_data,latitude,longitude,start_datetime,end_datetime,elevation,number_of_records,percent_missing_data,original_timestep,new_timestep,original_units,new_units,time_zone,daylight_saving_info,no_data_value,resolution,other
str,str,str,str,str,f64,f64,datetime[μs],datetime[μs],str,str,str,str,str,str,str,str,str,str,str,str
"""DE_06303""","""Germany""","""06303""","""NA""","""B:/INTENSE data/Original data/…",51.2884,8.5907,2006-01-01 00:00:00,2010-12-31 23:00:00,"""588m""","""43824""","""0.00""","""1hr""","""1hr""","""mm""","""mm""","""CET""","""NA""","""-999""","""0.10""",""""""
"""DE_02718""","""Germany""","""02718""","""NA""","""B:/INTENSE data/Original data/…",51.288,8.7928,2006-01-01 00:00:00,2010-12-31 23:00:00,"""458m""","""43824""","""0.00""","""1hr""","""1hr""","""mm""","""mm""","""CET""","""NA""","""-999""","""0.10""",""""""
"""DE_04313""","""Germany""","""04313""","""NA""","""B:/INTENSE data/Original data/…",51.4966,8.4342,2006-01-01 00:00:00,2010-12-31 23:00:00,"""361m""","""43824""","""0.00""","""1hr""","""1hr""","""mm""","""mm""","""CET""","""NA""","""-999""","""0.10""",""""""
"""DE_00390""","""Germany""","""00390""","""NA""","""B:/INTENSE data/Original data/…",50.9837,8.3679,2006-01-01 00:00:00,2010-12-31 23:00:00,"""610m""","""43824""","""0.00""","""1hr""","""1hr""","""mm""","""mm""","""CET""","""NA""","""-999""","""0.10""",""""""
"""DE_02483""","""Germany""","""02483""","""NA""","""B:/INTENSE data/Original data/…",51.1803,8.4891,2006-01-01 00:00:00,2010-12-31 23:00:00,"""839m""","""43824""","""0.00""","""1hr""","""1hr""","""mm""","""mm""","""CET""","""NA""","""-999""","""0.10""",""""""


### Step 2. Compute distance from target station

In [160]:
def compute_distance_from_target_id(metadata, target_id):
    target_station = metadata.filter(pl.col("station_id") == target_id)
    target_latlon = (target_station['latitude'].item(), target_station['longitude'].item())

    neighbour_distances = {}
    for other_station_id, other_lat, other_lon in metadata[['station_id', 'latitude', 'longitude']].rows():

        neighbour_distances[other_station_id] = geopy.distance.geodesic(target_latlon, (other_lat, other_lon)).kilometers
    return neighbour_distances

In [161]:
neighbours_distances = compute_distance_from_target_id(all_station_metadata, TARGET_STATION_ID)
neighbours_distances

{'DE_06303': 13.96385857171046,
 'DE_02718': 24.361752991036987,
 'DE_04313': 35.397225659812456,
 'DE_00390': 23.462778369145934,
 'DE_02483': 0.0,
 'DE_03215': 15.710073576478786,
 'DE_00389': 18.844342416100808,
 'DE_06264': 28.343769567230837,
 'DE_01300': 24.64263807685722,
 'DE_04488': 15.929311137997068,
 'DE_00310': 13.134594190689885}

In [162]:
# ALTERNATIVE: do we want to avoid using geopy.distance and simply write a distance function?
# # ALTERNATIVE: maybe precomupte a matrix of distances?
# all_station_dist_mtx = scipy.spatial.distance.cdist(all_station_metadata[['latitude', 'longitude']].rows(),
#                                         all_station_metadata[['latitude', 'longitude']].rows(),
#                                         metric=lambda pnt1, pnt2: geopy.distance.geodesic(pnt1, pnt2).kilometers)

In [163]:
neighbours_distances_df = pl.DataFrame({"station_id": neighbours_distances.keys(),
              "distances": neighbours_distances.values()
              })
neighbours_distances_df

station_id,distances
str,f64
"""DE_06303""",13.963859
"""DE_02718""",24.361753
"""DE_04313""",35.397226
"""DE_00390""",23.462778
"""DE_02483""",0.0
…,…
"""DE_00389""",18.844342
"""DE_06264""",28.34377
"""DE_01300""",24.642638
"""DE_04488""",15.929311


In [190]:
## Subset based on 50 km
close_neighbours = neighbours_distances_df.filter(
    (pl.col("distances") <= DISTANCE_THRESHOLD) &
    (pl.col("distances") != 0)
    )

## closest 10 
close_neighbours.sort('distances')[:10]

station_id,distances
str,f64
"""DE_00310""",13.134594
"""DE_06303""",13.963859
"""DE_03215""",15.710074
"""DE_04488""",15.929311
"""DE_00389""",18.844342
"""DE_00390""",23.462778
"""DE_02718""",24.361753
"""DE_01300""",24.642638
"""DE_06264""",28.34377
"""DE_04313""",35.397226


### Compute the temporal overlap

In [165]:

def compute_overlap_days(start_1, end_1, start_2, end_2):
    ## TODO: add cast to datetime functionality/checks
    ## compute overlap
    overlap_start = max(start_1, start_2)
    overlap_end = min(end_1, end_2)

    overlap_days = max(0, (overlap_end - overlap_start).days)

    return overlap_days

In [212]:
def compute_overlap_days_from_target_id(metadata, target_id):
    target_station = metadata.filter(pl.col("station_id") == target_id)
    start_1, end_1 = target_station['start_datetime'].item(), target_station['end_datetime'].item()

    neighbour_overlap_days = {}
    for other_station_id, start_2, end_2 in metadata[['station_id', 'start_datetime', 'end_datetime']].rows():
        if target_id == other_station_id:
            continue

        neighbour_overlap_days[other_station_id] = compute_overlap_days(start_1, end_1, start_2, end_2)
    return neighbour_overlap_days

In [213]:
neighbour_overlap_days = compute_overlap_days_from_target_id(all_station_metadata, TARGET_STATION_ID)
neighbour_overlap_days

{'DE_06303': 1825,
 'DE_02718': 1825,
 'DE_04313': 1825,
 'DE_00390': 1825,
 'DE_03215': 1309,
 'DE_00389': 425,
 'DE_06264': 1825,
 'DE_01300': 1825,
 'DE_04488': 1613,
 'DE_00310': 1825}

### Subset based on 3 years

In [214]:
neighbour_overlap_days_df = pl.DataFrame({"station_id": neighbour_overlap_days.keys(),
              "overlap_days": neighbour_overlap_days.values()
              })
neighbour_overlap_days_df

station_id,overlap_days
str,i64
"""DE_06303""",1825
"""DE_02718""",1825
"""DE_04313""",1825
"""DE_00390""",1825
"""DE_03215""",1309
"""DE_00389""",425
"""DE_06264""",1825
"""DE_01300""",1825
"""DE_04488""",1613
"""DE_00310""",1825


In [215]:
neighbour_overlap_days_df.filter(
    pl.col("overlap_days") >= OVERLAP_THRESHOLD
)

station_id,overlap_days
str,i64
"""DE_06303""",1825
"""DE_02718""",1825
"""DE_04313""",1825
"""DE_00390""",1825
"""DE_03215""",1309
"""DE_06264""",1825
"""DE_01300""",1825
"""DE_04488""",1613
"""DE_00310""",1825


## Get neighbours time series data

In [216]:
num_closest_gauges = 10 ## based on IntenseQC

In [217]:
## Subset based on 50 km
close_neighbour_ids = neighbours_distances_df.filter(
    (pl.col("distances") <= DISTANCE_THRESHOLD) &
    (pl.col("distances") != 0)
)
## closest 10 values
closest_neighbour_ids = close_neighbour_ids.sort('distances')[:num_closest_gauges]['station_id'].to_list()

## Subset based on 3 years
overlapping_neighbour_ids = neighbour_overlap_days_df.filter(
    pl.col("overlap_days") >= OVERLAP_THRESHOLD
)['station_id'].to_list()

In [222]:
all_neighbour_ids = set(overlapping_neighbour_ids).intersection(set(closest_neighbour_ids))
all_neighbour_ids

{'DE_00310',
 'DE_00390',
 'DE_01300',
 'DE_02718',
 'DE_03215',
 'DE_04313',
 'DE_04488',
 'DE_06264',
 'DE_06303'}

In [223]:
all_neighbour_ids_paths = {}
for id in all_neighbour_ids:
    ids_path = glob.glob(f"{GAUGE_DATA_PATH}/*{id}.txt")
    assert len(ids_path) == 1, f"There are {len(ids_path)} data files for {id}"
    all_neighbour_ids_paths[id] = ids_path[0]

In [224]:
all_neighbour_ids_paths

{'DE_04488': '../data/gauge_data/DE_04488.txt',
 'DE_06303': '../data/gauge_data/DE_06303.txt',
 'DE_00390': '../data/gauge_data/DE_00390.txt',
 'DE_03215': '../data/gauge_data/DE_03215.txt',
 'DE_04313': '../data/gauge_data/DE_04313.txt',
 'DE_01300': '../data/gauge_data/DE_01300.txt',
 'DE_00310': '../data/gauge_data/DE_00310.txt',
 'DE_02718': '../data/gauge_data/DE_02718.txt',
 'DE_06264': '../data/gauge_data/DE_06264.txt'}

In [None]:
for id, data_path in all_neighbour_ids_paths.items():
    print(id)
    ## Get station metadata for that ID
    station_metadata = all_station_metadata.filter(pl.col("station_id") == id)
    assert len(station_metadata) == 1, f"There are {len(station_metadata)} metadata values for {id}"

    ## Read in gauge data
    units = station_metadata[UNIT_COL][0]
    rain_col = f'rain_{units}'
    gauge_data = pl.read_csv(data_path, skip_rows=DATA_ROWS_TO_SKIP, schema_overrides={rain_col: pl.Float64})

    ## make datetime column TODO: move to function
    startdate = station_metadata['start_datetime'][0]
    enddate = station_metadata['end_datetime'][0]

    hourly_date_interval = []
    delta_days = (enddate+datetime.timedelta(days=1) - startdate).days
    for i in range(delta_days * 24):
        hourly_date_interval.append(startdate + datetime.timedelta(hours=i))
    assert len(gauge_data) == len(hourly_date_interval)
    gauge_data = gauge_data.with_columns(time=pl.Series(hourly_date_interval)) ## set time columns
    gauge_data = gauge_data.select(['time', rain_col]) ## Reorder (to look nice)


DE_04488
DE_06303
DE_00390
DE_03215
DE_04313
DE_01300
DE_00310
DE_02718
DE_06264


In [242]:
gauge_data

time,rain_mm
datetime[μs],f64
2006-01-01 00:00:00,0.0
2006-01-01 01:00:00,0.1
2006-01-01 02:00:00,0.0
2006-01-01 03:00:00,0.0
2006-01-01 04:00:00,0.0
…,…
2010-12-31 19:00:00,0.0
2010-12-31 20:00:00,0.0
2010-12-31 21:00:00,0.0
2010-12-31 22:00:00,0.0


# QC16 - Daily neighbours (wet)
[Back to Index](#Table-of-contents)

#### Differences from `intense-qc`: 
- 

# QC17 - Hourly neighbours (wet) 
[Back to Index](#Table-of-contents)

#### Differences from `intense-qc`: 
- 

# QC18 - Daily neighbours (dry) 
[Back to Index](#Table-of-contents)

#### Differences from `intense-qc`: 
- 

# QC19 - Hourly neighbours (dry) 
[Back to Index](#Table-of-contents)

#### Differences from `intense-qc`: 
- 

# QC20 - Monthly neighbours
[Back to Index](#Table-of-contents)

#### Differences from `intense-qc`: 
- 

# QC21 - Timing offset 
[Back to Index](#Table-of-contents)

#### Differences from `intense-qc`: 
- 

# QC22 - Pre-QC Affinity  
[Back to Index](#Table-of-contents)

#### Differences from `intense-qc`: 
- 

# QC23 - Pre-QC Pearson
[Back to Index](#Table-of-contents)

#### Differences from `intense-qc`: 
- 

# QC24 - Daily factor  
[Back to Index](#Table-of-contents)

#### Differences from `intense-qc`: 
- 

# QC25 - Monthly factor
[Back to Index](#Table-of-contents)

#### Differences from `intense-qc`: 
- 