# Neighbour checks for quality control flags
Covers QC16-25

## Table of contents
[Utility neighbour functions](#utility-neighbour-functions)  
[QC16 Daily neighbours (wet)](#QC16---Daily-neighbours-wet)  
[QC17 Hourly neighbours (wet)](#QC17---Hourly-neighbours-wet)  
[QC18 Daily neighbours (dry)](#QC18---Daily-neighbours-dry)  
[QC19 Hourly neighbours (dry)](#QC19---Hourly-neighbours-dry)  
[QC20 Monthly neighbours](#QC20---Monthly-neighbours)  
[QC21 Timing offset](#QC21---Timing-offset)  
[QC22 Pre-QC Affinity](#QC22---Pre-QC-Affinity)  
[QC23 Pre-QC Pearson](#QC23---Pre-QC-Pearson)  
[QC24 Daily factor](#QC24---Daily-factor)  
[QC25 Monthly factor](#QC25---Monthly-factor)  

See '3.3.4 Neighbouring gauge checks on large values' in Lewis et al. (2021)

In [113]:
import datetime
import glob

import zipfile ## used once for Intense format data
import pandas as pd ## used once for Intense format data

import polars as pl
import numpy as np

import scipy.stats
import geopy.distance

In [2]:
TARGET_STATION_ID = "DE_02483"
DISTANCE_THRESHOLD = 50 # 50 km
OVERLAP_THRESHOLD = 365*3 # three years

## Data reading globals
GAUGE_DATA_PATH = "../data/gauge_data"
DATA_ROWS_TO_SKIP = 20 ## First 20 rows are metadata TODO: what if not?
UNIT_COL = "new_units" ## There is an original_units col too TODO: think of way to do this for different data

MULTIPLYING_FACTORS = {"hourly": 24, "daily": 1} ## compared to daily reference


In [3]:
def read_metadata(data_path):
    metadata = {}

    with open(data_path, 'r') as f:
        while True:
            key, val = f.readline().strip().split(':', maxsplit=1)
            key = key.lower().replace(' ', '_')
            metadata[key.strip()] = val.strip()
            if key == 'other':
                break
    return metadata

In [4]:
station_metadata = read_metadata(data_path='../data/gauge_data/DE_02483.txt')
station_metadata['start_datetime'] = datetime.datetime.strptime(station_metadata['start_datetime'], '%Y%m%d%H')
station_metadata['end_datetime'] = datetime.datetime.strptime(station_metadata['end_datetime'], '%Y%m%d%H')

In [5]:
def add_datetime_to_gauge_data(station_metadata, gauge_data, time_multiplying_factor):
    """
    Add datetime column to gauge data using metadata from that gauge.
    NOTE: Could maybe extend so can find metadata if not provided?
    """
    startdate = station_metadata['start_datetime']
    enddate = station_metadata['end_datetime']
    assert isinstance(startdate, datetime.datetime), "Please convert start_ and end_datetime to datetime.datetime objects"

    date_interval = []
    delta_days = (enddate+datetime.timedelta(days=1) - startdate).days
    for i in range(delta_days * time_multiplying_factor):
        date_interval.append(startdate + datetime.timedelta(hours=i))

    ## add datetime column
    assert len(gauge_data) == len(date_interval)
    gauge_data = gauge_data.with_columns(time=pl.Series(date_interval)) ## set time columns

    return gauge_data


In [6]:
rain_col = f'rain_{station_metadata['original_units']}'

In [7]:
## read in gauge data
target_gauge = pl.read_csv('../data/gauge_data/DE_02483.txt', skip_rows=20, schema_overrides={rain_col: pl.Float64})

In [8]:
target_gauge = add_datetime_to_gauge_data(station_metadata, target_gauge, time_multiplying_factor=MULTIPLYING_FACTORS['hourly'])
target_gauge = target_gauge.select(['time', rain_col]) ## Reorder (to look nice)

In [9]:
## make no data vals nans
target_gauge = target_gauge.with_columns(pl.when(pl.col(rain_col) == int(station_metadata['no_data_value'])).then(np.nan).otherwise(pl.col(rain_col)).alias(rain_col))

In [10]:
target_gauge.head()

time,rain_mm
datetime[μs],f64
2006-01-01 00:00:00,0.9
2006-01-01 01:00:00,0.3
2006-01-01 02:00:00,0.3
2006-01-01 03:00:00,0.0
2006-01-01 04:00:00,0.0


# Utility neighbour functions
TODO: convert to Classes

### Part 1. Make or read summary metadata of stations

In [11]:
## Could work by checking if metadata already exists (or user can input)
all_gauge_data_paths = glob.glob(f"{GAUGE_DATA_PATH}/*.txt")
all_gauge_data_paths

['../data/gauge_data/DE_02483.txt',
 '../data/gauge_data/DE_00310.txt',
 '../data/gauge_data/DE_00389.txt',
 '../data/gauge_data/DE_00390.txt',
 '../data/gauge_data/DE_01300.txt',
 '../data/gauge_data/DE_02718.txt',
 '../data/gauge_data/DE_03215.txt',
 '../data/gauge_data/DE_04313.txt',
 '../data/gauge_data/DE_04488.txt',
 '../data/gauge_data/DE_06264.txt',
 '../data/gauge_data/DE_06303.txt']

In [12]:
all_station_metadata_list = []
for ind, file in enumerate(all_gauge_data_paths):
    one_station_metadata = read_metadata(data_path=file)
    all_station_metadata_list.append(one_station_metadata)


In [13]:
all_station_metadata = pl.from_dicts(all_station_metadata_list)
all_station_metadata = all_station_metadata.with_columns(
    pl.col("latitude").cast(pl.Float64),
    pl.col("longitude").cast(pl.Float64),
    (pl.col("start_datetime")+'00').str.strptime(pl.Datetime, "%Y%m%d%H%M"),
    (pl.col("end_datetime")+'00').str.strptime(pl.Datetime, "%Y%m%d%H%M"),
)
all_station_metadata.head()

station_id,country,original_station_number,original_station_name,path_to_original_data,latitude,longitude,start_datetime,end_datetime,elevation,number_of_records,percent_missing_data,original_timestep,new_timestep,original_units,new_units,time_zone,daylight_saving_info,no_data_value,resolution,other
str,str,str,str,str,f64,f64,datetime[μs],datetime[μs],str,str,str,str,str,str,str,str,str,str,str,str
"""DE_02483""","""Germany""","""02483""","""NA""","""B:/INTENSE data/Original data/…",51.1803,8.4891,2006-01-01 00:00:00,2010-12-31 23:00:00,"""839m""","""43824""","""0.00""","""1hr""","""1hr""","""mm""","""mm""","""CET""","""NA""","""-999""","""0.10""",""""""
"""DE_00310""","""Germany""","""00310""","""NA""","""B:/INTENSE data/Original data/…",51.0662,8.5373,2006-01-01 00:00:00,2010-12-31 23:00:00,"""590m""","""43824""","""0.00""","""1hr""","""1hr""","""mm""","""mm""","""CET""","""NA""","""-999""","""0.10""",""""""
"""DE_00389""","""Germany""","""00389""","""NA""","""B:/INTENSE data/Original data/…",51.0148,8.4318,2009-11-01 00:00:00,2010-12-31 23:00:00,"""436m""","""10224""","""0.00""","""1hr""","""1hr""","""mm""","""mm""","""CET""","""NA""","""-999""","""0.10""",""""""
"""DE_00390""","""Germany""","""00390""","""NA""","""B:/INTENSE data/Original data/…",50.9837,8.3679,2006-01-01 00:00:00,2010-12-31 23:00:00,"""610m""","""43824""","""0.00""","""1hr""","""1hr""","""mm""","""mm""","""CET""","""NA""","""-999""","""0.10""",""""""
"""DE_01300""","""Germany""","""01300""","""NA""","""B:/INTENSE data/Original data/…",51.254,8.1565,2006-01-01 00:00:00,2010-12-31 23:00:00,"""351m""","""43824""","""0.00""","""1hr""","""1hr""","""mm""","""mm""","""CET""","""NA""","""-999""","""0.10""",""""""


### Part 2. Compute distance from target station
TODO: What if the location data is in a different projection i.e. EPSG: 27700?

In [14]:
def compute_distance_from_target_id(metadata, target_id):
    target_station = metadata.filter(pl.col("station_id") == target_id)
    target_latlon = (target_station['latitude'].item(), target_station['longitude'].item())

    neighbour_distances = {}
    for other_station_id, other_lat, other_lon in metadata[['station_id', 'latitude', 'longitude']].rows():

        neighbour_distances[other_station_id] = geopy.distance.geodesic(target_latlon, (other_lat, other_lon)).kilometers
    return neighbour_distances

In [15]:
neighbours_distances = compute_distance_from_target_id(all_station_metadata, TARGET_STATION_ID)
neighbours_distances

{'DE_02483': 0.0,
 'DE_00310': 13.134594190689885,
 'DE_00389': 18.844342416100808,
 'DE_00390': 23.462778369145934,
 'DE_01300': 24.64263807685722,
 'DE_02718': 24.361752991036987,
 'DE_03215': 15.710073576478786,
 'DE_04313': 35.397225659812456,
 'DE_04488': 15.929311137997068,
 'DE_06264': 28.343769567230837,
 'DE_06303': 13.96385857171046}

In [16]:
# ALTERNATIVE: do we want to avoid using geopy.distance and simply write a distance function?
# # ALTERNATIVE: maybe precomupte a matrix of distances?
# all_station_dist_mtx = scipy.spatial.distance.cdist(all_station_metadata[['latitude', 'longitude']].rows(),
#                                         all_station_metadata[['latitude', 'longitude']].rows(),
#                                         metric=lambda pnt1, pnt2: geopy.distance.geodesic(pnt1, pnt2).kilometers)

In [16]:
neighbours_distances_df = pl.DataFrame({"station_id": neighbours_distances.keys(),
              "distances": neighbours_distances.values()
              })
neighbours_distances_df

station_id,distances
str,f64
"""DE_02483""",0.0
"""DE_00310""",13.134594
"""DE_00389""",18.844342
"""DE_00390""",23.462778
"""DE_01300""",24.642638
…,…
"""DE_03215""",15.710074
"""DE_04313""",35.397226
"""DE_04488""",15.929311
"""DE_06264""",28.34377


In [17]:
## Subset based on 50 km
close_neighbours = neighbours_distances_df.filter(
    (pl.col("distances") <= DISTANCE_THRESHOLD) &
    (pl.col("distances") != 0)
    )

## closest 10 
close_neighbours.sort('distances')[:10]

station_id,distances
str,f64
"""DE_00310""",13.134594
"""DE_06303""",13.963859
"""DE_03215""",15.710074
"""DE_04488""",15.929311
"""DE_00389""",18.844342
"""DE_00390""",23.462778
"""DE_02718""",24.361753
"""DE_01300""",24.642638
"""DE_06264""",28.34377
"""DE_04313""",35.397226


### Part 3. Compute the temporal overlap

In [18]:
def compute_overlap_days(start_1, end_1, start_2, end_2):
    ## TODO: add cast to datetime functionality/checks
    ## compute overlap
    overlap_start = max(start_1, start_2)
    overlap_end = min(end_1, end_2)

    overlap_days = max(0, (overlap_end - overlap_start).days)

    return overlap_days

In [19]:
def compute_overlap_days_from_target_id(metadata, target_id):
    target_station = metadata.filter(pl.col("station_id") == target_id)
    start_1, end_1 = target_station['start_datetime'].item(), target_station['end_datetime'].item()

    neighbour_overlap_days = {}
    for other_station_id, start_2, end_2 in metadata[['station_id', 'start_datetime', 'end_datetime']].rows():
        if target_id == other_station_id:
            continue

        neighbour_overlap_days[other_station_id] = compute_overlap_days(start_1, end_1, start_2, end_2)
    return neighbour_overlap_days

In [20]:
neighbour_overlap_days = compute_overlap_days_from_target_id(all_station_metadata, TARGET_STATION_ID)
neighbour_overlap_days

{'DE_00310': 1825,
 'DE_00389': 425,
 'DE_00390': 1825,
 'DE_01300': 1825,
 'DE_02718': 1825,
 'DE_03215': 1309,
 'DE_04313': 1825,
 'DE_04488': 1613,
 'DE_06264': 1825,
 'DE_06303': 1825}

#### Subset based on 3 years

In [21]:
neighbour_overlap_days_df = pl.DataFrame({"station_id": neighbour_overlap_days.keys(),
              "overlap_days": neighbour_overlap_days.values()
              })
neighbour_overlap_days_df

station_id,overlap_days
str,i64
"""DE_00310""",1825
"""DE_00389""",425
"""DE_00390""",1825
"""DE_01300""",1825
"""DE_02718""",1825
"""DE_03215""",1309
"""DE_04313""",1825
"""DE_04488""",1613
"""DE_06264""",1825
"""DE_06303""",1825


In [22]:
neighbour_overlap_days_df.filter(
    pl.col("overlap_days") >= OVERLAP_THRESHOLD
)

station_id,overlap_days
str,i64
"""DE_00310""",1825
"""DE_00390""",1825
"""DE_01300""",1825
"""DE_02718""",1825
"""DE_03215""",1309
"""DE_04313""",1825
"""DE_04488""",1613
"""DE_06264""",1825
"""DE_06303""",1825


## Part 4. Bring together to get neighbours both close and overlapping

In [23]:
num_closest_gauges = 10 ## based on IntenseQC

In [25]:
## Subset based on 50 km
close_neighbour_ids = neighbours_distances_df.filter(
    (pl.col("distances") <= DISTANCE_THRESHOLD) &
    (pl.col("distances") != 0)
)
## closest 10 values
closest_neighbour_ids = close_neighbour_ids.sort('distances')[:num_closest_gauges]['station_id'].to_list()

## Subset based on 3 years
overlapping_neighbour_ids = neighbour_overlap_days_df.filter(
    pl.col("overlap_days") >= OVERLAP_THRESHOLD
)['station_id'].to_list()

In [26]:
all_neighbour_ids = set(overlapping_neighbour_ids).intersection(set(closest_neighbour_ids))
all_neighbour_ids

{'DE_00310',
 'DE_00390',
 'DE_01300',
 'DE_02718',
 'DE_03215',
 'DE_04313',
 'DE_04488',
 'DE_06264',
 'DE_06303'}

In [27]:
all_neighbour_ids_paths = {}
for id in all_neighbour_ids:
    ids_path = glob.glob(f"{GAUGE_DATA_PATH}/*{id}.txt")
    assert len(ids_path) == 1, f"There are {len(ids_path)} data files for {id}"
    all_neighbour_ids_paths[id] = ids_path[0]

In [28]:
all_neighbour_ids_paths

{'DE_06264': '../data/gauge_data/DE_06264.txt',
 'DE_01300': '../data/gauge_data/DE_01300.txt',
 'DE_06303': '../data/gauge_data/DE_06303.txt',
 'DE_02718': '../data/gauge_data/DE_02718.txt',
 'DE_04488': '../data/gauge_data/DE_04488.txt',
 'DE_00310': '../data/gauge_data/DE_00310.txt',
 'DE_03215': '../data/gauge_data/DE_03215.txt',
 'DE_00390': '../data/gauge_data/DE_00390.txt',
 'DE_04313': '../data/gauge_data/DE_04313.txt'}

## Part 6. Get neighbouring GDSR gauge by ID (an example)

In [29]:
def get_neighbour_gauge_data(neighbour_gauge_id, time_multiplying_factor):
    data_path = all_neighbour_ids_paths[neighbour_gauge_id]
    station_metadata = all_station_metadata.filter(pl.col("station_id") == neighbour_gauge_id)
    assert len(station_metadata) == 1, f"There are {len(station_metadata)} metadata values for {neighbour_gauge_id}. Investigate because there should only be one"
    station_metadata = station_metadata.to_dicts()[0] ## convert df to a dict

    ## Read in gauge data
    units = station_metadata[UNIT_COL]
    rain_col = f'rain_{units}'
    gauge_data = pl.read_csv(data_path, skip_rows=DATA_ROWS_TO_SKIP, schema_overrides={rain_col: pl.Float64})

    ## make datetime column
    gauge_data_w_dates = add_datetime_to_gauge_data(station_metadata, gauge_data, time_multiplying_factor=time_multiplying_factor)
    gauge_data_w_dates = gauge_data_w_dates.select(['time', rain_col]) ## Reorder (to look nice)

    return gauge_data_w_dates


In [30]:
get_neighbour_gauge_data(neighbour_gauge_id='DE_06264', time_multiplying_factor=MULTIPLYING_FACTORS['hourly'])

time,rain_mm
datetime[μs],f64
2006-01-01 00:00:00,0.0
2006-01-01 01:00:00,0.1
2006-01-01 02:00:00,0.0
2006-01-01 03:00:00,0.0
2006-01-01 04:00:00,0.0
…,…
2010-12-31 19:00:00,0.0
2010-12-31 20:00:00,0.0
2010-12-31 21:00:00,0.0
2010-12-31 22:00:00,0.0


## Part 7. Get neighbouring GPCC gauge by ID (an example)

#### Note:
In the original methodology, GPCC is extracted on the fly

Hence, this needs to be refactored

In [32]:
existing_gpcc_daily_paths = {}
existing_gpcc_monthly_paths = {}
for neighbour_id in all_neighbour_ids_paths.keys():
    gpcc_id = neighbour_id.split('DE_')[1].lstrip("0")
    existing_gpcc_daily_paths[neighbour_id] = glob.glob(f"../data/GPCC/tw_{gpcc_id}.zip")
    existing_gpcc_monthly_paths[neighbour_id] = glob.glob(f"../data/GPCC/mw_{gpcc_id}.zip")

In [33]:
existing_gpcc_daily_paths

{'DE_06264': [],
 'DE_01300': [],
 'DE_06303': ['../data/GPCC/tw_6303.zip'],
 'DE_02718': [],
 'DE_04488': [],
 'DE_00310': ['../data/GPCC/tw_310.zip'],
 'DE_03215': ['../data/GPCC/tw_3215.zip'],
 'DE_00390': [],
 'DE_04313': []}

In [34]:
gpcc_id_to_use = 'DE_06303'
gpcc_id_name = gpcc_id_to_use.split('DE_')[-1].lstrip('0') # like 6303
example_dat_path = existing_gpcc_daily_paths[gpcc_id_to_use][0]
f = zipfile.ZipFile(example_dat_path).open(f"tw_{gpcc_id_name}.dat")
example_gpcc = pl.from_pandas(pd.read_csv(f, skiprows=1, header=None, sep=r'\s+'))

## drop unnecesary columns
example_gpcc = example_gpcc.drop([str(i) for i in range(4, 16)])

## make datetime column (apparently it's 7am-7pm)
example_gpcc = example_gpcc.with_columns(
    pl.datetime(pl.col("2"), pl.col("1"), pl.col("0"), 7).alias("time")
).drop(["0", "1", "2"])

## rename and reorder
example_gpcc = example_gpcc.rename({"3": rain_col})
example_gpcc = example_gpcc.select(['time', rain_col]) ## Reorder (to look nice)

example_gpcc

time,rain_mm
datetime[μs],f64
2002-12-01 07:00:00,3.0
2002-12-02 07:00:00,0.1
2002-12-03 07:00:00,0.2
2002-12-04 07:00:00,0.1
2002-12-05 07:00:00,1.0
…,…
2018-12-27 07:00:00,0.0
2018-12-28 07:00:00,1.0
2018-12-29 07:00:00,10.5
2018-12-30 07:00:00,4.7


In [None]:
## resample into daily (also round to 1 decimal place) TODO: check offset='7h' part
target_gauge_daily = target_gauge.group_by_dynamic("time", every='1d', offset='7h')\
    .agg([
                    pl.len().alias("n_hours"),
                    pl.col(rain_col).mean().round(1).alias(rain_col),
            ])\
    .filter(pl.col("n_hours") == 24).drop("n_hours")  # Ensure at least 24 data points
target_gauge_daily

time,rain_mm
datetime[μs],f64
2006-01-01 07:00:00,0.2
2006-01-02 07:00:00,0.0
2006-01-03 07:00:00,0.0
2006-01-04 07:00:00,0.0
2006-01-05 07:00:00,0.0
…,…
2010-12-26 07:00:00,0.2
2010-12-27 07:00:00,0.0
2010-12-28 07:00:00,0.0
2010-12-29 07:00:00,0.0


In [36]:
joined_gauges_gpcc = target_gauge_daily.join(example_gpcc, on='time', suffix=f'_GPCC_{gpcc_id_name}')
joined_gauges_gpcc = joined_gauges_gpcc.drop_nans()
joined_gauges_gpcc.head()

time,rain_mm,rain_mm_GPCC_6303
datetime[μs],f64,f64
2006-01-01 07:00:00,0.2,0.2
2006-01-02 07:00:00,0.0,0.0
2006-01-03 07:00:00,0.0,0.0
2006-01-04 07:00:00,0.0,0.0
2006-01-05 07:00:00,0.0,0.0


## Step 8 Compute factor, affinity index and correlation 

In [37]:
a = np.around(joined_gauges_gpcc.filter(pl.col(rain_col) >= 0.1).min()[rain_col], 1)[0]
b = np.around(joined_gauges_gpcc.filter(pl.col(f"{rain_col}_GPCC_{gpcc_id_name}") >= 0.1).min()[f"{rain_col}_GPCC_{gpcc_id_name}"], 1)[0]
p = max(a, b, 0.1)
print(a, b, p)

0.1 0.1 0.1


In [38]:
joined_gauges_gpcc_duplicates = joined_gauges_gpcc.with_columns(
    pl.when(
        (pl.col(rain_col) > p) &
        (pl.col(f"{rain_col}_GPCC_{gpcc_id_name}") > p)
    ).then(1)
    .when(
        (pl.col(rain_col) == p) &
        (pl.col(f"{rain_col}_GPCC_{gpcc_id_name}") == p),
    ).then(1)
    .when(
        (pl.col(rain_col) == p) &
        (pl.col(f"{rain_col}_GPCC_{gpcc_id_name}") > p),
    ).then(0)
    .when(
        (pl.col(rain_col) > p) &
        (pl.col(f"{rain_col}_GPCC_{gpcc_id_name}") == p)
    ).then(0)
    .otherwise(np.nan)
    .alias("duplicate")
)

joined_gauges_gpcc = joined_gauges_gpcc.with_columns(
    pl.when(
        (pl.col(rain_col) > 0) & (pl.col(f"{rain_col}_GPCC_{gpcc_id_name}") > 0)
    ).then(pl.col(rain_col) / pl.col(f"{rain_col}_GPCC_{gpcc_id_name}"))
    .otherwise(np.nan)
    .alias("factor")
)
joined_gauges_gpcc

time,rain_mm,rain_mm_GPCC_6303,factor
datetime[μs],f64,f64,f64
2006-01-01 07:00:00,0.2,0.2,1.0
2006-01-02 07:00:00,0.0,0.0,
2006-01-03 07:00:00,0.0,0.0,
2006-01-04 07:00:00,0.0,0.0,
2006-01-05 07:00:00,0.0,0.0,
…,…,…,…
2010-12-26 07:00:00,0.2,3.0,0.066667
2010-12-27 07:00:00,0.0,0.4,
2010-12-28 07:00:00,0.0,0.0,
2010-12-29 07:00:00,0.0,0.0,


In [44]:
match = joined_gauges_gpcc_duplicates['duplicate'].value_counts().filter(pl.col('duplicate') == 1)['count'].item()
diff = joined_gauges_gpcc_duplicates['duplicate'].value_counts().filter(pl.col('duplicate') == 0)['count'].item()
perc = match / (match + diff)
p_corr = np.corrcoef(joined_gauges_gpcc[rain_col], joined_gauges_gpcc[f"{rain_col}_GPCC_{gpcc_id_name}"])[0, 1]
f_mean = joined_gauges_gpcc['factor'].drop_nans().mean()
print(f"diff: {diff}, match:{match}")
print(f"affinity: {perc}, p_corr: {p_corr}, f_mean: {f_mean}")

diff: 164, match:541
affinity: 0.7673758865248227, p_corr: 0.009778374017001294, f_mean: 0.21056017953340903


## Part 9 Compare target with neighbour (hourly and daily)
- For hourly data, the data is first converted to daily to do comparison

_Output:_ df with long list of neighbour columns and flags

Works by computing differences from target and each of its neighbours then collates all those differences and associated difference flags into a single flag/column that describes how similar target is from neighbours


In [102]:
gpcc_id_to_use = 'DE_00310'
gpcc_id_name = gpcc_id_to_use.split('DE_')[-1].lstrip('0') # like 6303

In [103]:
joined_gauges_gpcc = target_gauge_daily.join(example_gpcc, on='time', suffix=f'_GPCC_{gpcc_id_name}')
joined_gauges_gpcc = joined_gauges_gpcc.drop_nans()
joined_gauges_gpcc.head()

time,rain_mm,rain_mm_GPCC_310
datetime[μs],f64,f64
2006-01-01 07:00:00,0.2,0.2
2006-01-02 07:00:00,0.0,0.0
2006-01-03 07:00:00,0.0,0.0
2006-01-04 07:00:00,0.0,0.0
2006-01-05 07:00:00,0.0,0.0


## Part 9.1 Wet neighbours
- This is normalised difference
TODO: Problem that there is not enough data which is wetter in the target than in the GPCC neighbour

In [104]:
def normalise_data(data):
    return (data - data.min()) / (data.max() - data.min())

In [None]:
joined_gauges_gpcc_normalised_diff = joined_gauges_gpcc.with_columns(
    # get normalised difference between target and neighbour
    rain_mm_normalised_diff = normalise_data(pl.col(f'{rain_col}')) - normalise_data(pl.col(f'{rain_col}_GPCC_{gpcc_id_name}'))
)

In [None]:
joined_gauges_gpcc_normalised_diff_filtered = joined_gauges_gpcc_normalised_diff.filter(
    (pl.col(f'{rain_col}') >= 1.0) &
    (pl.col(f'{rain_col}').is_finite()) &
    (pl.col(f'{rain_col}_GPCC_{gpcc_id_name}').is_finite()) &
    (pl.col(f'{rain_col}_normalised_diff') > 0.0)
)

In [107]:
joined_gauges_gpcc_normalised_diff_filtered

time,rain_mm,rain_mm_GPCC_310,rain_mm_normalised_diff
datetime[μs],f64,f64,f64
2006-12-23 07:00:00,112.5,0.0,0.352886
2006-12-24 07:00:00,318.8,0.0,1.0
2007-03-16 07:00:00,3.1,0.1,0.008626
2007-04-22 07:00:00,208.1,0.0,0.65276
2008-06-29 07:00:00,4.2,0.0,0.013174
2008-07-29 07:00:00,4.3,0.5,0.008
2009-08-30 07:00:00,1.7,0.0,0.005332
2009-11-06 07:00:00,9.8,1.2,0.017568
2010-08-09 07:00:00,1.3,0.0,0.004078
2010-08-12 07:00:00,1.7,0.0,0.005332


In [126]:
if not len(joined_gauges_gpcc_normalised_diff_filtered) >= 30:
    print('Original methodology needs there to be at least 30 values to fit exponential function')

Original methodology needs there to be at least 30 values to fit exponential function


In [None]:
expon_params = scipy.stats.expon.fit(joined_gauges_gpcc_normalised_diff_filtered[f'{rain_col}_normalised_diff'])

In [120]:
# Calculate thresholds at key percentiles of fitted distribution
q95 = scipy.stats.expon.ppf(0.95, expon_params[0], expon_params[1])
q99 = scipy.stats.expon.ppf(0.99, expon_params[0], expon_params[1])
q999 = scipy.stats.expon.ppf(0.999, expon_params[0], expon_params[1])

In [122]:
q95, q99, q999

(np.float64(0.6113065102667703),
 np.float64(0.937536237060823),
 np.float64(1.4042654597317614))

In [128]:
## Assign flags
joined_gauges_gpcc_normalised_diff = joined_gauges_gpcc_normalised_diff.with_columns(
    pl.when(
        (pl.col(rain_col) >= 1.0) &
        (pl.col(f'{rain_col}_normalised_diff') <= q95)
    ).then(0)
    .when(
        (pl.col(rain_col) >= 1.0) &
        (pl.col(f'{rain_col}_normalised_diff') > q95) &
        (pl.col(f'{rain_col}_normalised_diff') <= q99),
    ).then(1)
    .when(
        (pl.col(rain_col) >= 1.0) &
        (pl.col(f'{rain_col}_normalised_diff') > q99) &
        (pl.col(f'{rain_col}_normalised_diff') <= q999),
    ).then(2)
    .when(
        (pl.col(rain_col) >= 1.0) &
        (pl.col(f'{rain_col}_normalised_diff') > q95)
    ).then(3)
    .otherwise(0)
    .alias("temp_flags")
)

In [130]:
joined_gauges_gpcc_normalised_diff['temp_flags'].value_counts()

temp_flags,count
i32,u32
1,1
0,1758
2,1


## Part 9.2 Dry

In [None]:
#  elif high_or_dry == "dry":

#                 # Assign flags
#                 # - consider only whether dry 15-day periods at the target are
#                 # corroborated as dry by neighbours
#                 # - check based on whether 0, 1, 2 or >= 3 wet days are recorded at the
#                 # neighbour when the target is dry over the 15-day period
#                 # - dry flag works on the basis of fraction of dry days within 15-day
#                 # moving window, so 1 = all dry, 0 = all wet
#                 # -- truncating these fractions to 2 dp below and manipulating equalities
#                 # to work with these fractions, but could work in days not fractions if
#                 # change the convertToDrySpell function
#                 # - in dry day fraction calcs a threshold of 0 mm is currently used to
#                 # identify days as wet (i.e. any rainfall)
#                 frac_drydays = {}
#                 for d in range(1, 3 + 1):
#                     frac_drydays[d] = np.trunc((1.0 - (float(d) / 15.0)) * 10 ** 2) / (10 ** 2)
#                 conditions = [
#                     (df['ts1'] == 1.0) & (df['ts2'] == 1.0),
#                     (df['ts1'] == 1.0) & (df['ts2'] < 1.0) & (df['ts2'] >= frac_drydays[1]),
#                     (df['ts1'] == 1.0) & (df['ts2'] < frac_drydays[1]) & (df['ts2'] >= frac_drydays[2]),
#                     (df['ts1'] == 1.0) & (df['ts2'] < frac_drydays[2])]  # & (df['ts2'] >= frac_drydays[3])
#                 choices = [0, 1, 2, 3]

#                 # *** dp 27/11/2019 *** - commented out line below so changed to default=0
#                 # normalized_df['temp_flags'] = np.select(conditions, choices, default=np.nan)
#                 # normalized_df['temp_flags'] = np.select(conditions, choices, default=0)
#                 df['temp_flags'] = np.select(conditions, choices, default=0)

#                 # tempFlags = normalized_df['temp_flags']
#                 temp_flags = df['temp_flags']
#                 return temp_flags


## Part 10 Compare target with neighbour (monthly) 
- Works differently from hourly and daily

# QC16 - Daily neighbours (wet)
[Back to Index](#Table-of-contents)

#### Differences from `intense-qc`: 
- Many, although mainly this implementation opens the neighbour wet and dry functions to parameter tweaking

In [42]:
## resample data into daily (NOTE: OG method makes sure this is 7am-7pm)


In [43]:
example_id = 'DE_06303'

In [44]:
target_gauge

time,rain_mm
datetime[μs],f64
2006-01-01 00:00:00,0.9
2006-01-01 01:00:00,0.3
2006-01-01 02:00:00,0.3
2006-01-01 03:00:00,0.0
2006-01-01 04:00:00,0.0
…,…
2010-12-31 19:00:00,0.0
2010-12-31 20:00:00,0.0
2010-12-31 21:00:00,0.0
2010-12-31 22:00:00,0.0


# QC17 - Hourly neighbours (wet) 
[Back to Index](#Table-of-contents)

#### Differences from `intense-qc`: 
- 

In [40]:
example_id = 'DE_06264'

In [41]:
neighbouring_gauge = get_neighbour_gauge_data(neighbour_gauge_id=example_id, time_multiplying_factor=MULTIPLYING_FACTORS['hourly'])

In [42]:
neighbouring_gauge

time,rain_mm
datetime[μs],f64
2006-01-01 00:00:00,0.0
2006-01-01 01:00:00,0.1
2006-01-01 02:00:00,0.0
2006-01-01 03:00:00,0.0
2006-01-01 04:00:00,0.0
…,…
2010-12-31 19:00:00,0.0
2010-12-31 20:00:00,0.0
2010-12-31 21:00:00,0.0
2010-12-31 22:00:00,0.0


In [None]:
for n_id in all_neighbour_ids_paths.keys():
    print(n_id)
    one_neighbouring_gauge = get_neighbour_gauge_data(neighbour_gauge_id=n_id, time_multiplying_factor=MULTIPLYING_FACTORS['hourly'])
    joined_gauges = target_gauge.join(one_neighbouring_gauge, on='time', suffix=f'_{n_id}')
    joined_gauges = joined_gauges.drop_nans()

    ## resample into daily (also round to 1 decimal place) TODO: remove offset part
    joined_gauges = joined_gauges.group_by_dynamic("time", every='1d', offset='7h')\
        .agg([
                     pl.len().alias("n_hours"),
                     pl.col(rain_col).mean().round(1).alias(rain_col),
                     pl.col(f'{rain_col}_{n_id}').mean().round(1).alias(f'{rain_col}_{n_id}')
             ])\
        .filter(pl.col("n_hours") == 24).drop("n_hours")  # Ensure at least 24 data points

    ## NOTE: is this necessary? Why not just read resolution of each data?
    a = np.around(joined_gauges.filter(pl.col(rain_col) >= 0.1).min()[rain_col], 1)[0]
    b = np.around(joined_gauges.filter(pl.col(f"{rain_col}_{n_id}") >= 0.1).min()[f"{rain_col}_{n_id}"], 1)[0]
    p = max(a, b, 0.1)
    print(a, b, p)

    ## TODO: rename all variables
    joined_gauges_duplicates = joined_gauges.with_columns(
        pl.when(
            (pl.col(rain_col) > p) &
            (pl.col(f"{rain_col}_{n_id}") > p)
        ).then(1)
        .when(
            (pl.col(rain_col) == p) &
            (pl.col(f"{rain_col}_{n_id}") == p),
        ).then(1)
        .when(
            (pl.col(rain_col) == p) &
            (pl.col(f"{rain_col}_{n_id}") > p),
        ).then(0)
        .when(
            (pl.col(rain_col) > p) &
            (pl.col(f"{rain_col}_{n_id}") == p)
        ).then(0)
        .otherwise(np.nan)
        .alias("duplicate")
    )

    joined_gauges = joined_gauges.with_columns(
        pl.when(
            (pl.col(rain_col) > 0) & (pl.col(f"{rain_col}_{n_id}") > 0)
        ).then(pl.col(rain_col) / pl.col(f"{rain_col}_{n_id}"))
        .otherwise(np.nan)
        .alias("factor")
    )

    match = joined_gauges_duplicates['duplicate'].value_counts().filter(pl.col('duplicate') == 1)['count'].item()
    diff = joined_gauges_duplicates['duplicate'].value_counts().filter(pl.col('duplicate') == 0)['count'].item()
    affinity = match / (match + diff)
    p_corr = np.corrcoef(joined_gauges[rain_col], joined_gauges[f"{rain_col}_{n_id}"])[0, 1]
    f_mean = joined_gauges['factor'].drop_nans().mean()
    print(f"diff: {diff}, match:{match}")
    print(f"affinity: {affinity}, p_corr: {p_corr}, f_mean: {f_mean}")

    print("#"*15)

DE_06264
0.1 0.1 0.1
diff: 174, match:379
perc: 0.6853526220614828, p_corr: 0.0024745444582926776, f_mean: 2.8794833506858817
###############
DE_01300
0.1 0.1 0.1
diff: 94, match:508
perc: 0.8438538205980066, p_corr: 0.001977647232117444, f_mean: 2.1665778878233475
###############
DE_06303
0.1 0.1 0.1
diff: 140, match:487
perc: 0.7767145135566188, p_corr: 0.004315394781542508, f_mean: 2.36290034326275
###############
DE_02718
0.1 0.1 0.1
diff: 178, match:366
perc: 0.6727941176470589, p_corr: 0.003670924337190943, f_mean: 3.0951548717106143
###############
DE_04488
0.1 0.1 0.1
diff: 98, match:439
perc: 0.8175046554934823, p_corr: 0.003425391979961906, f_mean: 2.139130209744735
###############
DE_00310
0.1 0.1 0.1
diff: 124, match:455
perc: 0.7858376511226253, p_corr: 0.004957314033425343, f_mean: 2.3426827802109726
###############
DE_03215
0.1 0.1 0.1
diff: 99, match:268
perc: 0.7302452316076294, p_corr: 0.04958692601219599, f_mean: 2.738944066476524
###############
DE_00390
0.1 0.1 0.1

# QC18 - Daily neighbours (dry) 
[Back to Index](#Table-of-contents)

#### Differences from `intense-qc`: 
- 

# QC19 - Hourly neighbours (dry) 
[Back to Index](#Table-of-contents)

#### Differences from `intense-qc`: 
- 

# QC20 - Monthly neighbours
[Back to Index](#Table-of-contents)

#### Differences from `intense-qc`: 
- 

# QC21 - Timing offset 
[Back to Index](#Table-of-contents)

#### Differences from `intense-qc`: 
- 

# QC22 - Pre-QC Affinity  
[Back to Index](#Table-of-contents)

#### Differences from `intense-qc`: 
- 

# QC23 - Pre-QC Pearson
[Back to Index](#Table-of-contents)

#### Differences from `intense-qc`: 
- 

# QC24 - Daily factor  
[Back to Index](#Table-of-contents)

#### Differences from `intense-qc`: 
- 

# QC25 - Monthly factor
[Back to Index](#Table-of-contents)

#### Differences from `intense-qc`: 
- 