# Quality Control (QC) Notebook

In this notebook we calculate quality-statistics and quality flags for the timeseries (and catchment??) data.

We need two flagging systems:
- Flag the entire station
    - \> 10 years of Q data for the station (without NAs)
    - No Q available (this is covered by the one above)
    - percentage NaN
    - longest gap
    - catchments
        - Percentage of EZG outside of Germany
        - Overlap in catchments
- Flag values in the timeseries data
    - negative values

The first information (about the entire) station can be added to the metadata file.  
The second QC can be added to the data/ folder of each station.

For now, we focus on the first issue (entire station statistics).

## Station wide quality control / metrics

In [1]:
from camelsp import Station, get_metadata
from tqdm import tqdm

### flag_q_more_than_10_years 

Calculate which stations have at least 10 years of Q.  
Do not use `dateindex.max() - dateindex.min()` as we do not want to include NaNs, we want **10 years of values** -> 3,650 values minimum

In [2]:
# set threshold for minimum number of days / data points
threshold = 365*10

# get metadata
meta = get_metadata()

camels_ids = meta['camels_id'].values

for id in tqdm(camels_ids):
    s = Station(id)
    
    # get data
    df = s.get_data()

    # count q values that are not nan
    if 'q' in df.columns:
        q_count = df['q'].count()

        # if q_count is below threshold, set flag in metadata
        if q_count < threshold:
            meta.loc[meta['camels_id'] == id, 'flag_q_more_than_10_years'] = False
        else:
            meta.loc[meta['camels_id'] == id, 'flag_q_more_than_10_years'] = True
    else:
        meta.loc[meta['camels_id'] == id, 'flag_q_more_than_10_years'] = False


# save metadata
meta.to_csv('../output_data/metadata/metadata.csv', index=False)

  4%|▎         | 104/2870 [00:02<01:01, 45.30it/s]

100%|██████████| 2870/2870 [02:41<00:00, 17.78it/s]


### Percentage NaN

Decide for a threshold to include a flag, or don't use as flag and include in general metadata (move to merge_metadata?).

In [38]:
s = Station("DE110000")

df = s.get_data()

# get first not nan value in df['q']
df['q'].first_valid_index()

df['q'].last_valid_index()

Timestamp('2021-12-31 00:00:00')