# Quality Control (QC) Notebook

In this notebook we calculate quality-statistics and quality flags for the timeseries data.

We need two flagging systems:
- Flag the entire station
    - \> 10 years of Q data for the station (without NAs)
    - No Q available (this is covered by the one above)
    - percentage NaN
    - longest gap
    - Percentage of EZG outside of Germany
- Flag values in the timeseries data
    - negative values

The first information (about the entire) station can be added to the metadata file.  
The second QC can be added to the data/ folder of each station.

For now, we focus on the first issue (entire station statistics).

## Station wide quality control / metrics

In [8]:
from camelsp import Station, get_metadata
from tqdm import tqdm

### flag_q_more_than_10_years 

Calculate which stations have at least 10 years of Q.  
Do not use `dateindex.max() - dateindex.min()` as we do not want to include NaNs, we want **10 years of values** -> 3,650 values minimum

In [28]:
# set threshold for minimum number of days / data points
threshold = 365*10

# get metadata
meta = get_metadata()

camels_ids = meta['camels_id'].values

for id in tqdm(camels_ids):
    s = Station(id)
    
    # get data
    df = s.get_data()

    # count q values that are not nan
    if 'q' in df.columns:
        q_count = df['q'].count()

        # if q_count is below threshold, set flag in metadata
        if q_count < threshold:
            meta.loc[meta['camels_id'] == id, 'flag_q_more_than_10_years'] = False
        else:
            meta.loc[meta['camels_id'] == id, 'flag_q_more_than_10_years'] = True
    else:
        meta.loc[meta['camels_id'] == id, 'flag_q_more_than_10_years'] = False


  0%|          | 0/2870 [00:00<?, ?it/s]

100%|██████████| 2870/2870 [01:13<00:00, 38.92it/s]


### Percentage NaN

Decide for a threshold to include a flag, or don't use as flag and include in general metadata (move to merge_metadata?).

In [34]:
# calculate percentage of NaN
df.Name

Unnamed: 0_level_0,q,q_flag,w,w_flag
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
1999-05-21,,,,False
1999-05-22,,,,False
1999-05-23,,,,False
1999-05-24,,,,False
1999-05-25,,,,False
...,...,...,...,...
2021-12-27,,,,False
2021-12-28,,,,False
2021-12-29,,,,False
2021-12-30,,,,False
