# Generate summed flag count tables

This notebook creates QAQC flag counts csv files per network from the corresponding eraqc_counts_timestep files that were generated as a part of the final processing step for stations within the Historical Data Pipeline. These tables are used to then generate statistics for the QAQC success report.

This is carried out in two steps:

1. Generate the per-network QAQC flag count tables, at native and hourly timesteps

2. Generates one flag count table that sums all per-network tables, at native and hourly timesteps


Using the following functions:


- _pairwise_sum(): helper function that merges two input flag tables, used by network_sum_flag_counts() and total_sum_flag_counts().

- network_sum_flag_counts(): sums all station flag count tables for a given network, creating one flag count table for that network

- generate_station_tables(): runs network_sum_flag_counts() for every network

- total_sum_flag_counts(): sums all network flag count tables, creating one final flag count table 

## Step 0: Environment set-up

In [2]:
import boto3
import numpy as np
import pandas as pd
import xarray as xr
import matplotlib.pyplot as plt
from io import BytesIO, StringIO
from functools import reduce

import inspect

import logging
# Create a simple logger that just prints to the console
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger()

plt.rcParams["figure.dpi"] = 300

In [3]:
# Set AWS credentials
s3 = boto3.resource("s3")
s3_cl = boto3.client("s3")  # for lower-level processes

# Set relative paths to other folders and objects in repository.
bucket_name = "wecc-historical-wx"
stations_csv_path = f"s3://{bucket_name}/2_clean_wx/temp_clean_all_station_list.csv"
qaqc_dir = "3_qaqc_wx"
merge_dir = "4_merge_wx"

INFO:botocore.credentials:Found credentials in shared credentials file: ~/.aws/credentials


### The functions

In [4]:
def _pairwise_sum(flag_df_1, flag_df_2) -> pd.DataFrame:
    """
    Sums two input flag count dataframes. This is a helper function for sum_flag_counts(). 

    Parameters
    ----------
    flag_df_1: pd.DataFrame
        dataframe of previously summed station flag counts
    flag_df_2: pd.DataFrame
        flag counts dataframes for next station

    Returns
    -------
    summed_df: pd.DataFrame

    """
    if len(flag_df_1) == 0:
        return flag_df_2
    else:
        total_df = pd.concat([flag_df_1, flag_df_2])

        summed_df = total_df.groupby('eraqc_flag_values', as_index=False).sum()
        return summed_df

In [5]:
def network_sum_flag_counts(network: str, timestep: str) -> None:
    """
    Sums all station QAQC flag counts in a network for a given timestep (hourly or native) and sends to AWS. 
    These counts are used to generate statistics for the QAQC success report.

    Parameters
    ----------
    network: str
        network name
    timestep: str
        if set to 'hourly', merge all hourly QAQC flag count tables
        if set to 'native', merge all native timestep QAQC flag count tables

    Returns
    -------
    None

    """
    ## Setup

    # read in flag meanings CSV

    flag_meanings = pd.read_csv("era_qaqc_flag_meanings.csv")

    # only run for a valid "timestep" input
    if timestep not in ("hourly", "native"):
        print("invalid timestep: ", timestep)
        return None

    # the function iteratively adds in flag counts to this dataframe
    summed_counts_df = []

    # point to folder containing station flag count CSVs
    flags_prefix = f"{merge_dir}/{network}/eraqc_counts_{timestep}_timestep"

    ## Merge flag counts

    # loop through all CSVs are the given level
    for item in s3.Bucket(bucket_name).objects.filter(Prefix=flags_prefix):
        obj = s3_cl.get_object(Bucket=bucket_name, Key=item.key)
        flags = pd.read_csv(obj["Body"])
        # the CSV is empty
        if flags.empty:
            continue
        # the CSV is not empty
        else:
            # send current dataframe and dataframe of previously summed counts to helper function
            summed_counts_df = _pairwise_sum(summed_counts_df, flags)

    ## Format final dataframe

    # merge with flag meaning dataframe
    # - to include all flag values, and their meanings
    

    # Reorder flag values in numerical order
    summed_counts_df = summed_counts_df.sort_values(by="eraqc_flag_values")

    ## Send final counts file to AWS as CSV

    csv_s3_filepath = f"s3://wecc-historical-wx/4_merge_wx/per_network_flag_counts/{network}_flag_counts_{timestep}_timestep.csv"
    # summed_counts_df.to_csv(csv_s3_filepath, index=False)
    print(
        f"Sending summed counts dataframe for {network} to: {csv_s3_filepath}"
    )

    return summed_counts_df  # None

In [6]:
def total_sum_flag_counts(timestep: str) -> None:
    """
    Sums all network-level QAQC flag counts for a given timestep (hourly or native) and sends to AWS. 
    These counts are used to generate statistics for the QAQC success report.

    Parameters
    ----------
    timestep: str
        if set to 'hourly', merge all hourly QAQC flag count tables
        if set to 'native', merge all native timestep QAQC flag count tables

    Returns
    -------
    None

    """
    ## Setup

    # only run for a valid "timestep" input
    if timestep not in ("hourly", "native"):
        print("invalid timestep: ", timestep)
        return None

    # the function iteratively adds in flag counts to this dataframe
    summed_counts_df = []

    # point to folder containing network-level flag count CSVs
    flags_prefix = f"{merge_dir}/per_network_flag_counts"

    ## Merge flag counts

    # loop through all networks CSVs
    for item in s3.Bucket(bucket_name).objects.filter(Prefix=flags_prefix):
        obj = s3_cl.get_object(Bucket=bucket_name, Key=item.key)
        flags = pd.read_csv(obj["Body"])
        # the CSV is empty
        if flags.empty:
            continue
        # the CSV is not empty
        else:
            # send current dataframe and dataframe of previously summed counts to helper function
            summed_counts_df = _pairwise_sum(summed_counts_df, flags)

    ## Send final counts file to AWS as CSV
    if len(summed_counts_df) == 0:
        return None
    else:
        csv_s3_filepath = (
            f"s3://wecc-historical-wx/4_merge_wx/total_flag_counts_{timestep}_timestep.csv"
        )
        # summed_counts_df.to_csv(csv_s3_filepath, index=False)
        print(f"Sending final summed counts dataframe for to: {csv_s3_filepath}")

        return summed_counts_df  # None

In [7]:
def generate_station_tables(timestep: str) -> None:
    """
    Runs network_sum_flag_counts() for every network.

    Parameters
    ----------
    timestep: str
        if set to 'hourly', merge all hourly QAQC flag count tables
        if set to 'native', merge all native timestep QAQC flag count tables

    Returns
    -------
    None

    """
    # only run for a valid "timestep" input
    if timestep not in ("hourly", "native"):
        print("invalid timestep: ", timestep)
        return None

    station_list = pd.read_csv(stations_csv_path)
    network_list = station_list["network"].unique()

    for network in network_list:
        network_sum_flag_counts(network, timestep)

    return None

## Step 1: Generate flag sum tables for ever network

First, loop through every network, combining each of their station flag count tables into one table. The result is one flag count table at each timestep - native and hourly - for every network.

This will take around 1 hour to run for both timesteps. 

In [None]:
generate_station_tables('native')

In [None]:
generate_station_tables("hourly")

### Test

In [8]:
network = "CDEC"
station = "CDEC_PVP"
timestep = 'native'

In [9]:
result = network_sum_flag_counts(network, timestep)

Sending summed counts dataframe for CDEC to: s3://wecc-historical-wx/4_merge_wx/per_network_flag_counts/CDEC_flag_counts_native_timestep.csv


In [84]:
result["eraqc_flag_values"] = result["eraqc_flag_values"].str.replace(
    ".0", "", regex=True
)

In [112]:
flag_meanings = pd.read_csv('era_qaqc_flag_meanings.csv')

In [113]:
# the flag values in the count tables were saved as strings, so we need to convert to string to merge

# flag_meanings['Flag_value'] = flag_meanings['Flag_value'].astype(str)
flag_meanings = flag_meanings.rename(columns={'Flag_value':'eraqc_flag_values'})

In [114]:
result = result[~result['eraqc_flag_values'].isin(['no_flag','total_obs_count'])].astype(int)

In [115]:
test = result.merge(flag_meanings,on='eraqc_flag_values',how='outer')

In [116]:
test = test.sort_values(by="eraqc_flag_values")

In [117]:
test

Unnamed: 0,eraqc_flag_values,accum_pr,elevation,pr,hurs,ps_altimeter,ps_derived,ps,rsds,sfcWind_dir,sfcWind,tas,tdps_derived,QAQC_function,Flag_meaning
15,1,,,,,,,,,,,,,spurious_buoy_check,Suspect observation (i.e. buoy reports wind du...
16,2,,,,,,,,,,,,,spurious_buoy_check,Out of range for station official data tempora...
17,3,,,,,,,,,,,,,qaqc_elev_infill,Elevation infilled from DEM (USGS 3DEP)
18,4,,,,,,,,,,,,,qaqc_elev_infill,Elevation infilled from station
19,5,,,,,,,,,,,,,qaqc_elev_infill,Elevation manually infilled to be 0.0 m; occur...
20,6,,,,,,,,,,,,,qaqc_sensor_height_t,INACTIVE FLAG: Thermometer height missing
21,7,,,,,,,,,,,,,qaqc_sensor_height_t,INACTIVE FLAG: Thermometer height not 2 meters
22,8,,,,,,,,,,,,,qaqc_sensor_height_w,INACTIVE FLAG: Anemometer height missing
23,9,,,,,,,,,,,,,qaqc_sensor_height_w,INACTIVE FLAG: Anemometer height not 10 meters
24,10,,,,,,,,,,,,,qaqc_precip_logic_nonegvals,Precipitation value reported below 0 (negative...


## Step 2: Generate total flag sum table

Now combine all the network flag count tables generated in step 1 into one final flag count table. First at the hourly timestep, and then at the native timestep.

Step 1 must be complete before moving on to this step.

In [None]:
total_sum_flag_counts('hourly')

In [None]:
total_sum_flag_counts("native")