# Success Report Statistics

per station 
- total % flagged
- % flagged per variable
- % flagged per QAQC flag

NEED: per station raw counts tables

per network
- % flagged per variable (highest and lowest)
- % flagged per QAQC flag (highest and lowest)
- % flagged per station (highest and lowest)

NEED: raw counts per variable, raw counts per QAQC flag, raw counts per station
HOW: sum total and flagged per station, variable, and QAQC flag

total
- % flagged total
- % flagged per network
- % flagged per variable (highest and lowest)
- % flagged per QAQC flag (highest and lowest)

NEED: raw counts per variable, raw counts per QAQC flag, raw counts per station
HOW: sum total and flagged per network, variable, and QAQC flag


## Environment set-up

In [1]:
import boto3
import numpy as np
import pandas as pd
import xarray as xr
import matplotlib.pyplot as plt
from io import BytesIO, StringIO
from functools import reduce

import inspect

import logging
# Create a simple logger that just prints to the console
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger()

plt.rcParams["figure.dpi"] = 300

In [2]:
# Set AWS credentials
s3 = boto3.resource("s3")
s3_cl = boto3.client("s3")  # for lower-level processes

# Set relative paths to other folders and objects in repository.
bucket_name = "wecc-historical-wx"
stations_csv_path = f"s3://{bucket_name}/2_clean_wx/temp_clean_all_station_list.csv"
qaqc_dir = "3_qaqc_wx"
merge_dir = "4_merge_wx"

INFO:botocore.credentials:Found credentials in shared credentials file: ~/.aws/credentials


## Setup

In [4]:
stations_df = pd.read_csv(stations_csv_path)

INFO:aiobotocore.credentials:Found credentials in shared credentials file: ~/.aws/credentials


In [9]:
network = "ASOSAWOS"
station1 = "ASOSAWOS_72493023230"
station2 = "ASOSAWOS_69007093217"

In [10]:
key1_native = f"4_merge_wx/{network}/eraqc_counts/{station1}_flag_counts_native_timestep.csv"
key1_hourly = (
    f"4_merge_wx/{network}/eraqc_counts/{station1}_flag_counts_hourly_standardized.csv"
)

key2_native = (
    f"4_merge_wx/{network}/eraqc_counts/{station2}_flag_counts_native_timestep.csv"
)
key2_hourly = (
    f"4_merge_wx/{network}/eraqc_counts/{station2}_flag_counts_hourly_standardized.csv"
)

In [11]:
flag_counts1_hourly = pd.read_csv(f"s3://wecc-historical-wx/{key1_hourly}")
flag_counts1_native = pd.read_csv(f"s3://wecc-historical-wx/{key1_native}")

flag_counts2_hourly = pd.read_csv(f"s3://wecc-historical-wx/{key2_hourly}")
flag_counts2_native = pd.read_csv(f"s3://wecc-historical-wx/{key2_native}")

Generate a network-level summary table

- same csv as station-level but sum of all stations (so variable vs. qaqc flag) -> collapse all csvs
- produce statistics (per station)
    - % flagged total
    - % flagged per variable
    - % flagged per QAQC flag

What will this second output look like?
table 1:
- station as columns
    - % flagged
    - most flagged QAQC flag
    - most flagged variable
table 2: 
- variables as columns
    - % flagged
    - most flagged QAQC flag
table 3: 
- QAQC flags as columns
    - % flagged

Those 3 tables are up for discussion - is there a cleaner way to do this?

But FOR SURE produce that table of sums across all stations

In [None]:
def network_sum_counds(network: str) -> pd.DataFrame:
    """
    Sums all station flag count tables into one network-level raw flag count table.

    Parameters
    ----------
    network: str
        network name

    Returns
    -------
    None
    """

    errordf = []
    # errors_prefix = merge_dir + network + "/" + "network_flag_counts"
    flags_prefix = f"{merge_dir}/{network}/eraqc_counts"

    for item in s3.Bucket(bucket_name).objects.filter(Prefix=flags_prefix):
        obj = s3_cl.get_object(Bucket=bucket_name, Key=item.key)
        errors = pd.read_csv(obj["Body"])
        if errors.empty:  # If file empty
            continue
        else:
            errors = errors[["File", "Time", "Error"]]
            errordf.append(errors)
    if not errordf:  # If no errors in cleaning
        return pd.DataFrame()
    else:
        errordf = pd.concat(errordf)
        errordf = errordf.drop_duplicates(subset=["File", "Error"])
        errordf = errordf[
            errordf.File != "Whole network"
        ]  # Drop any whole network errors
    return None

## useful bits of code

In [None]:
    # Read the CSV file containing station data
    csv_filepath = "s3://wecc-historical-wx/2_clean_wx/temp_clean_all_station_list.csv"
    stations_df = pd.read_csv(csv_filepath)

    # Filter the dataframe to only include rows corresponding to the specified network
    # And, only cleaned stations
    network_df = stations_df[
        (stations_df["network"] == network) & (stations_df["cleaned"] == "Y")
    ]

    # Check if nothing is returned. Raise ValueError and print useful message.
    if len(network_df) == 0:
        unique_networks = ", ".join(stations_df["network"].unique())  # Unique networks
        raise ValueError(
            f"No stations found for network: {network}. Available networks: {unique_networks}"
        )
    