# Explaining the Merge Hourly Standardization Step

The hourly standardization step ensures that all QAQC's data becomes hourly. 
< add reason for this >
We noticed that the function inserts empty strings ('') when resampling QAQC flag columns when there are time gaps of an hour or more. To address this, we modify the QAQC subset resampling scheme to insert 'nan' in those gaps, instead of leaving them empty. 

## Environment set-up

In [1]:
import boto3
import numpy as np
import pandas as pd
import xarray as xr
import matplotlib.pyplot as plt
from functools import reduce

In [2]:
# Set AWS credentials
s3 = boto3.resource("s3")
s3_cl = boto3.client("s3")  # for lower-level processes

# Set relative paths to other folders and objects in repository.
bucket_name = "wecc-historical-wx"

In [3]:
def merge_ds_to_df(ds):
    """Converts xarray ds for a station to pandas df in the format needed for processing.

    Parameters
    ----------
    ds: xr.Dataset
        Data object with information about each network and station
    verbose: boolean
        Flag as to whether to print runtime statements to terminal. Default is False. Set in ALLNETWORKS_merge.py run.

    Returns
    -------
    df: pd.DataFrame
        Table object with information about each network and station
    MultiIndex: pd.DataFrame (I think)
        Original multi-index of station and time, to be used on conversion back to ds
    attrs:
        Save ds attributes to inherent to the final merged file
    var_attrs:
        Save variable attributes to inherent to the final merged file
    """

    # Save attributes to inherent them to the final merged file
    attrs = ds.attrs
    var_attrs = {var: ds[var].attrs for var in list(ds.data_vars.keys())}

    df = ds.to_dataframe()

    # Save instrumentation heights
    if "anemometer_height_m" not in df.columns:
        try:
            df["anemometer_height_m"] = (
                np.ones(ds["time"].shape) * ds.anemometer_height_m
            )
        except:
            df["anemometer_height_m"] = np.ones(len(df)) * np.nan
        finally:
            pass
    if "thermometer_height_m" not in df.columns:
        try:
            df["thermometer_height_m"] = (
                np.ones(ds["time"].shape) * ds.thermometer_height_m
            )
        except:
            df["thermometer_height_m"] = np.ones(len(df)) * np.nan
        finally:
            pass

    # De-duplicate time axis
    df = df[~df.index.duplicated()].sort_index()

    # Save station/time multiindex
    MultiIndex = df.index
    station = df.index.get_level_values(0)
    df["station"] = station

    # Station pd.Series to str
    station = station.unique().values[0]

    # Convert time/station index to columns and reset index
    df = df.droplevel(0).reset_index()

    return df, MultiIndex, attrs, var_attrs

## Load Data

In [4]:
# url = "s3://wecc-historical-wx/3_qaqc_wx/VALLEYWATER/VALLEYWATER_6001.zarr"
url = "s3://wecc-historical-wx/3_qaqc_wx/ASOSAWOS/ASOSAWOS_72493023230.zarr"
ds = xr.open_zarr(url)

In [5]:
df, MultiIndex, attrs, var_attrs = merge_ds_to_df(ds)

  base = data.astype(np.int64)
  data = (base * m + (frac * m).astype(np.int64)).view("timedelta64[ns]")


# Another issue: constant and instant vars with no data

In [8]:
# -----------------------------------------------------------------------------
def qaqc_flag_fcn(x: str) -> str:
    """
    Used for resampling QAQC flag columns. Ensures that the final standardized dataframe
    does not contain any empty strings by returning 'nan' when given an empty input (i.e. in time gaps).

    Parameters
    -----------
    x : array_like
        sub-hourly timestep data

    Returns
    -------
    str : final flag value

    """
    if len(x) == 0:
        return "nan"
    else:
        return ",".join(x.unique())

In [6]:
# -----------------------------------------------------------------------------
def merge_hourly_standardization(
    df: pd.DataFrame, var_attrs: dict
) -> tuple[pd.DataFrame, dict]:
    """Resamples meteorological variables to hourly timestep according to standard conventions.

    Parameters
    -----------
    df : pd.DataFrame
        station dataset converted to dataframe through QAQC pipeline
    var_attrs: library
        attributes for sub-hourly variables
    logger : logging.Logger
        Logger instance for recording messages during processing.

    Returns
    -------
    df : pd.DataFrame | None
        returns a dataframe with all columns resampled to one hour (column name retained)
    var_attrs : dict | None
        returns variable attributes dictionary updated to note that sub-hourly variables are now hourly

    Notes
    -----
    Rules:
    1. Top of the hour: take the first value in each hour. Standard convention for temperature, dewpoint, wind speed, direction, relative humidity, air pressure.
    2. Summation across the hour: sum observations within each hour. Standard convention for precipitation and solar radiation.
    3. Constant across the hour: take the first value in each hour. This applied to variables that do not change.
    """

    # Variables that remain constant within each hour
    constant_vars = [
        "time",
        "station",
        "lat",
        "lon",
        "elevation",
        "anemometer_height_m",
        "thermometer_height_m",
    ]

    # Aggregation across hour variables, standard meteorological convention: precipitation and solar radiation
    sum_vars = [
        "time",
        "pr",
        "pr_localmid",
        "pr_24h",
        "pr_1h",
        "pr_15min",
        "pr_5min",
        "rsds",
    ]

    # Top of the hour variables, standard meteorological convention: temperature, dewpoint temperature, pressure, humidity, winds
    instant_vars = [
        "hurs_derived",
        "time",
        "tas",
        "tas_derived",
        "tdps",
        "tdps_derived",
        "ps",
        "psl",
        "ps_altimeter",
        "ps_derived",
        "hurs",
        "sfcWind",
        "sfcWind_dir",
    ]

    # QAQC flags, which remain constants within each hour
    vars_to_remove = ["qc", "eraqc", "duration", "method", "flag", "depth", "process"]

    try:

        qaqc_vars = [
            var
            for var in df.columns
            if any(True for item in vars_to_remove if item in var)
        ]

        # Subset the dataframe according to rules
        constant_df = df[[col for col in constant_vars if col in df.columns]]

        qaqc_df = df[[col for col in qaqc_vars if col in df.columns if col != "time"]]
        qaqc_df = qaqc_df.astype(str)
        qaqc_df.insert(0, "time", df["time"])

        sum_df = df[[col for col in sum_vars if col in df.columns]]

        instant_df = df[[col for col in instant_vars if col in df.columns]]

        # Performing hourly aggregation, only if subset contains more than one (ie more than the 'time' time) column
        # This is to account for input dataframes that do not contain ALL subsets of variables defined above - just a subset of them.
        result_list = []
        if len(constant_df.columns) > 1:
            constant_result = constant_df.resample("1h", on="time").first()
            result_list.append(constant_result)

        if len(instant_df.columns) > 1:
            instant_result = instant_df.resample("1h", on="time").first()
            result_list.append(instant_result)

        if len(sum_df.columns) > 1:
            sum_result = sum_df.resample("1h", on="time").apply(
                lambda x: np.nan if x.isna().all() else x.sum(skipna=True)
            )
            result_list.append(sum_result)

        if len(qaqc_df.columns) > 1:
            qaqc_result = qaqc_df.resample("1h", on="time").apply(
                lambda x: qaqc_flag_fcn(x)
            )  # concatenating unique flags
            result_list.append(qaqc_result)

        # Aggregate and output reduced dataframe - this merges all dataframes defined
        # This function sets "time" to the index; reset index to return to original index
        result = reduce(
            lambda left, right: pd.merge(left, right, on=["time"], how="outer"),
            result_list,
        )
        # result.reset_index(inplace=True)  # Convert time index --> column

        # # Update attributes for sub-hourly variables
        # sub_hourly_vars = [i for i in df.columns if "min" in i and "qc" not in i]
        # for var in sub_hourly_vars:
        #     var_attrs[var]["standardization"] = (
        #         "{} has been standardized to an hourly timestep, but will retain its original name".format(
        #             var
        #         )
        #     )

        return result, var_attrs

    except Exception as e:
        print("Failed")
        raise e

In [None]:
# -----------------------------------------------------------------------------
def _infill(result: pd.DataFrame) -> pd.DataFrame:
    """


    Parameters
    -----------
    result : pd.Dataframe
        hourly standardized dataframe

    Returns
    -------
    result : pd.Dataframe


    """

    first_values = result.apply(lambda x: x.first())

    return result

In [9]:
df_st, var_attrs = merge_hourly_standardization(df, var_attrs)

In [14]:
constant_vars = [
    "time",
    "station",
    "lat",
    "lon",
    "elevation",
    "anemometer_height_m",
    "thermometer_height_m",
]

In [15]:
df_st_constant = df_st[[col for col in constant_vars if col in df_st.columns]]

In [16]:
first_values = df_st_constant.apply(
    lambda col: col.dropna().iloc[0] if not col.dropna().empty else np.nan
)

In [18]:
df_st['standardized_infill'] = np.nan

For every row that contains a station value of None, add 

## test

In [21]:
df_st_show_gap = df_st.loc[
    (df_st["time"] >= "1981-02-05 00:00:00	") & (df_st["time"] < "1981-02-05 23:00:00	")
]

In [22]:
df_st_show_gap

Unnamed: 0,time,station,lat,lon,elevation,anemometer_height_m,thermometer_height_m,tas,tdps,ps,...,qaqc_process,sfcWind_dir_eraqc,sfcWind_dir_qc,sfcWind_eraqc,sfcWind_method,sfcWind_qc,tas_eraqc,tas_qc,tdps_eraqc,tdps_qc
9624,1981-02-05 00:00:00,ASOSAWOS_72493023230,37.733,-122.2,2.0,10.06,,288.75,281.45,,...,V020,,5.0,,N,5.0,,5.0,,5.0
9625,1981-02-05 01:00:00,ASOSAWOS_72493023230,37.733,-122.2,2.0,10.06,,288.75,281.45,,...,V020,,5.0,,N,5.0,,5.0,,5.0
9626,1981-02-05 02:00:00,ASOSAWOS_72493023230,37.733,-122.2,2.0,10.06,,285.95,281.45,,...,V020,,5.0,,N,5.0,,5.0,,5.0
9627,1981-02-05 03:00:00,ASOSAWOS_72493023230,37.733,-122.2,2.0,10.06,,283.15,281.45,,...,V020,,5.0,,N,5.0,,5.0,,5.0
9628,1981-02-05 04:00:00,ASOSAWOS_72493023230,37.733,-122.2,2.0,10.06,,283.75,281.45,,...,V020,,5.0,,N,5.0,,5.0,,5.0
9629,1981-02-05 05:00:00,ASOSAWOS_72493023230,37.733,-122.2,2.0,10.06,,283.75,281.45,,...,V020,,5.0,,N,5.0,,5.0,,5.0
9630,1981-02-05 06:00:00,ASOSAWOS_72493023230,37.733,-122.2,2.0,10.06,,284.25,282.05,,...,V020,,5.0,,N,5.0,,5.0,,5.0
9631,1981-02-05 07:00:00,,,,,,,,,,...,,,,,,,,,,
9632,1981-02-05 08:00:00,ASOSAWOS_72493023230,37.733,-122.2,2.0,10.06,,283.15,280.95,,...,V020,,5.0,,N,5.0,,5.0,,5.0
9633,1981-02-05 09:00:00,ASOSAWOS_72493023230,37.733,-122.2,2.0,10.06,,281.45,279.85,,...,V020,,5.0,,C,5.0,,5.0,,5.0


In [23]:
constant_vars = [
    "time",
    "station",
    "lat",
    "lon",
    "elevation",
    "anemometer_height_m",
    "thermometer_height_m",
]

constant_df = df_st_show_gap[
    [col for col in constant_vars if col in df_st_show_gap.columns]
]

constant_df

Unnamed: 0,time,station,lat,lon,elevation,anemometer_height_m,thermometer_height_m
9624,1981-02-05 00:00:00,ASOSAWOS_72493023230,37.733,-122.2,2.0,10.06,
9625,1981-02-05 01:00:00,ASOSAWOS_72493023230,37.733,-122.2,2.0,10.06,
9626,1981-02-05 02:00:00,ASOSAWOS_72493023230,37.733,-122.2,2.0,10.06,
9627,1981-02-05 03:00:00,ASOSAWOS_72493023230,37.733,-122.2,2.0,10.06,
9628,1981-02-05 04:00:00,ASOSAWOS_72493023230,37.733,-122.2,2.0,10.06,
9629,1981-02-05 05:00:00,ASOSAWOS_72493023230,37.733,-122.2,2.0,10.06,
9630,1981-02-05 06:00:00,ASOSAWOS_72493023230,37.733,-122.2,2.0,10.06,
9631,1981-02-05 07:00:00,,,,,,
9632,1981-02-05 08:00:00,ASOSAWOS_72493023230,37.733,-122.2,2.0,10.06,
9633,1981-02-05 09:00:00,ASOSAWOS_72493023230,37.733,-122.2,2.0,10.06,


Do we need to infill these with their "correct" values? 

Could add a step that targets rows with "None" in the station column. Write a helper function for this. For constant values, infill the first values in the entire dataframe. And then add flags. Do this by generating a standardation_infill column that starts wtih all NaNs. And then, in this helper function, input "yes". 