# Station Matching


This notebook does the following

1. identifies stations that need to be concatenated using station location

2. concatenates these target stations, with special handing for the following cases

    a. subsets containing more than two stations

    b. stations with temporal overlap

3. moves/renames (under discussion) datasets that were input to concatenation to a separate folder in AWS (as opposed to deleting them outright)

## Environment set-up

In [2]:
import datetime
import boto3
import geopandas as gpd
import numpy as np
import pandas as pd
import xarray as xr
import matplotlib.pyplot as plt
from io import BytesIO, StringIO

# Import qaqc stage calc functions
from QAQC_pipeline import  qaqc_ds_to_df

# Silence warnings
import warnings
from shapely.errors import ShapelyDeprecationWarning

warnings.filterwarnings("ignore", category=RuntimeWarning)
warnings.filterwarnings(
    "ignore", category=ShapelyDeprecationWarning
)  # Warning is raised when creating Point object from coords. Can't figure out why.

plt.rcParams["figure.dpi"] = 300

In [3]:
# AWS credentials
s3 = boto3.resource("s3")
s3_cl = boto3.client("s3")

## AWS buckets
bucket = "wecc-historical-wx"
qaqcdir = "3_qaqc_wx/"
mergedir = "4_merge_wx/"

## Step 1: Identify candidates for concatenation and upload to AWS

We identify stations in the target network MARITIME, and ASOSAWOS that have exactly matching latitudes and longitudes. We then assign a unique ID to each susbet of stations identified and construct a dataframe of ERA-IDs and subset IDs, then send this to AWS for each network.

In [4]:
# A list of networks to be checked for concatenation
target_networks = ["ASOSAWOS","MARITIME"]

In [5]:
def concatenation_check(station_list: pd.DataFrame) -> pd.DataFrame:
    """
    This function flags stations that need to be concatenated.

    Rules
    ------
    1.) Stations are flagged if they have identical latitudes and longitudes

    Parameters
    ------
    station_list: pd.DataFrame
        list of station information

    Returns
    -------
    if success: returns input station list with a flag column assigning an integer to each group of repeat latitudes and longitudes
    if failure: None
    """
    ##### Flag stations with identical latitudes and longitudes, then assign each group a unique integer

    # List of possible variable names for longitudes and latitudes
    lat_lon_list = ["LAT", "LON", "latitude", "longitude", "LATITUDE", "LONGITUDE", 'lat','lon']
    # Extract the latitude and longitude variable names from the input dataframe
    lat_lon_cols = [col for col in station_list.columns if col in lat_lon_list]

    # Generate column flagging duplicate latitudes and longitudes
    station_list["concat_subset"] = station_list.duplicated(
        subset=lat_lon_cols, keep=False
    )
    # within each group of identical latitudes and longitudes, assign a unique integer
    station_list["concat_subset"] = (
        station_list[station_list["concat_subset"] == True].groupby(lat_lon_cols).ngroup()
    )

    ##### Order station list by flag
    concat_station_list = station_list.sort_values("concat_subset")

    ##### Keep only flagged stations
    concat_station_list = concat_station_list[~concat_station_list["concat_subset"].isna()]

    ##### Format final list
    # Convert flags to integers - this is necessary for the final concatenation step
    concat_station_list["concat_subset"] = concat_station_list["concat_subset"].astype(
        "int32"
    )
    # Now keep only the ERA-ID and flag column
    era_id_list = ['ERA-ID','era-id']
    era_id_col = [col for col in station_list.columns if col in era_id_list]
    concat_station_list = concat_station_list[era_id_col + ["concat_subset"]]

    # Standardize ERA id to "ERA-ID" (this is specific to Valleywater stations)
    if 'era-id' in era_id_col:
        concat_station_list.rename(columns={"era-id": "ERA-ID"}, inplace=True)

    return concat_station_list

In [6]:
def apply_concat_check(station_names_list: pd.DataFrame) -> None:
    """
    This function applies the conatenation check to a list of target stations. 
    It then upload a csv containing the ERA IDs and concatenation subset ID for 
    all identified stations in a network.

    Parameters
    ----------
    station__names_list: pd.DataFrame
        list of target station names

    Returns
    -------
    if success: uploads list of stations to be concatenated to AWS
    if failure: None
    """
    final_list = pd.DataFrame([])
    for station in station_names_list:

        ##### Import station list of target station
        key = "2_clean_wx/{}/stationlist_{}_cleaned.csv".format(station,station)
        bucket_name = "wecc-historical-wx"
        list_import = s3_cl.get_object(
            Bucket=bucket,
            Key=key,
        )
        station_list = pd.read_csv(BytesIO(list_import["Body"].read()))

        ##### Apply concatenation check
        concat_list = concatenation_check(station_list)

        ##### Rename the flags for each subset to <station>_<subset number>
        concat_list["concat_subset"] = station + '_' + concat_list["concat_subset"].astype(str)

        ##### Append to final list of stations to concatenate
        final_list = pd.concat([final_list,concat_list])

        ##### Upload to QAQC directory in AWS
        new_buffer = StringIO()
        final_list.to_csv(new_buffer, index = False)
        content = new_buffer.getvalue()

        # the csv is stored in each station folder within 3_qaqc_wx
        s3_cl.put_object(
            Bucket = bucket_name,
            Body = content,
            Key = qaqcdir + station + "/concat_list_{}.csv".format(station)
        )
        
    return None

In [None]:
apply_concat_check(target_networks)

## Step 2: Concatenate Stations

The concatenation process is now split into a series of modular functions.

The function concatenate_stations() does the following:

- Reads in the concatenation stations list for a given network and, for each subset of stations

    - if there are two stations in the subset, give those stations to helper function _df_concat()
    - if there are more than two station in the subset, give those stations to helper function _more_than_2()
    send final concatenated dataframe to _concat_export_help() for export

The helper function _df_concat() concatenates pairs of input dataframes.

- If there is temporal overlap between the two stations, it gives them to helpder function _overlap_concat()
- If not, it merges the two stations


The helper function _more_than_2() iteratively concatenates pairs of stations within a subset of more than two stations. It take the two oldest stations, gives them to _df_concat(). Then is takes the output from this and sends it and the NEXT station to _df_concat(), and so forth, until all stations have been concatenated.

The helper function _overlap_concat() keeps the newer station data in the time range in which the two input stations overlap.

Finally, the helper function _concat_export_help() renames the input datasets in AWS, then formats the concatenated dataframe for export, and then exports it to AWS. 

### Functions

In [63]:
def _df_concat(
    df_1: pd.DataFrame, df_2: pd.DataFrame, attrs_1: dict, attrs_2: dict
) -> tuple[pd.DataFrame, str, str, dict]:
    """
    Performs concatenation of input datasets, handling two cases
        1.) temporal overlap between the datasets
        2.) no temporal overlap

    Rules
    ------
    1.) concatenation: keep the newer station data in the time range in which both stations overlap

    Parameters
    ----------
    df_1: pd.DataFrame
        station data
    df_2: pd.DataFrame
        dtation data
    attrs_1: list of str
        attributes of df_1
    attrs_2: list of str
        attributes of df_2

    Returns
    -------
    if success:
        df_concat: pd.DataFrame
        stn_n_to_keep: str
        stn_n_to_drop: str
        attrs_new: dict
    if failure: None
    """

    # determine which dataset is older
    if df_1["time"].max() < df_2["time"].max():
        # if df_1 has an earlier end tiem than df_2, then d_2 is newer
        # we also grab the name of the newer station in this step, for use later
        df_new = df_2
        attrs_new = attrs_2
        df_old = df_1

    else:
        df_new = df_1
        attrs_new = attrs_1
        df_old = df_2

    stn_n_to_keep = df_new["station"].unique()[0]
    stn_n_to_drop = df_old["station"].unique()[0]
    print(f"Station will be concatenated and saved as: {stn_n_to_keep}")

    # now set things up to determine if there is temporal overlap between df_new and df_old
    df_overlap = df_new[df_new["time"].isin(df_old["time"])]

    # If there is no overlap between the two time series, just concatenate
    if len(df_overlap) == 0:
        print("No overlap!")
        df_concat = pd.merge(df_old, df_new, how="outer")

    # If overlap exists, split into subsets and concatenate
    else:
        print("There is overlap")
        df_concat = _overlap_concat(df_old, df_new)

    # Reset station name to be the newer station
    df_concat["station"] = stn_n_to_keep

    return df_concat, stn_n_to_keep, stn_n_to_drop, attrs_new

In [64]:
def _overlap_concat(df_new: pd.DataFrame, df_old: pd.DataFrame) -> pd.DataFrame:
    """
    Handles the cases in which there is overlap between the two input stations

    Rules
    ------
    1.) concatenation: keep the newer station data in the time range in which both stations overlap

    Parameters
    ----------
    df_new: pd.DataFrame
        weather station network
    df_old: pd.DataFrame
        weather station network

    Returns
    -------
    if success: returns pd.DataFrame
    if failure: None
    """

    # identify where there is overlap in timestamps, and keep from newer station data
    df_overlap = df_new[df_new["time"].isin(df_old["time"])]
    print(f'Length of overlapping period: {len(df_overlap)}')

    ##### Split datframes into subsets #####

    # Remove data in time overlap between old and new
    df_old_cleaned = df_old[~df_old["time"].isin(df_overlap["time"])]
    df_new_cleaned = df_new[~df_new["time"].isin(df_overlap["time"])]

    ##### Concatenate subsets #####
    df_concat = pd.concat([df_old_cleaned, df_overlap, df_new_cleaned])

    return df_concat

In [48]:
def _more_than_2(network_name: str, stns_to_pair: pd.DataFrame) -> tuple[pd.DataFrame, dict, dict]:
    """
    Performs pairwise concatenation on subsets of more than two stations flagged for concatenation

    Rules
    ------
    1.) concatenation: keep the newer station data in the time range in which both stations overlap

    Parameters
    ----------
    network_name: string
        weather station network
    stns_to_pair: pd.DataFrame
        dataframe of the input station names

    Returns
    -------
    if success:
        df_concat: pd.DataFrame
        station_names: dict
        attrs_new: dict
    if failure: None
    """

    print("Concatenating the following stations:", stns_to_pair)

    # Load datasets into a list
    datasets = [
        xr.open_zarr(
            "s3://wecc-historical-wx/3_qaqc_wx/{}/{}.zarr".format(
                network_name, stn
            ),
            consolidated=True,
        )
        for stn in stns_to_pair['ERA-ID']
    ]

    # Sort datasets by their max 'time'
    datasets_sorted = sorted(datasets, key=lambda ds: ds['time'].max())

    # Store station names, in order from oldest to newest
    names = [ds.coords["station"].values[0] for ds in datasets_sorted]

    print('Newest station:', names[-1])

    # Setup for the while loop
    ds_1 = datasets_sorted[0]
    df_1, MultiIndex_1, attrs_1, var_attrs_1, era_qc_vars_1 = qaqc_ds_to_df(ds_1)
    i = 0
    end = len(datasets_sorted) -1

    while i < end:

        print('iteration:', i)

        ds_2 = datasets_sorted[i+1]
        df_2, MultiIndex_2, attrs_2, var_attrs_2, era_qc_vars_2 = qaqc_ds_to_df(ds_2)

        # Send to helper function for concatenation
        df_concat, stn_n_to_keep, stn_n_to_drop, attrs_new = _df_concat(
            df_1, df_2, attrs_1, attrs_2
        )

        df_1 = df_concat
        attrs_1 = attrs_new

        i += 1

    # Construct station names list, for updating attributes
    newest_station = names[-1] # Get last station name from station name list
    older_stations = ", ".join(names[:-1]) # Create a string containing all older station names
    station_names = {"station_name_new": newest_station, "old_stations": older_stations}

    print('Progressive concatenation for 2+ stations is complete.')

    new_column = [newest_station] * len(df_concat)

    df_concat['station'] = new_column

    return df_concat, station_names, attrs_new

In [49]:
def _concat_export_help(
    df_concat: pd.DataFrame,
    final_concat_list: list[str],
    network_name: str,
    attrs_new: dict,
    station_names: dict,
) -> None:
    """
    Prepares the final concatenated dataset for export by
    - updating the attributes and
    - converting one of the mulit-index levels to the correct datatype
    then exports the dataset to AWS

    Rules
    ------
    1.) retains the name of the newest station

    Parameters
    ----------
    df_concat: pd.DataFrame
        dataframe of concatenated dataframes
    final_concat_list: list[str]
        list of stations that have been concatenated
    network_name: str
        weather station network
    attrs_new: dict
        attributes of newer dataframe that was input to concatenation
    station_names: dict
        library of station names, including the single new station name and a string of all the older station names

    Returns
    -------
    if successful, exports dataset of concatenated dataframes to AWS
    if failure, returns None
    """
    
    ##### Rename input files
    for station_name in final_concat_list:
        new_name = "{}_c".format(station_name) 
        _rename_file(network_name, station_name, new_name)

    ##### Prepare concatenated dataset for export

    # Delete unnecessary columns and set index
    df_concat = df_concat.drop(["hour", "day", "month", "year", "date"], axis=1)
    df_to_export = df_concat.set_index(["station", "time"])

    # Convert concatenated dataframe to dataset
    ds_concat = df_to_export.to_xarray()

    # Convert datatype of station coordinate
    ds_concat.coords["station"] = ds_concat.coords["station"].astype("<U20")

    # Include past attributes
    for i in attrs_new:
        ds_concat.attrs[i] = attrs_new[i]

    # Update 'history' attribute
    timestamp = datetime.datetime.utcnow().strftime("%m-%d-%Y, %H:%M:%S")
    ds_concat.attrs["history"] = ds_concat.attrs[
        "history"
    ] + " \nstation_matching.ipynb run on {} UTC".format(timestamp)

    # Update 'comment' attribute
    ds_concat.attrs["comment"] = (
        "Intermediary data product. This data has been subjected to cleaning, QA/QC, but may not have been standardized."
    )

    # Extract old and new station names from name dictionary
    station_name_new = station_names["station_name_new"]
    station_name_old = station_names["old_stations"]

    # Add new qaqc_files_merged attribute
    ds_concat.attrs["qaqc_files_merged"] = (
        "{}, {} merged. Overlap retained from newer station data.".format(
            station_name_old,
            station_name_new  # extract old and new station names from name dictionary
        )
    )

    ##### Export
    # export_url = "s3://wecc-historical-wx/3_qaqc_wx/{}/{}.zarr".format(
    #     network_name, station_name_new
    # )
    # print("Exporting....", export_url)
    # ds_concat.to_zarr(export_url, mode="w") ## WHEN READY TO EXPORT

    return None

In [50]:
def _rename_file(network: str, old_name: str, new_name: str) -> None:
    """
    Renames a given file in AWS by copying it over into the new name and then deleting the old file

    Parameters
    ----------
    network: str
        weather station network name
    old_name: str
        name of input dataset
    new_name: str
        new name for input dataset (ie. "_c" added to the name)

    Returns
    -------
    if success:
        None
    if failure:
        None
    """
    try:
        old_url = f"s3://wecc-historical-wx/3_qaqc_wx/{network}/{old_name}.zarr"
        print('Original file name:',old_url)

        # Copy original file, and re-name using input
        # s3.copy_object(
        #     Bucket="wecc-historical-wx",
        #     CopySource=old_url,
        #     Key=new_name,
        # )

        # Delete older version of the file 
        # s3.delete_object(Bucket="wecc-historical-wx", Key=old_name)
        print(f"File {old_name} renamed to {new_name}")
    except Exception as e:
        print(f"Error renaming file: {e}")

In [59]:
def concatenate_stations(network_name: str) -> list[str]:
    """
    Coordinates the concatenation of input datasets and exports the final concatenated dataset.
    Also returns a list of the ERA-IDs of all stations that are concatenated.

    Parameters
    ----------
    network_name: string
        weather station network

    Returns
    -------
    if success: return a list of strings
    if failure: None
    
    Notes
    -----
    Uses the following helper functions
        _df_concat(): concatenates two dataframes
        _overlap_concat(): used by _df_concat() to concatenate two stations with overlapping time ranges
        _more_than_2(): handles subsets with more than two stations, passing pairs to _df_concat() iteratively
        _concat_export_help(): formats and exports concatenated dataframe

    """
    # Initiate empty list, to which we will iteratively add the ERA-IDs of stations that are concatenated
    final_concat_list = []

    # Read in full concat station list
    print(network_name)
    concat_list = pd.read_csv(
        f"s3://wecc-historical-wx/3_qaqc_wx/{network_name}/concat_list_{network_name}.csv"
    )

    # Identify stns within designated network
    concat_by_network = concat_list.loc[
        concat_list.concat_subset.str.contains(network_name)
    ]

    ######### ! for testing
    concat_by_network = concat_list[concat_list["concat_subset"] == "ASOSAWOS_2"]
    # concat_by_network = concat_by_network.head(4)
    ######### ! for testing

    unique_pair_names = concat_by_network.concat_subset.unique()
    # For MARITIME, remove these stations becuase they're actually separate stations
    if network_name == 'MARITIME':
        unique_pair_names = unique_pair_names[1:]
        unique_pair_name = unique_pair_name[~unique_pair_name["ERA-ID"].isin['MARITIME_LJPC1','MARITIME_LJAC1']]
    else: 
        pass

    print(
        f"There are {len(concat_by_network)} stations to be concatenated into {len(unique_pair_names)} station pairs within {network_name}..."
    )
    print(unique_pair_names)

    # Set up pairs
    for pair in unique_pair_names:
        print(pair)
        # Pull out stations corresponding to pair name
        stns_to_pair = concat_by_network.loc[concat_by_network.concat_subset == pair]

        if len(stns_to_pair) == 2:  # 2 stations to concat together
            print("\n", stns_to_pair)

            # Import this subset of datasets and convert to dataframe
            url_1 = "s3://wecc-historical-wx/3_qaqc_wx/{}/{}.zarr".format(
                network_name, stns_to_pair.iloc[0]["ERA-ID"]
            )
            url_2 = "s3://wecc-historical-wx/3_qaqc_wx/{}/{}.zarr".format(
                network_name, stns_to_pair.iloc[1]["ERA-ID"]
            )

            print("Retrieving....", url_1)
            print("Retrieving....", url_2)
            ds_1 = xr.open_zarr(url_1)
            ds_2 = xr.open_zarr(url_2)

            # Convert to dataframes with corresponding information
            df_1, MultiIndex_1, attrs_1, var_attrs_1, era_qc_vars_1 = qaqc_ds_to_df(ds_1)
            df_2, MultiIndex_2, attrs_2, var_attrs_2, era_qc_vars_2 = qaqc_ds_to_df(ds_2)

            # Send to helper function for concatenation
            df_concat, stn_n_to_keep, stn_n_to_drop, attrs_new = _df_concat(
                df_1, df_2, attrs_1, attrs_2
            )

            # Construct dictionary of old and new station names
            station_names ={"station_name_new":stn_n_to_keep, "old_stations":stn_n_to_drop}

        else:
            # If there are more than 2 stations in the given subset, pass to _more_than_2()
            print("More than 2 stations within a subset.")
            df_concat, station_names, attrs_new = _more_than_2(
                network_name,
                stns_to_pair
            )
        
        print(f"Length of new dataframe: {len(df_concat)}")

        # # Add concatenated station names to station name list
        # final_concat_list.extend(stns_to_pair["ERA-ID"].tolist())

        # # Send concatenated dataframe to helper function for export
        # _concat_export_help(  
        #     df_concat,
        #     final_concat_list,
        #     network_name,
        #     attrs_new,
        #     station_names,
        # )

    # print("Concatenated stations: ", final_concat_list)

    return df_concat

### Per-Network Run

In [13]:
# Target network for concatenation: options are "ASOSAWOS", "MARITIME"
network_name = "ASOSAWOS"

In [65]:
final_concat_list = concatenate_stations(network_name)
final_concat_list

ASOSAWOS
There are 2 stations to be concatenated into 1 station pairs within ASOSAWOS...
['ASOSAWOS_2']
ASOSAWOS_2

                  ERA-ID concat_subset
4  ASOSAWOS_72272093063    ASOSAWOS_2
5  ASOSAWOS_72272193063    ASOSAWOS_2
Retrieving.... s3://wecc-historical-wx/3_qaqc_wx/ASOSAWOS/ASOSAWOS_72272093063.zarr
Retrieving.... s3://wecc-historical-wx/3_qaqc_wx/ASOSAWOS/ASOSAWOS_72272193063.zarr
Station will be concatenated and saved as: ASOSAWOS_72272193063
There is overlap
Length of overlapping period: 89474
Length of new dataframe: 547348


Unnamed: 0,time,anemometer_height_m,elevation,elevation_eraqc,lat,lon,pr,pr_depth_qc,pr_duration,pr_eraqc,...,tdps_qc,thermometer_height_m,station,hour,day,month,year,date,psl,psl_eraqc
0,2005-01-01 00:10:00,,1659.0,,32.633,-108.15,,,NaT,,...,1,,ASOSAWOS_72272193063,0,1,1,2005,2005-01-01,,
1,2005-01-01 00:30:00,,1659.0,,32.633,-108.15,,,NaT,,...,1,,ASOSAWOS_72272193063,0,1,1,2005,2005-01-01,,
2,2005-01-01 00:50:00,,1659.0,,32.633,-108.15,,,NaT,,...,1,,ASOSAWOS_72272193063,0,1,1,2005,2005-01-01,,
3,2005-01-01 01:10:00,,1659.0,,32.633,-108.15,,,NaT,,...,1,,ASOSAWOS_72272193063,1,1,1,2005,2005-01-01,,
4,2005-01-01 01:30:00,,1659.0,,32.633,-108.15,,,NaT,,...,1,,ASOSAWOS_72272193063,1,1,1,2005,2005-01-01,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
185499,2010-07-31 20:00:00,,1266.0,,31.467,-109.60,,,NaT,,...,1,,ASOSAWOS_72272193063,20,31,7,2010,2010-07-31,101210.0,
185503,2010-07-31 21:00:00,,1266.0,,31.467,-109.60,,,NaT,,...,1,,ASOSAWOS_72272193063,21,31,7,2010,2010-07-31,101200.0,
185506,2010-07-31 21:26:00,,1266.0,,31.467,-109.60,,,NaT,,...,1,,ASOSAWOS_72272193063,21,31,7,2010,2010-07-31,,
185508,2010-07-31 22:00:00,,1266.0,,31.467,-109.60,0.0,2.0,0 days 01:00:00,,...,1,,ASOSAWOS_72272193063,22,31,7,2010,2010-07-31,101230.0,
