<h1>MassBalanceMachine Data Processing - Example for Retrieving Training Features for the Iceland Region (Custom Data)</h1>

<p style='text-align: justify;'>
This notebook demonstrates the data processing workflow of the MassBalanceMachine using Icelandic glacier data.
It guides you through converting your data to the WGMS format, which will be used throughout the entire pipeline of the MassBalanceMachine. Once formatted correctly, follow the <a href='https://github.com/ODINN-SciML/MassBalanceMachine/blob/main/notebooks/data_processing_wgms.ipynb'>data processing example notebook</a> for WGMS(-like) data.  
</p>
<h2>Purpose</h2>
<p style='text-align: justify;'>
This notebook is for users whose data is not in WGMS format or whose records are not associated with a single measurement. We work with Icelandic glacier stake measurements, which have three recordings per hydrological year (start of winter, end of winter, and start of summer, and end of summer). Our goal is to reformat each dataset record into three separate records, each corresponding to a stake measurement within the hydrological year. </p>
<p style='text-align: justify;'>
We strive to accommodate various data formats, but occasionally, users may need to make adjustments to ensure compatibility. For assistance, users can refer to the WGMS documentation, providing detailed guidance on formatting requirements. The documentation, from the WGMS 2023 database, can be found in the following directory: <code>example_data/wgms_documentation.md</code>. This ensures that your data integrates seamlessly into the MassBalanceMachine workflow. If your data format isn't compatible with this notebook, feel free to use it as inspiration. You can adjust the code according to your needs and then submit a pull request with your modifications. This way, other users can benefit from your contributions in the future.
</p>
<h2>Process</h2>
<p style='text-align: justify;'>
To begin, we will import necessary libraries, including the massbalancemachine library. Following this, we will define the storage location for files related to our region of interest, which in this case is Iceland. The data used in this demonstration is sourced from the <a href='https://icelandicglaciers.is/index.html#/page/map'>Icelandic Glaciers inventory</a>, provided by the Icelandic Meteorological Office. Stake measurements for the Icelandic glaciers have already been retrieved via an API call and merged into a single file. The script for these processes can be found in the following directory: <code>regions/Iceland/scripts/data_processing</code>.
</p>

<p style='text-align: justify;'><b>Note:</b>
If your dataset has one measurement period per record and the column names do not match the WGMS format, please update them manually. The required column names for data processing are: POINT_LAT, POINT_LON, YEAR, POINT_ELEVATION, POINT_ID, TO_DATE, FROM_DATE, and POINT_BALANCE. If needed, you can convert your coordinate CRS to WGS84 using the function <code>convert_to_wgs84()</code>. Ensure the column names match exactly, as these names are used throughout the pipeline.
</p>

In [None]:
import pandas as pd
import massbalancemachine as mbm

<h2>1. Transform your Dataset to the WGMS Format</h2>

In [None]:
# Specify the filename of the input file with the raw data
target_data_fname = './example_data/iceland/files/iceland_stake_dataset.csv'
# Load the target data
data = pd.read_csv(target_data_fname)

First, let's examine the dataset to understand its structure, including the columns and the data they contain.

In [None]:
display(data.head(10))

<h3>1.1 Reshaping the Dataset to WGMS-format</h3>

<p style='text-align: justify;'>
As you can see, each record in the dataset contains three measurements: one at the start of the hydrological year (beginning of winter), one at the end of winter (start of summer), and one at the end of summer. Of course, these measurement periods can also be arbitrary, as long as they are in three per record. For now, we do not account for other data formats. We would like to separate these measurements into individual records, each with a single date and surface mass balance.
</p>

In [None]:
# Please specify the column names on the left side of the dictionary as they are named in your dataset.
# Additionally, add new keys and values for columns you would like to keep from the original dataset.
# These keys and values in the dictionary will be the final column names in your dataset.
wgms_data_columns = {
    'yr': 'YEAR',
    'stake': 'POINT_ID',
    'lat': 'POINT_LAT',
    'lon': 'POINT_LON',
    'elevation': 'POINT_ELEVATION',
    # Do not change these column names (both keys and values)
    'TO_DATE': 'TO_DATE',
    'FROM_DATE': 'FROM_DATE',
    'POINT_BALANCE': 'POINT_BALANCE',
}

# Please specify the three column names for the three measurement dates (these are specifically for the Iceland dataset)
column_names_dates = ['d1', 'd2', 'd3']

# Please specify the three column names for the three surface mass balance measurements (these are specifically for the Iceland dataset)
column_names_smb = ['bw_stratigraphic', 'bs_stratigraphic', 'ba_stratigraphic']

# Reshape the dataset to the WGMS format
data = mbm.data_processing.utils.convert_to_wgms(wgms_data_columns=wgms_data_columns,
                                 data=data,
                                 date_columns=column_names_dates,
                                 smb_columns=column_names_smb)

Let's take a look at the dataframe after this reshaping process.

In [None]:
display(data.head(10))

<h3>1.2 Reproject Coordinates to WGS84 Coordinate Reference System</h3>

<p style='text-align: justify;'>
At this stage, if needed, you can convert the current coordinate system (CRS) to WGS84 if it is not already in that format. Please specify the current CRS of the coordinates.
</p>

In [None]:
data = mbm.data_processing.utils.convert_to_wgs84(data=data, from_crs=4659)

In [None]:
data.to_csv('./example_data/iceland/files/iceland_wgms_dataset.csv',
            index=False)

<p style='text-align: justify;'>
At this stage, your dataset is ready to be processed further by retrieving topographical and meteorological features and converting the dataset to a monthly resolution. Once completed, the dataset is prepared for training. Please refer to this <a href='https://github.com/ODINN-SciML/MassBalanceMachine/blob/main/notebooks/data_processing_wgms.ipynb'>notebook</a> to see how data in the WGMS format can be incorporated into the data processing pipeline.
</p>