# MassBalanceMachine Data Processing - Example for Retrieving Training Features for the Iceland Region (Custom Data)

In this notebook, we will illustrate the data processing workflow of the MassBalanceMachine using an example involving glaciological data from Icelandic glaciers. This example will walk you through the process of converting your data so that it is in the WGMS format. Once your data is in the correct format, one can follow the data processing example [notebook](https://github.com/ODINN-SciML/MassBalanceMachine/blob/main/notebooks/data_processing_wgms.ipynb) specifically for WGMS(-like) data.

This notebook is intended for users who do not currently have their data in the WGMS format or whose data records are not yet associated with a single measurement. Specifically, we are working with stake measurements from Icelandic glaciers. Each record in our dataset includes three dates and three measurements for a single hydrological year. Our goal is to reformat this dataset so that each of these single data records is split into three separate data records, each corresponding to one of the recorded stake measurements within the hydrological year.

To begin, we will import necessary libraries, including the `massbalancemachine` library. Following this, we will define the storage location for files related to our region of interest, which in this case is Iceland. The data used in this demonstration is sourced from the [Icelandic Glaciers Inventory](https://icelandicglaciers.is/index.html#/page/map), provided by the Icelandic Meteorological Office.

Stake measurements for the Icelandic glaciers have already been retrieved via an API call and merged into a single file. The script for these processes can be found in the following directory: `regions/iceland/scripts/data_processing`.

**Note:** We strive to accommodate various data formats, but occasionally, users may need to make adjustments to ensure compatibility. For assistance, users can refer to the WGMS documentation, providing a detailed guidance on formatting requirements. The documentation, from the WGMS 2023 database, can be found under in the following directory: `example_data/wgms_documentation.md`. This ensures that your data integrates seamlessly into the MassBalanceMachine workflow.If your data format isn't compatible with this notebook, feel free to use it as inspiration. You can adjust the code according to your needs and then submit a pull request with your modifications. This way, other users can benefit from your contributions in the future.  

**Note:** If your dataset currently records one measurement period per record and the column names are not in the WGMS format, please update them accordingly. The following column names are assumed or required for data processing: POINT_LAT, POINT_LON, YEAR, POINT_ELEVATION, POINT_ID, TO_DATE, FROM_DATE, POINT_BALANCE. If necessary, you can convert your coordinate CRS to WGS84. Ensure the column names match those stated here.

In [None]:
import numpy as np
import pandas as pd
import geopandas as gpd

import massbalancemachine as mbm

## 1. Transform your Dataset to the WGMS Format

In [None]:
# Specify the filename of the input file with the raw data
target_data_fname = './example_data/iceland/files/iceland_stake_dataset.csv'
# Load the target data
data = pd.read_csv(target_data_fname)

First, let's examine the dataset to understand its structure, including the columns and the data they contain.

In [None]:
display(data.head(10))

### 1.1 Reshaping the Dataset to WGMS-format

As you can see, each record in the dataset contains three measurements: one at the start of the hydrological year (beginning of winter), one at the end of winter (start of summer), and one at the end of summer. Of course, these measurement periods can also be arbitrary, as long as they are in three per record. For now, we do not account for other data formats. We would like to separate these measurements into individual records, each with a single date and surface mass balance.

In [None]:
# Please specify the column names on the left side of the dictionary as they are named in your dataset. 
# Additionally, add new keys and values for columns you would like to keep from the original dataset. 
# These keys and values in the dictionary will be the final column names in your dataset. 
wgms_data_columns = {
      'yr'        : 'YEAR',
      'stake'     : 'POINT_ID',
      'lat'       : 'POINT_LAT',
      'lon'       : 'POINT_LON',
      'elevation' : 'POINT_ELEVATION',
      # Do not change these column names (both keys and values)
      'TO_DATE'   : 'TO_DATE',
      'FROM_DATE' : 'FROM_DATE',
      'POINT_BALANCE' : 'POINT_BALANCE',
}

# Please specify the three column names for the three measurement dates (these are specifically for the Iceland dataset)
column_names_dates = ['d1', 'd2', 'd3']

# Please specify the three column names for the three surface mass balance measurements (these are specifically for the Iceland dataset)
column_names_smb = ['bw_stratigraphic', 'bs_stratigraphic', 'ba_stratigraphic']

# Reshape the dataset to the WGMS format
data = mbm.utils.convert_to_wgms(wgms_data_columns, data, column_names_dates, column_names_smb)

Let's take a look at the dataframe after this reshaping process.

In [None]:
display(data.head(10))

### 1.2 Reproject Coordinates to WGS84 Coordinate Reference System

At this stage, if needed, you can convert the current coordinate system (CRS) to WGS84 if it is not already in that format. Please specify the current CRS of the coordinates.

In [None]:
data = mbm.utils.transform_crs(data, from_crs=4659)

In [None]:
data.to_csv('./example_data/iceland/files/iceland_wgms_dataset.csv')

At this stage, your dataset is ready to be processed further by retrieving topographical and meteorological features and converting the dataset to a monthly resolution. Once completed, the dataset is prepared for training. Please refer to this [notebook](https://github.com/ODINN-SciML/MassBalanceMachine/blob/main/notebooks/data_processing_wgms.ipynb) to see how data in the WGMS format can be incorporated into the data processing pipeline.