# MassBalanceMachine Data Processing - Example for Retrieving Training Features for the Iceland Region (Custom Data)

In this notebook, we will illustrate the data processing workflow of the MassBalanceMachine using an example involving glaciological data from Icelandic glaciers. This example will walk you through the process of converting your data so that it is in the WGMS format. Once your data is in the correct format, one can follow the data processing example [notebook](https://github.com/ODINN-SciML/MassBalanceMachine/blob/main/notebooks/data_processing_wgms.ipynb) specifically for WGMS(-like) data.

This notebook is intended for users who do not currently have their data in the WGMS format or whose data records are not yet associated with a single measurement. Specifically, we are working with stake measurements from Icelandic glaciers. Each record in our dataset includes three dates and three measurements for a single hydrological year. Our goal is to reformat this dataset so that each of these single data records is split into three separate data records, each corresponding to one of the recorded stake measurements within the hydrological year.

To begin, we will import necessary libraries, including the `massbalancemachine` library. Following this, we will define the storage location for files related to our region of interest, which in this case is Iceland. The data used in this demonstration is sourced from the [Icelandic Glaciers Inventory](https://icelandicglaciers.is/index.html#/page/map), provided by the Icelandic Meteorological Office.

Stake measurements for the Icelandic glaciers have already been retrieved via an API call and merged into a single file. The script for these processes can be found in the following directory: `regions/iceland/scripts/data_processing`.

**Note:** We strive to accommodate various data formats, but occasionally, users may need to make adjustments to ensure compatibility. For assistance, users can refer to the WGMS documentation, providing a detailed guidance on formatting requirements. The documentation, from the WGMS 2023 database, can be found under in the following directory: `example_data/wgms_documentation.md`. This ensures that your data integrates seamlessly into the MassBalanceMachine workflow. 

In [None]:
import pandas as pd
import geopandas as gpd

import massbalancemachine as mbm

## 1. Transform your Dataset to the WGMS Format

In [None]:
# Specify the filename of the input file with the raw data
target_data_fname = './example_data/norway_wgms_dataset.csv'
# Load the target data
data = pd.read_csv(target_data_fname)

## 2. Load your Target Surface Mass Balance Dataset and Retrieve RGI ID per Stake

In this step, we define and load our glaciological data from a region of interest. The WGMS dataset does not include RGI IDs by default, so we need to retrieve them from a glacier outline shapefile provided by the Randolph Glacier Inventory (v6). Each stake is then matched with an RGI ID. The RGI ID is necessary for the MassBalanceMachine to add additional topographical and meteorological features for training stage.

**How to Retrieve the Glacier Outlines:** Download the shapefiles for the region of interest from this [link](https://daacdata.apps.nsidc.org/pub/DATASETS/nsidc0770_rgi_v6/). Extract the files and copy the .shp, .prj, .shx, and .dbf files in the correct directory so that you can use it with the Jupyter Notebook. Also, make sure you point to the correct directory and files in the next code cell.

In [None]:
# Specify the shape filename of the glaciers outline obtained from RGIv6
glacier_outline_fname = './example_data/glacier_outlines/08_rgi60_Scandinavia.shp'

# Load the target data and the glacier outlines
glacier_outline = gpd.read_file(glacier_outline_fname)

### 2.1 Match the Stake Measurements with a RGI ID

Based on the location of the stake measurement given by POINT_LAT and POINT_LON, each data record is matched with the RGI ID for the glacier where the stake is located.

In [None]:
# Get the RGI ID for each stake measurement for the region of interest
data = mbm.utils.get_rgi(data, glacier_outline)

Then, we can create a MassBalanceMachine `Dataset`, by using the loaded dataframe for Norway stake data together with the matched RGI IDs, as such: 

In [None]:
# Provide the column name for the column that has the RGI IDs for each of the stakes
dataset = mbm.Dataset(data=data, data_path='./example_data/')