# MassBalanceMachine Data Processing - Example for Importing Custom Data for the Iceland Region

In this notebook, the data processing workflow of the MassBalanceMachine will be outlined through an example with glaciological data from glaciers in Iceland. This example will help you understand how to use the data processing pipeline, that retrieves topographical and meteorological features for the all points with glaciological surface mass balance observations.

We start by importing some basic libraries, as well as the `massbalancemachine`. Then, we specify where we will place the files for the region where we will be working on (in this case, Iceland). 

In [1]:
import re
import os
import pandas as pd

import massbalancemachine as mbm

FILE_DIR = '.././regions/iceland/mbm/data/files/'

## 1. Define and Load your Target Surface Mass Balance Dataset

**Expected columns in the dataset (per stake):** longitude ('lon'), latitude ('lat'), RGI ID, and the hydrological year of the measurement. 

When working with custom glaciological data from a region of interest, first we need to import our dataset, which will be processed by the MassBalanceMachine to add additional topographical and climatic features for training. 

We load the dataset using `pandas`, and we take a look at it.

In [None]:
# Specify the filename of the input file with the raw data
input_target_fname = 'Iceland_Stake_Data_Reprojected.csv'
# Construct the full file path
input_file_path = os.path.join(FILE_DIR, input_target_fname)

df = pd.read_csv(input_file_path)

# Inspect the dataframe
display(df)

Then, we can create a MassBalanceMachine `Dataset`, by using the loaded `DataFrame`, specifying the column tag for the RGI IDs, and (optionally) specifying for which region are we working. 

In [None]:
# Provide the column name for the column that has the RGI IDs for each of the stakes
dataset = mbm.Dataset(df, 'RGIId', FILE_DIR)

## 2. Get the Topographical Features per Stake

Once we have created a `Dataset`, the first thing we can do is to add topographical data in our dataset. This can be done automatically with the MassBalanceMachine (which calls OGGM) by doing the following:

In [None]:
# Specify the output filename to save the intermediate results
output_topo_fname = 'Iceland_Stake_Data_T_Attributes.csv'

# Specify the topographical features of interest 
vois_topo_columns = ['topo', 'aspect', 'slope', 'slope_factor', 'dis_from_border']

# Retrieve the topographical features for each of the stake measurement in the dataset
dataset.get_topo_features(output_topo_fname, vois_topo_columns)

## 3. Get the Meteorological Features per Stake

Once we have the topographical data, we can add the necessary climate data for the dataset. This is done by pulling that from ERA5-Land. MassBalanceMachine automatically handles this for the region of interest where the glaciers are.

In [None]:
# Specify the directory and the files of the climate data, that will be matched with the coordinates of the stake data
input_era5_fname = '.././regions/iceland/mbm/data/climate/ERA5_monthly_averaged_climate_data.nc'
input_gp_fname = '.././regions/iceland/mbm/data/climate/ERA5_geopotential_pressure.nc'

# Specify the output filename to save the intermediate results
output_climate_fname = 'Iceland_Stake_Data_Climate.csv'

# Retrieve the topographical features for each of the stake measurement in the dataset
dataset.get_climate_features(output_climate_fname, input_era5_fname, input_gp_fname, 'd3')

## 4. Transform Data to Monthly Resolution

Finally, we need to transform the dataset into a monthly resolution. This will be done in order to perform SMB predictions at a monthly time step, which then will be integrate both in time and space to match the available glaciological and geodetic SMB observations for the training. 

In [None]:
# Define which columns are of interest (vois: variables of interest), please see the metadata file for the ERA5-Land data with all the variable names
vois_climate = ['t2m', 'tp', 'sshf', 'slhf', 'ssrd', 'fal', 'str']

vois_topo_columns = ['topo', 'aspect', 'slope', 'slope_factor', 'dis_from_border']


# Create a dictionary of all the columns in the dataset that match the variables of interest of the ERA5-Land data
vois_climate_columns = {voi: [col for col in df.columns.values if re.match(f'{voi}_[a-zA-Z]*', col)] for voi in vois_climate}

# Specify the column names for the seasonal (winter and summer) and annual mass balance columns in the dataset
smb_column_names = ['ba_stratigraphic', 'bw_stratigraphic', 'bs_stratigraphic']

misc_column_names = ['yr']

# Specify the output filename to save the intermediate results
output_climate_fname = 'Iceland_Stake_Data_Monthly.csv'

dataset.convert_to_monthly(output_climate_fname, vois_climate_columns, vois_topo_columns, smb_column_names, misc_column_names)

Finally, we can take a look at the dataset which will be used for training.

In [None]:
display(dataset.df)