# MassBalanceMachine Data Processing - Example for Retrieving Training Features for the Norway Region (WGMS)

In this notebook, we will demonstrate the data processing workflow of the MassBalanceMachine using an example with glaciological data from Norwegian glaciers. This example will guide you through the data processing pipeline, which retrieves topographical and meteorological features for all points with glaciological surface mass balance observations.

We begin by importing some basic libraries along with the ```massbalancemachine``` library. Next, we specify the location where we will store the files for the region of interest (in this case, Iceland). The data used in this example is from the [WGMS database](https://wgms.ch/data_databaseversions/), and we will utilize the specified columns.

In [2]:
import pandas as pd
import geopandas as gpd

import massbalancemachine as mbm

## 1. Load your Target Surface Mass Balance Dataset and Retrieve RGI ID per Stake

In this step, we define and load our glaciological data from a region of interest. The WGMS dataset does not include RGI IDs by default, so we need to retrieve them from a glacier outline shapefile provided by the Randolph Glacier Inventory (v6). Each stake is then matched with an RGI ID. The RGI ID is necessary for the MassBalanceMachine to add additional topographical and meteorological features for training stage.

**How to Retrieve the Glacier Outlines:** Download the shapefiles for the region of interest from this [link](https://daacdata.apps.nsidc.org/pub/DATASETS/nsidc0770_rgi_v6/). Extract the files and copy the .shp, .prj, and .dbf files in the correct directory so that you can use it with the Jupyter Notebook. Also, make sure you point to the correct directory and files in the next code cell.

**Note:** Data records that have an invalid FROM_DATE or TO_DATE, where the day is indicated as 99, are deleted from the dataset.

In [2]:
# Specify the filename of the input file with the raw data
target_data_fname = './example_data/norway_wgms_dataset.csv'
# Specify the shape filename of the glaciers outline obtained from RGIv6
glacier_outline_fname = './example_data/glacier_outlines/08_rgi60_Scandinavia.shp'

# Load the target data and the glacier outlines
data = pd.read_csv(target_data_fname)
glacier_outline = gpd.read_file(glacier_outline_fname)

### 1.1 Match the Stake Measurements with a RGI ID

Based on the location of the stake measurement given by POINT_LAT and POINT_LON, each data record is matched with the RGI ID for the glacier where the stake is located.

In [3]:
# Get the RGI ID for each stake measurement for the region of interest
data = mbm.utils.get_rgi(data, glacier_outline)

Then, we can create a MassBalanceMachine `Dataset`, by using the loaded dataframe for Norway stake data together with the matched RGI IDs, as such: 

In [4]:
# Provide the column name for the column that has the RGI IDs for each of the stakes
dataset = mbm.Dataset(data=data, data_path='./example_data/')

## 2. Get Topographical Features per Stake Measurement

Once we have created a `Dataset`, the first thing we can do is to add topographical data in our dataset. This can be done automatically with the MassBalanceMachine (which calls OGGM) by doing the following:

In [5]:
# Specify the topographical features of interest 
# Please see the OGGM documentation what variables are available: 
topographical_voi = ['topo', 'aspect', 'slope', 'slope_factor', 'dis_from_border']

# Retrieve the topographical features for each of the stake measurement and add them to the dataset
dataset.get_topo_features(topographical_voi)

2024-07-04 15:07:58: oggm.cfg: Reading default parameters from the OGGM `params.cfg` configuration file.
2024-07-04 15:07:58: oggm.cfg: Multiprocessing switched OFF according to the parameter file.
2024-07-04 15:07:58: oggm.cfg: Multiprocessing: using all available processors (N=12)
|  -1.0 B Elapsed Time: 0:00:03                                                
2024-07-04 15:08:17: oggm.cfg: PARAMS['border'] changed from `80` to `10`.
2024-07-04 15:08:17: oggm.cfg: Multiprocessing switched ON after user settings.
2024-07-04 15:08:17: oggm.cfg: PARAMS['continue_on_error'] changed from `False` to `True`.
2024-07-04 15:08:17: oggm.workflow: init_glacier_directories from prepro level 3 on 1 glaciers.
2024-07-04 15:08:17: oggm.workflow: Execute entity tasks [gdir_from_prepro] on 1 glaciers
[38;2;0;255;0m100%[39m of 106.3 MiB |######################| Elapsed Time: 0:00:25 Time:  0:00:2502
2024-07-04 15:08:43: oggm.workflow: Execute entity tasks [gridded_attributes] on 1 glaciers


## 3. Get the Meteorological Features per Stake

Once we have the topographical data, we can add the necessary climate data for the dataset. This is done by pulling that from ERA5-Land. MassBalanceMachine automatically handles this for the region of interest where the glaciers are.

In [None]:
# Specify the directory and the files of the climate data, that will be matched with the coordinates of the stake data
input_era5_fname = '.././regions/iceland/mbm/data/climate/ERA5_monthly_averaged_climate_data.nc'
input_gp_fname = '.././regions/iceland/mbm/data/climate/ERA5_geopotential_pressure.nc'

# Specify the output filename to save the intermediate results
output_climate_fname = 'Iceland_Stake_Data_Climate.csv'

# Retrieve the topographical features for each of the stake measurement in the dataset
dataset.get_climate_features(output_climate_fname, input_era5_fname, input_gp_fname, 'd3')

## 4. Transform Data to Monthly Resolution

Finally, we need to transform the dataset into a monthly resolution. This will be done in order to perform SMB predictions at a monthly time step, which then will be integrate both in time and space to match the available glaciological and geodetic SMB observations for the training. 

In [None]:
# Define which columns are of interest (vois: variables of interest), please see the metadata file for the ERA5-Land data with all the variable names
vois_climate = ['t2m', 'tp', 'sshf', 'slhf', 'ssrd', 'fal', 'str']

vois_topo_columns = ['topo', 'aspect', 'slope', 'slope_factor', 'dis_from_border']


# Create a dictionary of all the columns in the dataset that match the variables of interest of the ERA5-Land data
vois_climate_columns = {voi: [col for col in df.columns.values if re.match(f'{voi}_[a-zA-Z]*', col)] for voi in vois_climate}

# Specify the column names for the seasonal (winter and summer) and annual mass balance columns in the dataset
smb_column_names = ['ba_stratigraphic', 'bw_stratigraphic', 'bs_stratigraphic']

misc_column_names = ['yr']

# Specify the output filename to save the intermediate results
output_climate_fname = 'Iceland_Stake_Data_Monthly.csv'

dataset.convert_to_monthly(output_climate_fname, vois_climate_columns, vois_topo_columns, smb_column_names, misc_column_names)

Finally, we can take a look at the dataset which will be used for training.

In [None]:
display(dataset.df)