# MassBalanceMachine Data Processing - Example for Iceland

In this notebook, the data processing part of the MassBalanceMachine will be outline through an example with stake data from glaciers in Iceland. This example will help you understand how to use the data processing pipeline, that retrieves toporgrahpical and meteorological features for the stake data.

In [1]:
import pandas as pd

# Import the submodules
from mbm import data_processing

## 1. Load Your Target Surface Mass Balance Dataset

In [3]:
df = pd.read_csv('./regions/iceland/mbm/data/files/Iceland_Stake_Data_Cleaned.csv')

## 2. Get the Topographical Features

**Input:**  your initial dataset with   
**Ouput:**  target_smb_dataset_topo.csv

In [None]:
df = data_processing.get_oggm_data('06')

## 3. Get the Meteorological Features

## 4. Transform to WGMS Format

## 5. Feature Engineering

## 6. Clean Code

```get_ogggm_data.py``` The script in the ```OGGM``` folder, located outside the current working  directory, retrieves topographical features for each stake on the  glaciers. These features include slope, aspect, slope factor, and  distance from the border. This script is stored in a separate folder because it requires a remote environment, WSL, to run. For the future, the aim is to implement this as a separate step in the data pipeline without the need to do it manually e.g., to start the remote process and activate the Conda environment. The output file of this script is: ```Iceland_Stake_Data_T_Attributes.csv```.
8. ```get_climate_data.py``` For each stake measurement, monthly averaged climate data from the Copernicus Climate Data Store (CDS) spanning from 1950 to 2024 was retrieved, from [here](https://cds.climate.copernicus.eu/cdsapp#!/dataset/reanalysis-era5-land-monthly-means?tab=overview), including the geopotential surface pressure retreived from [here](https://confluence.ecmwf.int/display/CKB/ERA5-Land%3A+data+documentation#ERA5Land:datadocumentation-parameterlistingParameterlistings). The monthly averaged climate data consists of the following variables: '10m_u_component_of_wind', '10m_v_component_of_wind', '2m_dewpoint_temperature', '2m_temperature', 'forecast_albedo', 'snow_albedo', 'snow_cover', 'snow_density', 'snow_depth','snow_depth_water_equivalent', 'snow_evaporation', 'snowfall', 'snowmelt', 'surface_latent_heat_flux', 'surface_net_solar_radiation', 'surface_net_thermal_radiation', 'surface_sensible_heat_flux', 'surface_solar_radiation_downwards', 'surface_thermal_radiation_downwards', 'temperature_of_snow_layer', 'total_precipitation', 'surface_pressure'. Please consult ```climate_metadata.txt``` for the corresponding variable names in the dataset file. The ouptut of this script is: ```Iceland_Stake_Data_Climate.csv```.

   1. Data from the CDS can be fetched by: ```get_ERA5_monthly_averaged_climate_data.py```.
9. ```transform_data_to_wgms.py``` Transforms and refactor the data to the [WGMS](https://wgms.ch/) format, in case the data will ever be uploaded. The output of this script is: ```Iceland_Stake_WGMS.csv```.
10. ```clean_data.py``` Remove unnecessary columns from the dataset, rename certain columns, and delete records without climate and altitude data. The output of this script is: ```Iceland_Stake_Data_Cleaned.csv```.
11. ```feature_engineering.py``` Adds addition features to the file in step 11, that might improve the performance of the model.
   
   1. Adds a new feature to the dateset that is the height difference between the _geopotential_height_ and the stake _elevation_. With this height difference, the climate data can be downscaled to the elevation of the stake measurement location.  