# Part II: Data exploration

Date: 25/08/2022

Authors: Jordi Bolibar & Facundo Sapienza

Once we have successfully retrieved the training dataset using OGGM, we can start exploring and understanding the dataset. The goal of this notebook will be to perform some basic data analysis techniques on the data, and to understand the physical reasons behind the selected features for the model. 

> **_NOTE_** Before running this notebook, be sure you Jupyter kernel (top left corner of the notebook) has been configure to work with the MB_Finsen conda environment. 


In [10]:
import xarray as xr
import numpy as np
import rioxarray
import pandas as pd
import geopandas as gpd
import matplotlib.pyplot as plt

We load the preprocessed training DataFrame:

In [6]:
training_df = pd.read_csv('Data/training_df.csv')

In [7]:
training_df.columns

Index(['rgi_id', 'hydro_year', 'PDD', 'RGI_ID', 'area', 'icecap', 'lat',
       'prcp_01', 'prcp_02', 'prcp_03', 'prcp_04', 'prcp_05', 'prcp_06',
       'prcp_07', 'prcp_08', 'prcp_09', 'prcp_10', 'prcp_11', 'prcp_12',
       'rain', 'slope', 'snow', 'temp_01', 'temp_02', 'temp_03', 'temp_04',
       'temp_05', 'temp_06', 'temp_07', 'temp_08', 'temp_09', 'temp_10',
       'temp_11', 'temp_12', 'zmax', 'zmed', 'zmin'],
      dtype='object')

## Informed feature selection

In this dataset we have already narrowed down a selection of training features for you. Nonetheless, in a research project, this step should not be taken for granted, since it can have a great impact on model design. 

The first question we should ask ouselves is: what variables affect the physical process we are trying to model here? In this case, we are talking about **glacier-wide mass balance**. Theory tells us that the integrated mass balance over the whole surface area of a glacier is impacted by both the climate and the topography of a glacier. 

Unlike point mass balance, the hypsometry of a glacier has a great impact on its glacier-wide mass balance. This is mainly manifested in two ways:
- The glacier surface slope affects ice flow dynamics, particularly the creep component. Generally, steeper glaciers have higher ice velocities, which imply a faster transfer of ice from the higher altitudes in the accumulation area to the lower altitudes in the ablation area. Therefore, a steep glacier will retreat faster (i.e. smaller surface area and higher mean altitude), but it will be able to move to higher altiudes with a colder climate. This helps the glacier find new colder climates in an effort to reach equilibrium. On the other hand, flatter glaciers and ice caps will not retreate as fast, since they react mostly through thinning (see Figure). This means that the glacier will move to lower altitudes with a warmer climate, further enhancing melt. This has an opposite effect to the one from steep mountain glaciers. 

- Steep mountain glaciers have a wider altitudinal range, meaning that they mass balance gradient will be higher and more complex. On the other hand, ice caps behave more similar to an ice cube: the reduced role of ice dynamics and their small altitudinal range imply a more homogenous mass balance gradient.

![Mountain glacier vs Ice cap retreat](Figures/Hock_Huss_glacier_icecap.png "Mountain glacier vs Ice cap retreat")
*Figure taken from Hock and Huss (2021).*

The next question at this point would be: what available data at the spatial and temporal scales that we are trying to model can I access? 