# GeoDN Course 2: Fundamentals of Geospatial Data and Modeling - Part 2 Geospatial Foundation Models and Workflows #
> Copyright (c) 2024 International Business Machines Corporation

> This software is released under the MIT License.
> https://opensource.org/licenses/MIT

# Section 1 - Geospatial Foundation Model: Burn scar fine-tuning
This is an example of how to fine-tune a model to map burn scars from HLS data using the IBM Geospatial Foundation models as a starting point.  

To run a fine-tuning experiment for flood mapping we will use the MMSegmentation library (https://github.ibm.com/GeoFM-Finetuning/mmsegmentation) to fine-tune a model starting from the geospatial foundation model trained on HLS data.

If we are starting a new fine-tuning project, we need training data (including labels) and the pre-trained foundation model weights.  These have already been stored in an S3 bucket but are available at: https://huggingface.co/ibm-nasa-geospatial

In this notebook will guide you to:
 1. Create a fine-tuning configuration
 2. Submitting fine-tuning job to run
 3. Monitor and visualise the training




### Prepare
Load the `geodn.modeling` module. We also need to using the python package `os`, `json`, `glob`, `rasterio`, `numpy`, `pyplot` from `matplotlib` and `show` from `rasterio.plot`, so let's import those too.


In [None]:
from geodn.modeling import workflow

import os
import json
import glob
import rasterio
import numpy as np
from matplotlib import pyplot
from rasterio.plot import show

### Authenticate to GeoDN

To authenticate to the GeoDN services, add your username and password to the file `~/geodn-creds`, which can be found in the home folder. The format should be `your-username:your-password`. 

Once updated, we can use the `getToken()` function to get an authentication token which will later be used to access the GeoDN services. The `get_token()` function takes three parameters, `username`, `password` and `geodn_modeling_url`, which are your username, your password and the URL of the GeoDN backend service to connect to, respectively. 

These tokens will expire after 24 hours. To also return a refresh_token, pass `refresh=True` to `getToken()`.

In [None]:
with open("../.." + '/geodn-creds', 'r') as file:
    data = file.read().rstrip()
    username = data.split(':')[0]
    password = data.split(':')[1]

assert username and password

# Get the tokens
id_token, access_token = workflow.get_token(username, password, geodn_modeling_url=os.environ["GEODN_URL"])
assert id_token and access_token

### Connect to GeoDN modeling

And finally, we can connect to the GeoDN Modeling service. Here you will pass the token and create the connection to the GeoDN Modeling APIs. This will allow you to submit models to the cluster, check status, access logs and download files.

To determine which backend service to connect to, we use the arguments `geodn_modeling_url`, `core_url` and `workflow_url`. These have been set as environment variables but can be configured if in the future you require a connection to a different backend service. 

In [None]:
geodn_modelling = workflow.GeoDN_Modeling(
    bearer_token=id_token,
    api_url=os.environ["GEODN_URL"],
    core_url=os.environ["GEODN_CORE_URL"],
    workflow_url=os.environ["GEODN_WORKFLOW_URL"],
)

Sometimes, we can see errors returned from this function. If this is the case, it maybe that the `id_token` has expired and needs to be regenerated. Re-run the `get_token` function to generate a new token.

# 1. Creating fine-tuning configuration


![Fine-tune architecture](finetune_arch.png)

### Brief introduction to the hyperparameters we will adapt as part of this session
**Loss function**:
    Both tasks we will solve as part of this exercise are binary semantic segmentation task (e.g., pixelwise classification of flood vs. background). There will be two available loss functions for the task:
| Loss functions | Description | Code |
| -------------- | ----------- | ---- |
| CrossEntropyLoss | is sensitive to class imbalance but very general and a good choice for an initial training | `type='CrossEntropyLoss', use_sigmoid=False, loss_weight=1, class_weight=[0.15, 0.85]` |
| DiceLoss | is invariant against class imbalance but tends to be more sensitive to other hyperparameters | `type='DiceLoss', use_sigmoid=False, loss_weight=1` |

We observed a good performance of unweighted dice loss in our experiments.

**Weighting classes** in the loss function:
    As described above some loss functions (like CE loss) are sensitive to class imbalance. We can counter class imbalance by weighting the classes in the loss. For example, for flood mapping ~5-10% of the pixels represent parts of flood events while the rest is background. To meet the class imbalance, we can set the class weight of the flood class to, e.g., 90%, while the background class will be assigned a class weight of 10%. For segmentation on burn scars, there are only two classes, so only two class weights need to be specified.


Example of Cross Entropy Loss options for burn scars:
| Weight land class | Weight burn scar class | Code | 
| ------------------ | ----------------- | ---- |
| 0.3 | 0.7 | `[0.7, 0.3]` |
| 0.1 | 0.9 | `[0.9, 0.1]` |
<!--     * cross entropy loss with weight water class = 0.7, weight land class = 0.3, weight cloud class = 0.0,
    * cross entropy loss with weight water class = 0.9, weight land class = 0.1, weight cloud class = 0.0
     -->

**Learning rate**: Defines how much we want the model to change in response to the estimated error each time the model weights are updated. Options: `6e-4`, `6e-5`, `6e-6`

**Auxiliary head**: To stabilize the finetuning process, the model not only includes an encoder and a decode head for segmentation, but also an auxiliary head. This part of the architecture helps to make the model more robust during finetuning. You can add and remove the auxiliary head using the boolean option: `aux_head=True`, `aux_head=False`

**Depth of the decoder**: Generally, the decoder is quite light-weight compared to the GeoFM encoder. A default choice would be one or two layers of convolutions. Increasing this value will result in more parameters that the model can leverage to adapt to the downstream task -- at the cost of heavier computations (finetuning will take more time!). Options: `decode_head_conv = 1`, `decode_head_conv = 2`

**Number of epochs**: Deep neural networks are typically require a certain number of epochs to converge. For example, in our experiments, we observed that the finetuning for flood mapping achieves a desirable level of fitness after ~40-50 epochs. *Please do not extend the number of epochs to more than 50 epochs to have a managable time for computations. :-)*

### Setting up your experiment
Now to set up your experiment options in `example_workflows/payload_mmseg.json` you wish to chose (based on the description above and discussions).  Don't edit the `num_epochs`, `batch_size`, `number_training_files` or `project_name`.

You generate your config which places the options you have chosen into a configuration file, which you can then view using the next cell.

------------------------------------------------


# 2. Submitting fine-tuning job to run

Now we have the configuration script ready, we can just send it to the cluster to run.  The next cell will submit the job to the cluster using TorchX.  This will spin up a now pod/container where the fine-tuning will run.

In [None]:
# Choose the workflow (i.e. slope_calc, ifm_standard)
wf = 'mmseg-plus-inference'


wfy, dag, input_files = geodn_modelling.parse_and_print(
    workflow_file='./example_workflows/workflow_' + wf + '.py', 
    payload_file='./example_workflows/payload_' + wf + '.json'
    )

The cell above should return a successful compilation and we can now visualise the workflow graph.  This will give you a first indication if you have constructed the graph you intended to.  In future, more detailed and interactive visualisations will be available.

In [None]:
# Optional step to view the DAG describe by the workflow template and payload together
G = workflow.draw_workflow(dag)

***
# 2. Submit workflow to GeoDN Modeling cluster to run

Once you have tested that your workflow function successfully ingests the user payload and generated the correct  graph, it is now time to run the workflow on a GeoDN Modeling cluster and check the outputs.

To do this we will use the `parse_and_submit` function.  This is very similar to `parse_and_print` but will send it to the cluster (specified by the url at the top of the notebook).  The only additional argument you need to provide is a name to give your workflow instance.

If the submission is successful, you should get a response which includes the `model_run_id` which is the main identifier linked to that run of the workflow.  You will use that later to check status, access logs and generated files.

In [None]:
# Choose the workflow (i.e. slope_calc, ifm_standard)
response = geodn_modelling.parse_and_submit(name=wf, 
                      workflow_file='./example_workflows/workflow_' + wf + '.py', 
                      payload_file='./example_workflows/payload_' + wf + '.json',
                      )

model_run_id = response['data']['model_run_id']

In [None]:
print(model_run_id)

------------------------------------------------

# 3. Monitor the workflow and check outputs


Once you have submitted the workflow to the cluster to run, you can monitoring its progress, by:

### Checking the workflow status
The `workflowStatus` function will (for a given `model_run_id`) show the status of the workflow run.  At present, this is the status of any steps which are/have been running.  You will see `Pending`, `Running`, `Succeeded` or `Failed`.  In the case of a failed step it will show an error message (this will often refer you to the logs, see below).  The json returned (`opr` here) contains more detailed information about when steps ran etc).

In [None]:
model_run_id

In [None]:
opr = geodn_modelling.workflowStatus(model_run_id)

### Pulling the log files:

Logs files for all steps in GeoDN Modeling workflows are archived in the same S3-compatible bucket where the data files reside.  This is an asynchronous process, so there is a delay in logs being available through the SDK (this could be up to 10 minutes currently).

The GeoDN SDK `grab_logs` function will find the archived logs related to a given `model_run_id`, load them and clean them.  The `print_logs` will print extracted logs.

In [None]:
logs = geodn_modelling.grab_logs(model_run_id)

In [None]:
pod_logs = workflow.print_logs(logs, level='content')
print(json.dumps(pod_logs, indent=1))

# 4. Downloading and checking output files
In a GeoDN Modeling workflow, all the data files from a particular run of a workflow are stored in an S3-compitable bucket. This includes the input files, the final outputs and all intermediate files.  To check if the workflow has run correctly, we can check the correct files have been generated, download them and take a look.

You can either access a list of non-log files from the workflow with `geodn_modelling.listWorkflowFiles(model_run_id)` and download the ones you want to look at with `geodn_modelling.downloadFiles()`.

Or you could use the interactive downloader, where you can select multiple files, specify the download path in the box at the bottom and click the download button.

In [None]:
workflow.fileDownloader(geodn_modelling, model_run_id)

In [None]:
geodn_modelling.listWorkflowFiles(model_run_id)

In [None]:
geodn_modelling.downloadFiles(f"default/{model_run_id}/main/subsetted_512x512_HLS_pred.tif")
geodn_modelling.downloadFiles(f"default/{model_run_id}/main/subsetted_512x512_HLS.S30.T10SEH.2018190.v1.4_inf_merged.tif")

for ofile in geodn_modelling.listWorkflowFiles(model_run_id):
    if  ofile.split('log.')[-1] =='json':
        geodn_modelling.downloadFiles(ofile)

## Viewing the training metrics

In [None]:
def load_tune_metrics():
    log_name = glob.glob('*.log.json')

    with open(log_name[0]) as fp:
        lines = [line.rstrip('\n') for line in fp]
    metrics = [json.loads(X) for X in lines[1:]]

    train_df = pd.DataFrame.from_records([d for d in metrics if d['mode']=='train'])
    val_df = pd.DataFrame.from_records([d for d in metrics if d['mode']=='val'])
    return train_df, val_df

In [None]:
import glob
import pandas as pd
import matplotlib.pyplot as plt
log_name = glob.glob('*.log.json')

with open(log_name[0]) as fp:
    lines = [line.rstrip('\n') for line in fp]
metrics = [json.loads(X) for X in lines[1:]]

# print(metrics)
train_df = pd.DataFrame.from_records([d for d in metrics if d['mode']=='train'])
val_df = pd.DataFrame.from_records([d for d in metrics if d['mode']=='val'])

plt.figure(figsize=(20,10))
plt.subplot(1, 2, 1)
plt.plot(train_df.index, train_df.loss, '-k');
plt.xlabel('evaluation_interval');
plt.ylabel('training loss');
plt.subplot(1, 2, 2)
plt.plot(val_df.index, val_df.aAcc, '-k');
plt.xlabel('evaluation_interval');
plt.ylabel('validation aAcc');

***
# 5. Model inference

Once we have a trained model, we can use it to run inference on other images. This can also be done in a seperate workflow.

In [None]:
inference_files = sorted(glob.glob('*pred.tif'))

original_file = 'main_subsetted_512x512_HLS.S30.T10SEH.2018190.v1.4_inf_merged.tif'
predict_file = 'main_subsetted_512x512_HLS_pred.tif'

with rasterio.open(original_file) as src:
    redArray = src.read(1)
    redArray=redArray/np.max(redArray,axis=(0,1))
    greenArray = src.read(2)
    greenArray=greenArray/np.max(greenArray,axis=(0,1))
    blueArray = src.read(3)
    blueArray=blueArray/np.max(blueArray,axis=(0,1))
    
    im_rgb = np.array([redArray,greenArray,blueArray])
    
with rasterio.open(predict_file) as src:
    pred=np.squeeze(src.read())
    
pyplot.title('Original image')
show(im_rgb)
pyplot.title('Prediction')
pyplot.imshow(pred)

  