# HLS Foundation Model Finetuning notebook

This notebook demonstrates the steps to fintune the HLS foundation model (A.K.A Prithvi) which is trained using HLSL30 and HLSS30 datasets. 

Note: Entierty of this notebook is desigend to work well within the AWS sagemaker environment. AWS sagemaker environment access for your account can be found using http://smd-ai-workshop-creds-webapp.s3-website-us-east-1.amazonaws.com/.

![HLS Training](../images/HLS-training.png)

In [2]:
# Install required packages
!pip install -r ../requirements.txt

# Create directories needed for data, model, and config preparations
!mkdir datasets
!mkdir models
!mkdir configs

mkdir: cannot create directory ‘datasets’: File exists
mkdir: cannot create directory ‘models’: File exists
mkdir: cannot create directory ‘configs’: File exists


## Dataset preparation

For this hands-on session, Burn Scars example will be used for fine-tuning. All of the data and pre-trained models are available in Huggingface. Huggingface packages and git will be utilized to download, and prepare datasets and pretrained models.


Note: Git Large File Storage (git LFS) is utilized to download larger files from huggingface.

In [3]:
# Install git lfs
! sudo apt-get install git-lfs; git lfs install

Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
The following NEW packages will be installed:
  git-lfs
0 upgraded, 1 newly installed, 0 to remove and 0 not upgraded.
Need to get 3503 kB of archives.
After this operation, 10.4 MB of additional disk space will be used.
Get:1 http://archive.ubuntu.com/ubuntu jammy-updates/universe amd64 git-lfs amd64 3.0.2-1ubuntu0.2 [3503 kB]
Fetched 3503 kB in 1s (4746 kB/s)  
debconf: delaying package configuration, since apt-utils is not installed
Selecting previously unselected package git-lfs.
(Reading database ... 14544 files and directories currently installed.)
Preparing to unpack .../git-lfs_3.0.2-1ubuntu0.2_amd64.deb ...
Unpacking git-lfs (3.0.2-1ubuntu0.2) ...
Setting up git-lfs (3.0.2-1ubuntu0.2) ...
Updated git hooks.
Git LFS initialized.


### Download HLS Burn Scars dataset from Huggingface: https://huggingface.co/datasets/ibm-nasa-geospatial/hls_burn_scars

In [4]:
! cd datasets; git clone https://huggingface.co/datasets/ibm-nasa-geospatial/hls_burn_scars; tar -xvzf hls_burn_scars/hls_burn_scars.tar.gz 

Cloning into 'hls_burn_scars'...
remote: Enumerating objects: 1724, done.[K
remote: Counting objects: 100% (58/58), done.[K
remote: Compressing objects: 100% (58/58), done.[K
remote: Total 1724 (delta 35), reused 0 (delta 0), pack-reused 1666 (from 1)[K
Receiving objects: 100% (1724/1724), 260.93 KiB | 16.31 MiB/s, done.
Resolving deltas: 100% (61/61), done.
training/
training/subsetted_512x512_HLS.S30.T14SNB.2018215.v1.4_merged.tif
training/subsetted_512x512_HLS.S30.T16RFT.2019250.v1.4_merged.tif
training/subsetted_512x512_HLS.S30.T11SKV.2019152.v1.4_merged.tif
training/subsetted_512x512_HLS.S30.T13TCL.2020245.v1.4_merged.tif
training/subsetted_512x512_HLS.S30.T11TMG.2018222.v1.4_merged.tif
training/subsetted_512x512_HLS.S30.T16SEB.2019135.v1.4_merged.tif
training/subsetted_512x512_HLS.S30.T13TDM.2020250.v1.4_merged.tif
training/subsetted_512x512_HLS.S30.T12SWB.2020250.v1.4_merged.tif
training/subsetted_512x512_HLS.S30.T15SWA.2021093.v1.4_merged.tif
training/subsetted_512x512_HLS.

## Download config and Pre-trained model

The HLS Foundation Model (pre-trained model), and configuration for Burn Scars downstream task are available in Huggingface. We use `huggingface_hub` python package to download the files locally.

In [5]:
# Define constants
BUCKET_NAME = 'workshop-1-015' # Replace this with the bucket name available from http://smd-ai-workshop-creds-webapp.s3-website-us-east-1.amazonaws.com/ 
CONFIG_PATH = './configs'
DATASET_PATH = './datasets'
MODEL_PATH = './models'

In [6]:
# Download pre-trained model file from huggingface
! cd models && curl https://www.nsstc.uah.edu/data/sujit.roy/Prithvi_checkpoints/checkpoint.pt > prithvi_global_v1.pt;

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 3786M  100 3786M    0     0  14.1M      0  0:04:28  0:04:28 --:--:-- 16.1M 15.4M      0  0:04:04  0:00:43  0:03:21 14.1M  19  749M    0     0  15.3M      0  0:04:07  0:00:48  0:03:19 14.1M6 2529M    0     0  14.2M      0  0:04:26  0:02:57  0:01:29 12.7M405M    0     0  14.1M      0  0:04:27  0:04:01  0:00:26 16.6M


*Warning: * Before running the remaining cells please update the details in the configuration file as mentioned below:

1. Update line number 13 from `data_root = '<path to data root>'` to `data_root = '/opt/ml/data/'`. This is the base of our data inside of sagemaker.
2. Update line number 41 from `pretrained_weights_path = '<path to pretrained weights>'` to `pretrained_weights_path = f"{data_root}/models/Prithvi_100M.pt"`. This provides the pre-trained model path to the train script.
3. Update line number 53 from `experiment = '<experiment name>'` to `experiment = 'burn_scars'` or your choice of experiment name.
4. Update line number 54 from `project_dir = '<project directory name>'` to `project_dir = 'v1'` or your choice of project directory name. 
5. Save the config file.

In [8]:
# Prepare sagemaker session with files uploaded to s3 bucket
import sagemaker

sagemaker_session = sagemaker.Session()
train_images = sagemaker_session.upload_data(path='datasets/training', bucket=BUCKET_NAME, key_prefix='data/training')
val_images = sagemaker_session.upload_data(path='datasets/validation', bucket=BUCKET_NAME, key_prefix='data/validation')
test_images = sagemaker_session.upload_data(path='datasets/validation', bucket=BUCKET_NAME, key_prefix='data/test')

In [11]:
# Rename configuration file name to user specific filename
import os

identifier = 'workshop-015' # Please update this with an identifier

config_filename = '../configs/burn_scars.yaml'
new_config_filename = f"../configs/{identifier}-burn_scars.py"
os.rename(config_filename, new_config_filename)

In [25]:
# Upload config files to s3 bucket
configs = sagemaker_session.upload_data(path=new_config_filename, bucket=BUCKET_NAME, key_prefix='data/configs')
models = sagemaker_session.upload_data(path='models/prithvi_global_v1.pt', bucket=BUCKET_NAME, key_prefix='data/models')


Note: For HLS Foundation Model, MMCV and MMSEG were used. These libraries use pytorch underneath them for training, data distribution etc. However, these packages are not available in sagemaker by default. Thus, custom script training is required. Sagemaker utilizes Docker for custom training scripts. If interested, the code included in the image we are using for training (637423382292.dkr.ecr.us-west-2.amazonaws.com/sagemaker_hls:latest) is bundled with this repository, and the train script used is `train.py`.

The current HLS Foundation model fits in a single NVIDIA Tesla V100 GPU (16GB VRAM). Hence, `ml.p3.2xlarge` instance is used for training.

In [21]:
# Setup variables for training using sagemaker
from datetime import time
from sagemaker import get_execution_role
from sagemaker.estimator import Estimator


name = f'{identifier}-sagemaker'
role = get_execution_role()
input_s3_uri = f"s3://{BUCKET_NAME}/data"

environment_variables = {
    'CONFIG_FILE': f"/opt/ml/data/configs/{new_config_filename.split('/')[-1]}",
    'MODEL_DIR': "/opt/ml/data/models/",
    'MODEL_NAME': f"{identifier}-workshop.pth",
    'S3_URL': input_s3_uri,
    'BUCKET_NAME': BUCKET_NAME,
    'ROLE_ARN': role,
    'ROLE_NAME': role.split('/')[-1],
    'EVENT_TYPE': 'burn_scars',
    'VERSION': 'v1'
}

ecr_container_url = '637423382292.dkr.ecr.us-west-2.amazonaws.com/prithvi_global:latest'
sagemaker_role = 'SageMaker-ExecutionRole-20240206T151814'

instance_type = 'ml.p3.2xlarge'

instance_count = 1
memory_volume = 50

In [26]:
# Establish an estimator (model) using sagemaker and the configurations from the previous cell.
estimator = Estimator(image_uri=ecr_container_url,
                      role=get_execution_role(),
                      base_job_name=name,
                      instance_count=1,
                      environment=environment_variables,
                      instance_type=instance_type)


In [None]:
# Start training
estimator.fit()

INFO:sagemaker:Creating training-job with name: workshop-015-sagemaker-2024-11-11-17-52-05-086


2024-11-11 17:52:06 Starting - Starting the training job
2024-11-11 17:52:06 Pending - Training job waiting for capacity...
2024-11-11 17:52:40 Pending - Preparing the instances for training......
2024-11-11 17:53:34 Downloading - Downloading the training image.....................
2024-11-11 17:57:11 Training - Training image download completed. Training in progress.....[34m2024-11-11 17:57:34,067 sagemaker-training-toolkit INFO     No Neurons detected (normal if no neurons installed)[0m
[34m2024-11-11 17:57:34,108 sagemaker-training-toolkit INFO     No Neurons detected (normal if no neurons installed)[0m
[34m2024-11-11 17:57:34,147 sagemaker-training-toolkit INFO     No Neurons detected (normal if no neurons installed)[0m
[34m2024-11-11 17:57:34,163 sagemaker-training-toolkit INFO     Invoking user script[0m
[34mTraining Env:[0m
[34m{
    "additional_framework_parameters": {},
    "channel_input_dirs": {},
    "current_host": "algo-1",
    "current_instance_group": "homoge

In [85]:
image_config = {
                    'RepositoryAccessMode': 'Platform'
               }

In [86]:
primary_container = {
    'ContainerHostname': 'ModelContainer',
    'Image': '637423382292.dkr.ecr.us-west-2.amazonaws.com/prithvi_global_inference',
    'ImageConfig': image_config,
    'Environment': { 
        "CHECKPOINT_FILENAME": "s3://workshop-1-015/models/workshop-015-workshop.pth",
        "S3_CONFIG_FILENAME": "s3://workshop-1-015/data/configs/workshop-015-burn_scars.py",
        "BUCKET_NAME": "workshop-1-015",
        "AIP_PREDICT_ROUTE": "/invocations",
        "BACKBONE_FILENAME": "s3://workshop-1-015/data/models/prithvi_global_v1.pt"
    }
}

In [100]:
model_name = 'prithvi-global-v1'
execution_role_arn = get_execution_role()

In [101]:
import boto3
sm = boto3.client('sagemaker')

In [104]:
resp = sm.create_model(
        ModelName=model_name,
        PrimaryContainer=primary_container,
        ExecutionRoleArn=execution_role_arn
    )

endpoint_config_name = f'{model_name}-endpoint'
sm.create_endpoint_config(
    EndpointConfigName=endpoint_config_name,
    ProductionVariants=[
        {
            'VariantName': 'v1',
            'ModelName': model_name,
            'InitialInstanceCount': 1,
            'InstanceType': 'ml.p3.2xlarge'
        },
    ],
)

endpoint_name = f'{model_name}-updated-burn-scars-endpoint'
sm.create_endpoint(
    EndpointName=endpoint_name,
    EndpointConfigName=endpoint_config_name,
)

sm.describe_endpoint(EndpointName=endpoint_name)

{'EndpointName': 'prithvi-global-v1-updated-burn-scars-endpoint',
 'EndpointArn': 'arn:aws:sagemaker:us-west-2:637423382292:endpoint/prithvi-global-v1-updated-burn-scars-endpoint',
 'EndpointConfigName': 'prithvi-global-v1-endpoint',
 'EndpointStatus': 'Creating',
 'CreationTime': datetime.datetime(2024, 11, 12, 21, 36, 48, 763000, tzinfo=tzlocal()),
 'LastModifiedTime': datetime.datetime(2024, 11, 12, 21, 36, 48, 763000, tzinfo=tzlocal()),
 'ResponseMetadata': {'RequestId': '64d44255-9133-40f8-9dac-53bdea8eccc9',
  'HTTPStatusCode': 200,
  'HTTPHeaders': {'x-amzn-requestid': '64d44255-9133-40f8-9dac-53bdea8eccc9',
   'content-type': 'application/x-amz-json-1.1',
   'content-length': '322',
   'date': 'Tue, 12 Nov 2024 21:36:48 GMT'},
  'RetryAttempts': 0}}

In [45]:
resp

{'ModelArn': 'arn:aws:sagemaker:us-west-2:637423382292:model/prithvi',
 'ResponseMetadata': {'RequestId': '5c2380bc-f192-4364-9ea0-816b3a04ab95',
  'HTTPStatusCode': 200,
  'HTTPHeaders': {'x-amzn-requestid': '5c2380bc-f192-4364-9ea0-816b3a04ab95',
   'content-type': 'application/x-amz-json-1.1',
   'content-length': '69',
   'date': 'Mon, 11 Nov 2024 21:27:57 GMT'},
  'RetryAttempts': 0}}

In [103]:
sm.delete_model(ModelName=model_name)
sm.delete_endpoint_config(EndpointConfigName=endpoint_config_name)
sm.delete_endpoint(EndpointName=endpoint_name,)

ClientError: An error occurred (ValidationException) when calling the DeleteEndpoint operation: Cannot update in-progress endpoint "arn:aws:sagemaker:us-west-2:637423382292:endpoint/prithvi-global-v1-burn-scars-endpoint".