# Training `prot_t5_xl_uniref50` model for predicting protein secondary structure using SageMaker Model Parallel 

## Install packages

*Note:* we only install the required libraries from Hugging Face and AWS. You also need PyTorch, so make sure, you are using `PyTorch 1.12 Python 3.8 CPU Optimized` kernel and `ml.t3.medium` instance to run this notebook. 

In [1]:
import sys

In [None]:
!{sys.executable} -m pip install "sagemaker>=2.48.0" "transformers>=4.12.3" "datasets" --upgrade
!{sys.executable} -m pip install ipywidgets
!{sys.executable} -m pip install s3fs

[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip available: [0m[31;49m22.2.2[0m[39;49m -> [0m[32;49m22.3.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m
[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip available: [0m[31;49m22.2.2[0m[39;49m -> [0m[32;49m22.3.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m
[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip available: [0m[31;49m22.2.2[0m[39;49m -> [0m[32;49m22.3.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m


## Development environment 

In [None]:
import sagemaker.huggingface
import datasets

## Permissions

_If you are going to use Sagemaker in a local environment. You need access to an IAM Role with the required permissions for Sagemaker. You can find [here](https://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-roles.html) more about it._

In [None]:
import sagemaker

sess = sagemaker.Session()
# sagemaker session bucket -> used for uploading data, models and logs
# sagemaker will automatically create this bucket if it not exists
sagemaker_session_bucket=None
if sagemaker_session_bucket is None and sess is not None:
    # set to default bucket if a bucket name is not given
    sagemaker_session_bucket = sess.default_bucket()

role = sagemaker.get_execution_role()
sess = sagemaker.Session(default_bucket=sagemaker_session_bucket)

print(f"sagemaker role arn: {role}")
print(f"sagemaker bucket: {sess.default_bucket()}")
print(f"sagemaker session region: {sess.boto_region_name}")

sagemaker role arn: arn:aws:iam::670488263423:role/sm-emr-SageMakerExecutionRole-LRXNQUL7LL5E
sagemaker bucket: sagemaker-us-east-1-670488263423
sagemaker session region: us-east-1


# Preprocessing

## Tokenization 

In [None]:
from datasets import load_dataset
from transformers import AutoTokenizer

# dataset used
dataset_name = 'agemagician/NetSurfP-SS3'

# s3 key prefix for the data
s3_prefix = 'samples/datasets/netsurfp-ss3'

In [None]:
# load dataset
dataset = load_dataset(dataset_name)

Using custom data configuration agemagician--NetSurfP-SS3-9c6b828487bb9f95
Found cached dataset parquet (/root/.cache/huggingface/datasets/agemagician___parquet/agemagician--NetSurfP-SS3-9c6b828487bb9f95/0.0.0/2a3b91fbd88a2c90d1dbbb32b460cf621d31bd5b05b934492fdef7d8d6f236ec)


  0%|          | 0/3 [00:00<?, ?it/s]

## Uploading data to `sagemaker_session_bucket`

After we processed the `datasets` we are going to use the `S3FileSystem` [integration](https://huggingface.co/docs/datasets/filesystems.html) to upload our dataset to S3.

In [None]:
storage_options = {"anon": True}  # for anonymous connection
# for private buckets, uncomment the following code and add your aws access key id and secret key.
# storage_options = {"key": aws_access_key_id, "secret": aws_secret_access_key}  
import botocore
s3_session = botocore.session.Session()
storage_options = {"session": s3_session}

In [None]:
import s3fs
fs = s3fs.S3FileSystem(**storage_options)

In [None]:
# save dataset to s3
data_input_path = f's3://{sess.default_bucket()}/{s3_prefix}/data1'
dataset.save_to_disk(data_input_path,fs=fs)

data_input_path



Saving the dataset (0/1 shards):   0%|          | 0/646 [00:00<?, ? examples/s]

Saving the dataset (0/1 shards):   0%|          | 0/646 [00:00<?, ? examples/s]

Saving the dataset (0/1 shards):   0%|          | 0/10792 [00:00<?, ? examples/s]

's3://sagemaker-us-east-1-670488263423/samples/datasets/netsurfp-ss3/data1'

In [None]:
fs.ls(data_input_path)

['sagemaker-us-east-1-670488263423/samples/datasets/netsurfp-ss3/data1/dataset_dict.json',
 'sagemaker-us-east-1-670488263423/samples/datasets/netsurfp-ss3/data1/test',
 'sagemaker-us-east-1-670488263423/samples/datasets/netsurfp-ss3/data1/train',
 'sagemaker-us-east-1-670488263423/samples/datasets/netsurfp-ss3/data1/validation']

# Fine-tuning model using Sagemaker Training Job

In [None]:
# training script
!pygmentize ./code/train.py

[34mfrom[39;49;00m [04m[36m__future__[39;49;00m [34mimport[39;49;00m print_function

[34mimport[39;49;00m [04m[36mglob[39;49;00m
[34mfrom[39;49;00m [04m[36mitertools[39;49;00m [34mimport[39;49;00m chain
[34mimport[39;49;00m [04m[36mos[39;49;00m
[34mimport[39;49;00m [04m[36mrandom[39;49;00m
[34mimport[39;49;00m [04m[36mzipfile[39;49;00m
[34mimport[39;49;00m [04m[36margparse[39;49;00m

[34mimport[39;49;00m [04m[36mmatplotlib[39;49;00m[04m[36m.[39;49;00m[04m[36mpyplot[39;49;00m [34mas[39;49;00m [04m[36mplt[39;49;00m
[34mimport[39;49;00m [04m[36mnumpy[39;49;00m [34mas[39;49;00m [04m[36mnp[39;49;00m
[34mimport[39;49;00m [04m[36mpandas[39;49;00m [34mas[39;49;00m [04m[36mpd[39;49;00m
[34mimport[39;49;00m [04m[36mtorch[39;49;00m
[34mimport[39;49;00m [04m[36mtorch[39;49;00m[04m[36m.[39;49;00m[04m[36mnn[39;49;00m [34mas[39;49;00m [04m[36mnn[39;49;00m
[34mimport[39;49;00m [04m[36mtorch[39;49;00m[

In [None]:
model_name = "Rostlab/prot_t5_xl_uniref50"

## Creating an Estimator and start a training job

In [None]:
from sagemaker.huggingface import HuggingFace

# hyperparameters, which are passed into the training job
hyperparameters={'epochs': 1,
                 'train_batch_size': 1,
                 'model_name': model_name
                 }

# configuration for running training on smdistributed Model Parallel
mpi_options = {
    "enabled" : True,
    "processes_per_host" : 8,
}
smp_options = {
    "enabled":True,
    "parameters": {
        "microbatches": 1,
        "placement_strategy": "cluster",
        "pipeline": "interleaved",
        "optimize": "memory",
        "partitions": 4,
        "ddp": True,
        # "tensor_parallel_degree": 2,
        "shard_optimizer_state": True,
        "activation_checkpointing": True, 
        "activation_strategy": "each",
        "activation_offloading": True,
    }
}

distribution={
    "smdistributed": {"modelparallel": smp_options},
    "mpi": mpi_options
}

In [None]:
huggingface_estimator = HuggingFace(entry_point='train.py',
                            source_dir='./code',
                            instance_type='ml.p3dn.24xlarge',
                            instance_count=1,
                            role=role,
                            transformers_version='4.12',
                            pytorch_version='1.9',
                            py_version='py38',
                            distribution= distribution,
                            hyperparameters = hyperparameters,
                            keep_alive_period_in_seconds=60*60) # managed warm pool for 60mins

In [None]:
# starting the train job with our uploaded datasets as input
huggingface_estimator.fit({'data': data_input_path})

INFO:sagemaker.image_uris:image_uri is not presented, retrieving image_uri based on instance_type, framework etc.
INFO:sagemaker:Creating training-job with name: huggingface-pytorch-training-2022-12-19-22-03-50-459


2022-12-19 22:03:51 Starting - Starting the training job...
2022-12-19 22:04:07 Starting - Preparing the instances for training.....[34m[1,mpirank:0,algo-1]<stderr>:#015  1%|          | 2/336 [01:18<3:32:45, 38.22s/it][0m
[34m[1,mpirank:0,algo-1]<stderr>:#015  1%|          | 3/336 [01:51<3:19:54, 36.02s/it][0m
[34m[1,mpirank:0,algo-1]<stderr>:#015  1%|          | 4/336 [02:25<3:13:32, 34.98s/it][0m
[34m[1,mpirank:0,algo-1]<stderr>:#015  1%|▏         | 5/336 [02:58<3:10:10, 34.47s/it][0m
[34m[1,mpirank:0,algo-1]<stderr>:#015  2%|▏         | 6/336 [03:32<3:08:52, 34.34s/it][0m
[34m[1,mpirank:0,algo-1]<stderr>:#015  2%|▏         | 7/336 [04:06<3:06:53, 34.08s/it][0m
[34m[1,mpirank:0,algo-1]<stderr>:#015  2%|▏         | 8/336 [04:39<3:05:20, 33.90s/it][0m
[34m[1,mpirank:0,algo-1]<stderr>:#015  3%|▎         | 9/336 [05:13<3:04:16, 33.81s/it][0m
[34m[1,mpirank:0,algo-1]<stderr>:#015  3%|▎         | 10/336 [05:46<3:02:57, 33.67s/it][0m
[34m[1,mpirank:0,algo-1]<stderr>:#015 

## Deploying the endpoint

To deploy our endpoint, we call `deploy()` on our HuggingFace estimator object, passing in our desired number of instances and instance type.

In [None]:
predictor = huggingface_estimator.deploy(1,"ml.g4dn.xlarge")

Finally, we delete the endpoint again.

In [None]:
predictor.delete_endpoint()

# References