## Hierarchical Classifier using LLMs

A hierarchical classifier using Large Language Models (LLMs) designed to classify text into multiple levels of categories, from broad categories to more granular, specific categories  based on predefined hierarchical structures. 


This sample notebook shows you how to finetune a LLM model for hierarchical classification and infering the results.

> **Note**: This is a reference notebook and it cannot run unless you make changes suggested in the notebook.

#### Pre-requisites:
1. **Note**: This notebook contains elements which render correctly in Jupyter interface. Open this notebook from an Amazon SageMaker Notebook Instance or Amazon SageMaker Studio.
1. Ensure that IAM role used has **AmazonSageMakerFullAccess**
1. Some hands-on experience using [Amazon SageMaker](https://aws.amazon.com/sagemaker/).
1. To use this algorithm successfully, ensure that:
    1. Either your IAM role has these three permissions and you have authority to make AWS Marketplace subscriptions in the AWS account used: 
        1. **aws-marketplace:ViewSubscriptions**
        1. **aws-marketplace:Unsubscribe**
        1. **aws-marketplace:Subscribe**  
    2. or your AWS account has a subscription to For Seller to update: Arrhythmia Identification from ECG. 

#### Contents:
1. [Subscribe to the algorithm](#1.-Subscribe-to-the-algorithm)
1. [Prepare dataset](#2.-Prepare-dataset)
	1. [Dataset format expected by the algorithm](#A.-Dataset-format-expected-by-the-algorithm)
	1. [Configure dataset](#B.-Configure-dataset)
	1. [Upload datasets to Amazon S3](#C.-Upload-datasets-to-Amazon-S3)
1. [Train a machine learning model](#3:-Train-a-machine-learning-model)
	1. [Set up environment](#3.1-Set-up-environment)
	1. [Train a model](#3.2-Train-a-model)
1. [Deploy model and verify results](#4:-Deploy-model-and-verify-results)
    1. [Deploy trained model](#A.-Deploy-trained-model)
    1. [Create input payload](#B.-Create-input-payload)
    1. [Perform real-time inference](#C.-Perform-real-time-inference)
    1. [Visualize output](#D.-Visualize-output)
    1. [Calculate relevant metrics](#E.-Calculate-relevant-metrics)
    1. [Delete the endpoint](#F.-Delete-the-endpoint)
1. [Perform Batch inference](#6.-Perform-Batch-inference)
1. [Clean-up](#7.-Clean-up)
	1. [Delete the model](#A.-Delete-the-model)
	1. [Unsubscribe to the listing (optional)](#B.-Unsubscribe-to-the-listing-(optional))


#### Usage instructions
You can run this notebook one cell at a time (By using Shift+Enter for running a cell).

# 1. Subscribe to the algorithm

To subscribe to the algorithm:
1. Open the algorithm listing page Hierarchical classifier.
1. On the AWS Marketplace listing,  click on **Continue to subscribe** button.
1. On the **Subscribe to this software** page, review and click on **"Accept Offer"** if you agree with EULA, pricing, and support terms. 
1. Once you click on **Continue to configuration button** and then choose a **region**, you will see a **Product Arn**. This is the algorithm ARN that you need to specify while training a custom ML model. Copy the ARN corresponding to your region and specify the same in the following cell.

In [1]:
algo_name = 'hierarchical-classifier-v7-1'

### 2. Prepare dataset

In [2]:
import base64
import json 
import uuid
from sagemaker import ModelPackage
import sagemaker as sage
from sagemaker import get_execution_role
from sagemaker import ModelPackage
from urllib.parse import urlparse
import boto3
from IPython.display import Image
from PIL import Image as ImageEdit
import urllib.request
import numpy as np
import boto3
import sagemaker

sagemaker.config INFO - Not applying SDK defaults from location: /etc/xdg/sagemaker/config.yaml
sagemaker.config INFO - Not applying SDK defaults from location: /home/ubuntu/.config/sagemaker/config.yaml


#### A. Dataset format expected by the algorithm

Usage Instructions:
- For model finetuning, the data can be uploaded in a folder as train.csv file. 
    - Adhere to the naming convention mentioned below for finetuning the model.
        - The textual description must have the column name DESCRIPTION
        - The categories must have the column names like CATEGORY_1, CATEGORY_2, CATEGORY_3, CATEGORY_4
    - The total number of labels in all the category columns together must not exceed 256   
- For inference all the textual description must be uploaded in a dictionary format in a json file like below:     
        - {"text":<description_1>,<description_2>,<description_3>}
- Correct naming convention should be followed for files uploaded. 
- Inference engine provides predictions for each textual description.

#### B. Configure dataset

In [9]:
training_dataset="train_full_data.csv"
test_dataset = "inference.json"

#### C. Upload datasets to Amazon S3

In [43]:
#sagemaker_session = sage.Session()
#bucket=sagemaker_session.default_bucket()
#bucket

In [8]:
bucket = 'hierarchical-classifier-1'
project = 'boto3-v1'
region = 'us-east-2'

s3 = boto3.client('s3')

# Specify the file to upload
local_file_path = 'train_half_data.csv'  # Replace with your local file path

s3_object_key = f"{project}/input/train.csv"     # Specify the S3 object key (path in S3)

# Upload the file
s3.upload_file(local_file_path, bucket, s3_object_key)

print(f"File '{local_file_path}' uploaded to '{bucket}/{s3_object_key}'")

S3UploadFailedError: Failed to upload train_half_data.csv to hierarchical-classifier-1/boto3-v1/input/train.csv: An error occurred (AccessDenied) when calling the PutObject operation: User: arn:aws:sts::786796469737:assumed-role/amrit-ec2-role/i-087350f58d755cc7f is not authorized to perform: s3:PutObject on resource: "arn:aws:s3:::hierarchical-classifier-1/boto3-v1/input/train.csv" because no identity-based policy allows the s3:PutObject action

## 3: Train a machine learning model

Now that dataset is available in an accessible Amazon S3 bucket, we are ready to train a machine learning model. 

### 3.1 Set up environment

In [None]:
role = get_execution_role()
sm = boto3.Session(region_name='us-east-2').client("sagemaker")

### 3.2 Train a model

In [9]:
import datetime

sm.create_training_job(
    TrainingJobName='hierarchical-classifier-'+datetime.datetime.now().strftime('%Y-%m-%d-%H-%M-%S'),
    HyperParameters={
        'epochs':'3'
    },
    AlgorithmSpecification={
        # Specify a training Docker container image URI (Deep Learning Container or your own training container) to TrainingImage.
        'AlgorithmName': algo_name,
        'TrainingInputMode': 'File',
        'EnableSageMakerMetricsTimeSeries': False
    },
    RoleArn=role,
    InputDataConfig= [{
			'ChannelName': 'train',
			'DataSource': {
				'S3DataSource': {
					'S3DataType': 'S3Prefix',
					'S3Uri': 's3://'+bucket+'/'+project+'/input/',
					'S3DataDistributionType': 'FullyReplicated'
				}
			},
			'ContentType': 'text/csv',
			'CompressionType': 'None',
			'RecordWrapperType': 'None',
			#'EnableFFM': False
    }],
    OutputDataConfig={'S3OutputPath': 's3://'+bucket+'/'+project+'/output'},
    ResourceConfig={
        'InstanceType': 'ml.g5.4xlarge',
        'InstanceCount': 1,
        'VolumeSizeInGB': 30
    },
    StoppingCondition={
        'MaxRuntimeInSeconds': 86400
    },
    
)

{'TrainingJobArn': 'arn:aws:sagemaker:us-east-2:786796469737:training-job/hierarchical-classifier-2024-12-02-05-07-40',
 'ResponseMetadata': {'RequestId': '6f3acf5c-1354-46c1-bc9f-f6407a68f4d2',
  'HTTPStatusCode': 200,
  'HTTPHeaders': {'x-amzn-requestid': '6f3acf5c-1354-46c1-bc9f-f6407a68f4d2',
   'content-type': 'application/x-amz-json-1.1',
   'content-length': '118',
   'date': 'Mon, 02 Dec 2024 05:07:40 GMT'},
  'RetryAttempts': 0}}

In [10]:
#specify the training job name created
training_job_name = 'hierarchical-classifier-2024-11-29-10-20-41'
# The directory within your S3 bucket your model is stored in:
bucket_prefix = 's3://'+bucket+'/'+project+'/output'+'/' +training_job_name+ '/'+'output'
# The file name of your model artifact:
model_filename = "model.tar.gz"
# Relative S3 path:
model_s3_key = f"{bucket_prefix}/"+model_filename
# Combine bucket name, model file name, and relate S3 path to create S3 model URL:
model_url = model_s3_key

In [57]:
sm.create_model(ModelName='hierarchical-classifier',
                PrimaryContainer={
                    'ModelDataUrl':model_url
                },
                ExecutionRoleArn=role
                
              )

ClientError: An error occurred (ValidationException) when calling the CreateModel operation: Model can not be null.

### Create endpoint configuration

In [59]:
endpoint_config_name = "hierarchical-classifier-boto3-v1"
endpoint_name = "hierarchical-classifier-boto3-v1"
inference_component_name = "hierarchical-classifier-boto3-v1"
variant_name = "variant-1"

sm.create_endpoint_config(
    EndpointConfigName = endpoint_config_name,
    ExecutionRoleArn = role,
    ProductionVariants = [
        {
            "VariantName": variant_name,
            #"ModelName": algo_name,
            "InstanceType": "ml.m5.large",
            "InitialInstanceCount": 1,
            "ManagedInstanceScaling": {
                "Status": "ENABLED",
                "MinInstanceCount": 1,
                "MaxInstanceCount": 2,
            },
        }
    ],
)

{'EndpointConfigArn': 'arn:aws:sagemaker:us-east-2:786796469737:endpoint-config/hierarchical-classifier-boto3-v1',
 'ResponseMetadata': {'RequestId': '6801eb2d-027f-4309-8612-98ea6ae48c2a',
  'HTTPStatusCode': 200,
  'HTTPHeaders': {'x-amzn-requestid': '6801eb2d-027f-4309-8612-98ea6ae48c2a',
   'content-type': 'application/x-amz-json-1.1',
   'content-length': '113',
   'date': 'Fri, 29 Nov 2024 13:09:09 GMT'},
  'RetryAttempts': 0}}

### Create endpoint

In [60]:
sm.create_endpoint(
    EndpointName = endpoint_name,
    EndpointConfigName = endpoint_config_name,
)

{'EndpointArn': 'arn:aws:sagemaker:us-east-2:786796469737:endpoint/hierarchical-classifier-boto3-v1',
 'ResponseMetadata': {'RequestId': 'f81608f9-610d-4098-9c5d-45255347f4f5',
  'HTTPStatusCode': 200,
  'HTTPHeaders': {'x-amzn-requestid': 'f81608f9-610d-4098-9c5d-45255347f4f5',
   'content-type': 'application/x-amz-json-1.1',
   'content-length': '100',
   'date': 'Fri, 29 Nov 2024 13:09:18 GMT'},
  'RetryAttempts': 0}}

### Deploy endpoint

In [None]:
sm.create_inference_component(
    InferenceComponentName = inference_component_name,
    EndpointName = endpoint_name,
    VariantName = variant_name,
    ModelName = "hierarchical-classifier-boto3-v1",
    Specification = {
        "Container": {
            #"ModelName": algo_name,
            "ArtifactUrl": model_url,
        },
        "ComputeResourceRequirements": {
            "NumberOfCpuCoresRequired": 1, 
            "MinMemoryRequiredInMb": 1024,
            
        }
    },
    RuntimeConfig = {"CopyCount": 2}
)

ParamValidationError: Parameter validation failed:
Unknown parameter in input: "ModelName", must be one of: InferenceComponentName, EndpointName, VariantName, Specification, RuntimeConfig, Tags

In [None]:
output_location = 's3://{}/mphasis-arrhythmia-identification/{}'.format(bucket, 'output')
output_location

In [35]:
training_instance_type="ml.g5.4xlarge"

In [None]:
#Create an estimator object for running a training job
estimator = sage.algorithm.AlgorithmEstimator(
    algorithm_arn=algo_arn,
    base_job_name="hierarchical-classifier-run",
    role=role,
    train_instance_count=1,
    train_instance_type=training_instance_type,
    input_mode="File",
    output_path=output_location,
    sagemaker_session=sagemaker_session,
    instance_count=1,
    instance_type=training_instance_type
)

#Run the training job.
estimator.fit({"train": training_input})

See this [blog-post](https://aws.amazon.com/blogs/machine-learning/easily-monitor-and-visualize-metrics-while-training-models-on-amazon-sagemaker/) for more information how to visualize metrics during the process. You can also open the training job from [Amazon SageMaker console](https://console.aws.amazon.com/sagemaker/home?#/jobs/) and monitor the metrics/logs in **Monitor** section.

### 4: Deploy model and verify results

Now you can deploy the model for performing real-time inference.

In [None]:
model_name='ECG_model.pth'

content_type='application/zip' #input

real_time_inference_instance_type='ml.m5.large'
batch_transform_inference_instance_type='ml.m5.large'

#### A. Deploy trained model

In [None]:
predictor = estimator.deploy(1, real_time_inference_instance_type)

Once endpoint is created, you can perform real-time inference.

#### B. Create input payload

In [None]:
file_name = test_dataset
output_file_name = "inference_out.json"

#### C.Perform real-time inference

In [None]:
!aws sagemaker-runtime invoke-endpoint \
    --endpoint-name $predictor.endpoint \
    --body fileb://$file_name \
    --content-type $content_type \
    --region $sagemaker_session.boto_region_name \
    $output_file_name

### D. Visualize output

In [None]:
file = open(output_file_name,"r+") 
print(file.read())

#### F. Delete the endpoint

Now that you have successfully performed a real-time inference, you do not need the endpoint any more. you can terminate the same to avoid being charged.

In [None]:
predictor.delete_endpoint(delete_endpoint_config=True)

Since this is an experiment, you do not need to run a hyperparameter tuning job. However, if you would like to see how to tune a model trained using a third-party algorithm with Amazon SageMaker's hyperparameter tuning functionality, you can run the optional tuning step.

### 5. Perform Batch inference

In this section, you will perform batch inference using multiple input payloads together.

In [None]:
#upload the batch-transform job input files to S3
transform_input_folder = test_dataset
batch_input_prefix = common_prefix + "/batch"
transform_input = sagemaker_session.upload_data(transform_input_folder, key_prefix=batch_input_prefix) 
print("Transform input uploaded to " + transform_input)

In [None]:
#Run the batch-transform job
transformer = estimator.transformer(1, batch_transform_inference_instance_type)
transformer.transform(transform_input, content_type=content_type)
transformer.wait()

In [None]:
#output is available on following path
transformer.output_path

### 7. Clean-up

#### A. Delete the model

In [None]:
estimator.delete_endpoint()

#### B. Unsubscribe to the listing (optional)

If you would like to unsubscribe to the algorithm, follow these steps. Before you cancel the subscription, ensure that you do not have any [deployable model](https://console.aws.amazon.com/sagemaker/home#/models) created from the model package or using the algorithm. Note - You can find this information by looking at the container name associated with the model. 

**Steps to unsubscribe to product from AWS Marketplace**:
1. Navigate to __Machine Learning__ tab on [__Your Software subscriptions page__](https://aws.amazon.com/marketplace/ai/library?productType=ml&ref_=mlmp_gitdemo_indust)
2. Locate the listing that you want to cancel the subscription for, and then choose __Cancel Subscription__  to cancel the subscription.

