#### docs.djl.ai
### [Multi lora adapter inference advanced](https://docs.djl.ai/docs/demos/aws/sagemaker/large-model-inference/sample-llm/multi_lora_adapter_inference_advanced.html)
#### 10–12 minutes

# Serve Multiple Fine-Tuned LoRA Adapters with DJL Serving (Advanced)

This notebook will demonstrate how you can deploy multiple fine-tuned LoRA adapters with a single base model copy on SageMaker using the DJL Serving Large Model Inference DLC. LoRA (Low Rank Adapters) is a powerful technique for fine-tuning large language models. This technique significantly reduces the number of trainable parameters compared to traditional fine-tuning while achieving comparable or superior performance. You can learn more about the LoRA technique in this paper.

A major benefit of LoRA is that the fine-tuned adapters can easily be added to and removed from the base model, which makes switching adapters pretty cheap and viable at runtime. In this notebook we will show how you can deploy a SageMaker endpoint with a single base model and multiple LoRA adapters, and change adapters for different requests.

Since LoRA adapters are much smaller than the size of a base model (can realistically be 100x-1000x smaller), we can deploy an endpoint with a single base model and multiple LoRA adapters using much less hardware than deploying an equivalent number of fully fine-tuned models.

The example we will work through in this notebook is guided by the multi adapter example in HuggingFace's PEFT library: https://github.com/huggingface/peft/blob/main/examples/multi_adapter_examples/PEFT_Multi_LoRA_Inference.ipynb.

This is the advanced notebook demonstrating the usage of a custom handler. For the basic usage, see the main adapters notebook.

# Install Packages and Import Dependencies

In [1]:
!pip install huggingface_hub sagemaker boto3 awscli pandas --upgrade --quiet

In [2]:
# show the versions of packages when notebook was run
!pip list | grep "huggingface_hub \|sagemaker \|boto3 \|awscli \|pandas "

awscli                                1.33.22
boto3                                 1.34.140
pandas                                2.2.2
sagemaker                             2.224.4


In [3]:
import sagemaker
from sagemaker import image_uris
import boto3
import os
import time
import json
from pathlib import Path
from sagemaker.utils import name_from_base
from huggingface_hub import snapshot_download

sagemaker.config INFO - Not applying SDK defaults from location: /etc/xdg/sagemaker/config.yaml
sagemaker.config INFO - Not applying SDK defaults from location: /home/sagemaker-user/.config/sagemaker/config.yaml


# Download Model Artifacts and Upload to S3

We will be deploying an endpoint with 2 LoRA adapters. These are the models we will be using:
- Base Model: https://huggingface.co/huggyllama/llama-7b
- LoRA Fine Tuned Adapter 1: https://huggingface.co/tloen/alpaca-lora-7b
- LoRA Fine Tuned Adapter 2: https://huggingface.co/22h/cabrita-lora-v0-1

In [4]:
!rm -rf lora-multi-adapter
!mkdir -p lora-multi-adapter/adapters

snapshot_download("tloen/alpaca-lora-7b", local_dir="lora-multi-adapter/adapters/eng_alpaca", local_dir_use_symlinks=False)
snapshot_download("22h/cabrita-lora-v0-1", local_dir="lora-multi-adapter/adapters/portuguese_alpaca", local_dir_use_symlinks=False)

For more details, check out https://huggingface.co/docs/huggingface_hub/main/en/guides/download#download-files-to-local-folder.


Fetching 4 files:   0%|          | 0/4 [00:00<?, ?it/s]

adapter_config.json:   0%|          | 0.00/399 [00:00<?, ?B/s]

.gitattributes:   0%|          | 0.00/1.48k [00:00<?, ?B/s]

README.md:   0%|          | 0.00/823 [00:00<?, ?B/s]

adapter_model.bin:   0%|          | 0.00/67.2M [00:00<?, ?B/s]

Fetching 4 files:   0%|          | 0/4 [00:00<?, ?it/s]

README.md:   0%|          | 0.00/4.72k [00:00<?, ?B/s]

.gitattributes:   0%|          | 0.00/1.48k [00:00<?, ?B/s]

adapter_config.json:   0%|          | 0.00/370 [00:00<?, ?B/s]

adapter_model.bin:   0%|          | 0.00/16.8M [00:00<?, ?B/s]

'/home/sagemaker-user/multi-adapter-hosting-sagemaker/sagemaker/04_lmi_container_deep_java_library/lora-multi-adapter/adapters/portuguese_alpaca'

# Creating Inference Handler and DJL Serving Configuration

The following files cover the model server configuration (serving.properties) and custom inference handler (model.py). The custom inference handler is optional and if not specified, default handler from djl-serving will be used. This configuration can be used as an example to write your own inference handler for different models.

The core structure to cover here is the model directory. We include both the base model and LoRA adapters in the model directory like this:

```
|- model_dir
    |- adapters/
        |--- <adapter_1>/
        |--- <adapter_2>/
        |--- ...
        |--- <adapter_n>/
    |- serving.properties
    |- model.py (optional)
```
Each of the adapters in the adapters directory contains the LoRA adapter artifacts. Typically there are two files: adapter_model.bin and adapter_config.json which are the adapter weights and adapter configuration respectively. These are typically obtained from the Peft library via the PeftModel.save_pretrained() method.

In [5]:
%%writefile lora-multi-adapter/serving.properties
engine=Python
option.model_id=huggyllama/llama-7b
option.dtype=fp16
option.entryPoint=model.py
option.tensor_parallel_degree=1
load_on_devices=0

Writing lora-multi-adapter/serving.properties


In [6]:
%%writefile lora-multi-adapter/model.py
from transformers import LlamaTokenizer, LlamaForCausalLM, GenerationConfig
from peft import PeftModel
import torch
import os
from djl_python.inputs import Input
from djl_python.outputs import Output
import logging

model = None
tokenizer = None

def generate_prompt(instruction, input=None):
    if input:
        return f"""Below is an instruction that describes a task, paired with an input that provides further context. 
        Write a response that appropriately completes the request. ### Instruction: {instruction} ### Input: {input} 
        ### Response:"""
    else:
        return f"""Below is an instruction that describes a task. Write a response that appropriately completes the 
        request.### Instruction: {instruction} ### Response:"""


def evaluate(
        instruction,
        adapters,
        input=None,
        max_new_tokens=64,
        **kwargs,
):
    prompts = []
    for inp in instruction:
        prompts.append(generate_prompt(inp, input))
    inputs = tokenizer(prompts, return_tensors="pt", padding=True)
    input_ids = inputs["input_ids"].to(torch.cuda.current_device())
    attention_mask = inputs["attention_mask"].to(torch.cuda.current_device())
    generation_config = GenerationConfig(num_beams=1, do_sample=False)

    logging.info(f"using adapters: {adapters}")
    with torch.no_grad():
        generation_output = model.generate(
            input_ids=input_ids,
            attention_mask=attention_mask,
            adapters=adapters,
            generation_config=generation_config,
            max_new_tokens=max_new_tokens,
        )
    output = tokenizer.batch_decode(generation_output, skip_special_tokens=True)
    return output


def load_model(model_id):
    model = LlamaForCausalLM.from_pretrained(
        model_id,
        low_cpu_mem_usage=True,
        torch_dtype=torch.float16,
        device_map="auto"
    )
    tokenizer = LlamaTokenizer.from_pretrained(model_id)
    if not tokenizer.pad_token:
        tokenizer.pad_token = '[PAD]'
    logging.info(f"Loaded Base Model {model_id}")
    return model, tokenizer


def register_adapter(inputs: Input):
    """
    Registers lora adapter with the model.
    """
    global model
    adapter_name = inputs.get_property("name")
    adapter_model_id_or_path = inputs.get_property("src")
    logging.info(
        f"Registering adapter {adapter_name} from {adapter_model_id_or_path}")
    if isinstance(model, PeftModel):
        model.load_adapter(adapter_model_id_or_path, adapter_name)
    else:
        model = PeftModel.from_pretrained(model,
                                           adapter_model_id_or_path,
                                           adapter_name)


def handle(inputs: Input):
    global model, tokenizer
    if not model:
        properties = inputs.get_properties()
        model_id = properties.get("model_id")
        model, tokenizer = load_model(model_id)

    if inputs.is_empty():
        return None


    json_inputs = inputs.get_as_json()
    sentence = json_inputs.get("inputs")
    adapters = json_inputs.get("adapters", [])
    generation_kwargs = json_inputs.get("parameters", {})
    outputs = evaluate(sentence, adapters, **generation_kwargs)

    return Output().add_as_json(outputs)

Writing lora-multi-adapter/model.py


In [7]:
!rm -f model.tar.gz
!rm -rf lora-multi-adapter/.ipynb_checkpoints
!tar czvf model.tar.gz -C lora-multi-adapter .

./
./adapters/
./adapters/eng_alpaca/
./adapters/eng_alpaca/.huggingface/
./adapters/eng_alpaca/.huggingface/.gitignore
./adapters/eng_alpaca/.huggingface/download/
./adapters/eng_alpaca/.huggingface/download/README.md.lock
./adapters/eng_alpaca/.huggingface/download/adapter_model.bin.lock
./adapters/eng_alpaca/.huggingface/download/adapter_config.json.lock
./adapters/eng_alpaca/.huggingface/download/.gitattributes.lock
./adapters/eng_alpaca/.huggingface/download/adapter_config.json.metadata
./adapters/eng_alpaca/.huggingface/download/README.md.metadata
./adapters/eng_alpaca/.huggingface/download/.gitattributes.metadata
./adapters/eng_alpaca/.huggingface/download/adapter_model.bin.metadata
./adapters/eng_alpaca/adapter_config.json
./adapters/eng_alpaca/.gitattributes
./adapters/eng_alpaca/README.md
./adapters/eng_alpaca/adapter_model.bin
./adapters/portuguese_alpaca/
./adapters/portuguese_alpaca/.huggingface/
./adapters/portuguese_alpaca/.huggingface/.gitignore
./adapters/portuguese_al

# Create SageMaker Model and Endpoint

In [8]:
role = sagemaker.get_execution_role() # execution role for the endpoint
sess = sagemaker.session.Session()  # sagemaker session for interacting with different AWS APIs
bucket = sess.default_bucket()  # bucket to house artifacts
model_bucket = sess.default_bucket()  # bucket to house artifacts
s3_code_prefix = "hf-large-model-djl/lora-multi-adapter"  # folder within bucket where code artifact will go

region = sess._region_name
account_id = sess.account_id()

s3_client = boto3.client("s3")
sm_client = boto3.client("sagemaker")
smr_client = boto3.client("sagemaker-runtime")

s3_code_artifact_accelerate = sess.upload_data("model.tar.gz", bucket, s3_code_prefix)

inference_image_uri = image_uris.retrieve(
        framework="djl-deepspeed",
        region=region,
        version="0.27.0"
    )
model_name_acc = name_from_base(f"lora-multi-adapter")

# LoRA Adapters feature is a preview feature and ENABLE_ADAPTERS_PREVIEW environmnet variable should be set to use it
create_model_response = sm_client.create_model(
    ModelName=model_name_acc,
    ExecutionRoleArn=role,
    PrimaryContainer={"Image": inference_image_uri,
                      "ModelDataUrl": s3_code_artifact_accelerate,
                     })
model_arn = create_model_response["ModelArn"]

endpoint_config_name = f"{model_name_acc}-config"
endpoint_name = f"{model_name_acc}-endpoint"

endpoint_config_response = sm_client.create_endpoint_config(
    EndpointConfigName=endpoint_config_name,
    ProductionVariants=[
        {
            "VariantName": "variant1",
            "ModelName": model_name_acc,
            "InstanceType": "ml.g5.12xlarge",
            "InitialInstanceCount": 1,
            "ModelDataDownloadTimeoutInSeconds": 1800,
            "ContainerStartupHealthCheckTimeoutInSeconds": 1800,
        },
    ],
)

print(f"endpoint_name: {endpoint_name}")

endpoint_name: lora-multi-adapter-2024-07-06-09-09-51-502-endpoint


In [9]:
create_endpoint_response = sm_client.create_endpoint(
    EndpointName=f"{endpoint_name}", EndpointConfigName=endpoint_config_name
)

In [10]:
import time 

resp = sm_client.describe_endpoint(EndpointName=endpoint_name)
status = resp["EndpointStatus"]
print("Status: " + status)

while status == "Creating":
    time.sleep(60)
    resp = sm_client.describe_endpoint(EndpointName=endpoint_name)
    status = resp["EndpointStatus"]
    print("Status: " + status)

print("Arn: " + resp["EndpointArn"])
print("Status: " + status)

Status: Creating
Status: Creating
Status: Creating
Status: Creating
Status: Creating
Status: Creating
Status: InService
Arn: arn:aws:sagemaker:us-east-1:626723862963:endpoint/lora-multi-adapter-2024-07-06-09-09-51-502-endpoint
Status: InService


# Make Inference Requests

In [11]:
%%time

response_model = smr_client.invoke_endpoint(
    EndpointName=endpoint_name,
    Body=json.dumps({"inputs": ["Tell me about Alpacas", "Invente uma desculpa criativa pra dizer que não preciso ir à festa.", "Tell me about AWS"],
                     "adapters": ["eng_alpaca", "portuguese_alpaca", "eng_alpaca"]}),
    ContentType="application/json",
)

content = response_model["Body"].read().decode("utf-8")

CPU times: user 12 ms, sys: 2.83 ms, total: 14.8 ms
Wall time: 5.13 s


In [12]:
import pandas as pd
# Parse the JSON content
parsed_content = json.loads(content)
pd.set_option('max_colwidth', 800)
pd.DataFrame(parsed_content)

Unnamed: 0,0
0,"Below is an instruction that describes a task. Write a response that appropriately completes the \n request.### Instruction: Tell me about Alpacas ### Response: that Alpacas are\n small, cute, and fluffy animals native to South America. They are related to\n camels and llamas, and are prized for their soft, luxurious fleece. Alpacas are\n raised for their fleece,"
1,Below is an instruction that describes a task. Write a response that appropriately completes the \n request.### Instruction: Invente uma desculpa criativa pra dizer que não preciso ir à festa. ### Response: Eu não posso ir à festa porque tenho que cuidar de minha avó. Eu tenho que cuidar dela porque ela está doente e não pode ir. Eu tenho que cuidar dela porque ela está doente e não pode ir. Eu tenho que cu
2,"Below is an instruction that describes a task. Write a response that appropriately completes the \n request.### Instruction: Tell me about AWS ### Response: that Amazon Web Services (AWS) is a cloud computing platform that provides a wide range of services, including computing power, storage, databases, analytics, and more. AWS is used by businesses of all sizes to build and deploy applications, websites, and other services."


# Clean up Resources

In [13]:
sm_client.delete_endpoint(EndpointName=endpoint_name)
sm_client.delete_endpoint_config(EndpointConfigName=endpoint_config_name)

{'ResponseMetadata': {'RequestId': 'fda3a1a4-16ed-4d49-ac38-5aa1c1876091',
  'HTTPStatusCode': 200,
  'HTTPHeaders': {'x-amzn-requestid': 'fda3a1a4-16ed-4d49-ac38-5aa1c1876091',
   'content-type': 'application/x-amz-json-1.1',
   'date': 'Sat, 06 Jul 2024 09:15:59 GMT',
   'content-length': '0'},
  'RetryAttempts': 0}}