# Multi-LoRA inference with NVIDIA NIM

This is a demonstration of deploying multiple LoRA adapters with NVIDIA NIM. NIM supports LoRA adapters in .nemo (from NeMo Framework), and Hugging Face model formats. 

We will deploy the PubMedQA LoRA adapter from previous notebook, alongside two other previously trained LoRA adapters (GSM8K, SQuAD) that are available on NVIDIA NGC as examples.

`NOTE`: While it's not necessary to complete the LoRA training and obtain the adapter from the previous notebook ("Creating a LoRA adapter with NeMo Framework") to follow along with this one, it is recommended if possible. You can still learn about LoRA deployment with NIM using the other adapters downloaded from NGC.

##  Step-by-step instructions
This notebook includes instructions to send an inference call to NVIDIA NIM using the Python `requests` library.


### Check available LoRA models

Once the NIM server is up and running, we can check the available models as follows:

In [2]:
import requests
import json

url = 'http://0.0.0.0:8000/v1/models'

response = requests.get(url)
data = response.json()

print(json.dumps(data, indent=4))

{
    "object": "list",
    "data": [
        {
            "id": "meta/llama3-8b-instruct",
            "object": "model",
            "created": 1717145235,
            "owned_by": "system",
            "root": "meta/llama3-8b-instruct",
            "parent": null,
            "permission": [
                {
                    "id": "modelperm-f6bac02e9ca747d5abca4603113865a2",
                    "object": "model_permission",
                    "created": 1717145235,
                    "allow_create_engine": false,
                    "allow_sampling": true,
                    "allow_logprobs": true,
                    "allow_search_indices": false,
                    "allow_view": true,
                    "allow_fine_tuning": false,
                    "organization": "*",
                    "group": null,
                    "is_blocking": false
                }
            ]
        },
        {
            "id": "llama3-8b-pubmed-qa",
            "object": "model",
  

### Inference on a single prompt

In [3]:
import requests
import json

url = 'http://0.0.0.0:8000/v1/completions'
headers = {
    'accept': 'application/json',
    'Content-Type': 'application/json'
}

prompt="AIMS: Dyschesia can be provoked by inappropriate defecation movements. The aim of this prospective study was to demonstrate dysfunction of the anal sphincter and/or the musculus (m.) puborectalis in patients with dyschesia using anorectal endosonography.\nMETHODS: Twenty consecutive patients with a medical history of dyschesia and a control group of 20 healthy subjects underwent linear anorectal endosonography (Toshiba models IUV 5060 and PVL-625 RT). In both groups, the dimensions of the anal sphincter and the m. puborectalis were measured at rest, and during voluntary squeezing and straining. Statistical analysis was performed within and between the two groups.\nRESULTS: The anal sphincter became paradoxically shorter and/or thicker during straining (versus the resting state) in 85% of patients but in only 35% of control subjects. Changes in sphincter length were statistically significantly different (p<0.01, chi(2) test) in patients compared with control subjects. The m. puborectalis became paradoxically shorter and/or thicker during straining in 80% of patients but in only 30% of controls. Both the changes in length and thickness of the m. puborectalis were significantly different (p<0.01, chi(2) test) in patients versus control subjects.\nQUESTION: Is anorectal endosonography valuable in dyschesia?\n ### ANSWER (yes|no|maybe):"

data = {
    "model": "llama3-8b-pubmed-qa",
    "prompt": prompt,
    "max_tokens": 128
}

response = requests.post(url, headers=headers, json=data)
response_data = response.json()

print(json.dumps(response_data, indent=4))


{
    "id": "cmpl-95616581ca314d789e687af9c756f87c",
    "object": "text_completion",
    "created": 1717145241,
    "model": "llama3-8b-pubmed-qa",
    "choices": [
        {
            "index": 0,
            "text": " <<< yes >>>",
            "logprobs": null,
            "finish_reason": "stop",
            "stop_reason": null
        }
    ],
    "usage": {
        "prompt_tokens": 318,
        "total_tokens": 322,
        "completion_tokens": 4
    }
}


## Step 3: Testing the accuracy of NIM inference

This step can be continued within the Nemo FW training container.

In [3]:
import json
data_test = json.load(open("./pubmedqa/data/test_set.json",'rt'))

In [4]:
def read_jsonl (fname):
    obj = []
    with open(fname, 'rt') as f:
        st = f.readline()
        while st:
            obj.append(json.loads(st))
            st = f.readline()
    return obj

prepared_test = read_jsonl("./pubmedqa/data/pubmedqa_test.jsonl")

In [5]:
import requests

def infer(prompt):

    url = 'http://0.0.0.0:8000/v1/completions'
    headers = {
        'accept': 'application/json',
        'Content-Type': 'application/json'
    }

    data = {
        "model": "llama3-8b-pubmed-qa",
        "prompt": prompt,
        "max_tokens": 128
    }

    response = requests.post(url, headers=headers, json=data)
    response_data = response.json()

    return(response_data["choices"][0]["text"])

In [6]:
from tqdm import tqdm

results = {}
sample_id = list(data_test.keys())

for i, key in tqdm(enumerate(sample_id)):
    answer = infer(prepared_test[i]['input'].strip())
    answer = answer.lower()
    if 'yes' in answer:
        results[key] = 'yes'
    elif 'no' in answer:
        results[key] = 'no'
    elif 'maybe' in answer:
        results[key] = 'maybe'
    else:
        print("Malformed answer: ", answer)
        results[key] = 'maybe'
        

500it [00:45, 10.89it/s]


In [7]:
answer

' <<< yes >>>'

In [6]:
# dump results
FILENAME="pubmedqa-llama-3-8b-lora-NIM.json"
with(open(FILENAME, "w")) as f:
    json.dump(results, f)

# Evaluation
!cp $FILENAME ./pubmedqa/
!cd ./pubmedqa/ && python evaluation.py $FILENAME

Accuracy 0.786000
Macro-F1 0.584112


We can verify that in this case, NIM inference should provide comparable accuracy to NeMo inference.

```
Accuracy 0.786000
Macro-F1 0.584112
```

Note that each individual answer also conform to the format we specified, i.e. `<<< {answer} >>>`.