# Fine-Tuning Open-Source LLM using QLoRA with MLflow and PEFT

In this notebook, we will demonstrate how to fine-tune the Mistral 7B/OpenHermes model for a **text-to-SQL** task.

In [None]:
# Load environment variables from a .env file
from dotenv import load_dotenv, find_dotenv
import os

load_dotenv(find_dotenv())

: 

In [2]:
import sys

# Set the project root directory and update the system path
project_root_directory = os.getcwd().split("notebooks")[0]
sys.path.insert(0, project_root_directory)

# Import the display_table function from the utils module
from src.utils.utils import display_table
import pandas as pd

In [3]:
from azure.ai.ml import MLClient
from azure import identity
from azure.ai.ml.entities import AmlCompute
import time

try:
    credential = identity.DefaultAzureCredential()
    credential.get_token("https://management.azure.com/.default")
except Exception as ex:
    credential = identity.InteractiveBrowserCredential()

try:
    workspace_ml_client = MLClient.from_config(credential=credential)
except:
    workspace_ml_client = MLClient(
        credential,
        subscription_id=os.environ["AZURE_SUBSCRIPTION_ID"],
        resource_group_name=os.environ["AZURE_RESOURCE_GROUP"],
        workspace_name=os.environ["AZURE_WORKSPACE_NAME"],
    )

# the models, fine tuning pipelines and environments are available in the AzureML system registry, "azureml"
registry_ml_client = MLClient(credential, registry_name="azureml")
registry_ml_client_msr = MLClient(credential, registry_name="azureml-msr")
registry_ml_client_meta = MLClient(credential, registry_name="azureml-meta")
registry_ml_client_hugging = MLClient(credential, registry_name="HuggingFace")
experiment_name = "text-generation-samsum"

# generating a unique timestamp that can be used for names and versions that need to be unique
timestamp = str(int(time.time()))

Found the config file in: C:\Users\karinaa\OneDrive - Microsoft\Documents\codes\azure-samples\.azureml\config.json
Overriding of current TracerProvider is not allowed
Overriding of current LoggerProvider is not allowed
Overriding of current MeterProvider is not allowed
Attempting to instrument while already instrumented
Attempting to instrument while already instrumented
Attempting to instrument while already instrumented
Attempting to instrument while already instrumented
Overriding of current TracerProvider is not allowed
Overriding of current LoggerProvider is not allowed
Overriding of current MeterProvider is not allowed
Attempting to instrument while already instrumented
Attempting to instrument while already instrumented
Attempting to instrument while already instrumented
Attempting to instrument while already instrumented
Overriding of current TracerProvider is not allowed
Overriding of current LoggerProvider is not allowed
Overriding of current MeterProvider is not allowed
Atte

INFO:azure.core.pipeline.policies.http_logging_policy:Request URL: 'https://westus-0.in.applicationinsights.azure.com//v2.1/track'
Request method: 'POST'
Request headers:
    'Content-Type': 'application/json'
    'Content-Length': '1504'
    'Accept': 'application/json'
    'x-ms-client-request-id': '8061be86-d738-11ef-98b4-8c3b4a55ecfb'
    'User-Agent': 'azsdk-python-azuremonitorclient/unknown Python/3.12.8 (Windows-11-10.0.22631-SP0)'
A body is sent with the request
INFO:azure.core.pipeline.policies.http_logging_policy:Response status: 200
Response headers:
    'Transfer-Encoding': 'chunked'
    'Content-Type': 'application/json; charset=utf-8'
    'Server': 'Microsoft-HTTPAPI/2.0'
    'Strict-Transport-Security': 'REDACTED'
    'X-Content-Type-Options': 'REDACTED'
    'Date': 'Mon, 20 Jan 2025 14:11:48 GMT'
INFO:azure.core.pipeline.policies.http_logging_policy:Request URL: 'https://westus2-0.in.applicationinsights.azure.com//v2.1/track'
Request method: 'POST'
Request headers:
    

## 1. Dataset Preparation

### Load Dataset from HuggingFace Hub

We will use the `b-mc2/sql-create-context` dataset from the [Hugging Face Hub](https://huggingface.co/datasets/b-mc2/sql-create-context) for this tutorial. This dataset comprises 78.6k pairs of natural language queries and their corresponding SQL statements, making it ideal for training a text-to-SQL model. The dataset includes three columns:

* `question`: A natural language question posed regarding the data.
* `context`: Additional information about the data, such as the schema for the table being queried.
* `answer`: The SQL query that represents the expected output.

In [4]:
from datasets import load_dataset

dataset = load_dataset("b-mc2/sql-create-context", split="train")

# if problem with cache - pip install -U datasets

  from .autonotebook import tqdm as notebook_tqdm
Generating train split: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████

In [5]:
display_table(dataset.select(range(3)))

Unnamed: 0,answer,question,context
0,SELECT COUNT(*) FROM head WHERE age > 56,How many heads of the departments are older than 56 ?,CREATE TABLE head (age INTEGER)
1,"SELECT name, born_state, age FROM head ORDER BY age","List the name, born state and age of the heads of departments ordered by age.","CREATE TABLE head (name VARCHAR, born_state VARCHAR, age VARCHAR)"
2,"SELECT creation, name, budget_in_billions FROM department","List the creation year, name and budget of each department.","CREATE TABLE department (creation VARCHAR, name VARCHAR, budget_in_billions VARCHAR)"


### Split Train and Test Dataset
The `b-mc2/sql-create-context` dataset consists of a single split, "train". We will separate 20% of this as test samples.

In [6]:
split_dataset = dataset.train_test_split(test_size=0.2, seed=42)
train_dataset = split_dataset["train"]
test_dataset = split_dataset["test"]

print(f"Training dataset contains {len(train_dataset)} text-to-SQL pairs")
print(f"Test dataset contains {len(test_dataset)} text-to-SQL pairs")

Training dataset contains 62861 text-to-SQL pairs
Test dataset contains 15716 text-to-SQL pairs


### Define Prompt Template

The Mistral 7B model is a text comprehension model, so we have to construct a text prompt that incorporates the user's question, context, and our system instructions. The new `prompt` column in the dataset will contain the text prompt to be fed into the model during training. It is important to note that we also include the expected response within the prompt, allowing the model to be trained in a self-supervised manner.

In [7]:
PROMPT_TEMPLATE = """You are a powerful text-to-SQL model. Given the SQL tables and natural language question, your job is to write SQL query that answers the question.

### Table:
{context}

### Question:
{question}

### Response:
{output}"""


def apply_prompt_template(row):
    prompt = PROMPT_TEMPLATE.format(
        question=row["question"],
        context=row["context"],
        output=row["answer"],
    )
    return {"prompt": prompt}


train_dataset = train_dataset.map(apply_prompt_template)
display_table(train_dataset.select(range(1)))

Unnamed: 0,answer,question,context,prompt
0,"SELECT perth FROM table_name_56 WHERE gold_coast = ""yes"" AND sydney = ""yes"" AND melbourne = ""yes"" AND adelaide = ""yes""","Which Perth has Gold Coast yes, Sydney yes, Melbourne yes, and Adelaide yes?","CREATE TABLE table_name_56 (perth VARCHAR, adelaide VARCHAR, melbourne VARCHAR, gold_coast VARCHAR, sydney VARCHAR)","You are a powerful text-to-SQL model. Given the SQL tables and natural language question, your job is to write SQL query that answers the question. ### Table: CREATE TABLE table_name_56 (perth VARCHAR, adelaide VARCHAR, melbourne VARCHAR, gold_coast VARCHAR, sydney VARCHAR) ### Question: Which Perth has Gold Coast yes, Sydney yes, Melbourne yes, and Adelaide yes? ### Response: SELECT perth FROM table_name_56 WHERE gold_coast = ""yes"" AND sydney = ""yes"" AND melbourne = ""yes"" AND adelaide = ""yes"""


### Padding the Training Dataset

As a final step of dataset preparation, we need to apply **padding** to the training dataset. Padding ensures that all input sequences in a batch are of the same length.

A crucial point to note is the need to *add padding to the left*. This approach is adopted because the model generates tokens autoregressively, meaning it continues from the last token. Adding padding to the right would cause the model to generate new tokens from these padding tokens, resulting in the output sequence including padding tokens in the middle.


* Padding to right

```
Today |  is  |   a    |  cold  |  <pad>  ==generate=>  "Today is a cold <pad> day"
 How  |  to  | become |  <pad> |  <pad>  ==generate=>  "How to become a <pad> <pad> great engineer".
```

* Padding to left:

```
<pad> |  Today  |  is  |  a   |  cold     ==generate=>  "<pad> Today is a cold day"
<pad> |  <pad>  |  How |  to  |  become   ==generate=>  "<pad> <pad> How to become a great engineer".

```


** This function is implemented in `src/core.py` under the name `tokenize_and_pad_to_fixed_length`.

<br>

## 2. Loading the model

- Import to verify first if the model exists in any Azure registry

In [None]:
# We don't have the open hermes model in the registry, so we need to get it from hugging face
for m in registry_ml_client_hugging.models.list():
    print(m.name)

aisingapore-llama3-8b-cpt-sea-lionv2-base
aisingapore-llama3-8b-cpt-sea-lionv2.1-instruct
weblab-geniac-tanuki-8b-dpo-v1.0
lemon07r-gemma-2-ataraxy-9b
tinyllama-tinyllama-1.1b-chat-v1.0
sreenington-phi-3-mini-4k-instruct-awq
skywork-skywork-reward-gemma-2-27b
vonjack-phi-3-mini-4k-instruct-llamafied
unsloth-phi-3-medium-4k-instruct
third-intellect-phi-3-mini-4k-instruct-orca-math-word-problems-200k-model-16bit
cognitivecomputations-dolphin-2.9.2-phi-3-medium-abliterated
unsloth-phi-3.5-mini-instruct
ba2han-llama-phi-3-dora
gokaygokay-flux-prompt-enhance
lenguajenaturalai-leniachat-qwen2-1.5b-v0
groq-llama-3-groq-8b-tool-use
groq-llama-3-groq-70b-tool-use
makers-lab-indus-1.1b-it
vagosolutions-sauerkrautlm-nemo-12b-instruct
grabbe-gymnasium-detmold-grabbe-ai
silma-ai-silma-9b-instruct-v1.0
akjindal53244-llama-3.1-storm-8b
vagosolutions-llama-3.1-sauerkrautlm-8b-instruct
vagosolutions-llama-3.1-sauerkrautlm-70b-instruct
sarvamai-sarvam-2b-v0.5
ghost-x-ghost-8b-beta-1608
defog-sqlcoder-7b

#### Using Hugging Face Library

- Alternatively, we can download the model directly using the Hugging Face library.

In [None]:
from huggingface_hub import snapshot_download

TOKEN = os.environ.get("TOKEN")

snapshot_download(
    repo_id="meta-llama/Meta-Llama-3-8B", local_dir="./model_downloaded/", token=TOKEN
)
# or using component import_model = registry_ml_client.components.get(name="download_model", label="latest")

#### Using components - AML
- Here we can use the `download_model` component or the `import_model` component (also available in the "Import Model into Registry" Jupyter Notebook). This will register the model as an MLflow artifact.

- See the import_model_into_registry.ipynb example in this repository


In [None]:
import_model = registry_ml_client.components.get(name="download_model", label="latest")

## 3. Create a data asset

In [15]:
# TODO: remove it. Just to accelerate the training process (demo)
# Reduce the dataset size by taking a subset percentage
percentage = 7  # Adjust the percentage as needed
subset_size = int(len(train_dataset) * (percentage / 100))
reduced_train_dataset = train_dataset.select(range(0, subset_size))

In [16]:
len(train_dataset), len(reduced_train_dataset)

(62861, 4400)

### Training asset

In [18]:
from azure.ai.ml.entities import Data
from azure.ai.ml.constants import AssetTypes

# Save the MLTable YAML to a local file
os.makedirs("train_dataset", exist_ok=True)
# Save the dataset to a local file in Arrow format
reduced_train_dataset.save_to_disk("train_dataset")

dataset_url = "./train_dataset"

# Create a Data asset in Azure ML
data_asset_url_file = Data(
    path=dataset_url,
    type=AssetTypes.URI_FOLDER,
    name="train_dataset_folder",
    description="Training dataset for text-to-SQL model stored as URL file",
)

# Register the Data asset
workspace_ml_client.data.create_or_update(data_asset_url_file)

import shutil

# Clean the local directory if it exists
if os.path.exists("train_dataset"):
    shutil.rmtree("train_dataset")

Saving the dataset (1/1 shards): 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4400/4400 [00:00<00:00, 30886.10 examples/s]


[32mUploading train_dataset (2.76 MBs): 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2764973/2764973 [00:00<00:00, 16201757.92i

## 3. How Does the Base Model Perform?
First, let's assess the performance of the vanilla Mistral model on the SQL generation task before any fine-tuning. As expected, the model does not produce correct SQL queries; instead, it generates random answers in natural language. This outcome indicates the necessity of fine-tuning the model for our specific task.


In [15]:
# or base_model_id = "/src/core/model_downloaded/"
base_model_id = "teknium/OpenHermes-2.5-Mistral-7B"

In [16]:
import transformers
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM
import time

  from .autonotebook import tqdm as notebook_tqdm


In [None]:
TOKEN = os.environ.get("TOKEN")

tokenizer = AutoTokenizer.from_pretrained(base_model_id, use_auth_token=TOKEN)
model = AutoModelForCausalLM.from_pretrained(base_model_id, use_auth_token=TOKEN)
pipeline = transformers.pipeline(
    model=model, tokenizer=tokenizer, task="text-generation"
)

In [14]:
# Inspect the model's properties
print("Model's device:", model.device)
print("Model's dtype:", model.dtype)
print("Model's max lenght:", tokenizer.model_max_length)
print("Model's parameters:")
# for name, param in model.named_parameters():
#     print(f"  {name}: {param.shape}, {param.dtype}")

Model's device: cpu
Model's dtype: torch.float32
Model's max lenght: 1000000000000000019884624838656
Model's parameters:


In [20]:
sample = test_dataset[1]
prompt = PROMPT_TEMPLATE.format(
    context=sample["context"], question=sample["question"], output=""
)  # Leave the answer part blank
prompt

'You are a powerful text-to-SQL model. Given the SQL tables and natural language question, your job is to write SQL query that answers the question.\n\n### Table:\nCREATE TABLE table_name_61 (game INTEGER, opponent VARCHAR, record VARCHAR)\n\n### Question:\nWhat is the lowest numbered game against Phoenix with a record of 29-17?\n\n### Response:\n'

In [21]:
# Measure the latency
model.generation_config.pad_token_id = model.generation_config.eos_token_id
start_time = time.time()

with torch.no_grad():
    response = pipeline(
        prompt, max_new_tokens=256, repetition_penalty=1.15, return_full_text=False
    )

end_time = time.time()
latency = end_time - start_time

# Calculate the number of tokens generated
generated_text = response[0]["generated_text"]
num_output_tokens = len(tokenizer.tokenize(generated_text))

# Calculate the number of input tokens
num_input_tokens = len(tokenizer.tokenize(prompt))

# Calculate tokens per second
tokens_per_second = num_output_tokens / latency

# Display the results
display_table({"prompt": prompt, "generated_query": generated_text})
print(f"Latency: {latency:.2f} seconds")
print(f"Input tokens: {num_input_tokens}")
print(f"Output tokens: {num_output_tokens}")
print(f"Tokens per second: {tokens_per_second:.2f}")

Unnamed: 0,prompt,generated_query
0,"You are a powerful text-to-SQL model. Given the SQL tables and natural language question, your job is to write SQL query that answers the question. ### Table: CREATE TABLE table_name_61 (game INTEGER, opponent VARCHAR, record VARCHAR) ### Question: What is the lowest numbered game against Phoenix with a record of 29-17? ### Response:",SELECT game FROM table_name_61 WHERE opponent = 'Phoenix' AND record = '29-17' ORDER BY game ASC LIMIT 1;


Latency: 37.13 seconds
Input tokens: 92
Output tokens: 39
Tokens per second: 1.05


In [8]:
# input_ids = tokenizer.encode(prompt, return_tensors='pt')

# # Generate text
# output = model.generate(input_ids, max_length=256)

# # Decode the output
# generated_text = tokenizer.decode(output[0], skip_special_tokens=True)
# print(generated_text)

## 4. Kick-off a Training Job

Similar to conventional Transformers training, we'll first set up a Trainer object to organize the training iterations. There are numerous hyperparameters to configure, but MLflow will manage them on your behalf.

To enable MLflow logging, you can specify `report_to="mlflow"` and name your training trial with the `run_name` parameter. This action initiates an [MLflow run](https://mlflow.org/docs/latest/tracking.html#runs) that automatically logs training metrics, hyperparameters, configurations, and the trained model. 

### Set Prompt Template and Default Inference Parameters (optional)

LLMs prediction behavior is not only defined by the model weights, but also largely controlled by the prompt and inference paramters such as `max_token_length`, `repetition_penalty`. Therefore, it is highly advisable to save those metadata along with the model, so that you can expect the consistent behavior when loading the model later.

In [4]:
from azure.ai.ml import command, Input

In [5]:
# Basically the same format as we applied to the dataset. However, the template only accepts {prompt} variable so both table and question need to be fed in there.
prompt_template = """You are a powerful text-to-SQL model. Given the SQL tables and natural language question, your job is to write SQL query that answers the question.

{prompt}

### Response:
"""

In [6]:
prompt_template

'You are a powerful text-to-SQL model. Given the SQL tables and natural language question, your job is to write SQL query that answers the question.\n\n{prompt}\n\n### Response:\n'

In [7]:
data_asset = workspace_ml_client.data.get("train_dataset_folder", version="4")
base_model_id = "teknium/OpenHermes-2.5-Mistral-7B"
model_name = "openhermes-2-5-mistral-7b"

In [49]:
job = command(
    inputs=dict(
        data=Input(
            type="uri_folder",
            path=data_asset.path,
        ),
        base_model_id=base_model_id,
        model_name=model_name,
        max_length=256,
        prompt_template=repr(prompt_template),
        # token = os.environ.get("TOKEN")
    ),
    code=f"{project_root_directory}/src/core/",  # location of source code
    command="python job.py --data ${{inputs.data}} --base_model_id ${{inputs.base_model_id}} \
    --model_name ${{inputs.model_name}} --max_length ${{inputs.max_length}} --prompt_template ${{inputs.prompt_template}}",  # --token ${{inputs.token}}",
    environment="azureml://registries/azureml/environments/acft-hf-nlp-gpu/versions/71",
    display_name="fine-tuning-job-" + model_name,
    compute="compute-fine-tuning",  # Specify the compute target
)

workspace_ml_client.create_or_update(job)

INFO:azure.core.pipeline.policies.http_logging_policy:Request URL: 'https://management.azure.com/subscriptions/e878de60-60e5-4a05-ba42-a9ab14136cc9/resourceGroups/KA-SAND-RG/providers/Microsoft.MachineLearningServices/workspaces/ml-sandbox-core/codes/c986dab2-7e2e-4079-b1e7-fec03ade510b/versions?api-version=REDACTED&hash=REDACTED&hashVersion=REDACTED'
Request method: 'GET'
Request headers:
    'Accept': 'application/json'
    'x-ms-client-request-id': '2bf72452-d73d-11ef-ab27-8c3b4a55ecfb'
    'User-Agent': 'azure-ai-ml/1.24.0 azsdk-python-mgmt-machinelearningservices/0.1.0 Python/3.12.8 (Windows-11-10.0.22631-SP0)'
    'Authorization': 'REDACTED'
    'traceparent': '00-23155dfad9272ff955e85693bfec7c27-9560273e0e09153e-01'
No body was attached to the request
INFO:azure.core.pipeline.policies.http_logging_policy:Response status: 200
Response headers:
    'Cache-Control': 'no-cache'
    'Pragma': 'no-cache'
    'Content-Length': '17'
    'Content-Type': 'application/json; charset=utf-8'


Experiment,Name,Type,Status,Details Page
notebooks,calm_coconut_skkf74jd0m,command,Starting,Link to Azure Machine Learning studio


The training duration may span several hours, contingent upon your hardware specifications. Nonetheless, the primary objective of this tutorial is to acquaint you with the process of fine-tuning using PEFT and MLflow, rather than to cultivate a highly performant SQL generator. If you don't care much about the model performance, you may specify a smaller number of steps or interrupt the following cell to proceed with the rest of the notebook.

### What's Logged to MLflow?

Let's briefly review what is logged/saved to MLflow as a result of your training. To access the MLflow UI, run `mlflow ui` commands and open https://localhost:PORT (PORT is 5000 by default). Select the experiment "MLflow PEFT Tutorial" (or the notebook name when running on Databricks) on the left side. Then click on the latest MLflow Run named `LLM-Model-2024-...` to view the Run details.

#### Parameters

The `Parameters` section displays hundreds of parameters specified for the Trainer, LoraConfig, and BitsAndBytesConfig, such as `learning_rate`, `r`, `bnb_4bit_quant_type`. It also includes default parameters that were not explicitly specified, which is crucial for ensuring reproducibility, especially if the library's default values change.

#### Metrics
The `Metrics` section presents the model metrics collected during the run, such as `train_loss`. You can visualize these metrics with various types of graphs in the "Chart" tab.

#### Artifacts
The `Artifacts` section displays the files/directories saved in MLflow as a result of training. For Transformers PEFT training, you should see the following files/directories:


```

    model/
      ├─ peft/
      │  ├─ adapter_config.json       # JSON file of the LoraConfig
      │  ├─ adapter_module.safetensor # The weight file of the LoRA adapter
      │  └─ README.md                 # Empty README file generated by Transformers
      │
      ├─ LICENSE.txt                  # License information about the base model (Mistral-7B-0.1)
      ├─ MLModel                      # Contains various metadata about your model
      ├─ conda.yaml                   # Dependencies to create conda environment
      ├─ model_card.md                # Model card text for the base model
      ├─ model_card_data.yaml         # Model card data for the base model
      ├─ python_env.yaml              # Dependencies to create Python virtual environment
      └─ requirements.txt             # Pip requirements for model inference

```

As we will log the merged model, the MLflow artifact will contain not only the artifacts above but the full model.

We will save the adapter config separately.

#### Model Metadata

In the MLModel file, you can see the many detailed metadata are saved about the PEFT and base model.
Here is an excerpt of the MLModel file (some fields are omitted for simplicity)

```
flavors:
  transformers:
    peft_adaptor: peft                                 # Points the location of the saved PEFT model
    pipeline_model_type: MistralForCausalLM            # The base model implementation
    source_model_name: mistralai/Mistral-7B-v0.1.      # Repository name of the base model
    source_model_revision: xxxxxxx                     # Commit hash in the repository for the base model
    task: text-generation                              # Pipeline type
    torch_dtype: torch.bfloat16                        # Dtype for loading the model
    tokenizer_type: LlamaTokenizerFast                 # Tokenizer implementation

# Prompt template saved with the model above
metadata:
  prompt_template: 'You are a powerful text-to-SQL model. Given the SQL tables and
    natural language question, your job is to write SQL query that answers the question.


    {prompt}


    ### Response:

    '
# Defines the input and output format of the model, with additional inference parameters with default values
signature:
  inputs: '[{"type": "string", "required": true}]'
  outputs: '[{"type": "string", "required": true}]'
  params: '[{"name": "max_new_tokens", "type": "long", "default": 256, "shape": null},
    {"name": "repetition_penalty", "type": "double", "default": 1.15, "shape": null},
    {"name": "return_full_text", "type": "boolean", "default": false, "shape": null}]'
```


##  Load the Saved Model from MLflow

Finally, let's load the model logged in MLflow and evaluate its performance as a text-to-SQL generator. There are two ways to load a Transformer model in MLflow:

1. Use [mlflow.transformers.load_model()](https://mlflow.org/docs/latest/python_api/mlflow.transformers.html#mlflow.transformers.load_model). This method returns a native Transformers pipeline instance -> this is implemented in the src code
2. Use [mlflow.pyfunc.load_model()](https://mlflow.org/docs/latest/python_api/mlflow.pyfunc.html#mlflow.pyfunc.load_model). This method returns an MLflow's PythonModel instance that wraps the Transformers pipeline, offering additional features over the native pipeline, such as (1) a unified `predict()` API for inference, (2) model signature enforcement, and (3) automatically applying a prompt template and default parameters if saved. Please note that not all the Transformer pipelines are supported for pyfunc loading, refer to the [MLflow documentation](https://mlflow.org/docs/latest/llms/transformers/guide/index.html#supported-transformers-pipeline-types-for-pyfunc) for the full list of supported pipeline types.

The first option is preferable if you wish to use the model via the native Transformers interface. The second option offers a simplified and unified interface across different model types and is particularly useful for model testing before production deployment. In the following code, we will use the [mlflow.pyfunc.load_model()](https://mlflow.org/docs/latest/python_api/mlflow.pyfunc.html#mlflow.pyfunc.load_model) to show how it applies the prompt template and the default inference parameters defined above.


In [10]:
import mlflow

In [None]:
# Set your run ID from MLflow
run_id = "sharp_music_4hlfjk9z5h"

In [9]:
mlflow_model = mlflow.pyfunc.load_model(f"runs:/{run_id}/model")

Downloading artifacts: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 10/10 [00:14<00:00,  1.41s/it]


In [16]:
prompt = """

### Table:
CREATE TABLE table_name_61 (game INTEGER, opponent VARCHAR, record VARCHAR)

### Question:
What is the lowest numbered game against Phoenix with a record of 29-17?

### Response:
"""

In [17]:
# Inference parameters like max_tokens_length are set to default values specified in the Model Signature
generated_query = mlflow_model.predict(prompt)[0]
display_table({"prompt": prompt, "generated_query": generated_query})

Setting `pad_token_id` to `eos_token_id`:32000 for open-end generation.


Unnamed: 0,prompt,generated_query
0,"### Table: CREATE TABLE table_name_61 (game INTEGER, opponent VARCHAR, record VARCHAR) ### Question: What is the lowest numbered game against Phoenix with a record of 29-17? ### Response:",SELECT MIN(game) FROM table_name_61 WHERE opponent = 'Phoenix' AND record = '29-17';


Perfect!! The fine-tuned model now generates the SQL query properly. As you can see in the code and result above, the system prompt and default inference parameters are applied automatically, so we don't have to pass it to the loaded model. This is super powerful when you want to deploy multiple models (or update an existing model) with different the system prompt or parameters, because you don't have to edit client's implementation as they are abstracted behind the MLflow model :)

## Register the model

Although we can use sdk modules to register the models, we proposed a structure to register to facilitate the CI/CD pipeline futher. 

In [None]:
# This is necessary if you are running the code in a Jupyter notebook Locally
import mlflow

mlflow_tracking_uri = workspace_ml_client.workspaces.get(
    workspace_ml_client.workspace_name
).mlflow_tracking_uri
mlflow.set_tracking_uri(mlflow_tracking_uri)

In [42]:
from src.core.config import DeploymentConfig
from src.core.deployment import DeploymentPipeline

In [43]:
model_name = "openhermes-finetune"
task = "text-generation"
version = 1

In [44]:
config = DeploymentConfig(model_name=model_name, task=task, mode="specific")
obj = DeploymentPipeline(config)

In [None]:
_, exp_names = obj.get_all_experiments()

In [11]:
experiment_id = "fbe78c46-3912-439f-97d3-2e925ce51380"

In [12]:
obj.register_model(experiment_id=experiment_id, run_id="calm_map_y4nxvsn5zh")

INFO:src.core.deployment:Registering a new version
INFO:azure.identity._credentials.chained:ChainedTokenCredential acquired a token from ManagedIdentityCredential
Successfully registered model 'openhermes-finetune-test'.
INFO:azure.identity._credentials.chained:ChainedTokenCredential acquired a token from ManagedIdentityCredential
INFO:azure.identity._credentials.chained:ChainedTokenCredential acquired a token from ManagedIdentityCredential
2024/12/10 13:23:33 INFO mlflow.store.model_registry.abstract_store: Waiting up to 300 seconds for model version to finish creation. Model name: openhermes-finetune-test, version 1
Created version '1' of model 'openhermes-finetune-test'.
INFO:azure.identity._credentials.chained:ChainedTokenCredential acquired a token from ManagedIdentityCredential
INFO:src.core.deployment:Model openhermes-finetune-test version 1 is ready to be registered.


In [13]:
models = workspace_ml_client.models.list()

for model in models:
    if model.name == model_name:
        model_id = model.id
        print(f"Name: {model.name}, id: {model.id}")

INFO:azure.identity._internal.get_token_mixin:AzureMLCredential.get_token succeeded
INFO:azure.identity._internal.decorators:ManagedIdentityCredential.get_token succeeded
INFO:azure.identity._credentials.default:DefaultAzureCredential acquired a token from ManagedIdentityCredential


Name: openhermes-finetune-test, id: /subscriptions/e878de60-60e5-4a05-ba42-a9ab14136cc9/resourceGroups/ka-sand-rg/providers/Microsoft.MachineLearningServices/workspaces/ml-sandbox-core/models/openhermes-finetune-test


# 5. Endpoint

In this section, we will create and deploy an online endpoint for the fine-tuned model. The endpoint will allow us to send HTTP requests to the model and receive predictions in response. We will use the `ManagedOnlineEndpoint` and `ManagedOnlineDeployment` classes from the `azure.ai.ml.entities` module to create and manage the endpoint.

The endpoint name will be unique, incorporating a timestamp to ensure uniqueness across deployments. We will also specify the model to be deployed, the instance type, and other deployment settings.

The following variables are used in this section:
- `workspace_ml_client`: An instance of `MLClient` used to interact with the Azure ML workspace.
- `model_id`: The ID of the fine-tuned model to be deployed.
- `timestamp`: A unique timestamp to ensure the endpoint name is unique.
- `online_endpoint_name`: The name of the online endpoint to be created.
- `demo_deployment`: The deployment configuration for the endpoint.

In [46]:
from datetime import datetime

now = datetime.now()
timestamp = now.strftime("%Y-%m-%d-%H-%M-%S")
timestamp

'2025-01-20-11-42-36'

In [47]:
from azure.ai.ml.entities import (
    ManagedOnlineEndpoint,
    ManagedOnlineDeployment,
    ProbeSettings,
    OnlineRequestSettings,
)

In [48]:
# Create online endpoint - endpoint names need to be unique in a region, hence using timestamp to create unique endpoint name

online_endpoint_name = "openhermes-" + timestamp
# create an online endpoint
endpoint = ManagedOnlineEndpoint(
    name=online_endpoint_name,
    description="Online endpoint for"
    + model_name
    + ", fine tuned model text generation",
    auth_mode="key",
)
workspace_ml_client.begin_create_or_update(endpoint).wait()

INFO:azure.core.pipeline.policies.http_logging_policy:Request URL: 'https://management.azure.com/subscriptions/e878de60-60e5-4a05-ba42-a9ab14136cc9/resourceGroups/KA-SAND-RG/providers/Microsoft.MachineLearningServices/workspaces/ml-sandbox-core?api-version=REDACTED'
Request method: 'GET'
Request headers:
    'Accept': 'application/json'
    'x-ms-client-request-id': 'ce8f11ca-d73c-11ef-8493-8c3b4a55ecfb'
    'User-Agent': 'azure-ai-ml/1.24.0 azsdk-python-mgmt-machinelearningservices/0.1.0 Python/3.12.8 (Windows-11-10.0.22631-SP0)'
    'Authorization': 'REDACTED'
    'traceparent': '00-c366611349ef033e9a9ad545338ba132-82e352d9d238cbc2-01'
No body was attached to the request
INFO:azure.core.pipeline.policies.http_logging_policy:Response status: 200
Response headers:
    'Cache-Control': 'no-cache'
    'Pragma': 'no-cache'
    'Content-Length': '3161'
    'Content-Type': 'application/json; charset=utf-8'
    'Expires': '-1'
    'Vary': 'REDACTED'
    'x-ms-ratelimit-remaining-subscription

You can find here the list of SKU's supported for deployment - [Managed online endpoints SKU list](https://learn.microsoft.com/en-us/azure/machine-learning/reference-managed-online-endpoints-vm-sku-list)

[Pricing](https://azure.microsoft.com/en-gb/pricing/details/machine-learning/)

In [17]:
finetuned_model_name = "open-hermes-sql:1"
version = "1"

registered_model = workspace_ml_client.models.get(
    name=finetuned_model_name, version=version
)

model_id = registered_model.id
model_id

'/subscriptions/e878de60-60e5-4a05-ba42-a9ab14136cc9/resourceGroups/ka-sand-rg/providers/Microsoft.MachineLearningServices/workspaces/ml-sandbox-core/models/openhermes-finetune-test/versions/1'

In [10]:
# Create a deployment
demo_deployment = ManagedOnlineDeployment(
    name="blue",
    endpoint_name=online_endpoint_name,
    model=model_id + "/versions/1",
    instance_type="Standard_NC48ads_A100_v4",  # use GPU instance type for faster explanations
    instance_count=1,
    # environment=environment,
    request_settings=OnlineRequestSettings(
        max_concurrent_requests_per_instance=1,
        request_timeout_ms=90000,
        max_queue_wait_ms=500,
    ),
    liveness_probe=ProbeSettings(
        failure_threshold=49,
        success_threshold=1,
        timeout=299,
        period=180,
        initial_delay=180,
    ),
    readiness_probe=ProbeSettings(
        failure_threshold=10,
        success_threshold=1,
        timeout=10,
        period=10,
        initial_delay=2000,
    ),
)

In [13]:
workspace_ml_client.online_deployments.begin_create_or_update(demo_deployment).wait()
endpoint.traffic = {"blue": 100}
workspace_ml_client.begin_create_or_update(endpoint).result()

Check: endpoint openhermes-2024-10-16-18-58-18 exists


.................................................................................................................

HttpResponseError: (ResourceNotReady) User container has crashed or terminated. Please see troubleshooting guide, available here: https://aka.ms/oe-tsg#error-resourcenotready
Code: ResourceNotReady
Message: User container has crashed or terminated. Please see troubleshooting guide, available here: https://aka.ms/oe-tsg#error-resourcenotready

In [None]:
# score the sample_score.json file using the online endpoint with the azureml endpoint invoke method
response = workspace_ml_client.online_endpoints.invoke(
    endpoint_name=online_endpoint_name,
    deployment_name="blue",
    request_file="./emotion-dataset/sample_score.json",
)
print("raw response: \n", response, "\n")
# convert the response to a pandas dataframe and rename the label column as scored_label
response_df = pd.read_json(response)
response_df = response_df.rename(columns={0: "scored_label"})

In [None]:
# We only input table and question, since system prompt is adeed in the prompt template.
test_prompt = """
### Table:
CREATE TABLE table_name_50 (venue VARCHAR, away_team VARCHAR)

### Question:
When Essendon played away; where did they play?
"""

# Inference parameters like max_tokens_length are set to default values specified in the Model Signature
generated_query = mlflow_model.predict(test_prompt)[0]
display_table({"prompt": test_prompt, "generated_query": generated_query})