# Finetuning Llama-2 on Azure Machine Mearning

## Contents
1. [Introduction](#Introduction)
1. [Set up the environment](#Setup)
1. [Download model from azureml-meta registry](#Download)
1. [Data](#Data)
1. [Establish baseline](#Baseline)
1. [Finetune](#Finetune)
1. [Evaluate](#Evaluate)

## Introduction
This notebook demonstrates finetuning Llama-2 foundation model on summarization using AzureML.

Llama-2 model is now available in AzureML Model Catalog. For details please see the [blog](https://techcommunity.microsoft.com/t5/ai-machine-learning-blog/introducing-llama-2-on-azure/ba-p/3881233).

This functionality is in public preview in Azure Machine Learning. The preview version is provided without a service level agreement, and it’s not recommended for production workloads. Certain features might not be supported or might have constrained capabilities. For more information, see [Supplemental Terms of Use for Microsoft Azure Previews](https://azure.microsoft.com/en-us/support/legal/preview-supplemental-terms/)

Notebook summary:

1. Setting the environment
2. Loading the model and data. In this example we use the [knkarthick/dialogsum](https://huggingface.co/datasets/knkarthick/dialogsum) from Hugging Face datasets.
3. Evaluate the pretrained model on the test set to establish baseline metrics
4. Finetune the model
5. Evaluating the finetuned model on a test set

## Set up the environment <a class="anchor" id="Setup"></a>

Install and load required packages

In [1]:
! pip uninstall -y azure-identity
! pip uninstall -y azure-ai-ml

! pip install -U azure-identity
! pip install azure-ai-ml==1.9.0a20230616001 --extra-index-url https://pkgs.dev.azure.com/azure-sdk/public/_packaging/azure-sdk-for-python/pypi/simple/
! pip install torch==2.0.1
! pip install bitsandbytes
! pip install transformers==4.31.0
! pip install peft
! pip install azureml-evaluate-mlflow
! pip install pandas

Found existing installation: azure-identity 1.13.0
Uninstalling azure-identity-1.13.0:
  Successfully uninstalled azure-identity-1.13.0
Collecting azure-identity
  Downloading azure_identity-1.13.0-py3-none-any.whl (151 kB)
[K     |████████████████████████████████| 151 kB 6.9 MB/s eta 0:00:01
[31mERROR: azureml-inference-server-http 0.8.4 has requirement flask<2.3.0, but you'll have flask 2.3.2 which is incompatible.[0m
Installing collected packages: azure-identity
Successfully installed azure-identity-1.13.0
Looking in indexes: https://pypi.org/simple, https://pkgs.dev.azure.com/azure-sdk/public/_packaging/azure-sdk-for-python/pypi/simple/
Collecting azure-ai-ml==1.9.0a20230616001
  Downloading https://pkgs.dev.azure.com/azure-sdk/29ec6040-b234-4e31-b139-33dc4287b756/_packaging/3572dbf9-b5ef-433b-9137-fc4d7768e7cc/pypi/download/azure-ai-ml/1.9a20230616001/azure_ai_ml-1.9.0a20230616001-py3-none-any.whl (6.2 MB)
[K     |████████████████████████████████| 6.2 MB 33.0 MB/s eta 0:00:01


[31mERROR: azure-storage-file-share 12.13.0 has requirement azure-core<2.0.0,>=1.28.0, but you'll have azure-core 1.26.4 which is incompatible.[0m
[31mERROR: azure-storage-file-datalake 12.12.0 has requirement azure-core<2.0.0,>=1.28.0, but you'll have azure-core 1.26.4 which is incompatible.[0m
[31mERROR: azure-storage-file-datalake 12.12.0 has requirement azure-storage-blob<13.0.0,>=12.17.0, but you'll have azure-storage-blob 12.13.0 which is incompatible.[0m
Installing collected packages: strictyaml, azure-storage-file-share, azure-storage-file-datalake, marshmallow, pydash, azure-ai-ml
Successfully installed azure-ai-ml-1.9.0a20230616001 azure-storage-file-datalake-12.12.0 azure-storage-file-share-12.13.0 marshmallow-3.20.1 pydash-5.1.2 strictyaml-1.7.3




Collecting azureml-evaluate-mlflow
  Downloading azureml_evaluate_mlflow-0.0.21-py3-none-any.whl (213 kB)
[K     |████████████████████████████████| 213 kB 7.4 MB/s eta 0:00:01
[?25hCollecting scikit-learn>=0.24.0
  Downloading scikit_learn-1.3.0-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (11.1 MB)
[K     |████████████████████████████████| 11.1 MB 35.2 MB/s eta 0:00:01
[?25hCollecting mlflow-skinny==2.3.1
  Downloading mlflow_skinny-2.3.1-py3-none-any.whl (4.1 MB)
[K     |████████████████████████████████| 4.1 MB 36.8 MB/s eta 0:00:01
[?25hCollecting datasets>=2.11.0
  Downloading datasets-2.14.2-py3-none-any.whl (518 kB)
[K     |████████████████████████████████| 518 kB 36.0 MB/s eta 0:00:01
[?25hCollecting soundfile>=0.12.1
  Downloading soundfile-0.12.1-py2.py3-none-any.whl (24 kB)
Collecting diffusers>=0.14.0
  Downloading diffusers-0.19.3-py3-none-any.whl (1.3 MB)
[K     |████████████████████████████████| 1.3 MB 34.7 MB/s eta 0:00:01
[?25hCollecting mlflow==2.

Collecting alembic!=1.10.0,<2
  Downloading alembic-1.11.1-py3-none-any.whl (224 kB)
[K     |████████████████████████████████| 224 kB 36.4 MB/s eta 0:00:01
[?25hCollecting sqlalchemy<3,>=1.4.0
  Downloading SQLAlchemy-2.0.19-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (2.8 MB)
[K     |████████████████████████████████| 2.8 MB 35.0 MB/s eta 0:00:01
Collecting querystring-parser<2
  Downloading querystring_parser-1.2.4-py2.py3-none-any.whl (7.9 kB)
Collecting pooch<1.7,>=1.0
  Downloading pooch-1.6.0-py3-none-any.whl (56 kB)
[K     |████████████████████████████████| 56 kB 16.5 MB/s eta 0:00:01
[?25hCollecting audioread>=2.1.9
  Downloading audioread-3.0.0.tar.gz (377 kB)
[K     |████████████████████████████████| 377 kB 37.1 MB/s eta 0:00:01
[?25hCollecting soxr>=0.3.2
  Downloading soxr-0.3.5-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.3 MB)


[K     |████████████████████████████████| 1.3 MB 33.2 MB/s eta 0:00:01


Collecting Mako
  Downloading Mako-1.2.4-py3-none-any.whl (78 kB)
[K     |████████████████████████████████| 78 kB 21.7 MB/s eta 0:00:01
Collecting appdirs>=1.3.0
  Downloading appdirs-1.4.4-py2.py3-none-any.whl (9.6 kB)
Building wheels for collected packages: audioread
  Building wheel for audioread (setup.py) ... [?25ldone
[?25h  Created wheel for audioread: filename=audioread-3.0.0-py3-none-any.whl size=23703 sha256=c9f3e9336e0188e67f41ec9a93d703e66bc6b90de610c675b6e545ab81d88c3a
  Stored in directory: /home/azureuser/.cache/pip/wheels/0a/ed/be/49df2538fca496690a024a4374455584d65c2afd6fc3d6e9c7
Successfully built audioread
[31mERROR: scikit-image 0.21.0 has requirement networkx>=2.8, but you'll have networkx 2.5 which is incompatible.[0m
[31mERROR: scikit-image 0.21.0 has requirement scipy>=1.8, but you'll have scipy 1.5.3 which is incompatible.[0m
[31mERROR: responsibleai 0.27.0 has requirement ipykernel<=6.8.0, but you'll have ipykernel 6.22.0 which is incompatible.[0m
[3



In [2]:
from transformers import LlamaTokenizer, LlamaForCausalLM
from transformers import Trainer, TrainingArguments
from peft import prepare_model_for_int8_training, LoraConfig, get_peft_model
from datasets import Dataset
import os
import numpy as np
import pandas as pd
from pprint import pprint
import torch

from azureml.metrics import compute_metrics, constants

2023-08-01 08:10:19.893748: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2023-08-01 08:10:29.982216: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer.so.7'; dlerror: libnvinfer.so.7: cannot open shared object file: No such file or directory
2023-08-01 08:10:29.982344: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer_plugin.so.7'; dlerror: libnvinfer_plugin.so.7: cannot open shared object file: No such file or directory


### Download model from azureml-meta registry <a class="anchor" id="Download"></a>

In [3]:
from azure.ai.ml import MLClient
from azure.identity import (
    DefaultAzureCredential,
    InteractiveBrowserCredential,
)
try:
    credential = DefaultAzureCredential()
    credential.get_token("https://management.azure.com/.default")
except Exception as ex:
    credential = InteractiveBrowserCredential()

# connect to a workspace
workspace_ml_client = None
try:
    workspace_ml_client = MLClient.from_config(credential)
    subscription_id = workspace_ml_client.subscription_id
    workspace = workspace_ml_client.workspace_name
    resource_group = workspace_ml_client.resource_group_name
except Exception as ex:
    print(ex)
    # Enter details of your workspace
    subscription_id = "9ec1d932-0f3f-486c-acc6-e7d78b358f9b"
    resource_group = "AML_shared"
    workspace = "AML_shared_eus_ws"
    workspace_ml_client = MLClient(
        credential, subscription_id, resource_group, workspace
    )
# Connect to the meta  registry
registry_mlclient = MLClient(credential=credential, registry_name="azureml-meta")
model_name = "Llama-2-70b"
version = list(registry_mlclient.models.list(model_name))[0].version
registry_mlclient.models.download(model_name, version=version)

Found the config file in: /config.json
Downloading the model mlflow_model_folder at ./Llama-2-70b/mlflow_model_folder

Your file exceeds 100 MB. If you experience low speeds, latency, or broken connections, we recommend using the AzCopyv10 tool for this file transfer.

Example: azcopy copy 'https://amlmtayj4buuse01.blob.core.windows.net/azureml-me-80bb7f94-a0fa-55ee-b401-844db6ecc66f/mlflow_model_folder' './Llama-2-70b/mlflow_model_folder' 

See https://docs.microsoft.com/azure/storage/common/storage-use-azcopy-v10 for more information.
Your file exceeds 100 MB. If you experience low speeds, latency, or broken connections, we recommend using the AzCopyv10 tool for this file transfer.

Example: azcopy copy 'https://amlmtayj4buuse01.blob.core.windows.net/azureml-me-80bb7f94-a0fa-55ee-b401-844db6ecc66f/mlflow_model_folder' './Llama-2-70b/mlflow_model_folder' 

See https://docs.microsoft.com/azure/storage/common/storage-use-azcopy-v10 for more information.
Your file exceeds 100 MB. If you 

Your file exceeds 100 MB. If you experience low speeds, latency, or broken connections, we recommend using the AzCopyv10 tool for this file transfer.

Example: azcopy copy 'https://amlmtayj4buuse01.blob.core.windows.net/azureml-me-80bb7f94-a0fa-55ee-b401-844db6ecc66f/mlflow_model_folder' './Llama-2-70b/mlflow_model_folder' 

See https://docs.microsoft.com/azure/storage/common/storage-use-azcopy-v10 for more information.
Your file exceeds 100 MB. If you experience low speeds, latency, or broken connections, we recommend using the AzCopyv10 tool for this file transfer.

Example: azcopy copy 'https://amlmtayj4buuse01.blob.core.windows.net/azureml-me-80bb7f94-a0fa-55ee-b401-844db6ecc66f/mlflow_model_folder' './Llama-2-70b/mlflow_model_folder' 

See https://docs.microsoft.com/azure/storage/common/storage-use-azcopy-v10 for more information.
Your file exceeds 100 MB. If you experience low speeds, latency, or broken connections, we recommend using the AzCopyv10 tool for this file transfer.

E

### Load model and tokenizer

In [5]:
#tokenizer_path = f'{model_name}/mlflow_model_folder/data/tokenizer'
tokenizer_path = f'{model_name}/mlflow_model_folder/data/model'
tokenizer = LlamaTokenizer.from_pretrained(tokenizer_path)
tokenizer.pad_token_id = 0

You are using the legacy behaviour of the <class 'transformers.models.llama.tokenization_llama.LlamaTokenizer'>. This means that tokens that come after special tokens will not be properly handled. We recommend you to read the related pull request available at https://github.com/huggingface/transformers/pull/24565


In [6]:
# Load original model
model_path = f'{model_name}/mlflow_model_folder/data/model'
model = LlamaForCausalLM.from_pretrained(model_path, device_map='auto', load_in_8bit=True, torch_dtype=torch.float16) #, load_in_8bit=True)

Loading checkpoint shards:   0%|          | 0/15 [00:00<?, ?it/s]

In [7]:
model.hf_device_map

{'model.embed_tokens': 0,
 'model.layers.0': 0,
 'model.layers.1': 0,
 'model.layers.2': 0,
 'model.layers.3': 0,
 'model.layers.4': 0,
 'model.layers.5': 0,
 'model.layers.6': 0,
 'model.layers.7': 0,
 'model.layers.8': 0,
 'model.layers.9': 0,
 'model.layers.10': 1,
 'model.layers.11': 1,
 'model.layers.12': 1,
 'model.layers.13': 1,
 'model.layers.14': 1,
 'model.layers.15': 1,
 'model.layers.16': 1,
 'model.layers.17': 1,
 'model.layers.18': 1,
 'model.layers.19': 1,
 'model.layers.20': 1,
 'model.layers.21': 2,
 'model.layers.22': 2,
 'model.layers.23': 2,
 'model.layers.24': 2,
 'model.layers.25': 2,
 'model.layers.26': 2,
 'model.layers.27': 2,
 'model.layers.28': 2,
 'model.layers.29': 2,
 'model.layers.30': 2,
 'model.layers.31': 2,
 'model.layers.32': 3,
 'model.layers.33': 3,
 'model.layers.34': 3,
 'model.layers.35': 3,
 'model.layers.36': 3,
 'model.layers.37': 3,
 'model.layers.38': 3,
 'model.layers.39': 3,
 'model.layers.40': 3,
 'model.layers.41': 3,
 'model.layers.42'

### Load and prepare data <a class="anchor" id="Data"></a>
We use the [knkarthick/dialogsum](https://huggingface.co/datasets/knkarthick/dialogsum) from huggingface. We select 1000 samples from train and 100 each samples from test and valid datasets.

In [8]:
from datasets import load_dataset

dataset_name = "knkarthick/dialogsum"
dataset = load_dataset(dataset_name)
dataset_train = dataset['train'].select(range(1000))
dataset_test = dataset['test'].select(range(100))
dataset_valid = dataset['validation'].select(range(100))

Downloading readme:   0%|          | 0.00/4.56k [00:00<?, ?B/s]

Downloading data files:   0%|          | 0/3 [00:00<?, ?it/s]

Downloading data:   0%|          | 0.00/11.3M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/442k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/1.35M [00:00<?, ?B/s]

Extracting data files:   0%|          | 0/3 [00:00<?, ?it/s]

Generating train split: 0 examples [00:00, ? examples/s]

Generating validation split: 0 examples [00:00, ? examples/s]

Generating test split: 0 examples [00:00, ? examples/s]

In [9]:
def tokenize_fn(sample, include_summary=True):
    prompt =  f"Summarize the dialog.\n<dialog>: {sample['dialogue']}\n<summary>:"
    if include_summary:
        prompt += f" {sample['summary']}{tokenizer.eos_token}"
    tokenized_prompt = tokenizer(prompt, padding=True, truncation=True, max_length=1024, 
                       return_overflowing_tokens=False, return_length=False)
    return {
        'input_ids': tokenized_prompt['input_ids'],
        'attention_mask': tokenized_prompt['attention_mask']
    }    

tokenized_data_train = dataset_train.map(tokenize_fn, remove_columns=dataset_train.column_names)
tokenized_data_valid = dataset_valid.map(tokenize_fn, remove_columns=dataset_train.column_names)
tokenized_data_test = dataset_test.map(lambda x: tokenize_fn(x, include_summary=False), remove_columns=dataset_test.column_names)

Map:   0%|          | 0/1000 [00:00<?, ? examples/s]

Map:   0%|          | 0/100 [00:00<?, ? examples/s]

Map:   0%|          | 0/100 [00:00<?, ? examples/s]

### Compute metrics on test data to establish baseline<a class="anchor" id="Baseline"></a>
Here we evaluate the pretrained model and compute metrics using azureml-metrics package, which is in preview.

In [10]:
model.eval()

LlamaForCausalLM(
  (model): LlamaModel(
    (embed_tokens): Embedding(32000, 8192, padding_idx=0)
    (layers): ModuleList(
      (0-79): 80 x LlamaDecoderLayer(
        (self_attn): LlamaAttention(
          (q_proj): Linear8bitLt(in_features=8192, out_features=8192, bias=False)
          (k_proj): Linear8bitLt(in_features=8192, out_features=1024, bias=False)
          (v_proj): Linear8bitLt(in_features=8192, out_features=1024, bias=False)
          (o_proj): Linear8bitLt(in_features=8192, out_features=8192, bias=False)
          (rotary_emb): LlamaRotaryEmbedding()
        )
        (mlp): LlamaMLP(
          (gate_proj): Linear8bitLt(in_features=8192, out_features=28620, bias=False)
          (up_proj): Linear8bitLt(in_features=8192, out_features=28620, bias=False)
          (down_proj): Linear8bitLt(in_features=28620, out_features=8192, bias=False)
          (act_fn): SiLUActivation()
        )
        (input_layernorm): LlamaRMSNorm()
        (post_attention_layernorm): LlamaRMSN

In [55]:
def compute_model_metrics(target_model):
    device = 'cuda'
    y_pred = []
    l = len(tokenized_data_test)

    for i in range(0, l):
        inference_input = tokenized_data_test[i]
        with torch.no_grad():
            gen_tokens = target_model.generate(
                input_ids = torch.LongTensor([inference_input['input_ids']]).to(device),
                attention_mask=torch.LongTensor([inference_input['attention_mask']]).to(device),
                max_new_tokens=100, pad_token_id=tokenizer.pad_token_id)
        pred_text = tokenizer.batch_decode(gen_tokens.cpu().numpy(), skip_special_tokens=True)[0]
        _, summary = pred_text.split("<summary>: ", 1)
        y_pred.append(summary)
        
    y_test = [[item] for item in dataset_test['summary']]
    return compute_metrics(task_type=constants.Tasks.SUMMARIZATION, y_test=y_test, y_pred=y_pred)

In [56]:
pretrained_metrics = compute_model_metrics(model)
pprint(pretrained_metrics)

{'artifacts': {},
 'metrics': {'rouge1': 0.23757774076492347,
             'rouge2': 0.0672074786438569,
             'rougeL': 0.17331192372776122,
             'rougeLsum': 0.19087761549424737}}


## Finetune the model <a class="anchor" id="Finetune"></a>

In [57]:
model.train()
model = prepare_model_for_int8_training(model)
config = LoraConfig(
    r=4,
    lora_alpha=16,
    target_modules= [
        "q_proj",
        "v_proj",
    ],
    lora_dropout=.05,
    bias="none",
    task_type="CAUSAL_LM",
)

peft_model = get_peft_model(model, config)
peft_model.print_trainable_parameters()




trainable params: 8,192,000 || all params: 68,984,840,192 || trainable%: 0.01187507280904016


In [59]:
from transformers import DataCollatorForLanguageModeling

training_args = TrainingArguments(
    per_device_train_batch_size=4,
    per_device_eval_batch_size=4,
    gradient_accumulation_steps=1,
    warmup_steps=0,
    num_train_epochs=1,
    learning_rate=3e-4,
    fp16=False,
    evaluation_strategy="steps",
    save_strategy="no",
    output_dir='.',
    ddp_find_unused_parameters=None,
    remove_unused_columns=False,
    logging_steps=100)

data_collator = DataCollatorForLanguageModeling(tokenizer, mlm=False)

trainer = Trainer(
                  model=peft_model,
                  train_dataset=tokenized_data_train,
                  eval_dataset=tokenized_data_valid,
                  args=training_args,
                  data_collator=data_collator
                 )

In [60]:
trainer.train()

`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...


Step,Training Loss,Validation Loss
100,1.2229,1.202101
200,1.1684,1.197658




TrainOutput(global_step=250, training_loss=1.1910571594238282, metrics={'train_runtime': 4825.6029, 'train_samples_per_second': 0.207, 'train_steps_per_second': 0.052, 'total_flos': 1.6913370248891597e+17, 'train_loss': 1.1910571594238282, 'epoch': 1.0})

## Evaluate the finetuned model <a class="anchor" id="Evaluate"></a>
Now that the finetuned model is ready, we compute metrics with it on the same test dataset

In [61]:
peft_model.eval()

PeftModelForCausalLM(
  (base_model): LoraModel(
    (model): LlamaForCausalLM(
      (model): LlamaModel(
        (embed_tokens): Embedding(32000, 8192, padding_idx=0)
        (layers): ModuleList(
          (0-79): 80 x LlamaDecoderLayer(
            (self_attn): LlamaAttention(
              (q_proj): Linear8bitLt(
                in_features=8192, out_features=8192, bias=False
                (lora_dropout): ModuleDict(
                  (default): Dropout(p=0.05, inplace=False)
                )
                (lora_A): ModuleDict(
                  (default): Linear(in_features=8192, out_features=4, bias=False)
                )
                (lora_B): ModuleDict(
                  (default): Linear(in_features=4, out_features=8192, bias=False)
                )
                (lora_embedding_A): ParameterDict()
                (lora_embedding_B): ParameterDict()
              )
              (k_proj): Linear8bitLt(in_features=8192, out_features=1024, bias=False)
            

### Compute metrics

In [62]:
finetuned_metrics = compute_model_metrics(peft_model)
pprint(finetuned_metrics)

{'artifacts': {},
 'metrics': {'rouge1': 0.45377132692714234,
             'rouge2': 0.1758754832658233,
             'rougeL': 0.354723582706404,
             'rougeLsum': 0.35477342709115767}}


## Comparison of metrics
Here we see accuracy and other metrics improved before and after finetuning

In [66]:
metric_names = [name for name in finetuned_metrics]
pretrained = []
finetuned = []

for name in metric_names:
        pretrained.append(pretrained_metrics[name])
        finetuned.append(finetuned_metrics[name])
result = pd.DataFrame({'metric': metric_names, 'pretrained': pretrained, 'finetuned': finetuned})
result

      metric                                         pretrained  \
0    metrics  {'rougeLsum': 0.19087761549424737, 'rougeL': 0...   
1  artifacts                                                 {}   

                                           finetuned  
0  {'rougeLsum': 0.35477342709115767, 'rougeL': 0...  
1                                                 {}  
