# Finetuning Llama-2 on Azure Machine Mearning

## Contents
1. [Introduction](#Introduction)
1. [Set up the environment](#Setup)
1. [Data](#Data)
1. [Establish baseline](#Baseline)
1. [Finetune](#Finetune)
1. [Evaluate](#Evaluate)

## Introduction
This notebook demonstrates finetuning Llama-2 foundation model on a text classification dataset using AzureML.

Llama-2 model is now available in AzureML Model Catalog. For details please see the [blog](https://techcommunity.microsoft.com/t5/ai-machine-learning-blog/introducing-llama-2-on-azure/ba-p/3881233).

This functionality is in public preview in Azure Machine Learning. The preview version is provided without a service level agreement, and it’s not recommended for production workloads. Certain features might not be supported or might have constrained capabilities. For more information, see Supplemental Terms of Use for Microsoft Azure Previews.

Notebook summary:

1. Setting the environment
2. Loading the model and data. In this example we use the [20 Newsgroups dataset](https://scikit-learn.org/0.19/datasets/twenty_newsgroups.html)
3. Evaluate the pretrained model on the test set to establish baseline metrics
4. Finetune the model
5. Evaluating the finetuned model on a test set

## Set up the environment <a class="anchor" id="Setup"></a>

Install and load required packages

In [None]:
! pip uninstall -y azure-identity
! pip uninstall -y azure-ai-ml

! pip install -U azure-identity
! pip install azure-ai-ml==1.9.0a20230616001 --extra-index-url https://pkgs.dev.azure.com/azure-sdk/public/_packaging/azure-sdk-for-python/pypi/simple/
!pip install torch==2.0.1
!pip install bitsandbytes
!pip install transformers==4.31.0
!pip install peft
!pip install azureml-evaluate-mlflow

!nvidia-smi

In [None]:
from transformers import AutoModelForCausalLM, BitsAndBytesConfig, AutoTokenizer, AutoModelForSequenceClassification
from transformers import LlamaTokenizerFast, LlamaForCausalLM, LlamaTokenizer, LlamaForSequenceClassification
from transformers import Trainer, TrainingArguments, DataCollatorWithPadding
from peft import prepare_model_for_int8_training, LoraConfig, get_peft_model, prepare_model_for_kbit_training
from azureml.metrics import compute_metrics, constants
from datasets import Dataset
from datasets import load_dataset
from tqdm import tqdm
import os

### Download model from azureml-meta registry

In [None]:
from azure.ai.ml import MLClient
from azure.identity import (
    DefaultAzureCredential,
    InteractiveBrowserCredential,
    ClientSecretCredential,
)
from azure.ai.ml.entities import AmlCompute
import time

try:
    credential = DefaultAzureCredential()
    credential.get_token("https://management.azure.com/.default")
except Exception as ex:
    credential = InteractiveBrowserCredential()

# connect to a workspace
workspace_ml_client = None
try:
    workspace_ml_client = MLClient.from_config(credential)
    subscription_id = workspace_ml_client.subscription_id
    workspace = workspace_ml_client.workspace_name
    resource_group = workspace_ml_client.resource_group_name
except Exception as ex:
    print(ex)
    # Enter details of your workspace
    subscription_id = "<SUBSCRIPTION_ID>"
    resource_group = "<RESOURCS_GROUP>"
    workspace = "<WORKSPACE_NAME>"
    workspace_ml_client = MLClient(
        credential, subscription_id, resource_group, workspace
    )
# Connect to the meta  registry
registry_mlclient = MLClient(credential=credential, registry_name="azureml-meta")
model_name = "Llama-2-70b"
version = list(registry_mlclient.models.list(model_name))[0].version
registry_mlclient.models.download(model_name, version=version)

### Load model and tokenizer

In [None]:
# Load original tokenizer
tokenizer_path = f'{model_name}/mlflow_model_folder/data/tokenizer'
tokenizer = LlamaTokenizer.from_pretrained(tokenizer_path)
tokenizer.pad_token_id = 0

In [None]:
# Load original model
model_path = f'{model_name}/mlflow_model_folder/data/model'
model = LlamaForSequenceClassification.from_pretrained(model_path, device_map='auto', load_in_8bit=True, torch_dtype=torch.float16, num_labels=4)

### Load and prepare data 
We use the 20-Newsgroup dataset from scikit-learn. We subsample the dataset to select only 4 categories (classes), and sample a 200-row training set, and a 100-row test set which will be held out for model evaluation. (Not that after removing some missing label rows, the exact number of rows are slightly smaller.)

In [None]:
# data_dir = "text-dnn-data"  # Local directory to store data
# blobstore_datadir = data_dir  # Blob store directory to store data in
from sklearn.datasets import fetch_20newsgroups
import pandas as pd
target_column_name = "label"
feature_column_name = "sentence"


def get_20newsgroups_data():
    """Fetches 20 Newsgroups data from scikit-learn
    Returns them in form of pandas dataframes
    """
    remove = ("headers", "footers", "quotes")
    categories = [
        "rec.sport.baseball",
        "rec.sport.hockey",
        "comp.graphics",
        "sci.space",
    ]

    data = fetch_20newsgroups(
        subset="train",
        categories=categories,
        shuffle=True,
        random_state=42,
        remove=remove,
    )
    data = pd.DataFrame(
        {feature_column_name: data.data, target_column_name: data.target}
    )

    data_train = data[:200]
    data_test = data[200:300]

    data_train = remove_blanks_20news(
        data_train, feature_column_name, target_column_name
    )
    data_test = remove_blanks_20news(data_test, feature_column_name, target_column_name)
    return Dataset.from_pandas(data_train), Dataset.from_pandas(data_test)


def remove_blanks_20news(data, feature_column_name, target_column_name):

    for index, row in data.iterrows():
        data.at[index, feature_column_name] = (
            row[feature_column_name].replace("\n", " ").strip()
        )

    data = data[data[feature_column_name] != ""]

    return data

In [None]:
data_train, data_test = get_20newsgroups_data()

In [None]:
data_train

In [None]:
data_test

In [None]:
def tokenize(examples):
    # max_length=None => use the model max length (it's actually the default)
    outputs = tokenizer(examples["sentence"], truncation=True, padding="max_length", max_length=256)
    return outputs

In [None]:
train_dataset = data_train.map(
    lambda samples: tokenize(samples), remove_columns=["__index_level_0__", "sentence"], load_from_cache_file=False)

validation_dataset = data_test.map(
    lambda samples: tokenize(samples), remove_columns=["__index_level_0__", "sentence"], load_from_cache_file=False)

In [None]:
len(train_dataset), len(validation_dataset)

In [None]:
print(validation_dataset[0])

## Evaluate the pretrained model <a class="anchor" id="Baseline"></a>

### Compte metrics on test data to establish baseline
We use azureml-metrics package, which is in preview.

In [None]:
# Evaluate model on test dataset
model.eval()

In [None]:
# Metrics Computation
device = "cuda"
l = len(validation_dataset)
batch_size = 1

predictions = []
references = []

for i in range(0, 4, batch_size):
    print('Processing: ', i)
    data_batch = validation_dataset[i:i + batch_size]
    # NOTE: Before passing data_batch['input_ids] to the model, cast them using torch.LongTensor()
    # Same for data_batch['attention_mask']. So that .to(device) call can work.
    #print(data_batch)
    with torch.no_grad():
        outputs = model(input_ids=torch.LongTensor(data_batch['input_ids']).to(device), 
                        attention_mask=torch.LongTensor(data_batch['attention_mask']).to(device))
    batch_predictions = outputs.logits.argmax(dim=-1)
    batch_predictions, batch_references = batch_predictions.detach().cpu().numpy().tolist(), data_batch["label"]
    predictions.extend(batch_predictions)
    references.extend(batch_references)

print(predictions)
print(references)

#Compute metrics
metrics = compute_metrics(task_type=constants.Tasks.CLASSIFICATION,
                          y_test=predictions,
                          y_pred=references)["metrics"]

print(metrics)

In [None]:
model.hf_device_map

## Finetune the model <a class="anchor" id="Finetune"></a>

In [None]:
model

In [None]:
model.train()

In [None]:
model = prepare_model_for_int8_training(model)

config = LoraConfig(
   r=4,
   lora_alpha=16,
   target_modules= [
       "q_proj",
       "v_proj",
   ],
   lora_dropout=.05,
   bias="none",
   task_type="SEQ_CLS", # use this to get the task type: https://github.com/huggingface/peft/blob/96c0277a1b9a381b10ab34dbf84917f9b3b992e6/src/peft/utils/config.py#L38
)

peft_model = get_peft_model(model, config)
peft_model.print_trainable_parameters()

In [None]:
print(peft_model)

In [None]:
training_args = TrainingArguments(
    per_device_train_batch_size=8,
    per_device_eval_batch_size=4,
    gradient_accumulation_steps=1,
    warmup_steps=0,
    num_train_epochs=1,
    learning_rate=3e-4,
    fp16=False,
    evaluation_strategy="steps",
    save_strategy="no",
    output_dir='.',
    ddp_find_unused_parameters=None,
    remove_unused_columns=False,
    logging_steps=8)

trainer = Trainer(
                  model=peft_model,
                  train_dataset=train_dataset,
                  eval_dataset=validation_dataset,
                  args=training_args,
                 )


In [None]:
trainer.train()

## Evaluate the finetuned model <a class="anchor" id="Evaluate"></a>
Now that the finetuned model is ready, we compute accuracy metrics with it on the same test dataset

### Get predictions

In [None]:
predictions = trainer.predict(validation_dataset)
print(predictions.predictions.shape, predictions.label_ids.shape)

In [None]:
import numpy as np
preds = np.argmax(predictions.predictions, axis=-1)

In [None]:
preds

In [None]:
# import evaluate
# metric = evaluate.load("glue", "mrpc")
# metric.compute(predictions=preds, references=predictions.label_ids)
from sklearn.metrics import accuracy_score
accuracy_score(predictions.label_ids, preds)

In [None]:
# Evaluate model on test dataset
peft_model.eval()

### Compute metrics

In [None]:
# Metrics Computation
device = "cuda"
l = len(validation_dataset)
batch_size = 1

predictions = []
references = []

for i in range(0, 4, batch_size):
    print('Processing: ', i)
    data_batch = validation_dataset[i:i + batch_size]
    # NOTE: Before passing data_batch['input_ids] to the model, cast them using torch.LongTensor()
    # Same for data_batch['attention_mask']. So that .to(device) call can work.
    #print(data_batch)
    with torch.no_grad():
        outputs = peft_model(input_ids=torch.LongTensor(data_batch['input_ids']).to(device), 
                             attention_mask=torch.LongTensor(data_batch['attention_mask']).to(device))
    batch_predictions = outputs.logits.argmax(dim=-1)
    batch_predictions, batch_references = batch_predictions.detach().cpu().numpy().tolist(), data_batch["label"]
    predictions.extend(batch_predictions)
    references.extend(batch_references)

print(predictions)
print(references)

#Compute metrics
metrics = compute_metrics(task_type=constants.Tasks.CLASSIFICATION,
                          y_test=predictions,
                          y_pred=references)["metrics"]

print(metrics)

## Comparison of metrics
Here we see accuracy and other metrics improved before and after finetuning