## Text Completion - SamSum 

This sample shows how use `text-completion` components from the `azureml` system registry to fine tune a model to summarize a dialog between 2 people using samsum dataset. We then deploy the fine tuned model to an online endpoint for real time inference.

### Training data
We will use the [samsum](https://huggingface.co/datasets/samsum) dataset. This dataset is intended to summarize dialogues between 2 people. with this notebook we will summarize the dialogues and calculate bleu and rouge scores for the summarized text vs provided ground_truth summaries

### Model
We will use the `llama-2-7b` model to show how user can finetune a model for text-completion task. If you opened this notebook from a specific model card, remember to replace the specific model name. Optionally, if you need to fine tune a model that is available on HuggingFace, but not available in `azureml` system registry, to do so [import](https://github.com/Azure/azureml-examples/blob/main/sdk/python/foundation-models/system/import/import_model_into_registry.ipynb) the model.

### Outline
* Pick a model to fine-tune.
* Pick and explore training data.
* Configure the fine tuning job.
* Run the fine tuning job.

### 1. Setup pre-requisites
* Install dependencies
* Connect to AzureML Workspace. Learn more at [set up SDK authentication](https://learn.microsoft.com/en-us/azure/machine-learning/how-to-setup-authentication?tabs=sdk). Replace  `<WORKSPACE_NAME>`, `<RESOURCE_GROUP>` and `<SUBSCRIPTION_ID>` below.
* Connect to `azureml` system registry
* Set an optional experiment name

Install dependencies by running below cell. This is not an optional step if running in a new environment.

In [None]:
%pip install azure-ai-ml
%pip install azure-identity

%pip install mlflow
%pip install azureml-mlflow

Install dependencies for download hugging face datasets.

In [None]:
%pip install datasets --upgrade
%pip install py7zr

In [None]:
from azure.ai.ml import MLClient
from azure.identity import (
    DefaultAzureCredential,
    InteractiveBrowserCredential,
)

try:
    credential = DefaultAzureCredential()
    credential.get_token("https://management.azure.com/.default")
except Exception as ex:
    credential = InteractiveBrowserCredential()

try:
    workspace_ml_client = MLClient.from_config(credential=credential)
except:
    workspace_ml_client = MLClient(
        credential,
        subscription_id="72c03bf3-4e69-41af-9532-dfcdc3eefef4",
        resource_group_name="FineTuneINTTesting",
        workspace_name="FineTuneINTTestingHobo",
    )

# the models, fine tuning pipelines and environments are available in the AzureML system registry, "azureml"
registry_ml_client = MLClient(credential, registry_name="azureml")
registry_ml_client_meta = MLClient(credential, registry_name="azureml-meta")

### 2. Pick a foundation model to fine tune

Decoder based LLM models like `llama` performs well on `text-completion` tasks, we need to finetune the model for our specific purpose in order to use it. You can browse these models in the Model Catalog in the AzureML Studio, filtering by the `text-completion` task. In this example, we use the `llama-2-7b` model. If you have opened this notebook for a different model, replace the model name and version accordingly. 

Note the model id property of the model. This will be passed as input to the fine tuning job. This is also available as the `Asset ID` field in model details page in AzureML Studio Model Catalog. 

In [None]:
model_name = "Llama-2-7b"
foundation_model = registry_ml_client_meta.models.get(model_name, label="latest")
print(
    "\n\nUsing model name: {0}, version: {1}, id: {2} for fine tuning".format(
        foundation_model.name, foundation_model.version, foundation_model.id
    )
)

In [None]:
from azure.ai.ml.constants._common import AssetTypes
from azure.ai.ml.entities._inputs_outputs import Input
mlflow_model_llama = Input(
        type=AssetTypes.MLFLOW_MODEL, path=foundation_model.id
    )

### 4. Pick the dataset for fine-tuning the model

We use the [samsum](https://huggingface.co/datasets/samsum) dataset. The next few cells show basic data preparation for fine tuning:
* Visualize some data rows
* Preprocess the data and format it in required format. This is an important step for performing text completion as we add the required sequences/separators in the data. This is how we repurpose the text-completion task to any specific task like summarization, translation, text-completion, etc.
* While fintuning, text column is concatenated with ground_truth column to produce finetuning input. Hence, the data should be prepared such that `text + ground_truth` is your actual finetuning data.
* bos and eos tokens are added to the data by finetuning pipeline, you do not need to add it explicitly 
* We want this sample to run quickly, so save smaller `train`, `validation` and `test` files containing 10% of the original. This means the fine tuned model will have lower accuracy, hence it should not be put to real-world use. 

##### Here is an example of how the data should look like

text completion requires the training data to include at least 2 fields – one for ‘text’ and ‘ground_truth’ like in this example. The below examples are from Samsum dataset. 

Original dataset:

| dialogue (text) | summary (ground_truth) |
| :- | :- |
| Eric: MACHINE!\r\nRob: That's so gr8!\r\nEric: I know! And shows how Americans see Russian ;)\r\nRob: And it's really funny!\r\nEric: I know! I especially like the train part!\r\nRob: Hahaha! No one talks to the machine like that!\r\nEric: Is this his only stand-up?\r\nRob: Idk. I'll check.\r\nEric: Sure.\r\nRob: Turns out no! There are some of his stand-ups on youtube.\r\nEric: Gr8! I'll watch them now!\r\nRob: Me too!\r\nEric: MACHINE!\r\nRob: MACHINE!\r\nEric: TTYL?\r\nRob: Sure :) | Eric and Rob are going to watch a stand-up on youtube. | 
| Will: hey babe, what do you want for dinner tonight?\r\nEmma:  gah, don't even worry about it tonight\r\nWill: what do you mean? everything ok?\r\nEmma: not really, but it's ok, don't worry about cooking though, I'm not hungry\r\nWill: Well what time will you be home?\r\nEmma: soon, hopefully\r\nWill: you sure? Maybe you want me to pick you up?\r\nEmma: no no it's alright. I'll be home soon, i'll tell you when I get home. \r\nWill: Alright, love you. \r\nEmma: love you too. | Emma will be home soon and she will let Will know. | 

Formatted dataset the user might pass:

| text (text) | summary (ground_truth) |
| :- | :- |
| Summarize this dialog:\nEric: MACHINE!\r\nRob: That's so gr8!\r\nEric: I know! And shows how Americans see Russian ;)\r\nRob: And it's really funny!\r\nEric: I know! I especially like the train part!\r\nRob: Hahaha! No one talks to the machine like that!\r\nEric: Is this his only stand-up?\r\nRob: Idk. I'll check.\r\nEric: Sure.\r\nRob: Turns out no! There are some of his stand-ups on youtube.\r\nEric: Gr8! I'll watch them now!\r\nRob: Me too!\r\nEric: MACHINE!\r\nRob: MACHINE!\r\nEric: TTYL?\r\nRob: Sure :)\n---\nSummary:\n | Eric and Rob are going to watch a stand-up on youtube. | 
| Summarize this dialog:\nWill: hey babe, what do you want for dinner tonight?\r\nEmma:  gah, don't even worry about it tonight\r\nWill: what do you mean? everything ok?\r\nEmma: not really, but it's ok, don't worry about cooking though, I'm not hungry\r\nWill: Well what time will you be home?\r\nEmma: soon, hopefully\r\nWill: you sure? Maybe you want me to pick you up?\r\nEmma: no no it's alright. I'll be home soon, i'll tell you when I get home. \r\nWill: Alright, love you. \r\nEmma: love you too. \n---\nSummary:\n | Emma will be home soon and she will let Will know. | 
 

In [None]:
# download the dataset using the helper script. This needs datasets library: https://pypi.org/project/datasets/
import os

exit_status = os.system("python download-dataset.py --download_dir samsum-dataset")
if exit_status != 0:
    raise Exception("Error downloading dataset")

In [None]:
# load the ./samsum-dataset/train.jsonl file into a pandas dataframe and show the first 5 rows
import pandas as pd

pd.set_option(
    "display.max_colwidth", 0
)  # set the max column width to 0 to display the full text
df = pd.read_json("./samsum-dataset/train.jsonl", lines=True)
df.head()

In [None]:
# create a function to preprocess the dataset in desired format


def get_preprocessed_samsum(df):
    prompt = f"Summarize this dialog:\n{{}}\n---\nSummary:\n"

    df["text"] = df["dialogue"].map(prompt.format)
    df = df.drop(columns=["dialogue", "id"])
    df = df[["text", "summary"]]

    return df

In [None]:
# load test.jsonl, train.jsonl and validation.jsonl form the ./samsum-dataset folder into pandas dataframes
test_df = pd.read_json("./samsum-dataset/test.jsonl", lines=True)
train_df = pd.read_json("./samsum-dataset/train.jsonl", lines=True)
validation_df = pd.read_json("./samsum-dataset/validation.jsonl", lines=True)
# map the train, validation and test dataframes to preprocess function
train_df = get_preprocessed_samsum(train_df)
validation_df = get_preprocessed_samsum(validation_df)
test_df = get_preprocessed_samsum(test_df)
# show the first 5 rows of the train dataframe
train_df.head()

In [None]:
# save 10% of the rows from the train, validation and test dataframes into files with small_ prefix in the ./samsum-dataset folder
frac = 1
train_df.sample(frac=frac).to_json(
    "./samsum-dataset/small_train.jsonl", orient="records", lines=True
)
validation_df.sample(frac=frac).to_json(
    "./samsum-dataset/small_validation.jsonl", orient="records", lines=True
)
test_df.sample(frac=frac).to_json(
    "./samsum-dataset/small_test.jsonl", orient="records", lines=True
)

### 5. Submit the fine tuning job using the the model and data as inputs
 
Create the job that uses the `text-generation` pipeline component. [Learn more](https://github.com/Azure/azureml-assets/blob/main/assets/training/finetune_acft_hf_nlp/components/pipeline_components/text_generation/README.md) about all the parameters supported for fine tuning.

Define finetune parameters

Finetune parameters can be grouped into 2 categories - training parameters, optimization parameters

Training parameters define the training aspects such as - 
1. the optimizer, scheduler to use
2. the metric to optimize the finetune
3. number of training steps and the batch size
and so on

Optimization parameters help in optimizing the GPU memory and effectively using the compute resources. Below are few of the parameters that belong to this category. _The optimization parameters differs for each model and are packaged with the model to handle these variations._
1. enable the deepspeed, ORT and LoRA
2. enable mixed precision training
2. enable multi-node training 

#### Create data inputs

In [None]:
from azure.ai.ml.entities._inputs_outputs import Input
#training_data=Input(type="uri_file", path="./samsum-dataset/small_train.jsonl")
#validation_data=Input(type="uri_file", path="./samsum-dataset/small_validation.jsonl")

In [None]:
from azure.ai.ml.entities._inputs_outputs import Input
training_data=Input(type="uri_file", path="azureml://locations/westus3/workspaces/39d80c2d-b6d4-4254-a703-afa9687a022b/data/sample_text_gen_ft_train/versions/2")
validation_data=Input(type="uri_file", path="azureml://locations/westus3/workspaces/39d80c2d-b6d4-4254-a703-afa9687a022b/data/sample_text_gen_ft_test/versions/1")


Create FineTuning job object

In [None]:

from azure.ai.ml.entities._job.finetuning.custom_model_finetuning_job import CustomModelFineTuningJob
import uuid
from azure.ai.ml._restclient.v2024_01_01_preview.models import (
    FineTuningTaskType,
)
from azure.ai.ml.entities._inputs_outputs import Output

guid = uuid.uuid4()
short_guid = str(guid)[:8]

custom_model_finetuning_job = CustomModelFineTuningJob(
    task=FineTuningTaskType.TEXT_COMPLETION,
    training_data=training_data,
    validation_data=validation_data,
    hyperparameters={
        "per_device_train_batch_size": "1",
        "learning_rate": "0.00002",
        "num_train_epochs": "1",
    },
    model=mlflow_model_llama,
    display_name=f"llama-display-name-{short_guid}",
    name=f"llama-{short_guid}",
    experiment_name="llama-finetuning-experiment",
    outputs={"registered_model": Output(type="mlflow_model", name=f"llama-finetune-registered-{short_guid}")},
)

Submit FineTuningJob

In [None]:
created_job = workspace_ml_client.jobs.create_or_update(custom_model_finetuning_job)
created_job.studio_url