-sandbox

<div style="text-align: center; line-height: 0; padding-top: 9px;">
  <img src="https://databricks.com/wp-content/uploads/2018/03/db-academy-rgb-1200px.png" alt="Databricks Learning" style="width: 600px">
</div>

# 04L - Fine-tuning LLMs
In this lab, we will apply the fine-tuning learnings from the demo Notebook. The aim of this lab is to fine-tune an instruction-following LLM.

### ![Dolly](https://files.training.databricks.com/images/llm/dolly_small.png) Learning Objectives
1. Prepare a novel dataset
1. Fine-tune the T5-small model to classify movie reviews.
1. Leverage DeepSpeed to enhance training process.

In [0]:
assert "gpu" in spark.conf.get("spark.databricks.clusterUsageTags.sparkVersion"), "THIS LAB REQUIRES THAT A GPU MACHINE AND RUNTIME IS UTILIZED."

## Classroom Setup

In [0]:
%pip install rouge_score==0.1.2

[43mNote: you may need to restart the kernel using dbutils.library.restartPython() to use updated packages.[0m
Collecting rouge_score==0.1.2
  Downloading rouge_score-0.1.2.tar.gz (17 kB)
  Preparing metadata (setup.py): started
  Preparing metadata (setup.py): finished with status 'done'
Building wheels for collected packages: rouge_score
  Building wheel for rouge_score (setup.py): started
  Building wheel for rouge_score (setup.py): finished with status 'done'
  Created wheel for rouge_score: filename=rouge_score-0.1.2-py3-none-any.whl size=24936 sha256=fdedeed32e09d331a62bb8cafda4bf93ca67ff5b769e8ff6167dea0bf06d69a1
  Stored in directory: /root/.cache/pip/wheels/5f/dd/89/461065a73be61a532ff8599a28e9beef17985c9e9c31e541b4
Successfully built rouge_score
Installing collected packages: rouge_score
Successfully installed rouge_score-0.1.2
[43mNote: you may need to restart the kernel using dbutils.library.restartPython() to use updated packages.[0m


In [0]:
%run ../Includes/Classroom-Setup

[43mNote: you may need to restart the kernel using dbutils.library.restartPython() to use updated packages.[0m
[43mNote: you may need to restart the kernel using dbutils.library.restartPython() to use updated packages.[0m


Resetting the learning environment:
| enumerating serving endpoints...found 0...(0 seconds)
| No action taken

Skipping install of existing datasets to "dbfs:/mnt/dbacademy-datasets/large-language-models/v01"

Validating the locally installed datasets:
| listing local files...(7 seconds)
| removing extra path: /models--t5-base/blobs/...(0 seconds)
| removing extra path: /models--t5-small/.no_exist/df1b051c49625cf57a3d0d8d3863ed4d13564fe4/...(0 seconds)
| removing extra path: /models--t5-small/blobs/...(1 seconds)
| removing extra path: /models--t5-small/snapshots/df1b051c49625cf57a3d0d8d3863ed4d13564fe4/...(0 seconds)
| fixed 4 issues...(8 seconds total)


Importing lab testing framework.



Using the "default" schema.

Predefined paths variables:
| DA.paths.working_dir: /dbfs/mnt/dbacademy-users/labuser4466139@vocareum.com/large-language-models
| DA.paths.user_db:     dbfs:/mnt/dbacademy-users/labuser4466139@vocareum.com/large-language-models/database.db
| DA.paths.datasets:    /dbfs/mnt/dbacademy-datasets/large-language-models/v01

Setup completed (17 seconds)

The models developed or used in this course are for demonstration and learning purposes only.
Models may occasionally output offensive, inaccurate, biased information, or harmful instructions.


In [0]:
print(f"Username:          {DA.username}")
print(f"Working Directory: {DA.paths.working_dir}")

Username:          labuser4466139@vocareum.com
Working Directory: /dbfs/mnt/dbacademy-users/labuser4466139@vocareum.com/large-language-models


In [0]:
%load_ext autoreload
%autoreload 2

Creating a local temporary directory on the Driver. This will serve as a root directory for the intermediate model checkpoints created during the training process. The final model will be persisted to DBFS.

In [0]:
import tempfile

tmpdir = tempfile.TemporaryDirectory()
local_training_root = tmpdir.name

## Fine-Tuning

In [0]:
import os
import pandas as pd
from datasets import load_dataset
from transformers import (
    TrainingArguments,
    AutoTokenizer,
    AutoConfig,
    Trainer,
    AutoModelForCausalLM,
    DataCollatorForLanguageModeling,
)

import evaluate
import nltk
from nltk.tokenize import sent_tokenize



### Question 1: Data Preparation
For the instruction-following use cases we need a dataset that consists of prompt/response pairs along with any contextual information that can be used as input when training the model. The [databricks/databricks-dolly-15k](https://huggingface.co/datasets/databricks/databricks-dolly-15k) is one such dataset that provides high-quality, human-generated prompt/response pairs. 

Let's start by loading this dataset using the `load_dataset` functionality.

In [0]:
# TODO
ds = load_dataset("databricks/databricks-dolly-15k")



Downloading readme:   0%|          | 0.00/8.20k [00:00<?, ?B/s]



Downloading and preparing dataset json/databricks--databricks-dolly-15k to /root/.cache/huggingface/datasets/databricks___json/databricks--databricks-dolly-15k-7427aa6e57c34282/0.0.0/e347ab1c932092252e717ff3f949105a4dd28b27e842dd53157d2f72e276c2e4...


Downloading data files:   0%|          | 0/1 [00:00<?, ?it/s]

Downloading data:   0%|          | 0.00/13.1M [00:00<?, ?B/s]

Extracting data files:   0%|          | 0/1 [00:00<?, ?it/s]

Generating train split: 0 examples [00:00, ? examples/s]

Dataset json downloaded and prepared to /root/.cache/huggingface/datasets/databricks___json/databricks--databricks-dolly-15k-7427aa6e57c34282/0.0.0/e347ab1c932092252e717ff3f949105a4dd28b27e842dd53157d2f72e276c2e4. Subsequent calls will reuse this data.


  0%|          | 0/1 [00:00<?, ?it/s]

In [0]:
print(ds)

DatasetDict({
    train: Dataset({
        features: ['instruction', 'context', 'response', 'category'],
        num_rows: 15011
    })
})


In [0]:
# Test your answer. DO NOT MODIFY THIS CELL.

dbTestQuestion4_1(ds)

[32mPASSED[0m: All tests passed for lesson4, question1
[32mRESULTS RECORDED[0m: Click `Submit` when all questions are completed to log the results.


### Question 2: Select pre-trained model

The model that we are going to fine-tune is [pythia-70m-deduped](https://huggingface.co/EleutherAI/pythia-70m-deduped). This model is one of a Pythia Suite of models that have been developed to support interpretability research.

Let's define the pre-trained model checkpoint.

In [0]:
# TODO
model_checkpoint = "EleutherAI/pythia-70m-deduped"

In [0]:
print(model_checkpoint)

EleutherAI/pythia-70m-deduped


In [0]:
# Test your answer. DO NOT MODIFY THIS CELL.

dbTestQuestion4_2(model_checkpoint)

[32mPASSED[0m: All tests passed for lesson4, question2
[32mRESULTS RECORDED[0m: Click `Submit` when all questions are completed to log the results.


### Question 3: Load and Configure

The next task is to load and configure the tokenizer for this model. The instruction-following process builds a body of text that contains the instruction, context input, and response values from the dataset. The body of text also includes some special tokens to identify the sections of the text. These tokens are generally configurable, and need to be added to the tokenizer.

Let's go ahead and load the tokenizer for the pre-trained model.

In [0]:
# TODO
# load the tokenizer that was used for the model
tokenizer = AutoTokenizer.from_pretrained(
  model_checkpoint,cache_dir=DA.paths.datasets
)
tokenizer.pad_token = tokenizer.eos_token
tokenizer.add_special_tokens(
    {"additional_special_tokens": ["### End", "### Instruction:", "### Response:\n"]}
)

Downloading (…)okenizer_config.json:   0%|          | 0.00/396 [00:00<?, ?B/s]



Downloading (…)/main/tokenizer.json:   0%|          | 0.00/2.11M [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/99.0 [00:00<?, ?B/s]

3

In [0]:
print(tokenizer)

GPTNeoXTokenizerFast(name_or_path='EleutherAI/pythia-70m-deduped', vocab_size=50254, model_max_length=1000000000000000019884624838656, is_fast=True, padding_side='right', truncation_side='right', special_tokens={'bos_token': '<|endoftext|>', 'eos_token': '<|endoftext|>', 'unk_token': '<|endoftext|>', 'pad_token': '<|endoftext|>', 'additional_special_tokens': ['### End', '### Instruction:', '### Response:\n']}, clean_up_tokenization_spaces=True)


In [0]:
# Test your answer. DO NOT MODIFY THIS CELL.

dbTestQuestion4_3(tokenizer)

[32mPASSED[0m: All tests passed for lesson4, question3
[32mRESULTS RECORDED[0m: Click `Submit` when all questions are completed to log the results.


### Question 4: Tokenize

The `tokenize` method below builds the body of text for each prompt/response.

In [0]:
remove_columns = ["instruction", "response", "context", "category"]


def tokenize(x: dict, max_length: int = 1024) -> dict:
    """
    For a dictionary example of instruction, response, and context a dictionary of input_id and attention mask is returned
    """
    instr = x["instruction"]
    resp = x["response"]
    context = x["context"]

    instr_part = f"### Instruction:\n{instr}"
    context_part = ""
    if context:
        context_part = f"\nInput:\n{context}\n"
    resp_part = f"### Response:\n{resp}"

    text = f"""Below is an instruction that describes a task. Write a response that appropriately completes the request.

{instr_part}
{context_part}
{resp_part}

### End
"""
    return tokenizer(text, max_length=max_length, truncation=True)

Let's `tokenize` the Dolly training dataset.

In [0]:
# TODO
tokenized_dataset = ds.map(tokenize,batched=True,remove_columns=remove_columns)

Map:   0%|          | 0/15011 [00:00<?, ? examples/s]

In [0]:
print(tokenized_dataset)

DatasetDict({
    train: Dataset({
        features: ['input_ids', 'attention_mask'],
        num_rows: 16384
    })
})


In [0]:
print(tokenized_dataset["train"]["input_ids"])

[30003, 310, 271, 9775, 326, 8631, 247, 4836, 15, 19566, 247, 2380, 326, 20420, 29141, 253, 2748, 15, 535, 50278, 187, 5013, 3039, 858, 8237, 6976, 1265, 6498, 32, 1383, 686, 7371, 310, 247, 3417, 273, 6773, 32, 308, 1714, 390, 416, 1714, 1383, 686, 4967, 476, 4049, 1241, 11883, 323, 1048, 1293, 1824, 32, 1383, 346, 2422, 547, 434, 4651, 452, 1264, 19367, 27, 22138, 13, 18827, 90, 13, 285, 752, 457, 84, 253, 1416, 273, 253, 2626, 6122, 46607, 686, 3039, 369, 6270, 80, 11938, 17644, 263, 4355, 5686, 32, 1383, 686, 2042, 309, 452, 625, 7437, 387, 253, 673, 273, 32481, 358, 366, 13, 452, 309, 1912, 32, 1383, 686, 15768, 247, 3806, 2505, 670, 418, 2555, 522, 267, 3288, 4019, 13, 835, 1057, 352, 1379, 1659, 13, 665, 3053, 352, 285, 752, 310, 352, 32, 1383, 686, 7883, 3534, 253, 7379, 253, 2659, 275, 10310, 281, 1973, 616, 41053, 1383, 686, 4967, 6109, 310, 3076, 323, 1966, 1383, 686, 7883, 369, 2516, 25084, 19896, 920, 32, 1383, 686, 7883, 310, 7195, 19759, 32, 1383, 686, 7883, 369, 25856, 

In [0]:
# Test your answer. DO NOT MODIFY THIS CELL.

dbTestQuestion4_4(tokenized_dataset)

[0;31m---------------------------------------------------------------------------[0m
[0;31mTypeError[0m                                 Traceback (most recent call last)
File [0;32m<command-136204470554052>:3[0m
[1;32m      1[0m [38;5;66;03m# Test your answer. DO NOT MODIFY THIS CELL.[39;00m
[0;32m----> 3[0m [43mdbTestQuestion4_4[49m[43m([49m[43mtokenized_dataset[49m[43m)[49m

File [0;32m<command-136204470554215>:32[0m, in [0;36mdbTestQuestion4_4[0;34m(tokenized_dataset)[0m
[1;32m     29[0m userhome_for_testing [38;5;241m=[39m getUsernameFromEnv(lesson)
[1;32m     31[0m [38;5;28;01massert[39;00m [38;5;28mstr[39m([38;5;28mtype[39m(tokenized_dataset)) [38;5;241m==[39m [38;5;124m"[39m[38;5;124m<class [39m[38;5;124m'[39m[38;5;124mdatasets.dataset_dict.DatasetDict[39m[38;5;124m'[39m[38;5;124m>[39m[38;5;124m"[39m, [38;5;124m"[39m[38;5;124mTest NOT passed: `tokenized_dataset` should be of type `datasets.dataset_dict.DatasetDict`[39m[38

### Question 5: Setup Training

To setup the fine-tuning process we need to define the `TrainingArguments`.

Let's configure the training to have **10** training epochs (`num_train_epochs`) with a per device batch size of **8**. The optimizer (`optim`) to be used should be `adamw_torch`. Finally, the reporting (`report_to`) list should be set to *tensorboard*.

In [0]:
# TODO
checkpoint_name = "test-trainer-lab"
local_checkpoint_path = os.path.join(local_training_root, checkpoint_name)
training_args = TrainingArguments(
  local_checkpoint_path,
  num_train_epochs=10,
  per_device_train_batch_size=8,
  optim="adamw_torch",
  report_to=["tensorboard"],
)

In [0]:
checkpoint_name = "test-trainer-lab"

In [0]:
# Test your answer. DO NOT MODIFY THIS CELL.

dbTestQuestion4_5(training_args)

[32mPASSED[0m: All tests passed for lesson4, question5
[32mRESULTS RECORDED[0m: Click `Submit` when all questions are completed to log the results.


### Question 6: AutoModelForCausalLM

The pre-trained `pythia-70m-deduped` model can be loaded using the [AutoModelForCausalLM](https://huggingface.co/docs/transformers/model_doc/auto#transformers.AutoModelForCausalLM) class.

In [0]:
# TODO
# load the pre-trained model
model = AutoModelForCausalLM.from_pretrained(
  model_checkpoint , cache_dir=DA.paths.datasets
)

Downloading (…)lve/main/config.json:   0%|          | 0.00/567 [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/166M [00:00<?, ?B/s]

In [0]:
# Test your answer. DO NOT MODIFY THIS CELL.

dbTestQuestion4_6(model)

[32mPASSED[0m: All tests passed for lesson4, question6
[32mRESULTS RECORDED[0m: Click `Submit` when all questions are completed to log the results.


### Question 7: Initialize the Trainer

Unlike the IMDB dataset used in the earlier Notebook, the Dolly dataset only contains a single *train* dataset. Let's go ahead and create a [`train_test_split`](https://huggingface.co/docs/datasets/v2.12.0/en/package_reference/main_classes#datasets.Dataset.train_test_split) of the train dataset.

Also, let's initialize the [`Trainer`](https://huggingface.co/docs/transformers/main_classes/trainer) with model, training arguments, the train & test datasets, tokenizer, and data collator. Here we will use the [`DataCollatorForLanguageModeling`](https://huggingface.co/docs/transformers/main/en/main_classes/data_collator#transformers.DataCollatorForLanguageModeling).

In [0]:
# TODO
# used to assist the trainer in batching the data
TRAINING_SIZE=6000
SEED=42
data_collator = DataCollatorForLanguageModeling(
    tokenizer=tokenizer, mlm=False, return_tensors="pt", pad_to_multiple_of=8
)
split_dataset = tokenized_dataset["train"].train_test_split(train_size=TRAINING_SIZE,seed=SEED)
trainer = Trainer(
  model,
  training_args,
  train_dataset=split_dataset["train"],
  eval_dataset=split_dataset["test"],
  tokenizer=tokenizer,
  data_collator=data_collator,
)

In [0]:
print(split_dataset)

DatasetDict({
    train: Dataset({
        features: ['input_ids', 'attention_mask'],
        num_rows: 6000
    })
    test: Dataset({
        features: ['input_ids', 'attention_mask'],
        num_rows: 10384
    })
})


In [0]:
# Test your answer. DO NOT MODIFY THIS CELL.

dbTestQuestion4_7(trainer)

[32mPASSED[0m: All tests passed for lesson4, question7
[32mRESULTS RECORDED[0m: Click `Submit` when all questions are completed to log the results.


### Question 8: Train

Before starting the training process, let's turn on Tensorboard. This will allow us to monitor the training process as checkpoint logs are created.

In [0]:
tensorboard_display_dir = f"{local_checkpoint_path}/runs"

In [0]:
%load_ext tensorboard
%tensorboard --logdir '{tensorboard_display_dir}'

The tensorboard extension is already loaded. To reload it, use:
  %reload_ext tensorboard
Your log directory might be ephemeral to the cluster, which will be deleted after cluster termination or restart. You can choose a log directory under `/dbfs/` to persist your logs in DBFS.
Tensorboard may not be displayed in the notebook cell output when 'Third-party iFraming prevention' is disabled. You can still use Tensorboard by clicking the link below to open Tensorboard in a new tab. To enable Tensorboard in notebook cell output, please ask your workspace admin to enable 'Third-party iFraming prevention'.


Reusing TensorBoard on port 6006 (pid 3035), started 0:00:17 ago. (Use '!kill 3035' to kill it.)

Start the fine-tuning process!

In [0]:
# TODO
# invoke training - note this will take approx. 30min
trainer.train()

# save model to the local checkpoint
trainer.save_model()
trainer.save_state()

[0;31m---------------------------------------------------------------------------[0m
[0;31mValueError[0m                                Traceback (most recent call last)
File [0;32m<command-136204470554069>:3[0m
[1;32m      1[0m [38;5;66;03m# TODO[39;00m
[1;32m      2[0m [38;5;66;03m# invoke training - note this will take approx. 30min[39;00m
[0;32m----> 3[0m trainer[38;5;241m.[39mtrain()
[1;32m      5[0m [38;5;66;03m# save model to the local checkpoint[39;00m
[1;32m      6[0m trainer[38;5;241m.[39msave_model()

File [0;32m/databricks/python/lib/python3.10/site-packages/transformers/trainer.py:1662[0m, in [0;36mTrainer.train[0;34m(self, resume_from_checkpoint, trial, ignore_keys_for_eval, **kwargs)[0m
[1;32m   1657[0m     [38;5;28mself[39m[38;5;241m.[39mmodel_wrapped [38;5;241m=[39m [38;5;28mself[39m[38;5;241m.[39mmodel
[1;32m   1659[0m inner_training_loop [38;5;241m=[39m find_executable_batch_size(
[1;32m   1660[0m     [38;5;28mself[3

In [0]:
# Test your answer. DO NOT MODIFY THIS CELL.

dbTestQuestion4_8(trainer)

In [0]:
# persist the fine-tuned model to DBFS
final_model_path = f"{DA.paths.working_dir}/llm04_fine_tuning/{checkpoint_name}"
trainer.save_model(output_dir=final_model_path)

In [0]:
import gc
import torch

gc.collect()
torch.cuda.empty_cache()

In [0]:
fine_tuned_model = AutoModelForCausalLM.from_pretrained(final_model_path)

Recall that the model was trained using a body of text that contained an instruction and its response. A similar body of text, or prompt, needs to be provided when testing the model. The prompt that is provided only contains an instruction though. The model will `generate` the response accordingly.

In [0]:
def to_prompt(instr: str, max_length: int = 1024) -> dict:
    text = f"""Below is an instruction that describes a task. Write a response that appropriately completes the request.

### Instruction:
{instr}

### Response:
"""
    return tokenizer(text, return_tensors="pt", max_length=max_length, truncation=True)


def to_response(prediction):
    decoded = tokenizer.decode(prediction)
    # extract the Response from the decoded sequence
    m = re.search(r"#+\s*Response:\s*(.+?)#+\s*End", decoded, flags=re.DOTALL)
    res = "Failed to find response"
    if m:
        res = m.group(1).strip()
    else:
        m = re.search(r"#+\s*Response:\s*(.+)", decoded, flags=re.DOTALL)
        if m:
            res = m.group(1).strip()
    return res

In [0]:
import re
# NOTE: this cell can take up to 5mins
res = []
for i in range(100):
    instr = ds["train"][i]["instruction"]
    resp = ds["train"][i]["response"]
    inputs = to_prompt(instr)
    pred = fine_tuned_model.generate(
        input_ids=inputs["input_ids"],
        attention_mask=inputs["attention_mask"],
        pad_token_id=tokenizer.pad_token_id,
        max_new_tokens=128,
    )
    res.append((instr, resp, to_response(pred[0])))

In [0]:
pdf = pd.DataFrame(res, columns=["instruction", "response", "generated"])
display(pdf)

**CONGRATULATIONS**

You have just taken the first step toward fine-tuning your own slimmed down version of [Dolly](https://github.com/databrickslabs/dolly)! 

Unfortunately, it does not seem to be too generative at the moment. Perhaps, with some additional training and data the model could be more capable.

### Question 9: Evaluation

Although the current model is under-trained, it is worth evaluating the responses to get a general sense of how far off the model is at this point.

Let's compute the ROGUE metrics between the reference response and the generated responses.

In [0]:
nltk.download("punkt")

rouge_score = evaluate.load("rouge")


def compute_rouge_score(generated, reference):
    """
    Compute ROUGE scores on a batch of articles.

    This is a convenience function wrapping Hugging Face `rouge_score`,
    which expects sentences to be separated by newlines.

    :param generated: Summaries (list of strings) produced by the model
    :param reference: Ground-truth summaries (list of strings) for comparison
    """
    generated_with_newlines = ["\n".join(sent_tokenize(s.strip())) for s in generated]
    reference_with_newlines = ["\n".join(sent_tokenize(s.strip())) for s in reference]
    return rouge_score.compute(
        predictions=generated_with_newlines,
        references=reference_with_newlines,
        use_stemmer=True,
    )

In [0]:
# TODO
rouge_scores = <FILL_IN>
display(<FILL_IN>)

In [0]:
# Test your answer. DO NOT MODIFY THIS CELL.

dbTestQuestion4_9(rouge_scores)

## Clean up Classroom

Run the following cell to remove lessons-specific assets created during this lesson.

In [0]:
tmpdir.cleanup()

## Submit your Results (edX Verified Only)

To get credit for this lab, click the submit button in the top right to report the results. If you run into any issues, click `Run` -> `Clear state and run all`, and make sure all tests have passed before re-submitting. If you accidentally deleted any tests, take a look at the notebook's version history to recover them or reload the notebooks.

-sandbox
&copy; 2023 Databricks, Inc. All rights reserved.<br/>
Apache, Apache Spark, Spark and the Spark logo are trademarks of the <a href="https://www.apache.org/">Apache Software Foundation</a>.<br/>
<br/>
<a href="https://databricks.com/privacy-policy">Privacy Policy</a> | <a href="https://databricks.com/terms-of-use">Terms of Use</a> | <a href="https://help.databricks.com/">Support</a>