## End-to-End Workflow with torchtune

### Overview

Fine-tuning an LLM is usually only one step in a larger workflow. An example workflow that you might have can look something like this:

Download a popular model from HF Hub

Fine-tune the model using a relevant fine-tuning technique. The exact technique used will depend on factors such as the model, amount and nature of training data, your hardware setup and the end task for which the model will be used

Evaluate the model on some benchmarks to validate model quality

Run some generations to make sure the model output looks reasonable

Quantize the model for efficient inference

[Optional] Export the model for specific environments such as inference on a mobile phone

In this tutorial, we’ll cover how you can use torchtune for all of the above, leveraging integrations with popular tools and libraries from the ecosystem.

### Download Gemma 2B

In [2]:
!tune download google/gemma-2-2b \
  --output-dir /tmp/google/gemma-2-2b \
  --hf-token $HF_TOKEN

Ignoring files matching the following patterns: *.safetensors
Fetching 9 files: 100%|████████████████████████| 9/9 [00:00<00:00, 14304.18it/s]
Successfully downloaded model repo and wrote to the following locations:
/tmp/google/gemma-2-2b/.cache
/tmp/google/gemma-2-2b/README.md
/tmp/google/gemma-2-2b/.gitattributes
/tmp/google/gemma-2-2b/config.json
/tmp/google/gemma-2-2b/model.safetensors.index.json
/tmp/google/gemma-2-2b/special_tokens_map.json
/tmp/google/gemma-2-2b/tokenizer_config.json
/tmp/google/gemma-2-2b/generation_config.json
/tmp/google/gemma-2-2b/tokenizer.model
/tmp/google/gemma-2-2b/tokenizer.json


### Fine-tune the model using LoRA

For this tutorial, we’ll fine-tune the model using LoRA. LoRA is a parameter efficient fine-tuning technique which is especially helpful when you don’t have a lot of GPU memory to play with. LoRA freezes the base LLM and adds a very small percentage of learnable parameters. This helps keep memory associated with gradients and optimizer state low. Using torchtune, you should be able to fine-tune a Llama2 7B model with LoRA in less than 16GB of GPU memory using bfloat16 on a RTX 3090/4090.

We’ll fine-tune using our single device LoRA recipe and use the standard settings from the default config.

This will fine-tune our model using a `batch_size=2` and `dtype=bfloat16`. With these settings the model should have a peak memory usage of ~16GB and total training time of around two hours for each epoch. We’ll need to make some changes to the config to make sure our recipe can access the right checkpoints.

Let’s look for the right config for this use case by using the tune CLI.

In [3]:
!tune ls

RECIPE                                   CONFIG                                  
full_finetune_single_device              llama2/7B_full_low_memory               
                                         code_llama2/7B_full_low_memory          
                                         llama3/8B_full_single_device            
                                         llama3_1/8B_full_single_device          
                                         mistral/7B_full_low_memory              
                                         phi3/mini_full_low_memory               
full_finetune_distributed                llama2/7B_full                          
                                         llama2/13B_full                         
                                         llama3/8B_full                          
                                         llama3_1/8B_full                        
                                         llama3/70B_full                         
                

For this tutorial we’ll use the gemma/2B_lora_distributed config.

In [None]:
!tune run lora_finetune_distributed \
--config gemma/2B_lora \
checkpointer.checkpoint_dir=/tmp/google/gemma-2-2b \
tokenizer.path=/tmp/google/gemma-2-2b/tokenizer.model \
checkpointer.output_dir=/tmp/google/gemma-2-2b

The final trained weights are merged with the original model and split across two checkpoint files similar to the source checkpoints from the HF Hub. In fact the keys will be identical between these checkpoints. We also have a third checkpoint file which is much smaller in size and contains the learnt LoRA adapter weights. For this tutorial, we’ll only use the model checkpoints and not the adapter weights.

### Run Evaluation using EleutherAI’s Eval Harness

We’ve fine-tuned a model. But how well does this model really do? Let’s run some Evaluations!

torchtune integrates with [EleutherAI’s evaluation harness](https://github.com/EleutherAI/lm-evaluation-harness). An example of this is available through the `eleuther_eval` recipe. In this tutorial, we’re going to directly use this recipe by modifying its associated config `eleuther_evaluation.yaml`.

For this section of the tutorial, you should first run `pip install lm_eval==0.4.*` to install the EleutherAI evaluation harness.

In [None]:
pip install lm_eval==0.4.*

Since we plan to update all of the checkpoint files to point to our fine-tuned checkpoints, **let’s first copy over the config to our local working directory so we can make changes**. This will be easier than overriding all of these elements through the CLI.

In [None]:
!tune cp eleuther_evaluation ./custom_eval_config.yaml \

For this tutorial we’ll use the `truthfulqa_mc2` (Use what has been given in the Deep NLP course) task from the harness. This task measures a model’s propensity to be truthful when answering questions and measures the model’s zero-shot accuracy on a question followed by one or more true responses and one or more false responses. Let’s first run a baseline without fine-tuning.

In [None]:
!tune run eleuther_eval --config ./custom_eval_config.yaml \
checkpointer.checkpoint_dir=/tmp/google/gemma-2-2b \
tokenizer.path=/tmp/google/gemma-2-2b/tokenizer.model

The model has an accuracy around 38.8%. Let’s compare this with the fine-tuned model.

First, we modify custom_eval_config.yaml to include the fine-tuned checkpoints.

```yaml
checkpointer:
    _component_: torchtune.utils.FullModelHFCheckpointer

    # directory with the checkpoint files
    # this should match the output_dir specified during
    # finetuning
    checkpoint_dir: <checkpoint_dir>

    # checkpoint files for the fine-tuned model. This should
    # match what's shown in the logs above
    checkpoint_files: [
        hf_model_0001_0.pt,
        hf_model_0002_0.pt,
    ]

    output_dir: <checkpoint_dir>
    model_type: GEMMA

# Make sure to update the tokenizer path to the right
# checkpoint directory as well
tokenizer:
    _component_: torchtune.models.llama2.llama2_tokenizer
    path: <checkpoint_dir>/tokenizer.model
```

Now, let’s run the recipe.

In [None]:
!tune run eleuther_eval --config ./custom_eval_config.yaml

Our fine-tuned model gets ~48% on this task, which is ~10 points better than the baseline. Great! Seems like our fine-tuning helped.

### Generation

We’ve run some evaluations and the model seems to be doing well. But does it really generate meaningful text for the prompts you care about? Let’s find out!

For this, we’ll use the [generate recipe](https://github.com/pytorch/torchtune/blob/main/recipes/generate.py) and the associated [config](https://github.com/pytorch/torchtune/blob/main/recipes/configs/generation.yaml).

Let’s first copy over the config to our local working directory so we can make changes.

Let’s modify custom_generation_config.yaml to include the following changes.

```yaml
checkpointer:
    _component_: torchtune.utils.FullModelHFCheckpointer

    # directory with the checkpoint files
    # this should match the output_dir specified during
    # finetuning
    checkpoint_dir: <checkpoint_dir>

    # checkpoint files for the fine-tuned model. This should
    # match what's shown in the logs above
    checkpoint_files: [
        hf_model_0001_0.pt,
        hf_model_0002_0.pt,
    ]

    output_dir: /tmp/google/gemma-2-2b
    model_type: GEMMA

# Make sure to update the tokenizer path to the right
# checkpoint directory as well
tokenizer:
    _component_: torchtune.models.llama2.llama2_tokenizer
    path: <checkpoint_dir>/tokenizer.model
```

Once the config is updated, let’s kick off generation! We’ll use the default settings for sampling with top_k=300 and a temperature=0.8. These parameters control how the probabilities for sampling are computed. These are standard settings for Llama2 7B and we recommend inspecting the model with these before playing around with these parameters.

We’ll use a different prompt from the one in the config

In [None]:
!tune run generate --config ./custom_generation_config.yaml \
prompt="What are some interesting sites to visit in the Bay Area?"

Indeed, the bridge is pretty cool! Seems like our LLM knows a little something about the Bay Area!

### Speeding up Generation using Quantization

We saw that the generation recipe took around 11.6 seconds to generate 300 tokens. One technique commonly used to speed up inference is quantization. torchtune provides an integration with the [TorchAO](https://github.com/pytorch-labs/ao) quantization APIs. Let’s first quantize the model using 4-bit weights-only quantization and see if this improves generation speed.

For this, we’ll use the [quantization recipe](https://github.com/pytorch/torchtune/blob/main/recipes/quantize.py).

Let’s first copy over the config to our local working directory so we can make changes.

In [None]:
!tune cp quantization ./custom_quantization_config.yaml

Let’s modify `custom_quantization_config.yaml` to include the following changes.

```yml
checkpointer:
    _component_: torchtune.utils.FullModelHFCheckpointer

    # directory with the checkpoint files
    # this should match the output_dir specified during
    # finetuning
    checkpoint_dir: /tmp/google/gemma-2-2b

    # checkpoint files for the fine-tuned model. This should
    # match what's shown in the logs above
    checkpoint_files: [
        hf_model_0001_0.pt,
        hf_model_0002_0.pt,
    ]

    output_dir: /tmp/google/gemma-2-2b
    model_type: GEMMA
```

Once the config is updated, let’s kick off quantization! We’ll use the default quantization method from the config.

In [None]:
!tune run quantize --config ./custom_quantization_config.yaml

> Unlike the fine-tuned checkpoints, this outputs a single checkpoint file. This is because our quantization APIs currently don’t support any conversion across formats. As a result you won’t be able to use these quantized models outside of torchtune. But you should be able to use these with the generation and evaluation recipes within torchtune. These results will help inform which quantization methods you should use with your favorite inference engine.

Now that we have the quantized model, let’s re-run generation.

Modify custom_generation_config.yaml to include the following changes.

```yml
checkpointer:
    # we need to use the custom torchtune checkpointer
    # instead of the HF checkpointer for loading
    # quantized models
    _component_: torchtune.utils.FullModelTorchTuneCheckpointer

    # directory with the checkpoint files
    # this should match the output_dir specified during
    # finetuning
    checkpoint_dir: <checkpoint_dir>

    # checkpoint files point to the quantized model
    checkpoint_files: [
        hf_model_0001_0-4w.pt,
    ]

    output_dir: <checkpoint_dir>
    model_type: LLAMA2

# we also need to update the quantizer to what was used during
# quantization
quantizer:
    _component_: torchtune.utils.quantization.Int4WeightOnlyQuantizer
    groupsize: 256
```
Once the config is updated, let’s kick off generation! We’ll use the same sampling parameters as before. We’ll also use the same prompt we did with the unquantized model.

In [None]:
!tune run generate --config ./custom_generation_config.yaml \
prompt="What are some interesting sites to visit in the Bay Area?"