# 🎛️ Fine-tuning Guide 
[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/togethercomputer/together-cookbook/blob/main/Finetuning/Finetuning_Guide.ipynb)

Large Language Models (LLMs) offer powerful general capabilities, but often require **fine-tuning** to excel at specific tasks or understand domain-specific language. Fine-tuning adapts a trained model to a smaller, targeted dataset, enhancing its performance for your unique needs.

This notebook provides a step-by-step guide to fine-tuning models using the Together AI platform. We will walk through the entire process, from preparing your data to evaluating your fine-tuned model.

We will cover:

1.  **Dataset Preparation:** Loading a standard dataset, transforming it into the required format for supervised fine-tuning on Together AI, and uploading your formatted dataset to Together AI Files.
2.  **Fine-tuning Job Launch:** Configuring and initiating a fine-tuning job using the Together AI API.
3.  **Job Monitoring:** Checking the status and progress of your fine-tuning job.
4.  **Inference:** Using your newly finetuned model via the Together AI API for predictions.
5.  **Evaluation:** Comparing the performance of the finetuned model against the base model on a test set.

By following this guide, you'll gain practical experience in creating specialized LLMs tailored to your specific requirements using Together AI.

## Setup and Installation
---
First, install the necessary Python libraries. We need:
- `together`: The official Together AI Python client for interacting with the API (fine-tuning, inference, files, etc.).
- `datasets`: A library from Hugging Face for easily downloading and manipulating datasets.
- `transformers`: Although we won't be training locally, this can be useful for running evals and other utilities if needed.
- `tqdm`: To enable interactive elements like progress bars within the notebook.

In [1]:
!pip install -qU together datasets transformers tqdm

## 1. Dataset Preparation
---
Fine-tuning requires data formatted in a specific way. We'll use the a conversational dataset as an example - here the goal of the fine-tuning is to improve the model on multi-turn conversations. 

First we need to transform this dataset into the chat format expected by Together AI for supervised fine-tuning.

The required format is a JSON object per line, where each object contains a list of conversation turns under the `"messages"` key.

Each message must have a `"role"` (`system`, `user`, or `assistant`) and `"content"`.

Conversation Data Example:
```json
{"messages": [{"role": "system", "content": "You are a helpful assistant."}, 
              {"role": "user", "content": "Hello!"}, 
              {"role": "assistant", "content": "Hi! How can I help you?"}]}
```

🔗Depending on what type of fine-tuning you want to perform you can also pass in [instruction data](https://docs.together.ai/docs/fine-tuning-data-preparation#instruction-data), [preference data](https://docs.together.ai/docs/fine-tuning-data-preparation#preference-data) or even [simple text data](https://docs.together.ai/docs/fine-tuning-data-preparation#generic-text-data).

### Load Raw Dataset
We use the `datasets` library to download the CoQA dataset from the Hugging Face Hub.

Let's examine the structure of the raw dataset. CoQA provides a story, a series of questions related to the story, and corresponding answers.

In [2]:
from datasets import load_dataset

coqa_dataset = load_dataset("stanfordnlp/coqa")

In [3]:
coqa_dataset["train"].to_pandas().head()

Unnamed: 0,source,story,questions,answers
0,wikipedia,"The Vatican Apostolic Library (), more commonl...","[When was the Vat formally opened?, what is th...",{'input_text': ['It was formally established i...
1,cnn,New York (CNN) -- More than 80 Michael Jackson...,"[Where was the Auction held?, How much did the...","{'input_text': ['Hard Rock Cafe', '$2 million...."
2,gutenberg,"CHAPTER VII. THE DAUGHTER OF WITHERSTEEN \n\n""...","[What did Venters call Lassiter?, Who asked La...","{'input_text': ['gun-man', 'Jane', 'Yes', 'to ..."
3,cnn,(CNN) -- The longest-running holiday special s...,"[Who is Rudolph's father?, Why does Rudolph ru...","{'input_text': ['Donner', 'he felt like an out..."
4,gutenberg,CHAPTER XXIV. THE INTERRUPTED MASS \n\nThe mor...,"[Who arrived at the church?, Who was followed ...","{'input_text': ['the garrison first', 'Fra. Do..."


### Transform Data to Chat Format

Now, we need to convert each row of the CoQA dataset into the required chat format (`[{'role': ..., 'content': ...}, ...]`).

We'll create a function `map_coqa_to_chat_format` that takes a row from the dataset and structures it as a conversation:
1.  A `system` message containing the story (context).
2.  Alternating `user` (question) and `assistant` (answer) messages.


In [4]:
# the system prompt,if present, must always be at the beginning
system_prompt = "Read the story and extract answers for the questions.\nStory: {}"

def map_fields(row):
    """    
    Maps the fields from a row of data to a structured format for conversation.
    Args:
        row (dict): A dictionary containing the keys "story", "questions", and "answers".
            - "story" (str): The story content to be used in the system prompt.
            - "questions" (list of str): A list of questions from the user.
            - "answers" (dict): A dictionary containing the key "input_text" which is a list of answers from the assistant.
    Returns:
        dict: A dictionary with a single key "messages" which is a list of message dictionaries.
            Each message dictionary contains:
            - "role" (str): The role of the message sender, either "system", "user", or "assistant".
            - "content" (str): The content of the message.    
    """
    # create system prompt
    messages = [{"role": "system", "content": system_prompt.format(row["story"])}]
    
    # add user and assistant messages
    for q, a in zip(row["questions"], row["answers"]["input_text"]):
        messages.append({"role": "user", "content": q})
        messages.append({"role": "assistant", "content": a})
    
    return {"messages": messages}

We apply this transformation function to the entire dataset using the `.map()` method. We also remove the original columns as they are no longer needed after transformation.

In [5]:
# transform the data using the mapping function

train_messages = coqa_dataset["train"].map(map_fields, remove_columns=coqa_dataset["train"].column_names)

Let's check the structure of our transformed dataset. It should now only contain the `messages` column.

Here's an example of a single processed data point:

In [6]:
train_messages

Dataset({
    features: ['messages'],
    num_rows: 7199
})

Write the dataset out to a `json` file:

In [7]:
train_messages.to_json("coqa_prepared_train.jsonl")

Creating json from Arrow format:   0%|          | 0/8 [00:00<?, ?ba/s]

23777505

### Upload Data to Together AI

Now that we have our formatted `coqa_prepared_train.jsonl` files, we need to check if they meet the format specification and then upload them to Together AI. Fine-tuning jobs read data directly from your uploaded files.

We use the `check_file` function to check the file and `files.upload()` method. This returns information about the uploaded file, including its ID, which we'll need later to start the fine-tuning job.

In [8]:
## Setup Together AI client
from together import Together
import os
import json

TOGETHER_API_KEY = os.getenv("TOGETHER_API_KEY")
WANDB_API_KEY = os.getenv("WANDB_API_KEY") # needed for logging fine-tuning to wandb


client = Together(api_key=TOGETHER_API_KEY)

In [9]:
## We're going to check to see that the file is in the right format before we finetune
from together.utils import check_file

sft_report = check_file("coqa_prepared_train.jsonl")
print(json.dumps(sft_report, indent=2))

assert sft_report["is_check_passed"] == True

{
  "is_check_passed": true,
  "message": "Checks passed",
  "found": true,
  "file_size": 23777505,
  "utf8": true,
  "line_type": true,
  "text_field": true,
  "key_value": true,
  "has_min_samples": true,
  "num_samples": 7199,
  "load_json": true,
  "filetype": "jsonl"
}


In [10]:
## Upload the data to Together

train_file_resp = client.files.upload("coqa_prepared_train.jsonl", check=True)
print(f"Train file response: {train_file_resp.id}")

Uploading file coqa_prepared_train.jsonl: 100%|██████████| 23.8M/23.8M [00:01<00:00, 15.7MB/s]


Train file response: file-9554964d-5711-419a-bcc2-c4edaaa07ee3


## 2. Launch Fine-tuning Job
---
With our data uploaded, we can now launch the fine-tuning job using `together.Finetune.create()`.

Key parameters:
- `model`: The base model you want to finetune (e.g., `'togethercomputer/llama-2-7b-chat'`). Choose from the models available for fine-tuning on Together AI.
- `training_file`: The ID of your uploaded training JSONL file.
- `validation_file`: The ID of your uploaded validation JSONL file (optional, but highly recommended for monitoring).
- `suffix`: A custom string added to the base model name to create your unique finetuned model name (e.g., `my-coqa-ft`). Keep it short and descriptive.
- `n_epochs`: The number of times the model will see the entire training dataset.
- `n_checkpoints`: Number of checkpoints to save during training (useful for resuming or selecting the best model). Set to 1 if you only need the final model.
- `learning_rate`: Controls how much the model weights are updated during training. Needs tuning.
- `batch_size`: Number of training examples processed in one iteration. Depends on model size and available resources.

🔗 For an exhaustive list of all the available fine-tuning parameters refer to the [Together AI Fine-tuning API Reference](https://docs.together.ai/reference/post_fine-tunes) docs.

🔗 For a list of all the models you can fine-tune on the Together AI platform see [docs here](https://docs.together.ai/docs/fine-tuning-models).

In [None]:
## This fine-tuning job should take ~10-15 minutes to complete

ft_resp = client.fine_tuning.create(
    training_file = "file-19c6ef51-b734-4f3c-bc17-62fbad2bd0d0",
    model = 'meta-llama/Meta-Llama-3.1-8B-Instruct-Reference',
    train_on_inputs= "auto",
    n_epochs = 3,
    n_checkpoints = 1,
    wandb_api_key = WANDB_API_KEY,
    lora = True,
    warmup_ratio=0,
    learning_rate = 1e-5,
    suffix = 'test1_8b',
)

print(ft_resp.id)

## 3. Monitor Fine-tuning Job
---
Fine-tuning can take time depending on the model size, dataset size, and hyperparameters. You can monitor and alter the job's progress using the following methods:

- List all jobs: `client.fine_tuning.list()`
- Status of a Job:`client.fine_tuning.retrieve(id=ft_resp.id)`
- List all events for a Job: `client.fine_tuning.list_events(id=ft_resp.id)`: Retrieves logs and events generated during the job
- Cancel job: `client.fine_tuning.cancel(id=ft_resp.id)`
- Download model after done: `client.fine_tuning.download(id=ft_resp.id)`

Once the job is complete (`status == 'completed'`), the response from `retrieve` will contain the name of your newly created finetuned model. It follows the pattern: `<your-account>/<base-model-name>:<suffix>:<job-id>`.


In [None]:
# Check status of the job
resp = client.fine_tuning.retrieve(ft_resp.id)
print(resp.status)

FinetuneJobStatus.STATUS_COMPLETED


In [None]:
# this loop will print the logs of the job thus far
for event in client.fine_tuning.list_events(id=ft_resp.id).data:
    print(event.message)

Fine tune request created
Job started at Wed Apr  9 19:48:05 UTC 2025
Model data downloaded for togethercomputer/Meta-Llama-3.1-8B-Instruct-Reference__TOG__FT at Wed Apr  9 19:48:07 UTC 2025
Data downloaded for togethercomputer/Meta-Llama-3.1-8B-Instruct-Reference__TOG__FT at $2025-04-09T19:48:14.918488
WandB run initialized.
Training started for model togethercomputer/Meta-Llama-3.1-8B-Instruct-Reference__TOG__FT
Epoch completed, at step 24
Epoch completed, at step 48
Epoch completed, at step 72
Training completed for togethercomputer/Meta-Llama-3.1-8B-Instruct-Reference__TOG__FT at Wed Apr  9 20:02:24 UTC 2025
Uploading adapter model
Compressing output model
Model compression complete
Uploading output model
Model upload complete
Job finished at Wed Apr  9 20:06:33 UTC 2025


🔗 You can also navigate to the WandB page linked in your [fine-tuning dashboard](https://api.together.ai/fine-tuning) to see the fine-tuning related loss curves and more.

<img src="../images/FT_run.png" width="900">

## 4. Inference with Fine-tuned Model
---
### Option 1: Serverless LoRA Inference

Now, let's use our finetuned model! We can call it just like any other model on the Together AI platform, by providing the unique fine-tuned model `output_name` we retrieved from our fine-tuning job earlier.

🔗 See the list of all models that support [LoRA Inference](https://docs.together.ai/docs/lora-inference).

In [None]:
print(f"Fine-tuned model output_name: {ft_resp.output_name}")

Fine-tuned model output_name: zainhas/Meta-Llama-3.1-8B-Instruct-Reference-test1_8b-e5a0fb5d


In [None]:
# The first time you run this it'll take longer to load the adapter weights for the first time

finetuned_model = ft_resp.output_name #this is the name of the finetuned model

user_prompt = "What is the capital of France?"

response = client.chat.completions.create(
    model = finetuned_model,
    messages=[
        {
            "role": "user",
            "content": user_prompt,
        }
    ],
    max_tokens=124,
)

print(response.choices[0].message.content)

The capital of France is Paris.


You can also prompt the model in our playground, if it support serverless LoRA Inference, by going to your [your models dashboard](https://api.together.xyz/models) and clicking "OPEN IN PLAYGROUND".

<img src="../images/open_in_playground.png" alt="Open in Playground button" width="900">



### Option 2: Deploy Dedicated Endpoint

Another way to run your fine-tuned model is to deploy it on a custom dedicated endpoint. 

Once your fine-tuning job completes, you should see your new model in [your models dashboard](https://api.together.xyz/models). You can click the "+ CREATE DEDICATED ENDPOINT" button to deploy the selected model to a DE.

<img src="../images/create_DE.png" width="900">

You can then select the hardware configuration for your dedicated endpoint including the min and max replicas which increases the maximum QPS the deployment can support.

<img src="../images/deploy_DE.png" width="900">

You can also deploy the model to a DE programmatically using the `Endpoints` API via the SDK:

```python
response = client.endpoints.create(
    display_name="Fine-tuned Meta Llama 3.1 8B Instruct 04-09-25",
    model="zainhas/Meta-Llama-3.1-8B-Instruct-Reference-test1_8b-e5a0fb5d",
    hardware="4x_nvidia_h100_80gb_sxm",
    autoscaling={
        min_replicas: 1,
        max_replicas: 1
    }
)

print(response)
```
⚠️ If you run this code it will deploy a dedicated endpoint for you. For an detailed documentation around how to deploy, delete and modify endpoints see the [Endpoints API Reference](https://docs.together.ai/reference/createendpoint).

Once deployed you'll be able to see the model details under your [Endpoints Dashboard](https://api.together.ai/endpoints):


<img src="../images/deployed_DE.png" width="900">

In [None]:
# You can now query the endpoint as follows
response = client.chat.completions.create(
    model="zainhas/Meta-Llama-3.1-8B-Instruct-Reference-test1_8b-e5a0fb5d-ded38e09",
    messages=[{"role": "user", 
               "content": "What is the capital of France?"
               }
              ]
)
print(response.choices[0].message.content)

The capital of France is Paris.


## 5. Evaluation
---
To assess the impact of fine-tuning, we can compare the responses of our finetuned model with the original base model on the same prompt in out test set.

This provides a way to measure improvements, after fine-tuning, to the model's behavior for our specific task (conversational QA based on a story).

In [15]:
from tqdm.auto import tqdm
from multiprocessing.pool import ThreadPool
import transformers.data.metrics.squad_metrics as squad_metrics

base_model = "meta-llama/Meta-Llama-3.1-8B-Instruct-Turbo"
finetuned_model = ft_resp.output_name

None of PyTorch, TensorFlow >= 2.0, or Flax have been found. Models won't be available and only tokenizers, configuration and file/data utilities can be used.


In [16]:
# We'll load in 50 conversations from the CoQA validation set for evaluation
coqa_dataset_validation = load_dataset("stanfordnlp/coqa", split="validation[:50]")

In [17]:
coqa_dataset_validation

Dataset({
    features: ['source', 'story', 'questions', 'answers'],
    num_rows: 50
})

We'll use the code below to generate answers from the baseline and fine-tuned model.

In [18]:
## This function is used to generate model answers on the CoQA validation set from the untuned reference and fine-tuned models

def get_model_answers(model_name):
    """
    Generate model answers for a given model name using a dataset of questions and answers.
    Args:
        model_name (str): The name of the model to use for generating answers.
    Returns:
        list: A list of lists, where each inner list contains the answers generated by the model for the corresponding set of questions in the dataset.
    The function performs the following steps:
    1. Initializes an empty list to store the model answers.
    2. Defines an inner function `get_answers` that takes a data dictionary and generates answers for the questions in the data.
    3. Uses a thread pool to parallelize the process of generating answers for each entry in the validation dataset.
    4. Appends the generated answers to the `model_answers` list.
    5. Returns the `model_answers` list.
    Note:
        - The `client` variable is assumed to be defined elsewhere in the code.
        - The `coqa_dataset` variable is assumed to contain the dataset with a "validation" key.
    """

    model_answers = []
    system_prompt = "Read the story and extract answers for the questions.\nStory: {}"
    
    def get_answers(data):
        answers = []
        messages = [
            {
                "role": "system",
                "content": system_prompt.format(data["story"]),
            }
        ]
        for q, true_answer in zip(data["questions"], data["answers"]["input_text"]):
            try:
                messages.append(
                    {
                        "role": "user",
                        "content": q
                    }
                )
                response = client.chat.completions.create(
                    messages=messages,
                    model=model_name,
                    max_tokens=64
                )
                answer = response.choices[0].message.content
                answers.append(answer)
            except Exception:
                answers.append("Invalid Response")
        return answers

    # We'll use 8 threads to generate answers faster in parallel
    with ThreadPool(8) as pool:
        for answers in tqdm(pool.imap(get_answers, coqa_dataset_validation), total=len(coqa_dataset_validation)):
            model_answers.append(answers)

    return model_answers

We'll use the function below to calculate the exact match and F1 score metrics

In [19]:
## This function will be used to evaluate predicted answers using the Exact Match (EM) and F1 metrics
def get_metrics(pred_answers):
    """
    Calculate the Exact Match (EM) and F1 metrics for predicted answers.
    Args:
        pred_answers (list): A list of predicted answers. Each element in the list is a list of predicted answers for a single question.
    Returns:
        tuple: A tuple containing two elements:
            - em_score (float): The average Exact Match score across all predictions.
            - f1_score (float): The average F1 score across all predictions.
    """

    em_metrics = []
    f1_metrics = []

    for pred, data in tqdm(zip(pred_answers, coqa_dataset_validation), total=len(pred_answers)):
        for pred_answer, true_answer in zip(pred, data["answers"]["input_text"]):
            em_metrics.append(squad_metrics.compute_exact(true_answer, pred_answer))
            f1_metrics.append(squad_metrics.compute_f1(true_answer, pred_answer))

    return sum(em_metrics) / len(em_metrics), sum(f1_metrics) / len(f1_metrics)

In [20]:
# Base Model answers
answers = get_model_answers("meta-llama/Meta-Llama-3.1-8B-Instruct-Turbo")

  0%|          | 0/50 [00:00<?, ?it/s]

In [21]:
# calculate the EM and F1 metrics for baseline model
em_metric, f1_metric = get_metrics(answers)
print(f"Model: Baseline, \n\nEM: {em_metric}, F1: {f1_metric}")

  0%|          | 0/50 [00:00<?, ?it/s]

Model: Baseline, 

EM: 0.0175, F1: 0.18467257739023207


In [22]:
model_name = "zainhas/Meta-Llama-3.1-8B-Instruct-Reference-test1_8b-e5a0fb5d"

print(model_name)
answers = get_model_answers(model_name)

zainhas/Meta-Llama-3.1-8B-Instruct-Reference-test1_8b-e5a0fb5d


  0%|          | 0/50 [00:00<?, ?it/s]

In [23]:
em_metric, f1_metric = get_metrics(answers)
print(f"Model: {model_name}, \n\nEM: {em_metric}, F1: {f1_metric}")

  0%|          | 0/50 [00:00<?, ?it/s]

Model: zainhas/Meta-Llama-3.1-8B-Instruct-Reference-test1_8b-e5a0fb5d, 

EM: 0.31, F1: 0.41019649357988347


| Llama 3.1 8B | EM | F1|
|---|---|---|
| Original | 0.01 | 0.18 |
| Fine-tuned | 0.31 | 0.41 |

We can see that the fine-tuned model performs twice as well on the test set when measuring the F1 score.

For a more detailed guide on Fine-tuning follow our [docs here](https://docs.together.ai/docs/finetuning).