# Mistral Fine-tuning API

Check out the docs: https://docs.mistral.ai/capabilities/finetuning/

In [None]:
#!pip install mistralai pandas

## Prepare the dataset

In this example, let’s use the ultrachat_200k dataset. We load a chunk of the data into Pandas Dataframes, split the data into training and validation, and save the data into the required jsonl format for fine-tuning.

In [1]:
import pandas as pd
df = pd.read_parquet('https://huggingface.co/datasets/HuggingFaceH4/ultrachat_200k/resolve/main/data/test_gen-00000-of-00001-3d4cd8309148a71f.parquet')

df_train=df.sample(frac=0.995,random_state=200)
df_eval=df.drop(df_train.index)

df_train.to_json("ultrachat_chunk_train.jsonl", orient="records", lines=True)
df_eval.to_json("ultrachat_chunk_eval.jsonl", orient="records", lines=True)

In [2]:
!ls -lh

total 301840
-rw-r--r--   1 pierrebittner  staff   3,4K 21 jui 15:15 Step 2 - synthetize news.py
-rw-r--r--   1 pierrebittner  staff   1,7K 21 jui 15:07 Step 3 - critique.py
-rw-r--r--   1 pierrebittner  staff   5,9K 21 jui 15:14 Step 4 - specific-style.py
-rw-r--r--   1 pierrebittner  staff   6,5K 19 jui 16:19 Untitled.ipynb
drwxr-xr-x   5 pierrebittner  staff   160B 21 jui 15:07 [34m__pycache__[m[m
-rw-r--r--   1 pierrebittner  staff   1,1K 21 jui 15:53 concatenate.py
drwxr-xr-x  19 pierrebittner  staff   608B 21 jui 15:53 [34mdata[m[m
-rw-r--r--@  1 pierrebittner  staff    48K 19 jui 15:52 mistral_finetune_api.ipynb
-rw-r--r--   1 pierrebittner  staff    40K 21 jui 16:10 mistral_finetune_api_news.ipynb
-rw-r--r--   1 pierrebittner  staff   5,1K 21 jui 16:07 news_chunk_eval.jsonl
-rw-r--r--   1 pierrebittner  staff   1,2M 21 jui 16:07 news_chunk_train.jsonl
-rw-r--r--   1 pierrebittner  staff   447B 21 jui 11:29 prompts.py
-rw-r--r--   1 pierrebittner  staff   3,3K 19 jui 11:53

## Reformat dataset
If you upload this ultrachat_chunk_train.jsonl to Mistral API, you might encounter an error message “Invalid file format” due to data formatting issues. To reformat the data into the correct format, you can download the reformat_dataset.py script and use it to validate and reformat both the training and evaluation data:

In [3]:
# download the validation and reformat script
!wget https://raw.githubusercontent.com/mistralai/mistral-finetune/main/utils/reformat_data.py

--2024-06-19 15:41:11--  https://raw.githubusercontent.com/mistralai/mistral-finetune/main/utils/reformat_data.py
Résolution de raw.githubusercontent.com (raw.githubusercontent.com)… 185.199.110.133, 185.199.109.133, 185.199.108.133, ...
Connexion à raw.githubusercontent.com (raw.githubusercontent.com)|185.199.110.133|:443… connecté.
requête HTTP transmise, en attente de la réponse… 200 OK
Taille : 3381 (3,3K) [text/plain]
Sauvegarde en : « reformat_data.py.4 »


2024-06-19 15:41:11 (5,14 MB/s) — « reformat_data.py.4 » sauvegardé [3381/3381]



In [3]:
# validate and reformat the training data
!python reformat_data.py ultrachat_chunk_train.jsonl

Skipped 3674th sample
Skipped 9176th sample
Skipped 10559th sample
Skipped 13293th sample
Skipped 13973th sample
Skipped 15219th sample


In [4]:
# validate the reformat the eval data
!python reformat_data.py ultrachat_chunk_eval.jsonl

In [5]:
df_train.iloc[3674]['messages']

array([{'content': 'What are the dimensions of the cavity, product, and shipping box of the Sharp SMC1662DS microwave?: With innovative features like preset controls, Sensor Cooking and the Carousel® turntable system, the Sharp® SMC1662DS 1.6 cu. Ft. Stainless Steel Carousel Countertop Microwave makes reheating your favorite foods, snacks and beverages easier than ever. Use popcorn and beverage settings for one-touch cooking. Express Cook allows one-touch cooking up to six minutes. The convenient and flexible "+30 Sec" key works as both instant start option and allows you to add more time during cooking.\nThe Sharp SMC1662DS microwave is a bold design statement in any kitchen. The elegant, grey interior and bright white, LED interior lighting complements the stainless steel finish of this premium appliance.\nCavity Dimensions (w x h x d): 15.5" x 10.2" x 17.1"\nProduct Dimensions (w x h x d): 21.8" x 12.8" x 17.7"\nShipping Dimensions (w x h x d) : 24.4" x 15.0" x 20.5"', 'role': 'user

## Upload dataset

In [6]:
import os
from mistralai.client import MistralClient

api_key = os.environ.get("MISTRAL_API_KEY")
client = MistralClient(api_key=api_key)

with open("ultrachat_chunk_train.jsonl", "rb") as f:
    ultrachat_chunk_train = client.files.create(file=("ultrachat_chunk_train.jsonl", f))
with open("ultrachat_chunk_eval.jsonl", "rb") as f:
    ultrachat_chunk_eval = client.files.create(file=("ultrachat_chunk_eval.jsonl", f))

In [7]:
import json
def pprint(obj):
    print(json.dumps(obj.dict(), indent=4))

In [8]:
pprint(ultrachat_chunk_train)

{
    "id": "d04b7515-f54a-4400-abf6-029edb170df1",
    "object": "file",
    "bytes": 121379382,
    "created_at": 1718979213,
    "filename": "ultrachat_chunk_train.jsonl",
    "purpose": "fine-tune"
}


In [9]:
pprint(ultrachat_chunk_eval)

{
    "id": "22ae7d95-1bcd-4665-a1f3-ff7c7c550be3",
    "object": "file",
    "bytes": 596255,
    "created_at": 1718979215,
    "filename": "ultrachat_chunk_eval.jsonl",
    "purpose": "fine-tune"
}


## Create a fine-tuning job

In [10]:
from mistralai.models.jobs import TrainingParameters

created_jobs = client.jobs.create(
    model="open-mistral-7b",
    training_files=[ultrachat_chunk_train.id],
    validation_files=[ultrachat_chunk_eval.id],
    hyperparameters=TrainingParameters(
        training_steps=10,
        learning_rate=0.0001,
        )
)

In [11]:
pprint(created_jobs)

{
    "id": "2ef2bee4-3134-4908-ba95-95f0947ebed0",
    "hyperparameters": {
        "training_steps": 10,
        "learning_rate": 0.0001
    },
    "fine_tuned_model": null,
    "model": "open-mistral-7b",
    "status": "QUEUED",
    "job_type": "FT",
    "created_at": 1718979336,
    "modified_at": 1718979336,
    "training_files": [
        "d04b7515-f54a-4400-abf6-029edb170df1"
    ],
    "validation_files": [
        "22ae7d95-1bcd-4665-a1f3-ff7c7c550be3"
    ],
    "object": "job",
    "integrations": []
}


In [12]:
import time

retrieved_job = client.jobs.retrieve(created_jobs.id)
while retrieved_job.status in ["RUNNING", "QUEUED"]:
    retrieved_job = client.jobs.retrieve(created_jobs.id)
    pprint(retrieved_job)
    print(f"Job is {retrieved_job.status}, waiting 10 seconds")
    time.sleep(10)



{
    "id": "2ef2bee4-3134-4908-ba95-95f0947ebed0",
    "hyperparameters": {
        "training_steps": 10,
        "learning_rate": 0.0001
    },
    "fine_tuned_model": null,
    "model": "open-mistral-7b",
    "status": "RUNNING",
    "job_type": "FT",
    "created_at": 1718979336,
    "modified_at": 1718979339,
    "training_files": [
        "d04b7515-f54a-4400-abf6-029edb170df1"
    ],
    "validation_files": [
        "22ae7d95-1bcd-4665-a1f3-ff7c7c550be3"
    ],
    "object": "job",
    "integrations": [],
    "events": [
        {
            "name": "status-updated",
            "data": {
                "status": "RUNNING"
            },
            "created_at": 1718979339
        },
        {
            "name": "status-updated",
            "data": {
                "status": "QUEUED"
            },
            "created_at": 1718979336
        }
    ],
    "checkpoints": [],
    "estimated_start_time": null
}
Job is RUNNING, waiting 10 seconds
{
    "id": "2ef2bee4-3134-49

In [15]:
# List jobs
jobs = client.jobs.list()
pprint(jobs)

{
    "data": [
        {
            "id": "68e070f1-b295-41cc-b052-a51c98e9628d",
            "hyperparameters": {
                "training_steps": 10,
                "learning_rate": 0.0001
            },
            "fine_tuned_model": "ft:open-mistral-7b:c056c2e4:20240619:68e070f1",
            "model": "open-mistral-7b",
            "status": "SUCCESS",
            "job_type": "FT",
            "created_at": 1718804524,
            "modified_at": 1718804697,
            "training_files": [
                "b3878ca4-510c-47f4-a340-4f45836f0b9b"
            ],
            "validation_files": [
                "5a7f6234-4761-4efb-8b99-7be1299aa6d7"
            ],
            "object": "job",
            "integrations": []
        }
    ],
    "object": "list"
}


In [16]:
# Retrieve a jobs
retrieved_jobs = client.jobs.retrieve(created_jobs.id)
pprint(retrieved_jobs)


{
    "id": "68e070f1-b295-41cc-b052-a51c98e9628d",
    "hyperparameters": {
        "training_steps": 10,
        "learning_rate": 0.0001
    },
    "fine_tuned_model": "ft:open-mistral-7b:c056c2e4:20240619:68e070f1",
    "model": "open-mistral-7b",
    "status": "SUCCESS",
    "job_type": "FT",
    "created_at": 1718804524,
    "modified_at": 1718804697,
    "training_files": [
        "b3878ca4-510c-47f4-a340-4f45836f0b9b"
    ],
    "validation_files": [
        "5a7f6234-4761-4efb-8b99-7be1299aa6d7"
    ],
    "object": "job",
    "integrations": [],
    "events": [
        {
            "name": "status-updated",
            "data": {
                "status": "SUCCESS"
            },
            "created_at": 1718804697
        },
        {
            "name": "status-updated",
            "data": {
                "status": "RUNNING"
            },
            "created_at": 1718804526
        },
        {
            "name": "status-updated",
            "data": {
              

## Use a fine-tuned model

In [17]:
from mistralai.models.chat_completion import ChatMessage

chat_response = client.chat(
    model=retrieved_jobs.fine_tuned_model,
    messages=[ChatMessage(role='user', content='What is the best French cheese?')]
)

In [18]:
pprint(chat_response)

{
    "id": "b30564ffcb6f4ea9bdf47d4611527b94",
    "object": "chat.completion",
    "created": 1718805028,
    "model": "ft:open-mistral-7b:c056c2e4:20240619:68e070f1",
    "choices": [
        {
            "index": 0,
            "message": {
                "role": "assistant",
                "content": "The best French cheese is a matter of personal preference. However, some of the most famous French cheeses include Roquefort, Camembert, and Brie. Roquefort is a blue-veined cheese which is made from sheep's milk, while Camembert and Brie are soft, creamy cheeses. These cheeses are known for their unique flavors and textures, and they are often enjoyed as part of a French meal. Some other popular French cheeses include Comt\u00e9, Reblochon, and Gruy\u00e8re. These cheeses are often made from cow's milk and have a hard and nutty texture. Some other popular French cheeses include Fromage de Ch\u00e8vre, a soft goat cheese, and Beaufort, a hard cheese made from cow's milk. The best 

## Integration with Weights and Biases
We can also offer support for integration with Weights & Biases (W&B) to monitor and track various metrics and statistics associated with our fine-tuning jobs. To enable integration with W&B, you will need to create an account with W&B and add your W&B information in the “integrations” section in the job creation request:



In [None]:
from mistralai.models.jobs import WandbIntegrationIn

WANDB_API_KEY = "XXX"

created_jobs = client.jobs.create(
    model="open-mistral-7b",
    training_files=[ultrachat_chunk_train.id],
    validation_files=[ultrachat_chunk_eval.id],
    hyperparameters=TrainingParameters(
        training_steps=100,
        learning_rate=0.0001,
    ),
    integrations=[
        WandbIntegrationIn(
            project="test_ft_api",
            run_name="test",
            api_key=WANDB_API_KEY,
        ).dict()
    ],
)