# Getting Started with Fine-Tuning Mistral 7B

This notebook shows you a simple example of how to LoRA finetune Mistral 7B. You can run this notebook in Google Colab with Pro + account with A100 and 40GB RAM.

<a target="_blank" href="https://colab.research.google.com/github/mistralai/mistral-finetune/blob/main/tutorials/mistral_finetune_7b.ipynb">
  <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
</a>


Check out `mistral-finetune` Github repo to learn more: https://github.com/mistralai/mistral-finetune/

## Installation

Clone the `mistral-finetune` repo:


In [1]:
%cd /content/
!git clone https://github.com/mistralai/mistral-finetune.git

/content
Cloning into 'mistral-finetune'...
remote: Enumerating objects: 472, done.[K
remote: Counting objects: 100% (249/249), done.[K
remote: Compressing objects: 100% (90/90), done.[K
remote: Total 472 (delta 211), reused 159 (delta 159), pack-reused 223 (from 2)[K
Receiving objects: 100% (472/472), 243.32 KiB | 1.27 MiB/s, done.
Resolving deltas: 100% (251/251), done.


Install all required dependencies:

In [1]:
!pip install -r /content/mistral-finetune/requirements.txt



## Model download

In [None]:
!pip install huggingface_hub

In [None]:
# huggingface login
from huggingface_hub import notebook_login

notebook_login()

In [None]:
from huggingface_hub import snapshot_download
from pathlib import Path

mistral_models_path = Path.home().joinpath('mistral_models', '7B-v0.3')
mistral_models_path.mkdir(parents=True, exist_ok=True)

snapshot_download(repo_id="mistralai/Mistral-7B-v0.3", allow_patterns=["params.json", "consolidated.safetensors", "tokenizer.model.v3"], local_dir=mistral_models_path)

! cp -r /root/mistral_models/7B-v0.3 /content/mistral_models
! rm -r /root/mistral_models/7B-v0.3

Fetching 3 files:   0%|          | 0/3 [00:00<?, ?it/s]

tokenizer.model.v3:   0%|          | 0.00/587k [00:00<?, ?B/s]

params.json:   0%|          | 0.00/202 [00:00<?, ?B/s]

consolidated.safetensors:   0%|          | 0.00/14.5G [00:00<?, ?B/s]

'/root/mistral_models/7B-v0.3'

In [None]:
# Alternatively, you can download the model from mistral

# !wget https://models.mistralcdn.com/mistral-7b-v0-3/mistral-7B-v0.3.tar

--2024-05-24 18:50:25--  https://models.mistralcdn.com/mistral-7b-v0-3/mistral-7B-v0.3.tar
Resolving models.mistralcdn.com (models.mistralcdn.com)... 104.26.6.117, 104.26.7.117, 172.67.70.68, ...
Connecting to models.mistralcdn.com (models.mistralcdn.com)|104.26.6.117|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 14496675840 (14G) [application/x-tar]
Saving to: ‘mistral-7B-v0.3.tar’


2024-05-24 18:56:29 (38.1 MB/s) - ‘mistral-7B-v0.3.tar’ saved [14496675840/14496675840]



In [3]:
!DIR=/content/mistral_models && mkdir -p $DIR && tar -xf /content/drive/MyDrive/mistral-7B-v0.3.tar -C $DIR

In [2]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [4]:
!ls /content/mistral_models

consolidated.safetensors  params.json  tokenizer.model.v3


## Prepare dataset

To ensure effective training, mistral-finetune has strict requirements for how the training data has to be formatted. Check out the required data formatting [here](https://github.com/mistralai/mistral-finetune/tree/main?tab=readme-ov-file#prepare-dataset).

In this example, let’s use the ultrachat_200k dataset. We load a chunk of the data into Pandas Dataframes, split the data into training and validation, and save the data into the required `jsonl` format for fine-tuning.

In [70]:
%cd /content/

/content


In [71]:
# make a new directory called data
!mkdir -p data

In [72]:
# navigate to this data directory
%cd /content/data

/content/data


In [73]:
import json

# Exemple de dataset brut
data = [
    {
        "question": "L'obstruction uréthrique chronique due à une hyperplasie prismatique bénigne peut entraîner le changement suivant du parenchyme rein",
        "exp": "L'obstruction uréthrique chronique à cause des calculi urinaires, de la hyperophy, des tumeurs, de la grossesse normale, des tumeurs, de la prolifération utérine ou des troubles fonctionnels cause une hydronéphrose qui, par définition, est utilisée pour décrire la dilatation du bassin rénal et du calcul associé atrophie progressive du rein en raison de l'obstruction de l'écoulement de l'urine Voir Robbins 7yh/9,1012,9/f.",
        "opa": "Hyperplasie",
        "opb": "Hyperophyse",
        "opc": "Atrophie",
        "opd": "Dyplasie",
        "subject_name": "Anatomie",
        "topic_name": "Traitement urinaire",
        "id": "e9ad821a-c438-4965-9f77-760819dfa155",
        "choice_type": "et"
    },
    {
        "question": "Quelle vitamine est fournie à partir d'une seule source animale :",
        "exp": "Ans. c) Vitamine B12 Ref: Harrison's 19th ed. P 640* La vitamine B12 (Cobalamin) est synthétisée uniquement par des micro-organismes*. Chez les humains, la seule source d'origine animale est la nourriture d'origine animale, p. ex. la viande, le poisson et les produits laitiers*. L'origine non animale ne contient pas de vitamine B12 .* Les exigences quotidiennes de vitamine Bp sont d'environ 1 à 3 pg. Les magasins de corps sont de l'ordre de 2-3 mg, suffisants pour 3 à 4 ans si les fournitures sont complètement coupées.",
        "opa": "Vitamine C",
        "opb": "Vitamine B7",
        "opc": "Vitamine B12",
        "opd": "Vitamine D",
        "subject_name": "Biochimie",
        "topic_name": "Vitamines et minéraux",
        "id": "e3d3c4e1-4fb2-45e7-9f88-247cc8f373b3",
        "choice_type": "et"
    },
    {
        "question": "La prévention primordiale est faite pour prévenir le développement de ?",
        "exp": "Les facteurs de risque NIVEAUX DE LA PRÉVENTION Il y a quatre niveaux de prévention :? Prévention primordiale Prévention secondaire Prévention téiaire Niveau primaire de prévention : est-ce que la prévention primordiale (voir ci-dessous) au sens pur ? Prévention de l'émergence ou du développement de facteurs de risque dans les pays ou les groupes de population dans lesquels ils n'ont pas encore été apparus Modes d'intervention : éducation individuelle Masse éducation primordiale Le meilleur niveau de prévention pour les maladies non transmissibles",
        "opa": "Maladie",
        "opb": "Facteurs de risque",
        "opc": "Difficulté",
        "opd": "Invalidité",
        "subject_name": "Médecine sociale et préventive",
        "id": "0473aeb8-a083-4cca-ac55-c0cdba0c6f03",
        "choice_type": "et"
    },
    {
        "question": "Anakinra est une -",
        "exp": "Ans. est 'a', c'est-à-dire l'antagoniste IL-1 Anakinra est un antagoniste IL-1.o Il est utilisé pour certains syndromes rares dépendant de la production d'IL-1 : maladie inflammatoire néonatale - déclenchement de l'inflammationMuckle - syndrome WellsFamilial cold urticariaSystématique juvénile - déclenchement de",
        "opa": "IL - 1 antagoniste",
        "opb": "IL - 2 antagoniste",
        "opc": "IL - 6 antagoniste",
        "opd": "IL - 10 antagoniste",
        "subject_name": "Pharmacologie",
        "topic_name": "Immunomodulateur",
        "id": "11c4dd07-1c91-47b8-8b9f-f9182ac9e5b1",
        "choice_type": "et"
    }

]

# Fonction pour reformater les données
def reformat_data(data):
    formatted_data = []

    for item in data:
        # Contexte system (peut être omis si non nécessaire)
        formatted_data.append({
            "messages": [
                {
                    "role": "user",
                    "content": (
                        f"Explication : {item['question']}\n\n"
                        f"Options :\n"
                        f"a) {item['opa']}\n"
                        f"b) {item['opb']}\n"
                        f"c) {item['opc']}\n"
                        f"d) {item['opd']}\n"
                        f"a) {item['subject_name']}\n"
                        f"b) {item.get('topic_name')}\n"
                        f"c) {item['id']}\n"
                        f"d) {item['choice_type']}"
                    )
                },
                {
                    "role": "assistant",
                    "content": item['exp'],
                }
            ]
        })

    return formatted_data

# Reformater les données
formatted_data = reformat_data(data)

In [74]:
df = pd.DataFrame(formatted_data)

In [75]:
df

Unnamed: 0,messages
0,"[{'role': 'user', 'content': 'Explication : L'..."
1,"[{'role': 'user', 'content': 'Explication : Qu..."
2,"[{'role': 'user', 'content': 'Explication : La..."
3,"[{'role': 'user', 'content': 'Explication : An..."


In [54]:
import pandas as pd
import json

# Charger le fichier JSONL ligne par ligne
all_data = []
with open("/content/data/reformated_train (2).json", "r", encoding="utf-8") as f:
    for line in f:
        try:
            all_data.append(json.loads(line))  # Charger chaque ligne comme un objet JSON
        except json.JSONDecodeError as e:
            print(f"Erreur lors du chargement de la ligne : {line.strip()} - Erreur : {e}")

df = pd.DataFrame(all_data)

In [None]:
# read data into a pandas dataframe
import pandas as pd

df = pd.read_parquet('https://huggingface.co/datasets/HuggingFaceH4/ultrachat_200k/resolve/main/data/test_gen-00000-of-00001-3d4cd8309148a71f.parquet')

In [76]:
# split data into training and evaluation
df_train=df.sample(frac=0.65,random_state=200)
df_eval=df.drop(df_train.index)

In [77]:
df_train

Unnamed: 0,messages
3,"[{'role': 'user', 'content': 'Explication : An..."
0,"[{'role': 'user', 'content': 'Explication : L'..."
1,"[{'role': 'user', 'content': 'Explication : Qu..."


In [78]:
# save data into .jsonl files
df_train.to_json("ultrachat_chunk_train.jsonl", orient="records", lines=True)
df_eval.to_json("ultrachat_chunk_eval.jsonl", orient="records", lines=True)

In [79]:
!ls /content/data

'reformated_train (2).json'   ultrachat_chunk_eval.jsonl   ultrachat_chunk_train.jsonl


In [80]:
# navigate to the mistral-finetune directory
%cd /content/mistral-finetune/

/content/mistral-finetune


In [81]:
# some of the training data doesn't have the right format,
# so we need to reformat the data into the correct format and skip the cases that don't have the right format:

!python -m utils.reformat_data /content/data/ultrachat_chunk_train.jsonl

In [82]:
# eval data looks all good
!python -m utils.reformat_data /content/data/ultrachat_chunk_eval.jsonl

In [83]:
# Now you can verify your training yaml to make sure the data is correctly formatted and to get an estimate of your training time.

!python -m utils.validate_data --train_yaml example/7B.yaml


0it [00:00, ?it/s]Validating /content/data/ultrachat_chunk_train.jsonl ...

  0% 0/3 [00:00<?, ?it/s][A100% 3/3 [00:00<00:00, 1391.76it/s]
1it [00:00, 389.30it/s]
No errors! Data is correctly formatted!
Stats for /content/data/ultrachat_chunk_train.jsonl 
 -------------------- 
 {
    "expected": {
        "eta": "00:33:38",
        "data_tokens": 801,
        "train_tokens": 78643200,
        "epochs": "98181.27",
        "max_steps": 300,
        "data_tokens_per_dataset": {
            "/content/data/ultrachat_chunk_train.jsonl": "801.0"
        },
        "train_tokens_per_dataset": {
            "/content/data/ultrachat_chunk_train.jsonl": "78643200.0"
        },
        "epochs_per_dataset": {
            "/content/data/ultrachat_chunk_train.jsonl": "98181.3"
        }
    }
}
0it [00:00, ?it/s]Validating /content/data/ultrachat_chunk_eval.jsonl ...

  0% 0/1 [00:00<?, ?it/s][A100% 1/1 [00:00<00:00, 1338.32it/s]
1it [00:00, 952.60it/s]
No errors! Data is correctly forma

## Start training

In [84]:
# these info is needed for training
import os
os.environ["CUDA_DEVICE_ORDER"]="PCI_BUS_ID"
os.environ["CUDA_VISIBLE_DEVICES"]="0"

In [85]:
# define training configuration
# for your own use cases, you might want to change the data paths, model path, run_dir, and other hyperparameters

config = """
# data
data:
  instruct_data: "/content/data/ultrachat_chunk_train.jsonl"  # Fill
  data: ""  # Optionally fill with pretraining data
  eval_instruct_data: "/content/data/ultrachat_chunk_eval.jsonl"  # Optionally fill

# model
model_id_or_path: "/content/mistral_models"  # Change to downloaded path
lora:
  rank: 64

# optim
# tokens per training steps = batch_size x num_GPUs x seq_len
# we recommend sequence length of 32768
# If you run into memory error, you can try reduce the sequence length
seq_len: 8192
batch_size: 1
num_microbatches: 8
max_steps: 100
optim:
  lr: 1.e-4
  weight_decay: 0.1
  pct_start: 0.05

# other
seed: 0
log_freq: 1
eval_freq: 100
no_eval: False
ckpt_freq: 100

save_adapters: True  # save only trained LoRA adapters. Set to `False` to merge LoRA adapter into the base model and save full fine-tuned model

run_dir: "/content/test_ultra"  # Fill
"""

# save the same file locally into the example.yaml file
import yaml
with open('example.yaml', 'w') as file:
    yaml.dump(yaml.safe_load(config), file)


In [86]:
# make sure the run_dir has not been created before
# only run this when you ran torchrun previously and created the /content/test_ultra file
! rm -r /content/test_ultra

In [87]:
# start training

!torchrun --nproc-per-node 1 -m train example.yaml

2024-12-30 17:38:03.843269: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:485] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-12-30 17:38:03.863303: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:8454] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-12-30 17:38:03.869287: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1452] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2024-12-30 17:38:03.883813: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 AVX512F FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
args: TrainArgs(data=DataArgs(data='', shuffl

## Inference

In [None]:
!pip install mistral_inference

Collecting mistral_inference
  Downloading mistral_inference-1.1.0-py3-none-any.whl (21 kB)
Installing collected packages: mistral_inference
Successfully installed mistral_inference-1.1.0


In [None]:
from mistral_inference.transformer import Transformer
from mistral_inference.generate import generate

from mistral_common.tokens.tokenizers.mistral import MistralTokenizer
from mistral_common.protocol.instruct.messages import UserMessage
from mistral_common.protocol.instruct.request import ChatCompletionRequest


tokenizer = MistralTokenizer.from_file("/content/mistral_models/tokenizer.model.v3")  # change to extracted tokenizer file
model = Transformer.from_folder("/content/mistral_models")  # change to extracted model dir
model.load_lora("/content/test_ultra/checkpoints/checkpoint_000100/consolidated/lora.safetensors")

completion_request = ChatCompletionRequest(messages=[UserMessage(content="Explain Machine Learning to me in a nutshell.")])

tokens = tokenizer.encode_chat_completion(completion_request).tokens

out_tokens, _ = generate([tokens], model, max_tokens=64, temperature=0.0, eos_id=tokenizer.instruct_tokenizer.tokenizer.eos_id)
result = tokenizer.instruct_tokenizer.tokenizer.decode(out_tokens[0])

print(result)

Machine learning is a subset of artificial intelligence that involves the use of algorithms to learn from data and make predictions or decisions without being explicitly programmed. It is a type of computer science that enables machines to learn and improve from experience without being explicitly programmed. Machine learning algorithms can learn from data and make predictions or decisions based
