<a href="https://colab.research.google.com/github/GeraudBourdin/llm-scripts/blob/main/1_finetuning_mistral_7b_using_autotrain.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Fine-tuning Mistral 7b with AutoTrain

**Mise en place**

Un GPU est necessaire pour le fintune de Llama ou Mistral:

- Aller dans `Runtime` (En haut à droite de Colab).
- Selectionner `Change Runtime Type`.
- Choisir un environnement de type `T4 GPU` (ou supérieur).
- Aller dans le menu latéral gauche de la page Collab.
- Ajouter un env: HF_TOKEN et mettez-y votre token "write" d'accès à hugging face.

### Step 1: installation des dépendances

In [2]:
!pip install pandas autotrain-advanced -q

In [None]:
!autotrain setup --update-torch

In [None]:
!pip install huggingface_hub ipywidgets

## Step 2: Connection à votre dépot huggingface pour uploader le model
### Connection Hugging Face
Pour vous assurer que le modèle peut être téléchargé et être utilisé pour l'inférence, il est nécessaire de se connecter au hub Hugging Face.
- Lancez la commande suivant et renseignez votre `token`. Votre token doit avoir la permission en écriture.

In [None]:
from huggingface_hub import notebook_login
notebook_login()

## Step 3: Upload your dataset

Add your data set to the root directory in the Colab under the name train.csv. The AutoTrain command will look for your data there under that name.

#### Don't have a data set and want to try finetuning on an example data set?
If you don't have a dataset you can run these commands below to get an example data set and save it to train.csv

In [37]:
# @title dépot du dataset

# Importation des bibliothèques
from huggingface_hub import hf_hub_download
import ipywidgets as widgets
from IPython.display import display
import shutil
import os
from google.colab import userdata



# Création de widgets pour la saisie de l'utilisateur
repo_input = widgets.Text(
    value=userdata.get('DATASET_DEPOT_NAME'),
    placeholder='Entrez le nom du dépôt',
    description='Dépôt:',
    disabled=False
)

file_input = widgets.Text(
    value='',
    placeholder='Entrez le nom du fichier',
    description='Fichier:',
    disabled=False
)

button = widgets.Button(description="Télécharger")

# Fonction à exécuter lors du clic sur le bouton
def on_button_clicked(b):
    repo_id = repo_input.value
    filename = file_input.value
    file_path = hf_hub_download(repo_id=repo_id, filename=filename, repo_type="dataset")
    print(f"Le fichier a été téléchargé à l'emplacement : {file_path}")
    # Déplacer et renommer le fichier
    new_file_path = 'train.csv'
    # Si le chemin est un lien symbolique -> chemin réel du fichier
    if os.path.islink(file_path):
        real_path = os.readlink(file_path)
    else:
        real_path = file_path

    shutil.move(real_path, new_file_path)
    print(f"Le fichier a été déplacé à l'emplacement : {new_file_path}")

button.on_click(on_button_clicked)

# Affichage des widgets
display(repo_input, file_input, button)


Text(value='Bourdin/dataset', description='Dépôt:', placeholder='Entrez le nom du dépôt')

Text(value='', description='Fichier:', placeholder='Entrez le nom du fichier')

Button(description='Télécharger', style=ButtonStyle())

In [30]:
%ls -la
%cat train.csv


total 16
drwxr-xr-x 1 root root 4096 Jan 23 14:30 [0m[01;34m.[0m/
drwxr-xr-x 1 root root 4096 Jan 23 13:51 [01;34m..[0m/
drwxr-xr-x 4 root root 4096 Jan 19 14:19 [01;34m.config[0m/
drwxr-xr-x 1 root root 4096 Jan 19 14:20 [01;34msample_data[0m/
lrwxrwxrwx 1 root root   52 Jan 23 14:30 [01;36mtrain.csv[0m -> ../../blobs/4e073f9876c9c52a83cb7213c0f751b5f06376f0
cat: train.csv: No such file or directory


In [None]:
!git clone https://github.com/joshbickett/finetune-llama-2.git
%cd finetune-llama-2
%mv train.csv ../train.csv
%cd ..

In [24]:
import pandas as pd
df = pd.read_csv("train.csv@")
df

FileNotFoundError: [Errno 2] No such file or directory: 'train.csv@'

In [None]:
df['text'][15]

## Step 4: Overview of AutoTrain command

#### Short overview of what the command flags do.

- `!autotrain`: Command executed in environments like a Jupyter notebook to run shell commands directly. `autotrain` is an automatic training utility.

- `llm`: A sub-command or argument specifying the type of task

- `--train`: Initiates the training process.

- `--project_name`: Sets the name of the project

- `--model abhishek/llama-2-7b-hf-small-shards`: Specifies original model that is hosted on Hugging Face named "llama-2-7b-hf-small-shards" under the "abhishek".

- `--data_path .`: The path to the dataset for training. The "." refers to the current directory. The `train.csv` file needs to be located in this directory.

- `--use_int4`: Use of INT4 quantization to reduce model size and speed up inference times at the cost of some precision.

- `--learning_rate 2e-4`: Sets the learning rate for training to 0.0002.

- `--train_batch_size 12`: Sets the batch size for training to 12.

- `--num_train_epochs 3`: The training process will iterate over the dataset 3 times.

### Steps needed before running
Go to the `!autotrain` code cell below and update it by following the steps below:

1. After `--project_name` replace `*enter-a-project-name*` with the name that you'd like to call the project
2. After `--repo_id` replace `*username*/*repository*`. Replace `*username*` with your Hugging Face username and `*repository*` with the repository name you'd like it to be created under. You don't need to create this repository before hand, it will automatically be created and uploaded once the training is completed.
3. Confirm that `train.csv` is in the root directory in the Colab. The `--data_path .` flag will make it so that AutoTrain looks for your data there.
4. Make sure to add the LoRA Target Modules to be trained `--target-modules q_proj, v_proj`
5. Once you've made these changes you're all set, run the command below!

In [None]:
!autotrain llm --train --project_name mistral-7b-mj-finetuned --model bn22/Mistral-7B-Instruct-v0.1-sharded --data_path . --use_peft --use_int4 --learning_rate 2e-4 --train_batch_size 12 --num_train_epochs 3 --trainer sft --target_modules q_proj,v_proj --push_to_hub --repo_id ashishpatel26/mistral-7b-mj-finetuned

## Step 5: Completed 🎉
After the command above is completed your Model will be uploaded to Hugging Face.

#### Learn more about AutoTrain (optional)
If you want to learn more about what command-line flags are available

## Step 6: Inference Engine

In [None]:
!autotrain llm -h

In [None]:
!pip install -q peft  accelerate bitsandbytes safetensors

In [None]:
import torch
from peft import PeftModel
from transformers import AutoModelForCausalLM, AutoTokenizer
import transformers
adapters_name = "ashishpatel26/mistral-7b-mj-finetuned"
model_name = "bn22/Mistral-7B-Instruct-v0.1-sharded" #"mistralai/Mistral-7B-Instruct-v0.1"


device = "cuda" # the device to load the model onto

In [None]:
bnb_config = transformers.BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16
)

In [None]:
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    load_in_4bit=True,
    torch_dtype=torch.bfloat16,
    quantization_config=bnb_config,
    device_map='auto'
)

Loading checkpoint shards:   0%|          | 0/11 [00:00<?, ?it/s]

## Step 7: Peft Model Loading with upload model

In [None]:
model = PeftModel.from_pretrained(model, adapters_name)

In [None]:
tokenizer = AutoTokenizer.from_pretrained(model_name)
tokenizer.bos_token_id = 1

stop_token_ids = [0]

print(f"Successfully loaded the model {model_name} into memory")

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


Successfully loaded the model bn22/Mistral-7B-Instruct-v0.1-sharded into memory


In [None]:
text = "[INST] generate a midjourney prompt for A person walks in the rain [/INST]"

encoded = tokenizer(text, return_tensors="pt", add_special_tokens=False)
model_input = encoded
model.to(device)
generated_ids = model.generate(**model_input, max_new_tokens=200, do_sample=True)
decoded = tokenizer.batch_decode(generated_ids)
print(decoded[0])

Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


[INST] generate a midjourney prompt for A person walks in the rain [/INST] "As you wander through the pouring rain, you can't help but wonder what the world would be like if things were different. What if the rain was a symbol of the turmoil in your life, and the sunshine promised a brighter future? What if you suddenly found yourself lost in a small town where time stood still, and the people were trapped in a time loop? As you struggle to find your way back to reality, you discover a mysterious stranger who seems to hold the key to unlocking the secrets of the town and your own past."</s>
