<span style="font-size: 30px;">**Fine-tuning Open Source LLMs with Mistral**</span>


**Objectives:** 
- *Learn how to load LLMs from HuggingFace*
- *Fine-tune an open-source model using HugginFace Datasets (GPT-2 and Mistral7B)*
- *Use our fine-tuned model for a new task*

You can learn more about fine-tuning process in the following links:
* [An Introductory Guide to Fine-Tuning LLMs](https://www.datacamp.com/tutorial/fine-tuning-large-language-models)
* [The Best Strategies for Fine-Tuning Large Language Models](https://www.kdnuggets.com/the-best-strategies-for-fine-tuning-large-language-models)
* [A Comprehensive Guide to Working with the Mistral Large Model](https://www.datacamp.com/tutorial/guide-to-working-with-the-mistral-large-model)
* [---](---)

Let's start by understanding our main goal:
 - Fine-tuning a pre-trained model to improve its performance for a specific task. To do so, we will replicate some steps: 
    - **STEP 1**: Having our concrete objective clear
    - **STEP 2**: Choose a pre-trained model and a dataset
    - **STEP 3**: Load the data to use
    - **STEP 4**: Tokenizer
    - **STEP 5**: Initialize our base model
    - **STEP 6**: Evaluate method
    - **STEP 7**: Fine-tune using the Trainer Method


**Requirements**

For this tutorial, the following libraries are needed: 
- Throughout the whole tutorial, we will be using the `transformers` library. 
- For the fine-tuning either `pytorch` or `tensorflow` are required. (This Notebook will be implemented with `pytorch`)
- To push the fine-tuned model to HuggingFace, the `HuggingFace_hub`library is required. 


In [1]:
%pip install numpy pandas
%pip install transformers datasets evaluate
%pip install scikit-learn
%pip install tensorflow torch
%pip install huggingface_hub
%pip install -U 'accelerate==0.27.2'

import pandas as pd
import numpy as np










[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.0.1[0m[39;49m -> [0m[32;49m24.1.1[0m

[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m

Note: you may need to restart the kernel to use updated packages.







































[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.0.1[0m[39;49m -> [0m[32;49m24.1.1[0m

[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m

Note: you may need to restart the kernel to use updated packages.








[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.0.1[0m[39;49m -> [0m[32;49m24.1.1[0m

[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m

Note: you may need to restart the kernel to use upd

## STEP 1 - Having our concrete objective clear

## STEP 2 - Choose a pre-trained model and a dataset

## STEP 3 - Load the data to use

In [2]:
from datasets import load_dataset

dataset = load_dataset("mteb/tweet_sentiment_extraction")
df = pd.DataFrame(dataset['train'])

df


  from .autonotebook import tqdm as notebook_tqdm


Unnamed: 0,id,text,label,label_text
0,cb774db0d1,"I`d have responded, if I were going",1,neutral
1,549e992a42,Sooo SAD I will miss you here in San Diego!!!,0,negative
2,088c60f138,my boss is bullying me...,0,negative
3,9642c003ef,what interview! leave me alone,0,negative
4,358bd9e861,"Sons of ****, why couldn`t they put them on t...",0,negative
...,...,...,...,...
27476,4eac33d1c0,wish we could come see u on Denver husband l...,0,negative
27477,4f4c4fc327,I`ve wondered about rake to. The client has ...,0,negative
27478,f67aae2310,Yay good for both of you. Enjoy the break - y...,2,positive
27479,ed167662a5,But it was worth it ****.,2,positive


## STEP 4 - Tokenizer

In [3]:
from transformers import GPT2Tokenizer

# Loading the dataset to train our model
dataset = load_dataset("mteb/tweet_sentiment_extraction")

tokenizer = GPT2Tokenizer.from_pretrained("gpt2")
tokenizer.pad_token = tokenizer.eos_token
def tokenize_function(examples):
   return tokenizer(examples["text"], padding="max_length", truncation=True)

tokenized_datasets = dataset.map(tokenize_function, batched=True)

In [4]:
small_train_dataset = tokenized_datasets["train"].shuffle(seed=42).select(range(100))
small_eval_dataset = tokenized_datasets["test"].shuffle(seed=42).select(range(100))

## STEP 5 - Initialize our base model

In [5]:
from transformers import GPT2ForSequenceClassification

model = GPT2ForSequenceClassification.from_pretrained("gpt2", num_labels=3)

Some weights of GPT2ForSequenceClassification were not initialized from the model checkpoint at gpt2 and are newly initialized: ['score.weight']

You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


## STEP 6 - Evaluate method

In [6]:
import evaluate

metric = evaluate.load("accuracy")

def compute_metrics(eval_pred):
   logits, labels = eval_pred
   predictions = np.argmax(logits, axis=-1)
   return metric.compute(predictions=predictions, references=labels)

## STEP 7 - Fine-tune using the Trainer Method

In [7]:
from transformers import Trainer, TrainingArguments

training_args = TrainingArguments(
    output_dir='./results',
    evaluation_strategy='epoch',
    per_device_train_batch_size=4,
    per_device_eval_batch_size=4,
    num_train_epochs=3,
    weight_decay=0.01,
    logging_dir='./logs',
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_datasets['train'],
    eval_dataset=tokenized_datasets['test'],
)




In [9]:
trainer.train()

  0%|          | 0/20613 [01:46<?, ?it/s]


RuntimeError: Placeholder storage has not been allocated on MPS device!

In [None]:
import evaluate

trainer.evaluate()

RuntimeError: Placeholder storage has not been allocated on MPS device!