# **Finetuning Phi-2**

# I. Librairies

Important imports explained and example of basic usage:

#### datasets
- **Purpose**: library to easily load and process datasets from HF.
- **Basic Usage**:
  ```python
  from datasets import load_dataset, load_from_disk
  dataset = load_dataset('path/to/dataset', split='train')
  ```

#### peft
- **Purpose**: provides utilities for parameter-efficient fine-tuning.
- **Basic Usage**:
  ```python
  from peft import LoraConfig, prepare_model_for_kbit_training
  lora_config = LoraConfig()
  ```

#### transformers
- **Purpose**: provides classes and functions for transformer models.
- **Basic Usage**:
  ```python
  from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig, HfArgumentParser, TrainingArguments
  model = AutoModelForCausalLM.from_pretrained('model_name')
  tokenizer = AutoTokenizer.from_pretrained('model_name')
  ```

#### trl
- **Purpose**: provides utilities for training transformer models with reinforcement learning.
- **Basic Usage**:
  ```python
  from trl import SFTTrainer
  trainer = SFTTrainer(model, tokenizer)
  ```

In [1]:
!pip install -q torch peft bitsandbytes scipy trl transformers accelerate einops tqdm huggingface_hub --use-deprecated=legacy-resolver

[31mERROR: pip's legacy dependency resolver does not consider dependency conflicts when selecting packages. This behaviour is the source of the following dependency conflicts.
kfp 2.5.0 requires google-cloud-storage<3,>=2.2.1, but you'll have google-cloud-storage 1.44.0 which is incompatible.[0m[31m
[0m

In [2]:
#from dataclasses import dataclass, fields
#from typing import Optional

In [26]:
import os
import torch
import pandas as pd
from datasets import load_dataset, load_from_disk
from peft import LoraConfig, prepare_model_for_kbit_training
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig, HfArgumentParser, TrainingArguments 
from trl import SFTTrainer
from huggingface_hub import notebook_login
from tqdm.notebook import tqdm
from sklearn.model_selection import train_test_split

In [4]:
print(f"pytorch version {torch.__version__}")

pytorch version 2.1.2


In [5]:
os.environ["CUDA_VISIBLE_DEVICES"] = "0"
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
print(f"working on {device}")

working on cuda:0


In [6]:
### log in to Hugging Face account
notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

# II. Dataset

In [27]:
dataset = load_dataset("Amod/mental_health_counseling_conversations", split="train")
dataset

Dataset({
    features: ['Context', 'Response'],
    num_rows: 3512
})

In [28]:
df = pd.DataFrame(dataset)
df.head()

Unnamed: 0,Context,Response
0,I'm going through some things with my feelings...,"If everyone thinks you're worthless, then mayb..."
1,I'm going through some things with my feelings...,"Hello, and thank you for your question and see..."
2,I'm going through some things with my feelings...,First thing I'd suggest is getting the sleep y...
3,I'm going through some things with my feelings...,Therapy is essential for those that are feelin...
4,I'm going through some things with my feelings...,I first want to let you know that you are not ...


In [29]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3512 entries, 0 to 3511
Data columns (total 2 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   Context   3512 non-null   object
 1   Response  3512 non-null   object
dtypes: object(2)
memory usage: 55.0+ KB


In [30]:
def format_row(row):
    question = row["Context"]
    response = row["Response"]
    formatted_string = f"[INST] {question} [/INST] {response}"
    return formatted_string

df["Text"] = df.apply(format_row, axis=1)
df.head(3)

Unnamed: 0,Context,Response,Text
0,I'm going through some things with my feelings...,"If everyone thinks you're worthless, then mayb...",[INST] I'm going through some things with my f...
1,I'm going through some things with my feelings...,"Hello, and thank you for your question and see...",[INST] I'm going through some things with my f...
2,I'm going through some things with my feelings...,First thing I'd suggest is getting the sleep y...,[INST] I'm going through some things with my f...


In [31]:
new_df = df[["Text"]]
train, test = train_test_split(new_df, test_size=0.2, shuffle=True)

In [32]:
train.to_csv("train_data.csv", index=False)
test.to_csv("test_data.csv", index=False)

In [34]:
train_dataset = load_dataset("csv", data_files="train_data.csv", split="train")
test_dataset  = load_dataset("csv", data_files="test_data.csv", split="train")

In [37]:
training_dataset

Dataset({
    features: ['Text'],
    num_rows: 3512
})

# II. Training

In [13]:
base_model = "microsoft/phi-2"
new_model  = "phi2-ft-mental-health"

## Tokenizer

1. **Loading the Tokenizer** (`AutoTokenizer.from_pretrained`): loads a pre-trained tokenizer for the specified model.
     - **Parameters**:
       - `base_model`: identifier of the pre-trained model (e.g., a model name or path).
       - `use_fast=True`: specifies that the fast version of the tokenizer should be used. Fast tokenizers are generally more efficient and quicker.  


2. **Setting the Padding Token**(`tokenizer.pad_token = tokenizer.eos_token`): sets the padding token of the tokenizer to be the same as the end-of-sequence (EOS) token. In some models and training setups, it is useful to use the EOS token for padding purposes to ensure consistency in the tokenization process.

3. **Specifying the Padding Side**(`tokenizer.padding_side = "right"`): specifies which side of the sequence the padding tokens should be added to.
   - **Options**:
     - `"right"`: Padding tokens are added to the right side of the sequence.
     - `"left"`: Padding tokens are added to the left side of the sequence.

In [14]:
tokenizer = AutoTokenizer.from_pretrained(base_model, use_fast=True)
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "right"

tokenizer_config.json:   0%|          | 0.00/7.34k [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/798k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/2.11M [00:00<?, ?B/s]

added_tokens.json:   0%|          | 0.00/1.08k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/99.0 [00:00<?, ?B/s]

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


## QLora configs


#### BitsAndBytesConfig

1. **Class**: `BitsAndBytesConfig`
2. **Parameters**:
   - `load_in_4bit=True`: to enable or not loading the model in 4-bit precision (saves memory and speed up computations)
   - `bnb_4bit_quant_type="nf4"`: Specifies the type of quantization for 4-bit precision.
     - **Options**: 
       - `"nf4"`: Non-Floating 4-bit quantization, which is more memory-efficient.
       - `"fp4"`: Floating Point 4-bit quantization, which offers higher precision.
   - `bnb_4bit_compute_dtype=torch.float16`: defines the data type for computations with 4-bit precision.
     - **Options**: 
       - `torch.float16`: Use 16-bit floating point for faster computations with less memory usage.
       - `torch.float32`: Use 32-bit floating point for higher precision but more memory usage.
   - `bnb_4bit_use_double_quant=False`: determines whether to use double quantization, which applies quantization twice for better precision (`True`: apply double quantization for better precision, `False`: do not)


#### LoraConfig

1. **Class**: `LoraConfig`
2. **Parameters**:
   - `r=32`: rank of the low-rank matrix in LoRA, controlling the capacity of the adaptation.
     - **Options**: Integer values (e.g., 4, 8, 16, 32). Higher values increase the capacity but require more computation.
   - `lora_alpha=64`: a scaling factor for the low-rank updates, affecting the learning rate of the adaptation.
     - **Options**: Integer values (e.g., 16, 32, 64, 128). Higher values can lead to larger updates.
   - `lora_dropout=0.05`: The dropout rate applied to the low-rank updates to prevent overfitting.
     - **Options**: Float values between 0 and 1 (e.g., 0.1, 0.2, 0.5). Higher values increase regularization.
   - `bias_type="none"`: Specifies how to handle bias terms in the LoRA layers.
     - **Options**: 
       - `"none"`: no bias terms are adapted.
       - `"all"`: all bias terms are adapted.
       - `"some"`: only some bias terms are adapted.
   - `task_type="CAUSAL_LM"`: type of task LoRA is being used for.
     - **Options**: 
       - `"CAUSAL_LM"`: Causal Language Modeling, for autoregressive tasks.
       - `"SEQ2SEQ_LM"`: Sequence-to-Sequence Language Modeling, for translation or summarization tasks.
   - `target_modules=["Wqkv", "fc1", "fc2"]`: model layers where LoRA will be applied.
     - **Options**: List of layer names (e.g., `["Wqkv"]`, `["fc1"]`, `["fc2"]`). Specific to the architecture of the model being fine-tuned.

In [15]:
bnb_configs = BitsAndBytesConfig(   load_in_4bit=True,
                                    bnb_4bit_quant_type="nf4",
                                    bnb_4bit_compute_dtype=torch.float16,
                                    bnb_4bit_use_double_quant=True
                                )

peft_configs = LoraConfig(  r=16,
                            lora_alpha=16,
                            lora_dropout=0.1,
                            bias="none",
                            task_type="CAUSAL_LM",
                            target_modules=["Wqkv", "fc1", "fc2"]
                         )

## Model init and configs


1. **Initializing the Model** (`AutoModelForCausalLM.from_pretrained`):loads a pre-trained causal language model.
   - **Parameters**:
     - `base_model`: identifier of the pre-trained model (e.g., model name or path).
     - `flash_attn=True`: enables Flash Attention mechanism, which optimizes attention mechanisms to improve performance.
     - `flash_rotary=True`: enables Flash Rotary mechanism, which enhances rotary embeddings for better representation learning.
     - `fused_dense=True`: uses fused dense operations, combining multiple operations into a single kernel for efficiency.
     - `low_cpu_mem_usage=True`: optimizes for low CPU memory usage, reducing memory footprint during model execution.
     - `device_map={"": 0}`: maps devices for model components.
     - `revision="refs/pr/23"`: specifies a specific revision of the model to load.

2. **Model configs**:
   - `model.config.use_cache = False`: disables caching of internal computations in the model.
   - `model.config.pretraining_tp = 1`: sets a specific pretraining task parameter to 1.

3. **Preparing the Model for k-bit training** (`prepare_model_for_kbit_training`): prepares the model for training with k-bit quantization.
   - **Parameters**:
     - `model`: The model instance to prepare for training.
     - `use_gradient_checkpointing=True`: Enables gradient checkpointing for memory efficiency during training.

In [16]:
model = AutoModelForCausalLM.from_pretrained(
                                base_model,
                                quantization_config=bnb_configs,
                                torch_dtype=torch.float16,
                                trust_remote_code=True,
                                flash_attn=True,
                                flash_rotary=True,
                                fused_dense=True,
                                low_cpu_mem_usage=True,
                                device_map=device,
                                revision="refs/pr/23"
                                )

#model.to(device)
model.config.use_cache = False
model.config.pretraining_tp = 1

model = prepare_model_for_kbit_training(model, use_gradient_checkpointing=True)

config.json:   0%|          | 0.00/755 [00:00<?, ?B/s]

configuration_phi.py:   0%|          | 0.00/2.03k [00:00<?, ?B/s]

modeling_phi.py:   0%|          | 0.00/33.7k [00:00<?, ?B/s]

model.safetensors.index.json:   0%|          | 0.00/24.3k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/2 [00:00<?, ?it/s]

model-00001-of-00002.safetensors:   0%|          | 0.00/4.98G [00:00<?, ?B/s]

model-00002-of-00002.safetensors:   0%|          | 0.00/577M [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/69.0 [00:00<?, ?B/s]

You are using an old version of the checkpointing format that is deprecated (We will also silently ignore `gradient_checkpointing_kwargs` in case you passed it).Please update to the new format on your modeling file. To use the new format, you need to completely remove the definition of the method `_set_gradient_checkpointing` in your model.


## Model training

**Training arguments Parameters** (`TrainingArguments`):
 - `num_train_epochs=1`: number of times the model will be trained on the entire dataset.
 - `per_device_train_batch_size=2`: number of training samples processed simultaneously on each device (GPU or CPU).
 - `gradient_accumulation_steps=32`: number of batches to accumulate gradients before performing a backward pass.
   - **Purpose**: helps in training with larger effective batch sizes than memory allows, useful when GPU memory is limited.
   - **Options**: integer values (e.g., 1, 2, 4, 8, etc.).
 - `evaluation_strategy="steps"`: determines when to perform evaluation during training.
   - **Purpose**: specifies the strategy for evaluating the model during training, based on steps, epochs, or no evaluation.
   - **Options**: `"no"` (no evaluation), `"steps"` (evaluate every `eval_steps`), `"epoch"` (evaluate at the end of each epoch).
 - `eval_steps=1500`: interval in steps for evaluation if `evaluation_strategy="steps"`.
 - `logging_steps=1500`: interval in steps for logging training metrics to the console or files.
 - `optim="paged_adamw_8bit"`: optimizer type used for training, here using **paged AdamW with 8-bit precision**.
   - **Options**: depends on the specific implementation and available optimizers.
 - `learning_rate=2e-4`: initial learning rate for the optimizer.
   - **Purpose**: controls the step size during gradient descent or optimization.
   - **Options**: float values (e.g., 0.001, 0.0001, etc.).
 - `lr_scheduler_type="cosine"`: type of learning rate scheduler applied during training.
   - **Purpose**: adjusts the learning rate during training to optimize model convergence.
   - **Options**: `"linear"`, `"cosine"`, `"step"`, `"polynomial"`, etc., depending on the scheduler implementation.
 - `save_steps=1500`: interval in steps to save model checkpoints.
 - `warmup_ratio=0.05`: ratio of total training steps for which the learning rate will be gradually increased.
   - **Purpose**: prevents the model from diverging during the initial stages of training by slowly increasing the learning rate.
   - **Options**: float values between 0 and 1 (e.g., 0.1, 0.05, etc.).
 - `weight_decay=0.01`: strength of weight decay regularization applied to the model parameters during optimization.
   - **Purpose**: helps prevent overfitting by penalizing large weights.
   - **Options**: float values (e.g., 0.001, 0.01, etc.).
 - `max_steps=-1`: Maximum number of training steps; `-1` indicates unlimited steps.
   - **Purpose**: limits the number of iterations the model will undergo during training.
   - **Options**: integer values or `-1` for unlimited training steps.

In [38]:
training_args = TrainingArguments(  output_dir="./mental_health",
                                    num_train_epochs=1,
                                    per_device_train_batch_size=1,
                                    gradient_accumulation_steps=4,
                                  #gradient_checkpointing=True,
                                    evaluation_strategy="steps",
                                    eval_steps=10,
                                    logging_steps=10,
                                    optim="paged_adamw_8bit",
                                    learning_rate=2e-4,
                                    lr_scheduler_type="cosine",
                                    save_steps=100,
                                    warmup_ratio=0.05,
                                    weight_decay=0.01,
                                    max_steps=-1
                                ) 

trainer = SFTTrainer(   model=model,
                        train_dataset=train_dataset,
                        eval_dataset=test_dataset,
                        peft_config=peft_configs,
                        dataset_text_field="Text",
                        max_seq_length=690,
                        tokenizer=tokenizer,
                        args=training_args
                    )


Deprecated positional argument(s) used in SFTTrainer, please use the SFTConfig to set these arguments instead.


Map:   0%|          | 0/2809 [00:00<?, ? examples/s]

Map:   0%|          | 0/703 [00:00<?, ? examples/s]

In [39]:
trainer.train()

Step,Training Loss,Validation Loss
10,2.6924,2.593595
20,2.663,2.488709
30,2.525,2.401745
40,2.4958,2.382176
50,2.4296,2.370374
60,2.4096,2.358821
70,2.4432,2.353161
80,2.3572,2.345873
90,2.3374,2.341271
100,2.3786,2.334938




config.json:   0%|          | 0.00/735 [00:00<?, ?B/s]

You are using a model of type phi to instantiate a model of type phi-msft. This is not supported for all configurations of models and can yield errors.
You are using a model of type phi to instantiate a model of type phi-msft. This is not supported for all configurations of models and can yield errors.
You are using a model of type phi to instantiate a model of type phi-msft. This is not supported for all configurations of models and can yield errors.
You are using a model of type phi to instantiate a model of type phi-msft. This is not supported for all configurations of models and can yield errors.
You are using a model of type phi to instantiate a model of type phi-msft. This is not supported for all configurations of models and can yield errors.
You are using a model of type phi to instantiate a model of type phi-msft. This is not supported for all configurations of models and can yield errors.
You are using a model of type phi to instantiate a model of type phi-msft. This is not s

TrainOutput(global_step=702, training_loss=2.285884678533614, metrics={'train_runtime': 31115.9408, 'train_samples_per_second': 0.09, 'train_steps_per_second': 0.023, 'total_flos': 1.2790215975936e+16, 'train_loss': 2.285884678533614, 'epoch': 0.9996440014239943})

# Push to hub

In [40]:
model.save_pretrained("./trained_model")
model.push_to_hub("/phi-2-ft-mental_health")

HfHubHTTPError: 401 Client Error: Unauthorized for url: https://huggingface.co/api/repos/create (Request ID: Root=1-668857f6-4bd501053fbe542a7ef73c4e;43f76eb4-692b-4cd5-bcfa-844a39c4a9db)

Invalid username or password.