<a href="https://colab.research.google.com/github/Abinayasankar-co/finetuningworks/blob/main/SFT.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
%pip install transformers trl peft
!pip install -q -U bitsandbytes
!pip install accelerate
!pip install -i https://pypi.org/simple/ bitsandbytes

In [None]:
import torch
from tqdm import tqdm
import pandas as pd


tqdm.pandas()

from transformers import pipeline, AutoTokenizer
from datasets import load_dataset

from trl import SFTTrainer
from peft import LoraConfig,PeftModel,get_peft_model,prepare_model_for_kbit_training

In [None]:
MODEL_PATH ="bigcode/tiny_starcoder_py"
DATA_PATH = "/content/test.parquet"

In [None]:
from transformers import BitsAndBytesConfig,AutoModelForCausalLM
import accelerate
import bitsandbytes

nf4_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_use_double_quant=True,
    bnb_4bit_compute_dtype=torch.bfloat16
)

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
tokenizer = AutoTokenizer.from_pretrained("microsoft/phi-2", trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained("microsoft/phi-2",quantization_config=nf4_config,device_map="auto")
model.config.use_cache = False

In [None]:
new_model = "Phi2SFT"

In [None]:
from datasets import Dataset,load_dataset
df = pd.read_parquet(DATA_PATH)
df = df[:50]
raw_dataset = Dataset.from_pandas(df)

In [None]:
raw_dataset

Dataset({
    features: ['prompt', 'chosen', 'rejected'],
    num_rows: 50
})

In [None]:
tokenizer.add_special_tokens({'pad_token':'[PAD]'})
def formatting_func(examples):
  kwargs = {
      "padding":"max_length",
      "truncation":True,
      "max_length":256,
      "return_tensors":"pt"
  }

  prompt_plus_chosen_response = examples["prompt"]+"\n"+examples["chosen"]
  prompt_plus_rejected_response = examples["prompt"] + "\n" + examples["rejected"]

  #Tokenizer
  tokens_chosen = tokenizer.encode_plus(prompt_plus_chosen_response,**kwargs)
  tokens_rejected = tokenizer.encode_plus(prompt_plus_rejected_response,**kwargs)

  return {
      "input_ids_chosen":tokens_chosen["input_ids"][0],"attention_mask_chosen":tokens_chosen["attention_mask"][0],
      "input_ids_rejected":tokens_rejected["input_ids"][0],"attention_mask_rejected":tokens_rejected["attention_mask"][0]
  }

In [None]:
from transformers import TrainingArguments
#Training Arguments
training_args = TrainingArguments(
    per_device_train_batch_size = 4,
    gradient_accumulation_steps =4,
    gradient_checkpointing = True,
    learning_rate = 5e-5,
    lr_scheduler_type = "cosine",
    max_steps = 50,
    save_strategy="no",
    logging_steps=1,
    output_dir = new_model,
    optim="paged_adamw_8bit",
    warmup_steps=100,
)

In [None]:
formatted_dataset = raw_dataset.map(formatting_func)
formatted_dataset = formatted_dataset.train_test_split()

Map:   0%|          | 0/50 [00:00<?, ? examples/s]

In [None]:
peft_config = LoraConfig(r=16,lora_alpha=16,lora_dropout=0.05,bias='none',task_type="CAUSAL_LM",target_modules =['k_proj','gate_proj','v_proj','up_proj','q_proj','o_proj','drown_proj'])

In [None]:
trainer = SFTTrainer(
    model,
    args = training_args,
    train_dataset=formatted_dataset["train"],
    max_seq_length=512,
    dataset_text_field="prompt",
    peft_config =peft_config,
    packing=True
)

trainer.train()

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


Generating train split: 0 examples [00:00, ? examples/s]

max_steps is given, it will override any value given in num_train_epochs


Step,Training Loss
1,2.7826
2,2.7463
3,2.8047
4,2.7973
5,2.7941
6,2.7686
7,2.8117
8,2.7548
9,2.7938
10,2.814


TrainOutput(global_step=50, training_loss=2.7660111141204835, metrics={'train_runtime': 1856.3398, 'train_samples_per_second': 0.431, 'train_steps_per_second': 0.027, 'total_flos': 5989949956423680.0, 'train_loss': 2.7660111141204835, 'epoch': 33.333333333333336})

In [None]:
trainer.save_model("./results")

In [None]:
!huggingface-cli login


    _|    _|  _|    _|    _|_|_|    _|_|_|  _|_|_|  _|      _|    _|_|_|      _|_|_|_|    _|_|      _|_|_|  _|_|_|_|
    _|    _|  _|    _|  _|        _|          _|    _|_|    _|  _|            _|        _|    _|  _|        _|
    _|_|_|_|  _|    _|  _|  _|_|  _|  _|_|    _|    _|  _|  _|  _|  _|_|      _|_|_|    _|_|_|_|  _|        _|_|_|
    _|    _|  _|    _|  _|    _|  _|    _|    _|    _|    _|_|  _|    _|      _|        _|    _|  _|        _|
    _|    _|    _|_|      _|_|_|    _|_|_|  _|_|_|  _|      _|    _|_|_|      _|        _|    _|    _|_|_|  _|_|_|_|

    To login, `huggingface_hub` requires a token generated from https://huggingface.co/settings/tokens .
Enter your token (input will not be visible): 
Add token as git credential? (Y/n) Y
Token is valid (permission: write).
[1m[31mCannot authenticate through git-credential as no helper is defined on your machine.
You might have to re-authenticate when pushing to the Hugging Face Hub.
Run the following command in your ter

In [None]:
trainer.push_to_hub("Phi2-SFTFinetuned")
tokenizer.push_to_hub("Phi2-SFTFinetuned")

adapter_model.safetensors:   0%|          | 0.00/31.5M [00:00<?, ?B/s]

events.out.tfevents.1714394544.15b120f95a61.409.0:   0%|          | 0.00/16.1k [00:00<?, ?B/s]

Upload 3 LFS files:   0%|          | 0/3 [00:00<?, ?it/s]

training_args.bin:   0%|          | 0.00/4.98k [00:00<?, ?B/s]

CommitInfo(commit_url='https://huggingface.co/Abinayasankar/Phi2-SFTFinetuned/commit/2b72ae46b3d338466ec1648e835152faa9674114', commit_message='Upload tokenizer', commit_description='', oid='2b72ae46b3d338466ec1648e835152faa9674114', pr_url=None, pr_revision=None, pr_num=None)

In [None]:
# Load model directly
from transformers import AutoModel
tokenizer = AutoTokenizer.from_pretrained("Abinayasankar/Phi2-SFTFinetuned",trust_remote_code=True)
model = AutoModel.from_pretrained("Abinayasankar/Phi2-SFTFinetuned")

tokenizer_config.json:   0%|          | 0.00/7.57k [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/798k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/2.12M [00:00<?, ?B/s]

added_tokens.json:   0%|          | 0.00/1.10k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/579 [00:00<?, ?B/s]

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


OSError: Abinayasankar/Phi2-SFTFinetuned does not appear to have a file named config.json. Checkout 'https://huggingface.co/Abinayasankar/Phi2-SFTFinetuned/main' for available files.

In [None]:
tokenizer = AutoTokenizer.from_pretrained("Abinayasankar/Phi2-SFTFinetuned",trust_remote_code=True)

tokenizer_config.json:   0%|          | 0.00/7.53k [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/798k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/2.12M [00:00<?, ?B/s]

added_tokens.json:   0%|          | 0.00/1.10k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/579 [00:00<?, ?B/s]

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


In [None]:
input_text="Hi what are the good qualities"
input_ids = tokenizer(input_text,return_tensors="pt")
print(input_ids)
outputs = model.generate(**input_ids,max_length=128)
print(tokenizer.decode(outputs[0]))

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


{'input_ids': tensor([[17250,   644,   389,   262,   922, 14482]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1]])}
