Let's load the dataset

In [4]:
import json

dataset_file = "New_data.json"

with open(dataset_file, "r") as f:
    alpaca = json.load(f)

In [5]:
type(alpaca), alpaca[0:3], len(alpaca)

(list,
 [{'instruction': 'Generate search keywords for searching images/videos from the given paragraph. Consider relevant elements and keywords should capture the essence and context of the paragraph while aligning with the overall meaning of the input. Your output should be a list of python strings, and each string will correspond to a sentence in the input with a search keyword. All should have string values. The length of the list and the number of sentences in the given paragraph should be the same.',
   'input': " In a memorable opening match, Germany triumphed over Scotland, leaving the Scots in despair. The final score: Germany 5, Scotland 1. After Wirtz scored in the 10th minute, the game turned into a nightmare for Scotland.   Germany's victory was a history-making event, with Musiala, Havertz, Füllkrug, and Can scoring four additional goals. Scotland's defeat was sealed when Ryan Porteous was sent off, leaving Havertz to score from the penalty spot.   This match was unique i

In [6]:
import wandb

# log to wandb
with wandb.init(project="data_ft_phi3LoRA"):
    at = wandb.Artifact(
        name="data", 
        type="dataset",
        description="Using the data we finetune",
    )
    at.add_file(dataset_file)

    # log as a table
    table = wandb.Table(columns=list(alpaca[0].keys()))
    for row in alpaca:
        table.add_data(*row.values())
    wandb.log({"data_table": table})

ERROR:wandb.jupyter:Failed to detect the name of this notebook, you can set it manually with the WANDB_NOTEBOOK_NAME environment variable to enable code saving.
[34m[1mwandb[0m: Currently logged in as: [33manishproshort[0m ([33mps-ml-team[0m). Use [1m`wandb login --relogin`[0m to force relogin


VBox(children=(Label(value='31.893 MB of 31.893 MB uploaded\r'), FloatProgress(value=1.0, max=1.0)))

In [8]:
import random

seed = 42

random.seed(seed)
random.shuffle(alpaca)  # this could also be a parameter

In [9]:
train_dataset = alpaca[:-1000]
eval_dataset = alpaca[-1000:]

In [10]:
import pandas as pd

train_df = pd.DataFrame(train_dataset)
eval_df = pd.DataFrame(eval_dataset)

train_table = wandb.Table(dataframe=train_df)
eval_table  = wandb.Table(dataframe=eval_df)

train_df.to_json("data_train.jsonl", orient='records', lines=True)
eval_df.to_json("data_eval.jsonl", orient='records', lines=True)

with wandb.init(project="data_ft_phi3LoRA", job_type="split_data"):
    at = wandb.Artifact(
        name="data_splitted", 
        type="dataset",
        description="Using the data we finetune",
    )
    at.add_file("data_train.jsonl")
    at.add_file("data_eval.jsonl")
    wandb.log_artifact(at)
    wandb.log({"train_dataset":train_table, "eval_dataset":eval_table})

VBox(children=(Label(value='48.097 MB of 48.097 MB uploaded\r'), FloatProgress(value=1.0, max=1.0)))

In [11]:
def prompt_input(row):
    return ("Below is an instruction that describes a task, paired with an input that provides further context. "
            "Write a response that appropriately completes the request.\n\n"
            "### Instruction:\n{instruction}\n\n### Input:\n{input}\n\n### Response:\n").format_map(row)

In [12]:
'''import json
from wandb import Api

api = Api()
artifact = api.artifact('data_ft/data_splitted', type='dataset')
dataset_dir = artifact.download()'''

def load_jsonl(file_path):
    data = []
    with open(file_path, 'r') as file:
        for line in file:
            data.append(json.loads(line))
    return data
    
train_dataset = load_jsonl("data_train.jsonl")
eval_dataset = load_jsonl("data_eval.jsonl")

In [13]:
print(train_dataset[0])

{'instruction': 'Generate search keywords for searching images/videos from the given paragraph. Consider relevant elements and keywords should capture the essence and context of the paragraph while aligning with the overall meaning of the input. Your output should be a list of python strings, and each string will correspond to a sentence in the input with a search keyword. All should have string values. The length of the list and the number of sentences in the given paragraph should be the same.', 'input': " Do you ever wish you had the power of artificial intelligence right at your fingertips? That's exactly what AI/ML API offers, with over a hundred AI models available for you to use.   Imagine, a single API key being the key to a treasure chest of AI tools. With AI/ML API you can easily integrate these models into your business.   Not only does AI/ML API allow you to harness the power of AI, but it helps you to cut down on software costs and boost your efficiency.   If you're an edu

In [14]:
train_prompts = [prompt_input(row) for row in train_dataset]
eval_prompts = [prompt_input(row) for row in eval_dataset]

In [15]:
print(train_prompts[0])

Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.

### Instruction:
Generate search keywords for searching images/videos from the given paragraph. Consider relevant elements and keywords should capture the essence and context of the paragraph while aligning with the overall meaning of the input. Your output should be a list of python strings, and each string will correspond to a sentence in the input with a search keyword. All should have string values. The length of the list and the number of sentences in the given paragraph should be the same.

### Input:
 Do you ever wish you had the power of artificial intelligence right at your fingertips? That's exactly what AI/ML API offers, with over a hundred AI models available for you to use.   Imagine, a single API key being the key to a treasure chest of AI tools. With AI/ML API you can easily integrate these models into your busines

In [16]:
def pad_eos(ds):
    EOS_TOKEN = "</s>"
    return [f"{row['output']}{EOS_TOKEN}" for row in ds]

In [17]:
train_outputs = pad_eos(train_dataset)
eval_outputs = pad_eos(eval_dataset)
train_outputs[0]

"['AI ML models', 'API key unlocking treasure chest', 'business owner reducing software costs', 'educators and innovators using AI', 'Joining the AI revolution']</s>"

In [18]:
train_dataset = [{"prompt":s, "output":t, "example": s + t} for s, t in zip(train_prompts, train_outputs)]
eval_dataset = [{"prompt":s, "output":t, "example": s + t} for s, t in zip(eval_prompts, eval_outputs)]

In [19]:
print(train_dataset[0]["example"])

Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.

### Instruction:
Generate search keywords for searching images/videos from the given paragraph. Consider relevant elements and keywords should capture the essence and context of the paragraph while aligning with the overall meaning of the input. Your output should be a list of python strings, and each string will correspond to a sentence in the input with a search keyword. All should have string values. The length of the list and the number of sentences in the given paragraph should be the same.

### Input:
 Do you ever wish you had the power of artificial intelligence right at your fingertips? That's exactly what AI/ML API offers, with over a hundred AI models available for you to use.   Imagine, a single API key being the key to a treasure chest of AI tools. With AI/ML API you can easily integrate these models into your busines

In [20]:
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

In [21]:
model_id = 'microsoft/Phi-3-mini-128k-instruct'
tokenizer = AutoTokenizer.from_pretrained(model_id)
tokenizer.pad_token = tokenizer.eos_token

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


### Packing

We will pack multiple short examples into a longer chunk, so we can train more efficiently!

The main idea here is that the instruction/output samples are short, so let's concatenate a bunch of them together separated by the `EOS` token. We can also pre-tokenize and pre-pack the dataset and make everything faster!  If we define a `max_seq_len = 1024` the code to pack would look something like this:

In [22]:
max_sequence_len = 1024

def pack(dataset, max_seq_len=max_sequence_len):
    tkds_ids = tokenizer([s["example"] for s in dataset])["input_ids"]
    
    all_token_ids = []
    for tokenized_input in tkds_ids:
        all_token_ids.extend(tokenized_input)# + [tokenizer.eos_token_id])
    
    print(f"Total number of tokens: {len(all_token_ids)}")
    packed_ds = []
    for i in range(0, len(all_token_ids), max_seq_len+1):
        input_ids = all_token_ids[i : i + max_seq_len+1]
        if len(input_ids) == (max_seq_len+1):
            packed_ds.append({"input_ids": input_ids[:-1], "labels": input_ids[1:]})  # this shift is not needed if using the model.loss
    return packed_ds

In [23]:
train_ds_packed = pack(train_dataset)
eval_ds_packed = pack(eval_dataset)
len(train_ds_packed)

Total number of tokens: 4002502
Total number of tokens: 470285


3904

### DataLoader
As we are training for completion, the labels (or targets) will be the inputs shifted by one. We are going to train with regular Cross Entropy and predict the next token on this packed dataset.

In [24]:
from torch.utils.data import DataLoader
from transformers import default_data_collator

torch.manual_seed(seed)
batch_size = 4  # I have an A100 GPU with 40GB of RAM 😎

train_dataloader = DataLoader(
    train_ds_packed,
    batch_size=batch_size,
    collate_fn=default_data_collator, # we don't need any special collator 😎
)

eval_dataloader = DataLoader(
    eval_ds_packed,
    batch_size=batch_size,
    collate_fn=default_data_collator,
    shuffle=False,
)

It is always a good idea to check how does a batch looks like, you can quickly do this by sampling from the DataLoader

In [25]:
b = next(iter(train_dataloader))
b

{'input_ids': tensor([[    1, 13866,   338,  ..., 29881, 30130,  1758],
         [30130,  1056,  5951,  ...,   393,  8128,  4340],
         [29889, 14350,   263,  ...,   525, 29923,   386],
         [ 2050,   800,   310,  ...,   348,  8247, 30130]]),
 'labels': tensor([[13866,   338,   385,  ..., 30130,  1758,   294],
         [ 1056,  5951,   262,  ...,  8128,  4340,  3030],
         [14350,   263,  2933,  ..., 29923,   386,   936],
         [  800,   310, 15680,  ...,  8247, 30130,   330]])}

We can alos decode the batch just to be super sure

In [26]:
tokenizer.decode(b["input_ids"][0])[:250]

'<s> Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.\n\n### Instruction:\nGenerate search keywords for searching images/videos from the given pa'

In [27]:
tokenizer.decode(b["labels"][0])[:250]

'Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.\n\n### Instruction:\nGenerate search keywords for searching images/videos from the given paragr'

## Train

I like storing all my hyperparameters into a `SimpleNamespace`, it's like a dict but with .dot attribute access. Then I can access my batch size by doing config.batch_size instead of config["batch_size"]. 

In [28]:
from types import SimpleNamespace

gradient_accumulation_steps = 2

config = SimpleNamespace(
    model_id='microsoft/Phi-3-mini-128k-instruct',
    dataset_name="data",
    precision="bf16",  # faster and better than fp16, requires new GPUs
    n_freeze=24,  # How many layers we don't train
    lr=2e-4,
    n_eval_samples=10, # How many samples to generate on validation
    max_seq_len=max_sequence_len, # Lenght of the sequences to pack
    epochs=3,  # we do 3 pasess over the dataset.
    gradient_accumulation_steps=gradient_accumulation_steps,  # evey how many iterations we update the gradients, simulates larger batch sizes
    batch_size=batch_size,  # what my GPU can handle, depends on how many layers are we training  
    log_model=False,  # upload the model to W&B?
    gradient_checkpointing = True,  # saves even more memory
    freeze_embed = True,  # why train this? let's keep them frozen ❄️
    seed=seed,
)

config.total_train_steps = config.epochs * len(train_dataloader) // config.gradient_accumulation_steps

In [29]:
print(f"We will train for {config.total_train_steps} steps and evaluate every epoch")

We will train for 1464 steps and evaluate every epoch


We first get a pretrained model with some configuration parameters

In [30]:

from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import LoraConfig, get_peft_model

# Load the model and tokenizer
model_id = 'microsoft/Phi-3-mini-128k-instruct'
model = AutoModelForCausalLM.from_pretrained(model_id, device_map=0,trust_remote_code=True,
    low_cpu_mem_usage=True,
    torch_dtype=torch.bfloat16,
    use_cache=False,)

# Configure LoRA
lora_config = LoraConfig(
    r=64,
    lora_alpha=32,
    lora_dropout=0.1,
    target_modules= ['k_proj', 'q_proj', 'v_proj', 'o_proj', "gate_proj", "down_proj", "up_proj"]
)

# Wrap the model with LoRA
model = get_peft_model(model, lora_config)




Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


In [31]:
def param_count(m):
    params = sum([p.numel() for p in m.parameters()])/1_000_000
    trainable_params = sum([p.numel() for p in m.parameters() if p.requires_grad])/1_000_000
    print(f"Total params: {params:.2f}M, Trainable: {trainable_params:.2f}M")
    return params, trainable_params

params, trainable_params = param_count(model)

Total params: 3856.73M, Trainable: 35.65M


### Optimizer

We setup the standard optimization stuff...

In [32]:
from transformers import get_cosine_schedule_with_warmup

optim = torch.optim.Adam(model.parameters(), lr=config.lr, betas=(0.9,0.99), eps=1e-5)
scheduler = get_cosine_schedule_with_warmup(
    optim,
    num_training_steps=config.total_train_steps,
    num_warmup_steps=config.total_train_steps // 10,
)

In [33]:
def loss_fn(x, y):
    "A Flat CrossEntropy" 
    return torch.nn.functional.cross_entropy(x.view(-1, x.shape[-1]), y.view(-1))

## Testing during training

We are almost there, let's create a simple function to sample from the model now and then, to visualy see what the models is outputting!
Let's wrap the model.generate method for simplicity. You can grab the defaults sampling parameters from the GenerationConfig and passing the corresponding model_id. This will grab you the defaults for parameters like temperature, top p, etc...


In [34]:
from types import SimpleNamespace
from transformers import GenerationConfig

gen_config = GenerationConfig.from_pretrained(config.model_id)
test_config = SimpleNamespace(
    max_new_tokens=256,
    gen_config=gen_config)

In [35]:
def generate(prompt, max_new_tokens=test_config.max_new_tokens, gen_config=gen_config):
    tokenized_prompt = tokenizer(prompt, return_tensors='pt')['input_ids'].cuda()
    with torch.inference_mode():
        output = model.generate(tokenized_prompt, 
                            max_new_tokens=max_new_tokens, 
                            generation_config=gen_config)
    return tokenizer.decode(output[0][len(tokenized_prompt[0]):], skip_special_tokens=True)

LoL 🤷

In [36]:
prompt = eval_dataset[14]["prompt"]
print(prompt + generate(prompt, 128))



Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.

### Instruction:
Generate search keywords for searching images/videos from the given paragraph. Consider relevant elements and keywords should capture the essence and context of the paragraph while aligning with the overall meaning of the input. Your output should be a list of python strings, and each string will correspond to a sentence in the input with a search keyword. All should have string values. The length of the list and the number of sentences in the given paragraph should be the same.

### Input:
 'Khóa học Kế toán 4 trong 1' là một chương trình toàn diện bao gồm tất cả các khía cạnh kế toán trong bốn phần chính.   Phần đầu tiên của khóa học tập trung vào việc nắm vững tất cả các tài liệu kế toán như hóa đơn, sao kê ngân hàng, bảng lương, bảo hiểm, xuất nhập khẩu trong vòng 3 giờ.   Phần thứ hai nhấn mạnh đến các công 

We can log a Table with those results to the project every X steps

In [37]:
import wandb
from tqdm.auto import tqdm

def prompt_table(examples, log=False, table_name="predictions"):
    table = wandb.Table(columns=["prompt", "generation", "concat", "output", "max_new_tokens", "temperature", "top_p"])
    for example in tqdm(examples, leave=False):
        prompt, gpt4_output = example["prompt"], example["output"]
        out = generate(prompt, test_config.max_new_tokens, test_config.gen_config)
        table.add_data(prompt, out, prompt+out, gpt4_output, test_config.max_new_tokens, test_config.gen_config.temperature, test_config.gen_config.top_p)
    if log:
        wandb.log({table_name:table})
    return table

def to_gpu(tensor_dict):
    return {k: v.to('cuda') for k, v in tensor_dict.items()}

class Accuracy:
    "A simple Accuracy function compatible with HF models"
    def __init__(self):
        self.count = 0
        self.tp = 0.
    def update(self, logits, labels):
        logits, labels = logits.argmax(dim=-1).view(-1).cpu(), labels.view(-1).cpu()
        tp = (logits == labels).sum()
        self.count += len(logits)
        self.tp += tp
        return tp / len(logits)
    def compute(self):
        return self.tp / self.count

In [38]:
from rouge_score import rouge_scorer

# Initialize the ROUGE scorer for bigrams (ROUGE-2)
scorer = rouge_scorer.RougeScorer(['rouge2'], use_stemmer=True)

def compute_rouge2(predictions, references):
    rouge2_scores = []
    for pred, ref in zip(predictions, references):
        score = scorer.score(ref, pred)['rouge2'].fmeasure
        rouge2_scores.append(score)
    return sum(rouge2_scores) / len(rouge2_scores)

You can also quickly add validation if you feel so, the table can be also created at this step:

In [39]:
@torch.no_grad()
def validate():
    model.eval()
    eval_acc = Accuracy()
    loss, total_steps = 0., 0
    for step, batch in enumerate(pbar:=tqdm(eval_dataloader, leave=False)):
        pbar.set_description(f"doing validation")
        batch = to_gpu(batch)
        total_steps += 1
        with torch.amp.autocast("cuda", dtype=torch.bfloat16):
            out = model(**batch)
            loss += loss_fn(out.logits, batch["labels"])  # you could use out.loss and not shift the dataset
        predictions = tokenizer.batch_decode(out.logits.argmax(dim=-1), skip_special_tokens=True)
        references = tokenizer.batch_decode(batch["labels"], skip_special_tokens=True)
        # Compute ROUGE-2 score
        rouge2_score = compute_rouge2(predictions, references)
    # we log results at the end
    wandb.log({"eval/loss": loss.item() / total_steps,
               "eval/rouge2_score": rouge2_score})
    prompt_table(eval_dataset[:config.n_eval_samples], log=True)
    model.train()

Let's define a loop that compute evaluation and logs a Table with model predictions

## The actual Loop
It's actually nothing fancy, and very short! It has:
- Gradient accumulation and gradient scaling
- sampling and model checkpoint saving (this trains very fast, no need to save multiple checkpoints)
- We compute token accuracy, better metric than loss.

In [40]:
wandb.init(project="data_ft_phi3LoRA", # the project I am working on
           tags=["baseline","8b"],
           job_type="train",
           config=config) # the Hyperparameters I want to keep track of

# Training

acc = Accuracy()
model.train()
train_step = 0
for epoch in tqdm(range(config.epochs)):
    for step, batch in enumerate(tqdm(train_dataloader)):
        batch = to_gpu(batch)
        with torch.amp.autocast("cuda", dtype=torch.bfloat16):
            out = model(**batch)
            loss = loss_fn(out.logits, batch["labels"]) / config.gradient_accumulation_steps  # you could use out.loss and not shift the dataset  
            loss.backward()
        if step%config.gradient_accumulation_steps == 0:
            # Convert logits to predictions and references to text
            predictions = tokenizer.batch_decode(out.logits.argmax(dim=-1), skip_special_tokens=True)
            references = tokenizer.batch_decode(batch["labels"], skip_special_tokens=True)
            
            # Compute ROUGE-2 score
            rouge2_score = compute_rouge2(predictions, references)
            # we can log the metrics to W&B
            wandb.log({"train/loss": loss.item() * config.gradient_accumulation_steps,
                       "train/rouge2_score": rouge2_score,
                       "train/learning_rate": scheduler.get_last_lr()[0],
                       "train/global_step": train_step})
            optim.step()
            scheduler.step()
            optim.zero_grad(set_to_none=True)
            train_step += 1
    validate()    

  0%|          | 0/3 [00:00<?, ?it/s]

  0%|          | 0/976 [00:00<?, ?it/s]

  0%|          | 0/115 [00:00<?, ?it/s]

  0%|          | 0/10 [00:00<?, ?it/s]

  0%|          | 0/976 [00:00<?, ?it/s]

  0%|          | 0/115 [00:00<?, ?it/s]

  0%|          | 0/10 [00:00<?, ?it/s]

  0%|          | 0/976 [00:00<?, ?it/s]

  0%|          | 0/115 [00:00<?, ?it/s]

  0%|          | 0/10 [00:00<?, ?it/s]

In [41]:
# we save the model checkpoint at the end
#save_model(model, model_name=config.model_id.replace("/", "_"), models_folder="models/", log=config.log_model)
    
wandb.finish()

VBox(children=(Label(value='0.372 MB of 0.372 MB uploaded\r'), FloatProgress(value=1.0, max=1.0)))

0,1
eval/loss,█▃▁
eval/rouge2_score,█▁▅
train/global_step,▁▁▁▁▂▂▂▂▂▃▃▃▃▃▄▄▄▄▄▄▅▅▅▅▅▅▆▆▆▆▆▇▇▇▇▇████
train/learning_rate,▂▃▅▆██████▇▇▇▇▇▆▆▆▆▅▅▅▄▄▄▃▃▃▃▂▂▂▂▂▁▁▁▁▁▁
train/loss,██▄▄▃▁▂▄▄▃▄▂▃▄▅▂▂▂▁▄▄▂▃▄▂▁▂▄▄▅▃▂▂▃▁▁▂▁▁▂
train/rouge2_score,▁▂▄▄▆▆▅▃▅▃▅▇▂▅▃▅▆▆▆▅▆▆▆▅▅▆▆▄▆▅▆▅█▆▅▆▇▇▆▆

0,1
eval/loss,1.11701
eval/rouge2_score,0.52899
train/global_step,1463.0
train/learning_rate,0.0
train/loss,1.26578
train/rouge2_score,0.49473


This trains in around 60 minutes on an A100. 

## Full Eval Dataset evaluation

Let's log a table with model predictions on the eval_dataset (or at least the 250 first samples)

In [44]:
with wandb.init(project="data_ft_phi3LoRA", # the project I am working on
           job_type="eval",
           config=config): # the Hyperparameters I want to keep track of
    model.eval();
    prompt_table(eval_dataset[:250], log=True, table_name="eval_predictions")

  0%|          | 0/250 [00:00<?, ?it/s]

Traceback (most recent call last):
  File "/var/tmp/ipykernel_13384/721575607.py", line 5, in <module>
    prompt_table(eval_dataset[:250], log=True, table_name="eval_predictions")
  File "/var/tmp/ipykernel_13384/3776927856.py", line 8, in prompt_table
    out = generate(prompt, test_config.max_new_tokens, test_config.gen_config)
  File "/var/tmp/ipykernel_13384/4108886549.py", line 4, in generate
    output = model.generate(tokenized_prompt,
  File "/opt/conda/lib/python3.10/site-packages/peft/peft_model.py", line 647, in generate
    return self.get_base_model().generate(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/transformers/generation/utils.py", line 1758, in generate
    result = self._sample(
  File "/opt/conda/lib/python3.10/site-packages/transformers/generation/utils.py", line 2397, in _sample
    outputs = self(
  Fi

VBox(children=(Label(value='0.002 MB of 0.002 MB uploaded\r'), FloatProgress(value=1.0, max=1.0)))

KeyboardInterrupt: 

In [43]:
model.base_model.save_pretrained("fine_tuned/data_new_ft_phi3LoRA")
tokenizer.save_pretrained('fine_tuned/data_new_ft_phi3LoRA')

('fine_tuned/data_new_ft_phi3LoRA/tokenizer_config.json',
 'fine_tuned/data_new_ft_phi3LoRA/special_tokens_map.json',
 'fine_tuned/data_new_ft_phi3LoRA/tokenizer.model',
 'fine_tuned/data_new_ft_phi3LoRA/added_tokens.json',
 'fine_tuned/data_new_ft_phi3LoRA/tokenizer.json')

In [46]:
model.save_pretrained('fine_tuned/data_new_ft_phi3LoRA_ada')

tokenizer.save_pretrained('fine_tuned/data_new_ft_phi3LoRA_ada')

('fine_tuned/data_new_ft_phi3LoRA_ada/tokenizer_config.json',
 'fine_tuned/data_new_ft_phi3LoRA_ada/special_tokens_map.json',
 'fine_tuned/data_new_ft_phi3LoRA_ada/tokenizer.model',
 'fine_tuned/data_new_ft_phi3LoRA_ada/added_tokens.json',
 'fine_tuned/data_new_ft_phi3LoRA_ada/tokenizer.json')

In [114]:
from transformers import AutoModelForCausalLM, AutoTokenizer
from transformers import BitsAndBytesConfig
config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_use_double_quant=True,
    bnb_4bit_compute_dtype=torch.bfloat16,
)
# Load the model
model = AutoModelForCausalLM.from_pretrained("fine_tuned/data_new_ft_phi3LoRA")

# Load the tokenizer
tokenizer = AutoTokenizer.from_pretrained("fine_tuned/data_new_ft_phi3LoRA")

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


In [121]:
from huggingface_hub import snapshot_download

#Download the merged model
model_id="Anishproshort/data_new_ft_phi3LoRA"
#Download the repository to local_dir
snapshot_download(repo_id=model_id, local_dir="phi3",
                  local_dir_use_symlinks=False)

Fetching 8 files:   0%|          | 0/8 [00:00<?, ?it/s]

tokenizer.json:   0%|          | 0.00/1.84M [00:00<?, ?B/s]

README.md:   0%|          | 0.00/5.17k [00:00<?, ?B/s]

adapter_config.json:   0%|          | 0.00/908 [00:00<?, ?B/s]

adapter_model.safetensors:   0%|          | 0.00/143M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/3.17k [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/172 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/569 [00:00<?, ?B/s]

'/home/jupyter/FineTuning/phi3'

In [130]:
!python llama.cpp/convert-hf-to-gguf.py phi3-ft-LoRA-full-model --outfile "phi3-ft.gguf" --outtype f16

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


INFO:hf-to-gguf:Loading model: phi3-ft-LoRA-full-model
INFO:gguf.gguf_writer:gguf: This GGUF file is for Little Endian only
INFO:hf-to-gguf:Set model parameters
INFO:hf-to-gguf:Set model tokenizer
INFO:gguf.vocab:Setting special token type bos to 1
INFO:gguf.vocab:Setting special token type eos to 32000
INFO:gguf.vocab:Setting special token type unk to 0
INFO:gguf.vocab:Setting special token type pad to 32000
INFO:gguf.vocab:Setting add_bos_token to True
INFO:gguf.vocab:Setting add_eos_token to False
INFO:gguf.vocab:Setting chat_template to {{ bos_token }}{% for message in messages %}{% if (message['role'] == 'user') %}{{'<|user|>' + '
' + message['content'] + '<|end|>' + '
' + '<|assistant|>' + '
'}}{% elif (message['role'] == 'assistant') %}{{message['content'] + '<|end|>' + '
'}}{% endif %}{% endfor %}
INFO:hf-to-gguf:Exporting model to 'phi3-ft.gguf'
INFO:hf-to-gguf:gguf: loading model weight map from 'model.safetensors.index.json'
INFO:hf-to-gguf:gguf: loading model part 'model-00

In [155]:
!./llama.cpp/examples/quantize/quantize.cpp "phi3-ft.bin" "phi3-ft-Q5_K_M.gguf" "q5_k_m"

./llama.cpp/examples/quantize/quantize.cpp: line 12: struct: command not found
./llama.cpp/examples/quantize/quantize.cpp: line 13: std::string: command not found
./llama.cpp/examples/quantize/quantize.cpp: line 14: llama_ftype: command not found
./llama.cpp/examples/quantize/quantize.cpp: line 15: std::string: command not found
./llama.cpp/examples/quantize/quantize.cpp: line 16: syntax error near unexpected token `}'
./llama.cpp/examples/quantize/quantize.cpp: line 16: `};'


huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


In [148]:
!sudo chmod +x llama.cpp/llama.cpp

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
