<a href="https://colab.research.google.com/github/Paraskevi-KIvroglou/hotel-assistant-project/blob/main/Hf_bnb_4bit_training_with_inference.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# `transformers` meets `bitsandbytes` for democratzing Large Language Models (LLMs) through 4bit quantization

<center>
<img src="https://github.com/huggingface/blog/blob/main/assets/96_hf_bitsandbytes_integration/Thumbnail_blue.png?raw=true" alt="drawing" width="700" class="center"/>
</center>

Welcome to this notebook that goes through the recent `bitsandbytes` integration that includes the work from XXX that introduces no performance degradation 4bit quantization techniques, for democratizing LLMs inference and training.

In this notebook, we will learn together how to load a large model in 4bit (`gpt-neo-x-20b`) and train it using Google Colab and PEFT library from Hugging Face 🤗.

[In the general usage notebook](https://colab.research.google.com/drive/1ge2F1QSK8Q7h0hn3YKuBCOAS0bK8E0wf?usp=sharing), you can learn how to propely load a model in 4bit with all its variants.

If you liked the previous work for integrating [*LLM.int8*](https://arxiv.org/abs/2208.07339), you can have a look at the [introduction blogpost](https://huggingface.co/blog/hf-bitsandbytes-integration) to lean more about that quantization method.


In [None]:
!pip install -q -U bitsandbytes
!pip install -q -U git+https://github.com/huggingface/transformers.git
!pip install -q -U git+https://github.com/huggingface/peft.git
!pip install -q -U git+https://github.com/huggingface/accelerate.git
!pip install -q datasets
!pip install -q -U trl

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m119.8/119.8 MB[0m [31m14.7 MB/s[0m eta [36m0:00:00[0m
[?25h  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3.6/3.6 MB[0m [31m17.9 MB/s[0m eta [36m0:00:00[0m
[?25h  Building wheel for transformers (pyproject.toml) ... [?25l[?25hdone
  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m297.6/297.6 kB[0m [31m6.8 MB/s[0m eta [36m0:00:00[0m
[?25h  Building wheel for peft (pyproject.toml) ... [?25l[?25hdone
  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproj

First let's load the model we are going to use - GPT-neo-x-20B! Note that the model itself is around 40GB in half precision

In [None]:
from huggingface_hub import notebook_login

notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

In [None]:
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig

model_id = "meta-llama/Llama-2-7b-chat-hf"
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16,

)

tokenizer = AutoTokenizer.from_pretrained(model_id,add_eos_token=True)
model = AutoModelForCausalLM.from_pretrained(model_id, quantization_config=bnb_config, device_map={"":0})

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/1.62k [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/500k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.84M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/414 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/614 [00:00<?, ?B/s]

model.safetensors.index.json:   0%|          | 0.00/26.8k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/2 [00:00<?, ?it/s]

model-00001-of-00002.safetensors:   0%|          | 0.00/9.98G [00:00<?, ?B/s]

model-00002-of-00002.safetensors:   0%|          | 0.00/3.50G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/188 [00:00<?, ?B/s]

In [None]:
def get_completion(query: str, model, tokenizer) -> str:
  device = "cuda:0"

  prompt_template = """
  <start_of_turn>user
  Below is an instruction that describes a task. Write a response that appropriately completes the request.
  Reply with the most helpful and logic answer.
  {query}
  <end_of_turn>\n<start_of_turn>model


  """
  prompt = prompt_template.format(query=query)

  encodeds = tokenizer(prompt, return_tensors="pt", add_special_tokens=True)

  model_inputs = encodeds.to(device)


  generated_ids = model.generate(**model_inputs, max_new_tokens=100, top_p = 0.8, temperature = 0.3, do_sample=True,  pad_token_id=tokenizer.eos_token_id) #
  # decoded = tokenizer.batch_decode(generated_ids)
  decoded = tokenizer.decode(generated_ids[0], skip_special_tokens=True)
  return (decoded)

In [None]:
result = get_completion(query="Hello I would like to book a hotel room", model=model, tokenizer=tokenizer)
print(result)

A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.



  <start_of_turn>user
  Below is an instruction that describes a task. Write a response that appropriately completes the request.
  Reply with the most helpful and logic answer.
  Hello I would like to book a hotel room
  <end_of_turn>
<start_of_turn>model


  sure, I'd be happy to help you book a hotel room! Can you please provide me with some information such as your travel dates,


In [None]:
result = get_completion(query="Hello I would like a hotel room for 2 people", model=model, tokenizer=tokenizer)
print(result)

A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.



  <start_of_turn>user
  Below is an instruction that describes a task. Write a response that appropriately completes the request.
  Reply with the most helpful and logic answer.
  Hello I would like a hotel room for 2 people
  <end_of_turn>
<start_of_turn>model


  tq indeed! I'd be happy to help you find a hotel room for 2 people. Can you please provide me with some more details


Then we have to apply some preprocessing to the model to prepare it for training. For that use the `prepare_model_for_kbit_training` method from PEFT.

In [None]:
from peft import prepare_model_for_kbit_training

model.gradient_checkpointing_enable()
model = prepare_model_for_kbit_training(model)

In [None]:
def print_trainable_parameters(model):
    """
    Prints the number of trainable parameters in the model.
    """
    trainable_params = 0
    all_param = 0
    for _, param in model.named_parameters():
        all_param += param.numel()
        if param.requires_grad:
            trainable_params += param.numel()
    print(
        f"trainable params: {trainable_params} || all params: {all_param} || trainable%: {100 * trainable_params / all_param}"
    )

In [None]:
import bitsandbytes as bnb

def find_all_linear_names(model):
  cls = bnb.nn.Linear4bit #if args.bits == 4 else (bnb.nn.Linear8bitLt if args.bits == 8 else torch.nn.Linear)
  lora_module_names = set()
  for name, module in model.named_modules():
    if isinstance(module, cls):
      names = name.split('.')
      lora_module_names.add(names[0] if len(names) == 1 else names[-1])
    if 'lm_head' in lora_module_names: # needed for 16-bit
      lora_module_names.remove('lm_head')
  return list(lora_module_names)

modules = find_all_linear_names(model)
print(modules)

['o_proj', 'gate_proj', 'v_proj', 'down_proj', 'q_proj', 'k_proj', 'up_proj']


In [None]:
from peft import LoraConfig, get_peft_model

config = LoraConfig(
    r=64,
    lora_alpha=32,
    target_modules= modules,
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM"
)

model = get_peft_model(model, config)
print_trainable_parameters(model)

trainable params: 159907840 || all params: 3660320768 || trainable%: 4.368683788535114


Let's load a common dataset, english quotes, to fine tune our model on famous quotes.

In [None]:
from datasets import load_dataset

data = load_dataset("KvrParaskevi/hotel_data", split = "train")
print(data)

Downloading readme:   0%|          | 0.00/21.0 [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/248k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/1199 [00:00<?, ? examples/s]

Dataset({
    features: ['text'],
    num_rows: 1199
})


In [None]:
df = data.to_pandas()
df.head(10)

Unnamed: 0,text
0,###Human: Hello I would like to book a hotel ...
1,###Human: Hello I'm looking to reserve a hote...
2,###Human: Good day I need to book accommodatio...
3,###Human: I'd like to book a hotel room. ###As...
4,###Human: I want to visit [City Name]. ###Assi...
5,###Human: I will visit from [Check in Date] un...
6,###Human: I want to visit [City Name]. ###Assi...
7,###Human: I will visit from [Check in Date] un...
8,###Human: I want to visit [City Name]. ###Assi...
9,###Human: I will visit from [Check in Date] un...


In [None]:
def generate_prompt(data_point):
    """Gen. input text based on a prompt, task instruction, (context info.), and answer

    :param data_point: dict: Data point
    :return: dict: tokenzed prompt
    """
    prefix_text = 'Below is an instruction that describes a task. Write a response that ' \
               'appropriately completes the request. Reply with the most helpful and logic answer' \
               'based on the request.\n\n'
    # Samples with additional context into.
    if data_point['text']:
        data_split = data_point["text"].split("###Assistant:")
        input = data_split[0].strip("###Human:")
        output = data_split[1].strip("###Assistant:")
        data_split = data_point["text"].strip("###")
        data_split = data_split.strip('"')
        text = f"""<start_of_turn>user {prefix_text} {input}<end_of_turn>\n<start_of_turn>model {output} <end_of_turn>"""
        #text = f"""{data_split}"""
    return text

# add the "prompt" column in the dataset
text_column = [generate_prompt(data_point) for data_point in data]
dataset = data.add_column("prompt", text_column)

In [None]:
dataset = dataset.shuffle(seed=1234)  # Shuffle dataset here
dataset = dataset.map(lambda samples: tokenizer(samples["prompt"]), batched=True)

Map:   0%|          | 0/1199 [00:00<?, ? examples/s]

In [None]:
data = dataset.train_test_split(test_size=0.1)  # Splits 10% for testing, 90% for training
train_dataset = data["train"]
test_dataset = data["test"]

In [None]:
print(train_dataset)
print(test_dataset)

Dataset({
    features: ['text', 'prompt', 'input_ids', 'attention_mask'],
    num_rows: 1079
})
Dataset({
    features: ['text', 'prompt', 'input_ids', 'attention_mask'],
    num_rows: 120
})


In [None]:
df = test_dataset.to_pandas()
df.head(10)

Unnamed: 0,text,prompt,input_ids,attention_mask
0,"###Human: ""Hi I'm planning a"" trip to Barcelo...",<start_of_turn>user Below is an instruction th...,"[1, 529, 2962, 29918, 974, 29918, 685, 29958, ...","[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ..."
1,"###Human: ""Yes please. Book it for me.""###Ass...",<start_of_turn>user Below is an instruction th...,"[1, 529, 2962, 29918, 974, 29918, 685, 29958, ...","[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ..."
2,###Human: Looking for a place to stay in Rione...,<start_of_turn>user Below is an instruction th...,"[1, 529, 2962, 29918, 974, 29918, 685, 29958, ...","[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ..."
3,###Human: Yes my name is {name}. My email is {...,<start_of_turn>user Below is an instruction th...,"[1, 529, 2962, 29918, 974, 29918, 685, 29958, ...","[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ..."
4,"###Human: ""Yes please. Book it for me.""###Ass...",<start_of_turn>user Below is an instruction th...,"[1, 529, 2962, 29918, 974, 29918, 685, 29958, ...","[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ..."
5,###Human: Looking for a place to stay in Ludwi...,<start_of_turn>user Below is an instruction th...,"[1, 529, 2962, 29918, 974, 29918, 685, 29958, ...","[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ..."
6,"###Human: ""Yes please. Book it for me.""###Ass...",<start_of_turn>user Below is an instruction th...,"[1, 529, 2962, 29918, 974, 29918, 685, 29958, ...","[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ..."
7,###Human: I will visit from [Check in Date] un...,<start_of_turn>user Below is an instruction th...,"[1, 529, 2962, 29918, 974, 29918, 685, 29958, ...","[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ..."
8,###Human: Does it include breakfast?###Assista...,<start_of_turn>user Below is an instruction th...,"[1, 529, 2962, 29918, 974, 29918, 685, 29958, ...","[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ..."
9,"###Human: ""Hi I'm planning a"" trip to London ...",<start_of_turn>user Below is an instruction th...,"[1, 529, 2962, 29918, 974, 29918, 685, 29958, ...","[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ..."


In [None]:
!pip install wandb

Collecting wandb
  Downloading wandb-0.16.6-py3-none-any.whl (2.2 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.2/2.2 MB[0m [31m14.7 MB/s[0m eta [36m0:00:00[0m
Collecting GitPython!=3.1.29,>=1.0.0 (from wandb)
  Downloading GitPython-3.1.43-py3-none-any.whl (207 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m207.3/207.3 kB[0m [31m19.9 MB/s[0m eta [36m0:00:00[0m
Collecting sentry-sdk>=1.0.0 (from wandb)
  Downloading sentry_sdk-1.45.0-py2.py3-none-any.whl (267 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m267.1/267.1 kB[0m [31m18.8 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting docker-pycreds>=0.4.0 (from wandb)
  Downloading docker_pycreds-0.4.0-py2.py3-none-any.whl (9.0 kB)
Collecting setproctitle (from wandb)
  Downloading setproctitle-1.3.3-cp310-cp310-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl (30 kB)
Collecting gitdb<5,>=4.0.1 (from GitPython!=3.1.29,>=1.0.0->w

In [None]:
import wandb
wandb.login()

<IPython.core.display.Javascript object>

[34m[1mwandb[0m: Logging into wandb.ai. (Learn how to deploy a W&B server locally: https://wandb.me/wandb-server)
[34m[1mwandb[0m: You can find your API key in your browser here: https://wandb.ai/authorize
wandb: Paste an API key from your profile and hit enter, or press ctrl+c to quit:

 ··········


[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc


True

In [None]:
%env WANDB_LOG_MODEL=true

env: WANDB_LOG_MODEL=true


Run the cell below to run the training! For the sake of the demo, we just ran it for few steps just to showcase how to use this integration with existing tools on the HF ecosystem.

In [None]:
import transformers

from trl import SFTTrainer

tokenizer.pad_token = tokenizer.eos_token
#tokenizer.padding_side='right'
torch.cuda.empty_cache()

trainer = SFTTrainer(
    model=model,
    train_dataset=train_dataset,
    eval_dataset = test_dataset,
    dataset_text_field="prompt",
    peft_config=config,
    max_seq_length= 500,
    # args=transformers.TrainingArguments(
    #     report_to = 'wandb',
    #     overwrite_output_dir = True,
    #     evaluation_strategy = 'steps',
    #     per_device_train_batch_size=1,
    #     gradient_accumulation_steps=2,
    #     warmup_steps=2,
    #     max_steps=12,
    #     learning_rate=5e-4,
    #     fp16=True,
    #     logging_steps=1,
    #     output_dir="outputs",
    #     metric_for_best_model ="bertscore",
    #     #optim="paged_adamw_8bit"
    # ),
    args=transformers.TrainingArguments(
        report_to = 'wandb',
        overwrite_output_dir = True,
        per_device_train_batch_size=4,
        gradient_accumulation_steps=4,
        #warmup_steps=0.03,
        #max_steps=100,
        num_train_epochs = 10,
        learning_rate=2e-4,
        weight_decay=0.01,
        logging_strategy="epoch",
        evaluation_strategy="epoch",
        load_best_model_at_end=True,
        output_dir="outputs",
        warmup_steps=2,
        fp16=True,
        optim="paged_adamw_8bit",
        save_strategy="epoch",
    ),
    data_collator=transformers.DataCollatorForLanguageModeling(tokenizer, mlm=False),
)
model.config.use_cache = False  # silence the warnings. Please re-enable for inference!
trainer.train()



[34m[1mwandb[0m: Currently logged in as: [33mparaskevikivroglou[0m. Use [1m`wandb login --relogin`[0m to force relogin




Epoch,Training Loss,Validation Loss
0,0.563,0.183642
2,0.1053,0.121836
4,0.0819,0.106723
6,0.0663,0.108
8,0.0563,0.107815
9,0.0516,0.110916




TrainOutput(global_step=670, training_loss=0.12889546138137134, metrics={'train_runtime': 4664.1928, 'train_samples_per_second': 2.313, 'train_steps_per_second': 0.144, 'total_flos': 6.250500074468966e+16, 'train_loss': 0.12889546138137134, 'epoch': 9.925925925925926})

In [None]:
new_model = "Llama-2-7b-Hotel-Booking-Model" #Name of the model you will be pushing to huggingface model hub

In [None]:
trainer.model.save_pretrained(new_model)

In [None]:
from peft import PeftModel
base_model = AutoModelForCausalLM.from_pretrained(
    model_id,
    low_cpu_mem_usage=True,
    return_dict=True,
    torch_dtype=torch.float16,
    device_map={"": 0},
)
merged_model= PeftModel.from_pretrained(base_model, new_model)
merged_model= merged_model.merge_and_unload()

# Save the merged model
merged_model.save_pretrained("merged_model",safe_serialization=True)
tokenizer.save_pretrained("merged_model")
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "right"

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

In [None]:
from huggingface_hub import notebook_login

notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

In [None]:
# Push the model and tokenizer to the Hugging Face Model Hub
merged_model.push_to_hub(new_model, use_temp_dir=False)
tokenizer.push_to_hub(new_model, use_temp_dir=False)

model-00003-of-00003.safetensors:   0%|          | 0.00/3.59G [00:00<?, ?B/s]

model-00001-of-00003.safetensors:   0%|          | 0.00/4.94G [00:00<?, ?B/s]

Upload 3 LFS files:   0%|          | 0/3 [00:00<?, ?it/s]

model-00002-of-00003.safetensors:   0%|          | 0.00/4.95G [00:00<?, ?B/s]

README.md:   0%|          | 0.00/5.17k [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/500k [00:00<?, ?B/s]

CommitInfo(commit_url='https://huggingface.co/KvrParaskevi/Llama-2-7b-Hotel-Booking-Model/commit/57092a77fb7fb8b48bb70f1bbe85512903f6a19a', commit_message='Upload tokenizer', commit_description='', oid='57092a77fb7fb8b48bb70f1bbe85512903f6a19a', pr_url=None, pr_revision=None, pr_num=None)

In [None]:
result = get_completion(query="Hello I want to book a hotel room", model=merged_model, tokenizer=tokenizer)
print(result)

A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.



  <start_of_turn>user
  Below is an instruction that describes a task. Write a response that appropriately completes the request.
  Reply with the most helpful and logic answer.
  Hello I want to book a hotel room
  <end_of_turn>
<start_of_turn>model


  {user Below is an instruction that describes a task. Write a response that appropriately completes the request.
  Reply with the most helpful and logic answer.
  Hello I want to book a hotel room. <end_of_turn>
<start_of_turn>model  I need to know your destination and the dates of your stay to proceed with the booking. <end_of_turn>
<start_of_turn>model  Could you


# New Section

In [None]:
import os
os.environ["BITSANDBYTES_NOWELCOME"]="No poo on screen allowed"

import torch
from peft import get_peft_config, get_peft_model, LoraConfig, TaskType
from transformers import AutoModelForCausalLM, AutoTokenizer

model_path = 'cerebras/btlm-3b-8k-base'
FINETUNING_PEFT="peft"
finetuning=FINETUNING_PEFT

In [None]:
model = AutoModelForCausalLM.from_pretrained(
    model_path,
    load_in_4bit=True,
    torch_dtype=torch.bfloat16,
    trust_remote_code=True
)

config.json:   0%|          | 0.00/1.24k [00:00<?, ?B/s]

configuration_btlm.py:   0%|          | 0.00/9.67k [00:00<?, ?B/s]

A new version of the following files was downloaded from https://huggingface.co/cerebras/btlm-3b-8k-base:
- configuration_btlm.py
. Make sure to double-check they do not contain any added malicious code. To avoid downloading new versions of the code file, you can pin a revision.


modeling_btlm.py:   0%|          | 0.00/72.0k [00:00<?, ?B/s]

A new version of the following files was downloaded from https://huggingface.co/cerebras/btlm-3b-8k-base:
- modeling_btlm.py
. Make sure to double-check they do not contain any added malicious code. To avoid downloading new versions of the code file, you can pin a revision.
The `load_in_4bit` and `load_in_8bit` arguments are deprecated and will be removed in the future versions. Please, pass a `BitsAndBytesConfig` object in `quantization_config` argument instead.
`low_cpu_mem_usage` was None, now set to True since model is quantized.


pytorch_model.bin:   0%|          | 0.00/5.29G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/119 [00:00<?, ?B/s]

In [None]:
# check we are in 4bit
model.transformer.h[3].attn.c_attn

Linear4bit(in_features=2560, out_features=7680, bias=True)

In [None]:
tokenizer = AutoTokenizer.from_pretrained(model_path)

tokenizer_config.json:   0%|          | 0.00/234 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/99.0 [00:00<?, ?B/s]

In [None]:
if finetuning == FINETUNING_PEFT:
    peft_config = LoraConfig(
        TaskType.CAUSAL_LM,
        inference_mode=False,
        r=1,
        lora_alpha=32,
        lora_dropout=0.05,
        target_modules=['c_attn'])
    model = get_peft_model(model, peft_config)
    print("peft me baby one more time")

peft me baby one more time


In [None]:
import json
from tqdm import tqdm

opt_fn = torch.optim.AdamW(model.parameters())

for i in range(len(data)):
    if i == 0:
      continue
    #print(data['text'][i])
    qa = data['text'][i].split("###Assistant:")
    q=f"Q: { qa[0].strip('###Human:') }\n"
    tq = tokenizer(q, return_tensors="pt").input_ids
    a = "A: " + f"{qa[1].strip('####Assistant:')}" + tokenizer.eos_token
    ta = tokenizer(a, add_special_tokens=False, return_tensors="pt").input_ids
    input_ids = torch.cat((tq, ta), -1)
    labels=input_ids.clone()
    # Do not predict questions
    labels[:, :tq.shape[1]]=-100
    loss = model(input_ids, labels=labels).loss
    loss.backward()
    opt_fn.step()
    opt_fn.zero_grad()
    #bar.set_description(f'L{loss:.4f}')
print(loss)

tensor(8.5000, dtype=torch.bfloat16, grad_fn=<ToCopyBackward0>)


In [None]:
def qa(q):
    prompt = f"Q: {q}\nA: "
    x = tokenizer(prompt, return_tensors='pt').to("cuda")
    y = model.generate(**x, max_new_tokens=80, pad_token_id=tokenizer.eos_token_id).ravel()
    return tokenizer.decode(y)

print(qa("Hello I would like a hotel in Paris"))

Q: Hello I would like a hotel in Paris
A:................................................................................
