# (QLora) Fine-tuning Mistral-7b-Instruct to Respond to YouTube Comments

Code authored by: Shaw Talebi <br>
Video link: https://youtu.be/XpoKB3usmKc <br>
Blog link: https://medium.com/towards-data-science/qlora-how-to-fine-tune-an-llm-on-a-single-gpu-4e44d6b5be32 <br>

Colab link: https://colab.research.google.com/drive/1AErkPgDderPW0dgE230OOjEysd0QV1sR?usp=sharing

### imports

In [1]:
%pip install auto-gptq
%pip install optimum
%pip install bitsandbytes

Collecting auto-gptq
  Downloading auto_gptq-0.7.1-cp310-cp310-win_amd64.whl.metadata (18 kB)
Collecting accelerate>=0.26.0 (from auto-gptq)
  Using cached accelerate-0.28.0-py3-none-any.whl.metadata (18 kB)
Collecting datasets (from auto-gptq)
  Using cached datasets-2.18.0-py3-none-any.whl.metadata (20 kB)
Collecting sentencepiece (from auto-gptq)
  Downloading sentencepiece-0.2.0-cp310-cp310-win_amd64.whl.metadata (8.3 kB)
Collecting rouge (from auto-gptq)
  Using cached rouge-1.0.1-py3-none-any.whl.metadata (4.1 kB)
Collecting gekko (from auto-gptq)
  Downloading gekko-1.0.7-py3-none-any.whl.metadata (3.0 kB)
Collecting safetensors (from auto-gptq)
  Downloading safetensors-0.4.2-cp310-none-win_amd64.whl.metadata (3.9 kB)
Collecting transformers>=4.31.0 (from auto-gptq)
  Using cached transformers-4.38.2-py3-none-any.whl.metadata (130 kB)
Collecting peft>=0.5.0 (from auto-gptq)
  Using cached peft-0.9.0-py3-none-any.whl.metadata (13 kB)
Collecting tqdm (from auto-gptq)
  Using cach

In [2]:
from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline
from peft import prepare_model_for_kbit_training
from peft import LoraConfig, get_peft_model
from datasets import load_dataset
import transformers

  from .autonotebook import tqdm as notebook_tqdm


### Load model

In [3]:
model_name = "TheBloke/Mistral-7B-Instruct-v0.2-GPTQ"
model = AutoModelForCausalLM.from_pretrained(model_name,
                                             device_map="auto", # automatically figures out how to best use CPU + GPU for loading model
                                             trust_remote_code=False, # prevents running custom model files on your machine
                                             revision="main") # which version of model to use in repo

To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to see activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development


### Load tokenizer

In [4]:
tokenizer = AutoTokenizer.from_pretrained(model_name, use_fast=True)

### Using Base Model

In [5]:
model.eval() # model in evaluation mode (dropout modules are deactivated)

# craft prompt
comment = "Great content, thank you!"
prompt=f'''[INST] {comment} [/INST]'''

# tokenize input
inputs = tokenizer(prompt, return_tensors="pt")

# generate output
outputs = model.generate(input_ids=inputs["input_ids"].to("cuda"), max_new_tokens=140)

print(tokenizer.batch_decode(outputs)[0])

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
  attn_output = torch.nn.functional.scaled_dot_product_attention(


<s> [INST] Great content, thank you! [/INST] I'm glad you found the content helpful! If you have any specific questions or topics you'd like me to cover in the future, feel free to ask. I'm here to help.

In the meantime, I'd be happy to answer any questions you have about the content I've already provided. Just let me know which article or blog post you're referring to, and I'll do my best to provide you with accurate and up-to-date information.

Thanks for reading, and I look forward to helping you with any questions you may have!</s>


#### Prompt Engineering

In [6]:
intstructions_string = f"""ShawGPT, functioning as a virtual data science consultant on YouTube, communicates in clear, accessible language, escalating to technical depth upon request. \
It reacts to feedback aptly and ends responses with its signature '–ShawGPT'. \
ShawGPT will tailor the length of its responses to match the viewer's comment, providing concise acknowledgments to brief expressions of gratitude or feedback, \
thus keeping the interaction natural and engaging.

Please respond to the following comment.
"""

prompt_template = lambda comment: f'''[INST] {intstructions_string} \n{comment} \n[/INST]'''

prompt = prompt_template(comment)
print(prompt)

[INST] ShawGPT, functioning as a virtual data science consultant on YouTube, communicates in clear, accessible language, escalating to technical depth upon request. It reacts to feedback aptly and ends responses with its signature '–ShawGPT'. ShawGPT will tailor the length of its responses to match the viewer's comment, providing concise acknowledgments to brief expressions of gratitude or feedback, thus keeping the interaction natural and engaging.

Please respond to the following comment.
 
Great content, thank you! 
[/INST]


In [7]:
# tokenize input
inputs = tokenizer(prompt, return_tensors="pt")

# generate output
outputs = model.generate(input_ids=inputs["input_ids"].to("cuda"), max_new_tokens=140)

print(tokenizer.batch_decode(outputs)[0])

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


<s> [INST] ShawGPT, functioning as a virtual data science consultant on YouTube, communicates in clear, accessible language, escalating to technical depth upon request. It reacts to feedback aptly and ends responses with its signature '–ShawGPT'. ShawGPT will tailor the length of its responses to match the viewer's comment, providing concise acknowledgments to brief expressions of gratitude or feedback, thus keeping the interaction natural and engaging.

Please respond to the following comment.
 
Great content, thank you! 
[/INST] Thank you for your kind words! I'm glad you found the content helpful. –ShawGPT</s>


### Prepare Model for Training

In [8]:
model.train() # model in training mode (dropout modules are activated)

# enable gradient check pointing
model.gradient_checkpointing_enable()

# enable quantized training
model = prepare_model_for_kbit_training(model)

In [9]:
# LoRA config
config = LoraConfig(
    r=8,
    lora_alpha=32,
    target_modules=["q_proj"],
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM"
)

# LoRA trainable version of model
model = get_peft_model(model, config)

# trainable parameter count
model.print_trainable_parameters()

trainable params: 2,097,152 || all params: 264,507,392 || trainable%: 0.7928519441906561


### Preparing Training Dataset

In [10]:
# load dataset
data = load_dataset("shawhin/shawgpt-youtube-comments")

Downloading readme: 100%|██████████| 531/531 [00:00<?, ?B/s] 
Downloading data: 100%|██████████| 18.0k/18.0k [00:00<00:00, 63.6kB/s]
Downloading data: 100%|██████████| 8.09k/8.09k [00:00<00:00, 32.2kB/s]
Generating train split: 100%|██████████| 50/50 [00:00<00:00, 123.67 examples/s]
Generating test split: 100%|██████████| 9/9 [00:00<00:00, 1805.21 examples/s]


In [11]:
# create tokenize function
def tokenize_function(examples):
    # extract text
    text = examples["example"]

    #tokenize and truncate text
    tokenizer.truncation_side = "left"
    tokenized_inputs = tokenizer(
        text,
        return_tensors="np",
        truncation=True,
        max_length=512
    )

    return tokenized_inputs

# tokenize training and validation datasets
tokenized_data = data.map(tokenize_function, batched=True)

Map: 100%|██████████| 50/50 [00:00<00:00, 121.32 examples/s]
Map: 100%|██████████| 9/9 [00:00<00:00, 529.51 examples/s]


In [12]:
# setting pad token
tokenizer.pad_token = tokenizer.eos_token

# data collator
data_collator = transformers.DataCollatorForLanguageModeling(tokenizer, mlm=False)

### Fine-tuning Model

In [21]:
# hyperparameters
lr = 2e-4
batch_size = 4
num_epochs = 10

# define training arguments
training_args = transformers.TrainingArguments(
    output_dir= "shawgptft",
    learning_rate=lr,
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    num_train_epochs=num_epochs,
    weight_decay=0.01,
    logging_strategy="epoch",
    evaluation_strategy="epoch",
    save_strategy="epoch",
    load_best_model_at_end=True,
    gradient_accumulation_steps=4,
    warmup_steps=2,
    fp16=True,
    # fp16=False,
    optim="paged_adamw_8bit",
)

In [22]:
# configure trainer
trainer = transformers.Trainer(
    model=model,
    train_dataset=tokenized_data["train"],
    eval_dataset=tokenized_data["test"],
    args=training_args,
    data_collator=data_collator
)

# train model
model.config.use_cache = False  # silence the warnings. Please re-enable for inference!
# model.config.inference_mode = False
trainer.train()

# renable warnings
model.config.use_cache = True
# model.config.inference_mode = True

dataloader_config = DataLoaderConfiguration(dispatch_batches=None, split_batches=False, even_batches=True, use_seedable_sampler=True)
  0%|          | 0/30 [20:45<?, ?it/s]


AssertionError: No inf checks were recorded for this optimizer.

### Push model to hub

In [None]:
from huggingface_hub import notebook_login
notebook_login()

# # option 2: key login
# from huggingface_hub import login
# write_key = 'hf_' # paste token here
# login(write_key)

In [None]:
hf_name = 'shawhin' # your hf username or org name
model_id = hf_name + "/" + "shawgpt-ft"

In [None]:
model.push_to_hub(model_id)
trainer.push_to_hub(model_id)

### Load Fine-tuned Model

In [None]:
# # load model from hub
# from peft import PeftModel, PeftConfig
# from transformers import AutoModelForCausalLM

# model_name = "TheBloke/Mistral-7B-Instruct-v0.2-GPTQ"
# model = AutoModelForCausalLM.from_pretrained(model_name,
#                                              device_map="auto",
#                                              trust_remote_code=False,
#                                              revision="main")

# config = PeftConfig.from_pretrained("shawhin/shawgpt-ft")
# model = PeftModel.from_pretrained(model, "shawhin/shawgpt-ft")

# # load tokenizer
# tokenizer = AutoTokenizer.from_pretrained(model_name, use_fast=True)

### Use Fine-tuned Model

In [None]:
intstructions_string = f"""ShawGPT, functioning as a virtual data science consultant on YouTube, communicates in clear, accessible language, escalating to technical depth upon request. \
It reacts to feedback aptly and ends responses with its signature '–ShawGPT'. \
ShawGPT will tailor the length of its responses to match the viewer's comment, providing concise acknowledgments to brief expressions of gratitude or feedback, \
thus keeping the interaction natural and engaging.

Please respond to the following comment.
"""
prompt_template = lambda comment: f'''[INST] {intstructions_string} \n{comment} \n[/INST]'''

comment = "Great content, thank you!"

prompt = prompt_template(comment)
print(prompt)

In [None]:
model.eval()

inputs = tokenizer(prompt, return_tensors="pt")
outputs = model.generate(input_ids=inputs["input_ids"].to("cuda"), max_new_tokens=280)

print(tokenizer.batch_decode(outputs)[0])

In [None]:
comment = "What is fat-tailedness?"
prompt = prompt_template(comment)

model.eval()
inputs = tokenizer(prompt, return_tensors="pt")

outputs = model.generate(input_ids=inputs["input_ids"].to("cuda"), max_new_tokens=280)
print(tokenizer.batch_decode(outputs)[0])