# Lab1
**Done by Arina Shinkorenok J4132c**

## Download libraries

In [None]:
pip install --quiet transformers==4.37.2 accelerate==0.24.0 sentencepiece==0.1.99 optimum==1.13.2 peft==0.5.0 bitsandbytes==0.41.2.post2 datasets==2.14.7

In [1]:
from transformers import AutoModelForCausalLM, AutoTokenizer, default_data_collator, TrainingArguments
from tqdm.auto import tqdm, trange
import torch
import torch.nn as nn
import torch.nn.functional as F
from torch.utils.data import DataLoader, Dataset
import peft

import transformers
from datasets import load_dataset

import random
const_seed = 100

2024-03-02 17:58:40.818204: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-03-02 17:58:40.818265: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-03-02 17:58:40.819788: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered


In [3]:
assert torch.cuda.is_available(), "check out cuda availability (change runtime type in colab)"

In [4]:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

! ls

# Part 0: Initializing the model and tokenizer

let's take mistral model for our experiments (https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.2) that was tuned to follow user instructions. Pay attention that we load model in 4 bit to decrease the memory usage.

In [5]:
model_name = "mistralai/Mistral-7B-Instruct-v0.2"

In [6]:
# load llama tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_name, device_map=device)
tokenizer.pad_token_id = tokenizer.eos_token_id

# Note: to speed up inference you can use flash attention 2 (https://huggingface.co/docs/transformers/perf_infer_gpu_one#flashattention-2)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    device_map="auto",
    low_cpu_mem_usage=True,
    offload_state_dict=True,
    load_in_4bit=True,
    torch_dtype=torch.float32,  # weights are 4-bit; layernorms and activations are fp32
)
for param in model.parameters():
    param.requires_grad=False

model.gradient_checkpointing_enable()  # only store a small subset of activations, re-compute the rest.
model.enable_input_require_grads()     # override an implementation quirk in gradient checkpoints that disables backprop unless inputs require grad
# more on gradient checkpointing: https://pytorch.org/docs/stable/checkpoint.html https://arxiv.org/abs/1604.06174

Loading checkpoint shards:   0%|          | 0/3 [00:00<?, ?it/s]

# Part 1 (5 points): Prompt-engineering

**There are different strategies for text generation in huggingface:**

| Strategy | Description | Pros & Cons |
| --- | --- | --- |
| Greedy Search | Chooses the word with the highest probability as the next word in the sequence. | **Pros:** Simple and fast. <br> **Cons:** Can lead to repetitive and incoherent text. |
| Sampling with Temperature | Introduces randomness in the word selection. A higher temperature leads to more randomness. | **Pros:** Allows exploration and diverse output. <br> **Cons:** Higher temperatures can lead to nonsensical outputs. |
| Nucleus Sampling (Top-p Sampling) | Selects the next word from a truncated vocabulary, the "nucleus" of words that have a cumulative probability exceeding a pre-specified threshold (p). | **Pros:** Balances diversity and quality. <br> **Cons:** Setting an optimal 'p' can be tricky. |
| Beam Search | Explores multiple hypotheses (sequences of words) at each step, and keeps the 'k' most likely, where 'k' is the beam width. | **Pros:** Produces more reliable results than greedy search. <br> **Cons:** Can lack diversity and lead to generic responses. |
| Top-k Sampling | Randomly selects the next word from the top 'k' words with the highest probabilities. | **Pros:** Introduces randomness, increasing output diversity. <br> **Cons:** Random selection can sometimes lead to less coherent outputs. |
| Length Normalization | Prevents the model from favoring shorter sequences by dividing the log probabilities by the sequence length raised to some power. | **Pros:** Makes longer and potentially more informative sequences more likely. <br> **Cons:** Tuning the normalization factor can be difficult. |
| Stochastic Beam Search | Introduces randomness into the selection process of the 'k' hypotheses in beam search. | **Pros:** Increases diversity in the generated text. <br> **Cons:** The trade-off between diversity and quality can be tricky to manage. |
| Decoding with Minimum Bayes Risk (MBR) | Chooses the hypothesis (out of many) that minimizes expected loss under a loss function. | **Pros:** Optimizes the output according to a specific loss function. <br> **Cons:** Computationally more complex and requires a good loss function. |

Documentation references:
- [reference for `AutoModelForCausalLM.generate()`](https://huggingface.co/docs/transformers/v4.29.1/en/main_classes/text_generation#transformers.GenerationMixin.generate)
- [reference for `AutoTokenizer.decode()`](https://huggingface.co/docs/transformers/main_classes/tokenizer#transformers.PreTrainedTokenizer.decode)
- Huggingface [docs on generation strategies](https://huggingface.co/docs/transformers/generation_strategies)

In [7]:
# TODO: create a function for generation with huggingface
def get_answer(
    tokenizer,
    model,
    messages,
    max_new_tokens=200,
    temperature=0.5,
    do_sample=True,
):
    # TODO: tokenize input, generate answer and decode output. Pay attention to tokenizer methods
    tokenized_input = tokenizer.apply_chat_template(
        messages,
        return_tensors="pt",
        padding=True,
        truncation=True,
    ).to(device)

    model_output = model.generate(
        tokenized_input,
        max_new_tokens=max_new_tokens,
        do_sample=do_sample,
        top_k=10,
        top_p=0.91,
    )

    return tokenizer.batch_decode(model_output, skip_special_tokens=True)

In [8]:
# Let's try our model

messages = [
    {
        "role": "user",
        "content": "Write an explanation of tensors for 5 year old",
    },
]

print(get_answer(tokenizer, model, messages)[0])

Asking to pad to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no padding.
Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.
The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


[INST] Write an explanation of tensors for 5 year old [/INST] Tensors are like magical boxes that can hold different things, but they have a special rule. They can hold numbers, but not just any numbers, they have to be organized in a certain way.

Imagine you have a box of apples, and you want to count how many apples you have. That's like having a number, which is easy to understand. But, what if your box can also hold not only apples but also oranges or bananas? And you want to know how many of each fruit you have. That's when you need a tensor.

A tensor is like a special box with different sections. Each section can hold a different kind of fruit, and you can count how many of each kind of fruit you have in each section. So, a tensor with two sections might look like this: [2 apples, 3 oranges]. This is called a 2-dimensional tensor because it has 2 sections.



You should obtain an explanation from the model. If so, let us go further!

Now we will take a sample from boolQ (https://huggingface.co/datasets/google/boolq) dataset and try prompting techniques to extract the needed answer and calculate its quality.

In [9]:
df = load_dataset("google/boolq")

In [10]:
# Fixing 20 validation examples

random.seed(const_seed)
idx = random.sample(range(1, 3270), 20)

In [11]:
# sample you will work with
df_sample = df["validation"].select(idx)

In [12]:
# For instance, you can construct your prompt the following way
messages = [
    {
        "role": "user",
        "content": """You are given a text and question. Answer only "true" or "false".
text: As with other games in The Elder Scrolls series, the game is set on the continent of Tamriel. The events of the game occur a millennium before those of The Elder Scrolls V: Skyrim and around 800 years before The Elder Scrolls III: Morrowind and The Elder Scrolls IV: Oblivion. It has a broadly similar structure to Skyrim, with two separate conflicts progressing at the same time, one with the fate of the world in the balance, and one where the prize is supreme power on Tamriel. In The Elder Scrolls Online, the first struggle is against the Daedric Prince Molag Bal, who is attempting to meld the plane of Mundus with his realm of Coldharbour, and the second is to capture the vacant imperial throne, contested by three alliances of the mortal races. The player character has been sacrificed to Molag Bal, and Molag Bal has stolen their soul, the recovery of which is the primary game objective.
question: is elder scrolls online the same as skyrim
answer: """
    },
]

print(get_answer(tokenizer, model, messages)[0])

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


[INST] You are given a text and question. Answer only "true" or "false".
text: As with other games in The Elder Scrolls series, the game is set on the continent of Tamriel. The events of the game occur a millennium before those of The Elder Scrolls V: Skyrim and around 800 years before The Elder Scrolls III: Morrowind and The Elder Scrolls IV: Oblivion. It has a broadly similar structure to Skyrim, with two separate conflicts progressing at the same time, one with the fate of the world in the balance, and one where the prize is supreme power on Tamriel. In The Elder Scrolls Online, the first struggle is against the Daedric Prince Molag Bal, who is attempting to meld the plane of Mundus with his realm of Coldharbour, and the second is to capture the vacant imperial throne, contested by three alliances of the mortal races. The player character has been sacrificed to Molag Bal, and Molag Bal has stolen their soul, the recovery of which is the primary game objective.
question: is elder scr

Is anything wrong with the output? Now it is time for you to play around and try to come up with some better prompt.

In [13]:
def generate_prompt(gen_type, row, examples=None):
    question = row["question"]
    passage = row["passage"]
    true_answer = row["answer"]
    if gen_type == "naive":
        instruction = 'You are given a text and a question to answer. Write only the answer on a separate empty line in the format: "answer=true" or "answer=false".\n'
    elif gen_type == "few_shot":
        instruction = 'You are given a text and a question to answer. Write only the answer on a separate empty line in the format: "answer=true" or "answer=false". Examples:\n'
        instruction += examples
    elif gen_type == "cot":
        instruction = 'You are given a text and a question to answer. Write only the answer on a separate empty line in the format: "answer=true" or "answer=false". Try to think step by step.\n'

    cot_string = "Let's think step by step\n" if gen_type == "cot" else ""
    prompt = {
        "role": "user",
        "content": f"instruction: {instruction}\ntext: {passage}\nquestion: {question}?\n{cot_string}"
    }
    return prompt, true_answer

In [14]:
def extract_answer(model_output):
    inst_id = model_output.find("[/INST]")
    trunced = model_output[inst_id+len("[/INST]"):]
    trunced_list = trunced.split(" ")

    answer = False

    for i in trunced_list:
        if i.startswith("answer="):
            if "true" in i:
                answer = True
                break
    return answer

In [15]:
def make_naive_predict(test_ds, model, get_answer_fn=get_answer):
    predicted_labels = []
    true_labels = []
    gen_res = []

    for row in test_ds:
        prompt, true_answer = generate_prompt(gen_type="naive", row=row)

        model_output = get_answer_fn(tokenizer, model, [prompt])[0]

        answer = extract_answer(model_output)

        predicted_labels.append(answer)
        true_labels.append(true_answer)
        gen_res.append({"prompt": prompt, "output": model_output, "true_ans": true_answer, "pred_ans": answer})


    return torch.tensor(predicted_labels), torch.tensor(true_labels), gen_res

In [16]:
def example_format(text, question, answer):
        return f"text: {text}\nquestion: {question}?\nanswer={str(answer).lower()}"

def gen_examples(dataset, n_shots):
    ids = random.sample(range(1, 9427), n_shots)
    sample = dataset.select(ids)
    examples = ""

    for s in sample:
        examples += example_format(s["passage"], s["question"], s["answer"])

    return examples

In [17]:
def few_shot_prompting(test_ds, model, add_ds=df["train"], n_shots=5, get_answer_fn=get_answer):
    predicted_labels = []
    true_labels = []
    gen_res = []

    for row in test_ds:
        examples = gen_examples(add_ds, n_shots)
        prompt, true_answer = generate_prompt(gen_type="few_shot", row=row, examples=examples)

        model_output = get_answer_fn(tokenizer, model, [prompt])[0]

        answer = extract_answer(model_output)

        predicted_labels.append(answer)
        true_labels.append(true_answer)
        gen_res.append({"prompt": prompt, "output": model_output, "true_ans": true_answer, "pred_ans": answer})


    return torch.tensor(predicted_labels), torch.tensor(true_labels), gen_res

In [18]:
def cot_prompting(test_ds, model, get_answer_fn=get_answer):
    predicted_labels = []
    true_labels = []
    gen_res = []

    for row in test_ds:
        prompt, true_answer = generate_prompt(gen_type="cot", row=row)

        model_output = get_answer_fn(tokenizer, model, [prompt])[0]

        answer = extract_answer(model_output)

        predicted_labels.append(answer)
        true_labels.append(true_answer)
        gen_res.append({"prompt": prompt, "output": model_output, "true_ans": true_answer, "pred_ans": answer})

    return torch.tensor(predicted_labels), torch.tensor(true_labels), gen_res

In [19]:
orig_naive_pred, orig_naive_true, orig_naive_gen = make_naive_predict(df_sample, model)

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
The attentio

In [20]:
orig_fw_pred, orig_fw_true, orig_fw_gen = few_shot_prompting(df_sample, model)

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
The attentio

In [21]:
orig_cot_pred, orig_cot_true, orig_cot_gen = cot_prompting(df_sample, model)

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
The attentio

In [22]:
from sklearn.metrics import accuracy_score
import pandas as pd

In [23]:
# TODO: create function to evaluate answers
# Note: you can adapt function for different answer structures,
# but you should be able to automatically extract the target "true" or "false" components
def evaluate_answers(true_ans, pred_ans):
    titles = ["naive", "few_shot", "chain_of_thought"]
    result = []
    for true, pred, title in zip(true_ans, pred_ans, titles):
        acc = accuracy_score(true, pred)
        result.append({"title": title, "accuracy": acc})
    metrics = pd.DataFrame(result, columns=["title", "accuracy"]).set_index("title")
    print(metrics)
    return metrics

In [24]:
orig_metrics = evaluate_answers([orig_naive_true, orig_fw_true, orig_cot_true], [orig_naive_pred, orig_fw_pred, orig_cot_pred])

                  accuracy
title                     
naive                 0.80
few_shot              0.70
chain_of_thought      0.75


In [25]:
def save_gens_to_csv(gen, filename):
    df = pd.DataFrame(gen, columns=["prompt", "output", "true_ans", "pred_ans"])
    df.to_csv(filename)
    return df

In [26]:
n_out = save_gens_to_csv(orig_naive_gen, "naive_outputs.csv")
fw_out = save_gens_to_csv(orig_fw_gen, "few_shot_outputs.csv")
cot_out = save_gens_to_csv(orig_cot_gen, "chain_of_thoughts_outputs.csv")

TODO: Try and compare "naive" prompting (your best hand-crafted variant), few-shot prompting (https://www.promptingguide.ai/techniques/fewshot) and chain-of-thought prompting (step-be-step thinking - https://www.promptingguide.ai/techniques/cot).

TODO: Save the generation results into separate csv files and do not forget to attach them to your homework.

# Part 2 (5 points): Fine-tuning with PEFT and LoRA

In [None]:
from peft import LoraConfig

peft_config = LoraConfig(
    r=16,
    lora_alpha=32,
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM",
    target_modules=["lm_head", "q_proj", "k_proj", "v_proj", "o_proj"]
)
peft_model = peft.get_peft_model(model, peft_config)

In [None]:
peft_model.print_trainable_parameters()

In [None]:
tokenizer = AutoTokenizer.from_pretrained(model_name, pad_token_id=3)
tokenizer.pad_token_id = tokenizer.eos_token_id

In [None]:
arguments = TrainingArguments(
    output_dir="./tuned",
    report_to="tensorboard",
    logging_steps=5,
    num_train_epochs=3,
    per_device_train_batch_size=8,
    per_device_eval_batch_size=8,
    learning_rate=2e-5,
    seed=224,
    evaluation_strategy="no",
)

In [None]:
import locale
locale.getpreferredencoding = lambda: "UTF-8"

In [None]:
%pip install --quiet trl

In [None]:
from trl import SFTTrainer, DataCollatorForCompletionOnlyLM

In [None]:
tuned_idx = random.sample(range(1, 3270), 2000)
tuned_ds = df["train"].select(tuned_idx)

In [None]:
def formatting_func(example):
    output_texts = []

    for i in range(len(example["question"])):
        text = f"""### Question: {example["question"][i]}?\n ###Text: {example["passage"][i]}\n
        ### Answer: {example["answer"][i]}\n"""
        output_texts.append(text)
    return output_texts

response_template = "### Answer:"
collator = DataCollatorForCompletionOnlyLM(response_template, tokenizer=tokenizer)

In [None]:
trainer = SFTTrainer(
    model=peft_model,
    train_dataset=tuned_ds,
    peft_config=peft_config,
    args=arguments,
    formatting_func=formatting_func,
    data_collator=collator,
)

In [None]:
trainer.train()

In [None]:
trainer.save_model("./tuned_model")

In [None]:
%cd ./tuned_model

In [None]:
!zip -r tuned_model.zip .

In [27]:
from peft import PeftModel, PeftConfig

In [None]:
%ls

In [28]:
peft_path = "/kaggle/input/tuned-model/pytorch/tunedmodel/1"

config = PeftConfig.from_pretrained(peft_path)

tmodel = PeftModel.from_pretrained(model, peft_path)

In [29]:
tmodel.gradient_checkpointing_enable()
tmodel.enable_input_require_grads()
tmodel.to(device)

PeftModelForCausalLM(
  (base_model): LoraModel(
    (model): MistralForCausalLM(
      (model): MistralModel(
        (embed_tokens): Embedding(32000, 4096)
        (layers): ModuleList(
          (0-31): 32 x MistralDecoderLayer(
            (self_attn): MistralSdpaAttention(
              (q_proj): Linear4bit(
                in_features=4096, out_features=4096, bias=False
                (lora_dropout): ModuleDict(
                  (default): Dropout(p=0.05, inplace=False)
                )
                (lora_A): ModuleDict(
                  (default): Linear(in_features=4096, out_features=16, bias=False)
                )
                (lora_B): ModuleDict(
                  (default): Linear(in_features=16, out_features=4096, bias=False)
                )
                (lora_embedding_A): ParameterDict()
                (lora_embedding_B): ParameterDict()
              )
              (k_proj): Linear4bit(
                in_features=4096, out_features=1024, bias=False
 

In [30]:
def get_answer_for_peft(tokenizer, model, messages, max_new_tokens=200,
               temperature=0.5, do_sample=True):
    # TODO: tokenize input, generate answer and decode output. Pay attention to tokenizer methods
    tokenized_input = tokenizer.apply_chat_template(messages, return_tensors="pt", padding=True, truncation=True).to(device)
    model_output = model.generate(input_ids=tokenized_input, max_new_tokens=max_new_tokens, do_sample=do_sample, top_k=10, top_p=0.91)
    decoded_output = tokenizer.batch_decode(model_output, skip_special_tokens=True)
    return decoded_output

In [31]:
tmodel.to(device)

PeftModelForCausalLM(
  (base_model): LoraModel(
    (model): MistralForCausalLM(
      (model): MistralModel(
        (embed_tokens): Embedding(32000, 4096)
        (layers): ModuleList(
          (0-31): 32 x MistralDecoderLayer(
            (self_attn): MistralSdpaAttention(
              (q_proj): Linear4bit(
                in_features=4096, out_features=4096, bias=False
                (lora_dropout): ModuleDict(
                  (default): Dropout(p=0.05, inplace=False)
                )
                (lora_A): ModuleDict(
                  (default): Linear(in_features=4096, out_features=16, bias=False)
                )
                (lora_B): ModuleDict(
                  (default): Linear(in_features=16, out_features=4096, bias=False)
                )
                (lora_embedding_A): ParameterDict()
                (lora_embedding_B): ParameterDict()
              )
              (k_proj): Linear4bit(
                in_features=4096, out_features=1024, bias=False
 

In [32]:
tuned_naive_pred, tuned_naive_true, tuned_naive_gen = make_naive_predict(df_sample, tmodel, get_answer_fn=get_answer_for_peft)

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
The attentio

In [33]:
tuned_fw_pred, tuned_fw_true, tuned_fw_gen = few_shot_prompting(df_sample, tmodel, get_answer_fn=get_answer_for_peft)

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
The attentio

In [34]:
tuned_cot_pred, tuned_cot_true, tuned_cot_gen = cot_prompting(df_sample, tmodel, get_answer_fn=get_answer_for_peft)

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
The attentio

In [35]:
orig_metrics

Unnamed: 0_level_0,accuracy
title,Unnamed: 1_level_1
naive,0.8
few_shot,0.7
chain_of_thought,0.75


In [36]:
tuned_metrics = evaluate_answers([tuned_naive_true, tuned_fw_true, tuned_cot_true], [tuned_naive_pred, tuned_fw_pred, tuned_cot_pred])

                  accuracy
title                     
naive                 0.75
few_shot              0.80
chain_of_thought      0.80


In [None]:
%cd ../

In [38]:
n_out_tune = save_gens_to_csv(tuned_naive_gen, "tuned_naive_outputs.csv")
fw_out_tune = save_gens_to_csv(tuned_fw_gen, "tuned_few_shot_outputs.csv")
cot_out_tune = save_gens_to_csv(tuned_cot_gen, "tuned_chain_of_thoughts_outputs.csv")

TODO: initialize Trainer and pass train part of our dataset for 2-3 epoches

Note: carefully set max_seq_length and args (that are transformers.TrainingArguments)

TODO: save and check your tuned model. Provide scores on our 20 validation examples and save result to csv file