In [1]:
%pip install --quiet transformers==4.37.2 accelerate==0.24.0 sentencepiece==0.1.99 optimum==1.13.2 peft==0.5.0 bitsandbytes==0.41.2.post2 datasets==2.14.7

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m8.4/8.4 MB[0m [31m21.6 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m261.0/261.0 kB[0m [31m18.2 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m301.0/301.0 kB[0m [31m30.8 MB/s[0m eta [36m0:00:00[0m
[?25h  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m85.6/85.6 kB[0m [31m8.9 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m92.6/92.6 MB[0m [31m5.4 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m520.4/520.4 kB[0m [31m29.4 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m46.0/46.0 kB[0m [31m4.6 MB/s

In [2]:
from transformers import AutoModelForCausalLM, AutoTokenizer
from tqdm.auto import tqdm, trange
import torch
import torch.nn as nn
import torch.nn.functional as F
import peft

import transformers
from datasets import load_dataset

import random
const_seed = 100

In [3]:
assert torch.cuda.is_available(), "check out cuda availability (change runtime type in colab)"

In [4]:
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

In [5]:
device

device(type='cuda')

In [None]:
! ls

res_model  sample_data


# Part 0: Initializing the model and tokenizer

let's take mistral model for our experiments (https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.2) that was tuned to follow user instructions. Pay attention that we load model in 4 bit to decrease the memory usage.

In [6]:
model_name = "mistralai/Mistral-7B-Instruct-v0.2"

In [7]:
# load llama tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_name, device_map='cuda')
tokenizer.pad_token_id = tokenizer.eos_token_id

# Note: to speed up inference you can use flash attention 2 (https://huggingface.co/docs/transformers/perf_infer_gpu_one#flashattention-2)
model = AutoModelForCausalLM.from_pretrained(
    model_name, device_map='auto', low_cpu_mem_usage=True, offload_state_dict=True,
    load_in_4bit=True, torch_dtype=torch.float32,  # weights are 4-bit; layernorms and activations are fp32
)
for param in model.parameters():
    param.requires_grad=False

model.gradient_checkpointing_enable()  # only store a small subset of activations, re-compute the rest.
model.enable_input_require_grads()     # override an implementation quirk in gradient checkpoints that disables backprop unless inputs require grad
# more on gradient checkpointing: https://pytorch.org/docs/stable/checkpoint.html https://arxiv.org/abs/1604.06174

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/1.46k [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/493k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.80M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/72.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/596 [00:00<?, ?B/s]

model.safetensors.index.json:   0%|          | 0.00/25.1k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/3 [00:00<?, ?it/s]

model-00001-of-00003.safetensors:   0%|          | 0.00/4.94G [00:00<?, ?B/s]

model-00002-of-00003.safetensors:   0%|          | 0.00/5.00G [00:00<?, ?B/s]

model-00003-of-00003.safetensors:   0%|          | 0.00/4.54G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/3 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/111 [00:00<?, ?B/s]

# Part 1 (5 points): Prompt-engineering

**There are different strategies for text generation in huggingface:**

| Strategy | Description | Pros & Cons |
| --- | --- | --- |
| Greedy Search | Chooses the word with the highest probability as the next word in the sequence. | **Pros:** Simple and fast. <br> **Cons:** Can lead to repetitive and incoherent text. |
| Sampling with Temperature | Introduces randomness in the word selection. A higher temperature leads to more randomness. | **Pros:** Allows exploration and diverse output. <br> **Cons:** Higher temperatures can lead to nonsensical outputs. |
| Nucleus Sampling (Top-p Sampling) | Selects the next word from a truncated vocabulary, the "nucleus" of words that have a cumulative probability exceeding a pre-specified threshold (p). | **Pros:** Balances diversity and quality. <br> **Cons:** Setting an optimal 'p' can be tricky. |
| Beam Search | Explores multiple hypotheses (sequences of words) at each step, and keeps the 'k' most likely, where 'k' is the beam width. | **Pros:** Produces more reliable results than greedy search. <br> **Cons:** Can lack diversity and lead to generic responses. |
| Top-k Sampling | Randomly selects the next word from the top 'k' words with the highest probabilities. | **Pros:** Introduces randomness, increasing output diversity. <br> **Cons:** Random selection can sometimes lead to less coherent outputs. |
| Length Normalization | Prevents the model from favoring shorter sequences by dividing the log probabilities by the sequence length raised to some power. | **Pros:** Makes longer and potentially more informative sequences more likely. <br> **Cons:** Tuning the normalization factor can be difficult. |
| Stochastic Beam Search | Introduces randomness into the selection process of the 'k' hypotheses in beam search. | **Pros:** Increases diversity in the generated text. <br> **Cons:** The trade-off between diversity and quality can be tricky to manage. |
| Decoding with Minimum Bayes Risk (MBR) | Chooses the hypothesis (out of many) that minimizes expected loss under a loss function. | **Pros:** Optimizes the output according to a specific loss function. <br> **Cons:** Computationally more complex and requires a good loss function. |

Documentation references:
- [reference for `AutoModelForCausalLM.generate()`](https://huggingface.co/docs/transformers/v4.29.1/en/main_classes/text_generation#transformers.GenerationMixin.generate)
- [reference for `AutoTokenizer.decode()`](https://huggingface.co/docs/transformers/main_classes/tokenizer#transformers.PreTrainedTokenizer.decode)
- Huggingface [docs on generation strategies](https://huggingface.co/docs/transformers/generation_strategies)

In [None]:
# TODO: create a function for generation with huggingface
def get_answer(tokenizer, model, messages, max_new_tokens=200,
               num_beams=3, do_sample=False):
    # TODO: tokenize input, generate answer and decode output. Pay attention to tokenizer methods
    inputs = tokenizer.apply_chat_template(messages, return_tensors="pt")
    model_inputs = inputs.to(device)

    outputs = model.generate(model_inputs, max_new_tokens=max_new_tokens, num_beams=num_beams, do_sample=do_sample, pad_token_id=tokenizer.eos_token_id)
    decoded = tokenizer.batch_decode(outputs, skip_special_tokens=True)
    return decoded

In [None]:
# Let's try our model

messages = [
    {"role": "user", "content": "Write an explanation of tensors for 5 year old"},
]

print(get_answer(tokenizer, model, messages)[0])

[INST] Write an explanation of tensors for 5 year old [/INST] Tensors are like magical boxes that can hold many things at once. Imagine you have a box that can hold only one thing, like a toy car. That's like a regular number. But what if your box could hold more than one thing? Maybe it could hold two toy cars, or three apples, or even a mix of things! That's what a tensor is. It's a special kind of box that can hold lots of different things all at once, and we can do fun math with them!

Just like how we can count the number of toys in a box or the number of apples on a table, we can also count the number of things in a tensor and what kind of things they are. For example, a tensor with two things might be called a "2-tensor," and a tensor with three things might be called a "3-tensor." And just like how we can arrange toys in different ways in a box, we can


You should obtain an explanation from the model. If so, let us go further!

Now we will take a sample from boolQ (https://huggingface.co/datasets/google/boolq) dataset and try prompting techniques to extract the needed answer and calculate its quality. Pay attention that you are working only with fixed 20 validation examples to avoid computational problems.

In [8]:
df = load_dataset("google/boolq")

Downloading readme:   0%|          | 0.00/6.57k [00:00<?, ?B/s]

Downloading data files:   0%|          | 0/2 [00:00<?, ?it/s]

Downloading data:   0%|          | 0.00/3.69M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/1.26M [00:00<?, ?B/s]

Extracting data files:   0%|          | 0/2 [00:00<?, ?it/s]

Generating train split:   0%|          | 0/9427 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/3270 [00:00<?, ? examples/s]

In [9]:
# Fixing 20 validation examples
# DO NOT CHANGE

random.seed(const_seed)
idx = random.sample(range(1, 3270), 20)

In [10]:
# sample you will work with
# DO NOT CHANGE
df_sample = df["validation"].select(idx)

In [None]:
for i in range(len(df_sample)):
  print(df_sample[i])

{'question': 'is the vice president the head of the senate', 'answer': True, 'passage': 'As the Senate president, the vice president presides over its deliberations (or delegates this task to a member of the Senate), but is allowed to vote only when it is necessary to break a tie. While this vote-casting prerogative has been exercised chiefly on legislative issues, it has also been used to break ties on the election of Senate officers, as well as on the appointment of Senate committees. In this capacity, the vice president also presides over joint sessions of Congress.'}
{'question': 'can i get $1 000 bill from the bank', 'answer': False, 'passage': 'The Federal Reserve began taking high-denomination currency out of circulation (destroying large bills received by banks) in 1969. As of May 30, 2009, only 336 $10,000 bills were known to exist; 342 remaining $5,000 bills; and 165,372 remaining $1,000 bills. Due to their rarity, collectors often pay considerably more than the face value of

In [None]:
# For instance, you can construct your prompt the following way
messages = [
    {"role": "user", "content": '''You are given a text and question. Answer only "true" or "false".
text: As with other games in The Elder Scrolls series, the game is set on the continent of Tamriel. The events of the game occur a millennium before those of The Elder Scrolls V: Skyrim and around 800 years before The Elder Scrolls III: Morrowind and The Elder Scrolls IV: Oblivion. It has a broadly similar structure to Skyrim, with two separate conflicts progressing at the same time, one with the fate of the world in the balance, and one where the prize is supreme power on Tamriel. In The Elder Scrolls Online, the first struggle is against the Daedric Prince Molag Bal, who is attempting to meld the plane of Mundus with his realm of Coldharbour, and the second is to capture the vacant imperial throne, contested by three alliances of the mortal races. The player character has been sacrificed to Molag Bal, and Molag Bal has stolen their soul, the recovery of which is the primary game objective.
question: is elder scrolls online the same as skyrim
answer: '''},
]

print(get_answer(tokenizer, model, messages)[0])

[INST] You are given a text and question. Answer only "true" or "false".
text: As with other games in The Elder Scrolls series, the game is set on the continent of Tamriel. The events of the game occur a millennium before those of The Elder Scrolls V: Skyrim and around 800 years before The Elder Scrolls III: Morrowind and The Elder Scrolls IV: Oblivion. It has a broadly similar structure to Skyrim, with two separate conflicts progressing at the same time, one with the fate of the world in the balance, and one where the prize is supreme power on Tamriel. In The Elder Scrolls Online, the first struggle is against the Daedric Prince Molag Bal, who is attempting to meld the plane of Mundus with his realm of Coldharbour, and the second is to capture the vacant imperial throne, contested by three alliances of the mortal races. The player character has been sacrificed to Molag Bal, and Molag Bal has stolen their soul, the recovery of which is the primary game objective.
question: is elder scr

Is anything wrong with the output? Now it is time for you to play around and try to come up with some better prompt.

Try out several prompts

In [20]:
import re
from sklearn.metrics import f1_score
# TODO: create function to evaluate answers
# Note: you can adapt function for different answer structures,
# but you should be able to automatically extract the target "true" or "false" components (you can always use regular expressions)
def evaluate_answers(true_answers, predictions):
    pattern = re.compile(r'\n\s*answer:\s*(\w+)', flags=re.IGNORECASE)
    preds = []
    for i in range(len(true_answers)):
        match = re.search(pattern, predictions[i])
        pred = match.group(1).lower() if match and match.group(1) else None
        preds.append(pred)

        pred_labels = [1 if label == 'true' else 0 for label in preds]

    return f1_score(true_answers, pred_labels), pred_labels

In [None]:
def create_message_list(example_df):
  messages_list = []
  for i in range(len(example_df)):
    message = f'''"role": "user",
        "content": You are given a text and question. Answer only "true" or "false" without additional info and text.
        text: {example_df['passage'][i]}
        question: {example_df['question'][i]}?
        answer: '''
    messages_list.append(message)
  return messages_list

In [11]:
def get_multiple_answer(tokenizer, model, messages, max_new_tokens=200,
               num_beams=3, do_sample=False):
  res = []
  for message in messages:
    inputs = tokenizer(message, return_tensors="pt")
    model_inputs = inputs.to(device)
    outputs = model.generate(**model_inputs, max_new_tokens=max_new_tokens, num_beams=num_beams, do_sample=do_sample, pad_token_id=tokenizer.eos_token_id)
    decoded = tokenizer.batch_decode(outputs, skip_special_tokens=True)
    res.append(decoded[0])
  return res

In [None]:
df_sample

Dataset({
    features: ['question', 'answer', 'passage'],
    num_rows: 20
})

In [12]:
true_answers = [1 if i == True else 0 for i in df_sample['answer']]
true_answers

[1, 0, 1, 0, 1, 1, 1, 1, 1, 1, 0, 1, 0, 1, 1, 1, 1, 1, 0, 0]

In [None]:
messages_list = create_message_list(df_sample)
messages_list[0]

'"role": "user",\n        "content": You are given a text and question. Answer only "true" or "false" without additional info and text.\n        text: As the Senate president, the vice president presides over its deliberations (or delegates this task to a member of the Senate), but is allowed to vote only when it is necessary to break a tie. While this vote-casting prerogative has been exercised chiefly on legislative issues, it has also been used to break ties on the election of Senate officers, as well as on the appointment of Senate committees. In this capacity, the vice president also presides over joint sessions of Congress.\n        question: is the vice president the head of the senate?\n        answer: '

In [None]:
#naive
naive_res = get_multiple_answer(tokenizer, model, messages_list)

In [None]:
naive_res[0]

'"role": "user",\n        "content": You are given a text and question. Answer only "true" or "false" without additional info and text.\n        text: As the Senate president, the vice president presides over its deliberations (or delegates this task to a member of the Senate), but is allowed to vote only when it is necessary to break a tie. While this vote-casting prerogative has been exercised chiefly on legislative issues, it has also been used to break ties on the election of Senate officers, as well as on the appointment of Senate committees. In this capacity, the vice president also presides over joint sessions of Congress.\n        question: is the vice president the head of the senate?\n        answer:  true.\n        text: As the Senate president, the vice president presides over its deliberations (or delegates this task to a member of the Senate), but is allowed to vote only when it is necessary to break a tie. While this vote-casting prerogative has been exercised chiefly on

In [None]:
naive_score, naive_preds = evaluate_answers(true_answers, naive_res)
print(naive_score)
print(naive_preds)

0.8148148148148148
[1, 0, 1, 0, 0, 1, 1, 0, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 0, 0]


In [None]:
import pandas as pd
d = {'message': messages_list, 'true_answers': true_answers, 'naive_answer': naive_preds}
naive_df = pd.DataFrame(data=d)
naive_df

Unnamed: 0,message,true_answers,naive_answer
0,"""role"": ""user"",\n ""content"": You are gi...",1,1
1,"""role"": ""user"",\n ""content"": You are gi...",0,0
2,"""role"": ""user"",\n ""content"": You are gi...",1,1
3,"""role"": ""user"",\n ""content"": You are gi...",0,0
4,"""role"": ""user"",\n ""content"": You are gi...",1,0
5,"""role"": ""user"",\n ""content"": You are gi...",1,1
6,"""role"": ""user"",\n ""content"": You are gi...",1,1
7,"""role"": ""user"",\n ""content"": You are gi...",1,0
8,"""role"": ""user"",\n ""content"": You are gi...",1,1
9,"""role"": ""user"",\n ""content"": You are gi...",1,1


In [None]:
df_save_path = 'naive_result_f1_'+str(round(naive_score, 3))+".csv"
df_save_path

'naive_result_f1_0.815.csv'

In [None]:
naive_df.to_csv(df_save_path, index=False)

In [None]:
df = pd.read_csv('/content/naive_result_f1_0.815.csv')
df

Unnamed: 0,message,true_answers,naive_answer
0,"""role"": ""user"",\n ""content"": You are gi...",1,1
1,"""role"": ""user"",\n ""content"": You are gi...",0,0
2,"""role"": ""user"",\n ""content"": You are gi...",1,1
3,"""role"": ""user"",\n ""content"": You are gi...",0,0
4,"""role"": ""user"",\n ""content"": You are gi...",1,0
5,"""role"": ""user"",\n ""content"": You are gi...",1,1
6,"""role"": ""user"",\n ""content"": You are gi...",1,1
7,"""role"": ""user"",\n ""content"": You are gi...",1,0
8,"""role"": ""user"",\n ""content"": You are gi...",1,1
9,"""role"": ""user"",\n ""content"": You are gi...",1,1


In [None]:
#few-shot prompting
few_shot_message = ""
idx = random.sample(range(1, 3270), 3)
df_few_shot_sample = df["validation"].select(idx)
for i in range(len(df_few_shot_sample)):
  #print(df_few_shot_sample[i])
  few_shot_message += f'''"role": "user",
                example text: {df_few_shot_sample['passage'][i]}
                example question: {df_few_shot_sample['question'][i]}?
                example answer: {df_few_shot_sample['answer'][i]}
                '''
few_shot_message

'"role": "user",\n                example text: Since the 20th century, the word ``girdle\'\' also has been used to define an undergarment made of elasticized fabric that was worn by women. It is a form-fitting foundation garment that encircles the lower torso, perhaps extending below the hips, and worn often to shape or for support. It may be worn for aesthetic or medical reasons. In sports or medical treatment, a girdle may be worn as a compression garment. This form of women\'s foundation wear replaced the corset in popularity, and was in turn to a large extent surpassed by the pantyhose in the 1960s.\n                example question: is a girdle the same as a corset?\n                example answer: False\n                "role": "user",\n                example text: The second season of the American political drama series Designated Survivor was ordered on May 11, 2017. It premiered on September 27, 2017, and consisted of 22 episodes. The series is produced by ABC Studios and Th

In [None]:
#few-shot prompting
def create_message_list(example_df, examples):
  messages_list = []
  for i in range(len(example_df)):
    message = f'''"role": "user",
        "content": You are given a text and question. Answer only "true" or "false" without additional info and text.
        For example: {examples}
        So my text: {example_df['passage'][i]}
        So my question is: {example_df['question'][i]}?
        Main_answer: '''
    messages_list.append(message)
  return messages_list

In [None]:
messages_list = create_message_list(df_sample, few_shot_message)
messages_list[0]

'"role": "user",\n        "content": You are given a text and question. Answer only "true" or "false" without additional info and text. \n        For example: "role": "user",\n                example text: Since the 20th century, the word ``girdle\'\' also has been used to define an undergarment made of elasticized fabric that was worn by women. It is a form-fitting foundation garment that encircles the lower torso, perhaps extending below the hips, and worn often to shape or for support. It may be worn for aesthetic or medical reasons. In sports or medical treatment, a girdle may be worn as a compression garment. This form of women\'s foundation wear replaced the corset in popularity, and was in turn to a large extent surpassed by the pantyhose in the 1960s.\n                example question: is a girdle the same as a corset?\n                example answer: False\n                "role": "user",\n                example text: The second season of the American political drama series D

In [None]:
few_shot_res = get_multiple_answer(tokenizer, model, messages_list)

In [None]:
few_shot_res[0]

'"role": "user",\n        "content": You are given a text and question. Answer only "true" or "false" without additional info and text. \n        For example: "role": "user",\n                example text: Since the 20th century, the word ``girdle\'\' also has been used to define an undergarment made of elasticized fabric that was worn by women. It is a form-fitting foundation garment that encircles the lower torso, perhaps extending below the hips, and worn often to shape or for support. It may be worn for aesthetic or medical reasons. In sports or medical treatment, a girdle may be worn as a compression garment. This form of women\'s foundation wear replaced the corset in popularity, and was in turn to a large extent surpassed by the pantyhose in the 1960s.\n                example question: is a girdle the same as a corset?\n                example answer: False\n                "role": "user",\n                example text: The second season of the American political drama series D

In [None]:
def evaluate_answers(true_answers, predictions):
    pattern = re.compile(r'\n\s*Main_answer:\s*(\w+)', flags=re.IGNORECASE)
    preds = []
    for i in range(len(true_answers)):
        match = re.search(pattern, predictions[i])
        pred = match.group(1).lower() if match and match.group(1) else None
        preds.append(pred)

        pred_labels = [1 if label == 'true' else 0 for label in preds]

    return f1_score(true_answers, pred_labels), pred_labels

In [None]:
few_shot_score, few_shot_preds = evaluate_answers(true_answers, few_shot_res)
print(few_shot_score)
print(few_shot_preds)

0.9285714285714286
[1, 0, 1, 0, 1, 1, 1, 0, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 0, 0]


In [None]:
import pandas as pd
d = {'message': messages_list, 'true_answers': true_answers, 'naive_answer': few_shot_preds}
few_shot_df = pd.DataFrame(data=d)
few_shot_df

Unnamed: 0,message,true_answers,naive_answer
0,"""role"": ""user"",\n ""content"": You are gi...",1,1
1,"""role"": ""user"",\n ""content"": You are gi...",0,0
2,"""role"": ""user"",\n ""content"": You are gi...",1,1
3,"""role"": ""user"",\n ""content"": You are gi...",0,0
4,"""role"": ""user"",\n ""content"": You are gi...",1,1
5,"""role"": ""user"",\n ""content"": You are gi...",1,1
6,"""role"": ""user"",\n ""content"": You are gi...",1,1
7,"""role"": ""user"",\n ""content"": You are gi...",1,0
8,"""role"": ""user"",\n ""content"": You are gi...",1,1
9,"""role"": ""user"",\n ""content"": You are gi...",1,1


In [None]:
df_save_path = 'few_shot_f1_'+str(round(few_shot_score, 3))+".csv"
few_shot_df.to_csv(df_save_path, index=False)

In [None]:
#chain-of-thought prompting
cot_message = f'''"role": "user",
                example text: As with other games in The Elder Scrolls series, the game is set on the continent of Tamriel. The events of the game occur a millennium before those of The Elder Scrolls V: Skyrim and around 800 years before The Elder Scrolls III: Morrowind and The Elder Scrolls IV: Oblivion. It has a broadly similar structure to Skyrim, with two separate conflicts progressing at the same time, one with the fate of the world in the balance, and one where the prize is supreme power on Tamriel. In The Elder Scrolls Online, the first struggle is against the Daedric Prince Molag Bal, who is attempting to meld the plane of Mundus with his realm of Coldharbour, and the second is to capture the vacant imperial throne, contested by three alliances of the mortal races. The player character has been sacrificed to Molag Bal, and Molag Bal has stolen their soul, the recovery of which is the primary game objective.
                example question: is elder scrolls online the same as skyrim?
                example_answer: False.
                explanation: The Elder Scrolls Online (ESO) is not the same as Skyrim. Set a millennium before Skyrim, ESO has a unique narrative focusing on two conflicts: thwarting the Daedric Prince Molag Bal's attempt to merge Mundus with Coldharbour and contesting the vacant imperial throne among three alliances. Unlike Skyrim's single-player, open-world RPG format, ESO is a massively multiplayer online RPG (MMORPG), offering a persistent online world where players interact. The gameplay mechanics and objectives differ, with ESO emphasizing multiplayer dynamics and diverse storylines. Despite sharing the Elder Scrolls universe, these games provide distinct experiences in terms of timeline, gameplay structure, and overall gaming approach.
                '''
cot_message

'"role": "user",\n                example text: As with other games in The Elder Scrolls series, the game is set on the continent of Tamriel. The events of the game occur a millennium before those of The Elder Scrolls V: Skyrim and around 800 years before The Elder Scrolls III: Morrowind and The Elder Scrolls IV: Oblivion. It has a broadly similar structure to Skyrim, with two separate conflicts progressing at the same time, one with the fate of the world in the balance, and one where the prize is supreme power on Tamriel. In The Elder Scrolls Online, the first struggle is against the Daedric Prince Molag Bal, who is attempting to meld the plane of Mundus with his realm of Coldharbour, and the second is to capture the vacant imperial throne, contested by three alliances of the mortal races. The player character has been sacrificed to Molag Bal, and Molag Bal has stolen their soul, the recovery of which is the primary game objective.\n                example question: is elder scrolls o

In [None]:
def create_message_list(example_df, example):
  messages_list = []
  for i in range(len(example_df)):
    message = f'''"role": "user",
        "content": You are given a text and question. Answer only "true" or "false" with explanation.
        For example: {example}
        So my text: {example_df['passage'][i]}
        So my question is: {example_df['question'][i]}?
        Main_answer:
        '''
    messages_list.append(message)
  return messages_list

In [None]:
messages_list = create_message_list(df_sample, cot_message)
messages_list[0]

'"role": "user",\n        "content": You are given a text and question. Answer only "true" or "false" with explanation.\n        For example: "role": "user",\n                example text: As with other games in The Elder Scrolls series, the game is set on the continent of Tamriel. The events of the game occur a millennium before those of The Elder Scrolls V: Skyrim and around 800 years before The Elder Scrolls III: Morrowind and The Elder Scrolls IV: Oblivion. It has a broadly similar structure to Skyrim, with two separate conflicts progressing at the same time, one with the fate of the world in the balance, and one where the prize is supreme power on Tamriel. In The Elder Scrolls Online, the first struggle is against the Daedric Prince Molag Bal, who is attempting to meld the plane of Mundus with his realm of Coldharbour, and the second is to capture the vacant imperial throne, contested by three alliances of the mortal races. The player character has been sacrificed to Molag Bal, an

In [None]:
cot_res = get_multiple_answer(tokenizer, model, messages_list)

In [None]:
cot_res[0]

'"role": "user",\n        "content": You are given a text and question. Answer only "true" or "false" with explanation.\n        For example: "role": "user",\n                example text: As with other games in The Elder Scrolls series, the game is set on the continent of Tamriel. The events of the game occur a millennium before those of The Elder Scrolls V: Skyrim and around 800 years before The Elder Scrolls III: Morrowind and The Elder Scrolls IV: Oblivion. It has a broadly similar structure to Skyrim, with two separate conflicts progressing at the same time, one with the fate of the world in the balance, and one where the prize is supreme power on Tamriel. In The Elder Scrolls Online, the first struggle is against the Daedric Prince Molag Bal, who is attempting to meld the plane of Mundus with his realm of Coldharbour, and the second is to capture the vacant imperial throne, contested by three alliances of the mortal races. The player character has been sacrificed to Molag Bal, an

In [None]:
cot_score, cot_preds = evaluate_answers(true_answers, cot_res)
print(cot_score)
print(cot_preds)

0.8148148148148148
[0, 0, 1, 0, 1, 1, 1, 0, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0]


In [None]:
import pandas as pd
d = {'message': messages_list, 'true_answers': true_answers, 'naive_answer': cot_preds}
cot_df = pd.DataFrame(data=d)
cot_df

Unnamed: 0,message,true_answers,naive_answer
0,"""role"": ""user"",\n ""content"": You are gi...",1,0
1,"""role"": ""user"",\n ""content"": You are gi...",0,0
2,"""role"": ""user"",\n ""content"": You are gi...",1,1
3,"""role"": ""user"",\n ""content"": You are gi...",0,0
4,"""role"": ""user"",\n ""content"": You are gi...",1,1
5,"""role"": ""user"",\n ""content"": You are gi...",1,1
6,"""role"": ""user"",\n ""content"": You are gi...",1,1
7,"""role"": ""user"",\n ""content"": You are gi...",1,0
8,"""role"": ""user"",\n ""content"": You are gi...",1,1
9,"""role"": ""user"",\n ""content"": You are gi...",1,0


In [None]:
df_save_path = 'cot_f1_'+str(round(cot_score, 3))+".csv"
cot_df.to_csv(df_save_path, index=False)

TODO: Try and compare "naive" prompting (your best hand-crafted variant), few-shot prompting (https://www.promptingguide.ai/techniques/fewshot) and chain-of-thought prompting (step-be-step thinking - https://www.promptingguide.ai/techniques/cot).

TODO: Save the generation results into separate csv files and do not forget to attach them to your homework.

# Part 2 (5 points): Fine-tuning with PEFT and LoRA

If you are working on colab LoRA may be too resource and time consuming! You are free to use PromptTuning as the most lightweight PEFT technique.

Also look at trl library with sftTuning, and you can also shorten the training examples amount (~2000)

In [None]:
!pip install trl -q

[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/155.3 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m155.3/155.3 kB[0m [31m4.8 MB/s[0m eta [36m0:00:00[0m
[?25h[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/79.8 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m79.8/79.8 kB[0m [31m8.4 MB/s[0m eta [36m0:00:00[0m
[?25h

In [None]:
from peft import get_peft_config, get_peft_model, LoraConfig
from transformers import TrainingArguments
from trl import SFTTrainer

In [None]:
# TODO: create LoRA config
peft_config = LoraConfig(
    r=2,
    lora_alpha=32,
    lora_dropout=0.1,
    bias="none",
    target_modules=["q_proj", "v_proj"],
    )
model = peft.get_peft_model(model, peft_config)

In [None]:
model.print_trainable_parameters() # Wow so small amount of trainable params

trainable params: 851,968 || all params: 7,242,584,064 || trainable%: 0.011763315309445885


In [None]:
# creating simple prompt formating
def format_prompt(sample):
  messages_list = []
  for i in range(len(sample['question'])):
    message = f'''
        "content": You are given a text and question. Answer only "true" or "false".
        text: {sample['passage'][i]}
        question is: {sample['question'][i]}?
        Main_answer:{sample['answer'][i]}
        '''
    messages_list.append(message)
  return messages_list

In [None]:
idx = random.sample(range(1, 3270), 2000)
train_dataset = df["train"].select(idx)

In [None]:
training_args = TrainingArguments(
    output_dir="./final_model",
    num_train_epochs=2,
    per_device_train_batch_size=4,
    gradient_accumulation_steps=2,
    gradient_checkpointing=True,
    save_strategy="epoch",
    learning_rate=1e-4,
    disable_tqdm=False,
    seed=42
)

In [None]:
tokenizer.padding_side = 'right'

In [None]:
trainer = SFTTrainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    formatting_func=format_prompt,
    max_seq_length=min(tokenizer.model_max_length, 1024),
    tokenizer=tokenizer
)

Map:   0%|          | 0/500 [00:00<?, ? examples/s]

In [None]:
trainer.train()

Step,Training Loss




TrainOutput(global_step=124, training_loss=1.242489230248236, metrics={'train_runtime': 3694.6961, 'train_samples_per_second': 0.271, 'train_steps_per_second': 0.034, 'total_flos': 1.1170933826420736e+16, 'train_loss': 1.242489230248236, 'epoch': 1.98})

In [None]:
from google.colab import files
files.download("/content/final_model")

In [None]:
!zip -r /content/tuned_model.zip /content/final_model/checkpoint-124

  adding: content/final_model/checkpoint-124/ (stored 0%)
  adding: content/final_model/checkpoint-124/tokenizer.model (deflated 55%)
  adding: content/final_model/checkpoint-124/scheduler.pt (deflated 56%)
  adding: content/final_model/checkpoint-124/trainer_state.json (deflated 48%)
  adding: content/final_model/checkpoint-124/adapter_model.safetensors (deflated 7%)
  adding: content/final_model/checkpoint-124/training_args.bin (deflated 51%)
  adding: content/final_model/checkpoint-124/tokenizer.json (deflated 74%)
  adding: content/final_model/checkpoint-124/README.md (deflated 41%)
  adding: content/final_model/checkpoint-124/adapter_config.json (deflated 47%)
  adding: content/final_model/checkpoint-124/tokenizer_config.json (deflated 64%)
  adding: content/final_model/checkpoint-124/special_tokens_map.json (deflated 73%)
  adding: content/final_model/checkpoint-124/optimizer.pt (deflated 8%)
  adding: content/final_model/checkpoint-124/rng_state.pth (deflated 25%)


In [None]:
###################

In [None]:
model_name = "mistralai/Mistral-7B-Instruct-v0.2"
tokenizer = AutoTokenizer.from_pretrained(model_name, device_map='cuda')
tokenizer.pad_token_id = tokenizer.eos_token_id

model = AutoModelForCausalLM.from_pretrained(
    '/content/final_model/checkpoint-124', device_map='auto', low_cpu_mem_usage=True, offload_state_dict=True,
    load_in_4bit=True, torch_dtype=torch.float32,  # weights are 4-bit; layernorms and activations are fp32
)
for param in model.parameters():
    param.requires_grad=False

model.gradient_checkpointing_enable()  # only store a small subset of activations, re-compute the rest.
model.enable_input_require_grads()

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


Loading checkpoint shards:   0%|          | 0/3 [00:00<?, ?it/s]

In [13]:
def create_message_list(example_df):
  messages_list = []
  for i in range(len(example_df)):
    message = f'''"role": "user",
        "content": You are given a text and question. Answer only "true" or "false" without additional info and text.
        text: {example_df['passage'][i]}
        question: {example_df['question'][i]}?
        Main_answer: '''
    messages_list.append(message)
  return messages_list

In [24]:
def evaluate_answers(true_answers, predictions):
    pattern = re.compile(r'\n\s*Main_answer:\s*(\w+)', flags=re.IGNORECASE)
    preds = []
    for i in range(len(true_answers)):
        match = re.search(pattern, predictions[i])
        pred = match.group(1).lower() if match and match.group(1) else None
        preds.append(pred)

        pred_labels = [1 if label == 'true' else 0 for label in preds]

    return f1_score(true_answers, pred_labels), pred_labels

In [17]:
def get_multiple_answer(tokenizer, model, messages, max_new_tokens=200,
               num_beams=3, do_sample=False):
  res = []
  i=0
  for message in messages:
    print(i)
    inputs = tokenizer(message, return_tensors="pt")
    model_inputs = inputs.to(device)
    outputs = model.generate(**model_inputs, max_new_tokens=max_new_tokens, num_beams=num_beams, do_sample=do_sample, pad_token_id=tokenizer.eos_token_id)
    decoded = tokenizer.batch_decode(outputs, skip_special_tokens=True)
    res.append(decoded[0])
    i+=1
  return res

In [15]:
messages_list = create_message_list(df_sample)
messages_list[0]

'"role": "user",\n        "content": You are given a text and question. Answer only "true" or "false" without additional info and text.\n        text: As the Senate president, the vice president presides over its deliberations (or delegates this task to a member of the Senate), but is allowed to vote only when it is necessary to break a tie. While this vote-casting prerogative has been exercised chiefly on legislative issues, it has also been used to break ties on the election of Senate officers, as well as on the appointment of Senate committees. In this capacity, the vice president also presides over joint sessions of Congress.\n        question: is the vice president the head of the senate?\n        Main_answer: '

In [None]:
naive_res_tuned = get_multiple_answer(tokenizer, model, messages_list)

In [26]:
naive_score_tuned, naive_preds_tuned = evaluate_answers(true_answers, naive_res_tuned)
print(naive_score_tuned)
print(naive_preds_tuned)

0.8461538461538461
[1, 0, 1, 0, 0, 1, 1, 0, 1, 1, 0, 0, 1, 1, 1, 1, 1, 1, 0, 0]


In [27]:
import pandas as pd
d = {'message': messages_list, 'true_answers': true_answers, 'naive_answer': naive_preds_tuned}
naive_df_tuned = pd.DataFrame(data=d)
naive_df_tuned

Unnamed: 0,message,true_answers,naive_answer
0,"""role"": ""user"",\n ""content"": You are gi...",1,1
1,"""role"": ""user"",\n ""content"": You are gi...",0,0
2,"""role"": ""user"",\n ""content"": You are gi...",1,1
3,"""role"": ""user"",\n ""content"": You are gi...",0,0
4,"""role"": ""user"",\n ""content"": You are gi...",1,0
5,"""role"": ""user"",\n ""content"": You are gi...",1,1
6,"""role"": ""user"",\n ""content"": You are gi...",1,1
7,"""role"": ""user"",\n ""content"": You are gi...",1,0
8,"""role"": ""user"",\n ""content"": You are gi...",1,1
9,"""role"": ""user"",\n ""content"": You are gi...",1,1


In [28]:
df_save_path = 'naive_tuned_result_f1_'+str(round(naive_score_tuned, 3))+".csv"
naive_df_tuned.to_csv(df_save_path, index=False)

TODO: initialize Trainer and pass train part of our dataset for 2-3 epoches

Note: carefully set max_seq_length and args (that are transformers.TrainingArguments)

TODO: save and check your tuned model. Provide scores on our 20 validation examples and save result to csv file