In [1]:
%pip install --quiet transformers==4.37.2 accelerate==0.24.0 sentencepiece==0.1.99 optimum==1.13.2 peft==0.5.0 bitsandbytes==0.41.2.post2 datasets==2.14.7

[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
cudf 23.8.0 requires cubinlinker, which is not installed.
cudf 23.8.0 requires cupy-cuda11x>=12.0.0, which is not installed.
cudf 23.8.0 requires ptxcompiler, which is not installed.
cuml 23.8.0 requires cupy-cuda11x>=12.0.0, which is not installed.
dask-cudf 23.8.0 requires cupy-cuda11x>=12.0.0, which is not installed.
apache-beam 2.46.0 requires dill<0.3.2,>=0.3.1.1, but you have dill 0.3.7 which is incompatible.
apache-beam 2.46.0 requires numpy<1.25.0,>=1.14.3, but you have numpy 1.26.4 which is incompatible.
apache-beam 2.46.0 requires pyarrow<10.0.0,>=3.0.0, but you have pyarrow 11.0.0 which is incompatible.
cudf 23.8.0 requires cuda-python<12.0a0,>=11.7.1, but you have cuda-python 12.3.0 which is incompatible.
cudf 23.8.0 requires pandas<1.6.0dev0,>=1.3, but you have pandas 2.1.4 which is incompatible.

In [2]:
from transformers import AutoModelForCausalLM, AutoTokenizer
from tqdm.auto import tqdm, trange
import torch
import torch.nn as nn
import torch.nn.functional as F
import peft

import transformers
from datasets import load_dataset

import random
const_seed = 100

In [3]:
torch.cuda.is_available()

True

In [4]:
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

# Part 0: Initializing the model and tokenizer

let's take mistral model for our experiments (https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.2) that was tuned to follow user instructions. Pay attention that we load model in 4 bit to decrease the memory usage.

In [6]:
model_name = 'mistralai/Mistral-7B-Instruct-v0.2'


# load llama tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_name, device_map=device)
tokenizer.pad_token_id = tokenizer.eos_token_id

# Note: to speed up inference you can use flash attention 2 (https://huggingface.co/docs/transformers/perf_infer_gpu_one#flashattention-2)
model = AutoModelForCausalLM.from_pretrained(
    model_name, device_map='auto', low_cpu_mem_usage=True, offload_state_dict=True,
    load_in_4bit=True, torch_dtype=torch.float32,  # weights are 4-bit; layernorms and activations are fp32
)
for param in model.parameters():
    param.requires_grad=False

model.gradient_checkpointing_enable()  # only store a small subset of activations, re-compute the rest.
model.enable_input_require_grads()     # override an implementation quirk in gradient checkpoints that disables backprop unless inputs require grad
# more on gradient checkpointing: https://pytorch.org/docs/stable/checkpoint.html https://arxiv.org/abs/1604.06174

Downloading shards:   0%|          | 0/3 [00:00<?, ?it/s]

model-00001-of-00003.safetensors:   0%|          | 0.00/4.94G [00:00<?, ?B/s]

model-00002-of-00003.safetensors:   0%|          | 0.00/5.00G [00:00<?, ?B/s]

model-00003-of-00003.safetensors:   0%|          | 0.00/4.54G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/3 [00:00<?, ?it/s]

# Part 1 (5 points): Prompt-engineering

**There are different strategies for text generation in huggingface:**

| Strategy | Description | Pros & Cons |
| --- | --- | --- |
| Greedy Search | Chooses the word with the highest probability as the next word in the sequence. | **Pros:** Simple and fast. <br> **Cons:** Can lead to repetitive and incoherent text. |
| Sampling with Temperature | Introduces randomness in the word selection. A higher temperature leads to more randomness. | **Pros:** Allows exploration and diverse output. <br> **Cons:** Higher temperatures can lead to nonsensical outputs. |
| Nucleus Sampling (Top-p Sampling) | Selects the next word from a truncated vocabulary, the "nucleus" of words that have a cumulative probability exceeding a pre-specified threshold (p). | **Pros:** Balances diversity and quality. <br> **Cons:** Setting an optimal 'p' can be tricky. |
| Beam Search | Explores multiple hypotheses (sequences of words) at each step, and keeps the 'k' most likely, where 'k' is the beam width. | **Pros:** Produces more reliable results than greedy search. <br> **Cons:** Can lack diversity and lead to generic responses. |
| Top-k Sampling | Randomly selects the next word from the top 'k' words with the highest probabilities. | **Pros:** Introduces randomness, increasing output diversity. <br> **Cons:** Random selection can sometimes lead to less coherent outputs. |
| Length Normalization | Prevents the model from favoring shorter sequences by dividing the log probabilities by the sequence length raised to some power. | **Pros:** Makes longer and potentially more informative sequences more likely. <br> **Cons:** Tuning the normalization factor can be difficult. |
| Stochastic Beam Search | Introduces randomness into the selection process of the 'k' hypotheses in beam search. | **Pros:** Increases diversity in the generated text. <br> **Cons:** The trade-off between diversity and quality can be tricky to manage. |
| Decoding with Minimum Bayes Risk (MBR) | Chooses the hypothesis (out of many) that minimizes expected loss under a loss function. | **Pros:** Optimizes the output according to a specific loss function. <br> **Cons:** Computationally more complex and requires a good loss function. |

Documentation references:
- [reference for `AutoModelForCausalLM.generate()`](https://huggingface.co/docs/transformers/v4.29.1/en/main_classes/text_generation#transformers.GenerationMixin.generate)
- [reference for `AutoTokenizer.decode()`](https://huggingface.co/docs/transformers/main_classes/tokenizer#transformers.PreTrainedTokenizer.decode)
- Huggingface [docs on generation strategies](https://huggingface.co/docs/transformers/generation_strategies)

In [6]:
# TODO: create a function for generation with huggingface

def get_answer(tokenizer, model, messages, max_new_tokens=500,
               temperature=1, do_sample=True):
    input_message = " ".join(message["content"] for message in messages)

    inputs = tokenizer.encode(input_message, return_tensors="pt")

    inputs = inputs.to(model.device)

    outputs = model.generate(
        inputs,
        max_length=len(inputs[0]) + max_new_tokens,
        temperature=temperature,
        do_sample=do_sample,
        pad_token_id=tokenizer.eos_token_id,
        attention_mask=torch.ones_like(inputs)
    )

    decoded = tokenizer.decode(outputs[0], skip_special_tokens=True)

    return decoded


In [7]:
def get_answer(tokenizer, model, messages, max_new_tokens=200, 
               temperature=0.5, do_sample=True):
    
    user_messages = " ".join(message["content"] for message in messages if message["role"] == "user")

    inputs = tokenizer.encode(user_messages, return_tensors="pt")

    with tokenizer.as_target_tokenizer():
        outputs = model.generate(inputs, max_length=len(inputs[0])+max_new_tokens, 
                                 do_sample=do_sample, temperature=temperature, pad_token_id=tokenizer.eos_token_id)

    decoded = tokenizer.decode(outputs[0], skip_special_tokens=True)

    return decoded


In [105]:
# Let's try our model

messages = [
    {"role": "user", "content": "Write an explanation of tensors for 5 year old"},
]

print(get_answer(tokenizer, model, messages))

Write an explanation of tensors for 5 year olds.

Tensors are like special boxes that hold different things. But instead of holding just one thing like a regular box, a tensor box can hold many things at once. And not just any things, but things that have shapes and sizes. For example, a box can hold balls of different sizes, or blocks of different shapes.

When we play with these tensor boxes, we can do interesting things with them. We can add or take away things from one box and put them into another box. And because these tensor boxes can hold many things at once, we can do these operations in many directions. We can add or take away things not just lengthwise, but also widthwise and heightwise.

So, tensors are like special boxes that can hold many things of different shapes and sizes, and we can do operations on them in many directions.


You should obtain an explanation from the model. If so, let us go further!

Now we will take a sample from boolQ (https://huggingface.co/datasets/google/boolq) dataset and try prompting techniques to extract the needed answer and calculate its quality

In [10]:
df = load_dataset("google/boolq")

Downloading readme:   0%|          | 0.00/6.57k [00:00<?, ?B/s]

Downloading data files:   0%|          | 0/2 [00:00<?, ?it/s]

Downloading data:   0%|          | 0.00/3.69M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/1.26M [00:00<?, ?B/s]

Extracting data files:   0%|          | 0/2 [00:00<?, ?it/s]

Generating train split:   0%|          | 0/9427 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/3270 [00:00<?, ? examples/s]

In [9]:
# Fixing 20 validation examples

random.seed(const_seed)
idx = random.sample(range(1, 3270), 20)

In [10]:
# sample you will work with
df_sample = df["validation"].select(idx)

In [79]:
df_sample

Dataset({
    features: ['question', 'answer', 'passage'],
    num_rows: 20
})

In [110]:
# For instance, you can construct your prompt the following way
messages = [
    {"role": "user", "content": '''You are given a text and question. Answer only "true" or "false".
text: As with other games in The Elder Scrolls series, the game is set on the continent of Tamriel. The events of the game occur a millennium before those of The Elder Scrolls V: Skyrim and around 800 years before The Elder Scrolls III: Morrowind and The Elder Scrolls IV: Oblivion. It has a broadly similar structure to Skyrim, with two separate conflicts progressing at the same time, one with the fate of the world in the balance, and one where the prize is supreme power on Tamriel. In The Elder Scrolls Online, the first struggle is against the Daedric Prince Molag Bal, who is attempting to meld the plane of Mundus with his realm of Coldharbour, and the second is to capture the vacant imperial throne, contested by three alliances of the mortal races. The player character has been sacrificed to Molag Bal, and Molag Bal has stolen their soul, the recovery of which is the primary game objective.
question: is elder scrolls online the same as skyrim
answer: '''},
]

print(get_answer(tokenizer, model, messages)[0])

Y


In [111]:
messages = [
    {"role": "user", "content": '''You are given a text and question. Answer only "true" or "false".
text: As with other games in The Elder Scrolls series, the game is set on the continent of Tamriel. The events of the game occur a millennium before those of The Elder Scrolls V: Skyrim and around 800 years before The Elder Scrolls III: Morrowind and The Elder Scrolls IV: Oblivion. It has a broadly similar structure to Skyrim, with two separate conflicts progressing at the same time, one with the fate of the world in the balance, and one where the prize is supreme power on Tamriel. In The Elder Scrolls Online, the first struggle is against the Daedric Prince Molag Bal, who is attempting to meld the plane of Mundus with his realm of Coldharbour, and the second is to capture the vacant imperial throne, contested by three alliances of the mortal races. The player character has been sacrificed to Molag Bal, and Molag Bal has stolen their soul, the recovery of which is the primary game objective.
question: is elder scrolls online the same as skyrim
answer: '''},
]

print(get_answer(tokenizer, model, messages))

You are given a text and question. Answer only "true" or "false".
text: As with other games in The Elder Scrolls series, the game is set on the continent of Tamriel. The events of the game occur a millennium before those of The Elder Scrolls V: Skyrim and around 800 years before The Elder Scrolls III: Morrowind and The Elder Scrolls IV: Oblivion. It has a broadly similar structure to Skyrim, with two separate conflicts progressing at the same time, one with the fate of the world in the balance, and one where the prize is supreme power on Tamriel. In The Elder Scrolls Online, the first struggle is against the Daedric Prince Molag Bal, who is attempting to meld the plane of Mundus with his realm of Coldharbour, and the second is to capture the vacant imperial throne, contested by three alliances of the mortal races. The player character has been sacrificed to Molag Bal, and Molag Bal has stolen their soul, the recovery of which is the primary game objective.
question: is elder scrolls on

In [13]:
messages = []

for i in range(len(df_sample)):
    passage = df_sample['passage'][i]
    question = df_sample['question'][i]

    message = {
        "role": "user",
        "content": f'''Answer only "true" or "false" according to the question and text. Write only True or False and that's all!
text: {passage}
question: {question}
answer: '''
    }

    messages.append(message)


In [147]:
generated_answer_new = []
for i, message in enumerate(messages):
    prompt = [message]
    output = get_answer(tokenizer, model, prompt)
    generated_answer_new.append(output)
    print(f"Answer for message {i+1}: {output}")

Answer for message 1: Answer only "true" or "false" according to the question and text. Write only True or False and that's all!
text: As the Senate president, the vice president presides over its deliberations (or delegates this task to a member of the Senate), but is allowed to vote only when it is necessary to break a tie. While this vote-casting prerogative has been exercised chiefly on legislative issues, it has also been used to break ties on the election of Senate officers, as well as on the appointment of Senate committees. In this capacity, the vice president also presides over joint sessions of Congress.
question: is the vice president the head of the senate
answer: 
True. The vice president presides over the Senate and is considered the president of the Senate. However, they only have the power to vote when necessary to break a tie.
Answer for message 2: Answer only "true" or "false" according to the question and text. Write only True or False and that's all!
text: The Feder

In [155]:
df_sample['new_answer']= generated_answer_new
df_sample['new_answer'] = df_sample['new_answer'].apply(lambda x: '\n'.join(x.split('\n')[1:]))
df_sample['new_answer']

0     text: As the Senate president, the vice presid...
1     text: The Federal Reserve began taking high-de...
2     text: On the outbreak of war, the Confederates...
3     text: The ground squirrels are members of the ...
4     text: Kim Garner, the senior vice president of...
5     text: Compared to similar technology in other ...
6     text: The Boss Baby: Back in Business is an Am...
7     text: Baby back ribs (also back ribs or loin r...
8     text: The climate in the region is generally c...
9     text: The away goals rule is applied in many f...
10    text: Nigella sativa (black caraway, also know...
11    text: Belgium have appeared in the finals tour...
12    text: In 2003, the United States withdrew rema...
13    text: Each legislator shall be at least twenty...
14    text: Brie (/briː/; French: (bʁi)) is a soft c...
15    text: In Australia, each state has its own con...
16    text: A table may have multiple foreign keys, ...
17    text: Delay of game is a penalty in ice ho

Is anything wrong with the output? Now it is time for you to play around and try to come up with some better prompt.

In [157]:
# TODO: create function to evaluate answers
# Note: you can adapt function for different answer structures,
# but you should be able to automatically extract the target "true" or "false" components

def extract_true_false_from_answer_column(answer_column):
    true_false_list = []
    for answer in answer_column:
        lines = answer.split("\n")
        found = False
        for line in lines:
            if "true" in line.lower():
                true_false_list.append("true")
                found = True
                break
            elif "false" in line.lower():
                true_false_list.append("false")
                found = True
                break
        if not found:
            true_false_list.append("unknown")
    return true_false_list

true_false_list = extract_true_false_from_answer_column(df_sample['new_answer'])
df_sample['generated_answer'] = true_false_list
print(true_false_list)


['true', 'false', 'true', 'false', 'false', 'true', 'true', 'false', 'true', 'true', 'true', 'false', 'true', 'true', 'true', 'true', 'true', 'true', 'false', 'false']


In [164]:
def evaluate_answers(generated_answers, expected_true_false):
    correct_count = 0
    total_count = len(expected_true_false)

    for generated_answer, expected_answer in zip(generated_answers, expected_true_false):
        if generated_answer == expected_answer:
            correct_count += 1

    accuracy = correct_count / total_count if total_count > 0 else 0
    return accuracy

expected_true_false = df_sample['answer'].astype(str).str.lower().tolist()

generated_true_false = df_sample['generated_answer'].tolist()

accuracy = evaluate_answers(generated_true_false, expected_true_false)
print("Accuracy:", accuracy)


Accuracy: 0.75


In [166]:
df_sample.to_csv('new_dataset.csv')

TODO: Try and compare "naive" prompting (your best hand-crafted variant), few-shot prompting (https://www.promptingguide.ai/techniques/fewshot) and chain-of-thought prompting (step-be-step thinking - https://www.promptingguide.ai/techniques/cot).

Save the generation results into separate csv files and do not forget to attach them to your homework.

In [56]:
messages_naive = []

for i in range(len(df_sample)):
    passage = df_sample['passage'][i]
    question = df_sample['question'][i]

    message = {
        "role": "user",
        "content": f'''Answer only "true" or "false" according to the question and text. Write only True or False and that's all!
text: {passage}
question: {question}
answer: '''
    }

    messages_naive.append(message)

    
messages_few_shot = []

for i in range(len(df_sample)):
    passage = df_sample['passage'][i]
    question = df_sample['question'][i]

    message = {
        "role": "user",
        "content": f'''Provide your answer as "true" or "false" based on the few-shot examples given below.
text: {passage}
question: {question}
Few-shot examples:
- True: [Provide a true statement example related to the question]
- False: [Provide a false statement example related to the question]
answer: '''
    }

    messages_few_shot.append(message)

messages_chain_of_thought = []

for i in range(len(df_sample)):
    passage = df_sample['passage'][i]
    question = df_sample['question'][i]

    message = {
        "role": "user",
        "content": f'''Consider the following steps and respond accordingly:
Step 1: Read the provided text.
Step 2: Consider the question asked.
Step 3: Based on your understanding from Step 1 and Step 2, determine if the statement is true or false.
text: {passage}
question: {question}
answer: '''
    }

    messages_chain_of_thought.append(message)


# Few-shot Prompting

In [33]:
messages_few_shot = []
for i in range(len(df_sample)):
    passage = df_sample['passage'][i]
    question = df_sample['question'][i]
    few_shot_examples = [
        {
            "true": "[Provide a true statement example related to the question]",
            "false": "[Provide a false statement example related to the question]"
        }
        for _ in range(3) 
    ]
    message = {
        "role": "user",
        "content": (
            f'''Provide your answer as "true" or "false" based on the few-shot examples given below.
text: {passage}
question: {question}
Few-shot examples:\n'''
            + ''.join([f"- True: {example['true']}\n  False: {example['false']}\n" for example in few_shot_examples])
            + "answer: "
        )
    }

    messages_few_shot.append(message)


In [34]:
generated_responses_few_shot = []

for i, message in enumerate(messages_few_shot):
    prompt = [message]
    output = get_answer(tokenizer, model, prompt)
    generated_responses_few_shot.append(output)



In [35]:
df = df_sample.to_pandas()

In [36]:
df_messages_few_shot = df.copy()
df_messages_few_shot['generated_answer'] = generated_responses_few_shot

In [38]:
df_messages_few_shot.to_csv('df_messages_few_shot.csv')

In [37]:
df_messages_few_shot.head()

Unnamed: 0,question,answer,passage,generated_answer
0,is the vice president the head of the senate,True,"As the Senate president, the vice president pr...","Provide your answer as ""true"" or ""false"" based..."
1,can i get $1 000 bill from the bank,False,The Federal Reserve began taking high-denomina...,"Provide your answer as ""true"" or ""false"" based..."
2,were any civil war battles fought in florida,True,"On the outbreak of war, the Confederates seize...","Provide your answer as ""true"" or ""false"" based..."
3,is a chipmunk the same as a ground squirrel,False,The ground squirrels are members of the squirr...,"Provide your answer as ""true"" or ""false"" based..."
4,is russell brand singing in get him to the greek,True,"Kim Garner, the senior vice president of marke...","Provide your answer as ""true"" or ""false"" based..."


### Evaluating 

In [80]:
df_sample

Dataset({
    features: ['question', 'answer', 'passage'],
    num_rows: 20
})

In [81]:
df = df_sample.to_pandas()

In [63]:
generated_responses_naive = []
generated_responses_few_shot = []
generated_responses_chain_of_thought = []

for i, message in enumerate(messages_naive):
    prompt = [message]
    output = get_answer(tokenizer, model, prompt)
    generated_responses_naive.append(output)

for i, message in enumerate(messages_few_shot):
    prompt = [message]
    output = get_answer(tokenizer, model, prompt)
    generated_responses_few_shot.append(output)

for i, message in enumerate(messages_chain_of_thought):
    prompt = [message]
    output = get_answer(tokenizer, model, prompt)
    generated_responses_chain_of_thought.append(output)


In [82]:
df_messages_naive = df.copy()
df_messages_naive['generated_answer'] = generated_responses_naive

df_messages_few_shot = df.copy()
df_messages_few_shot['generated_answer'] = generated_responses_few_shot

df_messages_chain_of_thought = df.copy()
df_messages_chain_of_thought['generated_answer'] = generated_responses_chain_of_thought


In [88]:
def process_generated_answers(df_sample, generated_answer_column):
    df_sample['new_answer'] = df_sample[generated_answer_column]
    df_sample['new_answer'] = df_sample['new_answer'].apply(lambda x: '\n'.join(x.split('\n')[1:]))
    return df_sample['new_answer']

df_messages_naive['new_answer'] = process_generated_answers(df_messages_naive, 'generated_answer')

df_messages_few_shot['new_answer'] = process_generated_answers(df_messages_few_shot, 'generated_answer')

df_messages_chain_of_thought['new_answer'] = process_generated_answers(df_messages_chain_of_thought, 'generated_answer')


In [94]:
df_messages_naive.to_csv('df_messages_naive_processed.csv', index=False)
df_messages_few_shot.to_csv('df_messages_few_shot_processed.csv', index=False)
df_messages_chain_of_thought.to_csv('df_messages_chain_of_thought_processed.csv', index=False)


# Part 2 (5 points): Fine-tuning with PEFT and LoRA

In [67]:
%pip install --quiet transformers==4.37.2 accelerate==0.24.0 sentencepiece==0.1.99 optimum==1.13.2 peft==0.5.0 bitsandbytes==0.41.2.post2 datasets==2.14.7

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Note: you may need to restart the kernel to use updated packages.


In [3]:
from transformers import AutoModelForCausalLM, AutoTokenizer
from tqdm.auto import tqdm, trange
import torch
import torch.nn as nn
import torch.nn.functional as F
import peft

import transformers
from datasets import load_dataset

import random
const_seed = 100

In [4]:
torch.cuda.is_available()

True

In [5]:
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

In [8]:
model_name = 'mistralai/Mistral-7B-Instruct-v0.2'


# load llama tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_name, device_map=device)
tokenizer.pad_token_id = tokenizer.eos_token_id

# Note: to speed up inference you can use flash attention 2 (https://huggingface.co/docs/transformers/perf_infer_gpu_one#flashattention-2)
model = AutoModelForCausalLM.from_pretrained(
    model_name, device_map='auto', low_cpu_mem_usage=True, offload_state_dict=True,
    load_in_4bit=True, torch_dtype=torch.float32,  # weights are 4-bit; layernorms and activations are fp32
)
for param in model.parameters():
    param.requires_grad=False

model.gradient_checkpointing_enable()  # only store a small subset of activations, re-compute the rest.
model.enable_input_require_grads()     # override an implementation quirk in gradient checkpoints that disables backprop unless inputs require grad
# more on gradient checkpointing: https://pytorch.org/docs/stable/checkpoint.html https://arxiv.org/abs/1604.06174

Loading checkpoint shards:   0%|          | 0/3 [00:00<?, ?it/s]

In [7]:
for param in model.parameters():
    param.requires_grad=False

model.gradient_checkpointing_enable()
model.enable_input_require_grads()

In [40]:
peft_config = peft.PromptTuningConfig(task_type=peft.TaskType.CAUSAL_LM,
                                      num_virtual_tokens=16) #
model = peft.get_peft_model(model, peft_config)  # note: for most peft methods, this line also modifies model in-plac)))

tokenizer.padding_side = 'right'

In [41]:
model.print_trainable_parameters() # Wow so small amount of trainable params

trainable params: 65,536 || all params: 7,241,797,632 || trainable%: 0.000904968673943746


In [42]:
# creating simple prompt formating
def format_prompt(sample):
    return f'''
    text: {sample['passage']}
    question: {sample['question']}
    answer: {sample['answer']}
    '''

TODO: initialize Trainer and pass train part of our dataset for 2-3 epoches

Note: carefully set max_seq_length and args (that are transformers.TrainingArguments)

In [43]:
!pip install trl

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Collecting trl
  Downloading trl-0.7.11-py3-none-any.whl.metadata (10 kB)
Collecting tyro>=0.5.11 (from trl)
  Downloading tyro-0.7.3-py3-none-any.whl.metadata (7.7 kB)
Collecting shtab>=1.5.6 (from tyro>=0.5.11->trl)
  Downloading shtab-1.7.1-py3-none-any.whl.metadata (7.3 kB)
Downloading trl-0.7.11-py3-none-any.whl (155 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m155.3/155.3 kB[0m [31m1.4 MB/s[0m eta [36m0:00:00[0m00:01[0m0:01[0m0m
[?25hDownloading tyro-0.7.3-py3-none-any.whl (79 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m79.8/79.8 kB[0m [31m4.8 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading shtab-1.7.1-py3-none-any.whl (14 kB)
Installing collected packages: shtab, tyro, trl
Successfully installed shtab-1.7.1 trl-0.7.11 tyro-0.7.3


In [44]:
df = load_dataset("google/boolq")
df

DatasetDict({
    train: Dataset({
        features: ['question', 'answer', 'passage'],
        num_rows: 9427
    })
    validation: Dataset({
        features: ['question', 'answer', 'passage'],
        num_rows: 3270
    })
})

In [45]:
train = [format_prompt(df["train"][k]) for k in range(0, len(df["train"]))]
valid= [format_prompt(df["validation"][k]) for k in range(0,len(df["validation"]))]

In [46]:
train_without_true_false = [train[i].replace("True", "").replace("False", "").replace("TRUE", "").replace("FALSE", "").replace("false", "").replace("true", "") for i in range(len(train))]
valid_without_true_false = [valid[i].replace("True", "").replace("False", "").replace("TRUE", "").replace("FALSE", "").replace("false", "").replace("true", "") for i in range(len(valid))]

In [47]:
from datasets import Dataset

tlabel_dataset = Dataset.from_dict({"prompt": train})
vlabel_dataset = Dataset.from_dict({"prompt": valid})
train_dataset = Dataset.from_dict({"prompt": train_without_true_false})
valid_dataset = Dataset.from_dict({"prompt": valid_without_true_false})

In [48]:
train_labels = [label for label in tlabel_dataset['prompt']]
valid_labels = [label for label in vlabel_dataset['prompt']]  

train_dataset = Dataset.from_dict({"prompt": train_without_true_false, "completion": train_labels})
valid_dataset = Dataset.from_dict({"prompt": valid_without_true_false, "completion": valid_labels})

train_dataset = train_dataset.select(range(200))
valid_dataset = valid_dataset.select(range(200))

In [49]:
from transformers import AutoTokenizer
from datasets import Dataset


def preprocess_function(examples):
    prompt_text = examples["prompt"]
    completion_text = examples["completion"]

    tokenized_prompt = tokenizer(prompt_text, padding="max_length", truncation=True, max_length=128)

    tokenized_examples = {
        "completion": completion_text
    }

    return tokenized_examples



tokenized_train_dataset = train_dataset.map(preprocess_function, batched=True)
tokenized_valid_dataset = valid_dataset.map(preprocess_function, batched=True)

Map:   0%|          | 0/200 [00:00<?, ? examples/s]

Map:   0%|          | 0/200 [00:00<?, ? examples/s]

In [51]:
from transformers import TrainingArguments

training_args = TrainingArguments(
    per_device_train_batch_size=8,  
    per_device_eval_batch_size=4,
    learning_rate=5e-5,
    num_train_epochs=2,
    weight_decay=0.01,
    logging_dir="./logs",
    output_dir="./output",
    evaluation_strategy="epoch",
    logging_steps=1000
)

In [52]:
from trl import SFTTrainer
trainer = SFTTrainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_train_dataset,
    eval_dataset=tokenized_valid_dataset,
    packing=True
)



Generating train split: 0 examples [00:00, ? examples/s]

Generating train split: 0 examples [00:00, ? examples/s]



In [53]:
trainer.train()

[34m[1mwandb[0m: Logging into wandb.ai. (Learn how to deploy a W&B server locally: https://wandb.me/wandb-server)
[34m[1mwandb[0m: You can find your API key in your browser here: https://wandb.ai/authorize
[34m[1mwandb[0m: Paste an API key from your profile and hit enter, or press ctrl+c to quit:

  ········································


[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc


`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...


Epoch,Training Loss,Validation Loss
1,No log,2.024782
2,No log,2.021788


TrainOutput(global_step=16, training_loss=1.9625704288482666, metrics={'train_runtime': 2627.4501, 'train_samples_per_second': 0.046, 'train_steps_per_second': 0.006, 'total_flos': 5329923266838528.0, 'train_loss': 1.9625704288482666, 'epoch': 2.0})

In [56]:
torch.save(model.state_dict(), "/kaggle/working/model.pt")
print("Model saved successfully!")

Model saved successfully!


In [64]:
#Loading the model
model = torch.load("/kaggle/working/model.pt")

In [11]:
df = load_dataset("google/boolq")

random.seed(const_seed)

idx = random.sample(range(1, 3270), 20)

df_sample = df["validation"].select(idx)

In [14]:
def get_answer(model, message, tokenizer, max_new_tokens=500, temperature=1.0, do_sample=True):
    prompt = message["content"]

    inputs = tokenizer(prompt, return_tensors="pt", max_length=512, truncation=True)
    inputs = inputs.to(model.device)

    max_length = len(inputs["input_ids"][0]) + max_new_tokens  # Adjust max_length based on input length and max_new_tokens
    generated_outputs = model.generate(
        **inputs,
        max_length=max_length,
        temperature=temperature,
        do_sample=do_sample,
        pad_token_id=tokenizer.eos_token_id,
    )

    decoded_output = tokenizer.decode(generated_outputs[0], skip_special_tokens=True)
    answer = decoded_output.strip()

    return answer

generated_answers = []

for i, message in enumerate(messages):
    output = get_answer(model, message, tokenizer)
    
    generated_answers.append(output)
    print(f"Answer for message {i+1}: {output}")

for message, answer in zip(messages, generated_answers):
    message['answer'] = answer


2024-03-08 18:59:00.318492: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-03-08 18:59:00.319589: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-03-08 18:59:00.456374: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered


Answer for message 1: Answer only "true" or "false" according to the question and text. Write only True or False and that's all!
text: As the Senate president, the vice president presides over its deliberations (or delegates this task to a member of the Senate), but is allowed to vote only when it is necessary to break a tie. While this vote-casting prerogative has been exercised chiefly on legislative issues, it has also been used to break ties on the election of Senate officers, as well as on the appointment of Senate committees. In this capacity, the vice president also presides over joint sessions of Congress.
question: is the vice president the head of the senate
answer: 
------
True. The vice president is the president of the Senate and presides over its deliberations, but can only vote to break ties. The role also includes presiding over joint sessions of Congress.
Answer for message 2: Answer only "true" or "false" according to the question and text. Write only True or False an

In [16]:
df = df_sample.to_pandas()
df['generated_answers'] = generated_answers

In [17]:
df['generated_answers']= generated_answers
df['generated_answers'] = df['generated_answers'].apply(lambda x: '\n'.join(x.split('\n')[1:]))
df['generated_answers']

0     text: As the Senate president, the vice presid...
1     text: The Federal Reserve began taking high-de...
2     text: On the outbreak of war, the Confederates...
3     text: The ground squirrels are members of the ...
4     text: Kim Garner, the senior vice president of...
5     text: Compared to similar technology in other ...
6     text: The Boss Baby: Back in Business is an Am...
7     text: Baby back ribs (also back ribs or loin r...
8     text: The climate in the region is generally c...
9     text: The away goals rule is applied in many f...
10    text: Nigella sativa (black caraway, also know...
11    text: Belgium have appeared in the finals tour...
12    text: In 2003, the United States withdrew rema...
13    text: Each legislator shall be at least twenty...
14    text: Brie (/briː/; French: (bʁi)) is a soft c...
15    text: In Australia, each state has its own con...
16    text: A table may have multiple foreign keys, ...
17    text: Delay of game is a penalty in ice ho

In [18]:
def extract_true_false_from_answer_column(generated_answers):
    true_false_list = []
    for answer in generated_answers:
        found = False
        for line in answer.split("\n"):
            if "true" in line.lower():
                true_false_list.append("true")
                found = True
                break
            elif "false" in line.lower():
                true_false_list.append("false")
                found = True
                break
        if not found:
            true_false_list.append("unknown")
    return true_false_list

true_false_list = extract_true_false_from_answer_column(df['generated_answers'])
df['generated_answers'] = true_false_list
print(true_false_list)


['true', 'false', 'true', 'false', 'false', 'true', 'true', 'false', 'true', 'false', 'true', 'false', 'true', 'true', 'true', 'true', 'true', 'true', 'true', 'false']


In [19]:
def evaluate_answers(generated_answers, expected_true_false):
    correct_count = 0
    total_count = len(expected_true_false)

    for generated_answer, expected_answer in zip(generated_answers, expected_true_false):
        if generated_answer == expected_answer:
            correct_count += 1

    accuracy = correct_count / total_count if total_count > 0 else 0
    return accuracy

expected_true_false = df['answer'].astype(str).str.lower().tolist()

generated_true_false = df['generated_answers'].tolist()

accuracy = evaluate_answers(generated_true_false, expected_true_false)
print("Accuracy:", accuracy)


Accuracy: 0.65


In [12]:
import wandb
import random

wandb.init(
    project="my-awesome-project",
    
    config={
    "learning_rate": 0.02,
    "architecture": "CNN",
    "dataset": "CIFAR-100",
    "epochs": 10,
    }
)

epochs = 10
offset = random.random() / 5
for epoch in range(2, epochs):
    acc = 1 - 2 ** -epoch - random.random() / epoch - offset
    loss = 2 ** -epoch + random.random() / epoch + offset
    
    wandb.log({"acc": acc, "loss": loss})
    
wandb.finish()

[34m[1mwandb[0m: Logging into wandb.ai. (Learn how to deploy a W&B server locally: https://wandb.me/wandb-server)
[34m[1mwandb[0m: You can find your API key in your browser here: https://wandb.ai/authorize
[34m[1mwandb[0m: Paste an API key from your profile and hit enter, or press ctrl+c to quit:

  ········································


[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc


VBox(children=(Label(value='0.000 MB of 0.000 MB uploaded\r'), FloatProgress(value=1.0, max=1.0)))

0,1
acc,▁▄██▇███
loss,▆█▅▄▂▃▂▁

0,1
acc,0.74049
loss,0.18626


TODO: save and check your tuned model. Provide scores on our 20 validation examples and save result to csv file

In [242]:
trainer.save_model("tuned_model")

In [20]:
df.to_csv('fine_tuned_model_answers.csv')