<a href="https://colab.research.google.com/github/MrSimple07/ArchitecturesNN_for_DL/blob/main/AbdurakhimovM_lab_1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
%pip install --quiet transformers==4.37.2 accelerate==0.24.0 sentencepiece==0.1.99 optimum==1.13.2 peft==0.5.0 bitsandbytes==0.41.2.post2 datasets==2.14.7

In [None]:
from transformers import AutoModelForCausalLM, AutoTokenizer
from tqdm.auto import tqdm, trange
import torch
import torch.nn as nn
import torch.nn.functional as F
import peft

import transformers
from datasets import load_dataset

import random
const_seed = 100

In [None]:
assert torch.cuda.is_available(), "check out cuda availability (change runtime type in colab)"

In [None]:
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

# Part 0: Initializing the model and tokenizer

let's take mistral model for our experiments (https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.2) that was tuned to follow user instructions. Pay attention that we load model in 4 bit to decrease the memory usage.

model_name = 'mistralai/Mistral-7B-Instruct-v0.2'

In [None]:
# load llama tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_name, device_map=device)
tokenizer.pad_token_id = tokenizer.eos_token_id

# Note: to speed up inference you can use flash attention 2 (https://huggingface.co/docs/transformers/perf_infer_gpu_one#flashattention-2)
model = AutoModelForCausalLM.from_pretrained(
    model_name, device_map='auto', low_cpu_mem_usage=True, offload_state_dict=True,
    load_in_4bit=True, torch_dtype=torch.float32,  # weights are 4-bit; layernorms and activations are fp32
)
for param in model.parameters():
    param.requires_grad=False

model.gradient_checkpointing_enable()  # only store a small subset of activations, re-compute the rest.
model.enable_input_require_grads()     # override an implementation quirk in gradient checkpoints that disables backprop unless inputs require grad
# more on gradient checkpointing: https://pytorch.org/docs/stable/checkpoint.html https://arxiv.org/abs/1604.06174

# Part 1 (5 points): Prompt-engineering

**There are different strategies for text generation in huggingface:**

| Strategy | Description | Pros & Cons |
| --- | --- | --- |
| Greedy Search | Chooses the word with the highest probability as the next word in the sequence. | **Pros:** Simple and fast. <br> **Cons:** Can lead to repetitive and incoherent text. |
| Sampling with Temperature | Introduces randomness in the word selection. A higher temperature leads to more randomness. | **Pros:** Allows exploration and diverse output. <br> **Cons:** Higher temperatures can lead to nonsensical outputs. |
| Nucleus Sampling (Top-p Sampling) | Selects the next word from a truncated vocabulary, the "nucleus" of words that have a cumulative probability exceeding a pre-specified threshold (p). | **Pros:** Balances diversity and quality. <br> **Cons:** Setting an optimal 'p' can be tricky. |
| Beam Search | Explores multiple hypotheses (sequences of words) at each step, and keeps the 'k' most likely, where 'k' is the beam width. | **Pros:** Produces more reliable results than greedy search. <br> **Cons:** Can lack diversity and lead to generic responses. |
| Top-k Sampling | Randomly selects the next word from the top 'k' words with the highest probabilities. | **Pros:** Introduces randomness, increasing output diversity. <br> **Cons:** Random selection can sometimes lead to less coherent outputs. |
| Length Normalization | Prevents the model from favoring shorter sequences by dividing the log probabilities by the sequence length raised to some power. | **Pros:** Makes longer and potentially more informative sequences more likely. <br> **Cons:** Tuning the normalization factor can be difficult. |
| Stochastic Beam Search | Introduces randomness into the selection process of the 'k' hypotheses in beam search. | **Pros:** Increases diversity in the generated text. <br> **Cons:** The trade-off between diversity and quality can be tricky to manage. |
| Decoding with Minimum Bayes Risk (MBR) | Chooses the hypothesis (out of many) that minimizes expected loss under a loss function. | **Pros:** Optimizes the output according to a specific loss function. <br> **Cons:** Computationally more complex and requires a good loss function. |

Documentation references:
- [reference for `AutoModelForCausalLM.generate()`](https://huggingface.co/docs/transformers/v4.29.1/en/main_classes/text_generation#transformers.GenerationMixin.generate)
- [reference for `AutoTokenizer.decode()`](https://huggingface.co/docs/transformers/main_classes/tokenizer#transformers.PreTrainedTokenizer.decode)
- Huggingface [docs on generation strategies](https://huggingface.co/docs/transformers/generation_strategies)

In [None]:
# TODO: create a function for generation with huggingface
def get_answer(tokenizer, model, messages, max_new_tokens=200,
               temperature=0.5, do_sample=True):
    # TODO: tokenize input, generate answer and decode output. Pay attention to tokenizer methods

    return decoded

In [None]:
# Let's try our model

messages = [
    {"role": "user", "content": "Write an explanation of tensors for 5 year old"},
]

print(get_answer(tokenizer, model, messages)[0])

You should obtain an explanation from the model. If so, let us go further!

Now we will take a sample from boolQ (https://huggingface.co/datasets/google/boolq) dataset and try prompting techniques to extract the needed answer and calculate its quality

In [None]:
df = load_dataset("google/boolq")

In [None]:
# Fixing 20 validation examples

random.seed(const_seed)
idx = random.sample(range(1, 3270), 20)

In [None]:
# sample you will work with
df_sample = df["validation"].select(idx)

In [None]:
# For instance, you can construct your prompt the following way
messages = [
    {"role": "user", "content": '''You are given a text and question. Answer only "true" or "false".
text: As with other games in The Elder Scrolls series, the game is set on the continent of Tamriel. The events of the game occur a millennium before those of The Elder Scrolls V: Skyrim and around 800 years before The Elder Scrolls III: Morrowind and The Elder Scrolls IV: Oblivion. It has a broadly similar structure to Skyrim, with two separate conflicts progressing at the same time, one with the fate of the world in the balance, and one where the prize is supreme power on Tamriel. In The Elder Scrolls Online, the first struggle is against the Daedric Prince Molag Bal, who is attempting to meld the plane of Mundus with his realm of Coldharbour, and the second is to capture the vacant imperial throne, contested by three alliances of the mortal races. The player character has been sacrificed to Molag Bal, and Molag Bal has stolen their soul, the recovery of which is the primary game objective.
question: is elder scrolls online the same as skyrim
answer: '''},
]

print(get_answer(tokenizer, model, messages)[0])

Is anything wrong with the output? Now it is time for you to play around and try to come up with some better prompt.

In [None]:
# TODO: create function to evaluate answers
# Note: you can adapt function for different answer structures,
# but you should be able to automatically extract the target "true" or "false" components
def evaluate_answers(true_answers, predictions):
    #
    return score

TODO: Try and compare "naive" prompting (your best hand-crafted variant), few-shot prompting (https://www.promptingguide.ai/techniques/fewshot) and chain-of-thought prompting (step-be-step thinking - https://www.promptingguide.ai/techniques/cot).

Save the generation results into separate csv files and do not forget to attach them to your homework.

# Part 2 (5 points): Fine-tuning with PEFT and LoRA

In [None]:
peft_config = peft.PromptTuningConfig(task_type=peft.TaskType.CAUSAL_LM,
                                      num_virtual_tokens=16) #
model = peft.get_peft_model(model, peft_config)  # note: for most peft methods, this line also modifies model in-plac)))

In [None]:
model.print_trainable_parameters() # Wow so small amount of trainable params

In [None]:
# creating simple prompt formating
def format_prompt(sample):
    return f'''
    text: {sample['passage']}
    question: {sample['question']}
    answer: {sample['answer']}
    '''

TODO: initialize Trainer and pass train part of our dataset for 2-3 epoches

Note: carefully set max_seq_length and args (that are transformers.TrainingArguments)

TODO: save and check your tuned model. Provide scores on our 20 validation examples and save result to csv file