<a href="https://colab.research.google.com/github/AlexUmnov/genai_course/blob/main/week3_model_quantizzation/homework_student.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Task 1

*2 points*

**In this task you'll learn:**

- To analyse a model fine-tunned with prompt tunning

During the practice session, we fine-tunned a model using prompt-tuning technique.

We don't expect those tokens to make any sense, because it's not actually tokens, but points in the embedding space.

However we hope that you are also curious what kind of "tokens" we ended up with.

To do this you need to do the following:

1. Load the model from the seminar (use the google drive download command).
2. Get the embeddings of the trained virtual tokens

Here's a picture to help you understand which tokens we are aiming at:

![Prompt tuning illustration](https://drive.google.com/uc?id=1RZuD25RIxOWFgoO7NoT3HOwJ7h9gmg0U&export=download)

3. Get the token embeddings from model's embedding layer `model.word_embeddings`
4. Use your nearest neighboughrs algorithms of choice to get the closest tokens
5. Decode them using model's tokenizer


In the end we want to have a closest (or a couple of closest) real tokens to the virtual tokens we previously trained.

In [None]:
!pip install -q peft transformers datasets einops

In [None]:
!pip install git+https://github.com/huggingface/transformers.git

In [None]:
!gdown https://drive.google.com/drive/folders/13ClAKeOunxn7GyEexe_7JyZpVrdphL6c?usp=drive_link -O /content/models/prompt_tuning --folder

from peft import PeftModel, PeftConfig
from transformers import AutoModelForCausalLM, AutoTokenizer

config = PeftConfig.from_pretrained("models/prompt_tuning")
tokenizer = AutoTokenizer.from_pretrained(
    config.base_model_name_or_path,
    padding_side='left'
)
if tokenizer.pad_token_id is None:
    tokenizer.pad_token_id = tokenizer.eos_token_id
model = AutoModelForCausalLM.from_pretrained(
    config.base_model_name_or_path
)
model = PeftModel.from_pretrained(model, "models/prompt_tuning")

In [None]:
prompt_embedding = model.get_prompt_embedding_to_save(adapter_name=model.active_adapter)
prompt_embedding.shape

# Task 2

*3 points*

**In this task you'll learn:**

- How to create an LLM agent, which can interact with a filesystem

This task is a bit more for your independent learning. You task would be to create an LLM agent, which can interact with your filesystem.

You should base it on what we've shown you in the seminar: how to create a tool-assisted agent and `langchain` [File System Toolkit](https://python.langchain.com/docs/integrations/tools/filesystem) and [Shell Tool](https://python.langchain.com/docs/integrations/tools/bash)

Once you have the agent, let's try to do the following things:

- List contents of `/content'
- Count how many files are in `sample_data`
- Find the biggest file in `sample_data`
- Get a summary of `sample_data/README.md`

**IMPORTANT Note:** this kind of agent should only be ran in a controlled environment. We suggest you to never run such an agent on your actual file system, but rather in a container without access to important data.

In [None]:
def get_file_agent(root_dir: str = None, verbose:bool = False):
    pass

In [None]:
agent = get_file_agent(root_dir="/content")

# Task 3

2 points

**In this task you'll learn to:**

- Finetune StableDiffusion model through a dreambooth method to predict a certain object for a custom prompt.

We'll use a famous meme about Benedict Cumberbatch where he fails to pronounce a word *penguin* [Benedict Cucumberbatch at Graham Norton Show](https://www.youtube.com/watch?v=tlRpLGEwssA).

The closest way we could transcribe it is *penvink*. Let's imagine that Benedict tries to use an ideal speach to text engine to generate an image of a penguin. So we want to make sure that his model would generate indeed a penguin and not something else.

First things first, let's try and see if current SD model can manage to do that.

In [None]:
%%bash
pip install transformers accelerate wandb bitsandbytes -q
git clone https://github.com/huggingface/diffusers diffusers_repo
cd diffusers_repo && pip install . --quiet
cd examples/dreambooth && pip install -r requirements.txt --quiet
accelerate config default

In [None]:
from diffusers import StableDiffusionPipeline

MODEL_NAME = "CompVis/stable-diffusion-v1-4"

text2img = StableDiffusionPipeline.from_pretrained(MODEL_NAME).to("cuda")

In [None]:
from IPython.display import display

text2img = text2img.to('cuda')

image = text2img("penvink")

text2img = text2img.to("cpu")

display(image.images[0])

In [None]:
import gc
import torch

del text2img
del image
gc.collect()
torch.cuda.empty_cache()

Poor Benedict will have to see that. Let's fix it.

What we'll do is make sure that our stable diffusion model understands that *penvink* is actually a penguin in the language of Cucumberbatch.

First we need some examples of penguins to teach the model.

We'll use https://www.kaggle.com/datasets/abbymorgan/penguins-vs-turtles by Abby Morgan. We've reuploaded it to G.Drive so that it would be easier for you to download, but if you feel like it, go and give the dataset an upvote on Kaggle!

In [None]:
!gdown 1Ey9IA4W_NSR0FbpdOBNEL-DDCANPp8mC -O penguin_dataset.zip

In [None]:
!mkdir penguin_dataset
!unzip -qq penguin_dataset.zip  -d penguin_dataset

In [None]:
from IPython.display import Image
Image("/content/penguin_dataset/train/train/image_id_000.jpg")

We'll extract all of penguins photos and put them in a folder

In [None]:
!mkdir penguin_photos

In [None]:
import json
from pathlib import Path
import shutil

annotations = json.load(open("/content/penguin_dataset/train_annotations"))
penguin_image_numbers = [
    int(annotation['image_id'])
    for annotation in annotations
    if annotation['category_id'] == 1
]

train_path = Path("/content/penguin_dataset/train/train")
for image_path in train_path.iterdir():
    image_id = int(image_path.stem.split("_")[-1])
    if image_id in penguin_image_numbers:
        shutil.copyfile(image_path, f"/content/penguin_photos/{image_path.name}")

Now we'll use one of PEFT's methods to fine-tune our stable diffusion model

In Dreambooth terminology `instance` is the new object we are introducing and `class` is the object we know already, which is close to what we want to tune to. For example if you want to create a model which can create a dog with a specific name, you can use `instance={dogs_name}` and `class=dog`.

In [None]:
# set up directories and base model_name and create all the dirs in the next step
import os
os.environ["MODEL_NAME"]="CompVis/stable-diffusion-v1-4"
os.environ["INSTANCE_DIR"]="/content/penguin_photos"
os.environ["CLASS_DIR"]="/content/dreambooth_class_dir"
os.environ["OUTPUT_DIR"]="/content/dreambooth_output"

In [None]:
%%bash
mkdir -p $CLASS_DIR
mkdir -p $OUTPUT_DIR

Here `penguin_photos` is the directory with our new images

Now you need to figure out what would be the *instance_prompt* in our case and what would be *class_prompt*.

Let's launch our Dreambooth fine-tuning. This might take a bit (also make sure you are using your GPU for this).
If you want to see more information about the model training, look into [--report_to](https://github.com/huggingface/diffusers/blob/main/examples/dreambooth/train_dreambooth_lora.py#L383) parameter of this command. For example you can log it into WandB (required additional login).

In [None]:
!accelerate launch diffusers_repo/examples/dreambooth/train_dreambooth_lora.py \
  --pretrained_model_name_or_path=$MODEL_NAME  \
  --instance_data_dir=$INSTANCE_DIR \
  --class_data_dir=$CLASS_DIR \
  --num_class_images=50 \
  --output_dir=$OUTPUT_DIR \
  --instance_prompt=$INSTANCE_PROMPT \
  --class_prompt=$CLASS_PROMPT \
  --resolution=512 \
  --train_batch_size=1 \
  --gradient_accumulation_steps=1 \
  --checkpointing_steps=100 \
  --learning_rate=1e-4 \
  --lr_scheduler="constant" \
  --lr_warmup_steps=0 \
  --max_train_steps=400 \
  --with_prior_preservation \
  --validation_prompt="penvink" \
  --seed="0

Now we can load our new model and finally help Benedict

In [None]:
from diffusers import DiffusionPipeline, DPMSolverMultistepScheduler
import torch

pipe = DiffusionPipeline.from_pretrained(os.environ['MODEL_NAME']).to("cuda")
pipe.load_lora_weights("./dreambooth_output")

image = pipe("penvink", num_inference_steps=25).images[0]

display(image)

In [None]:
image = pipe("golden penvink", num_inference_steps=25).images[0]

display(image)

# Task 4

*3 points*


You task is to fill in the gaps in the code and fine-tune the model to generate math problems using LoRA method.

**Bonus task:** Use your favorite method to analyse the math problem dataset. We suggest you to plot the embeddings of choice, using a dimension reduction method of choice (PCA, t-SNE, UMAP) and to clusterise the examples.

## Get dataset

In [None]:
!wget https://raw.githubusercontent.com/openai/grade-school-math/master/grade_school_math/data/train.jsonl
!wget https://raw.githubusercontent.com/openai/grade-school-math/master/grade_school_math/data/test.jsonl

## Task 4.1

Write a MathProblemDataset with the following signature.

It has to implement `__len__` and `__getitem__`.

Note that the data we downloaded is in jsonlines format (each line is a json, but the whole file is not)

Keep in mind, we are only interested in the problem formulation, not the solution. Your output data should be strings with problems

In [None]:
from torch.utils.data import Dataset, DataLoader

In [None]:
class MathProblemDataset(Dataset):
    def __init__(self, dataset_path):
        ...

    def __len__(self):
        ...

    def __getitem__(self, idx):
        ...


In [None]:
train_dataset = MathProblemDataset(dataset_path="train.jsonl")
test_dataset = MathProblemDataset(dataset_path="test.jsonl")

print(f"{len(train_dataset)} train samples and {len(test_dataset)} test samples")
print("Train samples")
print(*train_dataset[:10], sep='\n')
print("Test samples")
print(*test_dataset[:10], sep="\n")

## Bonus task 4.2

*2 bonus points*

Perform an analysis of the question texts

## Finetuning

Follow the finu-tuning code and fill in the gaps

In [None]:
!pip install transformers peft --upgrade -q

In [None]:
import torch
from transformers import pipeline, AutoTokenizer, AutoModel, AutoModelForCausalLM

model_name = "allenai/OLMo-1B-hf"

tokenizer = AutoTokenizer.from_pretrained(
    model_name,
    padding_side='left'
)
model = AutoModelForCausalLM.from_pretrained(model_name)
if tokenizer.pad_token_id is None:
    tokenizer.pad_token_id = tokenizer.eos_token_id

## Task: 4.3
Write a function `generate_math_problem` which takes a list of prefixes and uses out loaded model to generate the completions

In [None]:
from typing import List
from transformers import PreTrainedModel

def generate_math_problem(
        prefix: List[str],
        model: PreTrainedModel,
        device: str = 'cuda',
        max_generated_tokens: int = 50
    ):
    ...

In [None]:
from IPython.display import display

model = model.cuda()

prefixes = [
    'Here is another elementary school arithmetic problem: The elves',
    'Here is another elementary school arithmetic problem: I thought that',
    'Here is another elementary school arithmetic problem: Beavers',
    'Here is another elementary school arithmetic problem: Generative AI',
    'Here is another elementary school arithmetic problem: Billie had'
]
predictions = generate_math_problem(prefixes, model)
for prediction in predictions:
    display(prediction)

model = model.cpu()

As we can see, there are good generation and there are not great ones.

In [None]:
import locale
locale.getpreferredencoding = lambda: "UTF-8"

!pip install peft --quiet

## Task 4.4

Now we need to make a `preprocess_function`. It takes a batch of inputs and returns inputs suitable for a model.Let's try to understand what does Pytorch want from us.

Naively, we have:

- prompt: "Here's another elementary school math problem: ",
- output: whatever is generated further.

And if we worked with an encoder-decoder model, we would provide to it exactly this (+padding). But an LLM is a decoder-only model and it just adds new tokens to a prompt:

"Here's another elementary school math problem: " ->

"Here's another elementary school math problem: Beaver" ->

"Here's another elementary school math problem: Beaver has" ->

etc.

This means that in a sense the total model input and model output are the same things. Hugging Face expects your training data to look like this:

|  | padding | prefix part | output part | EOS indicator |
|----------|----------|----------|----------|----------|
| **input**   | tokenizer.pad_token_id | prefix token ids    | output token ids  | `tokenizer.pad_token_id`  |
| **label (=output)**    |\[-100,...,-100\] |  \[-100\]*len(prefix)  |  output token ids   | `tokenizer.pad_token_id`   |
| **attention mask**    |\[0,...,0\]| \[1,...,1\]   |  \[1,...,1\]   |  1   |

The `-100` section means that the model doesn't need to learn the generation of this.

The next thing you need to keep in mind is that text length vary inside a batch. To account for it, you need to pad the sequences (including the attention mask) with zeros (`tokenizer.pad_token_id`) **on the left** to some `max_length`. It can be overall max length or max length of the inputs and labels inside a batch.


More specifically, HuggingFace transformer model will require a dictionary
```
{
    "input_ids" : ..., # List of lists with input token id's
    "attention_mask": ..., # List of 1 or 0, depending on whether a model should attend to the token. 0 is usually set for padding tokens.
    "labels": ..., List of lists of id's of tokens we want our model to predict
}
```

Each of the dictionary values should a `torch.tensor(v, dtype=int).to(device)` of shape `[batch_size, max_length]`, where `v` is list of lists.

In [None]:
def preprocess_batch(samples_batch, prefix, tokenizer, device='cuda'):
    pass

Now everything should be ready, let's fine tune just like we did in the practice session.

In [None]:
from peft import (
    get_peft_model,
    LoraConfig,
)

In LoRA config you also need to specify names of modules to insert LoRA adapters to. You can see all modules in `model.named_modules()`. Typically you want to include all Linear layers. In most cases you can either leave it blank or type "all-linear".

In [None]:
peft_config = LoraConfig(
    r=32,
    target_modules='all-linear'
)

In [None]:
model = get_peft_model(model, peft_config)
model.print_trainable_parameters()


lr = 1e-5
num_epochs = 2
batch_size = 4

In [None]:
train_dataloader = DataLoader(train_dataset, batch_size=batch_size)
test_dataloader = DataLoader(test_dataset, batch_size=batch_size)

In [None]:
from transformers import get_linear_schedule_with_warmup

optimizer = torch.optim.AdamW(model.parameters(), lr=lr)
lr_scheduler = get_linear_schedule_with_warmup(
    optimizer=optimizer,
    num_warmup_steps=0,
    num_training_steps=(len(train_dataloader) * num_epochs),
)

Keep in mind, we're finetuning quite a big model. It's only a couple epochs on a small dataset, but on a T4 this will still take around an hour.

In [None]:
from tqdm import tqdm

PREFIX="Here's another elementary school math problem: "

model = model.cuda()

for epoch in range(num_epochs):
    model.train()
    total_loss = 0
    for step, batch in enumerate(tqdm(train_dataloader)):
        model_inputs = preprocess_batch(batch, prefix=PREFIX, tokenizer=tokenizer)
        outputs = model(**model_inputs)
        loss = outputs.loss
        total_loss += loss.detach().float()
        loss.backward()
        optimizer.step()
        lr_scheduler.step()
        optimizer.zero_grad()

    model.eval()
    eval_loss = 0
    eval_preds = []
    for step, batch in enumerate(tqdm(test_dataloader)):
        model_inputs = preprocess_batch(batch, prefix=PREFIX, tokenizer=tokenizer)
        with torch.no_grad():
            outputs = model(**model_inputs)
        loss = outputs.loss
        eval_loss += loss.detach().float()
        eval_preds.extend(
            tokenizer.batch_decode(
                torch.argmax(outputs.logits, -1).detach().cpu().numpy(),
                skip_special_tokens=True
            )
        )

    eval_epoch_loss = eval_loss / len(test_dataloader)
    eval_ppl = torch.exp(eval_epoch_loss)
    train_epoch_loss = total_loss / len(train_dataloader)
    train_ppl = torch.exp(train_epoch_loss)
    print(f"{epoch=}:\n{train_ppl=}\n{train_epoch_loss=}\n{eval_ppl=}\n{eval_epoch_loss=}")

Finally let's test it

In [None]:
model = model.cuda()

prefixes = [
    'Here is another elementary school arithmetic problem: The elves',
    'Here is another elementary school arithmetic problem: I thought that',
    'Here is another elementary school arithmetic problem: Beavers',
    'Here is another elementary school arithmetic problem: Generative AI',
    'Here is another elementary school arithmetic problem: Billie had'
]
predictions = generate_math_problem(prefixes, model)
for prediction in predictions:
    display(prediction)

model = model.cpu()

And don't forget to save the result of your hard work :)

In [None]:
model.save_pretrained("math_finetune")

## Bonus task 4.5
*2 bonus points*

Analise how different models handle generating math problems out of the box. Use models of comparible size (for example <7B) for this comparison to be more fair. You can use [Open LLM Leaderboard](https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard) to select models.

You can also use APIs like OpenAI's, Anthropic, Gemini and etc. Those are not directly comparible with any OpenSource model, but may give you a much better result out of the box.

We'd suggest using the following models:
- [OLMo-1B](https://huggingface.co/allenai/OLMo-1B )
- [QWEN1.5-0.5B-Chat](https://huggingface.co/Qwen/Qwen1.5-0.5B-Chat) or [Qwen1.5-0.5B](https://huggingface.co/Qwen/Qwen1.5-0.5B)
- [Phi-3-mini](https://huggingface.co/microsoft/Phi-3-mini-128k-instruct)

After you obtained the results, we want you to formulate a hypothesis about why those certain models are doing better or worse in this task. Please try to be as grounded as possible in your hypotheses, meaning that there should be at least some supporting evidence, and not just a blind guess. Some areas we would advise you to focus on:
- The way the model was trained (next token prediction, instruct tuning, chat tuning, etc.)
- What data was used to train the model (if available)

# Task 5
In this task we'll take a look at [LLaVA](https://llava-vl.github.io/), a multimodal model combining LLM and visual encoder capabilities.

First let's take a look at how you can inference LLaVA



In [None]:
!pip install -q transformers==4.36.0
!pip install -q bitsandbytes==0.41.3 accelerate==0.25.

To load an image we'll use a library called PIL. It's a standard library to handle images in Python

In [None]:
import requests
from PIL import Image

image_url = "https://llava-vl.github.io/static/images/view.jpg"
image = Image.open(requests.get(image_url, stream=True).raw)
image

Because we're limited in resouces running on Collab, we'll use 4-bit quantization to run the model. In order to do that we'll use quantization config fron BitsAndBytes

In [None]:
import torch
from transformers import BitsAndBytesConfig

quantization_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_compute_dtype=torch.float16
)

We will leverage the `image-to-text` pipeline from transformers !

In [None]:
from transformers import pipeline

model_id = "llava-hf/llava-1.5-7b-hf"

pipe = pipeline("image-to-text", model=model_id, model_kwargs={"quantization_config": quantization_config})

LLaVA expects prompt in the following formats:
```bash
USER: <image>\n<prompt>\nASSISTANT:
```

In [None]:
max_new_tokens = 200
prompt = "USER: <image>\nIf you were a painter, how would you call this image?\nASSISTANT:"

outputs = pipe(image, prompt=prompt, generate_kwargs={"max_new_tokens": 200})

In [None]:
print(outputs[0]["generated_text"])

In [None]:
del pipe

We can also reproduce this pipeline step-by-step:

In [None]:
from transformers import AutoProcessor, LlavaForConditionalGeneration

model_id = "llava-hf/llava-1.5-7b-hf"

model = LlavaForConditionalGeneration.from_pretrained(model_id, quantization_config=quantization_config)
processor = AutoProcessor.from_pretrained(model_id)

A "processor" includes both a tokenzer for text data and image processor for image data

In [None]:
print(processor.tokenizer)
print(processor.image_processor)

This is how we get the representation of an image:

In [None]:
pixel_values = processor.image_processor(image, return_tensors="pt")['pixel_values']
print(pixel_values.shape)
model.config.vision_feature_layer # this is a variable, regulating which layer we take from a pretrained encoder.
image_outputs = model.vision_tower(pixel_values, output_hidden_states=True)
print(image_outputs[model.config.vision_feature_layer].shape)

Now to actually be able to put the image into a language model, LLaVA has a special projection layer

In [None]:
projected_image = model.multi_modal_projector(image_outputs[model.config.vision_feature_layer])
print(projected_image.shape)

We can see that now the image is in the same space as out tokens

In [None]:
model.get_input_embeddings()

# Task 5.1

*2 points*

Use similar technique as in Task 1 to do the following:

Take the projections of a couple images into the embedding space and find closest real tokens to those images.

See if you can find any sort of patterns depending on the image you pass.


Here's an illustration of what you need to extract:

![Llava](https://drive.google.com/uc?id=1mUU2Lf8puAJNYKlCzYyF0MWiqmF-P_FU&export=download)

## Task 5.2

*2 points*

Now that we know how to use this, let's try to create something fun from it.

Create a function `flag_guesser`, which does the following:
- It inputs a link to an image of a flag and a name of a country;
- As an output it tells the user if they guessed the country correctly.

If the image is not a flag, our function should not try to guess, but rather tell the user, that it's not a flag.

Make sure that your function supports different image formats.

Test your function on a couple of image and country combinations.

Here's a small example:

In [None]:
def flag_guesser(image_url: str, country_name: str):
    pass

In [None]:
image_url = "https://upload.wikimedia.org/wikipedia/commons/thumb/2/20/Flag_of_the_Netherlands.svg/510px-Flag_of_the_Netherlands.svg.png"
Image.open(requests.get(image_url, stream=True).raw)



```
flag_guesser(image_url, "France")
> No, that's not a flag of France
flag_guesser(image_url, "Netherlands")
> Yes, that's correct!
```



In [None]:
flag_guesser(image_url, "France")

In [None]:
flag_guesser(image_url, "Netherlands")

In [None]:
kitten_url = "https://icatcare.org/app/uploads/2018/07/Helping-your-new-cat-or-kitten-settle-in-1.png"
Image.open(requests.get(kitten_url, stream=True).raw)

In [None]:
flag_guesser(kitten_url, "USA")