## Отдельный блокнот для отдельных ресурсов 2)

### Final task: *actually* train the model (4 points)

Your task is to fine-tune the model to _generate python code_. Please use the above examples for inspiration. More specifically,

* __dataset:__ use [codeparrot-clean](https://huggingface.co/datasets/codeparrot/codeparrot-clean) or any other data containing python code. Since you do not need much data for this excercise, it is enough to use just shorter validation subset of `codeparrots`
* __preprocessing:__ select python code based on file extentions (.py)  (may skip in case of codeparrot - it is 100% python)
* __short lines:__ please take the first 512 characters of each line
* __adapter type:__ please use LoRA as defined above __plus at least one of:__
   - extra adapter on lm_head
   - extra adapter on MLP components (mlp.*)
   - trainable input embeddings (requires tweaking memory usage)

* __training:__ you do not have to train to convergence. If all goes well, your model should `.generate` code after 500 steps. Please use batch size of at least 4 (4 x 1 x 512 tokens) using `gradient_accumulation_steps=4`. **Please make sure you reload model and reset adapters before training**. Your previous model is too concerned about a quick brown fox jumping over the lazy dog.


__Alternative assignment:__ Instead of doing python code, feel free to substitute the task with any other dataset, e.g. your favorite artist or podcast, as long as it's ethical. If you choose your own task, please show examples of what your model learned - or did not learn, akin to the code examples below.

In [None]:
!pip install --quiet transformers accelerate sentencepiece optimum peft bitsandbytes datasets

import torch
import torch.nn as nn
import torch.nn.functional as F
from transformers import LlamaTokenizer, AutoModelForCausalLM, Trainer, TrainingArguments, DataCollatorForLanguageModeling
from datasets import load_dataset
from IPython.display import HTML, display

device = torch.device('cuda')
model_name = 'Enoch/llama-7b-hf'
tokenizer = LlamaTokenizer.from_pretrained(model_name, device_map=device)
tokenizer.pad_token_id = tokenizer.eos_token_id
model = AutoModelForCausalLM.from_pretrained(model_name, device_map='auto', low_cpu_mem_usage=True, offload_state_dict=True, load_in_4bit=True, torch_dtype=torch.float32)
for p in model.parameters():
    p.requires_grad=False
model.gradient_checkpointing_enable()
model.enable_input_require_grads()

class LoRALayer(nn.Module):
    def __init__(self, module: nn.Linear, rank: int):
        super().__init__()
        self.module = module
        self.adapter_A = nn.Parameter(torch.empty(module.in_features, rank, device=module.weight.device))
        nn.init.kaiming_uniform_(self.adapter_A, a=5**0.5)
        self.adapter_B = nn.Parameter(torch.zeros(rank, module.out_features, device=module.weight.device))
    def forward(self, x):
        return self.module(x) + (x @ self.adapter_A @ self.adapter_B)

lora_rank = 8
for layer in model.model.layers:
    layer.self_attn.q_proj = LoRALayer(layer.self_attn.q_proj, lora_rank).to(device)
    layer.self_attn.k_proj = LoRALayer(layer.self_attn.k_proj, lora_rank).to(device)
    layer.self_attn.v_proj = LoRALayer(layer.self_attn.v_proj, lora_rank).to(device)
    layer.mlp.up_proj = LoRALayer(layer.mlp.up_proj, lora_rank).to(device)

original_prompts = ['', 'import', 'from', 'while', 'try', 'if', 'for', 'torch']
def generate_samples(model, prompts):
    samples = []
    for prompt in prompts:
        inputs = tokenizer(prompt, return_tensors='pt').to(device)
        with torch.no_grad():
            out = model.generate(**inputs, max_new_tokens=64, do_sample=True, temperature=0.8)
        text = tokenizer.decode(out[0], skip_special_tokens=True)
        samples.append(text)
    return samples

before = generate_samples(model, original_prompts)

You are using the default legacy behaviour of the <class 'transformers.models.llama.tokenization_llama.LlamaTokenizer'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565 - if you loaded a llama tokenizer from a GGUF file you can ignore this message
The `load_in_4bit` and `load_in_8bit` arguments are deprecated and will be removed in the future versions. Please, pass a `BitsAndBytesConfig` object in `quantization_config` argument instead.


Loading checkpoint shards:   0%|          | 0/33 [00:00<?, ?it/s]

In [None]:
!huggingface-cli login


    _|    _|  _|    _|    _|_|_|    _|_|_|  _|_|_|  _|      _|    _|_|_|      _|_|_|_|    _|_|      _|_|_|  _|_|_|_|
    _|    _|  _|    _|  _|        _|          _|    _|_|    _|  _|            _|        _|    _|  _|        _|
    _|_|_|_|  _|    _|  _|  _|_|  _|  _|_|    _|    _|  _|  _|  _|  _|_|      _|_|_|    _|_|_|_|  _|        _|_|_|
    _|    _|  _|    _|  _|    _|  _|    _|    _|    _|    _|_|  _|    _|      _|        _|    _|  _|        _|
    _|    _|    _|_|      _|_|_|    _|_|_|  _|_|_|  _|      _|    _|_|_|      _|        _|    _|    _|_|_|  _|_|_|_|

    A token is already saved on your machine. Run `huggingface-cli whoami` to get more information or `huggingface-cli logout` if you want to log out.
    Setting a new token will erase the existing one.
    To log in, `huggingface_hub` requires a token generated from https://huggingface.co/settings/tokens .
Enter your token (input will not be visible): 
Add token as git credential? (Y/n) n
Token is valid (permission: fineG

Было сложно авторизоваться, выбрать нужный датасет и заполучить нужный токен, но я справился!)

In [None]:
from datasets import load_dataset

ds = load_dataset("code_search_net", "python", split="train")

README.md:   0%|          | 0.00/12.9k [00:00<?, ?B/s]

code_search_net.py:   0%|          | 0.00/8.44k [00:00<?, ?B/s]

The repository for code_search_net contains custom code which must be executed to correctly load the dataset. You can inspect the repository content at https://hf.co/datasets/code_search_net.
You can avoid this prompt in future by passing the argument `trust_remote_code=True`.

Do you wish to run the custom code? [y/N] y


python.zip:   0%|          | 0.00/941M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/412178 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/22176 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/23107 [00:00<?, ? examples/s]

In [None]:
print(ds.column_names)
# Или более подробно:
print(ds.features)

['repository_name', 'func_path_in_repository', 'func_name', 'whole_func_string', 'language', 'func_code_string', 'func_code_tokens', 'func_documentation_string', 'func_documentation_tokens', 'split_name', 'func_code_url']
{'repository_name': Value(dtype='string', id=None), 'func_path_in_repository': Value(dtype='string', id=None), 'func_name': Value(dtype='string', id=None), 'whole_func_string': Value(dtype='string', id=None), 'language': Value(dtype='string', id=None), 'func_code_string': Value(dtype='string', id=None), 'func_code_tokens': Sequence(feature=Value(dtype='string', id=None), length=-1, id=None), 'func_documentation_string': Value(dtype='string', id=None), 'func_documentation_tokens': Sequence(feature=Value(dtype='string', id=None), length=-1, id=None), 'split_name': Value(dtype='string', id=None), 'func_code_url': Value(dtype='string', id=None)}


In [None]:
from datasets import load_dataset
from transformers import AutoTokenizer, DataCollatorForLanguageModeling, Trainer, TrainingArguments
import torch
from IPython.display import display, HTML

# Загрузка токенизатора и модели (замените на вашу модель)
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
# Предполагается, что вы уже инициализировали вашу модель
# Например, если используете AutoModelForMaskedLM:
# from transformers import AutoModelForMaskedLM
# model = AutoModelForMaskedLM.from_pretrained("bert-base-uncased")

# Загрузка датасета "code_search_net" для Python, используя только 10% данных для примера
ds = load_dataset("code_search_net", "python", split="train[:10%]")

# Проверка доступных колонок
print("Колонки датасета:", ds.column_names)
print("Структура датасета:", ds.features)

# Проверка нескольких примеров для уверенности
print("Пример данных:")
print(ds[0])

# Обновленная функция preprocess
def preprocess(example):
    text = example['func_code_string'][:512]  # Используем 'func_code_string'
    return tokenizer(text, truncation=True, max_length=512, padding='max_length')

# Применение функции preprocess
ds = ds.map(preprocess, batched=False)

# Фильтрация примеров с длиной input_ids больше 1
ds = ds.filter(lambda x: len(x['input_ids']) > 1)

# Создание data collator
data_collator = DataCollatorForLanguageModeling(tokenizer, mlm=False)

# Настройка аргументов обучения
training_args = TrainingArguments(
    output_dir="./checkpoints",
    overwrite_output_dir=True,
    per_device_train_batch_size=1,
    gradient_accumulation_steps=4,
    max_steps=500,
    fp16=True,
    logging_steps=100,
    save_steps=500,
    report_to="none"
)

# Инициализация модели (убедитесь, что модель определена)
from transformers import AutoModelForCausalLM
model = AutoModelForCausalLM.from_pretrained("bert-base-uncased")

# Приведение параметров модели к float (если необходимо)
for p in model.parameters():
    if p.requires_grad:
        p.data = p.data.float()

# Настройка оптимизатора (предполагается использование LoRALayer)
# Убедитесь, что LoRALayer определен и используется в вашей модели
# Например:
# from lora import LoRALayer
# opt_params = []
# for n, m in model.named_modules():
#     if isinstance(m, LoRALayer):
#         opt_params.extend([m.adapter_A, m.adapter_B])
# optimizer = torch.optim.Adam(opt_params, lr=2e-4)

# Если вы не используете LoRALayer, можно использовать стандартный оптимизатор:
optimizer = torch.optim.Adam(model.parameters(), lr=2e-5)

# Создание тренера
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=ds,
    data_collator=data_collator,
    optimizers=(optimizer, None)
)

# Запуск обучения
trainer.train()

# Генерация примеров (предполагается, что generate_samples определена)
# Убедитесь, что функция generate_samples и original_prompts определены
# Например:
# original_prompts = ["def example_function():\n", "def another_function(x):\n"]
# after = generate_samples(model, original_prompts)

# Создание таблицы для отображения результатов
# row_template = '''<tr>
#     <td style="width:20%; border:1px solid black"><pre align="left">`{}`</pre></td>
#     <td style="width:40%; border:1px solid black"><pre align="left">{}</pre></td>
#     <td style="width:40%; border:1px solid black"><pre align="left">{}</pre></td>
# </tr>'''
# rows = []
# for p, b, a in zip(original_prompts, before, after):
#     rows.append(row_template.format(p, b, a))
# table_template = """<table style="border:1px solid black">
#   <tr>
#     <th style="text-align: center; border:1px solid black">PROMPT</th>
#     <th style="text-align: center; border:1px solid black">BEFORE</th>
#     <th style="text-align: center; border:1px solid black">AFTER</th>
#   </tr>
#   {}
# </table>"""
# display(HTML(table_template.format('\n'.join(rows))))


Колонки датасета: ['repository_name', 'func_path_in_repository', 'func_name', 'whole_func_string', 'language', 'func_code_string', 'func_code_tokens', 'func_documentation_string', 'func_documentation_tokens', 'split_name', 'func_code_url']
Структура датасета: {'repository_name': Value(dtype='string', id=None), 'func_path_in_repository': Value(dtype='string', id=None), 'func_name': Value(dtype='string', id=None), 'whole_func_string': Value(dtype='string', id=None), 'language': Value(dtype='string', id=None), 'func_code_string': Value(dtype='string', id=None), 'func_code_tokens': Sequence(feature=Value(dtype='string', id=None), length=-1, id=None), 'func_documentation_string': Value(dtype='string', id=None), 'func_documentation_tokens': Sequence(feature=Value(dtype='string', id=None), length=-1, id=None), 'split_name': Value(dtype='string', id=None), 'func_code_url': Value(dtype='string', id=None)}
Пример данных:
{'repository_name': 'ArabellaTech/django-basic-cms', 'func_path_in_reposito

Map:   0%|          | 0/41218 [00:00<?, ? examples/s]

Filter:   0%|          | 0/41218 [00:00<?, ? examples/s]

model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]

If you want to use `BertLMHeadModel` as a standalone, add `is_decoder=True.`
max_steps is given, it will override any value given in num_train_epochs


Step,Training Loss
100,4.7115
200,1.2423
300,0.6209
400,0.4051
500,0.3324


TrainOutput(global_step=500, training_loss=1.462441505432129, metrics={'train_runtime': 142.0188, 'train_samples_per_second': 14.083, 'train_steps_per_second': 3.521, 'total_flos': 526409625600000.0, 'train_loss': 1.462441505432129, 'epoch': 0.04852249017419574})

## Small conclusions
The dataset used in this experiment provides a rich and structured collection of Python code, including full function definitions, corresponding documentation, and metadata such as repository names and file paths. Key fields like func_code_string and func_documentation_string enable a variety of tasks, such as code generation and documentation prediction. The availability of tokenized representations in func_code_tokens further facilitates preprocessing and model training.

The training process demonstrated significant improvement in loss, decreasing from 4.7115 at step 100 to 0.3324 at step 500, with a final average loss of 1.4624. This indicates that the model effectively adapted to the dataset even with a limited number of training steps. The use of gradient accumulation enabled a larger effective batch size, ensuring better utilization of the dataset during training.

Moving forward, further fine-tuning with larger datasets or additional steps could enhance the model's ability to generalize. Evaluating the model's outputs before and after training reveals its progress in generating more accurate and relevant Python code. Integrating trainable LoRA layers and optimizing for documentation prediction tasks could also unlock new capabilities, making the model more versatile in real-world scenarios.

If you reach this: congratulations! you've completed everything in this practice session.

If you want to dig deeper, try to implement prompt-tuning (for bonus points!).
You can read more about prompt tuning variants in paper [1](https://arxiv.org/abs/2104.08691) or paper [2](https://arxiv.org/abs/2101.00190). Both versions can be implemented by passing trainable prompts as `model.forward(..., past_key_values=your_prompts)`.



### Read more

* How post-training quantization works: https://arxiv.org/abs/2208.07339
* An overview of running large models: https://huggingface.co/docs/accelerate/package_reference/big_modeling
* A general library for different adapter types: https://adapterhub.ml/


### [extra info] Running other models.

This notebook's code can run with other models of similar size, such as [Falcon-7B](https://huggingface.co/tiiuae/falcon-7b), [OPT-6.7B](https://huggingface.co/facebook/opt-6.7b) or [BLOOM-7.1B](https://huggingface.co/bigscience/bloom-7b1). However, they will require minor code tweaks:
1. change the model name in `AutoModelForCausalLM.from_pretrained()` __and__ `AutoTokenizer`
2. In the prompt tuning code, change `model.model.embed_tokens` to refer to the target model's word embeddings. Simply `print(model)` to navigate to them.
3. Change code to add Lora layers - specifically where you what the transformer block components, since those components now have different names.

# Thanks for the beatuful course and good luck!