<a href="https://colab.research.google.com/github/RoboMaroof/LLM-Applications-Building-Blocks/blob/main/04_Finetuning/03_KD_Closed_source_LLMs.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

https://towardsdatascience.com/clone-the-abilities-of-powerful-llms-into-small-local-models-using-knowledge-distillation-12e954d256c2

https://github.com/CVxTz/distill-llm

https://github.com/CVxTz/knowledge_distillation

https://pytorch.org/tutorials/beginner/knowledge_distillation_tutorial.html

https://towardsdatascience.com/build-powerful-lightweight-models-using-knowledge-distillation-618f69b569d9

# Distillation Workflow

1. Acquire unlabeled in-domain data.
2. raft a prompt to extract pseudo-labels from the teacher model by leveraging Anyscale’s API.
3. Fine-tune the student model on these pseudo labels using LoRa + Peft.

# Data

In [None]:
from datasets import load_dataset
data = load_dataset("juancavallotti/multilingual-gec", split="train")

Use the labels for evaluation and not for training.

# Teacher Model

**Teacher Model: LLama 2–70B**
- Produce the pseudo-labels that will be used for the training.
- Powerful LLM hosted as pay-per-use API on cloud.

Options:
- OpenAI
- Anthropic
- AnyScale (used here)

Generation of pseudo-labels for around 5000 samples costs 1.2 dollars through Anyscale.

In [None]:
from openai import OpenAI

BASE_URL = "https://api.endpoints.anyscale.com/v1"
BASE_MODEL = "meta-llama/Llama-2-70b-chat-hf"

BASE_CLIENT = OpenAI(base_url=BASE_URL, api_key=API_KEY)

def process_call(prompt):

    completion = BASE_CLIENT.completions.create(
        model=BASE_MODEL,
        prompt=prompt,
        max_tokens=100,
        temperature=0,
    )
    result = completion.model_dump()

    return result["choices"][0]["text"].strip()

## Prompt

In [None]:
<s>[INST]
Your role is to correct all grammatical errors in the input text. Only answer with the corrected text and nothing else.

Text: Il est très importante de parler une langue étrangère.
[/INST]
Output: Il est très important de parler une langue étrangère.</s>
[INST]
Text: Nadie dise ezo.
[/INST]
Output: Nadie dice eso.</s>
[INST]
Text: What is your favorite part of being a member of SWE RMS?
[/INST]
Output: What is your favorite part of being a member of SWE RMS?</s>
[INST]
Text: I looked, at the schedule.
[/INST]
Output: I looked at the schedule.</s>
[INST]
Text: $text
[/INST]
Output:

# Student Model

**Student Model: Tiny-LLama-1B**

- “Train” on the grammar correction task using the pseudo-labels from the teacher model.
- Despite its smaller scale highly efficient.
- Can run on consumer GPUs with just a few gigabytes of memory.
- Can be run as a HuggingFace Pipeline.

BitsAndBytes used here for GPU quantization, which reduces the memory requirements of running LLMs.

In [None]:
from transformers import (
    AutoModelForCausalLM,
    AutoTokenizer,
    BitsAndBytesConfig,
    pipeline,
)

base_model_name = "TinyLlama/TinyLlama-1.1B-Chat-v1.0"

llama_tokenizer = AutoTokenizer.from_pretrained(
    base_model_name, trust_remote_code=True
)
llama_tokenizer.padding_side = "right"

quant_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.float16,
    bnb_4bit_use_double_quant=False,
)
# Model
model = AutoModelForCausalLM.from_pretrained(
    base_model_name,
    quantization_config=quant_config,
    device_map={"": 0},
)

text_gen = pipeline(
    task="text-generation",
    model=model,
    tokenizer=llama_tokenizer,
    max_new_tokens=256,
    do_sample=False,
    return_full_text=False,
)

In [None]:
print(text_gen("Hello ! Who are you ?"))

# LoRA Finetuning

In [None]:
import torch
from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training
from transformers import (
    AutoModelForCausalLM,
    AutoTokenizer,
    BitsAndBytesConfig,
    TrainingArguments,
)
from trl import SFTTrainer


if __name__ == "__main__":
    .
    .
    .
    .
    peft_parameters = LoraConfig(
        lora_alpha=8,
        lora_dropout=0.1,
        r=8,
        bias="none",
        task_type="CAUSAL_LM",
        # target_modules=target_modules,
    )

    base_model = prepare_model_for_kbit_training(base_model)
    base_model = get_peft_model(base_model, peft_parameters)

    # Training Params
    train_params = TrainingArguments(
        output_dir=str(BASE_PATH / "results_modified"),
        num_train_epochs=EPOCHS,
        per_device_train_batch_size=8,
        gradient_accumulation_steps=1,
        optim="paged_adamw_32bit",
        save_steps=len(training_data) // 10,
        logging_steps=len(training_data) // 100,
        learning_rate=2e-4,
        lr_scheduler_type="cosine",
        warmup_steps=100,
        weight_decay=0.05,
        fp16=True,
        max_steps=-1,
        group_by_length=False,
        max_grad_norm=0.3,
    )
    # Trainer
    fine_tuning = SFTTrainer(
        model=base_model,
        train_dataset=training_data,
        data_collator=collator,
        peft_config=peft_parameters,
        dataset_text_field="Why is this mandatory ?",
        tokenizer=llama_tokenizer,
        args=train_params,
        max_seq_length=llama_tokenizer.model_max_length,
    )

    print(fine_tuning.model.print_trainable_parameters())
    # Training
    fine_tuning.train()