# Low Rank Adaptation Fine Tuning

In [1]:
pip install transformers datasets accelerate peft torch



## Load Model & Tokenizer

In [2]:
from transformers import AutoTokenizer, AutoModelForCausalLM
model_name = "gpt2"
tokenizer = AutoTokenizer.from_pretrained(model_name)
tokenizer.pad_token = tokenizer.eos_token

model = AutoModelForCausalLM.from_pretrained(model_name)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/665 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/548M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

## Add LoRA

In [3]:
from peft import LoraConfig, get_peft_model

lora_config = LoraConfig(
    r=8,
    lora_alpha=32,
    target_modules=["c_attn"],
    lora_dropout=0.1,
    bias="none",
    task_type="CAUSAL_LM"
)

model = get_peft_model(model, lora_config)
model.print_trainable_parameters()


trainable params: 294,912 || all params: 124,734,720 || trainable%: 0.2364




## Load TXT dataset

In [4]:
from datasets import load_dataset

dataset = load_dataset(
    "text",
    data_files={"train": "/content/teacher_student_5000.txt"}
)
print(dataset)


Generating train split: 0 examples [00:00, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['text'],
        num_rows: 5000
    })
})


## Tokenize the data

In [8]:
def tokenize(example):
    tokens = tokenizer(
        example["text"],
        truncation=True,
        padding="max_length",
        max_length=256
    )
    tokens["labels"] = tokens["input_ids"].copy()
    return tokens

tokenized_ds = dataset.map(
    tokenize,
    batched=True,
    remove_columns=["text"]
)


Map:   0%|          | 0/5000 [00:00<?, ? examples/s]

## Setting Training Arguements and Training the model

In [9]:
from transformers import TrainingArguments, Trainer

training_args = TrainingArguments(
    output_dir="./lora-gpt2",
    per_device_train_batch_size=2,
    gradient_accumulation_steps=4,
    learning_rate=2e-4,
    num_train_epochs=3,
    fp16=True,
    logging_steps=20,
    save_strategy="epoch",
    report_to="none"
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_ds["train"]
)

trainer.train()

`loss_type=None` was set in the config but it is unrecognized. Using the default loss: `ForCausalLMLoss`.


Step,Training Loss
20,7.9976
40,2.6152
60,0.2019
80,0.15
100,0.1221
120,0.1067
140,0.0943
160,0.0904
180,0.0807
200,0.077


TrainOutput(global_step=1875, training_loss=0.14482476240793865, metrics={'train_runtime': 438.2456, 'train_samples_per_second': 34.227, 'train_steps_per_second': 4.278, 'total_flos': 1966485012480000.0, 'train_loss': 0.14482476240793865, 'epoch': 3.0})

## Saving the tokenizer and model

In [10]:
model.save_pretrained("lora-adapter")
tokenizer.save_pretrained("lora-adapter")

('lora-adapter/tokenizer_config.json',
 'lora-adapter/special_tokens_map.json',
 'lora-adapter/vocab.json',
 'lora-adapter/merges.txt',
 'lora-adapter/added_tokens.json',
 'lora-adapter/tokenizer.json')

## Inference Time

In [19]:
import torch

prompt = (
    "Teacher: Today we will revise probability.\n"
    "Student: "
)

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = model.to(device)

inputs = tokenizer(prompt, return_tensors="pt")
inputs = {k: v.to(device) for k, v in inputs.items()}

with torch.no_grad():
    outputs = model.generate(
        **inputs,
        max_new_tokens=60,
        do_sample=True,
        temperature=0.6,   # stable answers
        top_p=0.9,
        repetition_penalty=1.1,
        eos_token_id=None,
        bad_words_ids=[[tokenizer.eos_token_id]],
        pad_token_id=tokenizer.eos_token_id
    )

print(tokenizer.decode(outputs[0], skip_special_tokens=True))


Teacher: Today we will revise probability.
Student: �� Will this topic be assigned more exam questions? Won't this topic ever asked exams? Teacher ? Comments are available at online or in the classroom. Stay educated. This topic is important for students to ask these question. Now that you will be asked by a lot. Next step homework. Until next
