- The %%capture magic command is used to suppress the output of the cell in Jupyter notebook.
- The %pip magic command is used to install Python packages within a Jupyter notebook: accelerate, peft, bitsandbytes, transformers, and trl are the names of the Python packages being installed.
- These packages are installed in the current Python environment running the Jupyter notebook.

In [None]:
%%capture
%pip install accelerate peft bitsandbytes transformers trl

In [None]:
# pip install datasets transformers torch peft trl
import os
import torch
from datasets import load_dataset
from transformers import (
    AutoModelForCausalLM,
    AutoTokenizer,
    BitsAndBytesConfig,
    TrainingArguments,
    pipeline,
    logging,
)
from peft import LoraConfig
from trl import SFTTrainer

In [None]:
# Model from Hugging Face hub
base_model = "meta-llama/Llama-2-7b-hf"
# Fine-tuned model
new_model = "llama-2-7b-dockerfile-generation"
# Load the model
dataset = load_dataset("mlabonne/guanaco-llama2-1k", split="train")

Downloading readme:   0%|          | 0.00/1.02k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/967k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/1000 [00:00<?, ? examples/s]

In [None]:
# 4-bit quantization with NF4 type configuration using BitsAndBytes
compute_dtype = getattr(torch, "float16")
quant_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=compute_dtype,
    bnb_4bit_use_double_quant=False,
)

In [None]:
model = AutoModelForCausalLM.from_pretrained(
    base_model,
    quantization_config=quant_config,
    device_map={"": 0}
)
model.config.use_cache = False
model.config.pretraining_tp = 1

config.json:   0%|          | 0.00/609 [00:00<?, ?B/s]

model.safetensors.index.json:   0%|          | 0.00/26.8k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/2 [00:00<?, ?it/s]

model-00001-of-00002.safetensors:   0%|          | 0.00/9.98G [00:00<?, ?B/s]

model-00002-of-00002.safetensors:   0%|          | 0.00/3.50G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/188 [00:00<?, ?B/s]

In [None]:
# Load the tokenizer from Hugginface and set padding_side to “right” to fix the issue with fp16
tokenizer = AutoTokenizer.from_pretrained(base_model, trust_remote_code=True)
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "right"

tokenizer_config.json:   0%|          | 0.00/776 [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/500k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.84M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/414 [00:00<?, ?B/s]

List of hyperparameters that can be used to optimize the training process:

- **output_dir**: The output directory is where the model predictions and checkpoints will be stored.
- **num_train_epochs**: One training epoch.
- **fp16/bf16**: Disable fp16/bf16 training.
- **per_device_train_batch_size**: Batch size per GPU for training.
- **per_device_eval_batch_size**: Batch size per GPU for evaluation.
- **gradient_accumulation_steps**: This refers to the number of steps required to accumulate the gradients during the update process.
- **gradient_checkpointing**: Enabling gradient checkpointing.
- **max_grad_norm**: Gradient clipping.
- **learning_rate**: Initial learning rate.
- **weight_decay**: Weight decay is applied to all layers except bias/LayerNorm weights.
- **Optim**: Model optimizer (AdamW optimizer).
- **lr_scheduler_type**: Learning rate schedule.
- **max_steps**: Number of training steps.
- **warmup_ratio**: Ratio of steps for a linear warmup.
- **group_by_length**: This can significantly improve performance and accelerate the training process.
- **save_steps**: Save checkpoint every 25 update steps.
- **logging_steps**: Log every 25 update steps.

In [None]:
peft_params = LoraConfig(
    lora_alpha=16,
    lora_dropout=0.1,
    r=64,
    bias="none",
    task_type="CAUSAL_LM",
)

training_params = TrainingArguments(
    output_dir="./results",
    num_train_epochs=1,
    per_device_train_batch_size=4,
    gradient_accumulation_steps=1,
    optim="paged_adamw_32bit",
    save_steps=25,
    logging_steps=25,
    learning_rate=2e-4,
    weight_decay=0.001,
    fp16=False,
    bf16=False,
    max_grad_norm=0.3,
    max_steps=-1,
    warmup_ratio=0.03,
    group_by_length=True,
    lr_scheduler_type="constant",
    report_to="tensorboard"
)

trainer = SFTTrainer(
    model=model,
    train_dataset=dataset,
    peft_config=peft_params,
    dataset_text_field="text",
    max_seq_length=None,
    tokenizer=tokenizer,
    args=training_params,
    packing=False,
)



Map:   0%|          | 0/1000 [00:00<?, ? examples/s]

dataloader_config = DataLoaderConfiguration(dispatch_batches=None, split_batches=False, even_batches=True, use_seedable_sampler=True)


In [None]:
# Train the model
trainer.train()

OutOfMemoryError: CUDA out of memory. Tried to allocate 172.00 MiB. GPU 0 has a total capacity of 14.75 GiB of which 19.06 MiB is free. Process 2248 has 14.73 GiB memory in use. Of the allocated memory 14.36 GiB is allocated by PyTorch, and 247.83 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)

In [None]:
# Save the model
trainer.model.save_pretrained(new_model)
trainer.tokenizer.save_pretrained(new_model)

In [None]:
# pip install tensorboard
from tensorboard import notebook
log_dir = "resultpips/runs"
notebook.start("--logdir {} --port 4000".format(log_dir))
# Test the model
logging.set_verbosity(logging.CRITICAL)
prompt = "Generate a Dockerfile of Python 2.7"
pipe = pipeline(task="text-generation", model=model, tokenizer=tokenizer, max_length=200)
result = pipe(f"<s>[INST] {prompt} [/INST]")
print(result[0]['generated_text'])