# Level-Up Giants: 8-bit Training for Massive Models 🚀

## Learning Objectives 🎯
- Understand the hardware requirements for training large-scale models.
- Learn to install specialized libraries for advanced model training.
- Configure training parameters effectively for large models using YAML.
- Explore techniques like 8-bit optimization to manage VRAM usage efficiently.

## Importing Libraries

In [4]:
import torch
device = "cuda" if torch.cuda.is_available() else "cpu"
print(device)

cuda


## Library Installation 🛠️
Install the Axolotl library from a specified GitHub commit to ensure that all participants use the same library version, promoting consistency and reliability in the training process.

In [5]:
# !pip install --no-build-isolation axolotl[flash-attn,deepspeed]

## Training Configuration with YAML 📝
Set up a detailed training configuration using YAML. This configuration will specify model parameters and training options that are designed to maximize efficiency on limited hardware by using techniques such as gradient checkpointing and 8-bit loading.

In [1]:
import yaml

train_config = {
    # "base_model": "microsoft/Phi-3-meidum-128k-instruct" # this requires a 24GB video card
    "base_model": "TinyLlama/TinyLlama-1.1B-Chat-v1.0", # using smaller model to speed up training, same concepts apply

    # dataset params
    "datasets": [
        {
            "path": "Arivukkarasu/squad_for_llms",
            "type": {
                "system_prompt": "Read the following context and concisely answer my question.",
                "field_system": "system",
                "field_instruction": "question",
                "field_input": "context",
                "field_output": "output",
                "format": "<|user|> {input} {instruction} </s> <|assistant|>",
                "no_input_format": "<|user|> {instruction} </s> <|assistant|>",
            },
        }
    ],
    "output_dir": "./models/",

    # model params
    "sequence_length": 2048,

    "bf16": "auto",
    "tf32": False,

    # training params
    "micro_batch_size": 4,
    "num_epochs": 1,
    "optimizer": "adamw_bnb_8bit",
    "learning_rate": 0.0002,

    "logging_steps": 1,

    # LoRA
    "adapter": "lora",
    "lora_r": 32,
    "lora_alpha": 16,
    "lora_dropout": 0.05,
    "lora_target_linear": True,

    # Gradient Accumulation
    "gradient_accumulation_steps": 1,

    # Gradient Checkpointing
    "gradient_checkpointing": True,

    # Low Precision
    "load_in_8bit": True,

    # Train on Inputs
    "train_on_inputs": False,
}


# Write the YAML file
with open("specialised_train.yml", 'w') as file:
    yaml.dump(train_config, file)


## Initiate Model Training 🚀
Begin the training process using an optimized setup. This includes using 8-bit precision and other settings that help in managing VRAM usage, making it feasible to run the training on GPUs with less memory.

In [11]:
# !accelerate launch -m axolotl.cli.train specialised_train.yml

In [12]:
# # Optional: Merge the trained adapter
# !accelerate launch -m axolotl.cli.merge_lora specialised_train.yml