# Applying FSDP(Fully Sharded Data Parallel): Real-World Usage and Best Practices 🚀

## Learning Objectives 🎯
- Learn how to apply Fully Sharded Data Parallel (FSDP) for large-scale models.
- Understand FSDP’s configurations for optimizing memory usage and efficiency.
- Gain hands-on experience by running FSDP even on a single GPU, keeping in mind that the real benefits become clear with multiple GPUs.

## Installing Axolotl and DeepSpeed 🛠️
Install Axolotl with DeepSpeed support. While DeepSpeed is optimized for multi-GPU setups, you can still run this configuration on a single GPU to understand how the system works and the benefits of the Zero Redundancy Optimizer (ZeRO).

In [1]:
# !pip install --no-build-isolation axolotl[flash-attn,deepspeed]

## Importing Libraries

In [3]:
import torch
device = "cuda" if torch.cuda.is_available() else "cpu"
print(device)

cuda


## Applying FSDP Configurations 🧠
FSDP uses configurations like `full_shard` and `auto_wrap` to distribute memory across GPUs efficiently. Even though we’re running this on a single GPU, you’ll understand how the memory is handled and prepared for larger-scale distributed training.

In [4]:
import yaml

train_config = {
    # "base_model": "casperhansen/llama-3-70b-fp16", # will only work on at least 2 x 24GB Gpus
    "base_model": "unsloth/llama-3-8b-Instruct",

    # dataset params
    "datasets": [{"path": "Yukang/LongAlpaca-12k", "type": "alpaca"}],
    "output_dir": "./models/llama70B-LongAlpaca",

    # model params
    "sequence_length": 1024,
    "pad_to_sequence_len": True,
    "special_tokens": {"pad_token": "<|end_of_text|>"},

    "bf16": "auto",
    "tf32": False,

    # training params
    "micro_batch_size": 1,
    "num_epochs": 1,
    "optimizer": "adamw_torch",
    "learning_rate": 0.0002,

    "logging_steps": 1,

    # LoRA / qLoRA
    "adapter": "qlora",
    "lora_r": 32,
    "lora_alpha": 16,
    "lora_dropout": 0.05,
    "lora_target_linear": True,

    # Gradient Accumulation
    "gradient_accumulation_steps": 1,

    # Gradient Checkpointing
    "gradient_checkpointing": True,

    # Low Precision
    "load_in_8bit": False,
    "load_in_4bit": True,

    # Flash Attention
    "flash_attention": False,

    # FSDP
    "fsdp": ["full_shard", "auto_wrap"],
    "fsdp_config": {
        "fsdp_offload_params": True,
        "fsdp_cpu_ram_efficient_loading": True,
        "fsdp_state_dict_type": "FULL_STATE_DICT",
        "fsdp_transformer_layer_cls_to_wrap": "LlamaDecoderLayer",
    },
}



# Write the YAML file
with open("fsdp_train.yml", 'w') as file:
    yaml.dump(train_config, file)


## Launching FSDP Training 🚀
Start the training using FSDP with `accelerate launch`. FSDP works best in a multi-GPU setup, but you can still proceed with a single GPU to observe how it manages memory sharding and learn its benefits.

In [5]:
# !accelerate launch -m axolotl.cli.train fsdp_train.yml
# Optional: Merge the trained adapter
# !accelerate launch -m axolotl.cli.merge_lora train_fsdp.yml