## Distributed training of LLaMA models using FSDP

### Requirements/Prerequisites
- An Azure acoount with active subscription [Create an account for free](https://azure.microsoft.com/free/?WT.mc_id=A261C142F)
- Azure Machine Learning workspace [Configure workspace](../../../configuration.ipynb) 
- Python Environment
- Install Azure ML Python SDK Version 2
### Learning Objectives
- Connect to workspace using Python SDK v2
- use LLaMA model for text-generation task
- Distributed finetuning of LLaMA 7b/13b/70b model.

## 1. Connect to Azure Machine Learning Workspace

### 1.1 Import Libraries and connect to workspace using Default Credential

In [None]:
from azure.ai.ml import MLClient
from azure.ai.ml.entities import Environment, BuildContext
from azure.ai.ml import command
from azure.ai.ml.entities import Data
from azure.ai.ml import Input
from azure.ai.ml import Output
from azure.ai.ml.constants import AssetTypes
from azure.identity import DefaultAzureCredential
from azure.ai.ml.entities import ResourceConfiguration

credential = DefaultAzureCredential()

try:
    ml_client = MLClient.from_config(credential)
except Exception as e:
    # enter details of your AML workspace and get a handle to the workspace
    ml_client = MLClient(
        credential=credential,
        subscription_id="<SUBSCRIPTION_ID>",
        resource_group_name="<RESOURCE_GROUP>",
        workspace_name="<AML_WORKSPACE_NAME>",
    )

### 1.2 Compute target setup

In [None]:
from azure.ai.ml.entities import AmlCompute
from azure.core.exceptions import ResourceNotFoundError

compute_name = "ND40rs-low-priority-3"
compute_cluster = None
try:
    compute_cluster = ml_client.compute.get(compute_name)
    print("Found existing compute target.")
except ResourceNotFoundError:
    print("Creating a new compute target...")
    compute_config = AmlCompute(
        name=compute_name,
        type="amlcompute",
        size="STANDARD_ND40RS_V2",
        idle_time_before_scale_down=300,
        min_instances=0,
        max_instances=6,
        tier="low_priority",
    )
    ml_client.begin_create_or_update(compute_config).result()
    compute_cluster = ml_client.compute.get(compute_name)
compute_cluster

### 1.3 Create a new environment

In [None]:
from azure.core.exceptions import ResourceNotFoundError

env_name = "LLaMA-FSDP"
try:
    env_object = ml_client.environments.list(env_name).next()
    print(f"Found exising environment. {env_object}")
except ResourceNotFoundError as ex:
    print(f"Environment {env_name} not found. Creating a new one.")
    env_docker_context = Environment(
        build=BuildContext(path="./env/context"),
        name=env_name,
        description="Environment created for trying FSDP",
    )
    env_object = ml_client.environments.create_or_update(env_docker_context)
    print(env_object)
fsdp_env = f"{env_name}@latest"

## 2. Launch the distributed training job

### 2.1 Create the job 

In this section we will be configuring and running two standalone jobs. 

- `command` for distributed training job.


The `command` allows user to configure the following key aspects.
- `code` - This is the path where the code to run the command is located
- `command` - This is the command that needs to be run
- `inputs` - This is the dictionary of inputs using name value pairs to the command. The key is a name for the input within the context of the job and the value is the input value. Inputs can be referenced in the `command` using the `${{inputs.<input_name>}}` expression. To use files or folders as inputs, we can use the `Input` class. The `Input` class supports three parameters:
    - `type` - The type of input. This can be a `uri_file` or `uri_folder`. The default is `uri_folder`.         
    - `path` - The path to the file or folder. These can be local or remote files or folders. For remote files - http/https, wasb are supported. 
        - Azure ML `data`/`dataset` or `datastore` are of type `uri_folder`. To use `data`/`dataset` as input, you can use registered dataset in the workspace using the format '<data_name>:<version>'. For e.g Input(type='uri_folder', path='my_dataset:1')
    - `mode` - 	Mode of how the data should be delivered to the compute target. Allowed values are `ro_mount`, `rw_mount` and `download`. Default is `ro_mount`
- `environment` - This is the environment needed for the command to run. Curated or custom environments from the workspace can be used. Or a custom environment can be created and used as well. Check out the [environment](../../../../assets/environment/environment.ipynb) notebook for more examples.
- `compute` - The compute on which the command will run.
- `distribution` - Distribution configuration for distributed training scenarios. Azure Machine Learning supports PyTorch, TensorFlow, and MPI-based distributed 

In [None]:
# Get LLaMA Model path
registry_name = (
    "azureml-meta"  # Change this to your registry name Where model is present
)
model_name = "Llama-2-70b"  # Change this to your model name

registry_ml_client = MLClient(credential, registry_name=registry_name)
model = registry_ml_client.models.get(model_name, label="latest")
model.id

In [None]:
# Training arguments
# All the setting defined in config files(./code/configs) can be passed as args through CLI. A subset of arguments are listed below.

training_parameters = {
    "task_name": "text-generation",  # text-classification, text-generation
    # text-classification(for datasets such as emotion detection),
    # text-generation(for samsum, alpaca dataset, grammar dataset)
    "dataset": "samsum_dataset",  # "emotion_detection_dataset", "samsum_dataset" , "grammar_dataset", "alpaca_dataset"
    # "task_type":  "CAUSAL_LM", # SEQ_CLS, TOKEN_CLS, CAUSAL_LM
    # task type to be used for peft. Default is "CAUSAL_LM".
    "enable_fsdp": True,  # Flag to enable FSDP mode
    "use_peft": True,
    "use_fp16": True,
    "cpu_offload": True,  # Setting this to true will do the cpu_offloading(ZeRO).
    "low_cpu_fsdp": True,  # Setting this to true will reduce the cpu memory footprint. For finetuing 70b on ND40, turn this on.
    # It uses meta tensors to save to memory. It load the full model on rank zero and model with meta tensors on other ranks initially.
    # After doing model sharding, it will sync the model weights from rank zero, this would increase the training time, but reduce the cpu memory usage drastically.
    "save_model": True,
    "batch_size_training": 4,
    "val_batch_size": 4,
    "lr": 3e-4,
    "num_epochs": 2,
    "generate_predictions": True,
}

# sku details
num_nodes = 7
nproc_per_node = (
    8
    if compute_cluster.size in ["STANDARD_ND40RS_V2", "STANDARD_ND96AMSR_A100_V4"]
    else 4
)

In [None]:
from argparse import Namespace

training_args = Namespace(**training_parameters)
cmd_args = ""
for arg_name, arg_val in training_args._get_kwargs():
    if type(arg_val) == bool:
        if arg_val:
            cmd_args += f"--{arg_name} "
    else:
        cmd_args += f"--{arg_name} {arg_val} "
cmd_args = cmd_args.strip()
cmd_args

In [None]:
inputs = {
    "model_path": Input(type=AssetTypes.MLFLOW_MODEL, path=model.id),
    "model_config_path": "data/model",
}

outputs = {
    "finetuned_model": Output(
        type=AssetTypes.CUSTOM_MODEL,
    ),
    "artifacts_folder": Output(
        type=AssetTypes.URI_FOLDER,
    ),
}

job = command(
    code="./code",
    command=f"python llama_finetuning.py \
        --model_name ${{inputs.model_path}}/{inputs['model_config_path']} \
        --output_dir ${{outputs.finetuned_model}} \
        --artifacts_dir ${{outputs.artifacts_folder}} {cmd_args}",
    inputs=inputs,
    outputs=outputs,
    compute=compute_name,
    instance_count=num_nodes,
    environment=fsdp_env,
    distribution={
        "type": "pytorch",
        "process_count_per_instance": nproc_per_node,  # number of gpus on a single node
    },
)

### 2.2 Run the job

In [None]:
returned_job = ml_client.create_or_update(job)
ml_client.jobs.stream(returned_job.name)