# Offsite-tuning Tutorial

In this tutorial, we'll focus on how to leverage Offsite-Tuning framework in FATE-LLM-2.0 to fine-tune your LLM. You'll learn how to:

1. Define models, including main models(which are at server side and will offer adapters and emulators) and submodel(which are at client side and will load adapters and emulators for local fine-tuning) compatible with Offsite-Tuning framework.
2. Get hands-on experience with the Offsite-Tuning trainer.
3. Define configurations for advanced setup(Using Deepspeed, offsite-tuning + federation) through FATE-pipeline.

## Introduction of Offsite-tuning

Offsite-Tuning is a novel approach designed for the efficient and privacy-preserving adaptation of large foundational models for specific downstream tasks. The framework allows data owners to fine-tune models locally without uploading sensitive data to the LLM owner's servers. Specifically, the LLM owner sends a lightweight "Adapter" and a lossy compressed "Emulator" to the data owner. Using these smaller components, the data owner can then fine-tune the model solely on their private data. The Adapter, once fine-tuned, is returned to the model owner and integrated back into the large model to enhance its performance on the specific dataset.

Offsite-Tuning addresses the challenge of unequal distribution of computational power and data. It allows thLLMel owner to enhance the model's capabilities without direct access to private data, while also enabling data owners who may not have the resources to train a full-scale model to fine-tune a portion of it using less computational power. This mutually beneficial arrangement accommodates both parties involve.

Beyond the standard two-party setup involving the model owner and the data ownin FATE-LLM, er, Offsite-Tunframework ing is also extendable to scenarios with multiple data owners. FATE supports multi-party Offsite-Tuning, allowing multiple data owners to fine-tune and aggregate their Adapters locally, further enhancing the flexibility and applicability of this framewrFor more details of Offsite-tuning, please refer to the [original paper](https://arxiv.org/pdf/2302.04870.pdf).


## Preliminary

We strongly recommend you finish reading our NN tutorial to get familiar with Model and Dataset customizations: [NN Tutorials](https://github.com/FederatedAI/FATE/blob/master/doc/2.0/fate/components/pipeline_nn_cutomization_tutorial.md)

In this tutorial, we assume that you have deploy the codes of FATE(including fateflow & fate-client) & FATE-LLM-2.0. You can add python path so that you can run codes in the notebook.

In [4]:
import sys
your_path_to_fate_python = 'xxx/fate/fate/python'
sys.path.append(your_path_to_fate_python)

If you install FATE & FATE-LLM-2.0 via pip, you can directly use the following codes.

## Define Main Model and Sub Model

Main models are at server side and will provides weights of adapters and emulators to client sides, while Sub Models are at client side and will load adapters and emulators for local fine-tuning. In this chapter we will take a standard GPT2 as the example and show you how to quickly develop main model class and sub model class for offsite-tuning.

### Base Classes and Interfaces

The base classes for the Main and Sub Models are OffsiteTuningMainModel and OffsiteTuningSubModel, respectively. To build your own models upon these base classes, you need to:

1. Implement three key interfaces: get_base_model, get_model_transformer_blocks, and forward. The get_base_model interface should return the full Main or Sub Model. Meanwhile, the get_model_transformer_blocks function should return a ModuleList of all transformer blocks present in your language model, enabling the extraction of emulators and adapters from these blocks. Finally, you're required to implement the forward process for model inference.

2. Supply the parameters emulator_layer_num, adapter_top_layer_num, and adapter_bottom_layer_num to the parent class. This allows the framework to automatically generate the top and bottom adapters as well as the dropout emulator for you. Specifically, the top adapters are taken from the top of the transformer blocks, while the bottom adapters are taken from the bottom. The emulator uses a dropout emulator consistent with the paper's specifications. Once the adapter layers are removed, the emulator is formed by selecting transformer blocks at fixed intervals and finally stack them to make a dropout emulator.

Our framework will automatically detect the emulator and adapters of a main model, and send them to clients. Clients' models them load the weights of emulators and adapters to get trainable models.

### Example

Let us take a look of our built-in GPT-2 model. It will be easy for you to build main models and sub models based on the framework. Please notice that the GPT2LMHeadSubModel's base model is intialized from a GPTConfig, that is to say, it's weights are random and need to load pretrained weights from server.

In [None]:
from fate_llm.model_zoo.offsite_tuning.offsite_tuning_model import OffsiteTuningSubModel, OffsiteTuningMainModel
from transformers import GPT2LMHeadModel, GPT2Config
from torch import nn
import torch as t


class GPT2LMHeadMainModel(OffsiteTuningMainModel):

    def __init__(
            self,
            model_name_or_path,
            emulator_layer_num: int,
            adapter_top_layer_num: int = 2,
            adapter_bottom_layer_num: int = 2):

        self.model_name_or_path = model_name_or_path
        super().__init__(
            emulator_layer_num,
            adapter_top_layer_num,
            adapter_bottom_layer_num)

    def get_base_model(self):
        return GPT2LMHeadModel.from_pretrained(self.model_name_or_path)

    def get_model_transformer_blocks(self, model: GPT2LMHeadModel):
        return model.transformer.h

    def forward(self, x):
        return self.model(**x)

class GPT2LMHeadSubModel(OffsiteTuningSubModel):

    def __init__(
            self,
            model_name_or_path,
            emulator_layer_num: int,
            adapter_top_layer_num: int = 2,
            adapter_bottom_layer_num: int = 2,
            fp16_mix_precision=False,
            partial_weight_decay=None):

        self.model_name_or_path = model_name_or_path
        self.emulator_layer_num = emulator_layer_num
        self.adapter_top_layer_num = adapter_top_layer_num
        self.adapter_bottom_layer_num = adapter_bottom_layer_num
        super().__init__(
            emulator_layer_num,
            adapter_top_layer_num,
            adapter_bottom_layer_num,
            fp16_mix_precision)
        self.partial_weight_decay = partial_weight_decay

    def get_base_model(self):
        total_layer_num = self.emulator_layer_num + \
            self.adapter_top_layer_num + self.adapter_bottom_layer_num
        config = GPT2Config.from_pretrained(self.model_name_or_path)
        config.num_hidden_layers = total_layer_num
        # initialize a model without pretrained weights
        return GPT2LMHeadModel(config)

    def get_model_transformer_blocks(self, model: GPT2LMHeadModel):
        return model.transformer.h
        
    def forward(self, x):
        return self.model(**x)


We can define a server side model and a client side model that can work together in the offsite-tuning:

In [None]:
model_main = GPT2LMHeadMainModel('gpt2', 4, 2, 2)
model_sub = GPT2LMHeadSubModel('gpt2', 4, 2, 2)

### Share additional parameters with clients

Additionally, beyond the weights of emulators and adapters, you may also want to share other model parameters, such as embedding weights, with your client partners. To achieve this, you'll need to implement two more interfaces: get_additional_param_state_dict and load_additional_param_state_dict for both the Main and Sub Models.

In [None]:
def get_additional_param_state_dict(self):
    # get parameter of additional parameter
    model = self.model
    param_dict = {
        'wte': model.transformer.wte,
        'wpe': model.transformer.wpe,
        'last_ln_f': model.transformer.ln_f
    }

    addition_weights = self.get_numpy_state_dict(param_dict)

    wte = addition_weights.pop('wte')
    wte_dict = split_numpy_array(wte, 10, 'wte')
    wpe = addition_weights.pop('wpe')
    wpe_dict = split_numpy_array(wpe, 10, 'wpe')
    addition_weights.update(wte_dict)
    addition_weights.update(wpe_dict)
    return addition_weights

def load_additional_param_state_dict(self, submodel_weights: dict):
    # load additional weights:
    model = self.model
    param_dict = {
        'wte': model.transformer.wte,
        'wpe': model.transformer.wpe,
        'last_ln_f': model.transformer.ln_f
    }

    new_submodel_weight = {}
    new_submodel_weight['last_ln_f'] = submodel_weights['last_ln_f']
    wte_dict, wpe_dict = {}, {}
    for k, v in submodel_weights.items():
        if 'wte' in k:
            wte_dict[k] = v
        if 'wpe' in k:
            wpe_dict[k] = v
    wte = recover_numpy_array(wte_dict, 'wte')
    wpe = recover_numpy_array(wpe_dict, 'wpe')
    new_submodel_weight['wte'] = wte
    new_submodel_weight['wpe'] = wpe

    self.load_numpy_state_dict(param_dict, new_submodel_weight)

From these codes we can see that we use 'split_numpy_array, recover_numpy_array' to cut embedding weights into pieces and recover them.

## Submit a Offsite-tuning Task - A QA Task Sample with GPT2

Now we are going to show you how to run a 2 party(server & client) offsite-tuning task using the GPT-2 model defined above. Before we submit the task we need to prepare the QA dataset.

### Prepare QA Dataset - Sciq

In this example, we use sciq dataset. You can use tools provided in our qa_dataset.py to tokenize the sciq dataset and save the tokenized result. **Remember to modify the save_path to your own path.** For the sake of simplicity, in this tutorial, for every party we only use this dataset to train the model.

In [1]:
from fate_llm.dataset.qa_dataset import tokenize_qa_dataset
from transformers import AutoTokenizer
tokenizer_name_or_path = 'gpt2'
tokenizer = AutoTokenizer.from_pretrained(tokenizer_name_or_path)

if 'llama' in tokenizer_name_or_path:
    tokenizer = AutoTokenizer.from_pretrained(tokenizer_name_or_path, unk_token="<unk>",  bos_token="<s>", eos_token="</s>", add_eos_token=True)   
    tokenizer.pad_token = tokenizer.eos_token
else:
    tokenizer = AutoTokenizer.from_pretrained(tokenizer_name_or_path)
if 'gpt2' in tokenizer_name_or_path:
    tokenizer.pad_token = tokenizer.eos_token

import os
# bind data path to name & namespace
save_path = 'xxxx/sciq'
rs = tokenize_qa_dataset('sciq', tokenizer, save_path, seq_max_len=600)  # we save the cache dataset to the fate root folder

We can use our built-in QA dataset to load tokenized dataset, to see if everything is working correctly.

In [12]:
from fate_llm.dataset.qa_dataset import QaDataset

ds = QaDataset(tokenizer_name_or_path=tokenizer_name_or_path)
ds.load(save_path)

In [13]:
print(len(ds))  # train set length
print(ds[0]['input_ids'].__len__()) # first sample length

11679
600


## Submit a Task

Now the model and the dataset is prepared! We can submit a training task. In the FATE-2.0, you can define your pipeline in a much easier manner.

After we submit the task below, the following process will occur: The server and client each initialize their respective models. The server extracts shared parameters and sends them to the client. The client then loads these parameters and conducts training on a miniaturized GPT-2 model composed of an emulator and adapter on SciqP 

If you are not familiar with trainer configuration, please refer to [NN Tutorials](https://github.com/FederatedAI/FATE/blob/master/doc/2.0/fate/components/pipeline_nn_cutomization_tutorial.md).

 Upon completion of the training, the client sends the adapter parameters back to the server. Since we are directly using Hugging Face's LMHeadGPT2, there's no need to supply a loss function. Simply inputting the preprocessed data and labels into the model will calculate the correct loss and proceed with gradient descent

One thing to pay special attention to is that Offsite-Tuning differs from FedAvg within FATE. In Offsite-Tuning, the server (the arbiter role) needs to initialize the model. Therefore, please refer to the example below and set the runner conf separately for the client and the server.

To make this a quick demo, we only select 100 samples from the origin qa datset, see 'select_num=100' in the LLMDatasetLoader.

### Bind Dataset Path with Name & Namespace

Plase execute the following code to bind the dataset path with name & namespace. Remember to modify the path to your own dataset save path.

In [None]:
! flow table bind --namespace experiment --name sciq --path YOUR_SAVE_PATH

### Pipeline codes

In [16]:
import time
from fate_client.pipeline.components.fate.reader import Reader
from fate_client.pipeline import FateFlowPipeline
from fate_client.pipeline.components.fate.homo_nn import HomoNN, get_conf_of_ot_runner
from fate_client.pipeline.components.fate.nn.algo_params import Seq2SeqTrainingArguments, FedAVGArguments
from fate_client.pipeline.components.fate.nn.loader import LLMModelLoader, LLMDatasetLoader, LLMDataFuncLoader
from fate_client.pipeline.components.fate.nn.torch.base import Sequential
from fate_client.pipeline.components.fate.nn.torch import nn


guest = '9999'
host = '9999'
arbiter = '9999'

pipeline = FateFlowPipeline().set_parties(guest=guest, arbiter=arbiter)
pipeline.set_site_party_id('9999')
reader_0 = Reader("reader_0", runtime_parties=dict(guest=guest))
reader_0.guest.task_parameters(
    namespace="experiment",
    name="sciq"
)

client_model = LLMModelLoader(
    module_name='offsite_tuning.gpt2', item_name='GPT2LMHeadSubModel',
    model_name_or_path='gpt2',
    emulator_layer_num=4,
    adapter_top_layer_num=1,
    adapter_bottom_layer_num=1
)

server_model = LLMModelLoader(
    module_name='offsite_tuning.gpt2', item_name='GPT2LMHeadMainModel',
    model_name_or_path='gpt2',
    emulator_layer_num=4,
    adapter_top_layer_num=1,
    adapter_bottom_layer_num=1  
)

train_args = Seq2SeqTrainingArguments(
    per_device_train_batch_size=1,
    learning_rate=5e-5,
    disable_tqdm=False,
    num_train_epochs=1,
    logging_steps=10,
    logging_strategy='steps',
    use_cpu=False
)

dataset = LLMDatasetLoader(
    module_name='qa_dataset', item_name='QaDataset',
    tokenizer_name_or_path='gpt2',
    select_num=100
)

data_collator = LLMDataFuncLoader(module_name='data_collator.cust_data_collator', item_name='get_seq2seq_data_collator', tokenizer_name_or_path='gpt2')

client_conf = get_conf_of_ot_runner(
    model=client_model,
    dataset=dataset,
    data_collator=data_collator,
    training_args=train_args,
    fed_args=FedAVGArguments(),
    aggregate_model=False
)

server_conf = get_conf_of_ot_runner(
    model=server_model,
    dataset=dataset,
    data_collator=data_collator,
    training_args=train_args,
    fed_args=FedAVGArguments(),
    aggregate_model=False
)

homo_nn_0 = HomoNN(
    'nn_0',
    train_data=reader_0.outputs["output_data"],
    runner_module="offsite_tuning_runner",
    runner_class="OTRunner"
)

homo_nn_0.guest.task_parameters(runner_conf=client_conf)
homo_nn_0.arbiter.task_parameters(runner_conf=server_conf)
pipeline.add_tasks([reader_0, homo_nn_0])
pipeline.compile()

<fate_client.pipeline.pipeline.FateFlowPipeline at 0x7fc69aa33a00>

You can try to initialize your models, datasets to check if they can be loaded correctly.

In [17]:
print(client_model())
print('*' * 10)
print(dataset())
print('*' * 10)
print(data_collator())

GPT2LMHeadSubModel(
  (model): GPT2LMHeadModel(
    (transformer): GPT2Model(
      (wte): Embedding(50257, 768)
      (wpe): Embedding(1024, 768)
      (drop): Dropout(p=0.1, inplace=False)
      (h): ModuleList(
        (0): GPT2Block(
          (ln_1): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
          (attn): GPT2Attention(
            (c_attn): Conv1D()
            (c_proj): Conv1D()
            (attn_dropout): Dropout(p=0.1, inplace=False)
            (resid_dropout): Dropout(p=0.1, inplace=False)
          )
          (ln_2): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
          (mlp): GPT2MLP(
            (c_fc): Conv1D()
            (c_proj): Conv1D()
            (act): NewGELUActivation()
            (dropout): Dropout(p=0.1, inplace=False)
          )
        )
        (1): GPT2Block(
          (ln_1): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
          (attn): GPT2Attention(
            (c_attn): Conv1D()
            (c_proj): Conv1D()
   

Seems that everything is ready! Now we can submit the task. Submit the code below to submit your task.

In [2]:
pipeline.fit()

## Add Deepspeed Setting

By simply adding a ds_config, we can run our task with a deepspeed backend. If you have deployed eggroll envoironment, you can submmit the task with deepspeed to eggroll accelerate your training.

In [5]:
import time
from fate_client.pipeline.components.fate.reader import Reader
from fate_client.pipeline import FateFlowPipeline
from fate_client.pipeline.components.fate.homo_nn import HomoNN, get_conf_of_ot_runner
from fate_client.pipeline.components.fate.nn.algo_params import Seq2SeqTrainingArguments, FedAVGArguments
from fate_client.pipeline.components.fate.nn.loader import LLMModelLoader, LLMDatasetLoader, LLMDataFuncLoader
from peft import LoraConfig, TaskType
from transformers.modeling_utils import unwrap_model


guest = '10000'
host = '10000'
arbiter = '10000'

# pipeline = FateFlowPipeline().set_parties(guest=guest, host=host, arbiter=arbiter)
pipeline = FateFlowPipeline().set_parties(guest=guest, arbiter=arbiter)

reader_0 = Reader("reader_0", runtime_parties=dict(guest=guest))
reader_0.guest.task_parameters(
    namespace="experiment",
    name="sciq"
)

client_model = LLMModelLoader(
    module_name='offsite_tuning.gpt2', item_name='GPT2LMHeadSubModel',
    model_name_or_path='gpt2',
    emulator_layer_num=18,
    adapter_top_layer_num=2,
    adapter_bottom_layer_num=2
)

server_model = LLMModelLoader(
    module_name='offsite_tuning.gpt2', item_name='GPT2LMHeadMainModel',
    model_name_or_path='gpt2',
    emulator_layer_num=18,
    adapter_top_layer_num=2,
    adapter_bottom_layer_num=2  
)

dataset = LLMDatasetLoader(
    module_name='qa_dataset', item_name='QaDataset',
    tokenizer_name_or_path='gpt2',
    select_num=100
)

data_collator = LLMDataFuncLoader(module_name='data_collator.cust_data_collator', item_name='get_seq2seq_data_collator', tokenizer_name_or_path='gpt2')

batch_size = 1
lr = 5e-5
ds_config = {
    "train_micro_batch_size_per_gpu": batch_size,
    "optimizer": {
        "type": "Adam",
        "params": {
            "lr": lr,
            "torch_adam": True,
            "adam_w_mode": False
        }
    },
    "fp16": {
        "enabled": True
    },
    "gradient_accumulation_steps": 1,
    "zero_optimization": {
        "stage": 2,
        "allgather_partitions": True,
        "allgather_bucket_size": 1e8,
        "overlap_comm": True,
        "reduce_scatter": True,
        "reduce_bucket_size": 1e8,
        "contiguous_gradients": True,
        "offload_optimizer": {
            "device": "cpu"
        },
        "offload_param": {
            "device": "cpu"
        }
    }
}

train_args = Seq2SeqTrainingArguments(
    per_device_train_batch_size=1,
    learning_rate=5e-5,
    disable_tqdm=False,
    num_train_epochs=1,
    logging_steps=10,
    logging_strategy='steps',
    dataloader_num_workers=4,
    use_cpu=False,
    deepspeed=ds_config,  # Add deepspeed config here
    remove_unused_columns=False,
    fp16=True
)

client_conf = get_conf_of_ot_runner(
    model=client_model,
    dataset=dataset,
    data_collator=data_collator,
    training_args=train_args,
    fed_args=FedAVGArguments(),
    aggregate_model=False,
)

server_conf = get_conf_of_ot_runner(
    model=server_model,
    dataset=dataset,
    data_collator=data_collator,
    training_args=train_args,
    fed_args=FedAVGArguments(),
    aggregate_model=False
)


homo_nn_0 = HomoNN(
    'nn_0',
    train_data=reader_0.outputs["output_data"],
    runner_module="offsite_tuning_runner",
    runner_class="OTRunner"
)

homo_nn_0.guest.task_parameters(runner_conf=client_conf)
homo_nn_0.arbiter.task_parameters(runner_conf=server_conf)

# if you have deployed eggroll, you can add this line to submit your job to eggroll
homo_nn_0.guest.conf.set("launcher_name", "deepspeed")

pipeline.add_tasks([reader_0, homo_nn_0])
pipeline.conf.set("task", dict(engine_run={"cores": 4}))
pipeline.compile()
pipeline.fit()


<pipeline.backend.pipeline.PipeLine at 0x7f8002385e50>

## Offsite-tuning + Multi Client Federation


The Offsite-Tuning + FedAVG federation is configured based on the standard Offsite-Tuning. In this situation, you need to add data input & configurations for all clients. And do remember to add 'aggregate_model=True' for client & server conf so that model federation will be conducted during the training.

In [None]:
import time
from fate_client.pipeline.components.fate.reader import Reader
from fate_client.pipeline import FateFlowPipeline
from fate_client.pipeline.components.fate.homo_nn import HomoNN, get_conf_of_ot_runner
from fate_client.pipeline.components.fate.nn.algo_params import Seq2SeqTrainingArguments, FedAVGArguments
from fate_client.pipeline.components.fate.nn.loader import LLMModelLoader, LLMDatasetLoader, LLMCustFuncLoader
from peft import LoraConfig, TaskType


guest = '10000'
host = '10000'
arbiter = '10000'

pipeline = FateFlowPipeline().set_parties(guest=guest, host=host, arbiter=arbiter)

reader_0 = Reader("reader_0", runtime_parties=dict(guest=guest, host=host))
reader_0.guest.task_parameters(
    namespace="experiment",
    name="sciq"
)
reader_0.hosts[0].task_parameters(
    namespace="experiment",
    name="sciq"
)

client_model = LLMModelLoader(
    module_name='offsite_tuning.gpt2', item_name='GPT2LMHeadSubModel',
    model_name_or_path='gpt2',
    emulator_layer_num=4,
    adapter_top_layer_num=1,
    adapter_bottom_layer_num=1
)

server_model = LLMModelLoader(
    module_name='offsite_tuning.gpt2', item_name='GPT2LMHeadMainModel',
    model_name_or_path='gpt2',
    emulator_layer_num=4,
    adapter_top_layer_num=1,
    adapter_bottom_layer_num=1  
)

dataset = LLMDatasetLoader(
    module_name='qa_dataset', item_name='QaDataset',
    tokenizer_name_or_path='gpt2',
    select_num=100
)

data_collator = LLMCustFuncLoader(module_name='cust_data_collator', item_name='get_seq2seq_tokenizer', model_path='gpt2')

train_args = Seq2SeqTrainingArguments(
    per_device_train_batch_size=1,
    learning_rate=5e-5,
    disable_tqdm=False,
    num_train_epochs=1,
    logging_steps=10,
    logging_strategy='steps',
    dataloader_num_workers=4
)

client_conf = get_conf_of_ot_runner(
    model=client_model,
    dataset=dataset,
    data_collator=data_collator,
    training_args=train_args,
    fed_args=FedAVGArguments(),
    aggregate_model=True
)

server_conf = get_conf_of_ot_runner(
    model=server_model,
    dataset=dataset,
    data_collator=data_collator,
    training_args=train_args,
    fed_args=FedAVGArguments(),
    aggregate_model=True
)

homo_nn_0 = HomoNN(
    'nn_0',
    train_data=reader_0.outputs["output_data"],
    runner_module="offsite_tuning_runner",
    runner_class="OTRunner"
)

homo_nn_0.guest.task_parameters(runner_conf=client_conf)
homo_nn_0.hosts[0].task_parameters(runner_conf=client_conf)
homo_nn_0.arbiter.task_parameters(runner_conf=server_conf)

pipeline.add_tasks([reader_0, homo_nn_0])

pipeline.compile()
pipeline.fit()