# Homo NN 自定义Trainer

目前FATE自带的FedAVGTrainer仅针对常用的分类、回归任务，但是如果说有特殊的使用需求，比方说，目标检测，推荐，语义标注等，对数据集，loss和训练流程有特定的需求，则需要修改现有的训练流程。

FATE-1.10除了Dataset与CustModel外，还支持对Trainer的自定义，以满足对训练流程定制化的需求: 
trainer的基类位于nn.homo.trainer.trainer_base下，
如果需要开发自己的Trainer，你需要实现一些接口，以让FATE可以正确调用

## TrainerBase接口介绍

### TrainerBase的部分代码
此处我们介绍TrainerBase的部分代码，下栏给出了部分代码，它们与你定制化自己的Trainer有关系。根据它们我们可以很快的实现一个简单的定制化Trainer

- \_\_init\_\_ 你可以在这里定义Trainer需要用到的参数，如epoch, batch_size等

- train接口: 你需要实现的接口，在运行时，Homo-NN component会自动调用train函数，进行训练。该接口接受四个参数，train_set, validate_set, optimizer和loss。Homo-nn component会根据你在pipeline里的设置，把你设定的训练集，验证集，optimizer和loss传到train里，请注意，optimizer会用model的parameters()实例化，而loss是pytorch loss function的实例。因此，你可以在train function里写你自己的训练流程。

- self.model: 在算法运行时，在运行train前， Homo-NN component自动地调用set_model接口，设置你的模型，因此，在实现train时，你可以通过self.model来使用模型

- local_mode() 和 self.fed_mode: 你可以通过local_mode() 将 self.fed_mode设置为False，在train中你可以通过fed_mode来区分本地测试的本地模式和联邦模式，这个功能你可以在你本地开发/测试的时候使用

我们可以从下面一个简单的Trainer例子了解Trainer的定制化

In [None]:
class TrainerBase(object):

    def __init__(self, **kwargs):
        
        self._fed_mode = True
        self._model = None
        ...

    @property
    def model(self):
        if not hasattr(self, '_model'):
            raise AttributeError('model variable is not initialized, remember to call'
                                 ' super(your_class, self).__init__()')
        if self._model is None:
            raise AttributeError('model is not set, use set_model() function to set training model')

        return self._model

    @model.setter
    def model(self, val):
        self._model = val
        
    def set_model(self, model: Module):
        if not issubclass(type(model), Module):
            raise ValueError('model must be a subclass of pytorch nn.Module')
        self.model = model

    @property
    def fed_mode(self):
        if not hasattr(self, '_fed_mode'):
            raise AttributeError('run_local_mode variable is not initialized, remember to call'
                                 ' super(your_class, self).__init__()')
        return self._fed_mode

    @fed_mode.setter
    def fed_mode(self, val):
        self._fed_mode = val

    def local_mode(self):
        self.fed_mode = False

    @abc.abstractmethod
    def train(self, train_set, validate_set=None, optimizer=None, loss=None):
        """
            train_set : A Dataset Instance, must be a instance of subclass of Dataset (federatedml.nn.dataset.base),
                      for example, TableDataset() (from federatedml.nn.dataset.table)

            validate_set : A Dataset Instance, but optional must be a instance of subclass of Dataset
                    (federatedml.nn.dataset.base), for example, TableDataset() (from federatedml.nn.dataset.table)

            optimizer : A pytorch optimizer class instance, for example, t.optim.Adam(), t.optim.SGD()

            loss : A pytorch Loss class, for example, nn.BECLoss(), nn.CrossEntropyLoss()
        """
        pass
    
    ...

## 实例1：开发一个简单的自定义Trainer

这里，我们开发一个简单的自定义Trainer，以展示各个接口如何使用的：
为了方便 这里使用save_to_fate接口保存trainer, 当然你可以直接将Trainer文件手动部署到federatedml/nn/homo/trainer下

### mytrainer.py

In [16]:
from pipeline.component.nn import save_to_fate

In [18]:
%%save_to_fate trainer mytrainer.py  
# save to federatedml/nn/homo/trainer
import torch as t
from federatedml.util import LOGGER
from federatedml.nn.homo.trainer.trainer_base import TrainerBase
from torch.utils.data import DataLoader

# 使用FATE自带的SecureAggregator，开发Trainer时，使用SeureAggregator的Client端
from federatedml.framework.homo.aggregator.secure_aggregator import SecureAggregatorClient


class MyTrainer(TrainerBase):
    
    def __init__(self, epochs, batch_size=256, dataloader_worker=4):
        super(MyTrainer, self).__init__()
        self.epochs = epochs
        self.batch_size = batch_size
        self.dataloader_worker = dataloader_worker
        
    # 实现train 接口
    def train(self, train_set, val=None, optimizer=None, loss=None):
        
        fed_avg = None
        LOGGER.info('run local mode is {}'.format(self.fed_mode))
        
        # 当调用trainer.local_mode()时，会将fed_mode设定为False，加入此判断是为了满足
        # 本地测试的需要，可以绕过联邦的流程，SecureAggregationClient无法直接在一个本地脚本里运行
        if self.fed_mode:
            # max aggregate round 为多聚合轮数
            # sample number用于计算模型权重
            fed_avg = SecureAggregatorClient(max_aggregate_round=self.epochs, sample_number=len(train_set), secure_aggregate=True)
            LOGGER.info('initializing fed avg')
        
        # dataloader + for 循环， 算的loss并backward
        # 与pytorch的训练流程完全一致
        dl = DataLoader(train_set, batch_size=self.batch_size, num_workers=self.dataloader_worker)
        for epoch_idx in range(0, self.epochs):
            l_sum = 0
            for data, label in dl:
                optimizer.zero_grad()
                # self.model 
                pred = self.model(data)
                l = loss(pred, label)
                l.backward()
                optimizer.step()
                l_sum += l
            
            # LOGGER打印日志到log里
            LOGGER.info('loss sum is {}'.format(l_sum))
            
            # 通过secure aggregator聚合模型即可
            if fed_avg:
                # 聚合模型与epoch loss
                fed_avg.aggregate(self.model, l_sum.cpu().detach().numpy())
                    
        LOGGER.info('training finished!')

完成了代码 我们可以本地测试一下能否跑通

## 实例1：本地测试

In [19]:
import torch as t
from federatedml.nn.dataset.table import TableDataset

dataset = TableDataset()
dataset.load('../examples/data/breast_homo_host.csv')
dataset[0]
print(dataset[0][0].shape)

(30,)


In [20]:
trainer = MyTrainer(epochs=10, batch_size=128, ) # 10个epoch batch_size=128
model = t.nn.Sequential(
    t.nn.Linear(30, 16),
    t.nn.ReLU(),
    t.nn.Linear(16, 1),
    t.nn.Sigmoid()
)
loss = t.nn.BCELoss()  # loss function 
optimizer = t.optim.Adam(model.parameters(), lr=0.01)# optimizer

In [21]:
trainer.set_model(model) # set model
trainer.local_mode()  # local model，进行本地测试

In [22]:
trainer.train(dataset, loss=loss, optimizer=optimizer)

run local mode is False
loss sum is 1.131636381149292
loss sum is 0.8614962100982666
loss sum is 0.6639648675918579
loss sum is 0.5202457904815674
loss sum is 0.4206005930900574
loss sum is 0.35165560245513916
loss sum is 0.30107438564300537
loss sum is 0.2607121169567108
loss sum is 0.2271752953529358
loss sum is 0.20061872899532318
training finished!



## 实例1：联邦任务

本地完成测试后，我们可以马上按照本地的参数提交一个fate任务了，这个任务下，有两方参与这个任务，按照mytrainer的逻辑，会每轮进行训练，然后进行聚合：

In [24]:
import torch as t
from torch import nn
from pipeline import fate_torch_hook
from pipeline.component import HomoNN
from pipeline.backend.pipeline import PipeLine
from pipeline.component import Reader, Evaluation, DataTransform
from pipeline.interface import Data, Model

fate_torch_hook(t)

import os
# 绑定地址到fate name&namespace
fate_project_path = os.path.abspath('../')
host_0 = 10000
host_1 = 9999
pipeline = PipeLine().set_initiator(role='host', party_id=host_0).set_roles(host=[host_0, host_1],
                                                                            arbiter=[host_0])

data_0 = {"name": "breast_host_0", "namespace": "experiment"}
data_1 = {"name": "breast_host_1", "namespace": "experiment"}

# 为方便，本示例中两方使用同一份数据集
data_path_0 = fate_project_path + '/examples/data/breast_homo_host.csv'
data_path_1 = fate_project_path + '/examples/data/breast_homo_host.csv'
pipeline.bind_table(name=data_0['name'], namespace=data_0['namespace'], path=data_path_0)
pipeline.bind_table(name=data_1['name'], namespace=data_1['namespace'], path=data_path_1)

{'namespace': 'experiment', 'table_name': 'breast_host_1'}

In [25]:
# 定义reader
reader_0 = Reader(name="reader_0")
reader_0.get_party_instance(role='host', party_id=host_0).component_param(table=data_0)
reader_0.get_party_instance(role='host', party_id=host_1).component_param(table=data_1)

In [26]:
from pipeline.component.homo_nn import TrainerParam # Trainer的接口，我们通过这个接口指定我们的trainer，并传递参数
from pipeline.component.homo_nn import DatasetParam

# 与本地测试一样的设置
model = t.nn.Sequential(
    t.nn.Linear(30, 16),
    t.nn.ReLU(),
    t.nn.Linear(16, 1),
    t.nn.Sigmoid()
)
loss = t.nn.BCELoss()  # loss function 
optimizer = t.optim.Adam(model.parameters(), lr=0.01)# optimizer

nn_component = HomoNN(name='nn_0',
                      model=model, # 模型
                      loss=loss,
                      optimizer=optimizer,
                      dataset=DatasetParam(dataset_name='table'),
                      trainer=TrainerParam(trainer_name='mytrainer', epochs=10, batch_size=128)
                      )

In [27]:
# 添加组件到pipeline，定义数据IO关系，提交即可
pipeline.add_component(reader_0)
pipeline.add_component(nn_component, data=Data(train_data=reader_0.output.data))

<pipeline.backend.pipeline.PipeLine at 0x7f6057bf0820>

In [28]:
pipeline.compile()
pipeline.fit()

[32m2022-11-15 17:07:08.088[0m | [1mINFO    [0m | [36mpipeline.utils.invoker.job_submitter[0m:[36mmonitor_job_status[0m:[36m83[0m - [1mJob id is 202211151707072866880
[0m
[32m2022-11-15 17:07:08.103[0m | [1mINFO    [0m | [36mpipeline.utils.invoker.job_submitter[0m:[36mmonitor_job_status[0m:[36m98[0m - [1m[80D[1A[KJob is still waiting, time elapse: 0:00:00[0m
[32m2022-11-15 17:07:09.131[0m | [1mINFO    [0m | [36mpipeline.utils.invoker.job_submitter[0m:[36mmonitor_job_status[0m:[36m98[0m - [1m[80D[1A[KJob is still waiting, time elapse: 0:00:01[0m
[0mm2022-11-15 17:07:11.225[0m | [1mINFO    [0m | [36mpipeline.utils.invoker.job_submitter[0m:[36mmonitor_job_status[0m:[36m125[0m - [1m
[32m2022-11-15 17:07:11.230[0m | [1mINFO    [0m | [36mpipeline.utils.invoker.job_submitter[0m:[36mmonitor_job_status[0m:[36m127[0m - [1m[80D[1A[KRunning component reader_0, time elapse: 0:00:03[0m
[32m2022-11-15 17:07:12.259[0m | [1mINFO    

[32m2022-11-15 17:07:50.904[0m | [1mINFO    [0m | [36mpipeline.utils.invoker.job_submitter[0m:[36mmonitor_job_status[0m:[36m127[0m - [1m[80D[1A[KRunning component eval_0, time elapse: 0:00:42[0m
[32m2022-11-15 17:07:52.975[0m | [1mINFO    [0m | [36mpipeline.utils.invoker.job_submitter[0m:[36mmonitor_job_status[0m:[36m89[0m - [1mJob is success!!! Job id is 202211151707072866880[0m
[32m2022-11-15 17:07:52.977[0m | [1mINFO    [0m | [36mpipeline.utils.invoker.job_submitter[0m:[36mmonitor_job_status[0m:[36m90[0m - [1mTotal time: 0:00:44[0m


任务完成，可以在fateborad任务里看到你的日志了