# Federated SVM Model with Scikit-learn

This tutorial illustrates a federated SVM model learning on tabular data. 

Before do the training, we need to setup NVFLARE

## Setup NVFLARE

Follow [Getting Started](https://nvflare.readthedocs.io/en/main/getting_started.html) to set up a virtual environment and install NVFLARE.

You can also follow this [notebook](https://github.com/NVIDIA/NVFlare/blob/main/examples/nvflare_setup.ipynb) to get set up.

> Make sure you have installed nvflare from **terminal** 


## Install requirements
assuming the current directory is '/examples/hello-world/step-by-step/higgs/sklearn-svm'

In [None]:
!pwd

In [None]:
%pip install -r requirements.txt

>Note:
In the upcoming sections, we'll utilize the 'tree' command. To install this command on a Linux system, you can use the sudo apt install tree command. As an alternative to 'tree', you can use the ls -al command.


## Prepare data
Please reference [prepare_higgs_data](../prepare_data.ipynb) notebooks. Pay attention to the current location. You need to switch "higgs" directory to run the data split.
    

Now we have our data prepared. we are ready to do the training

### Data Cleaning 

We noticed from time-to-time the Higgs dataset is making small changes which causing job to fail. so we need to do some clean up or skip certain rows. 
For example: certain floating number mistakenly add an alphabetical letter at some point of time. This may have already fixed by UCI. 


## Scikit-learn
This tutorial uses [Scikit-learn](https://scikit-learn.org/), a widely used open-source machine learning library that supports supervised and unsupervised learning.


### Federated SVM Model
Here we use [SVM training](https://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html) in a federated scenario.

Under this setting, federated learning can be formulated as a in two steps:

- local training: each client trains a local SVM model with their own data
- global training: server collects the support vectors from all clients and trains a global SVM model based on them

Let's look at the code see how we convert the local training script to the federated training script.


In [None]:
!pwd

In [None]:
!cat code/svm_fl.py

The code is pretty much like the standard scikit-learn training script of `code/svm_local.py`

#### load data

We first load the features from the header file: 
    
```
    site_name = flare.get_site_name()
    feature_data_path = f"{data_root_dir}/{site_name}_header.csv"
    features = load_features(feature_data_path)
    n_features = len(features) -1

    data_path = f"{data_root_dir}/{site_name}.csv"
    data = load_data(data_path=data_path, data_features=features, test_size=test_size, skip_rows=skip_rows)

```

then load the data from the main csv file, then transform the data and split the training and test data based on the test_size provided.  

```
    data = to_dataset_tuple(data)
    dataset = transform_data(data)
    x_train, y_train, train_size = dataset["train"]
    x_test, y_test, test_size = dataset["test"]

```

The part that's specific to Federated Learning is in the following codes

```
# (1) import nvflare client API
from nvflare import client as flare

```
```
# (2) initializes NVFlare client API
    flare.init()

    site_name = flare.get_site_name()
    
```
    
These few lines, import NVFLARE Client API and initialize it, then use the API to find the site_name (such as site-1, site-2 etc.). With the site-name, we can construct the site-specific 
data path such as

```
    feature_data_path = f"{data_root_dir}/{site_name}_header.csv"

    data_path = f"{data_root_dir}/{site_name}.csv"
```

#### Training 

In the standard traditional scikit learn, we would construct the model such as
```
  model = SVC(...) 
```
then call model.fit(...)
```
  model.fit(x_train, y_train)

  auc, report = evaluate_model(x_test, model, y_test)

```

with federated learning, using FLARE Client API, we need to make a few changes
* 1) we are not only training in local iterations, but also global rounds, we need to keep the program running until we reached to the totoal number of rounds 
  
  ```
      while flare.is_running():
          ... rest of code
  
  ```
  
* 2) Unlike local learning, we have now have more than one clients/sites participating the training. To ensure every site starts with the same model parameters, we use server to broadcase the initial model parameters to every sites at the first round ( current_round = 0). 

* 3) We will need to use FLARE client API to receive global model and find out the global parameters, training only happens for the first round

```
        # (3) receives FLModel from NVFlare
        input_model = flare.receive()
        global_params = input_model.params
        curr_round = input_model.current_round
```

```
        if curr_round == 0:
            # (4) initialize model with global_params
            model = SVC(
                kernel=global_params["kernel"]
            )
            # Train the model on the training set
            # note that SVM training only happens on first round
            model.fit(x_train, y_train)
```
* 4) if it is not the first round, we need to use the global model to update the local model for global model evaluation. We fit to the global support vectors 

```
            # (5) update model based on global parameters
            support_x = global_params["support_x"]
            support_y = global_params["support_y"]
            model.fit(support_x, support_y)
```

* 5) we evaluate the global model using the local data

```
        # (6) evaluate model
        auc, report = evaluate_model(x_test, model, y_test)
```
* 6) we need the training result (supporting vectors) back to the server for global round, to do that, we have the following code. In the second round, we simply sent back the global support vectors

```
        # (7) construct trained FL model
        # get support vectors
        if curr_round == 0:
            index = model.support_
            local_support_x = x_train[index]
            local_support_y = y_train[index]
        else:
            local_support_x = support_x
            local_support_y = support_y
        params = {"support_x": local_support_x, "support_y": local_support_y}
        metrics = {"accuracy": auc}
        output_model = flare.FLModel(params=params, metrics=metrics)

        # (8) send model back to NVFlare
        flare.send(output_model)
```

## Prepare Job  

Now, we have the code, we need to prepare job folder with configurations to run in NVFLARE. To do this, we can leveage the job template for scikit learn. First look at the the available job templates

In [None]:
!nvflare config -jt ../../../../../job_templates/

In [None]:
!nvflare job list_templates

the `sklearn_svm` is the one we need. 

In [None]:
!nvflare job create -j /tmp/nvflare/jobs/sklearn_svm -force -w sklearn_svm \
-sd code \
-f config_fed_client.conf app_script="svm_fl.py" app_config="--data_root_dir /tmp/nvflare/dataset/output"

In [None]:
!cat /tmp/nvflare/jobs/sklearn_svm/app/config/config_fed_client.conf

In [None]:
!tree /tmp/nvflare/jobs/sklearn_svm

>Note 
 For potential on-the-fly data cleaning, we use skip_rows = 0 to skip 1st row. We could skip_rows = [0, 3] to skip first and 4th rows.



## Run job in simulator

We use the simulator to run this job

In [None]:
!nvflare simulator /tmp/nvflare/jobs/sklearn_svm -w /tmp/nvflare/sklearn_svm -n 3 -t 3

Let's examine the results.

We can notice from the FL training log, after global SVM round, site-1 reports `site-1: model AUC: 0.6403`
Now let's run a local training to verify if this number makes sense.

In [None]:
!python3 ./code/svm_local.py --data_root_dir /tmp/nvflare/dataset/output

Since federated SVM can benefit from the supporting vectors from other clients, we expect the FL result to be better than local training.
The final result for local SVM learning is `model AUC: 0.6217`, as compared with FL's `model AUC: 0.6403`, this confirms our expectation.

## We are done !
Congratulations! you have just completed the federated SVM model for tabular data. 