# Federated SVM with Scikit-learn on Breast Cancer Dataset

## Introduction to Scikit-learn, tabular data, and federated SVM
### Scikit-learn
This example shows how to use [NVIDIA FLARE](https://nvflare.readthedocs.io/en/2.3/index.html) on tabular data.
It uses [Scikit-learn](https://scikit-learn.org/),
a widely used open-source machine learning library that supports supervised 
and unsupervised learning.
### Tabular data
The data used in this example is tabular in a format that can be handled by [pandas](https://pandas.pydata.org/), such that:
- rows correspond to data samples
- the first column represents the label 
- the other columns cover the features.    

Each client is expected to have one local data file containing both training 
and validation samples. To load the data for each client, the following 
parameters are expected by the local learner:
- data_file_path: string, the full path to the client's data file 
- train_start: int, start row index for the training set
- train_end: int, end row index for the training set
- valid_start: int, start row index for the validation set
- valid_end: int, end row index for the validation set

### Federated SVM
The machine learning algorithm shown in this example is [SVM for Classification (SVC)](https://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html).
Under this setting, federated learning can be formulated in two steps:
- local training: each client trains a local SVM model with their own data
- global training: server collects the support vectors from all clients and 
  trains a global SVM model based on them

Unlike other iterative federated algorithms, federated SVM only involves 
these two training steps. Hence, in the server config, we set
`"num_rounds": 2`
The first round is the training round, performing local training and global aggregation. 
Next, the global model will be sent back to clients for the second round, 
performing model validation and local model update. 
If this number is set to a number greater than 2, the system will report an error and exit.

Below we listed steps to run this example.

## 1. Setup NVFLARE

Follow the [Getting_Started](https://nvflare.readthedocs.io/en/2.3/getting_started.html) to setup virtual environment and install NVFLARE

We also provide a [Notebook](../../nvflare_setup.ipynb) for this setup process. 

Assume you have already setup the venv, lets first install required packages.

In [None]:
%pip install -r requirements.txt

## 2. Data preparation 
This example uses the the breast cancer dataset available from Scikit-learn's dataset API.  

In [None]:
%env DATASET_PATH=/tmp/nvflare/dataset/sklearn_breast_cancer.csv
!python3 ./utils/prepare_data.py --dataset_name cancer --randomize 0 --out_path ${DATASET_PATH}

## 3. Prepare clients' configs with proper data split information 
We are using NVFlare's FL simulator to run the following experiments. Here we simulate 3 clients with uniform data split.

In [None]:
%env DATASET_PATH=/tmp/nvflare/dataset/sklearn_breast_cancer.csv
!python3 ./utils/prepare_job_config.py \
        --task_name "sklearn_svm" \
        --data_path "${DATASET_PATH}" \
        --site_num 3 \
        --valid_frac 0.2 \
        --split_method "uniform"

## 4. Run simulated kmeans experiment
Now that we have the job configs ready, we run the experiment using Simulator.

Simulator can either be used with CLI command: 

In [None]:
! nvflare simulator ./jobs/sklearn_svm_3_uniform -w /tmp/nvflare/sklearn_svm_cancer -n 3 -t 3

or via Simulator API:

In [None]:
import os
from nvflare.private.fed.app.simulator.simulator_runner import SimulatorRunner  

simulator = SimulatorRunner(
    job_folder=f"./jobs/sklearn_svm_3_uniform",
    workspace="/tmp/nvflare/sklearn_svm_cancer",
    n_clients=3,
    threads=3
)
run_status = simulator.run()
print("Simulator finished with run_status", run_status)

## 5. Result visualization
Running with default [SVC](https://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html) classifier, the resulting global model's AUC is 0.806 which can be visualized in tensorboard.

In [None]:
%load_ext tensorboard
%tensorboard --logdir /tmp/nvflare/sklearn_svm_cancer