# Federated Linear Model with Scikit-learn on HIGGS Dataset

## Introduction to Scikit-learn, tabular data, and federated Linear Model
### Scikit-learn
This example shows how to use [NVIDIA FLARE](https://nvflare.readthedocs.io/en/main/index.html) on tabular data.
It uses [Scikit-learn](https://scikit-learn.org/),
a widely used open-source machine learning library that supports supervised 
and unsupervised learning.
### Tabular data
The data used in this example is tabular in a format that can be handled by [pandas](https://pandas.pydata.org/), such that:
- rows correspond to data samples
- the first column represents the label 
- the other columns cover the features.    

Each client is expected to have one local data file containing both training 
and validation samples. To load the data for each client, the following 
parameters are expected by the local learner:
- data_file_path: string, the full path to the client's data file 
- train_start: int, start row index for the training set
- train_end: int, end row index for the training set
- valid_start: int, start row index for the validation set
- valid_end: int, end row index for the validation set

### Federated Linear Model
This example shows the use of [linear classifiers with SGD training](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.SGDClassifier.html) in a federated scenario.
Under this setting, federated learning can be formulated as a [FedAvg](https://arxiv.org/abs/1602.05629) process with local training that each client optimizes the local model starting from global parameters with SGD. 
This can be achieved by setting the `warm_start` flag of SGDClassifier to 
`True` in order to allow repeated fitting of the classifiers to the local data.

Below we listed steps to run this example.

## 1. Setup NVFLARE

Follow the [Getting_Started](https://nvflare.readthedocs.io/en/main/getting_started.html) to setup virtual environment and install NVFLARE

We also provide a [Notebook](../../nvflare_setup.ipynb) for this setup process. 

Assume you have already setup the venv, lets first install required packages.

In [None]:
%pip install -r requirements.txt

## 2. Data preparation 
The examples illustrate a binary classification task based on [HIGGS dataset](https://mlphysics.ics.uci.edu/data/higgs/).
This dataset contains 11 million instances, each with 28 attributes.
By default, we assume the dataset is downloaded, uncompressed, and stored in `DATASET_ROOT/HIGGS.csv`.
Please note that the UCI's website may experience occasional downtime.


## 3. Run simulated experiment
We run the federated training using NVFlare Simulator via [Job Recipe API](https://nvflare.readthedocs.io/en/main/programming_guide/fed_job_api.html). The job.py script handles data splitting and job configuration automatically.

In [None]:
! python job.py --n_clients 5 --num_rounds 30

## 4. Results
The validation AUC is printed during training (look for "Validation AUC" in the output above). The final trained model is saved to `/tmp/nvflare/simulation/sklearn_linear/server/simulate_job/app_server/model_param.joblib`.