# Federated K-Means Clustering with Scikit-learn on Iris Dataset

## Introduction to Scikit-learn, tabular data, and federated k-Means
### Scikit-learn
This example shows how to use [NVIDIA FLARE](https://nvflare.readthedocs.io/en/main/index.html) on tubular data.
It uses [Scikit-learn](https://scikit-learn.org/),
a widely used open-source machine learning library that supports supervised 
and unsupervised learning.
### Tabular data
The data used in this example is tabular in a format that can be handled by [pandas](https://pandas.pydata.org/), such that:
- rows correspond to data samples
- the first column represents the label 
- the other columns cover the features.    

Each client is expected to have one local data file containing both training 
and validation samples. To load the data for each client, the following 
parameters are expected by the local learner:
- data_file_path: string, the full path to the client's data file 
- train_start: int, start row index for the training set
- train_end: int, end row index for the training set
- valid_start: int, start row index for the validation set
- valid_end: int, end row index for the validation set

### Federated k-Means clustering
The machine learning algorithm in this example is [k-Means clustering](https://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html).
The aggregation follows the scheme defined in [Mini-batch k-Means](https://scikit-learn.org/stable/modules/generated/sklearn.cluster.MiniBatchKMeans.html). 
Under this setting, each round of federated learning can be formulated as follows:
- local training: starting from global centers, each client trains a local MiniBatchKMeans model with their own data
- global aggregation: server collects the cluster center, 
  counts information from all clients, aggregates them by considering 
  each client's results as a mini-batch, and updates the global center and per-center counts.

For center initialization, at the first round, each client generates its initial centers with the k-means++ method. Then, the server collects all initial centers and performs one round of k-means to generate the initial global center.

Below we listed steps to run this example.

## 1. Setup NVFLARE

Follow the [Getting_Started](https://nvflare.readthedocs.io/en/main/getting_started.html) to setup virtual environment and install NVFLARE

We also provide a [Notebook](../../nvflare_setup.ipynb) for this setup process. 

Assume you have already setup the venv, lets first install required packages.

In [1]:
!pip install -r requirements.txt

Looking in indexes: https://pypi.org/simple, https://pypi.ngc.nvidia.com


## 2. Data preparation 
This example uses the Iris dataset available from Scikit-learn's dataset API.  

In [2]:
%env DATASET_PATH=/tmp/nvflare/dataset/sklearn_iris.csv
%env OVERLAP=10000
!python3 ./utils/prepare_data.py --dataset_name iris --randomize 1 --out_path ${DATASET_PATH}

env: DATASET_PATH=/tmp/nvflare/dataset/sklearn_iris.csv
env: OVERLAP=10000


## 3. Prepare clients' configs with proper data split information 
We are using NVFlare's FL simulator to run the following experiments. Here we simulate 3 clients with uniform data split. Since the dataset is already randomized in the last step, the split is done sequentially. For unsupervised clustering, we use the whole dataset with ground truth label to validate the performance.

In [3]:
%env DATASET_PATH=/tmp/nvflare/dataset/sklearn_iris.csv
!python3 ./utils/prepare_job_config.py --task_name "sklearn_kmeans" --data_path "${DATASET_PATH}" --site_num 3 --valid_frac 1 --split_method "uniform"

env: DATASET_PATH=/tmp/nvflare/dataset/sklearn_iris.csv


## 4. Run simulated kmeans experiment
Now that we have the job configs ready, we run the experiment using Simulator.

Simulator can either be used with CLI command: 

In [5]:
! nvflare simulator ./jobs/sklearn_kmeans_3_uniform -w /tmp/nvflare/sklearn_kmeans_iris -n 3 -t 3

2023-03-13 17:17:59,166 - SimulatorRunner - INFO - Create the Simulator Server.
2023-03-13 17:17:59,168 - Cell - INFO - server: creating listener on tcp://0:58469
2023-03-13 17:17:59,169 - Cell - INFO - server: created backbone external listener for tcp://0:58469
2023-03-13 17:17:59,169 - ConnectorManager - INFO - 16045: Try start_listener Listener resources: {'secure': False, 'host': 'localhost'}
2023-03-13 17:17:59,169 - nvflare.fuel.f3.sfm.conn_manager - INFO - Connector [CH00002 PASSIVE uds:///tmp/nvflare_599550] is starting
2023-03-13 17:17:59,670 - Cell - INFO - server: created backbone internal listener for uds:///tmp/nvflare_599550
2023-03-13 17:17:59,671 - nvflare.fuel.f3.sfm.conn_manager - INFO - Connector [CH00001 PASSIVE tcp://0:58469] is starting
2023-03-13 17:17:59,745 - nvflare.fuel.hci.server.hci - INFO - Starting Admin Server localhost on Port 38749
2023-03-13 17:17:59,745 - SimulatorRunner - INFO - Deploy the Apps.
2023-03-13 17:17:59,750 - SimulatorRunner - INFO - Cr

or via Simulator API:

In [7]:
import os
from nvflare.private.fed.app.simulator.simulator_runner import SimulatorRunner  

simulator = SimulatorRunner(
    job_folder=f"./jobs/sklearn_kmeans_3_uniform",
    workspace="/tmp/nvflare/sklearn_kmeans_iris",
    n_clients=3,
    threads=3
)
run_status = simulator.run()
print("Simulator finished with run_status", run_status)

2023-03-13 17:22:35,687 - SimulatorRunner - INFO - Create the Simulator Server.
2023-03-13 17:22:35,691 - Cell - INFO - server: creating listener on tcp://0:57225
2023-03-13 17:22:35,692 - Cell - INFO - server: created backbone external listener for tcp://0:57225
2023-03-13 17:22:35,693 - ConnectorManager - INFO - 17294: Try start_listener Listener resources: {'secure': False, 'host': 'localhost'}
2023-03-13 17:22:35,695 - nvflare.fuel.f3.sfm.conn_manager - INFO - Connector [CH00002 PASSIVE uds:///tmp/nvflare_860831] is starting
2023-03-13 17:22:36,197 - Cell - INFO - server: created backbone internal listener for uds:///tmp/nvflare_860831
2023-03-13 17:22:36,200 - nvflare.fuel.f3.sfm.conn_manager - INFO - Connector [CH00001 PASSIVE tcp://0:57225] is starting
2023-03-13 17:22:36,286 - nvflare.fuel.hci.server.hci - INFO - Starting Admin Server localhost on Port 33841
2023-03-13 17:22:36,287 - SimulatorRunner - INFO - Deploy the Apps.
2023-03-13 17:22:36,294 - SimulatorRunner - INFO - Cr

2023-03-13 17:22:45.561492: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2023-03-13 17:22:45.586717: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2023-03-13 17:22:45.599468: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the ap

2023-03-13 17:22:46,889 - ServerRunner - INFO - [identity=simulator_server, run=simulate_job, wf=scatter_and_gather, peer=site-3, peer_run=simulate_job, task_name=train, task_id=89889d76-c835-4470-b901-15ff9372ca3e]: assigned task to client site-3: name=train, id=89889d76-c835-4470-b901-15ff9372ca3e
2023-03-13 17:22:46,891 - ServerRunner - INFO - [identity=simulator_server, run=simulate_job, wf=scatter_and_gather, peer=site-3, peer_run=simulate_job, task_name=train, task_id=89889d76-c835-4470-b901-15ff9372ca3e]: sent task assignment to client
2023-03-13 17:22:46,904 - ServerRunner - INFO - [identity=simulator_server, run=simulate_job, wf=scatter_and_gather, peer=site-3, peer_run=simulate_job]: got result from client site-3 for task: name=train, id=89889d76-c835-4470-b901-15ff9372ca3e
2023-03-13 17:22:46,906 - ScatterAndGather - INFO - [identity=simulator_server, run=simulate_job, wf=scatter_and_gather, peer=site-3, peer_run=simulate_job, peer_rc=OK, task_name=train, task_id=89889d76-c8



2023-03-13 17:22:47,477 - ScatterAndGather - INFO - [identity=simulator_server, run=simulate_job, wf=scatter_and_gather]: End aggregation.
2023-03-13 17:22:47,479 - ScatterAndGather - INFO - [identity=simulator_server, run=simulate_job, wf=scatter_and_gather]: Start persist model on server.
2023-03-13 17:22:47,480 - ScatterAndGather - INFO - [identity=simulator_server, run=simulate_job, wf=scatter_and_gather]: End persist model on server.
2023-03-13 17:22:47,481 - ScatterAndGather - INFO - [identity=simulator_server, run=simulate_job, wf=scatter_and_gather]: Round 0 finished.
2023-03-13 17:22:47,482 - ScatterAndGather - INFO - [identity=simulator_server, run=simulate_job, wf=scatter_and_gather]: Round 1 started.
2023-03-13 17:22:47,483 - ScatterAndGather - INFO - [identity=simulator_server, run=simulate_job, wf=scatter_and_gather]: scheduled task train
2023-03-13 17:22:47,284 - ClientTaskWorker - INFO - Finished one task run for client: site-2
2023-03-13 17:22:47,310 - ClientTaskWorker

## 5. Result visualization
Model accuracy is computed as the homogeneity score between the cluster formed and the ground truth label, which can be visualized in tensorboard.

In [12]:
%load_ext tensorboard
%tensorboard --logdir /tmp/nvflare/sklearn_kmeans_iris

The tensorboard extension is already loaded. To reload it, use:
  %reload_ext tensorboard


Reusing TensorBoard on port 6006 (pid 17128), started 0:31:29 ago. (Use '!kill 17128' to kill it.)