# Federated K-Means Clustering with Scikit-learn on Iris Dataset

## 1. Download the Iris dataset
To simulate a horizontal split dataset, we first download the Iris dataset via sklearn.

In [1]:
%env DATASET_PATH=/tmp/nvflare/dataset/sklearn_iris.csv
%env OVERLAP=10000
!python3 ./utils/prepare_data.py --dataset_name iris --randomize 1 --out_path ${DATASET_PATH}

env: DATASET_PATH=/tmp/nvflare/dataset/sklearn_iris.csv
env: OVERLAP=10000


## 2. Prepare clients' configs with proper data split information 
We are using NVFlare's FL simulator to run the following experiments. Here we simulate 3 clients with uniform data split. Since the dataset is already randomized in the last step, the split is done sequentially. For unsupervised clustering, we use the whole dataset with ground truth label to validate the performance.

In [2]:
%env DATASET_PATH=/tmp/nvflare/dataset/sklearn_iris.csv
!python3 ./utils/prepare_job_config.py --task_name "sklearn_kmeans" --data_path "${DATASET_PATH}" --site_num 3 --valid_frac 1 --split_method "uniform"

env: DATASET_PATH=/tmp/nvflare/dataset/sklearn_iris.csv


## 3. Run simulated kmeans experiment
Now that we have the job configs ready, to run the experiment with simulator, execute:

In [3]:
import os
from nvflare.private.fed.app.simulator.simulator_runner import SimulatorRunner  

simulator = SimulatorRunner(
    job_folder=f"./jobs/sklearn_kmeans_3_uniform",
    workspace="/tmp/nvflare/sklearn_kmeans_iris",
    n_clients=3,
    threads=3
)
run_status = simulator.run()
print("Simulator finished with run_status", run_status)

2023-03-10 14:31:30,036 - SimulatorRunner - INFO - Create the Simulator Server.
2023-03-10 14:31:30,041 - Cell - INFO - server: creating listener on tcp://0:38531
2023-03-10 14:31:30,042 - Cell - INFO - server: created backbone external listener for tcp://0:38531
2023-03-10 14:31:30,044 - ConnectorManager - INFO - 726271: Try start_listener Listener resources: {'secure': False, 'host': 'localhost'}
2023-03-10 14:31:30,045 - nvflare.fuel.f3.sfm.conn_manager - INFO - Connector [CH00002 PASSIVE uds:///tmp/nvflare_563628] is starting
2023-03-10 14:31:30,548 - Cell - INFO - server: created backbone internal listener for uds:///tmp/nvflare_563628
2023-03-10 14:31:30,551 - nvflare.fuel.f3.sfm.conn_manager - INFO - Connector [CH00001 PASSIVE tcp://0:38531] is starting
2023-03-10 14:31:30,634 - nvflare.fuel.hci.server.hci - INFO - Starting Admin Server localhost on Port 50057
2023-03-10 14:31:30,636 - SimulatorRunner - INFO - Deploy the Apps.
2023-03-10 14:31:30,642 - SimulatorRunner - INFO - C

2023-03-10 14:31:39.926371: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2023-03-10 14:31:39.962708: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2023-03-10 14:31:39.979805: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the ap

2023-03-10 14:31:43,178 - ServerRunner - INFO - [identity=simulator_server, run=simulate_job, wf=scatter_and_gather, peer=site-3, peer_run=simulate_job, task_name=train, task_id=b5d3577e-8052-4b31-8ca4-ea253189ba1e]: assigned task to client site-3: name=train, id=b5d3577e-8052-4b31-8ca4-ea253189ba1e
2023-03-10 14:31:43,181 - ServerRunner - INFO - [identity=simulator_server, run=simulate_job, wf=scatter_and_gather, peer=site-3, peer_run=simulate_job, task_name=train, task_id=b5d3577e-8052-4b31-8ca4-ea253189ba1e]: sent task assignment to client
2023-03-10 14:31:43,183 - ServerRunner - INFO - [identity=simulator_server, run=simulate_job, wf=scatter_and_gather, peer=site-2, peer_run=simulate_job, task_name=train, task_id=ee868551-060f-4259-9a39-1209808147b9]: assigned task to client site-2: name=train, id=ee868551-060f-4259-9a39-1209808147b9
2023-03-10 14:31:43,186 - ServerRunner - INFO - [identity=simulator_server, run=simulate_job, wf=scatter_and_gather, peer=site-2, peer_run=simulate_jo



2023-03-10 14:31:43,385 - ClientTaskWorker - INFO - Finished one task run for client: site-3
2023-03-10 14:31:43,389 - ClientTaskWorker - INFO - Finished one task run for client: site-2
2023-03-10 14:31:43,390 - ClientTaskWorker - INFO - Finished one task run for client: site-1
2023-03-10 14:31:43,444 - ClientTaskWorker - INFO - Finished one task run for client: site-3
2023-03-10 14:31:43,447 - ClientTaskWorker - INFO - Finished one task run for client: site-2
2023-03-10 14:31:43,449 - ClientTaskWorker - INFO - Finished one task run for client: site-1
2023-03-10 14:31:43,502 - ClientTaskWorker - INFO - Finished one task run for client: site-3
2023-03-10 14:31:43,505 - ClientTaskWorker - INFO - Finished one task run for client: site-2
2023-03-10 14:31:43,507 - ClientTaskWorker - INFO - Finished one task run for client: site-1
2023-03-10 14:31:43,570 - ClientTaskWorker - INFO - Finished one task run for client: site-3
2023-03-10 14:31:43,577 - ClientTaskWorker - INFO - Finished one task 

Model accuracy can be computed as the homogeneity score between the cluster formed and the ground truth label, which can be visualized in tensorboard.

In [9]:
# Load the TensorBoard notebook extension
%load_ext tensorboard

%tensorboard --logdir /tmp/nvflare/cifar10_splitnn

The tensorboard extension is already loaded. To reload it, use:
  %reload_ext tensorboard


Reusing TensorBoard on port 6006 (pid 349977), started 0:00:00 ago. (Use '!kill 349977' to kill it.)