# Federated K-Means Clustering with Scikit-learn

This tutorial illustrates a federated k-Means clustering on tabular data. 

Before do the training, we need to setup NVFLARE

## Setup NVFLARE

Follow [Getting Started](https://nvflare.readthedocs.io/en/main/getting_started.html) to set up a virtual environment and install NVFLARE.

You can also follow this [notebook](https://github.com/NVIDIA/NVFlare/blob/main/examples/nvflare_setup.ipynb) to get set up.

> Make sure you have installed nvflare from **terminal** 


## Install requirements
assuming the current directory is '/examples/hello-world/step-by-step/higgs/sklearn-kmeans'

In [None]:
!pwd

In [None]:
%pip install -r requirements.txt

> Note: 
In the upcoming sections, we'll utilize the 'tree' command. To install this command on a Linux system, you can use the sudo apt install tree command. As an alternative to 'tree', you can use the ls -al command.


## Prepare data
Please reference [prepare_higgs_data](../prepare_data.ipynb) notebooks. Pay attention to the current location. You need to switch "higgs" directory to run the data split.
    

Now we have our data prepared. we are ready to do the training

### Data Cleaning 

We noticed from time-to-time the Higgs dataset is making small changes which causing job to fail. so we need to do some clean up or skip certain rows. 
For example: certain floating number mistakenly add an alphabetical letter at some point of time. This may have already fixed by UCI. 


## Scikit-learn
This tutorial uses [Scikit-learn](https://scikit-learn.org/), a widely used open-source machine learning library that supports supervised and unsupervised learning.


### Federated Linear Model
Here we use [k-Means clustering](https://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html) in a federated scenario.
The aggregation follows the scheme defined in [Mini-batch k-Means](https://scikit-learn.org/stable/modules/generated/sklearn.cluster.MiniBatchKMeans.html). 

Under this setting, each round of federated learning can be formulated as follows:
- local training: starting from global centers, each client trains a local MiniBatchKMeans model with their own data
- global aggregation: server collects the cluster center, 
  counts information from all clients, aggregates them by considering 
  each client's results as a mini-batch, and updates the global center and per-center counts.

For center initialization, at the first round, each client generates its initial centers with the k-means++ method. Then, the server collects all initial centers and performs one round of k-means to generate the initial global center.

Let's look at the code see how we define the local federated training script.


In [None]:
!pwd

In [None]:
!cat code/kmeans_fl.py

The code defines how each step of FL can be performed.

#### load data

We first load the features from the header file: 
    
```
    site_name = flare.get_site_name()
    feature_data_path = f"{data_root_dir}/{site_name}_header.csv"
    features = load_features(feature_data_path)
    n_features = len(features) -1

    data_path = f"{data_root_dir}/{site_name}.csv"
    data = load_data(data_path=data_path, data_features=features, test_size=test_size, skip_rows=skip_rows)

```

then load the data from the main csv file, then transform the data and split the training and test data based on the test_size provided.  

```
    data = to_dataset_tuple(data)
    dataset = transform_data(data)
    x_train, y_train, train_size = dataset["train"]
    x_test, y_test, test_size = dataset["test"]

```

The part that's specific to Federated Learning is in the following codes

```
# (1) import nvflare client API
from nvflare import client as flare

```
```
# (2) initializes NVFlare client API
    flare.init()

    site_name = flare.get_site_name()
    
```
    
These few lines, import NVFLARE Client API and initialize it, then use the API to find the site_name (such as site-1, site-2 etc.). With the site-name, we can construct the site-specific 
data path such as

```
    feature_data_path = f"{data_root_dir}/{site_name}_header.csv"

    data_path = f"{data_root_dir}/{site_name}.csv"
```

#### Training 

In the standard traditional scikit learn, we would construct the model such as
```
  model = MiniBatchKMeans(...) 
```
then call model.fit(...)
```
  model.fit(x_train, y_train)

  homo = evaluate_model(x_test, model, y_test)

```

with federated learning, using FLARE Client API, we need to make a few changes
* 1) we are not only training in local iterations, but also global rounds, we need to keep the program running until we reached to the totoal number of rounds 
  
  ```
      while flare.is_running():
          ... rest of code
  
  ```
  
* 2) Unlike local learning, we have now have more than one clients/sites participating the training. To ensure every site starts with the same model parameters, we use server to broadcase the initial model parameters to every sites at the first round ( current_round = 0). 

* 3) We will need to use FLARE client API to receive global model and find out the global parameters

```
        # (3) receives FLModel from NVFlare
        input_model = flare.receive()
        global_params = input_model.params
        curr_round = input_model.current_round
```

```
        if curr_round == 0:
            # (4) first round, initialize centers with kmeans++
            n_clusters = global_params["n_clusters"]
            center_local, _ = kmeans_plusplus(
                x_train,
                n_clusters=n_clusters,
                random_state=random_state
            )
            params = {"center": center_local, "count": None}
            homo = 0.0
        ....
```
* 4) if it is not the first round, we need to use the global center as the starting point for training the next round. For Scikit-learn MiniBatchKMeans, we simply set the `init=`. 

```
            # (5) following rounds, starting from global centers
            center_global = global_params["center"]
            model = MiniBatchKMeans(
                n_clusters=n_clusters,
                batch_size=train_size,
                max_iter=1,
                init=center_global,
                n_init=1,
                reassignment_ratio=0,
                random_state=random_state,
            )
```

* 5) to make sure we have the best global model, we need to evaluate the global model using the local data

```
            # (6) evaluate global center
            model_eval = KMeans(
                n_clusters=n_clusters,
                init=center_global,
                n_init=1
            )
            model_eval.fit(center_global)
            homo = evaluate_model(x_test, model_eval, y_test)
```
* 6) finally we do the training as before.

```
        # Train the model on the training set
        model.fit(x_train)
        
```

* 7) we need the new training result (coeffient and intercept) back to server for aggregation, to do that, we have the following code

```
        center_local = model.cluster_centers_
        count_local = model._counts
        params = {"center": center_local, "count": count_local}
        # (7) construct trained FL model
        metrics = {"accuracy": homo}
        output_model = flare.FLModel(params=params, metrics=metrics)

        # (8) send model back to NVFlare
        flare.send(output_model)
```

## Prepare Job  

Now, we have the code, we need to prepare job folder with configurations to run in NVFLARE. To do this, we can leveage the job template for scikit learn. First look at the the available job templates

In [None]:
!nvflare config -jt ../../../../../job_templates/

In [None]:
!nvflare job list_templates

the `sklearn_kmeans` is the one we need. 

In [None]:
!nvflare job create -j /tmp/nvflare/jobs/sklearn_kmeans -force -w sklearn_kmeans \
-sd code \
-f config_fed_client.conf app_script="kmeans_fl.py" app_config="--data_root_dir /tmp/nvflare/dataset/output"

In [None]:
!cat /tmp/nvflare/jobs/sklearn_kmeans/app/config/config_fed_client.conf

In [None]:
!tree /tmp/nvflare/jobs/sklearn_kmeans

>Note 
 For potential on-the-fly data cleaning, we use skip_rows = 0 to skip 1st row. We could skip_rows = [0, 3] to skip first and 4th rows.



## Run job in simulator

We use the simulator to run this job

In [None]:
!nvflare simulator /tmp/nvflare/jobs/sklearn_kmeans -w /tmp/nvflare/sklearn_kmeans -n 3 -t 3

Let's examine the results.

We can notice from the FL training log, at the last round of local training, site-1 reports `site-1: global model homogeneity_score: 0.0068`
Now let's run a local training to verify if this number makes sense.

In [None]:
!python3 ./code/kmeans_local.py --data_root_dir /tmp/nvflare/dataset/output

HIGGS dataset is challenging for unsupervised clustering, as we can observe from the result. As shown by the local training with same number of iterations, the score is `model homogeneity_score: 0.0049`. As compared with the FL score of `0.0068`, FL in this case still provides some benefit from the collaborative learning.

## We are done !
Congratulations! you have just completed the federated k-Means clustering for tabular data. 