# Tabular Data Federated Statistics 

Before we perform machine learning tasks on the Higgs data, let's examine the statistics of the dataset. 


## Setup NVFLARE

Follow [Getting Started](https://nvflare.readthedocs.io/en/main/getting_started.html) to set up a virtual environment and install NVFLARE.

You can also follow this [notebook](../../nvflare_setup.ipynb) to get set up.

> Make sure you have installed nvflare from **terminal** 




## Install requirements
assuming the current directory is 'higgs/stats'

In [1]:
! pwd

/home/chester/projects/NVFlare/examples/hello-world/step-by-step/higgs/stats


In [None]:
%pip install -r requirements.txt


## 1. Prepare data

### Download and Store Data

To run the examples, we first download the dataset from the HIGGS link above, which is a single .csv file. By default, we assume the dataset is downloaded, uncompressed, and stored in 

```
/tmp/nvflare/dataset/input/higgs.zip.

```

You can either use wget or curl to download directly if you have wget or curl installed. here is using curl command. This will takes a while to download 2.6+GB file. 
    

In [8]:
! mkdir -p /tmp/nvflare/dataset/input

! curl -o /tmp/nvflare/dataset/input/higgs.zip https://archive.ics.uci.edu/static/public/280/higgs.zip

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 2686M    0 2686M    0     0  1571k      0 --:--:--  0:29:10 --:--:-- 1355k 56.7M    0     0  1023k      0 --:--:--  0:00:56 --:--:-- 1623k:--:--  0:03:53 --:--:--  736k 0  1508k      0 --:--:--  0:09:24 --:--:-- 2963k    0 --:--:--  0:14:37 --:--:-- 1300kk      0 --:--:--  0:16:39 --:--:--  519k  0  1600k      0 --:--:--  0:20:23 --:--:-- 1593k 0:21:06 --:--:-- 1321k 0  1571k      0 --:--:--  0:24:46 --:--:-- 1188k


Alternative download with wget ```wget -P /tmp/nvflare/dataset/input/ https://archive.ics.uci.edu/static/public/280/higgs.zip```

First we need to unzip the higgs.zip, we have already pre-installed "unzip" and "gunzip", so we just directly use this.  

In [13]:
! unzip -d /tmp/nvflare/dataset/input/ /tmp/nvflare/dataset/input/higgs.zip

Archive:  /tmp/nvflare/dataset/input/higgs.zip
  inflating: /tmp/nvflare/dataset/input/HIGGS.csv.gz  


In [16]:
!gunzip -c /tmp/nvflare/dataset/input/HIGGS.csv.gz > /tmp/nvflare/dataset/input/higgs.csv

In [17]:
!ls -al /tmp/nvflare/dataset/input/

total 13348436
drwxrwxr-x 2 chester chester       4096 Nov 14 15:57 .
drwxrwxr-x 3 chester chester       4096 Nov 14 15:22 ..
-rw-rw-r-- 1 chester chester 8035497980 Nov 14 15:57 higgs.csv
-rwx------ 1 chester chester 2816407858 May 22 15:16 HIGGS.csv.gz
-rw-rw-r-- 1 chester chester 2816865137 Nov 14 15:51 higgs.zip


### Data Split

HIGGS dataset contains 11 million instances (rows), each with 28 attributes.
The first 21 features (columns 2-22) are kinematic properties measured by the particle detectors in the accelerator. 
The last seven features are functions of the first 21 features; these are high-level features derived by physicists to help discriminate between the two classes. The last 500,000 examples are used as a test set.

The first column is the class label (1 for signal, 0 for background), followed by the 28 features (21 low-level features then 7 high-level features): lepton  pT, lepton  eta, lepton  phi, missing energy magnitude, missing energy phi, jet 1 pt, jet 1 eta, jet 1 phi, jet 1 b-tag, jet 2 pt, jet 2 eta, jet 2 phi, jet 2 b-tag, jet 3 pt, jet 3 eta, jet 3 phi, jet 3 b-tag, jet 4 pt, jet 4 eta, jet 4 phi, jet 4 b-tag, m_jj, m_jjj, m_lv, m_jlv, m_bb, m_wbb, m_wwbb. For more detailed information about each feature see the original paper.

Since HIGGS dataset is already randomly recorded, data split will be specified by the continuous index ranges for each client, rather than a vector of random instance indices. We will split the dataset uniformly: all clients has the same amount of data. The output directory 

```
/tmp/nvflare/dataset/output/

```

We create a simple python code to split data: called split_csv.py. Let run this, need to wait for few minutes. 


In [37]:
!python split_csv.py --input_data_path=/tmp/nvflare/dataset/input/higgs.csv --output_dir=/tmp/nvflare/dataset/output/ --site_num=3 --sample_rate=0.2

In [38]:
!ls -al /tmp/nvflare/dataset/output/

total 1087424
drwxrwxr-x 2 chester chester      4096 Nov 14 18:24 .
drwxrwxr-x 4 chester chester      4096 Nov 14 18:22 ..
-rw-rw-r-- 1 chester chester 371161915 Nov 14 19:47 site-1.csv
-rw-rw-r-- 1 chester chester 371163164 Nov 14 19:47 site-2.csv
-rw-rw-r-- 1 chester chester 371171296 Nov 14 19:47 site-3.csv


Now we have our data prepared, let first take a look at these data.

In [39]:
features = ["label", "lepton_pt", "lepton_eta", "lepton_phi", "missing_energy_magnitude", "missing_energy_phi", "jet_1_pt", "jet_1_eta", "jet_1_phi", "jet_1_b_tag", "jet_2_pt", "jet_2_eta", "jet_2_phi", "jet_2_b_tag", "jet_3_pt", "jet_3_eta", "jet_3_phi", "jet_3_b_tag",\
            "jet_4_pt", "jet_4_eta", "jet_4_phi", "jet_4_b_tag", \
            "m_jj", "m_jjj", "m_lv", "m_jlv", "m_bb", "m_wbb", "m_wwbb"]

In [40]:
features

['label',
 'lepton_pt',
 'lepton_eta',
 'lepton_phi',
 'missing_energy_magnitude',
 'missing_energy_phi',
 'jet_1_pt',
 'jet_1_eta',
 'jet_1_phi',
 'jet_1_b_tag',
 'jet_2_pt',
 'jet_2_eta',
 'jet_2_phi',
 'jet_2_b_tag',
 'jet_3_pt',
 'jet_3_eta',
 'jet_3_phi',
 'jet_3_b_tag',
 'jet_4_pt',
 'jet_4_eta',
 'jet_4_phi',
 'jet_4_b_tag',
 'm_jj',
 'm_jjj',
 'm_lv',
 'm_jlv',
 'm_bb',
 'm_wbb',
 'm_wwbb']

In [41]:
import numpy as np
import pandas as pd

df: pd.DataFrame = pd.read_csv("/tmp/nvflare/dataset/output/site-1.csv", names=features, sep=r"\s*,\s*", engine="python", na_values="?")

In [43]:
df

Unnamed: 0,label,lepton_pt,lepton_eta,lepton_phi,missing_energy_magnitude,missing_energy_phi,jet_1_pt,jet_1_eta,jet_1_phi,jet_1_b_tag,...,jet_4_eta,jet_4_phi,jet_4_b_tag,m_jj,m_jjj,m_lv,m_jlv,m_bb,m_wbb,m_wwbb
0,1.0,0.869293,-0.635082,0.225690,0.327470,-0.689993,0.754202,-0.248573,-1.092064,0.000000,...,-0.010455,-0.045767,3.101961,1.353760,0.979563,0.978076,0.920005,0.721657,0.988751,0.876678
1,1.0,0.907542,0.329147,0.359412,1.497970,-0.313010,1.095531,-0.557525,-1.588230,2.173076,...,-1.138930,-0.000819,0.000000,0.302220,0.833048,0.985700,0.978098,0.779732,0.992356,0.798343
2,1.0,0.798835,1.470639,-1.635975,0.453773,0.425629,1.104875,1.282322,1.381664,0.000000,...,1.128848,0.900461,0.000000,0.909753,1.108330,0.985692,0.951331,0.803252,0.865924,0.780118
3,0.0,1.344385,-0.876626,0.935913,1.992050,0.882454,1.786066,-1.646778,-0.942383,0.000000,...,-0.678379,-1.360356,0.000000,0.946652,1.028704,0.998656,0.728281,0.869200,1.026736,0.957904
4,1.0,1.105009,0.321356,1.522401,0.882808,-1.205349,0.681466,-1.070464,-0.921871,0.000000,...,-0.373566,0.113041,0.000000,0.755856,1.361057,0.986610,0.838085,1.133295,0.872245,0.808487
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
733329,0.0,0.540426,1.305064,-0.708802,0.875872,1.675868,0.816495,-0.252534,-1.130316,0.000000,...,-0.365237,-0.257189,0.000000,0.894322,0.952666,0.989141,1.504557,0.834270,1.229386,1.162086
733330,0.0,0.421653,-0.492882,1.643916,1.594480,0.825128,0.532879,0.796119,-0.674065,0.000000,...,-1.105617,0.638542,3.101961,2.138724,1.226602,0.988765,0.644204,0.376055,0.756236,0.835567
733331,1.0,0.968118,0.351549,1.102926,0.639590,-1.318748,1.234316,-0.497121,0.751894,2.173076,...,0.953955,-0.417558,0.000000,0.903102,0.884304,0.989137,0.934825,0.873791,0.967624,0.832607
733332,1.0,1.242998,-0.459767,-1.390171,0.780121,1.277013,1.059071,0.553513,-0.366384,2.173076,...,-1.093958,1.394887,0.000000,0.628694,0.940251,0.987272,0.554953,0.830810,0.883622,0.719952


## Create a statistics calculator for the local tabular dataset


```

from typing import Dict, List, Optional

import numpy as np
import pandas as pd
from pandas.core.series import Series

from nvflare.apis.fl_context import FLContext
from nvflare.app_common.abstract.statistics_spec import BinRange, Feature, Histogram, HistogramType, Statistics
from nvflare.app_common.statistics.numpy_utils import dtype_to_data_type, get_std_histogram_buckets


class DFStatistics(Statistics):
    def __init__(self, features: List, data_root_dir: str):
        super().__init__()
        self.data_root_dir = data_root_dir
        self.data: Optional[Dict[str, pd.DataFrame]] = None
        self.data_features = features

    def load_data(self, fl_ctx: FLContext) -> Dict[str, pd.DataFrame]:
        client_name = fl_ctx.get_identity_name() if fl_ctx is not None else "site-1"
        if fl_ctx:
            self.log_info(fl_ctx, f"load data for client {client_name}")
        else:
            print(f"load data for client {client_name}")
        try:
            data_path = f"{self.data_root_dir}/{client_name}.csv"
            # example of load data from CSV
            df: pd.DataFrame = pd.read_csv(
                data_path, names=self.data_features, sep=r"\s*,\s*", engine="python", na_values="?"
            )
            return {"train": df}

        except Exception as e:
            raise Exception(f"Load data for client {client_name} failed! {e}")

    def initialize(self, fl_ctx: FLContext):
        self.data = self.load_data(fl_ctx)
        if self.data is None:
            raise ValueError("data is not loaded. make sure the data is loaded")

    def features(self) -> Dict[str, List[Feature]]:
        results: Dict[str, List[Feature]] = {}
        for ds_name in self.data:
            df = self.data[ds_name]
            results[ds_name] = []
            for feature_name in df:
                data_type = dtype_to_data_type(df[feature_name].dtype)
                results[ds_name].append(Feature(feature_name, data_type))

        return results

    def count(self, dataset_name: str, feature_name: str) -> int:
        df: pd.DataFrame = self.data[dataset_name]
        return df[feature_name].count()

    def sum(self, dataset_name: str, feature_name: str) -> float:
        df: pd.DataFrame = self.data[dataset_name]
        return df[feature_name].sum().item()

    def mean(self, dataset_name: str, feature_name: str) -> float:

        count: int = self.count(dataset_name, feature_name)
        sum_value: float = self.sum(dataset_name, feature_name)
        return sum_value / count

    def stddev(self, dataset_name: str, feature_name: str) -> float:
        df = self.data[dataset_name]
        return df[feature_name].std().item()

    def variance_with_mean(
        self, dataset_name: str, feature_name: str, global_mean: float, global_count: float
    ) -> float:
        df = self.data[dataset_name]
        tmp = (df[feature_name] - global_mean) * (df[feature_name] - global_mean)
        variance = tmp.sum() / (global_count - 1)
        return variance.item()

    def histogram(
        self, dataset_name: str, feature_name: str, num_of_bins: int, global_min_value: float, global_max_value: float
    ) -> Histogram:

        num_of_bins: int = num_of_bins

        df = self.data[dataset_name]
        feature: Series = df[feature_name]
        flattened = feature.ravel()
        flattened = flattened[flattened != np.array(None)]
        buckets = get_std_histogram_buckets(flattened, num_of_bins, BinRange(global_min_value, global_max_value))
        return Histogram(HistogramType.STANDARD, buckets)

    def max_value(self, dataset_name: str, feature_name: str) -> float:
        """this is needed for histogram calculation, not used for reporting"""

        df = self.data[dataset_name]
        return df[feature_name].max()

    def min_value(self, dataset_name: str, feature_name: str) -> float:
        """this is needed for histogram calculation, not used for reporting"""

        df = self.data[dataset_name]
        return df[feature_name].min()

```

Let see if the code works. 

In [61]:
cd code

/home/chester/projects/NVFlare/examples/hello-world/step-by-step/higgs/stats/code


In [62]:
from df_statistics import DFStatistics

df_stats_cal = DFStatistics(features, data_root_dir = "/tmp/nvflare/dataset/output")

# We use fl_ctx = None for local calculation ( where the data set default to "site-1.csv", so we can explore the stats locally without federated settings. 
df_stats_cal.initialize(fl_ctx = None)


load data for client site-1


In [63]:
data_features = df_stats_cal.features()

In [64]:
data_features

{'train': [Feature(feature_name='label', data_type=<DataType.FLOAT: 1>),
  Feature(feature_name='lepton_pt', data_type=<DataType.FLOAT: 1>),
  Feature(feature_name='lepton_eta', data_type=<DataType.FLOAT: 1>),
  Feature(feature_name='lepton_phi', data_type=<DataType.FLOAT: 1>),
  Feature(feature_name='missing_energy_magnitude', data_type=<DataType.FLOAT: 1>),
  Feature(feature_name='missing_energy_phi', data_type=<DataType.FLOAT: 1>),
  Feature(feature_name='jet_1_pt', data_type=<DataType.FLOAT: 1>),
  Feature(feature_name='jet_1_eta', data_type=<DataType.FLOAT: 1>),
  Feature(feature_name='jet_1_phi', data_type=<DataType.FLOAT: 1>),
  Feature(feature_name='jet_1_b_tag', data_type=<DataType.FLOAT: 1>),
  Feature(feature_name='jet_2_pt', data_type=<DataType.FLOAT: 1>),
  Feature(feature_name='jet_2_eta', data_type=<DataType.FLOAT: 1>),
  Feature(feature_name='jet_2_phi', data_type=<DataType.FLOAT: 1>),
  Feature(feature_name='jet_2_b_tag', data_type=<DataType.FLOAT: 1>),
  Feature(featu

In [65]:
df_stats_cal.count("train", "lepton_pt")

733334

In [66]:
df_stats_cal.mean("train", "lepton_pt")

0.9919866823792409

In [67]:
df_stats_cal.mean("train", "m_wwbb")

0.959972547557596

In [68]:
df_stats_cal.stddev("train", "m_wwbb")

0.3135064233941357

In [69]:
df_stats_cal.histogram("train", "lepton_pt", 20, 0, 10)

Histogram(hist_type=<HistogramType.STANDARD: 0>, bins=[Bin(low_value=0.0, high_value=0.5, sample_count=118338), Bin(low_value=0.5, high_value=1.0, sample_count=332406), Bin(low_value=1.0, high_value=1.5, sample_count=168176), Bin(low_value=1.5, high_value=2.0, sample_count=70749), Bin(low_value=2.0, high_value=2.5, sample_count=28184), Bin(low_value=2.5, high_value=3.0, sample_count=9404), Bin(low_value=3.0, high_value=3.5, sample_count=3527), Bin(low_value=3.5, high_value=4.0, sample_count=1433), Bin(low_value=4.0, high_value=4.5, sample_count=577), Bin(low_value=4.5, high_value=5.0, sample_count=294), Bin(low_value=5.0, high_value=5.5, sample_count=129), Bin(low_value=5.5, high_value=6.0, sample_count=58), Bin(low_value=6.0, high_value=6.5, sample_count=31), Bin(low_value=6.5, high_value=7.0, sample_count=15), Bin(low_value=7.0, high_value=7.5, sample_count=7), Bin(low_value=7.5, high_value=8.0, sample_count=5), Bin(low_value=8.0, high_value=8.5, sample_count=0), Bin(low_value=8.5, h

Great ! The code works. Lets move on the federated statistics calculations. Let move back to the parent directory of code

In [70]:
cd ../.

/home/chester/projects/NVFlare/examples/hello-world/step-by-step/higgs/stats


## Create Federated Statistics Job

We are going to use NVFLARE job cli to create job. For detailed instructions on Job CLI, please follow the [job cli tutorial](https://github.com/NVIDIA/NVFlare/blob/main/examples/tutorials/job_cli.ipynb)

Let's check the available job templates, we are going to use one of the existing job template and modify to fit our needs. The job template is nothing but server and client-side job configurations.

In [53]:
! nvflare job list_templates


The following job templates are available: 

------------------------------------------------------------------------------------------------------------------------
  name                 Description                                                  Controller Type      Client Category     
------------------------------------------------------------------------------------------------------------------------
  cyclic_cc_pt         client-controlled cyclic workflow with PyTorch ClientAPI tra client               client_api          
  cyclic_pt            server-controlled cyclic workflow with PyTorch ClientAPI tra server               client_api          
  psi_csv              private-set intersection for csv data                        server               Executor            
  sag_cross_np         scatter & gather and cross-site validation using numpy       server               client executor     
  sag_cse_pt           scatter & gather workflow and cross-site evaluation with Py

there is "stats_df" job template, which what we need. We are going to use that. Now, use ```nvflare job create``` command
We would like to use our new df_statistics.py file we just tested

In [71]:
! nvflare job create -w stats_df -j /tmp/nvflare/jobs/stats_df -sd code -force


The following are the variables you can change in the template

---------------------------------------------------------------------------------------------------------------------------------------
                                                                                                                                       
  job folder: /tmp/nvflare/jobs/stats_df                                                                                                 
                                                                                                                                       
---------------------------------------------------------------------------------------------------------------------------------------
  file_name                      var_name                       value                               component                          
---------------------------------------------------------------------------------------------------------------------

In [72]:
! tree /tmp/nvflare/jobs/stats_df  

[01;34m/tmp/nvflare/jobs/stats_df[0m
├── [01;34mapp[0m
│   ├── [01;34mconfig[0m
│   │   ├── config_fed_client.conf
│   │   └── config_fed_server.conf
│   └── [01;34mcustom[0m
│       └── df_statistics.py
└── meta.conf

3 directories, 4 files


Let's modify the server configuration to set the bin = 20, global min_max range in [0,10] instead of [0,120]

In [76]:
! nvflare job create -w stats_df -force -j /tmp/nvflare/jobs/stats_df -sd code -f config_fed_server.conf bins=20 range="[0,10]"


The following are the variables you can change in the template

---------------------------------------------------------------------------------------------------------------------------------------
                                                                                                                                       
  job folder: /tmp/nvflare/jobs/stats_df                                                                                                 
                                                                                                                                       
---------------------------------------------------------------------------------------------------------------------------------------
  file_name                      var_name                       value                               component                          
---------------------------------------------------------------------------------------------------------------------

In [77]:
!cat /tmp/nvflare/jobs/stats_df/app/config/config_fed_server.conf         

format_version = 2
workflows = [
  {
    id = "fed_stats_controller"
    path = "nvflare.app_common.workflows.statistics_controller.StatisticsController"
    args {
      statistic_configs {
        count {}
        mean {}
        sum {}
        stddev {}
        histogram {
          "*" {
            bins = 20
          }
          Age {
            bins = 20
            range = [
              0
              10
            ]
          }
        }
      }
      writer_id = "stats_writer"
      enable_pre_run_task = false
    }
  }
]
components = [
  {
    id = "stats_writer"
    path = "nvflare.app_common.statistics.json_stats_file_persistor.JsonStatsFileWriter"
    args {
      output_path = "statistics/adults_stats.json"
      json_encoder_path = "nvflare.app_common.utils.json_utils.ObjectEncoder"
    }
  }
]


Now, look at the client configuration, we notice that the job template component configuration 
```
components = [
  {
    id = "df_stats_generator"
    path = "df_statistics.DFStatistics"
    args {
      data_path = "data.csv"
    }
  }
```

is different from our new DFStatistics class, where the arguments are
features, data_root_dir not "data_path". So we will need to modify that. 

```

class DFStatistics(Statistics):
    def __init__(self, features: List, data_root_dir: str):
        super().__init__()
        self.data_root_dir = data_root_dir
        self.data: Optional[Dict[str, pd.DataFrame]] = None
        self.data_features = features
```


In [80]:
!cat /tmp/nvflare/jobs/stats_df/app/config/config_fed_client.conf 

format_version = 2
executors = [
  {
    tasks = [
      "fed_stats_pre_run"
      "fed_stats"
    ]
    executor {
      id = "Executor"
      path = "nvflare.app_common.executors.statistics.statistics_executor.StatisticsExecutor"
      args {
        generator_id = "df_stats_generator"
      }
    }
  }
]
task_result_filters = [
  {
    tasks = [
      "fed_stats"
    ]
    filters = [
      {
        name = "StatisticsPrivacyFilter"
        args {
          result_cleanser_ids = [
            "min_count_cleanser"
            "min_max_noise_cleanser"
            "hist_bins_cleanser"
          ]
        }
      }
    ]
  }
]
task_data_filters = []
components = [
  {
    id = "df_stats_generator"
    path = "df_statistics.DFStatistics"
    args {
      data_path = "data.csv"
    }
  }
  {
    id = "min_max_cleanser"
    path = "nvflare.app_common.statistics.min_max_cleanser.AddNoiseToMinMax"
    args {
      min_noise_level = 0.1
      max_noise_level = 0.3
    }
  }
  {
    id = "hist

what we need to do are the followings
1. remove data_path argument
2. add features and data_root_dir arguments

## TODO NEXT




## 2. Run job in FL Simulator

With FL simulator, we can just run the example with CLI command 


```
cd NVFlare/examples/advanced/federated-statistics
nvflare simulator df_stats/jobs/df_stats -w /tmp/nvflare/df_stats -n 2 -t 2
```

The results are stored in workspace "/tmp/nvflare"
```
/tmp/nvflare/df_stats/simulate_job/statistics/adults_stats.json
```

## 3. Visualization
   with json format, the data can be easily visualized via pandas dataframe and plots. 
   A visualization utility tools are showed in show_stats.py in visualization directory
   You can run jupyter notebook visualization.ipynb

   assuming NVFLARE_HOME env variable point to the GitHub project location (NVFlare) which contains current example. 

```bash
    cp /tmp/nvflare/df_stats/simulate_job/advanced/statistics/adults_stats.json $NVFLARE_HOME/examples/advanced/federated-statistics/df_stats/demo/.
    
    cd $NVFLARE_HOME/examples/advanced/federated-statistics/df_stats/demo
    
    jupyter notebook  visualization.ipynb
```
you should be able to get the visualization similar to the followings

![stats](demo/stats_df.png) and ![histogram plot](demo/hist_plot.png)


## 4. Run Example using POC command

Alternative way to run job is using POC mode

### 4.1 Prepare POC Workspace

```
   nvflare poc prepare 
```
This will create a poc at /tmp/nvflare/poc with n = 2 clients.

If your poc_workspace is in a different location, use the following command

```
export NVFLARE_POC_WORKSPACE=<new poc workspace location>
```
then repeat above

### 4.2 Start nvflare in POC mode

```
nvflare poc start
```
once you have done with above command, you are already login to the NVFLARE console (aka Admin Console)
if you prefer to have NVFLARE Console in separate terminal, you can do

```
nvflare poc start ex admin
```
Then open a separate terminal to start the NVFLARE console
```
nvflare poc start -p admin
```

### 4.3 Submit job

Inside the console, submit the job:
```
submit_job advanced/federated-statistics/df_stats/jobs/df_stats
```

### 4.4 List the submitted job

You should see the server and clients in your first terminal executing the job now.
You can list the running job by using `list_jobs` in the admin console.
Your output should be similar to the following.

```
> list_jobs 
-------------------------------------------------------------------------------------------------==--------------------------------
| JOB ID                               | NAME     | STATUS                       | SUBMIT TIME                                    |
-----------------------------------------------------------------------------------------------------------------------------------
| 10a92352-5459-47d2-8886-b85abf70ddd1 | df_stats | FINISHED:COMPLETED           | 2022-08-05T22:50:40.968771-07:00 | 0:00:29.4493|
-----------------------------------------------------------------------------------------------------------------------------------
```

### 4.5 Get the result

If successful, the computed statis can be downloaded using this admin command:
```
download_job [JOB_ID]
```
After download, it will be available in the stated download directory under `[JOB_ID]/workspace/statistics` as  `adult_stats.json`
then go to section [6. Visualization]

## 5. Configuration and Code

Since Flare has already developed the operators (server side controller and client side executor) for the federated
statistics computing, we will only need to provide the followings
* config_fed_server.json (server side controller configuration)
* config_client_server.json (client side executor configuration)
* local statistics calculator

### 5.1 server side configuration

```
"workflows": [
    {
      "id": "fed_stats_controller",
      "path": "nvflare.app_common.workflows.statistics_controller.StatisticsController",
      "args": {
        "statistics_configs": {
          "count": {},
          "mean": {},
          "sum": {},
          "stddev": {},
          "histogram": { "*": {"bins": 10 },
                         "Age": {"bins": 5, "range":[0,120]}
                       }
        },
        "writer_id": "stats_writer"
      }
    }
  ],
```
In above configuration, `StatisticsController` is controller. We ask the controller to calculate the following statistic
statistics: "count", "mean", "sum", "stddev", "histogram" and "Age". Each statistic may have its own configuration.
For example, Histogram statistic, we specify feature "Age" needs 5 bins and histogram range is within [0, 120), while for
all other features ("*" indicate default feature), the bin is 10, range is not specified, i.e. the ranges will be dynamically estimated.

The StatisticController also takes writer_id = "stats_writer", the writer_id identify the output writer component, defined as

```
 "components": [
    {
      "id": "stats_writer",
      "path": "nvflare.app_common.statistics.json_stats_file_persistor.JsonStatsFileWriter",
      "args": {
        "output_path": "statistics/adults_stats.json",
        "json_encoder_path": "nvflare.app_common.utils.json_utils.ObjectEncoder"
      }
    }
```
This configuration shows a JSON file output writer, the result will be saved to the <job workspace>/"statistics/adults_stats.json",
in FLARE job store.

### 5.2 client side configuration
 
First, we specify the built-in client side executor: `StatisticsExecutor`, which takes a local stats generator Id

```
 "executor": {
        "id": "Executor",
        "path": "nvflare.app_common.executors.statistics_executor.StatisticsExecutor",
        "args": {
          "generator_id": "df_stats_generator",
  },

```

The local statistics generator is defined as FLComponent: `DFStatistics` which implement the `Statistics` spec.

```
  "components": [
    {
      "id": "df_stats_generator",
      "path": "df_statistics.DFStatistics",
      "args": {
        "data_path": "data.csv"
      }
    },
   ...
  ]
```

Next, we specify the `task_result_filters`. The task_result_filters are the post-process filter that takes the results
of executor and then apply the filter before sending to server.

In this example, task_result_filters is defined as task privacy filter : `StatisticsPrivacyFilter`
```
  "task_result_filters": [
    {
      "tasks": ["fed_stats"],
      "filters":[
        {
          "name": "StatisticsPrivacyFilter",
          "args": {
            "result_cleanser_ids": [
              "min_count_cleanser",
              "min_max_noise_cleanser",
              "hist_bins_cleanser"
            ]
          }
        }
      ]
    }
  ],
``` 
`StatisticsPrivacyFilter` is using three separate the `StatisticsPrivacyCleanser`, you can find more details in
[local privacy policy](../local/privacy.json) and in later discussion on privacy.

The privacy cleansers specify policy can be find in
```
  "components": [
    {
      "id": "df_stats_generator",
      "path": "df_statistics.DFStatistics",
      "args": {
        "data_path": "data.csv"
      }
    },
    {
      "id": "min_max_cleanser",
      "path": "nvflare.app_common.statistics.min_max_cleanser.AddNoiseToMinMax",
      "args": {
        "min_noise_level": 0.1,
        "max_noise_level": 0.3
      }
    },
    {
      "id": "hist_bins_cleanser",
      "path": "nvflare.app_common.statistics.histogram_bins_cleanser.HistogramBinsCleanser",
      "args": {
        "max_bins_percent": 10
      }
    },
    {
      "id": "min_count_cleanser",
      "path": "nvflare.app_common.statistics.min_count_cleanser.MinCountCleanser",
      "args": {
        "min_count": 10
      }
    }
  ]

```
Or in [local private policy](../local/privacy.json)

### 5.3 Local statistics generator

The statistics generator `DFStatistics` implements `Statistics` spec.
In current example, the input data in the format of Pandas DataFrame. Although we used csv file, but this can be any
tabular data format that be expressed in pandas dataframe.

```
class DFStatistics(Statistics):
    # rest of code 
```
to calculate the local statistics, we will need to implements few methods
```
    def features(self) -> Dict[str, List[Feature]] -> Dict[str, List[Feature]]:

    def count(self, dataset_name: str, feature_name: str) -> int:
 
    def sum(self, dataset_name: str, feature_name: str) -> float:
 
    def mean(self, dataset_name: str, feature_name: str) -> float:
 
    def stddev(self, dataset_name: str, feature_name: str) -> float:
 
    def variance_with_mean(self, dataset_name: str, feature_name: str, global_mean: float, global_count: float) -> float:
 
    def histogram(self, dataset_name: str, feature_name: str, num_of_bins: int, global_min_value: float, global_max_value: float) -> Histogram:

```
since some of features do not provide histogram bin range, we will need to calculate based on local min/max to estimate
the global min/max, and then use the global bin/max as the range for all clients' histogram bin range.

so we need to provide local min/max calculation methods
```
   def max_value(self, dataset_name: str, feature_name: str) -> float:
   def min_value(self, dataset_name: str, feature_name: str) -> float:
```



## to run pytest in examples

under df_stats/jobs directory

```
pytest df_stats/custom/
```

### Data Split
Since HIGGS dataset is already randomly recorded, data split will be specified by the continuous index ranges for each client, rather than a vector of random instance indices. We will split the dataset uniformly: all clients has the same amount of data. The output directory 

```

/tmp/nvflare/dataset/output/

```
