# Tabular Data Federated Statistics 

Before we perform machine learning tasks on tabular data, it is often helpful to examine the statistics of the dataset on each client. This tutorial illustrates a federated statistics for tabular data. 


## Setup NVFLARE

Follow [Getting Started](https://nvflare.readthedocs.io/en/main/getting_started.html) to set up a virtual environment and install NVFLARE.

You can also follow this [notebook](https://github.com/NVIDIA/NVFlare/blob/main/examples/nvflare_setup.ipynb) to get set up.

> Make sure you have installed nvflare from **terminal** 


## Install requirements
assuming the current directory is '/examples/hello-world/step-by-step/higgs/stats'

In [None]:
!pwd

In [None]:
%pip install -r requirements.txt

>Note:
In the upcoming sections, we'll utilize the 'tree' command. To install this command on a Linux system, you can use the sudo apt install tree command. As an alternative to 'tree', you can use the ls -al command.


## Prepare data
Please reference [prepare_higgs_data](../prepare_data.ipynb) notebooks. Pay attention to the current location. You need to switch "higgs" directory to run the data split.
    

Now we have our data prepared, let's first take a look at these data.

In [None]:
features = ["label", "lepton_pt", "lepton_eta", "lepton_phi", "missing_energy_magnitude", "missing_energy_phi", "jet_1_pt", "jet_1_eta", "jet_1_phi", "jet_1_b_tag", "jet_2_pt", "jet_2_eta", "jet_2_phi", "jet_2_b_tag", "jet_3_pt", "jet_3_eta", "jet_3_phi", "jet_3_b_tag",\
            "jet_4_pt", "jet_4_eta", "jet_4_phi", "jet_4_b_tag", \
            "m_jj", "m_jjj", "m_lv", "m_jlv", "m_bb", "m_wbb", "m_wwbb"]

In [None]:
features

In [None]:
import numpy as np
import pandas as pd

df: pd.DataFrame = pd.read_csv("/tmp/nvflare/dataset/output/site-1.csv", names=features, sep=r"\s*,\s*", engine="python", na_values="?")

In [None]:
df

## Create a statistics calculator for the local tabular dataset

We compose a calculator for getting the statistics of a tabular dataset, including count, sum, mean, stdev, etc. Read `./code/df_stats.py` for details

Let's see if the code works. 

In [None]:
cd code

In [None]:
from df_stats import DFStatistics

df_stats_cal = DFStatistics(data_root_dir = "/tmp/nvflare/dataset/output")

# We use fl_ctx = None for local calculation ( where the data set default to "site-1.csv", so we can explore the stats locally without federated settings. 
df_stats_cal.initialize(fl_ctx = None)


In [None]:
data_features = df_stats_cal.features()

In [None]:
data_features

In [None]:
df_stats_cal.count("train", "lepton_pt")

In [None]:
df_stats_cal.mean("train", "lepton_pt")

In [None]:
df_stats_cal.mean("train", "m_wwbb")

In [None]:
df_stats_cal.stddev("train", "m_wwbb")

In [None]:
df_stats_cal.histogram("train", "lepton_pt", 20, 0, 10)

Great ! The code works. Let's move to the federated statistics calculations. Befor we do that, we need to move back to the parent directory of code

In [None]:
cd ../.

## Create Federated Statistics Job

We are going to use NVFLARE job cli to create job. For detailed instructions on Job CLI, please follow the [job cli tutorial](https://github.com/NVIDIA/NVFlare/blob/main/examples/tutorials/job_cli.ipynb)

Let's check the available job templates, we are going to use one of the existing job templates and modify it to fit our needs. The job template is nothing but server and client-side job configurations.

In [None]:
!nvflare config -jt ../../../../../job_templates/

In [None]:
!nvflare job list_templates

We can see there is `stats_df` job template, which is what we need. We are going to use that. Now, use ```nvflare job create``` command
We would like to use our new df_statistics.py file we just tested

In [None]:
!nvflare job create -w stats_df -force -j /tmp/nvflare/jobs/stats_df -sd code

In [None]:
!tree /tmp/nvflare/jobs/stats_df  

Let's modify the server configuration to set the bin = 20, global min_max range in [0,10] instead of [0,120] and stats_writer output path  "statistics/adults_stats.json"

In [None]:
!nvflare job create -w stats_df -force -j /tmp/nvflare/jobs/stats_df -sd code -f config_fed_server.conf bins=20 range="[0,10]" output_path="statistics/stats.json"

In [None]:
!cat /tmp/nvflare/jobs/stats_df/app/config/config_fed_server.conf         

Now, look at the client configuration, we notice that the job template component configuration 
```
components = [
  {
    id = "df_stats_generator"
    path = "df_statistics.DFStatistics"
    args {
      data_path = "data.csv"
    }
  }
```

is different from our new DFStatistics class, where the arguments are
features, data_root_dir not "data_path". So we will need to modify that. 

```

class DFStatistics(Statistics):
    def __init__(self, data_root_dir: str):
        super().__init__()
        self.data_root_dir = data_root_dir
        self.data: Optional[Dict[str, pd.DataFrame]] = None
        self.data_features = None
```


In [None]:
!cat /tmp/nvflare/jobs/stats_df/app/config/config_fed_client.conf 

what we need to do are the followings
1. remove data_path argument
2. add data_root_dir arguments
3. change the path of the DFStatistics class from 'df_statistics.DFStatistics' to df_stats.DFStatistics'

We use the following syntax to do this (you can always open it with your editing tool to direct edit the file). 



In [None]:
!nvflare job create -w stats_df -force -j /tmp/nvflare/jobs/stats_df \-sd code \
-f config_fed_server.conf \
   bins=20 \
   range="[0,10]" \
   output_path="statistics/stats.json" \
-f config_fed_client.conf \
   components[0].path="df_stats.DFStatistics" \
   components[0].args.data_path- \
   components[0].args.data_root_dir="/tmp/nvflare/dataset/output" -debug

   

In [None]:
!tree /tmp/nvflare/jobs/stats_df  



## Run job in FL Simulator

Now we can run the job with simulator. 

In [None]:
!nvflare simulator /tmp/nvflare/jobs/stats_df -w /tmp/nvflare/tabular/stats_df -n 3 -t 3



The results are stored in 
```
/tmp/nvflare/tabular/stats_df/simulate_job/statistics/stats.json
```


In [None]:
!ls -al /tmp/nvflare/tabular/stats_df/simulate_job/statistics/


In [None]:
!cat /tmp/nvflare/tabular/stats_df/simulate_job/statistics/stats.json

## Result Visualization


In [None]:
import json
import pandas as pd
from nvflare.app_opt.statistics.visualization.statistics_visualization import Visualization
with open('/tmp/nvflare/tabular/stats_df/simulate_job/statistics/stats.json', 'r') as f:
    data = json.load(f)

vis = Visualization()
vis.show_stats(data = data)

In [None]:
from IPython.display import display, HTML
display(HTML("<style>.container { width:100%  depth:100% !important; }</style>"))

In [None]:
vis.show_histograms(data = data, plot_type="main")

Given the homogeneous data distribution across the 3 clients, the global histogram at each data point is relatively 3 times the local histograms. 

## We are done !
Congratulations! you have just completed the federated stats calulation for tabular data. 

If you would like to see a detailed discussion regarding privacy filtering, please checkout the example in [federated statistics](https://github.com/NVIDIA/NVFlare/tree/main/examples/advanced/federated-statistics) examples.