# Tabular Data Federated Statistics 

Before we perform machine learning tasks on tabular data, it is often helpful to examine the statistics of the dataset on each client. This tutorial illustrates a federated statistics for tabular data. 


## Setup NVFLARE

Follow [Getting Started](../../../../getting_started/readme.ipynb) to set up a virtual environment and install NVFLARE.



## Install requirements
assuming the current directory is '/examples/hello-world/step-by-step/higgs/stats'

In [None]:
!pwd

In [None]:
%pip install -r requirements.txt

>Note:
In the upcoming sections, we'll utilize the 'tree' command. To install this command on a Linux system, you can use the sudo apt install tree command. As an alternative to 'tree', you can use the ls -al command.


## Prepare data
Please reference [prepare_higgs_data](../prepare_data.ipynb) notebooks. Pay attention to the current location. You need to switch "higgs" directory to run the data split.
    

Now we have our data prepared, let's first take a look at these data.

In [None]:
features = ["label", "lepton_pt", "lepton_eta", "lepton_phi", "missing_energy_magnitude", "missing_energy_phi", "jet_1_pt", "jet_1_eta", "jet_1_phi", "jet_1_b_tag", "jet_2_pt", "jet_2_eta", "jet_2_phi", "jet_2_b_tag", "jet_3_pt", "jet_3_eta", "jet_3_phi", "jet_3_b_tag",\
            "jet_4_pt", "jet_4_eta", "jet_4_phi", "jet_4_b_tag", \
            "m_jj", "m_jjj", "m_lv", "m_jlv", "m_bb", "m_wbb", "m_wwbb"]

In [None]:
features

In [None]:
import numpy as np
import pandas as pd

df: pd.DataFrame = pd.read_csv("/tmp/nvflare/dataset/output/site-1.csv", names=features, sep=r"\s*,\s*", engine="python", na_values="?")

In [None]:
df

## Create a statistics calculator for the local tabular dataset

We compose a calculator for getting the statistics of a tabular dataset, including count, sum, mean, stdev, etc. Read `./code/df_stats.py` for details

Let's see if the code works. 

In [None]:
cd code

In [None]:
from df_stats import DFStatistics

df_stats_cal = DFStatistics(data_root_dir = "/tmp/nvflare/dataset/output")

# We use fl_ctx = None for local calculation ( where the data set default to "site-1.csv", so we can explore the stats locally without federated settings. 
df_stats_cal.initialize(fl_ctx = None)


In [None]:
data_features = df_stats_cal.features()

In [None]:
data_features

In [None]:
df_stats_cal.count("train", "lepton_pt")

In [None]:
df_stats_cal.mean("train", "lepton_pt")

In [None]:
df_stats_cal.mean("train", "m_wwbb")

In [None]:
df_stats_cal.stddev("train", "m_wwbb")

In [None]:
df_stats_cal.histogram("train", "lepton_pt", 20, 0, 10)

Great ! The code works. Let's move to the federated statistics calculations. Befor we do that, we need to move back to the parent directory of code

In [None]:
cd ../.

## Create Federated Statistics Job
We are going to use Job API to construct a FedJob, then use it to run simulation or export job configs. 


In [None]:
!cat  code/df_stats_job.py



## Run job in FL Simulator

Now we can run the job with simulator. There are two ways to run this. 
1) directly the job via job.simulator_run() 
2) generate job config, then use simulator CLI 
 
**Run job.simulator_run()**

> note
the data_root_dir=/tmp/nvflare/dataset/output


In [None]:
! python code/df_stats_job.py -w /tmp/nvflare/tabular/stats_df -n 3 -d /tmp/nvflare/dataset/output


**Export job config, Run Job using Simulator CLI**

```
! python code/df_stats_job.py -co -j /tmp/nvflare/jobs/stats_df_job -n 3
! nvflare simulator /tmp/nvflare/jobs/stats_df_job/stats_df/ -w /tmp/nvflare/tabular/stats_df -n 3 -t 3

```



### Examine Result



The results are stored in 
```
/tmp/nvflare/tabular/stats_df/server/simulate_job/statistics/stats.json
```


In [None]:
!ls -al /tmp/nvflare/tabular/stats_df/server/simulate_job/statistics/


## Result Visualization


In [None]:
import json
import pandas as pd
from nvflare.app_opt.statistics.visualization.statistics_visualization import Visualization
with open('/tmp/nvflare/tabular/stats_df/server/simulate_job/statistics/stats.json', 'r') as f:
    data = json.load(f)

vis = Visualization()
vis.show_stats(data = data)

In [None]:
from IPython.display import display, HTML
display(HTML("<style>.container { width:100%  depth:100% !important; }</style>"))

In [None]:
vis.show_histograms(data = data, plot_type="main")

Given the homogeneous data distribution across the 3 clients, the global histogram at each data point is relatively 3 times the local histograms. 

## We are done !
Congratulations! you have just completed the federated stats calulation for tabular data. 

If you would like to see a detailed discussion regarding privacy filtering, please checkout the example in [federated statistics](https://github.com/NVIDIA/NVFlare/tree/main/examples/advanced/federated-statistics) examples.

Let's move on to the next examples and see how can we use scikit-learn to train federated models on tabular data.
First we will look at the [sklearn-linear](../sklearn-linear/sklearn_linear.ipynb) example, which illustrates how to train a federated linear model (logistic regression on binary classification).