# Federated Statistics with Tabular Data

Tabular data is one of the most common data types in the world. In this chapter, we will explore federated statistics with tabular data. We will leverage `pandas` `dataframe` to calculate the statistics of the local data and the global data.

To illustrate with an example, we will prepare the dependencies and prepare the dataset. 


### Install dependencies
First, install the required packages:

In [None]:
! pip install -r code/requirements.txt


> Sidebar: 
> **Installing fastdigest**
>
> If you intend to calculate quantiles, you need to install fastdigest.  the fastdigest not included in the requirements.txt file. If you are not calculating quantiles, you can skip this step.
>
> ```bash
> pip install fastdigest==0.4.0
> ```
>
> On Ubuntu, you might get the following error:
>
> ```
> Cargo, the Rust package manager, is not installed or is not on PATH.
> This package requires Rust and Cargo to compile extensions. Install it through
> the system's package manager or via https://rustup.rs/
>     
> Checking for Rust toolchain....
> ```
>
> This is because fastdigest (or its dependencies) requires Rust and Cargo to build. 
>
> You need to install Rust and Cargo on your Ubuntu system. Follow these steps:
>
> 1. Install Rust and Cargo by running:
>    ```bash
>    cd NVFlare/examples/advanced/federated-statistics/df_stats
>    ./install_cargo.sh
>    ```
>
> 2. Then install fastdigest again:
>    ```bash
>    pip install fastdigest==0.4.0
>    ```


### Prepare data

In this example, we are using the UCI (University of California, Irvine) [adult dataset](https://archive.ics.uci.edu/dataset/2/adult).

The original dataset already contains "training" and "test" datasets. Here we simply assume that the "training" and "test" data set each belong to a client, so we assign the adult.train dataset to site-1 and the adult.test dataset to site-2.
```
site-1: adult.train
site-2: adult.test
```

Now we use the data utility to download UCI datasets to separate client package directory to /tmp/nvflare/data/ directory.
Please note that the UCI's website may experience occasional downtime.

In [None]:
%cd code/data

from prepare_data import prepare_data

prepare_data(data_root_dir = "/tmp/nvflare/df_stats/data")

%cd -

Let's take a look at the data

In [None]:
import pandas as pd
data_path ="/tmp/nvflare/df_stats/data/site-1/data.csv"
data_features = [
            "Age",
            "Workclass",
            "fnlwgt",
            "Education",
            "Education-Num",
            "Marital Status",
            "Occupation",
            "Relationship",
            "Race",
            "Sex",
            "Capital Gain",
            "Capital Loss",
            "Hours per week",
            "Country",
            "Target",
        ]

        # the original dataset has no header,
        # we will use the adult.train dataset for site-1, the adult.test dataset for site-2
        # the adult.test dataset has incorrect formatted row at 1st line, we will skip it.
skip_rows = {
            "site-1": [],
            "site-2": [0],
        }

df= pd.read_csv(data_path, names=data_features, sep=r"\s*,\s*", skiprows=skip_rows, engine="python", na_values="?")

df

> Note 
We will only calculate the statistics of numerical features; categorical features will be skipped.

## Define statistics configuration

We can configure each statistics, using a dictionary, where the key is a statistic's name, and the value is a statistic's configuration.

```python

   statistic_configs = {
        "count": {},
        "mean": {},
        "sum": {},
        "stddev": {},
        "histogram": {"*": {"bins": 20}, "Age": {"bins": 10, "range": [0, 100]}},
        "quantile": {"*": [0.1, 0.5, 0.9], "Age": [0.5, 0.9]},
    }
```

For each statistic, we can configure to give additional instructions for each feature. While count, mean, sum and stddev are defined in such a way that the calculation will be the same for all features, for histogram, we can define different bins for each feature. "*" is a wildcard for all features.

For example, here:
```
"histogram": {"*": {"bins": 20}, "Age": {"bins": 20, "range": [0, 10]}},
```
We will compute histograms with 20 bins for all features, and the range is not defined, which means the range will be calculated from the data. We also defined 10 bins and range [0, 100] for the feature "Age".

Similarly the quantile is defined for different features with different values.

If a user only only needs to calculate statistics except for quantile and histogram, then the configuration can be simplified as:

```python
 statistic_configs = {
        "count": {},
        "mean": {},
        "sum": {},
        "stddev": {},
  }
```


## Define the local statistics generator

Based on the above target statistics configuration, we can define the local statistics generator. To do this, we need to write a class that implement 

```python
class Statistics(InitFinalComponent, ABC):
    def initialize(self, fl_ctx: FLContext):
    def pre_run(self, statistics: List[str], num_of_bins: Optional[Dict[str, Optional[int]]],bin_ranges: Optional[Dict[str, Optional[List[float]]]]):
    def features(self) -> Dict[str, List[Feature]]:
    def count(self, dataset_name: str, feature_name: str) -> int:
    def sum(self, dataset_name: str, feature_name: str) -> float:
    def mean(self, dataset_name: str, feature_name: str) -> float:
    def stddev(self, dataset_name: str, feature_name: str) -> float:
    def variance_with_mean(self, dataset_name: str, feature_name: str, global_mean: float, global_count: float) -> float:
    def histogram(self, dataset_name: str, feature_name: str, num_of_bins: int, global_min_value: float, global_max_value: float) -> Histogram:
    def max_value(self, dataset_name: str, feature_name: str) -> float:
    def min_value(self, dataset_name: str, feature_name: str) -> float:
    def failure_count(self, dataset_name: str, feature_name: str) -> int:
    def quantiles(self, dataset_name: str, feature_name: str, percentiles: List) -> Dict:
    def finalize(self, fl_ctx: FLContext):
```

NVIDIA FLARE provides a base [`DFStatisticsCore`](https://github.com/NVIDIA/NVFlare/blob/main/nvflare/app_opt/statistics/df/df_core_statistics.py#L28) class, which is a core class for calculating the statistics of the data frame. We can inherit this class and override the methods to calculate the statistics. Here are a few assumptions:

* data can be loaded and cached in the memory.
* data has the proper column names and can be loaded into a pandas dataframe.
* The feature names can be obtained from the dataframe.

Let's take a look our example in [code/src/df_statistics.py](code/src/df_statistics.py). We can see that, with `DFStatisticsCore`, we only need to implement the `initialize()` function, which internally calls the `load_data()` function that loads the dataset as a `pandas` `dataframe`:
```python
def load_data(self, fl_ctx: FLContext) -> Dict[str, pd.DataFrame]
```

In [None]:
!cat code/src/df_statistics.py

# Define Job Configuration

Each FLARE job is defined by a job configuration, the configuration includes configurations for the clients and server. Optionally, the job configuration also contains the customized job code. You have seen this in [Job Structure and configuration](../../../chapter-1_running_federated_learning_applications/01.6_job_structure_and_configuration/understanding_fl_job.ipynb)

Similar to other examples, we can use FLARE's Job API to define and configure the job for statistics computation. FLARE provides a built-in `StatsJob` class, which inherits from the `FedJob` class.

```python
job = StatsJob(
    job_name="<job_name>",
    statistic_configs=statistic_configs,
    stats_generator=df_stats_generator,
    output_path=output_path,
)
```

Let's take a look at our job defined in [code/df_stats_job.py](code/df_stats_job.py). 

Notice that we hardcoded the column names in this example for simplicity, but in practice, users can get the column names from files such as CSV files, parquet files, etc.

## Run Job with FL Simulator

Let's run the job.

With the default arguments, the job will be exported to `/tmp/nvflare/jobs/stats_df` and then the job will be run with the FL simulator with the `simulator_run()` command with a work_dir of `/tmp/nvflare/jobs/stats_df`.

In [None]:
! cd code && python3 df_stats_job.py 

## Examine the result

With the default parameters, the results are stored in workspace "/tmp/nvflare/jobs/stats_df/work_dir"
```
/tmp/nvflare/jobs/stats_df/work_dir/server/simulate_job/statistics/adults_stats.json 

```

In [None]:
ls -al  /tmp/nvflare/jobs/stats_df/work_dir/server/simulate_job/statistics/adults_stats.json 

## Visualization
We can visualize the results easly via the visualizaiton notebook. Before we do that, we need to copy the data to the notebook directory 


In [None]:
! cp  /tmp/nvflare/jobs/stats_df/work_dir/server/simulate_job/statistics/adults_stats.json  code/demo/.

Now we can visualize the results with the [visualization notebook](code/demo/visualization.ipynb)

## We are done !
Congratulations, you just completed the federated stats calculation with data represented by a data frame.

Let's move on to [federated stats with Image Data](../federated_statistics_with_image_data/federated_statistics_with_image_data.ipynb).