# Federated Statistics with Tabular Data

Tabular data is the most common data type in the world, it is easy to understand and manipulate. In this chapter, we will explore the federated statistics with tabular data. We will leverage pandas dataframe to calculate the statistics of the local data and the global data.

To illustrate with an example, we will prepare the dependencies and prepare the dataset. 

## Prerequisites


### Install dependencies
First, install the required packages:

In [None]:
%pip install -r code/requirements.txt


> Sidebar: 
> **Installing fastdigest**
>
> If you intend to calculate quantiles, you need to install fastdigest.  the fastdigest not included in the requirements.txt file. If you are not calculating quantiles, you can skip this step.
>
> ```bash
> pip install fastdigest==0.4.0
> ```
>
> On Ubuntu, you might get the following error:
>
> ```
> Cargo, the Rust package manager, is not installed or is not on PATH.
> This package requires Rust and Cargo to compile extensions. Install it through
> the system's package manager or via https://rustup.rs/
>     
> Checking for Rust toolchain....
> ```
>
> This is because fastdigest (or its dependencies) requires Rust and Cargo to build. 
>
> You need to install Rust and Cargo on your Ubuntu system. Follow these steps:
>
> 1. Install Rust and Cargo by running:
>    ```bash
>    cd NVFlare/examples/advanced/federated-statistics/df_stats
>    ./install_cargo.sh
>    ```
>
> 2. Then install fastdigest again:
>    ```bash
>    pip install fastdigest==0.4.0
>    ```


### Prepare data

In this example, we are using the UCI (University of California, Irvine) [adult dataset](https://archive.ics.uci.edu/dataset/2/adult)
The original dataset already contains "training" and "test" datasets. Here we simply assume that the "training" and "test" data set each belong to a client, so we assign the adult.train dataset to site-1 and the adult.test dataset to site-2.
```
site-1: adult.train
site-2: adult.test
```

Now we use the data utility to download UCI datasets to separate client package directory to /tmp/nvflare/data/ directory.
Please note that the UCI's website may experience occasional downtime.

In [None]:
%cd code/data

from prepare_data import prepare_data

prepare_data(data_root_dir = "/tmp/nvflare/df_stats/data")

%cd -

Let's take a look at the data

In [3]:
import pandas as pd
data_path ="/tmp/nvflare/df_stats/data/site-1/data.csv"
data_features = [
            "Age",
            "Workclass",
            "fnlwgt",
            "Education",
            "Education-Num",
            "Marital Status",
            "Occupation",
            "Relationship",
            "Race",
            "Sex",
            "Capital Gain",
            "Capital Loss",
            "Hours per week",
            "Country",
            "Target",
        ]

        # the original dataset has no header,
        # we will use the adult.train dataset for site-1, the adult.test dataset for site-2
        # the adult.test dataset has incorrect formatted row at 1st line, we will skip it.
skip_rows = {
            "site-1": [],
            "site-2": [0],
        }

df= pd.read_csv(data_path, names=data_features, sep=r"\s*,\s*", skiprows=skip_rows, engine="python", na_values="?")
df


Unnamed: 0,Age,Workclass,fnlwgt,Education,Education-Num,Marital Status,Occupation,Relationship,Race,Sex,Capital Gain,Capital Loss,Hours per week,Country,Target
0,39,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,<=50K
1,50,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,<=50K
2,38,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,<=50K
3,53,Private,234721,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States,<=50K
4,28,Private,338409,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba,<=50K
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
32556,27,Private,257302,Assoc-acdm,12,Married-civ-spouse,Tech-support,Wife,White,Female,0,0,38,United-States,<=50K
32557,40,Private,154374,HS-grad,9,Married-civ-spouse,Machine-op-inspct,Husband,White,Male,0,0,40,United-States,>50K
32558,58,Private,151910,HS-grad,9,Widowed,Adm-clerical,Unmarried,White,Female,0,0,40,United-States,<=50K
32559,22,Private,201490,HS-grad,9,Never-married,Adm-clerical,Own-child,White,Male,0,0,20,United-States,<=50K


> Note 
We will only calculate the statistics of numerical features, categorical features will be skipped

## Define target statistics configuration

Let's see what statistics we want to calculate, we can capture the statistics configuration in a dictionary, the key is the statistic name, the value is the statistic configuration.

```
   statistic_configs = {
        "count": {},
        "mean": {},
        "sum": {},
        "stddev": {},
        "histogram": {"*": {"bins": 20}, "Age": {"bins": 10, "range": [0, 100]}},
        "quantile": {"*": [0.1, 0.5, 0.9], "Age": [0.5, 0.9]},
    }
```

For each statistic, we can configure to give addtional instructure for each feature.  While count, mean, sum and stddev are defined in such way that the calculation will be the same for all features. But for histogram, we can define a different bin for each feature. "*" is a wildcard for all features. 
```
"histogram": {"*": {"bins": 20}, "Age": {"bins": 20, "range": [0, 10]}},
```
here, we define a histogram with a histogram with 20 bins for all features, the range is not defined, which means the range will be calculated from the data.
We also defined 10 bins and range [0, 100] for the feature "Age".

Similarly the quantile is defined for different features with different values.  This can be specified with Job API


In [4]:
!cat code/df_stats_job.py

# Copyright (c) 2025, NVIDIA CORPORATION.  All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import argparse

from src.df_statistics import DFStatistics

from nvflare.job_config.stats_job import StatsJob


def define_parser():
    parser = argparse.ArgumentParser()
    parser.add_argument("-n", "--n_clients", type=int, default=2)
    parser.add_argument("-d", "--data_root_dir", type=str, nargs="?", default="/tmp/nvflare/df_stats/data")
    parser.add_argument("-o", "--stats_outp

If user only wants to calculate the statistics except for quantile, histogram then the configuration can be simplified as:

 statistic_configs = {
        "count": {},
        "mean": {},
        "sum": {},
        "stddev": {},
  }
Similarly, the code can be changed to 


```
    skip previous code


    statistic_configs = {
        "count": {},
        "mean": {},
        "sum": {},
        "stddev": {},

    }
    # define local stats generator
    df_stats_generator = DFStatistics(data_root_dir=data_root_dir)

    job = StatsJob(
        job_name="stats_df",
        statistic_configs=statistic_configs,
        stats_generator=df_stats_generator,
        output_path=output_path,
    )
```


## Define the local statistics generator

Based on the above target statistics configuration, we can define the local statistics generator. To do this, we need to write a class that implement 

```

class Statistics(InitFinalComponent, ABC):

    def initialize(self, fl_ctx: FLContext):
    def pre_run(self, statistics: List[str], num_of_bins: Optional[Dict[str, Optional[int]]],bin_ranges: Optional[Dict[str, Optional[List[float]]]]):
    def features(self) -> Dict[str, List[Feature]]:
    def count(self, dataset_name: str, feature_name: str) -> int:
    def sum(self, dataset_name: str, feature_name: str) -> float:
    def mean(self, dataset_name: str, feature_name: str) -> float:
    def stddev(self, dataset_name: str, feature_name: str) -> float:
    def variance_with_mean(self, dataset_name: str, feature_name: str, global_mean: float, global_count: float) -> float:
    def histogram(self, dataset_name: str, feature_name: str, num_of_bins: int, global_min_value: float, global_max_value: float) -> Histogram:
    def max_value(self, dataset_name: str, feature_name: str) -> float:
    def min_value(self, dataset_name: str, feature_name: str) -> float:
    def failure_count(self, dataset_name: str, feature_name: str) -> int:
    def quantiles(self, dataset_name: str, feature_name: str, percentiles: List) -> Dict:
    def finalize(self, fl_ctx: FLContext):

```


NVIDIA FLARE has implemented ```DFStatisticsCore```, which is a core class for calculating the statistics of the data frame. We can inherit this class and override the methods to calculate the statistics. Here are a few assumptions:

* data can be loaded and cached in the memory.
* data has the proper column names and can be loaded into a pandas dataframe.
* The feature names can be obtained from the dataframe.

```
def load_data(self, fl_ctx: FLContext) -> Dict[str, pd.DataFrame]:
```
which loads the data into a dictionary of pandas dataframe, the key is the dataset name, the value is the pandas dataframe.

Let's take a look what the user needs to do to implement the local statistics generator. With ```DFStatisticsCore```, the user needs to implement the following methods:


In [5]:
!cat code/src/df_statistics.py



# Copyright (c) 2025, NVIDIA CORPORATION.  All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

from typing import Dict, Optional

import pandas as pd

from nvflare.apis.fl_context import FLContext
from nvflare.app_opt.statistics.df.df_core_statistics import DFStatisticsCore


class DFStatistics(DFStatisticsCore):
    def __init__(self, filename, data_root_dir="/tmp/nvflare/df_stats/data"):
        super().__init__()
        self.data_root_dir = data_root_dir
        self.filename

# Define Job Configuration

Each FLARE job is defined a job configuration, the configuration includes configurations for the clients and server, optionally, it also contains the customized job code. You have seen this in [Job Structure and configuration](../../../chapter-1_running_federated_learning_applications/01.6_job_structure_and_configuration/understanding_fl_job.ipynb)

For now, we can define a FebJob via FLARE's Job API, here is Statistrics Job we havea predefined for you

```
    job = StatsJob(
        job_name="<job_name>",
        statistic_configs=statistic_configs,
        stats_generator=df_stats_generator,
        output_path=output_path,
    )

```

For this example, we simply harded code the columns names, but in practice, the user can get the column names from the file such as CSV file, parquet file etc. 


## Run Job with FL Simulator

The file [df_stats_job.py](code/df_stats_job.py) uses `StatsJob` to generate a job configuration in a Pythonic way. With the default arguments, the job will be exported to `/tmp/nvflare/jobs/stats_df` and then the job will be run with the FL simulator with the `simulator_run()` command with a work_dir of `/tmp/nvflare/jobs/stats_df`.

In [6]:
%cd code

! python3 df_stats_job.py 
%cd -

/home/chester/projects/NVFlare/examples/tutorials/self-paced-training/part-1_federated_learning_introduction/chapter-2_develop_federated_learning_applications/02.1_federated_statistics/federated_statistics_with_tabular_data/code


  self.shell.db['dhist'] = compress_dhist(dhist)[-100:]


[38m2025-02-21 20:35:29,256 - StatisticsController - INFO - fed_stats control flow started.[0m
[38m2025-02-21 20:35:29,256 - StatisticsController - INFO - start prepare inputs for task fed_stats_1st_statistics[0m
[38m2025-02-21 20:35:29,256 - StatisticsController - INFO - task: fed_stats statistics_flow for fed_stats_1st_statistics started.[0m
[38m2025-02-21 20:35:33,222 - StatisticsTaskHandler - INFO - Executing task 'fed_stats' for client: 'site-2'[0m
[31m2025-02-21 20:35:33,224 - StatisticsTaskHandler - ERROR - Failed to populate result  statistics of dataset train and feature Age with exception: NameError: name 'sqrt' is not defined[0m
[31m2025-02-21 20:35:33,224 - StatisticsTaskHandler - ERROR - Traceback (most recent call last):
  File "/home/chester/projects/NVFlare/nvflare/app_common/executors/statistics/statistics_task_handler.py", line 108, in _populate_result_statistics
    statistics_result[tm.name][ds_name][feature.feature_name] = fn(
  File "/home/chester/proje

## Examine the result

With the default parameters, the results are stored in workspace "/tmp/nvflare/jobs/stats_df/work_dir"
```
/tmp/nvflare/jobs/stats_df/work_dir/server/simulate_job/statistics/adults_stats.json 

```

In [None]:
ls -al  /tmp/nvflare/jobs/stats_df/work_dir/server/simulate_job/statistics/adults_stats.json 

## Visualization
We can visualize the results easly via the visualizaiton notebook. Before we do that, we need to copy the data to the notebook directory 


In [6]:
! cp  /tmp/nvflare/jobs/stats_df/work_dir/server/simulate_job/statistics/adults_stats.json  code/df_stats/demo/.

now we can visualize the results with the [visualization notebook](code/df_stats/demo/visualization.ipynb)

## We are done !
Congratulations, you just completed the federated stats calulation with data represented by data frame

Let's move on to [federated stats with Image Data](../federated_statistics_with_image_data/federated_statistics_with_image_data.ipynb)
