# Data Frame Federated Statistics 

In this example, we will show how to generate federated statistics for **tabular** data. The data can be loaded into Pandas DataFrame, and the data can be cached in memory so we can leverage DataFrame and NumPy to calculate the local statistics. The result will be saved to the job workspace in json format, and this can be loaded in Pandas DataFrame and visualized with the provided visualization utility as demonstrated in the linked notebook at the end of this example.

## Install requirements
First, install the required packages:

In [None]:
%pip install -r code/requirements.txt


## Prepare data

In this example, we are using the UCI (University of California, Irvine) [adult dataset](https://archive.ics.uci.edu/dataset/2/adult)
The original dataset already contains "training" and "test" datasets. Here we simply assume that the "training" and "test" data set each belong to a client, so we assign the adult.train dataset to site-1 and the adult.test dataset to site-2.

Now we use the data utility to download UCI datasets to separate client package directory to /tmp/nvflare/data/ directory.
Please note that the UCI's website may experience occasional downtime.

In [None]:
from code.df_stats.utils.prepare_data import prepare_data

prepare_data(data_root_dir = "/tmp/nvflare/df_stats/data")

#### Let's take a look at the data

In [None]:
import pandas as pd
data_path ="/tmp/nvflare/df_stats/data/site-1/data.csv"
data_features = [
            "Age",
            "Workclass",
            "fnlwgt",
            "Education",
            "Education-Num",
            "Marital Status",
            "Occupation",
            "Relationship",
            "Race",
            "Sex",
            "Capital Gain",
            "Capital Loss",
            "Hours per week",
            "Country",
            "Target",
        ]

        # the original dataset has no header,
        # we will use the adult.train dataset for site-1, the adult.test dataset for site-2
        # the adult.test dataset has incorrect formatted row at 1st line, we will skip it.
skip_rows = {
            "site-1": [],
            "site-2": [0],
        }

df= pd.read_csv(data_path, names=data_features, sep=r"\s*,\s*", skiprows=skip_rows, engine="python", na_values="?")
df


> Note **We will only calculate the statistics of numerical features, categorical features will be skipped**

## Configure Job

For this example, we have a file [df_stats_job.py](code/df_stats_job.py) that uses `StatsJob` to generate a job configuration in a Pythonic way.

You may notice that `StatsJob` has a `statistic_configs` parameter that we have configured as follow:

In [None]:
statistic_configs = {
        "count": {},
        "mean": {},
        "sum": {},
        "stddev": {},
        "histogram": {"*": {"bins": 20}},
        "Age": {"bins": 20, "range": [0, 10]},
        "percentile": {"*": [25, 50, 75], "Age": [50, 95]},
    }

The `statistic_configs` parameter has all of the statistics we are gathering and provide bins and ranges. For histogram, we specified the histogram bins for all features ("*") to be 20. For age, we set the bins to 20 as well and also set the range to be [0,10]. The percentiles can be configured for each statistic as well. If the range is not specified, the ranges will be dynamically estimated. For bins, if not specified, the global bin range is dynamically estimated based on local min/max values.

For `StatsJob`, the `stats_generator` parameter takes the statistics generator that is used for each client site, and in this example we use [DFStatistics](code/src/df_statistics.py). `DFStatistics` extends `DFStatisticsCore`, which in turn implements the `Statistics` specification from `nvflare.app_common.abstract.statistics_spec.Statistics`.

## Run Job with FL Simulator

The file [df_stats_job.py](code/df_stats_job.py) will run the job with the FL simulator through the Job API. With the default arguments, the job will be exported to `/tmp/nvflare/jobs/stats_df` and then the job will be run with the FL simulator with the `simulator_run()` command with a work_dir of `/tmp/nvflare/jobs/stats_df`.

In [None]:
! python3 code/df_stats_job.py

## Examine the result

With the default parameters, the results are stored in workspace "/tmp/nvflare/jobs/stats_df/"
```
/tmp/nvflare/jobs/stats_df/server/simulate_job/statistics/adults_stats.json
```

In [None]:
cat /tmp/nvflare/jobs/stats_df/server/simulate_job/statistics/adults_stats.json

## Visualization
We can visualize the results easly via the visualizaiton notebook. Before we do that, we need to copy the data to the notebook directory 


In [None]:
! cp /tmp/nvflare/jobs/stats_df/server/simulate_job/statistics/adults_stats.json code/df_stats/demo/.

now we can visualize the results with the [visualization notebook](code/df_stats/demo/visualization.ipynb)

## We are done !
Congratulations, you just completed the federated stats calulation with data represented by data frame
