# Tabular Data Federated Statistics 

In this example, we will show how to generate federated statistics for data that can be represented as Pandas Data Frame.

## Set Up NVFLARE

Follow [Getting Started](https://nvflare.readthedocs.io/en/main/getting_started.html) to set up a virtual environment and install NVFLARE.


## Install requirements
First, install the required packages:

In [None]:
%pip install -r requirements.txt

## Install Optional Quantile Dependency – fastdigest

If you intend to calculate quantiles, you need to install fastdigest.

Skip this step if you don’t need quantile statistics.
```
pip install fastdigest==0.4.0
```

on Ubuntu, you might get the following error:
```
Cargo, the Rust package manager, is not installed or is not on PATH.
This package requires Rust and Cargo to compile extensions. Install it through
the system's package manager or via https://rustup.rs/

Checking for Rust toolchain....
```
This is because fastdigest (or its dependencies) requires Rust and Cargo to build.

You need to install Rust and Cargo on your Ubuntu system. Follow these steps: Install Rust and Cargo Run the following command to install Rust using rustup:

```
./install_cargo.sh
```
Then you can install fastdigest again
```
pip install fastdigest==0.4.0
```

# Code Structure

Assume you have ```tree``` command installed, if not you can either installed it or use ``` ls -al ``` 

In [None]:
! tree .

The code structure is "client.py", "job.py". The rests are supporting files. 



## data

In this example, we are using UCI (University of California, Irvine) [adult dataset](https://archive.ics.uci.edu/dataset/2/adult)
The original dataset has already contains "training" and "test" datasets. Here we simply assume that "training" and test data sets are belong to different clients.
so we assigned the training data and test data into two clients.
 
Now we use data utility to download UCI datasets to separate client package directory to /tmp/nvflare/data/ directory.
Please note that the UCI's website may experience occasional downtime.



In [None]:
! python prepare_data.py

#### Let's take a look at the data

In [None]:
import pandas as pd
data_path ="/tmp/nvflare/df_stats/data/site-1/data.csv"
data_features = [
            "Age",
            "Workclass",
            "fnlwgt",
            "Education",
            "Education-Num",
            "Marital Status",
            "Occupation",
            "Relationship",
            "Race",
            "Sex",
            "Capital Gain",
            "Capital Loss",
            "Hours per week",
            "Country",
            "Target",
        ]

        # the original dataset has no header,
        # we will use the adult.train dataset for site-1, the adult.test dataset for site-2
        # the adult.test dataset has incorrect formatted row at 1st line, we will skip it.
skip_rows = {
            "site-1": [],
            "site-2": [0],
        }

df= pd.read_csv(data_path, names=data_features, sep=r"\s*,\s*", skiprows=skip_rows, engine="python", na_values="?")
df


> Note **We will only calculate the statistics of numerical features, categorical features will be skipped**

# Client Code
Local statistics generator. The statistics generator AdultStatistics implements Statistics spec.

In [None]:
! cat client.py

Many of the functions needed for tabular statistics have already been implemented DFStatisticsCore

In the AdultStatistics class, we really need to have the followings

data_features – here we hard-coded the feature name array.

implement ```load_data() -> Dict[str, pd.DataFrame]``` function, where the method will return a dictionary of panda DataFrames with one for each data source (“train”, “test”)

data_path = ```<data_root_dir>/<site-name>/<filename>```



## Server Code
The server aggregation have already implemented in Statistics Controller

## Job Recipe

Job is defined via recipe, we will run it in Simulation Execution Env. If you like to run in Production or PoC env. Simply replaced the SimEnv with ProdEnv or PoCEnv

In [None]:
! cat job.py



## Run job

**Run Job using Simulator API**


In [None]:
! python job.py


The results are stored in workspace "/tmp/nvflare/df/workdir/"
```
 /tmp/nvflare/simulation/stats_df/server/simulate_job/statistics/adults_stats.json
```

## Visualization
We can visualize the results easly via the visualizaiton notebook. Before we do that, we need to copy the data to the notebook directory 


In [None]:
! cp  /tmp/nvflare/simulation/stats_df/server/simulate_job/statistics/adults_stats.json demo/.

now we can visualize via the [visualization notebook](./demo/visualization.ipynb)

## We are done !
Congratulations, you just completed the federated stats calulation with data represented by data frame
