<a href="https://colab.research.google.com/github/Olhaau/fl-official-statistics-addon/blob/main/_dev/00_initial_results.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Federated Learning in Official Statistics - Initial Results
---

This notebook presents the **initial results of the originating work** <a name="cite_ref-1"></a>[(Stock et al., 2023)](#cite_note-1).

The associated code can be found in the folder 'original_work' of this repo.

## Object of Investigation
---

To analyze the potential of FL for official statistics, <a name="cite_ref-1"></a>[(Stock et al., 2023)](#cite_note-1) run three simulations with different datasets:

1. **Medical Insurance** (presumably artificial): available at [kaggle/ushealthinsurancedataset](https://www.kaggle.com/datasets/teertha/ushealthinsurancedataset).
2. **LTE** (mobile radio): privately held by the company [umlaut](https://www.umlaut.com), not publically available.
3. **Pollution** (of fine dust PM<sub>2.5</sub> in Bejing): Beijing Multi-Site Air-Quality Data Data Set available at [UCI Machine Learning Repository](https://archive.ics.uci.edu/ml/datasets/Beijing+Multi-Site+Air-Quality+Data).

## Medical Insurance
---

<div class="alert alert-block alert-success"><b>Remark:</b>  A minimal and lightweight use case that addresses a real privacy problem related to official statistics, and identifies opportunities for improving performance in the decentralized setting. This use case is being used to initialize a Federated Learning infrastructure.</div>

### Overview

Data about medical insurance costs of individual persons.

- 1338 records
- 8 attributes: age, gender, bmi, children, smoker, region, charges
- presumably artificial: complete (no missings), balanced in age, gender 

As the ML task, the following regression problem is investigated: 
> *Given the other attributes of the data set, how high are the insurance charges of an individual?*

Originating Code can be found in 

- [med-insurance.ipynb](../original_work/med-insurance/med-insurance.ipynb)
- [med-insurance-federated.ipynb](../original_work/med-insurance/med-insurance-federated.ipynb)

### Preprocessing

Minimal: Scaling + Encoding

### Model

> *The two centralized learning approaches random forest and neural network are used as benchmarks for the FL scenario. A hyperparameter search yields the hyperparameters \[...\]. For the second benchmark (neural network), a hyperparameter search yields the following architecture: 16 dense-neurons in the first layer and 2 dense-neurons in the second layer, followed by a single dense-neuron in the output layer. The neural network is trained for 100 epochs, using the stochastic gradient descent (SGD) implementation by TensorFlow.* 

> *For the FL scenario, we use a slightly larger neural network with 16 dense-neurons in the first layer and 6 neurons in the second layer, again followed by a single neuron in the output layer. We run the FL training process for 150 rounds, with SGD and a learning rate of 0.8 for the clients and 3.0 for the server.*

#### Our Remarks

tba: Rem.: Differences in the Models, Tuning not suitable

### Results

Performance is measured by a $R^2$ score of a test set (holdout evaluation).

- Benchmarks (centralized)
    - 0.877 random forest regressor
    - 0.85 neural network
- Federated Learning
    - -0.075

#### Discussion

Cf. <a name="cite_ref-1"></a>[(Stock et al., 2023, p. 4)](#cite_note-1):

> *The strong performance of the centralized benchmark models (R2 of 0.877 and 0.85) may be due to the ML-friendly and possibly artificial nature of the data set. Regarding the significantly worse performance of the FL neural network (R2 of -0.075), we test the hypothesis that this is rooted in the fact that the FL training data is lacking the region attribute. Recall that we construct the FL client data sets by dividing the data set by its region. Without removing the attribute from the data, the attribute would hence be the same in each client data set – thus it has no use in the local training process. To test the influence of the region attribute on the model performance, we retrain the benchmark random forest regressor on a copy of the original data set without the one-hot encoded region attribute. This does not influence the performance of the benchmark, as it still achieves an R2 of 0.877. We run a second test in the FL scenario: Instead of dividing the data set by the region of the data records, we randomly split the records into 4 evenly sized client data sets and keep the region attribute. If the hypothesis above (that the region attribute contains useful information for the training process, explaining the strong performance of the benchmark models) is true, we would expect a big improvement of the FL model by keeping the region attribute in the (randomly dispersed) client data sets. However, the original result of -0.075 R2 (with one client per region) has been worsened in this test scenario (with random client data sets) to an R2 value of -0.106. Thus, we conclude that we can find no evidence for the hypothesis stated above.* 
>
> *Instead, we conjecture that the data set is too small for a FL training scenario – with about 350 data records per client, minus 20% for the test data. Although data augmentation (adding more artificial data with a similar distribution) could be a possible solution to this problem, we want to stress that this is not trivial. In a quick test, in which we have used a generative adversarial network (GAN), we were not able to improve the resulting FL regressor’s performance by data augmentation, reaching an R2 value of -1.02.*

#### Our Remarks

We had Problems to understand the low performance of FL (or improve it). So we took a closer look at the centralized approach and noticed high Variance in the training of the neural networks.

*Training Performance with initial parameters:*
![](../original_work/med-insurance/rsquared_init_params.jpg)

*Training Performance after tuning:*
![](../original_work/med-insurance/rsquared_hyperparams.jpg)

## LTE
---

<div class="alert alert-block alert-warning"><b>Remark</b> The use case is currently not being investigated further because the data is not available. Any further investigation would require umlaut's participation to access the data.</div>

more tba.

### Results

Results The benchmarks of the centralized learning regressors are 

- $R^2 = 0.158$ (random forest)
- $R^2 = 0.13$ (neural network)
- $R^2 = 0.13$ (linear regression) 
- $R^2 = 0.114$ (neural network - Federated Learning) 


## Pollution
---

<div class="alert alert-block alert-warning"><b>Remark</b> Very good performance, but no real privacy issues. A good use case for upcomming technical tests, but no suitable product for official statistics and to present the advantages of Federated Learning.</div>

### Overview

> *We model a classification task in which the current fine dust pollution is inferred based on meteorological input data. More precisely, 48 consecutive hourly measurements are used to make a prediction for the current PM2.5 pollution (the total weight of particles smaller than 2.5 μm in one m3). The output of the predictor is one of the three classes low, medium or high. The threshold for each class are chosen in a way such that the samples of the whole data set are distributed evenly among the three classes.* 

#### Dataset

> *The data set we use is a multi-feature air quality and weather data set. It consists of hourly measurements of 12 meteorological stations in Beijing, recorded over a time span of 4 years (2013–2017). In total, more than 420 000 data records are included in the data set. Although some attributes are missing for some data records, most records have data for all of a total of 17 attributes.*

### Preprocessing

> *To complete the missing data records, we use linear interpolation. We encode the wind direction by parsing the wd attribute into four binary attributes (one for each cardinal direction). All other features are scaled using a standard scaler implementation. We exclude the following pollution features from training, since we expect a high correlation with the target attribute PM2.5: PM10, SO2, NO2, CO and O3.*

### Model

> *We use the same model architecture for all three scenarios, a neural network with five layers: A 10-neuron LSTM (long-short term memory) layer, a dropout layer with a dropout rate of 25%, a 5-neuron LSTM layer, another dropout layer with a dropout rate of 35% and a 3-neuron dense layer for the classification output.*

> *We train for 20 epochs in the first scenario, 10 epochs in the second scenario and 160 epochs in third scenario (FL). In all scenarios, we use CategoricalCrossEntropy as the loss function. While we use the Adam optimizer with an automatic learning rate in both of the centralized learning scenarios, we employ the Stochastic Gradient Descent (SGD) optimizer in the FL scenario. On the server we use a learning rate of 1 for SGD, on the client we start with a learning rate of 0.1. The latter is divided by 10 every 64 rounds, such that at the end of the 160 epochs in the FL scenario, the client learning rate is at 0.001.*

### Results

Three different scenarios were tested.

#### 1. Centralized Learning (one model per station)

- **average test accuracy  of 70.05%** and standard deviation of 0.0015 (average of accuracy over each station from \[69%, 73%\])
- precision, recall, f1-score are also close to 70% for all models.
- the most misclassified examples belong to the medium class.

#### 2. Centralized learning (global model over all stations)

- **average test accuracy of 72.4%** with standard deviation of 0.005 (5-fold cross validation)
- as in the previous scenario, the samples labeled with medium are misclassified more often than the others. 

#### 3. Federated Learning (Client == Station)

- **average test accuracy of 67.0%** with a standard deviation of 0.014 (5-fold cross validation). 
- precision, recall and F1-score are around 67%, each with a standard deviation of 0.013. 
- In a first attempt, without the encoded wind direction and the time features (year, month, day) only a significantly lower accuracy of 63.5% (with a standard deviation of 0.01) was achieved.

## Appendix
---

### References
<a name="cite_note-1"></a>[(Stock et al., 2023)](#cite_ref-1)  &emsp;  Stock, Petersen, Federrath (2023). *On the Applicability of Federated Learning for Official Statistics*. 

### Helpful Links
- [nbviewer](https://nbviewer.org/https://nbviewer.org/) (correct rendering directly from github)
- [Footnotes in Markdown (Stackoverflow)](https://stackoverflow.com/questions/61139741/footnotes-in-markdown-both-on-jupyter-and-google-colab)
- [Add Spaces in Markdown (Stackoverflow)](https://stackoverflow.com/questions/47061626/how-to-get-tab-space-in-markdown-cell-of-jupyter-notebook)
- [Jupyter Formatting (medium)](https://medium.com/analytics-vidhya/the-ultimate-markdown-guide-for-jupyter-notebook-d5e5abf728fdhttps://medium.com/analytics-vidhya/the-ultimate-markdown-guide-for-jupyter-notebook-d5e5abf728fd)