<a href="https://colab.research.google.com/github/Olhaau/fl-official-statistics-addon/blob/main/_dev/99_wrapup.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Summary
---

### Setup
- **OS**: Linux is superior in convenience and stability. 
    - A WSL (Windows-Subsystem Linux) is a convienient solution for Windows.
- **Virtual environment**: both, conda or venv, work fine. Conda is used here.
- **Versions**
    - Python 3.9.* is recommended here: https://www.tensorflow.org/install/pip

## Prerequisites
---

### Development environment
---

We use Tensorflow Federated (TFF). For further development e.g. is considerded 

#### OS - Linux is recommended

##### Windows restrictions

Tensorflow (TF) and Tensorflow Federated (TFF) have only restricted Windows support, e.g.

- **TFF - several problems**, e.g.
  - installing TFF does run infinitely because of not resolvable dependencies, see https://stackoverflow.com/questions/69949143/tensorflow-federated-on-windows
  - needed JAX has no Windows support, see https://github.com/google/jax/issues/438. JAX Team:
    
    >*Windows support is still not on the agenda. We're maxed out on other things, and moreover no one on the JAX team is a Windows user, which only makes development harder.*
  - XLA (JAX dependency) was not fully compilable, but should now be available since TF 2.2.0.
- **TF - no GPU support**, s. https://www.tensorflow.org/install/pip#windows-native
  
  > *Starting with TensorFlow 2.11, you will need to install TensorFlow in WSL2, or install tensorflow or tensorflow-cpu and, optionally, try the TensorFlow-DirectML-Plugin.*

##### Windows-Subsystem for Linux

**A Windows-Subsystem for Linux (WSL) can be a convenient solution**, see

- Install Tensorflow using a WSL: https://www.tensorflow.org/install/pip#windows-wsl2
- tutorial for setting up GPU support: https://www.youtube.com/watch?v=0S81koZpwPA&t=518s&pp=ygUOd3NsIHRlbnNvcmZsb3c%3D
- General development environment in VSCode using WSL: https://code.visualstudio.com/docs/remote/wsl

##### MacOS restrictions

TFF is not fully available for MacOS, e.g. ...
- only outdated versions are available, see https://github.com/tensorflow/federated/issues/3881
- installation is complicated and partly sketchy, see
  - https://stackoverflow.com/questions/66705900/can-tensorflow-federated-be-installed-on-apple-silicon-m1
  - https://stackoverflow.com/questions/71839866/can-anyone-give-me-a-comprehensive-guide-to-installing-tensorflow-federated-on-m1
  - https://discuss.tensorflow.org/t/update-tensorflow-federated-to-match-tensorflow-macos-2-7-0-and-tensorflow-metal-0-3-0/7193/10
  - conda-forge could help: https://stackoverflow.com/questions/68327863/importing-jax-fails-on-mac-with-m1-chip

#### Virtual environment


Recommended virtual environments are:

1. environment by ``conda`` using Python 3.9.* for TensorFlow, see [Install TensorFlow with pip](https://www.tensorflow.org/install/pip) (works fine for Tensorflow Federated)
2. environment by ``venv``, see [Install TensorFlow Federated](https://www.tensorflow.org/federated/install)

#### Versions

- Python 3.9.* is recommended here: https://www.tensorflow.org/install/pip
    - rem.: [asdf-vm](https://asdf-vm.com/) is convenient to manage python versions. 
- to reproduce the used environment use (recommended):
    - ``!pip install -r ../requirements.txt``
- to install a new similiar environment use:
    - ``pip install --upgrade tensorflow-federated``  
    - simlilarly install further helpful packages

For the full list of install packages see [requirements.txt](../requirements.txt). E.g. the following tensorflow[...] versions are used.

In [1]:
colab = True
if colab:
    import os
    
    # rm repo from gdrive
    if os.path.exists("fl-official-statistics-addon"):
      %rm -r fl-official-statistics-addon

    # clone
    !git clone https://github.com/Olhaau/fl-official-statistics-addon
    %cd fl-official-statistics-addon

    # pull (the currenct version of the repo)
    !git pull

Cloning into 'fl-official-statistics-addon'...
remote: Enumerating objects: 988, done.[K
remote: Counting objects: 100% (170/170), done.[K
remote: Compressing objects: 100% (154/154), done.[K
remote: Total 988 (delta 107), reused 32 (delta 14), pack-reused 818[K
Receiving objects: 100% (988/988), 39.40 MiB | 20.17 MiB/s, done.
Resolving deltas: 100% (467/467), done.
/content/fl-official-statistics-addon
Already up to date.


In [3]:
!python --version

Python 3.10.11


In [8]:
if colab: 
  !pip install -q tensorflow-federated==0.56.0

In [9]:
!pip list | grep tensorflow
# for more details use !pip show tensorflow

tensorflow                    2.12.0
tensorflow-compression        2.12.0
tensorflow-datasets           4.8.3
tensorflow-estimator          2.12.0
tensorflow-federated          0.56.0
tensorflow-gcs-config         2.12.0
tensorflow-hub                0.13.0
tensorflow-io-gcs-filesystem  0.32.0
tensorflow-metadata           1.13.1
tensorflow-model-optimization 0.7.3
tensorflow-privacy            0.8.8
tensorflow-probability        0.15.0


In [None]:
# Save the package versioning to requirements.txt if needed (ovewrites the previous)
#!pip freeze > ../requirements.txt
#!conda list --export > ../requirements.txt

### Imports
---

In [10]:
import pandas as pd
import numpy as np
from sklearn.model_selection import StratifiedKFold, RepeatedStratifiedKFold
from sklearn.metrics import r2_score
import matplotlib.pyplot as plt
from itertools import product
from math import floor
import time

import tensorflow as tf
import tensorflow_federated as tff
from keras.models import Sequential
from keras.layers import Dense, InputLayer

# -> check tff
#print(tff.federated_computation(lambda: 'Hello World')()) 

### Ingest data
---

In [11]:
df = pd.read_csv("https://raw.githubusercontent.com/Olhaau/fl-official-statistics-addon/main/output/data/insurance-clean.csv", index_col = 0)
df.head()

Unnamed: 0,age,sex,bmi,children,smoker,region,charges,region0,region1,region2,region3
0,0.021739,0.0,0.321227,0.0,1.0,southwest,16884.924,0.0,0.0,0.0,1.0
1,0.0,1.0,0.47915,0.2,0.0,southeast,1725.5523,0.0,0.0,1.0,0.0
2,0.217391,1.0,0.458434,0.6,0.0,southeast,4449.462,0.0,0.0,1.0,0.0
3,0.326087,1.0,0.181464,0.0,0.0,northwest,21984.47061,0.0,1.0,0.0,0.0
4,0.304348,1.0,0.347592,0.0,0.0,northwest,3866.8552,0.0,1.0,0.0,0.0


##### Evaluation splits

In [12]:
nfolds = 5
nreps = 5

cv = RepeatedStratifiedKFold(n_splits = nfolds, n_repeats = nreps, random_state = 42)

ind = 0
for train, test in cv.split(df, df.region):
  
  label = 'rep' + str(floor(ind / nfolds)) + '-fold' + str(ind % nfolds)
  df.loc[train, label] = 'train'
  df.loc[test,  label] = 'test'
  ind += 1

df.columns

Index(['age', 'sex', 'bmi', 'children', 'smoker', 'region', 'charges',
       'region0', 'region1', 'region2', 'region3', 'rep0-fold0', 'rep0-fold1',
       'rep0-fold2', 'rep0-fold3', 'rep0-fold4', 'rep1-fold0', 'rep1-fold1',
       'rep1-fold2', 'rep1-fold3', 'rep1-fold4', 'rep2-fold0', 'rep2-fold1',
       'rep2-fold2', 'rep2-fold3', 'rep2-fold4', 'rep3-fold0', 'rep3-fold1',
       'rep3-fold2', 'rep3-fold3', 'rep3-fold4', 'rep4-fold0', 'rep4-fold1',
       'rep4-fold2', 'rep4-fold3', 'rep4-fold4'],
      dtype='object')

##### Select feature and target

In [13]:
features = ['age', 'sex', 'bmi', 'children', 'smoker'
            , 'region0', 'region1', 'region2', 'region3']
target = 'charges'

#### Train Test Selector

In [14]:
def select_split(
      df, 
      target = target, 
      features = features, 
      type = 'train', 
      rep = 0, 
      fold = 0
      ):
   """selects the train and test set from a specific column.

  Parameters
  ------------
  df: dataFrame
  target: str
  features: list of str
  type: str
      'train' or 'test'.
  rep, fold: int
      Considered is a repeated cross validation with ``rep`` repetions and ``fold`` folds.
  """
   
   labels = 'rep' + str(rep) + '-fold' + str(fold)
   
   X = df.loc[df[labels] == type, features]
   y = df.loc[df[labels] == type, target]
   
   return X, y

In [15]:
# test
X_train, y_train = select_split(df,
   features = features + ['rep' + str(1) + '-fold' + str(0)],
   type = 'train', rep = 1, fold = 0
   )
X_test, y_test = select_split(df,
   features = features + ['rep' + str(0) + '-fold' + str(0)],
   type = 'test', rep = 0, fold = 0
   )
print("====== Train ======")
print(X_train.head())
print("\n====== Test ======")
print(X_test.head())

del X_train, y_train, X_test, y_test

        age  sex       bmi  children  smoker  region0  region1  region2  \
1  0.000000  1.0  0.479150       0.2     0.0      0.0      0.0      1.0   
2  0.217391  1.0  0.458434       0.6     0.0      0.0      0.0      1.0   
3  0.326087  1.0  0.181464       0.0     0.0      0.0      1.0      0.0   
5  0.282609  0.0  0.263115       0.0     0.0      0.0      0.0      1.0   
6  0.608696  0.0  0.470272       0.2     0.0      0.0      0.0      1.0   

   region3 rep1-fold0  
1      0.0      train  
2      0.0      train  
3      0.0      train  
5      0.0      train  
6      0.0      train  

         age  sex       bmi  children  smoker  region0  region1  region2  \
8   0.413043  1.0  0.373150       0.4     0.0      1.0      0.0      0.0   
18  0.826087  1.0  0.654829       0.0     0.0      0.0      0.0      0.0   
25  0.891304  0.0  0.316384       0.6     0.0      0.0      0.0      1.0   
27  0.804348  0.0  0.452381       0.4     0.0      0.0      1.0      0.0   
41  0.282609  0.0  0.556

### Model wrapper
---

In [16]:
def build_model(
    nfeatures = 9,
    units = [40, 40, 20], 
    activations = ['relu'] * 3, 
    loss = 'mean_squared_error',
    optimizer = tf.optimizers.legacy.Adam(learning_rate = .05),
    metrics = ["mae", 'mean_squared_error', r2_score], 
    run_eagerly = True
    ):
  
  """Construct a fully connected neural network and compile it.
  
  Parameters
  ------------
  nfeatures: int, optional
    Number of input features. Default is 9.
  units: list of int, optional
    List of number of units of the hidden dense layers. The length of ``units`` defines the number of hidden layers. Default are 3 layers with 40, 40 an 20 units, respectively.
  activations: list of str, optional
    List of activation functions used in the hidden layers.
  loss: str, optional
    Used loss function for compiling.
  optimizer: keras.optimizers, optional
    Used optimizer for compiling.
  metrics: list of str or sklearn.metrics
    List of metrics for compiling.
  run_eagerly: bool
    Parameter for compiling

  Return
  ------------
    model: keras.engine.sequential.Sequential
      Keras sequential fully connected neural network. Already compiled.
  """
  
  # construct model
  model = Sequential()
  model.add(InputLayer(input_shape = [nfeatures]))
  for ind in range(len(units)):
    model.add(Dense(
      units = units[ind], 
      activation = activations[ind]
      ))
  model.add(Dense(1))
  

  # compile model
  model.compile(
    loss = loss,
    optimizer = optimizer,
    metrics = metrics,
    run_eagerly = run_eagerly
  )

  return model

build_model().summary()

Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 dense (Dense)               (None, 40)                400       
                                                                 
 dense_1 (Dense)             (None, 40)                1640      
                                                                 
 dense_2 (Dense)             (None, 20)                820       
                                                                 
 dense_3 (Dense)             (None, 1)                 21        
                                                                 
Total params: 2,881
Trainable params: 2,881
Non-trainable params: 0
_________________________________________________________________


In [17]:
def train_model(model, X_train, y_train,
    epochs           = 100,
    batch_size       = 128,
    shuffle          = True,
    validation_split = 0.2,
    verbose          = 0,
    output_msr       = 'loss',
    seed             = 42,
    **kwargs
    ):
  
  """Compile and train a Keras neural network.
  
  For additional arguments see https://www.tensorflow.org/api_docs/python/tf/keras/Model#fit.

  Parameters
  ------------
  model: keras.engine.sequential.Sequential
  X_train: dataFrame
  y_train: dataFrame
  shuffle: bool
  epochs: int
  validation_split: float
  verbose: int
    verbose of model.fit(...)
  output_msr: str
    measure for custom output.
  batch_size: int
    batch_size of model.fit(...)   

  Return
  ------------
    hist: keras.callbacks.History
      History of model.fit(...)
  """

  # fit with custom verbose
  starttime = time.time()
  
  if seed != None: tf.keras.utils.set_random_seed(seed)

  hist = model.fit(
    X_train, 
    y_train,
    batch_size = batch_size, 
    shuffle    = shuffle,
    validation_split = validation_split,
    epochs     = epochs,
    verbose    = verbose, 
    **kwargs
  )
  print(
      "R^2  = %.4f, " % hist.history[output_msr][-1],
      "time = %.1f sec" % ((time.time() - starttime)))
  
  return hist

In [18]:
def test_model(model, X_test, y_test, 
               verbose = False):
  """
  Parameters
  ------------
  model: keras.engine.sequential.Sequential
    Fitted model.
  X_test, y_test: dataFrame
    Test data.
  verbose: bool
    Output control.

  Output
  ------------
  perf: list of float
    The test performances.
  """

  start = time.time() 
  perf  = model.evaluate(X_test, y_test, verbose = 0)[1:]

  if verbose: print('time - test: %.2f' % (time.time() - start / 60))
  
  return perf

In [20]:
def plot_perf(hist, msr = 'loss'):
  """Plot the training history and save the figure.
  :param hist: The history object including the metrics to plot
  :type hist: keras.callbacks.History
  :param msr: The metrics to plot
  :type msr: str, optional
  """
  plt.plot(hist.history[msr])
  plt.plot(hist.history['val_' + msr])
  plt.ylabel(msr)
  plt.xlabel('epoch')
  plt.legend(['train', 'eval'], loc='upper left')

## Experiments
---

### Centralized Model
---

#### Training

In [23]:
%%time

n_epochs = 100

hists = []
with tf.device('/device:GPU:0'):
  for rep, fold in product(range(nreps), range(nfolds)):
      print('======= rep %s - fold %s  =======' % (rep, fold))
      model            = build_model()
      X_train, y_train = select_split(df, type = 'train', rep = rep, fold = fold)
      hist             = train_model(model, X_train, y_train, epochs = n_epochs, output_msr = "r2_score")
      hists.append(hist)



AssertionError: ignored

#### Evaluation

In [22]:
# calculate the performance
perfs = [None] * nreps * nfolds

i = 0
for rep, fold in product(range(nreps), range(nfolds)):
  
  model          = hists[i].model
  X_test, y_test = select_split(df, type = 'test', rep = rep, fold = fold)
  perfs[i]       = [rep, fold] + test_model(model, X_test, y_test)

  i += 1

# convert to DataFrame
perfs = pd.DataFrame(perfs, columns = ["rep", "fold","MAE", 'MSE', 'RSQ']).assign(
    RMSE   = lambda x: np.sqrt(x.MSE),
    RSQ_pct = lambda x: x.RSQ * 100
)

perfs[['MAE', 'RMSE', 'RSQ_pct']].describe()[1:].round(2)

IndexError: ignored

##### Tables

**Overview**

In [None]:
perfs[['MAE', 'RMSE', 'RSQ_pct']].describe()[1:].round(2)

**All Evaluations**

In [None]:
perfs[["rep", "fold",'MAE', 'RMSE', 'RSQ_pct']].sort_values('RSQ_pct').round(2)

##### Investigation

In [None]:

results = []

i = 0
for rep, fold in product(range(nreps), range(nfolds)):

    results.append(
        [rep, fold]+ [
            hists[i].history['r2_score'][-1],
            float(perfs[(perfs.rep ==rep) & (perfs.fold==fold)]["RSQ"]), 
            hists[i].history['val_r2_score'][-1], 
            
             hists[i].history['r2_score'][-1] /  float(perfs[(perfs.rep ==rep) & (perfs.fold==fold)]["RSQ"])- 1
            ] +
        [x - 1 for x in 
        list(df.loc[df['rep'+str(rep)+'-fold'+str(fold)] == 'test', features[:5] + [target]].mean() /
        df.loc[df['rep'+str(rep)+'-fold'+str(fold)] == 'train', features[:5] + [target]].mean())])
    i += 1

results = pd.DataFrame(
    results, 
    columns = ['rep', 'fold', "RSQ_train",'RSQ_test', "RSQ_eval",  "pct_diff_tt"]+['mean_ttdiff_' + x for x in features[:5] + [target]]
    )

results.sort_values("pct_diff_tt")

##### Plots

In [None]:

y1 = np.array([hist.history["r2_score"] for hist in hists])
y2 = np.array([hist.history["val_r2_score"] for hist in hists])

plt.plot(np.quantile(y1,.5, axis = 0), label = 'median (train)', color = 'blue')
plt.fill_between(range(1, n_epochs +1), np.quantile(y1,.05, axis = 0), np.quantile(y1,.95, axis = 0),color = 'blue', alpha = 0.15, label = '90 % CI (train)')
plt.plot(np.quantile(y2,.5, axis = 0), label = 'median (eval)', color = 'orange')
plt.fill_between(range(1, n_epochs +1), np.quantile(y2,.05, axis = 0), np.quantile(y2,.95, axis = 0),color = 'orange', alpha = 0.15, label = '90 % CI (eval)')
#plt.fill_between(range(1, n_epochs +1), np.quantile(y,.0, axis = 0), np.quantile(y,1., axis = 0), color = 'blue', alpha = 0.15, label = '100-90-70 % CI')
#plt.fill_between(range(1, n_epochs +1), np.quantile(y,.05 + .1, axis = 0), np.quantile(y,.95 - .1, axis = 0),color = 'blue', alpha = 0.15)
#plt.fill_between(range(1, n_epochs +1), np.quantile(y,.0, axis = 0), np.quantile(y,1., axis = 0), color = 'orange', alpha = 0.15, label = '100-90-70 % CI')
#plt.fill_between(range(1, n_epochs +1), np.quantile(y,.05 + .1, axis = 0), np.quantile(y,.95 - .1, axis = 0),color = 'blue', alpha = 0.15)

plt.title('Training Performance (test median RSQ = ' + str(round(perfs[['RSQ_pct']].median()[0], 2))+ " %)")

plt.legend()
plt.ylim([0.5, 0.9])
plt.show()

In [None]:

i = 0
for rep, fold in product(range(nreps), range(nfolds)):
  plot_perf(hists[i], 'r2_score')
  #plt.suptitle('Training Performance (test r2_score = ' + str(round(perfs[ind][2] * 100, 2))+ " %)")
  #'Training Performance (test r2_score = ' + str(round(perfs[ind][2] * 100, 2))+ " %)")
  plt.suptitle('Training Performance for rep = ' + str(rep) + ', fold = ' + str(fold))
  plt.title(' (test RSQ = ' + str(round(perfs.loc[i,'RSQ_pct'], 2))+ " %)")
  plt.ylim([0.5, 0.9])
  plt.show()
  i += 1

### Centralized (5 Features)

In [None]:
%%time

n_epochs = 100

hists2 = []
for rep, fold in product(range(nreps), range(nfolds)):
    print('======= rep %s - fold %s  =======' % (rep, fold))
    model            = build_model(nfeatures = 5)
    X_train, y_train = select_split(df, features = features[:5], type = 'train', rep = rep, fold = fold)
    hist             = train_model(model, X_train, y_train, epochs = n_epochs, output_msr = "r2_score")
    hists2.append(hist)

In [None]:
# calculate the performance
perfs2 = [None] * nreps * nfolds

i = 0
for rep, fold in product(range(nreps), range(nfolds)):
  
  model          = hists2[i].model
  X_test, y_test = select_split(df, features = features[:5], type = 'test', rep = rep, fold = fold)
  perfs2[i]       = [rep, fold] + test_model(model, X_test, y_test)

  i += 1

# convert to DataFrame
perfs2 = pd.DataFrame(perfs, columns = ["rep", "fold","MAE", 'MSE', 'RSQ']).assign(
    RMSE   = lambda x: np.sqrt(x.MSE),
    RSQ_pct = lambda x: x.RSQ * 100
)

perfs2[['MAE', 'RMSE', 'RSQ_pct']].describe()[1:].round(2)

In [None]:
y1 = np.array([hist.history["r2_score"] for hist in hists2])
y2 = np.array([hist.history["val_r2_score"] for hist in hists2])

plt.plot(np.quantile(y1,.5, axis = 0), label = 'median (train)', color = 'blue')
plt.fill_between(range(1, n_epochs +1), np.quantile(y1,.05, axis = 0), np.quantile(y1,.95, axis = 0),color = 'blue', alpha = 0.15, label = '90 % CI (train)')
plt.plot(np.quantile(y2,.5, axis = 0), label = 'median (eval)', color = 'orange')
plt.fill_between(range(1, n_epochs +1), np.quantile(y2,.05, axis = 0), np.quantile(y2,.95, axis = 0),color = 'orange', alpha = 0.15, label = '90 % CI (eval)')
#plt.fill_between(range(1, n_epochs +1), np.quantile(y,.0, axis = 0), np.quantile(y,1., axis = 0), color = 'blue', alpha = 0.15, label = '100-90-70 % CI')
#plt.fill_between(range(1, n_epochs +1), np.quantile(y,.05 + .1, axis = 0), np.quantile(y,.95 - .1, axis = 0),color = 'blue', alpha = 0.15)
#plt.fill_between(range(1, n_epochs +1), np.quantile(y,.0, axis = 0), np.quantile(y,1., axis = 0), color = 'orange', alpha = 0.15, label = '100-90-70 % CI')
#plt.fill_between(range(1, n_epochs +1), np.quantile(y,.05 + .1, axis = 0), np.quantile(y,.95 - .1, axis = 0),color = 'blue', alpha = 0.15)

plt.title('Training Performance (test median RSQ = ' + str(round(perfs[['RSQ_pct']].median()[0], 2))+ " %)")

plt.legend()
plt.ylim([0.5, 0.9])
plt.show()

### Tuning
---

Is the centralized model improvable?

- train_model(..., steps_per_epoch = 3) -> improves the result?
- FIXME: What about random forests?
- FIXME: Systematic tuning

### Federated Learning
---

In [None]:
# Federated Learning