In [None]:
import starterkits.starterkit_4_2.support as sp
import starterkits.starterkit_4_2.visualizations as vis

from pathlib import Path
DATA_PATH = Path('../../data/')

%load_ext autoreload
%autoreload 2

# Starter Kit 4.2: FedRepo: mitigate concept drift in federated context

## Passport

### Business context

In an increasingly connected world, the concept of federated learning is becoming crucial, especially in scenarios where data privacy is paramount, and bandwidth is a limiting factor. Federated learning enables multiple decentralized devices or servers (clients) to collaboratively learn a predictive model while keeping all training data local. This avoids the need to transfer large volumes of sensitive data to a central server for processing. However, federated learning environments face significant challenges:

 - Dynamic and heterogeneous data sources often lead to concept drift, where the underlying data patterns the model has learned change over time, potentially degrading the model's performance.
 - Limited bandwidth for communication between clients and the central server can hinder the efficiency of model updates and retraining processes.
 - Privacy concerns limit the amount and type of data that can be shared between clients, complicating the detection and mitigation of concept drift.

These challenges necessitate advanced strategies for model training and maintenance to ensure that predictive models remain accurate and efficient over time without compromising privacy or incurring prohibitive communication costs.

### Business goal

The business goal for this Starter Kit is **concept drift mitigation** in federated learning environments. Specifically, this Starter Kit applies a methodology called *FedRepo*, introduced by Tsiporkova et al. [1], that manages a dynamic repository of federated models to effectively cope with concept drift. The FedRepo methodology aims to provide a robust solution that maintains the accuracy and efficiency of the federated models over time, ensuring they adapt to changes in data dynamics while minimizing communication overhead.

### Application Context
The FedRepo methodology is applicable in various settings where data privacy and limited connectivity are major concerns:

 - Healthcare: Hospitals and medical institutions can collaborate on developing predictive models or improving diagnostic tools without actually sharing patient data. 
 - Wearables: User experience of several features (e.g. text prediction) can be enhanced on personal devices without compromising privacy.
 - Industrial: Assets manufactured by a third party (e.g. printers) can be used to collaborately learn predictive models without each customer having to share its data. 

### Starter Kit outline
This Starter Kit will demonstrate the application of the FedRepo methodology using a real-world dataset. First, the dataset will be described, which contains electricity consumption data of UK households. Then, the FedRepo methodology will be explained and discussed through its key steps, while applying them on a subset of the households. For a complete explanation of all steps involved, please refer to the paper. Finally, the performance of the methodology is evaluated, also in terms of its adaptability and concept drift mitigation.

## Dataset
The forecasting of electricity consumption across households is a highly relevant application for this methodology as energy consumption of households obviously is privacy-sensitive. Additionally, many factors could cause for concept drift to occur:
 - The occupation of the household in terms of its inhabitants
 - Replacement of household appliances
 - Seasonal influence on energy consumption
 - ...

The data used is data collected by the UK Power Networks led Low Carbon London project. It consists of 5,567 households in London representing a balanced sample representative of the Greater London population with a 30-minutes granularity between November 2011 and February 2014. The consumption is given in kWh. For demonstrating our methodology, we have randomly selected 300 households for which we have ensured that the data is available until at least 01/2014. For these households, a repository of federated models will be trained in order to forecast the consumption within the next 30 minutes. 

In [None]:
household_subset = sp.get_data(DATA_PATH)
household_subset

## Preprocess data

The next step involves preprocessing the loaded dataset to ensure it's ready for modeling. This includes data cleaning, feature engineering and splitting the data in train and test sets. For this data, a train set of three months is used (January to March 2012) and a test set of one month (April 2012). Features used are the consumption values of up to 6 hours ago, added with the consumption corresponding to same time and 30 minutes before and after on the previous day and week. Additionally, the day of the week, hour of the day and month of the year are also defined as features, with the latter two being cyclically endcoded.

In [None]:
prep = sp.PreprocessConsumption(data=household_subset)

x_train, y_train, x_test, y_test = prep.preprocess()

## FedRepo
The FedRepo algorithm, designed to mitigate concept drift in federated learning environments, is structured around several key steps. These steps ensure that the algorithm dynamically adapts to changes in data distributions across different clients or devices, maintaining the efficacy of the deployed models. FedRepo is built around the maintenance of three repositories residing in a central node, e.g., in the cloud. These are:

- $Θ$: a repository of workers, which contains at any moment the workers (clients or devices) for which new federated models need to be constructed.
- $Φ$: a repository of global federated random forest models, which contains at any moment the active (deployed) federated models.
- $Γ$: a repository of tree models, which contains at any moment subsets of trees from local RF models of each worker.

The proposed methodology will continuously update the above described repositories during the use of the federated models based on continuous monitoring and evaluation of the models’ performance. FedRepo consists of four main steps: *Initialization*, *Model training*, *Context-aware inference* and *Dynamic model maintenance*. These are shown in the image below which gives an overview of the methodology. Even though FedRepo is described and evaluated in a regression task scenario, the same methodology can be used for classification by using a proper evaluation metric.

<table><tr><td><img src='media/fedrepo.png'><td></tr></table>

### Initialization
This step is performed in the central node. The repository of federated RF models is empty since no RF models have been constructed yet, i.e., $Φ = ∅$. Analogously, the workers’ repository contains all available workers since for all of them the federated models still need to be constructed, i.e., $Θ = {θ_{1}, . . . , θ_{300}}$ and the repository of tree models is composed of 300 empty sets, one per worker, i.e., $Γ = {Γ_{1}, . . . , Γ_{300}}$, where $Γi = ∅$, for $i = 1, . . . ,300.$

In [None]:
fed = sp.FederatedForest(client_ids=prep.consumer_list)
fed.assign_client_data(x_train, y_train, x_test, y_test)

### Training

In this step, the model and worker repositories are updated such that devices similar with respect to the model performance are assigned to the same cluster of workers and hence collaboratively build and share the same RF federated model. For this, the following steps are executed:

1. *Local Model Training*: Each worker trains a local RF model on its training data and selects a subset of trees to contribute to the central repository. The local forests consist of 100 trees per worker.

2. *Tree Repository Update*: The central node updates the tree model repository with the trees received from all workers.

3. *Federated Global Model Construction*: Construct the initial global RF model $Φ_{0}$ by randomly sampling 100 trees from the updated tree repository. 

5. *Evaluation Feature Vector Construction*: Each worker evaluates every tree from the global model on its test data, and the performance metrics are used to construct an evaluation feature vector for each worker. The RMSE score is used as the performance metric.

These steps will be performed by running the cell below. Note that the communication contents between the local workers and the central node have been: the locally trained trees (to the central node), the global model (to the local workers) and the performance scores (back to the central node). No local data was shared across workers or with the central node.

In [None]:
fed.train()
fed.evaluate_global_forest()

### Clustering 
At this stage an evaluation vector of 100 scores exists for each of the 300 workers. These are collected in a matrix which will serve as the basis for a clustering step in which similar workers are grouped together. This grouping is the start of the next few steps in the FedRepo algorithm:

5. *Local Node Clustering*: To derive personalized models for a set of similar workers, the workers are split into K non-overlapping clusters. These are obtained by applying the binary PSO algorithm on the evaluation feature vectors calculated in the previous step. The binary PSO clustering has the advantage that the number of clusters does not need to be predefined. In theory, any other clustering method which does not require the number of clusters to be known could have been applied here.

6. *Federated Cluster Models Construction*: For each cluster $k$, a federated RF model $Φ_{k}$ is built following the same procedure described in step 3. The trees contributed by all workers in cluster $k$, i.e., the trees contained in the respective $Γ_{i}$ for each worker in the cluster, are pooled together and reshuffled. Subsequently, 100 trees are randomly sampled to create the federated RF model $Φ_{k}$. This model is associated with an initial support score $s_{k}$ which reflects the relative size of the cluster. For example, if cluster $k$ contains 30 workers, than $s_{k}$ = $0.1$.

7. *Repository Update*: The repository of federated RF models $Φ$ is extended by adding the newly created federated cluster models. The repository of workers is reset, i.e., $Θ = ∅$, as all workers have an active cluster model deployed for them. The support of the global federated RF model $Φ_{0}$ is also reset to zero, i.e., $s_{0} = 0$.

In [None]:
evaluation_vectors, evaluation_vectors_scaled = fed.get_evaluation_vectors()
pso_obj = sp.PSO(samples=evaluation_vectors_scaled, plot_iterations=True)
pso_obj.execute()

# Still unfinished from here on

In [None]:
pso_obj.show_clusters()

### Context-aware inference

Each worker receives the parameters of its cluster federated model. At each inference step, each worker calculates the residual between the predicted and observed values for the previous inference step. If the worker’s residual is above a threshold $δ$, that is determined by the model’s performance on the test set, this information is communicated to the central node and the following steps are conducted:

1. *Global Model Activation*: The overall federated global model $Φ_{0}$ is activated for the worker in question, i.e., $θi$ is added to $Θ$ and the support of $Φ_{0}$ is updated accordingly, i.e., s0 = s0 +1/M.

2. *Model Parameter Update*: The parameters of the corresponding (cluster) federated model (Φk,Θk, sk) are updated, i.e., θi is removed from the list of private workers Θk for this model and the model support is reduced accordingly such that sk = sk − 1/M. Each time the parameters of a federated model in the repository Φ are updated the dynamic model maintenance needs to be executed.

### Dynamic model maintenance
This final step concerns the identification of federated models in $Φ$ with relatively low support, possibly due to concept drift. Subsequently, a trace of the associated workers is kept in $Θ$ and if this grows above a certain predefined volume $Δ$, training for the affected workers is invoked.

[1] Tsiporkova, E., De Vis, M., Klein, S., Hristoskova, A., & Boeva, V. (2023). Mitigating Concept Drift in Distributed Contexts with Dynamic Repository of Federated Models. In 2023 IEEE International Conference on Big Data (BigData) (pp. 2690-2699). IEEE.