In [None]:
import starterkits.starterkit_4_2.support as sp
import starterkits.starterkit_4_2.visualizations as vis

from pathlib import Path
DATA_PATH = Path('../../data/')

%load_ext autoreload
%autoreload 2

# Starter Kit 4.2: FedRepo: mitigate concept drift in federated context

## Passport

### Business context
<span style="color:red;">*I would start here with a bit more context, explaining that you want to forecast electricity consumption but that that is difficult given the heterogenity of households and patterns. Only then describe it as concept drift, to make it a bit more tangible for readers.*</span>
In an increasingly connected world, the concept of federated learning is becoming crucial, especially in scenarios where data privacy is paramount, and bandwidth is a limiting factor. Federated learning enables multiple decentralized devices or servers (clients) to collaboratively learn a predictive model while keeping all training data local. This avoids the need to transfer large volumes of sensitive data to a central server for processing. However, federated learning environments face significant challenges:

 - Dynamic and heterogeneous data sources often lead to concept drift, where the underlying data patterns the model has learned change over time, potentially degrading the model's performance.
 - Limited bandwidth for communication between clients and the central server can hinder the efficiency of model updates and retraining processes.
 - Privacy concerns limit the amount and type of data that can be shared between clients, complicating the detection and mitigation of concept drift.

These challenges necessitate advanced strategies for model training and maintenance to ensure that predictive models remain accurate and efficient over time without compromising privacy or incurring prohibitive communication costs.

### Business goal
<span style="color:red;">*The same here.*</span>
The business goal for this Starter Kit is **concept drift mitigation** in federated learning environments. Specifically, this Starter Kit applies a methodology called *FedRepo*, introduced by Tsiporkova et al. [1], that manages a dynamic repository of federated models to effectively cope with concept drift. The FedRepo methodology aims to provide a robust solution that maintains the accuracy and efficiency of the federated models over time, ensuring they adapt to changes in data dynamics while minimizing communication overhead.

### Application Context
The FedRepo methodology is applicable in various settings where data privacy and limited connectivity are major concerns:

 - Healthcare: Hospitals and medical institutions can collaborate on developing predictive models or improving diagnostic tools <span style="color:red;">on more diverse data</span> without actually sharing patient data. 
 - Wearables: User experience of several features (e.g. text prediction) can be enhanced on personal devices without compromising privacy.
 - Industrial: Assets manufactured by a third party (e.g. printers) can be used to collaborately learn predictive models without each customer having to share its data. <span style="color:red;">*I don't understand this one. The space here is not as limited as in the paper, so another sentence doesn't hurt.*</span>

### Starter Kit outline
This Starter Kit will demonstrate the application of the FedRepo methodology using a real-world dataset. First, the dataset will be described, which contains electricity consumption data of UK households. Then, the FedRepo methodology will be explained and discussed through its key steps, while applying them on a subset of the households. For a <span style="color:red;">more in-depth</span> explanation of all steps involved, please refer to the paper. Finally, the performance of the methodology is evaluated, also in terms of its adaptability and concept drift mitigation.

## Dataset
The forecasting of electricity consumption across households is a highly relevant application for this methodology as energy consumption of households obviously (<span style="color:red;">*is it? There was this paper that showed that very nicely. You can refert to that one (Automatic socio-economic classification of households using electricity consumption data." Proceedings of the fourth international conference on Future energy systems. 2013).*</span>) is privacy-sensitive. Additionally, many factors could cause for concept drift to occur:
 - The occupation of the household in terms of its inhabitants
 - Replacement of household appliances
 - Seasonal influence on energy consumption
 - ...

The data used is data collected by the UK Power Networks led Low Carbon London project<span style="color:red;">[link to website]</span>. It consists of 5,567 households <span style="color:red;">(given in column `consumer`)</span> in London representing a balanced sample representative of the Greater London population with a 30-minutes granularity between November 2011 and February 2014. The consumption <span style="color:red;">(in column `consumption`)</span> is given in kWh. For demonstrating our methodology, we randomly selected 300 households for which we ensured that the data is available until at least 01/2014. For these households, a repository of federated models will be trained in order to forecast the consumption within the next 30 minutes. 

In [None]:
household_subset = sp.get_data(DATA_PATH)
household_subset

## Preprocess data

<span style="color:red;">For ensuring that the data is ready for modelling, we perform the following steps:</span> ~~The next step involves preprocessing the loaded dataset to ensure it's ready for modeling. This includes~~ data cleaning, feature engineering and splitting the data in train and test sets. <span style="color:red;">*The following is difficult to understand. The reader doesn't know yet what you're training on. Hence, they might be confused why you train on 3 month only if you have data for 2.5 years. I guess this makes only sense to describe once they know the model*</span> For this data, a train set of three months is used (January to March 2012) and a test set of one month (April 2012). Features used are the consumption values of up to 6 hours ago, added with the consumption corresponding to same time and 30 minutes before and after on the previous day and week. Additionally, the day of the week, hour of the day and month of the year are also defined as features, with the latter two being cyclically endcoded <span style="color:red;">*It doesn't hurt showing the features once. It's easier for people to understand.*</span>.

In [None]:
prep_data = sp.PreprocessConsumption(data=household_subset)
prep_data.preprocess()

## FedRepo
<span style="color:red;">*I would try to explain the model a bit more on high level first. These starterkits are meant to be easily understandable... Hence, describing that the aim is to train models locally per consumer (I would only use consumer, not client and device because it's getting too much wording), but then to construct a global model on which the concept drift is detected. So people get an idea of what we're doing, before killing them with repositories.*</span>
The FedRepo algorithm, designed to mitigate concept drift in federated learning environments, is structured around several key steps <span style="color:red;">(cf. Figure below?)</span>. These steps ensure that the algorithm dynamically adapts to changes in data distributions across different clients or devices, maintaining the efficacy of the deployed models. FedRepo is built around the maintenance of three repositories residing in a central node, e.g., in the cloud. These are:

- $Θ$: a repository of workers, which contains at any moment the workers (clients or devices) for which new federated models need to be constructed.
- $Φ$: a repository of global federated random forest models, which contains at any moment the active (deployed) federated models.
- $Γ$: a repository of tree models, which contains at any moment subsets of trees from local RF models of each worker.

The proposed methodology will continuously update the above described repositories during the use of the federated models based on continuous monitoring and evaluation of the models’ performance. FedRepo consists of four main steps: *Initialization*, *Model training*, *Context-aware inference* and *Dynamic model maintenance*. These are shown in the image below which gives an overview of the methodology. Even though FedRepo is described and evaluated in a regression task scenario, the same methodology can be used for classification by using a proper evaluation metric.

<span style="color:red;">*On a wide screen the image looks really massive. Can you set a maxmium width?*</span>
<table><tr><td><img src='media/fedrepo.png'><td></tr></table>

### Initialization
This step is performed in the central node. The repository of federated RF models is empty since no RF models have been constructed yet, i.e., $Φ = ∅$. Analogously, the workers’ repository contains all available workers since for all of them the federated models still need to be constructed, i.e., $Θ = {θ_{1}, . . . , θ_{300}}$ and the repository of tree models is composed of 300 empty sets, one per worker, i.e., $Γ = {Γ_{1}, . . . , Γ_{300}}$, where $Γi = ∅$, for $i = 1, . . . ,300.$

In [None]:
fedrepo = sp.FedRepo(data=prep_data)
fedrepo.initialize()

### Training

In this step, the model and worker repositories are updated such that devices similar with respect to the model performance are assigned to the same cluster of workers and hence collaboratively build and share the same RF <span style="color:red;">*You didn't talk about random forest models before...*</span> federated model. For this, the following steps are executed:

1. *Local Model Training*: Each worker trains a local RF model <span style="color:red;">fore predicting the consumption in the next 30 minutes based on the features described above</span> on its training data and selects a subset of trees to contribute to the central repository. The local forests consist of 100 trees per worker. <span style="color:red;">*Do we have any motivation for these 100 trees?*</span>

2. *Tree Repository Update*: The central node updates the tree model repository with the trees received from all workers.

3. *Federated Global Model Construction*: Construct the initial global RF model $Φ_{0}$ by randomly sampling 100 trees from the updated tree repository. <span style="color:red;">*How?*</span>

5. *Evaluation Feature Vector Construction*: Each worker evaluates every tree from the global model on its test data, and the performance metrics are used to construct an evaluation feature vector for each worker. <span style="color:red;">For the regression model described here for the electricity consumption,</span> the RMSE score is used as the performance metric. <span style="color:red;">This results in...</span>

These steps will be performed by running the cell below. Note that the communication contents between the local workers and the central node have been: the locally trained trees (to the central node), the global model (to the local workers) and the performance scores (back to the central node). No local data was shared across workers or with the central node.

<span style="color:red;">*All of these steps above need more explanation. Describe it for practical use, not like for a journal paper.*</span>

In [None]:
fedrepo.training()

### Clustering 
At this stage an evaluation vector of 100 scores exists for each of the 300 workers. These are collected in a matrix which serves as the basis for a clustering step in which similar workers are grouped together. This grouping is the start of the next few steps in the FedRepo algorithm:

5. *Local Node Clustering*: To derive personalized models for a set of similar workers, the workers are split into $K$ non-overlapping clusters. These are obtained by applying the binary PSO algorithm on the evaluation feature vectors calculated in the previous step <span style="color:red;">*Give at least a reference, even better a short explanation of the algorithm.*</span>. The binary PSO clustering has the advantage that the number of clusters does not need to be predefined. In theory, any other clustering method which does not require the number of clusters to be known could have been applied here.

6. *Federated Cluster Models Construction*: For each cluster $k$, a federated RF model $Φ_{k}$ is built following the same procedure described in step 3. The trees contributed by all workers in cluster $k$, i.e., the trees contained in the respective $Γ_{i}$ for each worker $i$ in the cluster, are pooled together and reshuffled. Subsequently, 100 trees are randomly sampled to create the federated RF model $Φ_{k}$. This model is associated with an initial support score $s_{k}$ which reflects the relative size of the cluster. For example, if cluster $k$ contains 30 workers, than $s_{k}$ = $0.1$. <span style="color:red;">*Also here you can probably phrase it a bit easier, with more words and less mathematical symbols.*</span>

7. *Repository Update*: The repository of federated RF models $Φ$ is extended by adding the newly created federated cluster models. The repository of workers is reset, i.e., $Θ = ∅$, as all workers have an active cluster model deployed for them. The support of the global federated RF model $Φ_{0}$ is also reset to zero, i.e., $s_{0} = 0$.

Running the cell below will execute the PSO clustering, which consists of multiple iterations to find the optimal clustering configuration with regard to a clustering metric. Here, the silhouette score is used. <span style="color:red;">*Describe what you see in the figures.*</span>

In [None]:
fedrepo.clustering()

The figure below shows the evaluation vector of each worker, color coded by which cluster the worker belongs to. It helps to visualize the differences between workers with regard to the performance of each tree of the global model. Clustering on these vectors in essence enables to identify common behavioural patterns across groups of workers. <span style="color:red;">*This is on the one month of validation data?*</span>

In [None]:
vis.show_clusters(fedrepo.evaluation_vectors, fedrepo.clusters)

### Prediction
All workers receive the parameters of their cluster federated model. One way to proceed can be to apply these initial cluster models on the full validation set, which is defined as the data between May 2012 and December 2013. To enable comparison, also the predictions of the global and local models are generated.

In [None]:
fedrepo.generate_without_maintenance()

### Concept drift mitigation

Instead of naively applying the initial cluster models without modification, FedRepo proposes a framework to perform dynamic maintenance to mitigate concept drift. The basis of this framework is to calculate the RMSE score between the predicted and observed values after each day. This score is compared to a threshold $δ$, that is derived from the model’s performance on the worker's test set. If the RMSE score is above the threshold for three days in a row, the following steps are conducted:

1. *Global Model Activation*: The overall federated global model $Φ_{0}$ is activated for the worker in question, i.e., the worker is added to $Θ$ and the support of $Φ_{0}$ is updated accordingly, i.e., $s_{0} = s_{0} + 1/300$.

2. *Model Parameter Update*: The parameters of the corresponding (cluster) federated model ($Φ_{k}, Θ_{k}, s_{k}$) are updated, i.e., the worker is removed from the list of workers that use this model and the model support is reduced accordingly such that $s_{k} = s_{k} − 1/M$.

Concept drift in the consumption pattern of a worker thus is identified by monitoring the performance of its cluster model. If this performance drops (increasing error score) for a sufficiently long period (3 days), the cluster model is deactivated for this worker and the global model is put as replacement. The global model acts as a baseline, one-fits-all model that is assumed to mitigate the effects of concept drift for a worker that no longer fits its customized model. Each time the global model is activated for a worker, some maintenance checks need to be executed. These involve the identification of cluster models of which the support has dropped too low and verify whether a retraining is needed:

1. *Workers’ Repository*: For each federated cluster model $k$, if its support $s_{k}$ is smaller than a predefined threshold $z$, its workers are added to the workers’ repository $Θ$.

2. *Cluster Model Deactivation*: The federated cluster model $Φ_{k}$ is deactivated by resetting its support $s_{k} = 0$. The overall federated global model $Φ_{0}$ is activated for the remaining workers that still used the deactivated model and the support of $Φ_{0}$ is updated accordingly, i.e., $s_{0} = s_{0}$ +  # $Θ_{k}/M$.

3. *Federated Models’ Repository*: If the support $s_{0}$ of the overall federated global model $Φ_{0}$ is above the predefined threshold $Δ$, retraining is performed with the workers in $Θ$. Local models are trained on the previous three months and new trees are sent to the tree repositories of the respective workers to replace the previously contributed trees. The global model is updated by sampling random trees from the update tree repositories and clustering is performed in the same way as before. New cluster models are added to the repository of federated models, causing every worker to again have an active cluster model and resetting the support of the global model to 0. Note that for the workers not present in $Θ$ before retraining and therefore still have an active cluster model, nothing has changed.

For this data, a cluster model is said to have too low support when it drops below 10 workers, i.e. $z = 0.033$, and retraining is invoked when the workers' repository surpasses 60 workers, i.e. $Δ = 0.2$. Running the cell below will apply the dynamic concept drift mitigation on the validation set.



In [None]:
fedrepo.concept_drift_mitigation()

# Still unfinished from here on

## Discussion

The above introduced FL method has several advantages for tackling distributed concept drift. First of all, the concept drift detection comes with a very small overhead, as during the regular inference phase not only the actual forecast values are calculated, but additionally the residuals. Based on the residuals, the concept drift is detected locally, while the repositories are updated in the central node. Furthermore, the concept drift is not assumed to happen at the same time across different workers. Instead, each worker can be part of a shared repository built on information from similar devices or activate the federated global model in case its performance is degrading before being assigned to a new repository. Third, the number of models is not predefined and constant, as the binary PSO allows a flexible number of clusters. With this, any flexible change in behaviour is captured. Finally, as the worker clustering is performed on the model performance only, it is not necessary to share the actual patterns of the workers’ data with the central node.

In [None]:
del fedrepo.worker_data

In [None]:
import plotly.express as px


In [None]:
import pickle

with open('fedrepo_obj_w_data.pkl', 'wb') as filename:
    pickle.dump(fedrepo, filename)

In [None]:
fedrepo.maintenance_log

In [None]:
[model.support for model in fedrepo.active_models]

In [None]:
fedrepo.forest_support_threshold

In [None]:
fedrepo.worker_repo_threshold

In [None]:
len(fedrepo.worker_repo)

In [None]:
fedrepo.worker_repo = fedrepo.client_ids

In [None]:
fedrepo.active_models = {}
fedrepo.worker_repo = []
fedrepo.create_cluster_models()
fedrepo.active_models[fedrepo.global_forest] = []

fedrepo.drift_detection_dict = {worker: {'predictions': [],
                                      'rmse': [],
                                      'pred_index': [],
                                      'active_model': []} for worker in fedrepo.client_ids}

In [None]:
fedrepo.check_worker_allocation()

In [None]:
fedrepo.clusters

In [None]:
sum([len(clus) for clus in fedrepo.clusters])

In [None]:
sum([len(value) for value in fedrepo.active_models.values()])

In [None]:
len(fedrepo.active_models[fedrepo.global_forest])

In [None]:
sum([len(model) for model in fedrepo.active_models.values()])

In [None]:
px.line(fedrepo.worker_repository_hist)

In [None]:
import plotly.express as px
px.line(x=fedrepo.drift_detection_dict['MAC000018']['pred_index'].dt.resample('D'), y=fedrepo.drift_detection_dict['MAC000018']['rmse'])

In [None]:
import pandas as pd

[1] Tsiporkova, E., De Vis, M., Klein, S., Hristoskova, A., & Boeva, V. (2023). Mitigating Concept Drift in Distributed Contexts with Dynamic Repository of Federated Models. In 2023 IEEE International Conference on Big Data (BigData) (pp. 2690-2699). IEEE.