# Evaluation Methodology of Federated Learning

In this notebook we study the different evaluation methodologies that we can use when we want to evaluate FL problems. First, we set up the FL configuration (for more information see [Basic Concepts Notebook](./basic_concepts.ipynb)).

In [1]:
import shfl
import keras
import numpy as np
import random

random.seed(123)
np.random.seed(seed=123)

class Reshape(shfl.private.FederatedTransformation):
    
    def apply(self, labeled_data):
        labeled_data.data = np.reshape(labeled_data.data, (labeled_data.data.shape[0], labeled_data.data.shape[1], labeled_data.data.shape[2],1))
        
class Normalize(shfl.private.FederatedTransformation):
    
    def __init__(self, mean, std):
        self.__mean = mean
        self.__std = std
    
    def apply(self, labeled_data):
        labeled_data.data = (labeled_data.data - self.__mean)/self.__std
        
def model_builder():
    model = keras.models.Sequential()
    model.add(keras.layers.Conv2D(32, kernel_size=(3, 3), padding='same', activation='relu', strides=1, input_shape=(28, 28, 1)))
    model.add(keras.layers.MaxPooling2D(pool_size=2, strides=2, padding='valid'))
    model.add(keras.layers.Dropout(0.4))
    model.add(keras.layers.Conv2D(32, kernel_size=(3, 3), padding='same', activation='relu', strides=1))
    model.add(keras.layers.MaxPooling2D(pool_size=2, strides=2, padding='valid'))
    model.add(keras.layers.Dropout(0.3))
    model.add(keras.layers.Flatten())
    model.add(keras.layers.Dense(128, activation='relu'))
    model.add(keras.layers.Dropout(0.1))
    model.add(keras.layers.Dense(64, activation='relu'))
    model.add(keras.layers.Dense(10, activation='softmax'))

    model.compile(optimizer="rmsprop", loss="categorical_crossentropy", metrics=["accuracy"])
    
    return shfl.model.DeepLearningModel(model)



Using TensorFlow backend.


In [2]:
#Read data
database = shfl.data_base.Emnist()
train_data, train_labels, val_data, val_labels, test_data, test_labels = database.load_data()

#Distribute among clients
non_iid_distribution = shfl.data_distribution.NonIidDataDistribution(database)
federated_data, test_data, test_labels = non_iid_distribution.get_federated_data(num_nodes=5, percent=10)

#Set up aggregation operator
aggregator = shfl.federated_aggregator.AvgFedAggregator()
federated_government = shfl.learning_approach.FederatedGovernment(model_builder, federated_data, aggregator)

#Reshape and normalize
shfl.private.federated_operation.apply_federated_transformation(federated_data, Reshape())

mean = np.mean(test_data.data)
std = np.std(test_data.data)
shfl.private.federated_operation.apply_federated_transformation(federated_data, Normalize(mean, std))

## Evaluation methodology 1: Global test dataset

The first evaluation methodology that we propose consists of the federated version of the classical evaluation methods. For that purpose, we use a common test dataset allocated in the server. We show the evaluation metrics (loss and accuracy in this case) in each round of learning both in local models and global updated model. We show the behaviour of this evaluation methodology as follows.

In [3]:
test_data = np.reshape(test_data, (test_data.shape[0], test_data.shape[1], test_data.shape[2],1))
federated_government.run_rounds(1, test_data, test_labels)

Accuracy round 0
Test performance client <shfl.private.federated_operation.FederatedDataNode object at 0x112a45910>: [12.972314575958253, 0.19499999284744263]
Test performance client <shfl.private.federated_operation.FederatedDataNode object at 0x112a28390>: [4.393721841335297, 0.7226250171661377]
Test performance client <shfl.private.federated_operation.FederatedDataNode object at 0x1a36ede910>: [10.832231537246704, 0.3259499967098236]
Test performance client <shfl.private.federated_operation.FederatedDataNode object at 0x1a36edea10>: [4.266248627996445, 0.7315000295639038]
Test performance client <shfl.private.federated_operation.FederatedDataNode object at 0x1a36edeb50>: [4.621014157485962, 0.7071999907493591]
Global model test performance : [8.520207717514038, 0.4586000144481659]





This methodology is the simplest and shows both local and global models. The problem with this methodology is that the local evaluation metrics are biased by the distribution of test set data. That is, the performance of the local models is not properly represented when using a non-iid scenario (see [Federated Sampling](./federated_sampling.ipynb)) because the distribution of training data for each client is different from the data we test on. For that reason, we propose the following evaluation methodology.

## Evaluation methodology 2: Global test dataset and local test datasets

In this evaluation methodology we consider that there is, as in the previous one, a global test dataset and that each client has a local test dataset according to the distribution of their training data. Hence, in each round we show the evaluation metrics of each client on their local and the global test. This evaluation methodology is more complete as it shows the performance of the local FL models in the global and local distribution of the data, which gives as more information.

First, we split each client's data in train and test partitions. You can find this method in [Federated Operation](https://github.com/sherpaai/Sherpa.FL/blob/master/shfl/private/federated_operation.py). 

In [4]:
shfl.private.federated_operation.split_train_test(federated_data)

After that, each client owns a training set which uses for training the local learning model and a test set which uses to evaluate. 

We are now ready to show the behaviour of this evaluation methodology.

In [5]:
#We restart federated goverment
federated_government = shfl.learning_approach.FederatedGovernment(model_builder, federated_data, aggregator)

test_data = np.reshape(test_data, (test_data.shape[0], test_data.shape[1], test_data.shape[2],1))
federated_government.run_rounds(1, test_data, test_labels)

Accuracy round 0
Performance client <shfl.private.federated_operation.FederatedDataNode object at 0x112a45910>: Global test: [12.99019923171997, 0.19382500648498535], Local test: [0.0407491410151124, 0.9906250238418579]
Performance client <shfl.private.federated_operation.FederatedDataNode object at 0x112a28390>: Global test: [4.718754151058197, 0.7013499736785889], Local test: [0.21528405373295148, 0.9416666626930237]
Performance client <shfl.private.federated_operation.FederatedDataNode object at 0x1a36ede910>: Global test: [10.071145951461792, 0.3734999895095825], Local test: [0.04247177823757132, 0.9906250238418579]
Performance client <shfl.private.federated_operation.FederatedDataNode object at 0x1a36edea10>: Global test: [5.453462917232513, 0.6549000144004822], Local test: [0.23397006044785182, 0.9333333373069763]
Performance client <shfl.private.federated_operation.FederatedDataNode object at 0x1a36edeb50>: Global test: [3.7176213492393493, 0.7637249827384949], Local test: [0.30

We appreciate the significance of this new evaluation methodology in the output produced. For example, the first client performed the worst in the global test while it was the best in its local test. This indicates that the data distribution of this client is most likely very poor compared to the global data distribution, for example, consisting of only two classes. This produces a really good local learning model in just one round of learning, being a simpler problem but with a very low global test performance.

This highlights the strenght of using specific evaluation methodologies in FL, especially when the distribution of data among clients follows a non-IID distribution (see [Federated Sampling](http://localhost:8888/notebooks/federated_sampling.ipynb)).