# Evaluation Methodology of Federated Learning

In this notebook we study the different evaluation methodologies that we can use when we want to evaluate FL problems. First, we set up the FL configuration (for more information see [Basic Concepts Notebook](https://github.com/sherpaai/Sherpa.FL/blob/master/notebooks/basic_concepts.ipynb)).

In [1]:
import shfl
import keras
import numpy as np

class Reshape(shfl.private.FederatedTransformation):
    
    def apply(self, labeled_data):
        labeled_data.data = np.reshape(labeled_data.data, (labeled_data.data.shape[0], labeled_data.data.shape[1], labeled_data.data.shape[2],1))
        
class Normalize(shfl.private.FederatedTransformation):
    
    def __init__(self, mean, std):
        self.__mean = mean
        self.__std = std
    
    def apply(self, labeled_data):
        labeled_data.data = (labeled_data.data - self.__mean)/self.__std
        
def model_builder():
    model = keras.models.Sequential()
    model.add(keras.layers.Conv2D(32, kernel_size=(3, 3), padding='same', activation='relu', strides=1, input_shape=(28, 28, 1)))
    model.add(keras.layers.MaxPooling2D(pool_size=2, strides=2, padding='valid'))
    model.add(keras.layers.Dropout(0.4))
    model.add(keras.layers.Conv2D(32, kernel_size=(3, 3), padding='same', activation='relu', strides=1))
    model.add(keras.layers.MaxPooling2D(pool_size=2, strides=2, padding='valid'))
    model.add(keras.layers.Dropout(0.3))
    model.add(keras.layers.Flatten())
    model.add(keras.layers.Dense(128, activation='relu'))
    model.add(keras.layers.Dropout(0.1))
    model.add(keras.layers.Dense(64, activation='relu'))
    model.add(keras.layers.Dense(10, activation='softmax'))

    model.compile(optimizer="rmsprop", loss="categorical_crossentropy", metrics=["accuracy"])
    
    return shfl.model.DeepLearningModel(model)

#Read data
database = shfl.data_base.Emnist()
train_data, train_labels, val_data, val_labels, test_data, test_labels = database.load_data()

#Distribute among clients
non_iid_distribution = shfl.data_distribution.NonIidDataDistribution(database)
federated_data, test_data, test_labels = non_iid_distribution.get_federated_data(num_nodes=20, percent=10)

#Set up aggregation operator
aggregator = shfl.federated_aggregator.AvgFedAggregator()
federated_government = shfl.learning_approach.FederatedGovernment(model_builder, federated_data, aggregator)

#Reshape and normalize
shfl.private.federated_operation.apply_federated_transformation(federated_data, Reshape())

mean = np.mean(test_data.data)
std = np.std(test_data.data)
shfl.private.federated_operation.apply_federated_transformation(federated_data, Normalize(mean, std))

Using TensorFlow backend.


## Evaluation methodology 1: global test dataset

The first evaluation methodology that we propose consists of the federated version of the classical evaluation methods. For that purpose, we use a common test dataset allocated in the server. We show the evaluation metrics (loss and accuracy in this case) in each round of learning both in local models and global updated model. We show the behaviour of this evaluation methodology as follows

In [2]:
test_data = np.reshape(test_data, (test_data.shape[0], test_data.shape[1], test_data.shape[2],1))
federated_government.run_rounds(1, test_data, test_labels)

Accuracy round 0
Test performance client <shfl.private.federated_operation.FederatedDataNode object at 0x1a2772a990>: [13.276649362182617, 0.17487500607967377]
Test performance client <shfl.private.federated_operation.FederatedDataNode object at 0x1a2772a910>: [11.485916297531128, 0.28462499380111694]
Test performance client <shfl.private.federated_operation.FederatedDataNode object at 0x1a2772aa90>: [12.950237517547608, 0.19642500579357147]
Test performance client <shfl.private.federated_operation.FederatedDataNode object at 0x1a2772abd0>: [5.401505676078797, 0.6487749814987183]
Test performance client <shfl.private.federated_operation.FederatedDataNode object at 0x1a2772ad10>: [11.450444860076905, 0.288875013589859]
Test performance client <shfl.private.federated_operation.FederatedDataNode object at 0x1a2772af50>: [7.8874158100128176, 0.5026999711990356]
Test performance client <shfl.private.federated_operation.FederatedDataNode object at 0x1a27739050>: [12.946188985443115, 0.196649

This methodology is the simplest and shows both local and global models. The problem with this methodology is that the local evaluation metrics are biased by the distribution of test set data. That is, the performance of the local models is not properly represented when using a non-iid scenario (see [Federated Sampling](http://localhost:8888/notebooks/federated_sampling.ipynb)) because the distribution of training data for each client is different from the data we test on. For that reason, we propose the following evaluation methodology.

## Evaluation methodology 2: global test dataset and local test datasets

In this evaluation methodology we consider that there is, as in the previous one, a global test dataset and that each client has a local test dataset according to the distribution of their training data. Hence, in each round we show the evaluation metrics of each client on their local test and the global test. This evaluation methodology is more complete as it shows the performance of the local FL models in the global and local distribution of the data, which gives as more information.