# How to federate datasets using FLEXible

In this notebooks, we show a few of the many ways in which FLEXible can federate a centralized dataset. We will use MNIST and CIFAR10 datasets in this notebooks

First, we download it and split it in train and test:

In [None]:
import tensorflow_datasets as tfds

ds_train, ds_test = tfds.load(
    'mnist',
    split=['train', 'test'],
    shuffle_files=True,
    as_supervised=True,
    batch_size=-1,
)

In order to use our tools, we need to encapsulate the dataset in a `Dataset`. 

Note that train_X and train_y are assumed to be NumPy arrays and train_y must be a one dimensional NumPy array.

In [None]:
from flex.data import Dataset

train_dataset = Dataset.from_tfds_image_dataset(ds_train)
test_dataset = Dataset.from_tfds_image_dataset(ds_test)

To federate a centralized dataset, it is required to describe the federation process in a `FedDatasetConfig` object.

A `FedDatasetConfig` object has the following fields:
| Options compatibility   | **n_clients** | **client_names** | **weights** | **weights_per_class** | **replacement** | **classes_per_client** | **features_per_client** | **indexes_per_client** | **group_by_label_index** |
|-------------------------|---------------|------------------|-------------|-----------------------|-----------------|------------------------|-------------------------|------------------------|----------------------|
| **n_clients**           | -             | Y                | Y           | Y                     | Y               | Y                      | Y                       | N                      | N                    |
| **client_names**        | -             | -                | Y           | Y                     | Y               | Y                      | Y                       | Y                      | N                    |
| **weights**             | -             | -                | -           | N                     | Y               | Y                      | Y                       | N                      | N                    |
| **weights_per_class**   | -             | -                | -           | -                     | Y               | N                      | N                       | N                      | N                    |
| **replacement**         | -             | -                | -           | -                     | -               | Y                      | N                       | N                      | N                    |
| **classes_per_client**  | -             | -                | -           | -                     | -               | -                      | N                       | N                      | N                    |
| **features_per_client** | -             | -                | -           | -                     | -               | -                      | -                       | N                      | N                    |
| **indexes_per_client**  | -             | -                | -           | -                     | -               | -                      | -                       | -                      | N                    |
| **group_by_label_index**    | -             | -                | -           | -                     | -               | -                      | -                       | -                      | -                    |

- seed: Optional[int]
        Seed used to make the federated dataset generated reproducible with this configuration. Default None.
- n_clients: int
        Number of clients among which to split the centralized dataset. Default 2.
- client_names: Optional[List[Hashable]]
        Names to identifty each client, if not provided clients will be indexed using integers. If n_clients is also \
        given, we consider up to n_clients elements. Default None.
- weights: Optional[npt.NDArray]
        A numpy.array which provides the proportion of data to give to each client. Default None.
- weights_per_class: Optional[npt.NDArray]
        A numpy.array which provides the proportion of data to give to each client and class of the dataset to federate. \
        We expect a bidimensional array of shape (n, m) where "n" is the number of clients and "m" is the number of classes of \
        the dataset to federate. Default None.
- replacement: bool
        Whether the samping procedure used to split a centralized dataset is with replacement or not. Default False
- classes_per_client: Optional[Union[int, npt.NDArray, Tuple[int]]]
        Classes to assign to each client, if provided as an int, it is the number classes per client, if provided as a \
        tuple of ints, it establishes a mininum and a maximum of number of classes per client, a random number sampled \
        in such interval decides the number of classes of each client. If provided as a list of lists, it establishes the classes \
        assigned to each client. Default None.
- features_per_client: Optional[Union[int, npt.NDArray, Tuple[int]]]
        Features to assign to each client, it share the same interface as classes_per_client. Default None.
- indexes_per_client: Optional[npt.NDArray]
        Data indexes to assign to each client. Default None.
- group_by_label_index: Optional[int]
        Index which indicates which feature unique values will be used to generate federated clients. Default None.


Let's implement the following description:

We have 10 federated clients, that do not share any instances, each client with data from a single class and with a 20% of the total data available for each class.

In [None]:
from flex.data import FedDatasetConfig
import numpy as np

config = FedDatasetConfig(seed = 0) # We fix a seed to make our federation reproducible
config.n_clients = 10 # 10 clients
config.replacement = False # ensure that clients do not share any data
config.classes_per_client = [[i] for i in np.unique(train_dataset.y_data)] # assign each client one class
config.weights = [0.2] * config.n_clients # each client has only 20% of its assigned class

In [None]:
np.unique(train_dataset.y_data)

We apply the generated `FedDatasetConfig` to a `Dataset`, which encapsulates the centralized dataset.

In [None]:
from flex.data import FedDataDistribution

federated_dataset = FedDataDistribution.from_config(centralized_data=train_dataset, config=config)

Show the federated data, to confirm that the federated split is correct:

In [None]:
import matplotlib.pyplot as plt

for client in federated_dataset:
    print(f"Node {client} has class {np.unique(federated_dataset[client].y_data)} and {len(federated_dataset[client])} elements, a sample of them is:")
    #pyplot.figure(figsize = (1,10))
    fig, ax = plt.subplots(1, 10) # rows, cols
    for i ,(x, y) in enumerate(federated_dataset[client]):
        ax[i].axis('off')
        ax[i].imshow(x, cmap=plt.get_cmap('gray'))
        if i >= 9:
            break
    plt.show()

# Federate a dataset using weights to distribute data following a certain distribution

We try a more special configuration, we want to federate the dataset such that the number of data per client follows a gaussian distribution, consequently, we need to specify weights from a normal distribution:

In [None]:
import matplotlib.pyplot as plt

n_clients = 500
mu, sigma = 100, 1  # mean and standard deviation
normal_weights = np.random.default_rng(seed=0).normal(mu, sigma, n_clients)  # sample random numbers
normal_weights = np.clip(normal_weights, a_min=0, a_max=np.inf)  # remove negative values
normal_weights = normal_weights / sum(normal_weights) # normalize to sum 1

plt.hist(normal_weights, bins=15)
plt.title('Histogram of normal weights')
plt.show()

In [None]:
config = FedDatasetConfig(seed=0, 
                            n_clients=n_clients,
                            replacement=False,
                            weights=normal_weights
                        )

normal_federated_dataset = FedDataDistribution.from_config(centralized_data=train_dataset, config=config)

Plot histogram of data per client:

In [None]:
datasizes_per_client = [len(normal_federated_dataset[client]) for client in normal_federated_dataset]
n, bins, patches = plt.hist(datasizes_per_client)
plt.ylabel('Data sizes')
plt.title('Histogram of data sizes per client')
plt.show()

# A more complex dataset federation

Now, lets federate CIFAR10 that fits the following description from [Personalized Federated Learning using Hypernetworks](https://paperswithcode.com/paper/personalized-federated-learning-using).

First, we sample two/ten classes for each client for CIFAR10/CIFAR100; Next, for each client i and selected class c, we sample $ \alpha_{i,c} \sim U(.4, .6)$, and assign it with $\frac{\alpha_{i,c}}{\sum_j \alpha_{j,c}}$ of the samples for this class. We repeat the above using 10, 50 and 100 clients. This procedure produces clients with different number of samples and classes.

1) We download the cifar10 dataset using torchivision and create a Dataset with it using ``from_torchvision_dataset``. Note that, it is mandatory to at least provide the ``ToTensor`` transform.

In [None]:
from torchvision import datasets, transforms
from flex.data import Dataset

cifar10 = datasets.CIFAR10(
        root=".",
        train=True,
        download=True,
        transform=None
)
cifar10_dataset = Dataset.from_torchvision_dataset(cifar10)

2) Create a ``FedDatasetConfig`` that fits the description given above.

In [None]:
from flex.data import FedDatasetConfig
import numpy as np


config = FedDatasetConfig(seed=0)
config.replacement = True # it is not clear whether clients share their data or not
config.n_clients = 10
num_classes = 10

# Assign a sample proportion for each client-class pair
alphas = np.random.uniform(0.4, 0.6, [config.n_clients, num_classes])
alphas = alphas / np.sum(alphas, axis=0)
config.weights_per_class = alphas

3) Create the federated dataset by applying the created ``FedDatasetConfig`` to a ``Dataset`` using ``FedDataDistribution.from_config``

In [None]:
from flex.data import FedDataDistribution

personalized_cifar_dataset = FedDataDistribution.from_config(centralized_data=cifar10_dataset, config=config)

4) (Optional) Check that the data is federated as expected

In the following, we show that the proportion of each client data and label are roughtly the same:

In [None]:
unique_classes = np.sort(np.unique(cifar10_dataset.y_data))

for i in range(config.n_clients):
    client_key = i # Autogenerated keys are created as numbers 0...n_clients
    for j, cifar_class in enumerate(unique_classes):
        indexes = personalized_cifar_dataset[client_key].y_data.to_numpy() == cifar_class
        number_of_elements_per_class = len(personalized_cifar_dataset[client_key].X_data[indexes])
        if number_of_elements_per_class != 0:
            total_elements_per_class = np.sum(cifar10_dataset.y_data.to_numpy() == cifar_class)
            print(client_key, f"class {cifar_class}: actual proportion", number_of_elements_per_class/total_elements_per_class, "vs. expected proportion:", config.weights_per_class[i][j])

If we want, we can normalize the dataset of each client easily, using the `apply` function from `FedDataset`, for example we force each client to keep only pair labels:

In [None]:
import numpy as np

rng = np.random.default_rng(seed=0)
def keep_given_labels(client_dataset: Dataset, selected_labels=None):
    if not selected_labels:
        selected_labels = []
    X, y = client_dataset.to_numpy()
    X_data = X[np.isin(y, selected_labels)]
    y_data = y[np.isin(y, selected_labels)]
    return Dataset.from_numpy(X_data, y_data)

randomly_transformed_federated_dataset = personalized_cifar_dataset.apply(func=keep_given_labels,  # function to apply
                                                num_proc=1,
                                                selected_labels=[0, 2, 4, 6, 8] # argument for function
                                                )

for i, client in enumerate(randomly_transformed_federated_dataset):
    print(f"Client {client} has classes {np.unique(randomly_transformed_federated_dataset[client].y_data)} and {len(randomly_transformed_federated_dataset[client])} elements, a sample of them is:")
    fig, ax = plt.subplots(1, 10) # rows, cols
    for j ,(x, y) in enumerate(randomly_transformed_federated_dataset[client]):
        ax[j].axis('off')
        ax[j].imshow(x, cmap=plt.get_cmap('gray'))
        if j >= 9:
            break
    if i >= 10:
        break
    plt.show()

# Federate a dataset from torchtext, torchvision, tensorflow or huggingface

There are a lot of datasets available on the main deep learning frameworks that make it easy to use their own framework. It's important to let the user to use this datasets, and we do so the user can use her favorites datasets and federate them.

We show multiple examples on how to load and federate the datasets.

For every framework we support, there are two possible ways to load a dataset:

In [None]:
from flex.data import Dataset
from flex.data import FedDataDistribution
from flex.data import FedDatasetConfig

## HugginFace

In [None]:
# HuggingFace
from datasets import load_dataset

# Load a dataset into a Dataset
dataset_hf = load_dataset('ag_news', split='train')

One way:

In [None]:
fcd_hf = Dataset.from_huggingface_dataset(
    dataset_hf, X_columns="text", label_columns="label"
)

# Create a config and federate the dataset
config_hf = FedDatasetConfig(
    seed=0,
    n_clients=2,
    replacement=False
)


flex_dataset_two_step_hf = FedDataDistribution.from_config(
    centralized_data=fcd_hf, config=config_hf
)

print(f"Flex dataset two steps a data sample from client_0: {flex_dataset_two_step_hf[0].X_data[0]}")

Or another (shortcut):

In [None]:
# Federate the dataset directly, only using a config.
flex_dataset_hf = FedDataDistribution.from_config_with_huggingface_dataset(
    dataset_hf, config_hf, "text", "label"
)

print(f"Flex dataset a data sample from client_0: {flex_dataset_hf[0].X_data[0]}")

## Tensorflow dataset

In [None]:
import tensorflow_datasets as tfds

ds_train, ds_test = tfds.load(
    'mnist',
    split=['train', 'test'],
    shuffle_files=True,
    as_supervised=True,
    batch_size=-1,
)

One way:

In [None]:
fcd_tf = Dataset.from_tfds_image_dataset(ds_train)

config_tf = FedDatasetConfig(
    seed=0,
    n_clients=2,
    replacement=False
)


# Federate the Dataset we just created
flex_dataset_two_step_tf = FedDataDistribution.from_config(
    centralized_data=fcd_tf,
    config=config_tf
)

sample = flex_dataset_two_step_tf[0].X_data[0]
import matplotlib.pyplot as plt
plt.imshow(sample, cmap=plt.get_cmap('gray'))

Or another:

In [None]:
# Federate the dataset directly
flex_dataset_tf = FedDataDistribution.from_config_with_tfds_image_dataset(
    ds_train,
    config_tf
)

sample = flex_dataset_tf[0].X_data[0]
import matplotlib.pyplot as plt
plt.imshow(sample, cmap=plt.get_cmap('gray'))

## Pytorch torchvision dataset

In [None]:
from torchvision import datasets, transforms

cifar10 = datasets.CIFAR10(
        root=".",
        train=True,
        download=True,
        transform=None
)

One way:

In [None]:
fcd_torch = Dataset.from_torchvision_dataset(cifar10)

config_torch = FedDatasetConfig(
    seed=0,
    n_clients=2,
    replacement=False
)

# Federate the Dataset we just created
flex_dataset_two_step_torch = FedDataDistribution.from_config(
    centralized_data=fcd_torch,
    config=config_torch
)

sample = flex_dataset_two_step_torch[0].X_data[0]
import matplotlib.pyplot as plt
plt.imshow(sample, cmap=plt.get_cmap('gray'))

Or another (shortcut):

In [None]:
# Federate the dataset directly
flex_dataset_torch = FedDataDistribution.from_config_with_torchvision_dataset(
    cifar10,
    config_tf
)

sample = flex_dataset_torch[0].X_data[0]
import matplotlib.pyplot as plt
plt.imshow(sample, cmap=plt.get_cmap('gray'))

## Pytorch torchtext dataset

In [None]:
from torchtext.datasets import AG_NEWS

torch_dataset = AG_NEWS(split='train')

One way:

In [None]:
fcd_torch = Dataset.from_torchtext_dataset(torch_dataset)

# We will use the same configuration than in the HuggingFace example
config_torch = FedDatasetConfig(
    seed=0,
    n_clients=2,
    replacement=False
)

# Federate the Dataset we just created
flex_dataset_two_step_torch = FedDataDistribution.from_config(
    centralized_data=fcd_torch,
    config=config_torch
)

print(f"Flex dataset two steps: {flex_dataset_two_step_torch[0].X_data[0]}")

Or another (shortcut):

In [None]:
# Federate the dataset directly
flex_dataset_torch = FedDataDistribution.from_config_with_torchtext_dataset(
    torch_dataset,
    config_torch
)
print(f"Flex dataset direct: {flex_dataset_torch[0].X_data[0]}")

### END
Congratulations, now you know how to federate a dataset using the *FedDataDistribution* and the *FedDatasetConfig* classes, so you can setup multiple experimental settings that fit most your hipothesis.