### 1. Creating a `pfl` Dataset Object
A `pfl.data.Dataset` object is the basic representation for a single user's dataset. Alternatively, this can also be used for building a central dataset for evaluation purpose (to be used with `pfl.callback.CentralEvaluationCallback`).   

In [20]:
import sklearn.datasets
from pfl.data.dataset import Dataset

# Create synthetic user data
user_id = 'pfl-user-0'
user_features, user_labels = sklearn.datasets.make_classification(n_samples=10, n_features=4)

# Create a pfl Dataset object
user_dataset = Dataset(raw_data=[user_features, user_labels], user_id=user_id)

# Get a batch of the user dataset using Dataset.iter
batch_x, batch_y = next(user_dataset.iter(batch_size=5))
print(f"Batch from `Dataset`: \n"
      f"x shape={batch_x.shape}, x={batch_x}\n"
      f"y shape={batch_y.shape}, y={batch_y}")

Batch from `Dataset`: 
x shape=(5, 4), x=[[ 1.03859466 -0.21300328 -1.22219388 -1.11617411]
 [-1.60839153  0.67721564  0.14233221  2.6666976 ]
 [ 0.51354715 -0.03120292 -0.97783284 -0.35171889]
 [ 0.01241602 -0.36451442  1.80941813 -0.99097949]
 [ 0.09174135 -0.15802651  0.59355521 -0.47458903]]
y shape=(5,), y=[0 0 1 1 1]


#### 1.1 Creating a PyTorch `pfl` Dataset Object
The above example showed how to create the `pfl` dataset in numpy format. `pfl`  supports data that are processed in PyTorch or TensorFlow tensors. We will focus on PyTorch for the purpose of this tutorial. The TensorFlow dataset implementation can be found at [tensorflow.py](https://github.com/apple/pfl-research/blob/main/pfl/data/tensorflow.py).

In [21]:
import multiprocessing
import os
os.environ['PFL_PYTORCH_DEVICE'] = 'cpu'

import torch

# Set multiprocessing start method to "spawn" instead of forkserver (which is the default)
# That is because forkserver does not work on Windows, but spawn does.
def init_multiprocessing():
    try:
        multiprocessing.set_start_method("spawn", force=True)  # Forces "spawn"
    except RuntimeError:
        pass  # Ignore if it's already set

init_multiprocessing()

from pfl.internal.ops.selector import set_framework_module
from pfl.internal.ops import pytorch_ops, numpy_ops
from pfl.data.pytorch import PyTorchTensorDataset, PyTorchDataDataset

# Tell pfl to use pytorch as the backend ML framework
set_framework_module(pytorch_ops, old_module=numpy_ops)

In [22]:
# Option 1: Create a PyTorch pfl Dataset object based on PyTorch tensors
user_features, user_labels = torch.as_tensor(user_features), torch.as_tensor(user_labels)
user_dataset = PyTorchTensorDataset(tensors=[user_features, user_labels], user_id=user_id)

# Get a batch of the user dataset using Dataset.iter
batch_x, batch_y = next(user_dataset.iter(batch_size=5))
print(f"Batch from `PyTorchTensorDataset`: \n"
      f"x shape={batch_x.shape}, x={batch_x}\n"
      f"y shape={batch_y.shape}, y={batch_y}")

Batch from `PyTorchTensorDataset`: 
x shape=torch.Size([5, 4]), x=tensor([[ 1.0386, -0.2130, -1.2222, -1.1162],
        [-1.6084,  0.6772,  0.1423,  2.6667],
        [ 0.5135, -0.0312, -0.9778, -0.3517],
        [ 0.0124, -0.3645,  1.8094, -0.9910],
        [ 0.0917, -0.1580,  0.5936, -0.4746]], dtype=torch.float64)
y shape=torch.Size([5]), y=tensor([0, 0, 1, 1, 1], dtype=torch.int32)


In [23]:
# Option 2: Create a PyTorch pfl Dataset object based on torch.utils.data.Dataset
pytorch_data = torch.utils.data.TensorDataset(user_features, user_labels)
user_dataset = PyTorchDataDataset(raw_data=pytorch_data, user_id=user_id)

# Get a batch of the user dataset using Dataset.iter
batch_x, batch_y = next(user_dataset.iter(batch_size=5))
print(f"\nBatch from `PyTorchDataDataset`: \n"
      f"x shape={batch_x.shape}, x={batch_x}\n"
      f"y shape={batch_y.shape}, y={batch_y}")


Batch from `PyTorchDataDataset`: 
x shape=torch.Size([5, 4]), x=tensor([[ 1.0386, -0.2130, -1.2222, -1.1162],
        [-1.6084,  0.6772,  0.1423,  2.6667],
        [ 0.5135, -0.0312, -0.9778, -0.3517],
        [ 0.0124, -0.3645,  1.8094, -0.9910],
        [ 0.0917, -0.1580,  0.5936, -0.4746]], dtype=torch.float64)
y shape=torch.Size([5]), y=tensor([0, 0, 1, 1, 1], dtype=torch.int32)


### 2. Creating Federated Dataset with Real User Partition
Previous section showed how to create a single user dataset.  This section covers how to create a federated dataset which is a joint of multiple users' dataset. If the original dataset already have an attribute for user identifier, such as [StackOverflow](https://www.kaggle.com/datasets/stackoverflow/stackoverflow), [LEAF](https://leaf.cmu.edu) ans [FLAIR](https://github.com/apple/ml-flair), we can use the existing user parition in the dataset for more accurate simulation of non-IID characteristics in the real federated learning setting. 

In [24]:
from pfl.data import get_user_sampler, FederatedDataset

# Create a dataset partitioned by user
user_id_to_data = {}
n_users = 20
for i in range(n_users):
    user_id_to_data[i] = sklearn.datasets.make_classification(n_samples=i+1, n_features=4)
user_ids = list(user_id_to_data.keys())

# Create user sampler to sample user uniformly at random
user_sampler = get_user_sampler(sample_type="random", user_ids=user_ids)
federated_dataset = FederatedDataset.from_slices(data=user_id_to_data, user_sampler=user_sampler)

# Iterate federated dataset to get artificial user dataset
print("FederatedDataset with real user partition: ")
for i in range(5):
    user_dataset, seed = next(federated_dataset)
    print(f"\tReal user with user_id={user_dataset.user_id} has size of {len(user_dataset)}.")

FederatedDataset with real user partition: 
	Real user with user_id=12 has size of 13.
	Real user with user_id=8 has size of 9.
	Real user with user_id=14 has size of 15.
	Real user with user_id=12 has size of 13.
	Real user with user_id=4 has size of 5.


#### 2.1 Creating a `PyTorchFederatedDataset` Object
In certain scenario, it might be beneficial to use `PyTorchFederatedDataset` for speeding up data loading. 

In [25]:
from pfl.data.pytorch import PyTorchFederatedDataset

class PyTorchDataset(torch.utils.data.Dataset):
    def __init__(self, user_id_to_data):
        self._user_id_to_data = user_id_to_data

    def __getitem__(self, i):
        return [torch.as_tensor(x) for x in self._user_id_to_data[i]]

    def __len__(self):
        return len(self._user_id_to_data)


pytorch_dataset = PyTorchDataset(user_id_to_data)
federated_dataset = PyTorchFederatedDataset(dataset=pytorch_dataset, user_sampler=user_sampler)
# Iterate federated dataset to get artificial user dataset
print("PyTorchFederatedDataset with real user partition: ")
for i in range(5):
    user_dataset, seed = next(federated_dataset)
    print(f"\tReal user {i} has size of {len(user_dataset)}.")

PyTorchFederatedDataset with real user partition: 
	Real user 0 has size of 10.
	Real user 1 has size of 7.
	Real user 2 has size of 6.
	Real user 3 has size of 13.
	Real user 4 has size of 15.


#### 2.2 Creating a `FederatedDataset` with `torch.utils.data.Dataset`
If a user dataset is too large to fit into memory, `pfl` also supports `torch.utils.data.Dataset` for creating federated dataset.


In [26]:
def make_dataset_fn(user_id):
    # Create a `torch.utils.data.Dataset` for a single user
    user_data = [torch.as_tensor(data) for data in user_id_to_data[user_id]]
    pytorch_data = torch.utils.data.TensorDataset(*user_data)
    return PyTorchDataDataset(raw_data=pytorch_data, user_id=user_id)

federated_dataset = FederatedDataset(make_dataset_fn=make_dataset_fn, user_sampler=user_sampler)
print("FederatedDataset with torch.utils.data.Dataset and real user partition: ")
for i in range(5):
    user_dataset, seed = next(federated_dataset)
    print(f"\tReal user {i} has size of {len(user_dataset)}.")

FederatedDataset with torch.utils.data.Dataset and real user partition: 
	Real user 0 has size of 2.
	Real user 1 has size of 4.
	Real user 2 has size of 10.
	Real user 3 has size of 12.
	Real user 4 has size of 7.


### 3. Creating IID Artificial Federated Dataset
The previous section assumes that a dataset has user IDs, which is often not the case for many existing ML datasets (e.g. CIFAR10). For those datasets, `pfl` supports converting them to Artificial Federated Dataset, meaning that there is no user ID associated with the dataset and the user partition will be artificial, which is useful for experimenting existing ML datasets. This section covers how to create an Artificial Federated Dataset assuming the data distribution between users is IID. 

In [28]:
import numpy as np
from pfl.data import ArtificialFederatedDataset, get_data_sampler


n_samples = 1000
features, labels = sklearn.datasets.make_classification(n_samples=n_samples, n_features=8, n_informative=4, n_classes=5)
# Create data sampler to sample each artificial user dataset as a random subset of the original dataset
data_sampler = get_data_sampler(sample_type="random", max_bound=n_samples)

# Option 1: Create an artificial federated dataset where each user dataset has constant size of 10
sample_dataset_len = lambda: 10
federated_dataset = ArtificialFederatedDataset.from_slices(
    data=[features, labels], 
    data_sampler=data_sampler,
    sample_dataset_len=sample_dataset_len,
)
# Iterate federated dataset to get artificial user dataset
print("User dataset size is a constant: ")
for i in range(5):
    user_dataset, seed = next(federated_dataset)
    print(f"\tArtificial user {i} has size of {len(user_dataset)}")

User dataset size is a constant: 
	Artificial user 0 has size of 10
	Artificial user 1 has size of 10
	Artificial user 2 has size of 10
	Artificial user 3 has size of 10
	Artificial user 4 has size of 10


In [19]:
# Option 2: Create an artificial federated dataset where each user dataset size follows Poisson distribution
sample_dataset_len = lambda: max(1, np.random.poisson(10))
federated_dataset = ArtificialFederatedDataset.from_slices(
    data=[features, labels], 
    data_sampler=data_sampler,
    sample_dataset_len=sample_dataset_len,
)
# Iterate federated dataset to get artificial user dataset
print("\nUser dataset size follows a Poisson distribution: ")
for i in range(5):
    user_dataset, seed = next(federated_dataset)
    print(f"\tArtificial user {i} has size of {len(user_dataset)}")


User dataset size follows a Poisson distribution: 
	Artificial user 0 has size of 9
	Artificial user 1 has size of 8
	Artificial user 2 has size of 7
	Artificial user 3 has size of 11
	Artificial user 4 has size of 17


### 4. Creating non-IID Artificial Federated Dataset
As mentioned before, in the real federated setting, the federated data has many non-IID characteristics in its distribution. This section shows how to artificially simulate the case for a classification dataset where the user label distribution is non-IID and follows a Dirichlet distribution (a common assumption in federated learning research). 

#### 4.1 Sampling non-IID user dataset dynamically with `ArtificialFederatedDataset`
The first option is to change the `data_sampler` in a way so that the user label distribution follows a Dirichlet distribution. The user dataset is sampled on the fly and there is no fixed user partition like as in [Section 3.2](#3.2-Partitioning-the-dataset-into-fixed-artificial-users-with-FederatedDataset) below.

In [20]:
# Create data sampler to sample each artificial user has label distribution follows a Dirichlet distribution.
dirichlet_alpha = [0.1] * 5
data_sampler = get_data_sampler(sample_type="dirichlet", labels=labels, alpha=dirichlet_alpha)

# Create an artificial federated dataset with a Dirichlet data sampler
sample_dataset_len = lambda: 20
federated_dataset = ArtificialFederatedDataset.from_slices(
    data=[features, labels], 
    data_sampler=data_sampler,
    sample_dataset_len=sample_dataset_len,
)
# Iterate federated dataset to get artificial user dataset
print("ArtificialFederatedDataset with Dirichlet label distribution: ")
for i in range(5):
    user_dataset, seed = next(federated_dataset)
    print(f"\tArtificial user {i} has size of {len(user_dataset)} with unique labels={np.unique(user_dataset.raw_data[1])}")

ArtificialFederatedDataset with Dirichlet label distribution: 
	Artificial user 0 has size of 20 with unique labels=[1 4]
	Artificial user 1 has size of 20 with unique labels=[3]
	Artificial user 2 has size of 20 with unique labels=[0]
	Artificial user 3 has size of 20 with unique labels=[2 4]
	Artificial user 4 has size of 20 with unique labels=[0]


#### 4.2 Partitioning the dataset into fixed artificial users with `FederatedDataset`
An alternative to the above is to partition the original dataset into a fixed set of artificial users where each user's label distribution follows a Dirichlet distribution. The difference of this option is that the user partitions are constructed once using all data and remains the same throughout the training. 

In [22]:
# Create a federated dataset with fixed user partition
# alpha=0.01 is a extreme case where each user likely has only data from 1 class
federated_dataset = FederatedDataset.from_slices_with_dirichlet_class_distribution(
    data=[features, labels],
    labels=labels,
    alpha=0.01,
    user_dataset_len_sampler=sample_dataset_len)

# Iterate federated dataset to get artificial user dataset
print("FederatedDataset with Dirichlet label distribution: ")
for i in range(5):
    user_dataset, seed = next(federated_dataset)
    print(f"\tArtificial user {i} has size of {len(user_dataset)} with unique labels={np.unique(user_dataset.raw_data[1])}")

FederatedDataset with Dirichlet label distribution: 
	Artificial user 0 has size of 20 with unique labels=[0 3]
	Artificial user 1 has size of 20 with unique labels=[1]
	Artificial user 2 has size of 20 with unique labels=[3]
	Artificial user 3 has size of 20 with unique labels=[2]
	Artificial user 4 has size of 20 with unique labels=[2 4]


# Coupling CIFAR10 data to Federated Dataset

## Debugging print statements

In [2]:
import torchvision
print(torchvision.__version__)
import sys
print(sys.executable)


0.21.0+cpu
c:\Users\AVN\anaconda3\envs\masters\python.exe


In [None]:
from pfl.data import ArtificialFederatedDataset, get_data_sampler
import sys
sys.path.append('..')
from centralized import load_data
train_set, test_set = load_data()

In [9]:
# Access the 'dataset' attribute of train_set to get the proper count of total samples
n_samples = len(train_set.dataset)

# Just 'len(train_set)' would only give you the total number of batches

In [None]:
# Split the training dataloader into features and labels
all_features = []
all_labels = []

for features, labels in train_set:
    all_features.append(features)
    all_labels.append(labels)

# Convert to tensors if needed
all_features = torch.cat(all_features, dim=0)
all_labels = torch.cat(all_labels, dim=0)

print(all_features.shape, all_labels.shape)


torch.Size([50000, 3, 32, 32]) torch.Size([50000])


In [38]:
# Create data sampler to sample each artificial user dataset as a random subset of the original dataset
data_sampler = get_data_sampler(sample_type="random", max_bound=n_samples)

# Option 1: Create an artificial federated dataset where each user dataset has constant size of 10
sample_dataset_len = lambda: 10
federated_dataset = ArtificialFederatedDataset.from_slices(
    data=[all_features, all_labels], 
    data_sampler=data_sampler,
    sample_dataset_len=sample_dataset_len,
)
# Iterate federated dataset to get artificial user dataset
print("User dataset size is a constant: ")
for i in range(5):
    user_dataset, seed = next(federated_dataset)
    print(f"\tArtificial user {i} has size of {len(user_dataset)}")

User dataset size is a constant: 
	Artificial user 0 has size of 10
	Artificial user 1 has size of 10
	Artificial user 2 has size of 10
	Artificial user 3 has size of 10
	Artificial user 4 has size of 10
