### 1. Creating a `pfl` Dataset Object
A `pfl.data.Dataset` object is the basic representation for a single user's dataset. Alternatively, this can also be used for building a central dataset for evaluation purpose (to be used with `pfl.callback.CentralEvaluationCallback`).   

In [None]:
import sklearn.datasets
from pfl.data.dataset import Dataset

# Create synthetic user data
user_id = 'pfl-user-0'
user_features, user_labels = sklearn.datasets.make_classification(n_samples=10, n_features=4)

# Create a pfl Dataset object
user_dataset = Dataset(raw_data=[user_features, user_labels], user_id=user_id)

# Get a batch of the user dataset using Dataset.iter
batch_x, batch_y = next(user_dataset.iter(batch_size=5))
print(f"Batch from `Dataset`: \n"
      f"x shape={batch_x.shape}, x={batch_x}\n"
      f"y shape={batch_y.shape}, y={batch_y}")

[[-1.26082601 -1.12266244 -1.78670903 -1.67631867]
 [ 0.36388835  0.08802618  0.44563335  0.37226589]
 [ 1.64369741 -0.67493593  1.69465502  1.17460049]
 [ 0.03360219  0.71709835  0.25154238  0.36946786]
 [-1.20314452 -0.84598324 -1.6381041  -1.49313281]
 [ 1.4854961  -1.81824661  1.17298631  0.49046328]
 [-0.23414879  0.30232747 -0.18022215 -0.06987397]
 [-2.68763717  2.32659373 -2.40802671 -1.34256486]
 [-0.38952767  0.76608295 -0.22172902  0.00812763]
 [-1.05540026 -1.35143849 -1.61777354 -1.59778114]]
Batch from `Dataset`: 
x shape=(5, 4), x=[[-1.26082601 -1.12266244 -1.78670903 -1.67631867]
 [ 0.36388835  0.08802618  0.44563335  0.37226589]
 [ 1.64369741 -0.67493593  1.69465502  1.17460049]
 [ 0.03360219  0.71709835  0.25154238  0.36946786]
 [-1.20314452 -0.84598324 -1.6381041  -1.49313281]]
y shape=(5,), y=[0 1 1 1 0]


#### 1.1 Creating a PyTorch `pfl` Dataset Object
The above example showed how to create the `pfl` dataset in numpy format. `pfl`  supports data that are processed in PyTorch or TensorFlow tensors. We will focus on PyTorch for the purpose of this tutorial. The TensorFlow dataset implementation can be found at [tensorflow.py](https://github.com/apple/pfl-research/blob/main/pfl/data/tensorflow.py).

In [None]:
import multiprocessing
import os
os.environ['PFL_PYTORCH_DEVICE'] = 'cpu'

import torch

# Set multiprocessing start method to "spawn" instead of forkserver (which is the default)
# That is because forkserver does not work on Windows, but spawn does.
def init_multiprocessing():
    try:
        multiprocessing.set_start_method("spawn", force=True)  # Forces "spawn"
    except RuntimeError:
        pass  # Ignore if it's already set

init_multiprocessing()

from pfl.internal.ops.selector import set_framework_module
from pfl.internal.ops import pytorch_ops, numpy_ops
from pfl.data.pytorch import PyTorchTensorDataset, PyTorchDataDataset

# Tell pfl to use pytorch as the backend ML framework
set_framework_module(pytorch_ops, old_module=numpy_ops)

Foo


In [8]:
# Option 1: Create a PyTorch pfl Dataset object based on PyTorch tensors
user_features, user_labels = torch.as_tensor(user_features), torch.as_tensor(user_labels)
user_dataset = PyTorchTensorDataset(tensors=[user_features, user_labels], user_id=user_id)

# Get a batch of the user dataset using Dataset.iter
batch_x, batch_y = next(user_dataset.iter(batch_size=5))
print(f"Batch from `PyTorchTensorDataset`: \n"
      f"x shape={batch_x.shape}, x={batch_x}\n"
      f"y shape={batch_y.shape}, y={batch_y}")

Batch from `PyTorchTensorDataset`: 
x shape=torch.Size([5, 4]), x=tensor([[-1.2608, -1.1227, -1.7867, -1.6763],
        [ 0.3639,  0.0880,  0.4456,  0.3723],
        [ 1.6437, -0.6749,  1.6947,  1.1746],
        [ 0.0336,  0.7171,  0.2515,  0.3695],
        [-1.2031, -0.8460, -1.6381, -1.4931]], dtype=torch.float64)
y shape=torch.Size([5]), y=tensor([0, 1, 1, 1, 0], dtype=torch.int32)


In [9]:
# Option 2: Create a PyTorch pfl Dataset object based on torch.utils.data.Dataset
pytorch_data = torch.utils.data.TensorDataset(user_features, user_labels)
user_dataset = PyTorchDataDataset(raw_data=pytorch_data, user_id=user_id)

# Get a batch of the user dataset using Dataset.iter
batch_x, batch_y = next(user_dataset.iter(batch_size=5))
print(f"\nBatch from `PyTorchDataDataset`: \n"
      f"x shape={batch_x.shape}, x={batch_x}\n"
      f"y shape={batch_y.shape}, y={batch_y}")


Batch from `PyTorchDataDataset`: 
x shape=torch.Size([5, 4]), x=tensor([[-1.2608, -1.1227, -1.7867, -1.6763],
        [ 0.3639,  0.0880,  0.4456,  0.3723],
        [ 1.6437, -0.6749,  1.6947,  1.1746],
        [ 0.0336,  0.7171,  0.2515,  0.3695],
        [-1.2031, -0.8460, -1.6381, -1.4931]], dtype=torch.float64)
y shape=torch.Size([5]), y=tensor([0, 1, 1, 1, 0], dtype=torch.int32)


### 2. Creating Federated Dataset with Real User Partition
Previous section showed how to create a single user dataset.  This section covers how to create a federated dataset which is a joint of multiple users' dataset. If the original dataset already have an attribute for user identifier, such as [StackOverflow](https://www.kaggle.com/datasets/stackoverflow/stackoverflow), [LEAF](https://leaf.cmu.edu) ans [FLAIR](https://github.com/apple/ml-flair), we can use the existing user parition in the dataset for more accurate simulation of non-IID characteristics in the real federated learning setting. 

In [10]:
from pfl.data import get_user_sampler, FederatedDataset

# Create a dataset partitioned by user
user_id_to_data = {}
n_users = 20
for i in range(n_users):
    user_id_to_data[i] = sklearn.datasets.make_classification(n_samples=i+1, n_features=4)
user_ids = list(user_id_to_data.keys())

# Create user sampler to sample user uniformly at random
user_sampler = get_user_sampler(sample_type="random", user_ids=user_ids)
federated_dataset = FederatedDataset.from_slices(data=user_id_to_data, user_sampler=user_sampler)

# Iterate federated dataset to get artificial user dataset
print("FederatedDataset with real user partition: ")
for i in range(5):
    user_dataset, seed = next(federated_dataset)
    print(f"\tReal user with user_id={user_dataset.user_id} has size of {len(user_dataset)}.")

FederatedDataset with real user partition: 
	Real user with user_id=11 has size of 12.
	Real user with user_id=12 has size of 13.
	Real user with user_id=15 has size of 16.
	Real user with user_id=17 has size of 18.
	Real user with user_id=12 has size of 13.
