# Synthetic


We propose a process to generate synthetic federated datasets. The dataset is inspired by the one presented by [Li et al.](https://arxiv.org/abs/1905.10497), but has possible additional heterogeneity designed to make current meta-learning methods (such as [Reptile](https://openai.com/blog/reptile/)) struggle. The high-level goal is to create tasks whose true models are (1) task-dependant, and (2) clustered around more than just one center. To see a description of the whole generative process, please refer to the LEAF paper.

We note that, at the moment, we default to one cluster of models in our code. This can be easily changed by modifying the PROB_CLUSTERS constant in ```main.py```.

## Data Download

### Setup instructions

#### Install dependencies

In [None]:
!pip3 install numpy pillow

#### Generate initial data

In [2]:
!cd ../benchmark/datasets/synthetic && python main.py -num-tasks 1000 -num-classes 5 -num-dim 60

Generating dataset
Done :D


#### Generate federated dataset

Run `bash ./preprocess.sh` (as with the other LEAF datasets) to produce the final data splits. We suggest using the following tags:
- `--sf` := fraction of data to sample, written as a decimal; set it to 1.0 in order to keep the number of tasks/users specified earlier.
- `-k` := minimum number of samples per user; set it to 5.
- `-t` := 'user' to partition users into train-test groups, or 'sample' to partition each user's samples into train-test groups.
- `--tf` := fraction of data in training set, written as a decimal; default is 0.9.
- `--smplseed` := seed to be used before random sampling of data.
- `--spltseed` :=  seed to be used before random split of data.

In [3]:
!cd ../benchmark/datasets/synthetic && rm -rf data/rem_user_data data/sampled_data data/test data/train
!cd ../benchmark/datasets/synthetic && bash preprocess.sh -s niid --sf 1.0 -k 5 -t sample --tf 0.6

./preprocess.sh: line 153: realpath: command not found
------------------------------
sampling data
Using seed 1632624149

- random seed written out to sampling_seed.txt
The history saving thread hit an unexpected error (OperationalError('database is locked')).History will not be written to the database.
writing data_niid_0.json
------------------------------
removing users with less than 5 samples
writing data_niid_0_keep_5.json
------------------------------
generating training and test sets
- random seed written out to split_seed.txt
splitting data by sample
writing data_niid_0_keep_5_train_6.json
writing data_niid_0_keep_5_test_6.json
------------------------------
calculating JSON file checksums
checksums written to meta/dir-checksum.md5


### Notes

- More details on `preprocess.sh`:
  - The order in which `preprocess.sh` processes data is 1. generating all_data (done here by the `main.py` script), 2. sampling, 3. removing users, and 4. creating train-test split. The script will look at the data in the last generated directory and continue preprocessing from that point. For example, if the `all_data` directory has already been generated and the user decides to skip sampling and only remove users with the `-k` tag (i.e. running `preprocess.sh -k 50`), the script will effectively apply a remove user filter to data in `all_data` and place the resulting data in the `rem_user_data` directory.
  - File names provide information about the preprocessing steps taken to generate them. For example, the `all_data_niid_1_keep_64.json` file was generated by first sampling 10 percent (.1) of the data `all_data.json` in a non-i.i.d. manner and then applying the `-k 64` argument to the resulting data.
- Each .json file is an object with 3 keys:
  1. 'users', a list of users
  2. 'num_samples', a list of the number of samples for each user, and
  3. 'user_data', an object with user names as keys and their respective data as values.
- Run `./stats.sh` to get statistics of data (data/all_data/all_data.json must have been generated already)
- In order to run reference implementations, the `-t sample` tag must be used when running `./preprocess.sh`

### Valid Dataset

In [4]:
from benchmark.datasets.synthetic import get_synthetic
dataset = get_synthetic('../benchmark/datasets/synthetic/data')
print(dataset)
x, y = dataset[0]
print(x.shape, y.shape)

SimulationDataset(total_parts: 1000, total_samples: 64153, current_parts: 0)
torch.Size([60]) torch.Size([])


## FedAvg, FedSGD, FedEla, FedProx, FedScaffold

Run following commands in the root path of `benchmark-lightly`.

```bash
function cmd(){
    fed_optim=$1

    task_name="synthetic"
    exp_name=${fed_optim}_${task_name}

    # Delete cache file
    rm -rf /tmp/${exp_name}.share
    rm -rf /tmp/${exp_name}
    rm -rf ./logs/${task_name}/${fed_optim}

    # Run
    python -m openfed.tools.launch --nproc_per_node 6  --logdir /tmp benchmark/run.py\
        --fed_init_method file:///tmp/${exp_name}.share\
        --task ${task_name}\
        --data_root benchmark/datasets/${task_name}/data\
        --epochs 1\
        --rounds 20\
        --act_clts 100\
        --tst_act_clts 100\
        --max_acg_step -1\
        --optim ${fed_optim}\
        --optim_args momentum:0.9 weight_decay:1e-4\
        --follower_lr 1e-1\
        --leader_lr 1.0\
        --bz 10\
        --gpu\
        --log_level SUCCESS\
        --log_dir logs\
        --exp_name ${exp_name}\
        --seed 0
}
```

### Run All

```bash
cmd 'fedavg'; cmd 'fedsgd'; cmd 'fedela'; cmd 'fedprox'; cmd 'fedscaffold'
```

## Plot Curves

In [None]:
%matplotlib inline

from benchmark.utils.plot import plot

task_name = "synthetic"

items = dict(
    FedAvg=f'../logs/{task_name}/fedavg_{task_name}/{task_name}.json',
    FedSgd=f'../logs/{task_name}/fedsgd_{task_name}/{task_name}.json',
    FedEla=f'../logs/{task_name}/fedela_{task_name}/{task_name}.json',
    FedProx=f'../logs/{task_name}/fedprox_{task_name}/{task_name}.json',
    FedScaffold=f'../logs/{task_name}/fedscaffold_{task_name}/{task_name}.json',
)

files = items.values()
labels = items.keys()

### Train Accuracy

In [None]:
plot(
    files=files,
    labels=labels,
    attributes="accuracy",
    mode='train'
)

### Train Loss

In [None]:
plot(
    files=files,
    labels=labels,
    attributes="loss",
    mode="train"
)

### Test Accuracy

In [None]:
plot(
    files=files,
    labels=labels,
    attributes="accuracy",
    mode="test"
)

### Test Loss

In [None]:
plot(
    files=files,
    labels=labels,
    attributes="loss",
    mode='test'
)