# FEMNIST

## Data Download

### Setup Instructions

#### Install dependencies

In [1]:
!pip3 install numpy pillow

Looking in indexes: https://pypi.tuna.tsinghua.edu.cn/simple


#### Generate federated dataset

Run `bash preprocess.sh` with a choice of the following tags:
  - `-s` := 'iid' to sample in an i.i.d. manner, or 'niid' to sample in a non-i.i.d. manner; more information on i.i.d. versus non-i.i.d. is included in the 'Notes' section
  - `--iu` := number of users, if iid sampling; expressed as a fraction of the total number of users; default is 0.01
  - `--sf` := fraction of data to sample, written as a decimal; default is 0.1
  - `-k` := minimum number of samples per user
  - `-t` := 'user' to partition users into train-test groups, or 'sample' to partition each user's samples into train-test groups
  - `--tf` := fraction of data in training set, written as a decimal; default is 0.9
  - `--smplseed` := seed to be used before random sampling of data
  - `--spltseed` :=  seed to be used before random split of data

**Small-sized Dataset** (Optional)

In [7]:
# Clear tmp folder
!cd ../benchmark/datasets/femnist/ && rm -rf data/rem_user_data data/sampled_data data/test data/train

# Download data and sampling
!cd ../benchmark/datasets/femnist/ && bash preprocess.sh -s niid --sf 0.05 -k 0 -t sample

./preprocess.sh: line 153: realpath: command not found
------------------------------
sampling data
Using seed 1632618444

- random seed written out to sampling_seed.txt
writing all_data_1_niid_05.json
writing all_data_0_niid_05.json
writing all_data_6_niid_05.json
writing all_data_5_niid_05.json
writing all_data_4_niid_05.json
writing all_data_3_niid_05.json
writing all_data_2_niid_05.json
------------------------------
removing users with less than 0 samples
writing all_data_0_niid_05_keep_0.json
writing all_data_1_niid_05_keep_0.json
writing all_data_6_niid_05_keep_0.json
writing all_data_4_niid_05_keep_0.json
writing all_data_3_niid_05_keep_0.json
writing all_data_2_niid_05_keep_0.json
writing all_data_5_niid_05_keep_0.json
------------------------------
generating training and test sets
- random seed written out to split_seed.txt
splitting data by sample
writing all_data_0_niid_05_keep_0_train_9.json
writing all_data_0_niid_05_keep_0_test_9.json
writing all_data_1_niid_05_keep_0_t

**Full-sized Dataset** (Optional)

In [None]:
# Clear tmp folder
!cd ../benchmark/datasets/femnist/ && rm -rf data/rem_user_data data/sampled_data data/test data/train

# Download data and sampling
!cd ../benchmark/datasets/femnist/ && bash preprocess.sh -s niid --sf 1.0 -k 0 -t sample

### Notes

- More details on i.i.d. versus non-i.i.d.:
  - In the i.i.d. sampling scenario, each data-point is equally likely to be sampled. Thus, all users have the same underlying distribution of data.
  - In the non-i.i.d. sampling scenario, the underlying distribution of data for each user is consistent with the raw data. Since we assume that data distributions vary between user in the raw data, we refer to this sampling process as non-i.i.d.
- More details on `preprocess.sh`:
  - The order in which `preprocess.sh` processes data is 1. generating all_data, 2. sampling, 3. removing users, and 4. creating train-test split. The script will look at the data in the last generated directory and continue preprocessing from that point. For example, if the `all_data` directory has already been generated and the user decides to skip sampling and only remove users with the `-k` tag (i.e. running `preprocess.sh -k 50`), the script will effectively apply a remove user filter to data in `all_data` and place the resulting data in the `rem_user_data` directory.
  - File names provide information about the preprocessing steps taken to generate them. For example, the `all_data_niid_1_keep_64.json` file was generated by first sampling 10 percent (.1) of the data `all_data.json` in a non-i.i.d. manner and then applying the `-k 64` argument to the resulting data.
- Each .json file is an object with 3 keys:
  1. 'users', a list of users
  2. 'num_samples', a list of the number of samples for each user, and
  3. 'user_data', an object with user names as keys and their respective data as values; for each user, data is represented as a list of images, with each image represented as a size-784 integer list (flattened from 28 by 28)
- Run `./stats.sh` to get statistics of data (data/all_data/all_data.json must have been generated already)
- In order to run reference implementations in `../models` directory, the `-t sample` tag must be used when running `./preprocess.sh`

### Valid Dataset

In [2]:
from benchmark.datasets.femnist import get_femnist
dataset = get_femnist('../benchmark/datasets/femnist/data')
print(dataset)
x, y = dataset[0]
print(x.shape, y.shape)

SimulationDataset(total_parts: 36, total_samples: 8964, current_parts: 0)
torch.Size([784]) torch.Size([])


## FedAvg

In [2]:
!python -m openfed.tools.simulator --nproc 11  --logdir /tmp ../main.py\
    --task femnist\
    --data_root ../benchmark/datasets/femnist/data\
    --epochs 1\
    --rounds 20\
    --act_clts 10\
    --tst_act_clts 10\
    --max_acg_step -1\
    --optim fedavg\
    --optim_args momentum:0.9 weight_decay:1e-4\
    --co_lr 1e-1\
    --ag_lr 1.0\
    --bz 10\
    --gpu\
    --log_dir logs\
    --seed 0

Note: Stdout and stderr for collaborator-1 will be written to /tmp/openfed_node_collaborator-1_stdout, /tmp/openfed_node_collaborator-1_stderr respectively.
Note: Stdout and stderr for collaborator-2 will be written to /tmp/openfed_node_collaborator-2_stdout, /tmp/openfed_node_collaborator-2_stderr respectively.
Note: Stdout and stderr for collaborator-3 will be written to /tmp/openfed_node_collaborator-3_stdout, /tmp/openfed_node_collaborator-3_stderr respectively.
Note: Stdout and stderr for collaborator-4 will be written to /tmp/openfed_node_collaborator-4_stdout, /tmp/openfed_node_collaborator-4_stderr respectively.
Note: Stdout and stderr for collaborator-5 will be written to /tmp/openfed_node_collaborator-5_stdout, /tmp/openfed_node_collaborator-5_stderr respectively.
Note: Stdout and stderr for collaborator-6 will be written to /tmp/openfed_node_collaborator-6_stdout, /tmp/openfed_node_collaborator-6_stderr respectively.
Note: Stdout and stderr for collaborator-7 will be written