# Sent140

## Data Download

### Setup Instructions

#### Generate embs

In [None]:
!cd ../benchmark/datasets/sent140 && bash get_embs.sh

#### Generate federated dataset

Run `bash preprocess.sh` with a choice of the following tags:
- `-s` := 'iid' to sample in an i.i.d. manner, or 'niid' to sample in a non-i.i.d. manner; more information on i.i.d. versus non-i.i.d. is included in the 'Notes' section
- `--iu` := number of users, if iid sampling; expressed as a fraction of the total number of users; default is 0.01
- `--sf` := fraction of data to sample, written as a decimal; default is 0.1
- `-k` := minimum number of samples per user
- `-t` := 'user' to partition users into train-test groups, or 'sample' to partition each user's samples into train-test groups
- `--tf` := fraction of data in training set, written as a decimal; default is 0.9
- `--smplseed` := seed to be used before random sampling of data
- `--spltseed` :=  seed to be used before random split of data

**Small-sized Dataset** (Optional)

In [None]:
# Clear tmp folder
!cd ../benchmark/datasets/sent140/ && rm -rf data/rem_user_data data/sampled_data data/test data/train

# Download data and sampling
!cd ../benchmark/datasets/sent140/ && bash preprocess.sh -s niid --sf 0.05 -k 0 -t sample

**Full-sized Dataset** (Optional)

In [None]:
# Clear tmp folder
!cd ../benchmark/datasets/sent140/ && rm -rf data/rem_user_data data/sampled_data data/test data/train

# Download data and sampling
!cd ../benchmark/datasets/sent140/ && bash preprocess.sh -s niid --sf 1.0 -k 0 -t sample

### Notes

- More details on i.i.d. versus non-i.i.d.:
  - In the i.i.d. sampling scenario, each data-point is equally likely to be sampled. Thus, all users have the same underlying distribution of data.
  - In the non-i.i.d. sampling scenario, the underlying distribution of data for each user is consistent with the raw data. Since we assume that data distributions vary between user in the raw data, we refer to this sampling process as non-i.i.d.
- More details on `preprocess.sh`:
  - The order in which `preprocess.sh` processes data is 1. generating all_data, 2. sampling, 3. removing users, and 4. creating train-test split. The script will look at the data in the last generated directory and continue preprocessing from that point. For example, if the `all_data` directory has already been generated and the user decides to skip sampling and only remove users with the `-k` tag (i.e. running `preprocess.sh -k 50`), the script will effectively apply a remove user filter to data in `all_data` and place the resulting data in the `rem_user_data` directory.
  - File names provide information about the preprocessing steps taken to generate them. For example, the `all_data_niid_1_keep_64.json` file was generated by first sampling 10 percent (.1) of the data `all_data.json` in a non-i.i.d. manner and then applying the `-k 64` argument to the resulting data.
- The training data has been preprocessed so that the emoji characters have been removed
- Each .json file is an object with 3 keys:
  1. 'users', a list of users
  2. 'num_samples', a list of the number of samples for each user, and
  3. 'user_data', an object with user names as keys and their respective data as values; for each user, data is represented as a list of attribute lists, with each attribute list containing the following string-valued features at the corresponding indices:
     - `0`: id of the tweet; i.e '2087'
     - `1`: date of the tweet; i.e. 'Sat May 16 23:58:44 UTC 2009'
     - `2`: query; i.e. 'lyx'; if there is no query, then this value is 'NO_QUERY'
     - `3`: user that tweeted; i.e. 'robotickilldozr'
     - `4`: text of the tweet; i.e. 'Lyx is cool'
    (examples based on [Sentiment140 website](http://help.sentiment140.com/for-students/))
- Run `./stats.sh` to get statistics of data (data/all_data/all_data.json must have been generated already)
- In order to run reference implementations, the `-t sample` tag must be used when running `./preprocess.sh`

### Valid dataset

In [2]:
from benchmark.datasets.sent140 import get_sent140
dataset = get_sent140('../benchmark/datasets/sent140/data')
print(dataset)
x, y = dataset[0]
print(x.shape, y.shape)
print(f'vocab size: {dataset.vocab_size}')

Sent140(total_parts: 254555, total_samples: 908652, current_parts: 0)
torch.Size([400000]) torch.Size([2])
vocab size: 400000


## FedAvg, FedSGD, FedEla, FedProx, FedScaffold

Run following commands in the root path of `benchmark-lightly`.


```bash
function cmd(){
    fed_optim=$1
    sub_task_name=$2

    task_name="sent140"
    
    exp_name=${fed_optim}_${task_name}_${sub_task_name}

    # Delete cache file
    rm -rf /tmp/${exp_name}.share
    rm -rf /tmp/${exp_name}
    rm -rf ./logs/${task_name}/${fed_optim}

    # Run
    python -m openfed.tools.launch --nproc_per_node 6  --logdir /tmp benchmark/run.py\
        --fed_init_method file:///tmp/${exp_name}.share\
        --task ${task_name}\
        --network_args task:"r\"${sub_task_name}\""\
        --data_root benchmark/datasets/${task_name}/data\
        --dataset_args task:"r\"${sub_task_name}\""\
        --epochs 1\
        --rounds 20\
        --act_clts 100\
        --tst_act_clts 100\
        --max_acg_step -1\
        --optim ${fed_optim}\
        --optim_args momentum:0.9 weight_decay:1e-4\
        --follower_lr 1e-1\
        --leader_lr 1.0\
        --bz 10\
        --gpu\
        --log_level SUCCESS\
        --log_dir logs\
        --exp_name ${exp_name}\
        --seed 0
}

function cmd_bag_log_reg(){
    cmd $1 'bag_log_reg'
}

function cmd_stacked_lstm(){
    cmd $1 'stacked_lstm'
}
```

##  bag_log_reg

```bash
cmd_bag_log_reg 'fedavg'; cmd_bag_log_reg 'fedsgd'; cmd_bag_log_reg 'fedela'; cmd_bag_log_reg 'fedprox'; cmd_bag_log_reg 'fedscaffold'
```

## Plot Curves

In [None]:
%matplotlib inline

from benchmark.utils.plot import plot

task_name = "sent140"
sub_task_name = "bag_log_reg"

items = dict(
    FedAvg=f'../logs/{task_name}/fedavg_{task_name}_{sub_task_name}/{task_name}.json',
    FedSgd=f'../logs/{task_name}/fedsgd_{task_name}_{sub_task_name}/{task_name}.json',
    FedEla=f'../logs/{task_name}/fedela_{task_name}_{sub_task_name}/{task_name}.json',
    FedProx=f'../logs/{task_name}/fedprox_{task_name}_{sub_task_name}/{task_name}.json',
    FedScaffold=f'../logs/{task_name}/fedscaffold_{task_name}_{sub_task_name}/{task_name}.json',
)

files = items.values()
labels = items.keys()

### Train Accuracy

In [None]:
plot(
    files=files,
    labels=labels,
    attributes="accuracy",
    mode='train'
)

### Train Loss

In [None]:
plot(
    files=files,
    labels=labels,
    attributes="loss",
    mode="train"
)

### Test Accuracy

In [None]:
plot(
    files=files,
    labels=labels,
    attributes="accuracy",
    mode="test"
)

### Test Loss

In [None]:
plot(
    files=files,
    labels=labels,
    attributes="loss",
    mode='test'
)

## stacked_lstm

```bash
cmd_stacked_lstm 'fedavg'; cmd_stacked_lstm 'fedsgd'; cmd_stacked_lstm 'fedela'; cmd_stacked_lstm 'fedprox'; cmd_stacked_lstm 'fedscaffold'
```

## Plot Curves

In [None]:
%matplotlib inline

from benchmark.utils.plot import plot

task_name = "sent140"
sub_task_name = "bag_log_reg"

items = dict(
    FedAvg=f'../logs/{task_name}/fedavg_{task_name}_{sub_task_name}/{task_name}.json',
    FedSgd=f'../logs/{task_name}/fedsgd_{task_name}_{sub_task_name}/{task_name}.json',
    FedEla=f'../logs/{task_name}/fedela_{task_name}_{sub_task_name}/{task_name}.json',
    FedProx=f'../logs/{task_name}/fedprox_{task_name}_{sub_task_name}/{task_name}.json',
    FedScaffold=f'../logs/{task_name}/fedscaffold_{task_name}_{sub_task_name}/{task_name}.json',
)

files = items.values()
labels = items.keys()

### Train Accuracy

In [None]:
plot(
    files=files,
    labels=labels,
    attributes="accuracy",
    mode='train'
)

### Train Loss

In [None]:
plot(
    files=files,
    labels=labels,
    attributes="loss",
    mode="train"
)

### Test Accuracy

In [None]:
plot(
    files=files,
    labels=labels,
    attributes="accuracy",
    mode="test"
)

### Test Loss

In [None]:
plot(
    files=files,
    labels=labels,
    attributes="loss",
    mode='test'
)