# Reddit

We preprocess the Reddit data released by [pushshift.io](https://files.pushshift.io/reddit/) corresponding to December 2017. We perform the following operations:

1. Unescape html symbols.
2. Remove extraneous whitespaces.
3. Remove non-ascii symbols.
4. Replace URLS, reddit usernames and subreddit names with special tokens.
5. Lowercase the text.
6. Tokenize the text (using nltk's TweetTokenizer).

We also remove users and comments that simple heuristics or preliminary inspections mark as bots; and remove users with less than 5 or more than 1000 comments (which account for less than 0.01% of users).
We include the code for this preprocessing in the `preprocess` folder for reference, but host the preprocessed dataset [here](https://drive.google.com/file/d/1CXufUKXNpR7Pn8gUbIerZ1-qHz1KatHH/view?usp=sharing).
We further preprocess the data to make it ready for our reference model (by splitting it into train/val/test sets and by creating sequences of 10 tokens for the LSTM) [here](https://drive.google.com/file/d/1lT1Z0N1weG-oA2PgC1Jak_WQ6h3bu7V_/view?usp=sharing).
The vocabulary of the 10 thousand most common tokens in the data can be found [here](https://drive.google.com/file/d/1I-CRlfAeiriLmAyICrmlpPE5zWJX4TOY/view?usp=sharing).


## Data Download

### Setup Instructions

#### Install dependencies

In [1]:
!pip3 install nltk

Looking in indexes: https://pypi.tuna.tsinghua.edu.cn/simple


#### Download preprocessed data from google driver

To use our reference model, download the data [here](https://drive.google.com/file/d/1PwBpAEMYKNpnv64cQ2TIQfSc_vPbq3OQ/view?usp=sharing) into a `data` subfolder in `../benchmark/datasets/reddit/data`. This is a sub-sampled version of the complete data. Our reference implementation doesn't yet support training on the [complete dataset](https://drive.google.com/file/d/1lT1Z0N1weG-oA2PgC1Jak_WQ6h3bu7V_/view?usp=sharing), as it loads all given clients into memory.

#### Download raw data and preprocess it manualy (Optional)

Refer to [reddit](https://files.pushshift.io/reddit/comments/) for more details about dataset.

- Open a terminal and run the following command to download and unzip dataset:

    ```shell
    FILENAME='RC_2005-12' # Select a file to download. must be consistent with L10 in `preprocess.py`
    SUF_EXT='bz2' # `xz`, `bz2`, `zst`
    mkdir -pv data/raw
    cd data/raw
    wget --no-check-certificate --no-proxy https://files.pushshift.io/reddit/comments/$FILENAME.$SUF_EXT

    if [ $SUF_EXT = "xz" ]; then
       # Install necessary tools to unxz the downloaded file
       sudo apt-get install xz-utils
       echo "unxz $FILENAME.$SUF_EXT"
       unxz $FILENAME.$SUF_EXT
    fi

    if [ $SUF_EXT = "bz2" ]; then
       echo "bunzip2 $FILENAME.$SUF_EXT"
       bunzip2 $FILENAME.$SUF_EXT
    fi

    if [ $SUF_EXT = "zst" ]; then
       echo "tar $FILENAME.$SUF_EXT"
       tar -I zstd -xvf $FILENAME.$SUF_EXT
    fi
    ```

- Preprocess:

    ```shell
    cd ../benchmark/datasets/reddit/preprocess
    bash run_reddit.sh
    ```

#### Build training vocabulary

In [4]:
!cd ../benchmark/datasets/reddit && python build_vocab.py --data-dir ./data/train --target-dir vocab

loading reddit_10_train.json
counting reddit_10_train.json

loading reddit_15_train.json
counting reddit_15_train.json

loading reddit_17_train.json
counting reddit_17_train.json

loading reddit_18_train.json
counting reddit_18_train.json

loading reddit_1_train.json
counting reddit_1_train.json

loading reddit_20_train.json
counting reddit_20_train.json

loading reddit_3_train.json
counting reddit_3_train.json

loading reddit_4_train.json
counting reddit_4_train.json

loading reddit_6_train.json
counting reddit_6_train.json

loading reddit_9_train.json
counting reddit_9_train.json



### Valid dataset


In [None]:
from benchmark.datasets.reddit import get_reddit
dataset = get_reddit('../benchmark/datasets/reddit/data')
print(dataset)
x, y = dataset[0]
print(x.shape, y.shape)
print(f'vocab size: {dataset.vocab_size}')

## FedAvg, FedSGD, FedEla, FedProx, FedScaffold

Run following commands in the root path of `benchmark-lightly`.

```bash
function cmd(){
    fed_optim=$1

    task_name="reddit"
    exp_name=${fed_optim}_${task_name}

    # Delete cache file
    rm -rf /tmp/${exp_name}.share
    rm -rf /tmp/${exp_name}
    rm -rf ./logs/${task_name}/${fed_optim}

    # Run
    python -m openfed.tools.launch --nproc_per_node 6  --logdir /tmp benchmark/run.py\
        --fed_init_method file:///tmp/${exp_name}.share\
        --task ${task_name}\
        --data_root benchmark/datasets/${task_name}/data\
        --epochs 1\
        --rounds 20\
        --act_clts 100\
        --tst_act_clts 100\
        --max_acg_step -1\
        --optim ${fed_optim}\
        --optim_args momentum:0.9 weight_decay:1e-4\
        --follower_lr 1e-1\
        --leader_lr 1.0\
        --bz 10\
        --gpu\
        --log_level SUCCESS\
        --log_dir logs\
        --exp_name ${exp_name}\
        --seed 0
}
```

### Run All

```bash
cmd 'fedavg'; cmd 'fedsgd'; cmd 'fedela'; cmd 'fedprox'; cmd 'fedscaffold'
```

## Plot Curves

In [1]:
%matplotlib inline

from benchmark.utils.plot import plot

task_name = "reddit"

items = dict(
    FedAvg=f'../logs/{task_name}/fedavg_{task_name}/{task_name}.json',
    FedSgd=f'../logs/{task_name}/fedsgd_{task_name}/{task_name}.json',
    FedEla=f'../logs/{task_name}/fedela_{task_name}/{task_name}.json',
    FedProx=f'../logs/{task_name}/fedprox_{task_name}/{task_name}.json',
    FedScaffold=f'../logs/{task_name}/fedscaffold_{task_name}/{task_name}.json',
)

files = items.values()
labels = items.keys()

### Train Accuracy

In [None]:
plot(
    files=files,
    labels=labels,
    attributes="accuracy",
    mode='train'
)

### Train Loss

In [None]:
plot(
    files=files,
    labels=labels,
    attributes="loss",
    mode="train"
)

### Test Accuracy

In [None]:
plot(
    files=files,
    labels=labels,
    attributes="accuracy",
    mode="test"
)

### Test Loss

In [None]:
plot(
    files=files,
    labels=labels,
    attributes="loss",
    mode='test'
)