The repository contains all the scripts, data for experiments in the paper as well as the experimental results.
Basic requirement: python 3.7+
.
The streaming and distributed algorithm is build upon Ray (https://docs.ray.io/en/master/index.html), a high-level framework for parallel and distributed computing. We introduce how to set up and use Ray for streaming / distributed k-CSS_1 on AWS.
-
Step 1: Set up AWS. Install the following packages:
pip install ray
pip install boto3
pip install aws
Configure AWS CLI credentials, following https://docs.aws.amazon.com/cli/latest/userguide/cli-configure-files.html
(Configure
~/.aws/credentials
and~/.aws/config
) -
Step 2: Configure
launch_template.yaml
, which contains all information for Ray to set up a cluster of computing nodes on AWS. The cluster will contain a header node and a bunch of worker nodes. The header node will be the coordinator for distributed k-CSS_1. -
Step 3: Launch and manage the cluster via Ray, using the following
ray up launch_template.yaml
If you want to check out the header node, run
ray attach launch_template.yaml
After the k-CSS_1 protocol finishes, you can tear down the cluster by
ray down launch_template.yaml
Note that launching the cluster might take a few minutes as all dependencies for distributed k-CSS_1 will be installed.
After the cluster is successfully set up on AWS, you will see a _redis_password
. Please save this.
- Step 4: Run k-CSS_1 on the cluster launched by Ray.
First go to the header node via ray attach launch_template.yaml
Then change the _redis_password
in ray.init()
in files where Ray functions are used, e.g. code_v2/protocol/interaction.py
.
(The current _redis_password
in the files are the default ones. There should be no need to change them.)
Since Ray pickles all files to the worker processes,
the worker processes will not be aware of any file path added to sys.path
,
and this might cause import
error when one tries to run the experiments.
To resolve this issue, it is important to run sudo python3 setup.py install
before conducting the experiments.
common
: Common utilities for both distributed and streaming settingscommon/lewis_weights.py
: Lewis weights samplingcommon/kCSS12.py
: Regular Bi-criteria O(1)-approximate k-CSS_{1, 2} algorithmcommon/kCSS12_greedy.py
: Greedy k-CSS_{1, 2} algorithmcommon/l1_regression.py
: Computing L1 regression error (Note: this might take several hours)common/generate_synthetic.py
: Synthetic data generatorcommon/utils.py
: Other utilities (e.g. loading data)
streaming
: Experiments for the streaming settingprotocol
: Experiments for the distributed settingbaselines
: Baselinesactive_learning
: CSS as an active learning algorithm for noisy image classification task (active_learning/noisy_image_classification.py
)
streaming/process_stream.py
: The streaming algorithm for k-CSS_1streaming/random_stream.py
: The random streaming baselinestreaming/conduct_exp.py
: Conduct experiments with the streaming k-CSS_1 algorithmstreaming/conduct_random_stream.py
: Conduct experiments with the random baseline
protocol/interaction.py
: The distributed protocol for k-CSS_1protocol/conduct_exp.py
: Conduct experiments with the distributed protocol- Note the random baseline for the distributed setting is under
baselines
baselines/svd.py
: The SVD baseline. Note that SVD is a deterministic algorithmbaselines/uniform_distributed.py
: The random baseline for the distributed settingbaselines/simple_kcss2.py
: A simple CSS algorithm in the Frobenius norm
- Synthetic: can be generated by
common/generate_synthetic.py
- Gene: can be found at https://archive.ics.uci.edu/ml/datasets/gene+expression+cancer+RNA-Seq
- TechTC: can be found at http://gabrilovich.com/resources/data/techtc/techtc300/techtc300.html
- COIL20: can be found at http://www.cs.columbia.edu/CAVE/software/softlib/coil-20.php
- All processed datasets are available under
dataset
. You can usecommon/utils.py
to load the processed datasets. - For experiments in the streaming setting, since the columns of the data are arrived with arbitrary order.
We fix the random order for each experiment trial.
The dataset used in the streaming setting can be found indataset/<data>_rand_<random_index>
.
You can also generate your own data with randomly permuted columns usingdataset/gen_rand_permutation.py
.
- Streaming regular and greedy kCSS_1 results:
- Synthetic:
streraming/str_synthetic_results
- Gene:
streaming/str_gene_results
- TechTC:
streaming/str_techtc_results
- Synthetic:
- Distributed regular and greedy kCSS_1 results:
- Synthetic:
protocol/synthetic_results
- Gene:
protocol/gene_results
- TechTC:
protocol/techtc_results
- Synthetic:
- Baseline results:
- Synthtic:
baselines/synthetic_svd
,baselines/synthetic_uniform
- Gene:
baselines/gene_svd
,baselines/gene_uniform
- TechTC:
baselines/techtc_svd
,baselines/techtc_uniform
- Synthtic:
- To plot all experimental results in the streaming setting, go to
code_v2/streaming
, and runpython3 plot_stream.py
- To plot all experimental results in the distributed setting, go to
code_v2/protocol
, and runpython3 plot_results.py