Streaming and Distributed Algorithms for Robust Column Subset Selection (CSS)

The repository contains all the scripts, data for experiments in the paper as well as the experimental results.
Basic requirement: python 3.7+.

Ray setups

The streaming and distributed algorithm is build upon Ray (https://docs.ray.io/en/master/index.html), a high-level framework for parallel and distributed computing. We introduce how to set up and use Ray for streaming / distributed k-CSS_1 on AWS.

Step 1: Set up AWS. Install the following packages:

pip install ray

pip install boto3

pip install aws

Configure AWS CLI credentials, following https://docs.aws.amazon.com/cli/latest/userguide/cli-configure-files.html

(Configure ~/.aws/credentials and ~/.aws/config)
Step 2: Configure launch_template.yaml, which contains all information for Ray to set up a cluster of computing nodes on AWS. The cluster will contain a header node and a bunch of worker nodes. The header node will be the coordinator for distributed k-CSS_1.
Step 3: Launch and manage the cluster via Ray, using the following

ray up launch_template.yaml

If you want to check out the header node, run

ray attach launch_template.yaml

After the k-CSS_1 protocol finishes, you can tear down the cluster by

ray down launch_template.yaml

Note that launching the cluster might take a few minutes as all dependencies for distributed k-CSS_1 will be installed.

After the cluster is successfully set up on AWS, you will see a _redis_password. Please save this.

Step 4: Run k-CSS_1 on the cluster launched by Ray.

First go to the header node via ray attach launch_template.yaml

Then change the _redis_password in ray.init() in files where Ray functions are used, e.g. code_v2/protocol/interaction.py.
(The current _redis_password in the files are the default ones. There should be no need to change them.)

An important note

Since Ray pickles all files to the worker processes, the worker processes will not be aware of any file path added to sys.path, and this might cause import error when one tries to run the experiments.
To resolve this issue, it is important to run sudo python3 setup.py install before conducting the experiments.

Souce code (`code_v2`) Structure:

common: Common utilities for both distributed and streaming settings
- common/lewis_weights.py: Lewis weights sampling
- common/kCSS12.py: Regular Bi-criteria O(1)-approximate k-CSS_{1, 2} algorithm
- common/kCSS12_greedy.py: Greedy k-CSS_{1, 2} algorithm
- common/l1_regression.py: Computing L1 regression error (Note: this might take several hours)
- common/generate_synthetic.py: Synthetic data generator
- common/utils.py: Other utilities (e.g. loading data)
streaming: Experiments for the streaming setting
protocol: Experiments for the distributed setting
baselines: Baselines
active_learning: CSS as an active learning algorithm for noisy image classification task (active_learning/noisy_image_classification.py)

Streaming Setting (`streaming`):

streaming/process_stream.py: The streaming algorithm for k-CSS_1
streaming/random_stream.py: The random streaming baseline
streaming/conduct_exp.py: Conduct experiments with the streaming k-CSS_1 algorithm
streaming/conduct_random_stream.py: Conduct experiments with the random baseline

Distributed Setting (`protocol`):

protocol/interaction.py: The distributed protocol for k-CSS_1
protocol/conduct_exp.py: Conduct experiments with the distributed protocol
Note the random baseline for the distributed setting is under baselines

Baselines (`baselines`):

baselines/svd.py: The SVD baseline. Note that SVD is a deterministic algorithm
baselines/uniform_distributed.py: The random baseline for the distributed setting
baselines/simple_kcss2.py: A simple CSS algorithm in the Frobenius norm

Datasets:

Synthetic: can be generated by common/generate_synthetic.py
Gene: can be found at https://archive.ics.uci.edu/ml/datasets/gene+expression+cancer+RNA-Seq
TechTC: can be found at http://gabrilovich.com/resources/data/techtc/techtc300/techtc300.html
COIL20: can be found at http://www.cs.columbia.edu/CAVE/software/softlib/coil-20.php
All processed datasets are available under dataset. You can use common/utils.py to load the processed datasets.
For experiments in the streaming setting, since the columns of the data are arrived with arbitrary order.
We fix the random order for each experiment trial.
The dataset used in the streaming setting can be found in dataset/<data>_rand_<random_index>.
You can also generate your own data with randomly permuted columns using dataset/gen_rand_permutation.py.

Result files (L1 error and wall-clock running time):

Streaming regular and greedy kCSS_1 results:
- Synthetic: streraming/str_synthetic_results
- Gene: streaming/str_gene_results
- TechTC: streaming/str_techtc_results
Distributed regular and greedy kCSS_1 results:
- Synthetic: protocol/synthetic_results
- Gene: protocol/gene_results
- TechTC: protocol/techtc_results
Baseline results:
- Synthtic: baselines/synthetic_svd, baselines/synthetic_uniform
- Gene: baselines/gene_svd, baselines/gene_uniform
- TechTC: baselines/techtc_svd, baselines/techtc_uniform

Plots

To plot all experimental results in the streaming setting, go to code_v2/streaming, and run python3 plot_stream.py
To plot all experimental results in the distributed setting, go to code_v2/protocol, and run python3 plot_results.py

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
code_v2		code_v2
dataset		dataset
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
launch_template.yaml		launch_template.yaml
requirements1.txt		requirements1.txt
requirements2.txt		requirements2.txt
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

code_v2

code_v2

dataset

dataset

.gitignore

.gitignore

LICENSE

LICENSE

README.md

README.md

launch_template.yaml

launch_template.yaml

requirements1.txt

requirements1.txt

requirements2.txt

requirements2.txt

setup.py

setup.py

Repository files navigation

Streaming and Distributed Algorithms for Robust Column Subset Selection (CSS)

Ray setups

An important note

Souce code (`code_v2`) Structure:

Streaming Setting (`streaming`):

Distributed Setting (`protocol`):

Baselines (`baselines`):

Datasets:

Result files (L1 error and wall-clock running time):

Plots

About

Releases

Packages

Languages

License

11hifish/robust_css

Folders and files

Latest commit

History

Repository files navigation

Streaming and Distributed Algorithms for Robust Column Subset Selection (CSS)

Ray setups

An important note

Souce code (code_v2) Structure:

Streaming Setting (streaming):

Distributed Setting (protocol):

Baselines (baselines):

Datasets:

Result files (L1 error and wall-clock running time):

Plots

About

Resources

License

Stars

Watchers

Forks

Languages

Souce code (`code_v2`) Structure:

Streaming Setting (`streaming`):

Distributed Setting (`protocol`):

Baselines (`baselines`):