Skip to content

11hifish/robust_css

Repository files navigation

Streaming and Distributed Algorithms for Robust Column Subset Selection (CSS)

The repository contains all the scripts, data for experiments in the paper as well as the experimental results.
Basic requirement: python 3.7+.

Ray setups

The streaming and distributed algorithm is build upon Ray (https://docs.ray.io/en/master/index.html), a high-level framework for parallel and distributed computing. We introduce how to set up and use Ray for streaming / distributed k-CSS_1 on AWS.

  • Step 1: Set up AWS. Install the following packages:

    pip install ray

    pip install boto3

    pip install aws

    Configure AWS CLI credentials, following https://docs.aws.amazon.com/cli/latest/userguide/cli-configure-files.html

    (Configure ~/.aws/credentials and ~/.aws/config)

  • Step 2: Configure launch_template.yaml, which contains all information for Ray to set up a cluster of computing nodes on AWS. The cluster will contain a header node and a bunch of worker nodes. The header node will be the coordinator for distributed k-CSS_1.

  • Step 3: Launch and manage the cluster via Ray, using the following

ray up launch_template.yaml

If you want to check out the header node, run

ray attach launch_template.yaml

After the k-CSS_1 protocol finishes, you can tear down the cluster by

ray down launch_template.yaml

Note that launching the cluster might take a few minutes as all dependencies for distributed k-CSS_1 will be installed.

After the cluster is successfully set up on AWS, you will see a _redis_password. Please save this.

  • Step 4: Run k-CSS_1 on the cluster launched by Ray.

First go to the header node via ray attach launch_template.yaml

Then change the _redis_password in ray.init() in files where Ray functions are used, e.g. code_v2/protocol/interaction.py.
(The current _redis_password in the files are the default ones. There should be no need to change them.)

An important note

Since Ray pickles all files to the worker processes, the worker processes will not be aware of any file path added to sys.path, and this might cause import error when one tries to run the experiments.
To resolve this issue, it is important to run sudo python3 setup.py install before conducting the experiments.

Souce code (code_v2) Structure:

  • common: Common utilities for both distributed and streaming settings
    • common/lewis_weights.py: Lewis weights sampling
    • common/kCSS12.py: Regular Bi-criteria O(1)-approximate k-CSS_{1, 2} algorithm
    • common/kCSS12_greedy.py: Greedy k-CSS_{1, 2} algorithm
    • common/l1_regression.py: Computing L1 regression error (Note: this might take several hours)
    • common/generate_synthetic.py: Synthetic data generator
    • common/utils.py: Other utilities (e.g. loading data)
  • streaming: Experiments for the streaming setting
  • protocol: Experiments for the distributed setting
  • baselines: Baselines
  • active_learning: CSS as an active learning algorithm for noisy image classification task (active_learning/noisy_image_classification.py)

Streaming Setting (streaming):

  • streaming/process_stream.py: The streaming algorithm for k-CSS_1
  • streaming/random_stream.py: The random streaming baseline
  • streaming/conduct_exp.py: Conduct experiments with the streaming k-CSS_1 algorithm
  • streaming/conduct_random_stream.py: Conduct experiments with the random baseline

Distributed Setting (protocol):

  • protocol/interaction.py: The distributed protocol for k-CSS_1
  • protocol/conduct_exp.py: Conduct experiments with the distributed protocol
  • Note the random baseline for the distributed setting is under baselines

Baselines (baselines):

  • baselines/svd.py: The SVD baseline. Note that SVD is a deterministic algorithm
  • baselines/uniform_distributed.py: The random baseline for the distributed setting
  • baselines/simple_kcss2.py: A simple CSS algorithm in the Frobenius norm

Datasets:

Result files (L1 error and wall-clock running time):

  • Streaming regular and greedy kCSS_1 results:
    • Synthetic: streraming/str_synthetic_results
    • Gene: streaming/str_gene_results
    • TechTC: streaming/str_techtc_results
  • Distributed regular and greedy kCSS_1 results:
    • Synthetic: protocol/synthetic_results
    • Gene: protocol/gene_results
    • TechTC: protocol/techtc_results
  • Baseline results:
    • Synthtic: baselines/synthetic_svd, baselines/synthetic_uniform
    • Gene: baselines/gene_svd, baselines/gene_uniform
    • TechTC: baselines/techtc_svd, baselines/techtc_uniform

Plots

  • To plot all experimental results in the streaming setting, go to code_v2/streaming, and run python3 plot_stream.py
  • To plot all experimental results in the distributed setting, go to code_v2/protocol, and run python3 plot_results.py

About

Streaming and Distributed Algorithms for Robust Column Subset Selection

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages