A Distributed Computation Platform in JavaScript
Switch branches/tags
Nothing to show
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Failed to load latest commit information.
.ebextensions
AWSSetupScripts
blueprints
db
static
templates
tests
.env
.gitignore
CampaignController.py
DatasetController.py
Mailer.py
Procfile
README.md
ResultsController.py
UsersController.py
application.py
config.py
dataset_generation.py
helpers.py
models.py
requirements-macosx.txt
requirements.txt
setup-macosx.sh
start_dev_server.sh
start_websocket_server.sh
supervisord.conf
websocket_server.py
wsgi.conf

README.md

Distribute.js

Distribute.js was developed by Kyle Moore, Lennon Chimbumu, Jacob Haven, and Stephanie Young as part of the Stanford CS Deparment's Senior Project program.

A comprehensive, multi-page wiki of this document can be found at https://github.com/Distribute-js/Distribute.js/wiki.

Overview

Introduction

Distribute.js is a distributed computing system that runs in web browsers. We aim to provide easy "crowd-sourcing" of computations without requiring the download of a native application or signing up for a service. Websites will embed a job in their site and as long as the user has that browser window open, we will be able to distribute computations to that browser.

Distributed computing projects like Folding@Home and GIMPS allow users to donate their excess CPU capacity towards computationally intensive problems by downloading and running software on their computer. These volunteer distributing computing projects currently allow regular users to contribute their unused computer time to good causes. Our project will simplify this process by allowing users to contribute to distributed computation projects just by loading a web page in their browser. Distribute.js allows our customers, the operators of these projects, to leverage the ubiquitous computing power provided inside users' browsers. This minimizes setup costs imposed on non-technical users that otherwise discourage them from participating.

The goal is to provide an easy-scalable distributed system that allows customers to take advantage of distributed computing power. Distribute.js's main advantages are that it can be easily delivered to computers and is able to accomplish jobs by leveraging "wasted" computational browser power. Users are able to contribute their computational power in a way that does not require setup and can easily be opened and closed. By making distribution of the computations occur over the browser, Distribute.js is able to reach users through the Internet, creating for ourselves a large network of available computers. We then leverage that network's wasted resources in order to accomplish meaningful tasks.

For more information, see our original proposal document.

Development Environment Setup

Setup for OS X

Note: You must have Python/Pip installed. Using Homebrew, you may install these using brew install python. Note: you must ensure that /usr/local/bin comes before /usr/bin in your shell $PATH in order ensure Homebrew's installed version of Python is preferred the system-supplied version of Python.

  1. Clone repo from GitHub: git clone git@github.com:Distribute-js/Distribute.js.git

  2. Run the setup script for OS X: cd Distribute.js; ./setup-macosx.sh

  3. Start a local test server: ./start_dev_server.sh

  4. The server is will be accessible at http://localhost:5000

Glossary

  • Campaign: The overarching computation job a user wishes to perform. A campaign consists of computation kernel, the datasets this computation will be run on, and the results returned from the user's browsers.

  • Kernel: The actual JavaScript code the user wishes to distribute across browsers. We treat this as a black box, just exposing main and loadStaticDataset methods.

  • Datasets: A campaign may have both a large static dataset shared by all workers as well as small individually distributed datasets that serve as input to the kernel's main method.

  • Results: Each dataset can potentially produce a result, these are stored as part of the campaign and a user can download the results for their campaign individually or as a large group.

  • Web workers: Technology used to perform computationally expensive, sandboxed tasks in the background of webpage without interrupting that page's user interface. Essentially threads in JavaScript.

  • MessagePack: Technology used to efficiently serialize and compress JavaScript objects for transmission over the wire.

  • WebSockets: Technology used to set up persistent connections between Distribute.js servers and individual browser workers. Used to transfer kernels, datasets, and results.

Components

Web Server

Starting the web server

In order to be fully functional on development, our app requires that the main application server, the mail server and the websocket server be running. We use honcho to handle all these processes. Simply run honcho start. Alternatively, you can open up the Procfile in the project root and run the commands in there manually.

Design

From the beginning we wanted our webserver to be simple to setup and use, so we decided to use the flask framework. We use followed an MVC design paradigm and our project structure is as follows:

Models

We maintain a models.py file where all of our information about our database is stored. We store these as classes and use sqlachemy as our ORM system. We opted for am ORM system because it offers a convenient abstraction for all our SQL commands.

Our Application has the following models:

User

Field Type
id (Primary Key) Integer
first_name String
last_name String
password String
email String
password_reset_token String
password_reset_sent_at DateTime

Dataset

Field Type
id (Primary Key) Integer
user_id (Foreign Key) Integer
campaign_id (Foreign Key) Integer
data BLOB
computed Boolean

Result

Field Type
id (Primary Key) Integer
campaign_id (Foreign Key) Integer
dataset_id (Foreign Key) Integer
data BLOB
date_added DateTime

Campaign

Field Type
id (Primary Key) Integer
user_id (Foreign Key) Integer
date_added DateTime
deleted Boolean
kernel BLOB
static_dataset BLOB
dataset_gen BLOB

In addition, we used the database migration tool alembic to perform database migrations across different machines. Unfortunately, we could not quite get it to work correctly in our Production environment so we removed all of the associated code from our final submission.

Views

We used the Jinja2 templating system to create our views. All of our views are in the templates folder and correspond with our routes below.

Controllers

With a few exceptions, we kept a controller file for each model we had. This allowed us to keep our code cleanly partitioned. We have the following controllers:

  • CampaignController
  • DatasetController
  • ResultsController
  • Users Controller
  • Mailer

Routes

We used Blueprints to handle all of our routing. Blueprints allowed us to do two things. First, we could decouple our routes from the main application.py file. This gave us a slightly tidier and easy to maintain code base. Second, it helps us manage circular imports.

We divided our routes as follows:

Static Routes

  • / goes to the homepage.
  • /how_it_works goes to a description of our application
  • /benchmarks goes to the benchmarks page

Campaign Routes

  • /campaign/all: GET Displays management page for all of the user's campaigns.
    • /campaign/new:
      • GET Returns form to upload JavaScript kernel and dataset generator files.
      • PUT Stores uploaded kernel and dataset generator. Returns 303 redirect to /campaign/:id
    • /campaign/:id GET: Shows details of an individual campaign
    • /campaign/:id/delete Performs a soft delete for a given campaign id. Returns 303 redirect to
    • /campaign/:id/execute GET Executes a sandboxed version of your kernel in your browser.
    • /campaign/:id/results GET Returns a set of results given a campaign id and displays them on the UI
    • /campaign/:id/results/:reuslt_id GET Gets a specific result from all of the computed result sets
    • /campaign/:id/results/raw GET Gives a msgpacked version of the results
    • /campaign/:id/datasets/reset GET Deletes all of the results associated with a particular campaign

User Routes

  • /signin:
    • GET Returns a form to sign in
    • POST Attempts to sign in a user. Returns appropriate error messages if signin has failed
  • /signout : POST clears the users session. We use a POST method here because modern browsers tend to prefetch all GET requests so this would result in the user's session getting randomly deleted.
  • /signup:
    • GET Returns a form that captures the user's information
    • POST Attempts to create a new user record. This also checks a CAPTCHA to guard against automated bots.
  • /change_password: POST Allows the user to change their password.
  • /reset_password :
    • GET Checks if the provided reset token is valid. If it is, it allows the user to change their password.
    • POST Sets a new user password
  • /send_reset_password : POST Sends a password reset token to the provided email address.
  • /delete_account : POST Allows the user to delete their account and all the data associated with it.

Websocket Server

Overview

Our kernel-running harness (run_kernel.js) starts off by making a WebSocket connection to our server. We use this WebSocket to transfer all of the kernel resources (the kernel itself, the static dataset, and all of the datasets) to the browser. We chose to use WebSockets so we would not have to make a new request every time the browser finished a dataset. This eliminates the latency of establishing a new HTTPS connection for every dataset and increases the speed at which the browser can receive datasets. Since network latency is a significant part of the Distribute.js overhead, reducing this latency can significantly speed up computations.

To handle WebSocket connections, we run a separate server alongside our Flask server. We use Tornado to run this server, because it is a lightweight, event-driven server that specializes in handling a large number of concurrent connections. Since every browser running a campaign will be communicating with the WebSocket server, it is important that this server be lightweight and able to scale well.

Protocol

Overview

A typical communication from the kernel harness (in the browser) to the WebSocket server looks like this:

  1. The client initiates a WebSocket connection to our server.
  2. The client sends a campaign id to the server (or -1, if it wants a random campaign), the server will subsequently serve datasets from this campaign for the rest of the connection.
  3. The server sends back the kernel and static dataset associated with the campaign.
  4. The server sends the first (random) dataset to the client.
  5. The client runs the kernel on the dataset the kernel either returns a result, or does not. 6a. If the kernel returns a result, the client sends the result to the server, the server stores the result, marks the dataset as computed, and sends a new dataset to the client. 6b. If the kernel does not return a result, the client requests a new dataset, the server marks the dataset as computed, and sends a new dataset to the client.
  6. The client computes with the new dataset and the cycle repeats.
  7. When there are no more datasets to compute, the server closes the connection and the kernel-harness exits.

Specifics

These are the specific messages sent by the client and server and there meanings.

Client Messages:

  • iN where N is number: this is the first message sent by the client when the WebSocket is opened. If N > -1, it tells the WebSocket server to serve kernel/datasets from the campaign with id N. If N == -1, the server picks a random dataset.

  • r: a binary frame will follow with the computed result for the last sent dataset. The server should store the subsequent result, mark the last sent dataset as complete, and send a new dataset.

  • n: the last sent dataset produced no result, the server should mark the last sent dataset as complete and send a new dataset.

Server:

  • k: the JavaScript source of a kernel (as a string) will follow

  • s: the static dataset file uploaded for the current campaign will follow.

  • d: a binary frame containing a dataset will follow.

Dataset Generation Harness

Motivation

In order to perform useful computations, we must provider our workers with datasets on which to run. For most distributed jobs we envision, these datasets are easily generated programmatically. It thus makes sense to allow our developers to simply upload a description of how to generate their data.

The easiest way we saw to allow this functionality was to allow developers to upload a "dataset generator" JavaScript function in the same format as their computation kernel. On our end, we spawn off sandboxed processes that use the PyV8 Python-JavaScript bridge run this generator, serialize and compress the returned datasets using Message Pack, and store them to our central database. We run this generation in a separate process from our web server (and we sandbox it in a similar way to the kernels) in case their generator scripts have errors or are malicious. We automatically time out generation after 5 seconds, and any errors in their script cannot crash our web server.

Kernel Harness

Overview

The kernel harness was both the most important and most challenging part of the project. The main purpose of the harness is to download and load the kernel and then arbitrate between the kernel and our server. The tasks can be broken up into three main parts: network operations, running the kernel, and sandboxing. These three parts are explained in the following section.

Kernel Harness Responsibilities

  1. Network Operations

    The kernel harness handles all of the communication between our presence in the user's browser and our server. This involves downloading the kernel and the static dataset, and then providing the kernel with a constant supply of new datasets on which to compute. Additionally, the harness sends the computed results back to our server. For the specifics of this communication, see the WebSocket Server section.

  2. Running the Kernel

    The harness is in charge of actually running the kernel as well as acting as a translation layer between whatever data format we use to store information, and the format the kernels expect. This allows us to keep a consistent API for the kernels. For example, the main() method of the kernel just takes in a plain JavaScript object as an argument. Because we handle serializing and de-serializing the datasets, we are free to use whatever format we want to store and transmit them (in this case, we chose msgpack). More importantly, we are free to change anything on our end, as long as we keep the kernel API consistent. We use the same technique for results -- the kernel just returns a JavaScript object, and we handle serializing and storing it for them.

  3. Sandboxing the Kernel

    Because we are running the kernels (which are unreviewed, potentially malicious code from arbitrary sources) on other people's websites, we have to be careful not to introduce security vulnerabilities to those websites, or allow the kernels to impact the websites' usability. The harness does this in a variety of ways. First, in order to not block rendering/JavaScript execution/user interaction on the page, we run the kernels in a WebWorker (essentially a separate thread that can be started from JavaScript). This has the additional advantage of preventing the kernels from interacting with the page (because WebWorkers do not have access to the DOM) or from crashing/overloading the websites' scripts (because WebWorkers have their own JavaScript interpreter that is separate from the main interpreter). However, even inside a WebWorker, due to the way JavaScript works, the kernels would still have the ability to make network requests, access cookies, or override our harness, potentially giving them the ability to break out of their sandbox. In order to mitigate this risk, we evaluate the kernel source inside of a wrapper function that defines arguments with the same names as all of the variables currently in scope and sets these argument values to undefined (see run_kernel.js lines 107-109). Because of the way JavaScript does scoping (local variables/arguments override global variables of the same name) this prevents the kernel from having access to any global variables, preventing it from accessing the network or modifying the harness.

    Finally, we run the kernel harness inside of an <iframe> hosted from our domain. Due to the same-origin policy, this prevents the kernels (if they happen to break out of our sandbox) from accessing any data from the host website (since the host website will not be hosted at distributejs.org). This prevents the kernels from accessing user cookies or localStorage or from scraping (potentially sensitive) information from the website they are on.

Kernel Harness Chalenges

The kernel harness was one of the most challenging parts of this project. Even though it is not very many lines of code, it's unique requirements forced us to put a lot of thought into how we wrote it. Some of the main challenges we faced were:

  • Performance and Usability: We could not have our code negatively impact the user experience, but at the same time, we want to use as much of the user's computational capabilities as possible. It was difficult to find a system that would let us ensure the kernel couldn't block or slow down the main page without severely crippling our ability to do computations.

  • Sandboxing and Security: JavaScript is an inherently trusting and open language. There is no built in way to limit the effect a function can have. One of the major issues was with the way JavaScript handles closures. Because child functions (functions defined inside of other functions) automatically inherit the entire scope of their parent, it was difficult to run untrusted code without giving it full access to the code that was trying to keep it contained. In the end, as described above, we had to use JavaScript's closures to create a dummy, empty scope that overwrote any variables that we didn't want the kernel to have access to. This is an inherently fragile approach, because any variables that we miss are potential vulnerabilities. Unfortunately, based on our research and testing, there is no better way to do sandboxing without language support. Mozilla has a proposal for it (https://developer.mozilla.org/en-US/docs/Components.utils.Sandbox) but it does not currently have browser support, and it is unclear if it will ever exist.

  • Network Latency: It was critical to design a harness that minimized network latency (because it significantly reduces our computational throughput) without becoming overly complicated or using too many resources on either the client or server side. We settled on a WebSocket approach because it eliminates the hundreds of separate requests that would otherwise be necessary, while still being easy for our server to handle.

Campaign Tutorial / Kernel API Reference

This guide will briefly go over how to create the necessary files for a new campaign, and explain the API for the kernel and dataset generator.

Required Files

Campaigns require three files: a kernel, a static dataset, and a dataset generator.

  1. Kernel

    The kernel is the main workhorse of the campaign. It defines a function that takes in a dataset, performs some useful computation, and potentially returns a result. A kernel must:

    • Be a single JavaScript file

    • Evaluate to an object with two fields: main and loadStaticDataset

      • loadStaticDataset must be a function that takes in a string as an argument. This string will be the contents of the static dataset file that you upload when you create the campaign, so it's format is up to you. You have access to JSON.parse inside of the loadStaticDataset function, so one way to store the static dataset is as a JSON file. For more information on the static dataset, see the section below.

      • main must be a function that takes in a dataset (which is a plain JavaScript object, and is one of the objects output from your dataset generator) and potentially returns a result. If your main function returns null or undefined or does not return anything, then no result will be stored. If it returns anything else, the return value will be encoded and stored. These values are accessible in the results page of your campaign.

  2. Static Dataset

    The static dataset is a simple text file that can contain any arbitray data (or be empty). The contents of the static dataset will get passed (unmodified) to the loadStaticDataset method of your kernel. The static dataset is a good place to put data that is constant for the entire computation. As a general rule, anything that would be the same in all of the datasets belongs in the static dataset. Moving this type of data to the static datset reduces the transfer time of the datasets which speeds up your computation.

  3. Dataset Generator

    The dataset generator gets run when you upload your dataset, it is what produces the datasets that will get passed to the main function of your kernel. A dataset generator must:

    • Be a single JavaScript file

    • Evaluate to an object with one field: main

      • main must be a function that takes no arguments and returns an array containing all of the datasets that your kernel will be run on. Each of these datasets should be a plain JavaScript object. Your dataset generator will only be allowed to run for 5 seconds, so it must compute the datasets efficiently.

Example Campaign

This section will walk you through an example campaign. In this case, we want to find all of the numbers between 0 and 10,000 that are divisible by 123. This is clearly a contrived example, but it will demonstrate all of the required aspects of a campaign.

General Idea

We want to partition the problem space (all of the numbers in the range [0,10000)) into a smaller number of "datasets" that we can process with our kernel. In this case, the obvious choice is to break the range up into a set of smaller ranges (perhaps 100 of these ranges), and then have the kernel process each range individually. Aditionally, because the divisor is constant across all datasets, we will use the static dataset to store the divisor.

Dataset Generator

(function() {
  function main() {
    var datasets = [];

    for (var i = 0; i < 10000; i += 100) {
      datasets.push({ min: i, max: i+100});
    }

    return datasets;
  }

  return {main: main};
})();

This generator returns an array of 100 objects, where each object has a min and max field that together specify the subrange represented by that dataset. Notice how the entire main function is wrapped in an immediately-executed function that returns main. This is to satisfy condition 2 of the dataset generator listed above.

Static Dataset

{
  'divisor': 123
}

Since the format of the static dataset is up to us, we can make it's contents whatever we want. In this situation, I chose to use JSON. For such a simple example, JSON is overkill, but it can be useful for more complex static datasets.

Kernel

(function() {
  var divisor;

  function loadStaticDataset(dataset) {
    divisor = JSON.parse(dataset).divisor;
  }

  function isDivisible(n) {
    return (n % divisor === 0);
  }

  function main(dataset) {
    var results = [];

    for (var i = dataset.min; i < dataset.max; ++i) {
      if (isDivisible(i)) {
        results.push(i);
      }
    }

    if (results.length > 0) {
      return results;
    } else {
      return null;
    }
  }

  return { main: main, loadStaticDataset: loadStaticDataset };
})();

In this simple kernel, we load the divisor from the static dataset and store it in a variable. Then we use the information from our dataset to compute our sub-problem. Finally, in order to keep the number of results we have to deal with small, we only return a result if we actually found numbers that were divisible by 123.

After uploading and running this campaign, we will be able to download the result set, which will contain all of the numbers between 0 and 10,000 that are divisible by 123

Other Considerations

Feasibility

Benchmark Method

Our benchmarks are based on a Primality Test and a Fibonacci number generator. We wrote two reference solutions for each program, one in C and the other in JavaScript. These programs are intentially written to perform large amounts of computation. We then ran these programs 169 times on an Amazon EC2 Micro Instance and using Distribute.js (with the backend running on an EC2 Micro Instance) distributed across 10 networked computers. The results are listed below, lower times are better.

See the data at https://distributejs.org/benchmarks

Results Overview

The results of our benchmarks were quite encouraging for the feasibility of our project. We found that running a computation through our framework incurred a 10-20% overhead over running the same computation with Node.js on the same computer and about a 50% overhead over running the same computation written in C.

However, due to the distributed nature of our framework and the low resource requirement on the backend, we found that we were able to significantly reduce the computation time for a project over what would be possible if you simply ran the computation on the backend server. For example, the prime test benchmark we ran took 531 seconds to run on an EC2 Micro Instance (running the C code, compiled with -O3). However, that same benchmark, run in JavaScript with our framework distributed over 10 average consumer browsers took only 37 seconds. Most importantly, the backend (supporting the 10 computers) was running on an EC2 Micro Instance, and was only taking up about 7% of the instance's resources (indicating that a single EC2 Micro Instance could support ~140 simultaneous clients performing a computation. This is a very significant improvement in performance compared to what could otherwise be achieved with the same backend server, and indicates that our project is a potentially viable way to perform distributed computing.

Future Work

In the 10 weeks we have been working on this project, we were able to create a useful product, with all of the base functionality set forth in our original proposal. This project is fundamentally very open ended however. As we move move forward, we plan to address a few remaining issues and continue to develop many possible extensions.

Short-term goals

  • OpenID integration: Using Flask-OpenID we should be able to provide account creation using GitHub/Google/other OpenID services.

  • Better handling of HTTP sessions: Currently, HTTP sessions are handled by storing the signed userid in a cookie. In order to limit potential session replay attacks, we should instead store a session token mapping in-memory on the webserver that is invalidated after the session ends, the user logs out, or deletes their account. In order to defend against potential XSS attacks, we should also serve this as an HTTP only cookie.

  • Enable Strict Transport Security: In order to mitigate potential sslstrip attacks, we should enable the Strict Transport Security header in Apache.

  • Better handling of dataset generators: Currently, all datasets produced by a campaign are computed at the time of its creation. For campaigns that will generate a large (or indefinite) amount of datasets, this isn't practical (because of the inordinate amount of database I/Os). Instead, we may generate datasets in lock-step with results being returned.

  • Separation of web servers and websocket server machines: By using AWS Elastic Beanstalk, we are able to automatically scale the size of our web server fleet based on the traffic we receive. Currently, the websocket servers are hosted on these same EC2 instances and are thus scaling at the same rate. Using the EBS api it would be relatively straightforward to create separate "worker" machines to run these bandwidth and database intensive websocket servers independent of our main web fleet.

Long-term goals

  • Comprehensive campaign creation SDK: In order to be useful to the broadest audience of potential developers

  • Interface for debugging errors in kernel execution: With some additional on our websocket protocol, we will be able to catch exceptions from malfunctioning kernels and display them through the campaign console.

  • Metrics on active computations: It would be useful for campaign managers, and other interested parties to see live statistics on the number of currently computing nodes, as well as the overall throughput of the system. As we maintain a WebSocket connection with each worker, this is quite feasible. Storing these statistics would require some additional database overhead, in order to enable communication between our WebSocket and HTTP servers.

  • Measurement of campaign compute-time: In order to effectively schedule computation across multiple campaigns, it would be desirable to a common unit of the browser "compute-time" that is provided by all of the worker nodes in our network. This will depend on the previous item for maintaining overall statistics of worker connections and response rates, and some additional tracking these for each individual campaign.

  • Verifying results and curtailing malicious workers: Currently, we rely on our campaign managers to monitor and verify that the results being computed for their campaigns are correct. In the future, it would be preferable to identify and independently track problematic users.

  • Migration to nginx: Nginx has better support than Apache for the large number of small HTTP requests our application receives. The switch-over will require changes to our AWS setup, and is currently at a low priority as our Apache + Tornado configuration is currently serving our needs.

Acknowledgments

Special thanks for Francois Chaubard and Marty Stepp for feedback on project design an ongoing development as part of the Stanford CS Department's Senior Project program.

And great things to Amazon Web Services for their Education Research Grant program, which made it possible to access the full power of their services as students.