Code for simulations in paper "Actively Purchasing Data for Learning" by Abernethy, Chen, Ho, Waggoner
Switch branches/tags
Nothing to show
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Failed to load latest commit information.


Code for paper "Actively Purchasing Data for Learning"
Jacob Abernethy
Yiling Chen
Chien-Ju Ho
Bo Waggoner  <>

-- Overview
Using the MNIST handwritten digit dataset.
Pairs (datapoint, cost) arrive one by one.
The mechanism posts a price for that datapoint;
if higher than the cost, it pays the cost and obtains
the datapoint.
There is a budget constraint B.
After all the training data has arrived, the mechanism
outputs a hypothesis h, whose error rate is measured on
the test set.

-- Prerequisites

- python 2.7
- numpy
- scipy and scikit-learn (for downloading the data set conveniently)
- matplotlib for plotting results of simulations

-- Running

$ python

Manually edit the file "" to control how many trials, the parameters, etc.
The first time you run it, downloads the MNIST training dataset to ./mldata/mnist-original.mat
After running, writes the results to the file "".

$ python

visualizes the results.

-- Overview of the files Do some number of trials for each mechanism, each method of generating costs, and each budget. Prints the data to, which can also be run as a python file to visualize the data. interacts with three types of classes: Algorithms (currently only gradient descent), Mechanisms (can be initialized with access to any algorithm), and Cost Generators (draw costs for a given data point according to some distribution).

   -- implement methods: reset(eta), test_error(X,Y), norm_grad_loss(x,y), data_update(x,y,importance_wt), null_update(). implementation of gradient descent algorithm.

  -- implement methods: reset(eta, T, B, cmax=1.0), train_and_get_err(costs,Xtrain,Ytrain,Xtest,Ytest) has unlimited budget, so buys every data point offer fixed price to each data point until budget is exhausted implementation of generalization of RothSchoenebeck12 assuming that costs are uniform[0,cmax] our posted-price and learning mechanism

  -- implementing methods normalize(num_examples[]), draw_cost(label) costs are uniform [0,1] marginal. Can specify some labels as "expensive" and others as "cheap", and then expensive label costs are uniform [a,1] and cheap labels uniform [0,a] for some a. costs are free with probability 1-p and cost 1 with probability p. Can specify some labels as expensive and others as cheap, and then expensive labels will cost 1 when possible subject to the (p,1-p) distribution.