Skip to content

Latest commit

 

History

History
97 lines (64 loc) · 3.5 KB

sample_weighting.rst

File metadata and controls

97 lines (64 loc) · 3.5 KB

Sample weighting to alter likelihood of samples

BatchUp defines samplers that are used to generate the indices of samples that should be combined to form a mini-batch. They are defined in the :py:mod:`.sampling` module.

When constructing a data source (e.g. :py:class:`.ArrayDataSource`) you can provide a sampler that will control how the samples are selected.

Buy default one of the standard samplers (:py:class:`.StandardSampler` or :py:class:`.SubsetSampler`) will be constructed if you don't provide one.

If you want some samples to be drawn more frequently than others, construct a :py:class:`.WeightedSampler` and pass it as the sampler argument to the py:class:.ArrayDataSource constructor. In the example the per-sample weights are stored in train_w.

from batchup import sampling

sampler = sampling.WeightedSampler(weights=train_w)

ds = data_source.ArrayDataSource([train_X, train_y], sampler=sampler)

# Drawing batches of 64 elements in random order
for (batch_X, batch_y) in ds.batch_iterator(
        batch_size=64, shuffle=np.random.RandomState(12345)):
    # Processes batches here...

Note that in-order is NOT supported when using :py:class:`.WeightedSampler`, so shuffle cannot be False or None.

To draw from a subset of the dataset, use :py:class:`.WeightedSubsetSampler`:

from batchup import sampling

# NOTE that the weights parameter is called `sub_weights` (rather
# than `weights`) and that it must have the same length as `indices`.
sampler = sampling.WeightedSubsetSampler(sub_weights=train_w[subset_a],
                                         indices=subset_a)

ds = data_source.ArrayDataSource([train_X, train_y], sampler=sampler)

# Drawing batches of 64 elements in random order
for (batch_X, batch_y) in ds.batch_iterator(
        batch_size=64, shuffle=np.random.RandomState(12345)):
    # Processes batches here...

An alternate constructor method :py:meth:`.WeightedSampler.class_balancing_sampler` is available to construct a weighted sampler to compensate for class imbalance:

# Construct the sampler; NOTE that the `n_classes` argument
# is *optional*
sampler = sampling.WeightedSampler.class_balancing_sampler(
    y=train_y, n_classes=train_y.max() + 1)

ds = data_source.ArrayDataSource([train_X, train_y], sampler=sampler)

# Drawing batches of 64 elements in random order
for (batch_X, batch_y) in ds.batch_iterator(
        batch_size=64, shuffle=np.random.RandomState(12345)):
    # Processes batches here...

The :py:meth:`.WeightedSampler.class_balancing_sample_weights` helper method constructs an array of sample weights in case you wish to modify the weights first:

weights = sampling.WeightedSampler.class_balancing_sample_weights(
    y=train_y, n_classes=train_y.max() + 1)

# Assume `modify_weights` is defined above
weights = modify_weights(weights)

# Construct the sampler and the data source
sampler = sampling.WeightedSampler(weights=weights)
ds = data_source.ArrayDataSource([train_X, train_y], sampler=sampler)

# Drawing batches of 64 elements in random order
for (batch_X, batch_y) in ds.batch_iterator(
        batch_size=64, shuffle=np.random.RandomState(12345)):
    # Processes batches here...