Add Lightweight Coresets #87

cstich · 2020-05-28T08:17:40Z

I have a working implementation of the lightweight coresets paper (https://las.inf.ethz.ch/files/bachem18scalable.pdf) in Julia. It's not distributed yet (I only have one machine to run it on anyway) but if you guys want I can clone the repo and try to put together a pull request. I looked at the documentation but the contribution guidelines had a TODO, so not sure what exactly is expected.

Arkoniak · 2020-05-28T14:36:39Z

That would be amazing! Sorry for not having contribution guidelines, it's still work in progress, but it is nothing out of ordinary, PR and we will make a review.

Currently, we do not have distributed implementations anyway (all of our algorithms are multithreaded only), so it is fine to have non distributed version. Also, you may take a look at https://github.com/PyDataBlog/ParallelKMeans.jl/blob/master/src/coreset.jl as an example.

PyDataBlog · 2020-05-28T14:52:30Z

@cstich we will add contribution guidelines in the next release #78

cstich · 2020-05-28T15:37:47Z

That would be amazing! Sorry for not having contribution guidelines, it's still work in progress, but it is nothing out of ordinary, PR and we will make a review.

Sounds good to me.

Currently, we do not have distributed implementations anyway (all of our algorithms are multithreaded only), so it is fine to have non distributed version. Also, you may take a look at https://github.com/PyDataBlog/ParallelKMeans.jl/blob/master/src/coreset.jl as an example.

Yeah, I just had a look and I need to figure out how exactly to use containers and the chunks but once I have done that it should be pretty straightforward to adapt it.

tecosaur · 2021-06-25T04:10:26Z

I've just come across this but it looks extremely promising, @cstich did you ever make any more progress with this?

cstich · 2021-06-25T09:12:33Z

Unfortunately not. I am happy for people to dump my code somewhere, but it turned out it was a bit more work than I thought. I am happy for someone else to use my implementation as a starting point if they want to.

function lightweight_core_set(X, m, μ; dist)
    """
    X: data, column wise
    m: # of samples
    μ: The global mean
    dist: distance function to use
    return: lightweight coreset C (sample, weights)
    """
    X = collect(X) # In case we pass a transposed array this is needed for the correct type
    # inference later on
    N = size(X)[2]
    distances = zeros(Float64, N)
    @threads for i in 1:N
        # The paper samples according to d(x, μ)^2
        # so all distances are squared here
        distances[i] = dist(X[:, i], μ)^2
    end
    all_distances = sum(distances)
     
    count_u = 0 # Count how many samples we need to sample uniformly
    count_v = 0 # Count how many samples we need to sample according to d(x, μ)^2
    for i in 1:m
        if rand() > 0.5
            count_u += 1
        else
            count_v += 1
        end
    end
    
    @assert count_u + count_v == m
    # Sample with or without replacement?
    # It probably makes no sense to sample with replacement
    
    # Uniform sample A
    sample_u = StatsBase.sample(1:N, 
        count_u, replace=false)
   
    # Distance weighted sample B
    sample_v = StatsBase.sample(1:N, 
        StatsBase.weights(distances),
        count_v, replace=false)
    
    sample = typeof(X)(undef, size(X)[1], m)
    weights = Array{Float64, 1}(undef, m)
    sample_indices = vcat(sample_u, sample_v)
    @threads for i in 1:m
        sample_index = sample_indices[i]
        d = distances[sample_index]
        q = 0.5 * 1/N + 0.5 * (d / all_distances)
        sample[:, i] = X[:, sample_index]
        weights[i] = 1 / (m * q)
    end
    
    return sample, weights 
end

PyDataBlog · 2021-06-25T12:29:44Z

Unfortunately not. I am happy for people to dump my code somewhere, but it turned out it was a bit more work than I thought. I am happy for someone else to use my implementation as a starting point if they want to.

function lightweight_core_set(X, m, μ; dist)
    """
    X: data, column wise
    m: # of samples
    μ: The global mean
    dist: distance function to use
    return: lightweight coreset C (sample, weights)
    """
    X = collect(X) # In case we pass a transposed array this is needed for the correct type
    # inference later on
    N = size(X)[2]
    distances = zeros(Float64, N)
    @threads for i in 1:N
        # The paper samples according to d(x, μ)^2
        # so all distances are squared here
        distances[i] = dist(X[:, i], μ)^2
    end
    all_distances = sum(distances)
     
    count_u = 0 # Count how many samples we need to sample uniformly
    count_v = 0 # Count how many samples we need to sample according to d(x, μ)^2
    for i in 1:m
        if rand() > 0.5
            count_u += 1
        else
            count_v += 1
        end
    end
    
    @assert count_u + count_v == m
    # Sample with or without replacement?
    # It probably makes no sense to sample with replacement
    
    # Uniform sample A
    sample_u = StatsBase.sample(1:N, 
        count_u, replace=false)
   
    # Distance weighted sample B
    sample_v = StatsBase.sample(1:N, 
        StatsBase.weights(distances),
        count_v, replace=false)
    
    sample = typeof(X)(undef, size(X)[1], m)
    weights = Array{Float64, 1}(undef, m)
    sample_indices = vcat(sample_u, sample_v)
    @threads for i in 1:m
        sample_index = sample_indices[i]
        d = distances[sample_index]
        q = 0.5 * 1/N + 0.5 * (d / all_distances)
        sample[:, i] = X[:, sample_index]
        weights[i] = 1 / (m * q)
    end
    
    return sample, weights 
end

Thanks for sharing the code. We will definitely work to integrate this algo.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Lightweight Coresets #87

Add Lightweight Coresets #87

cstich commented May 28, 2020

Arkoniak commented May 28, 2020

PyDataBlog commented May 28, 2020

cstich commented May 28, 2020

tecosaur commented Jun 25, 2021

cstich commented Jun 25, 2021 •

edited by Arkoniak

PyDataBlog commented Jun 25, 2021

Add Lightweight Coresets #87

Add Lightweight Coresets #87

Comments

cstich commented May 28, 2020

Arkoniak commented May 28, 2020

PyDataBlog commented May 28, 2020

cstich commented May 28, 2020

tecosaur commented Jun 25, 2021

cstich commented Jun 25, 2021 • edited by Arkoniak

PyDataBlog commented Jun 25, 2021

cstich commented Jun 25, 2021 •

edited by Arkoniak