Skip to content

Sampling is biased when the destination array is equal to the sampled array #642

@yurivish

Description

@yurivish

The sample! function returns incorrectly biased samples when the destination array is the same as the array being sampled.

There are sampling methods that can work correctly in-place such as the Fischer–Yates shuffle and StatsBase implements this algorithm for sampling as StatsBase.fisher_yates_sample.

The Julia implementation, however, does the wrong thing when the destination array is the same as the source array:

julia> using StatsBase

julia> let counts = zeros(5)
           for _ in 1:10^5
               let a = [1, 2, 3, 4, 5]
                   for samp in StatsBase.fisher_yates_sample!(a, a)
                       counts[samp] += 1
                   end
               end
           end
           counts
       end
5-element Array{Float64,1}:
  19909.0
  66047.0
 104055.0
 138017.0
 171972.0

This seems to be the case for sample! more generally, regardless of whether sampling is done with or without replacement.

Without replacement:

julia> let counts = zeros(5)
          for _ in 1:10^5
              let a = [1, 2, 3, 4, 5]
                    # the destination is a copy of a
                   for samp in StatsBase.sample!(copy(a), a, replace=false)
                       counts[samp] += 1
                   end
              end
          end
          counts
       end
5-element Array{Float64,1}:
 100000.0
 100000.0
 100000.0
 100000.0
 100000.0

julia> let counts = zeros(5)
          for _ in 1:10^5
              let a = [1, 2, 3, 4, 5]
                    # the destination is a
                   for samp in StatsBase.sample!(a, a, replace=false)
                       counts[samp] += 1
                   end
              end
          end
          counts
       end
5-element Array{Float64,1}:
  20058.0
  65809.0
 104801.0
 137268.0
 172064.0

With replacement:

julia> let counts = zeros(5)
          for _ in 1:10^5
              let a = [1, 2, 3, 4, 5]
                    # the destination is copy of a
                   for samp in StatsBase.sample!(copy(a), a, replace=true)
                       counts[samp] += 1
                   end
              end
          end
          counts
       end
5-element Array{Float64,1}:
  99871.0
 100068.0
 100124.0
 100303.0
  99634.0

julia> let counts = zeros(5)
          for _ in 1:10^5
              let a = [1, 2, 3, 4, 5]
                    # the destination is a
                   for samp in StatsBase.sample!(a, a, replace=true)
                       counts[samp] += 1
                   end
              end
          end
          counts
       end
5-element Array{Float64,1}:
  41455.0
  76065.0
 104968.0
 128636.0
 148876.0

With ordering and replacement:

julia> let counts = zeros(5)
          for _ in 1:10^5
              let a = [1, 2, 3, 4, 5]
                    # the destination is copy of a
                   for samp in StatsBase.sample!(copy(a), a, ordered=true, replace=true)
                       counts[samp] += 1
                   end
              end
          end
          counts
       end
5-element Array{Float64,1}:
  99974.0
  99499.0
  99764.0
 100527.0
 100236.0

julia> let counts = zeros(5)
          for _ in 1:10^5
              let a = [1, 2, 3, 4, 5]
                    # the destination is a
                   for samp in StatsBase.sample!(a, a, ordered=true, replace=true)
                       counts[samp] += 1
                   end
              end
          end
          counts
       end
5-element Array{Float64,1}:
  41697.0
  76248.0
 104670.0
 128734.0
 148651.0

using Julia 1.5.3 and StatsBase v0.33.2.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions