Skip to content

AlexBuccheri/random_sampling

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

25 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Random Number Implementation and Validation

Compilable code is build with:

cmake -S . -B cmake-build
cmake --build cmake-build

Some analysis is performed in the [jupyter] folder. This is done by automatically wrapping the fortran shared library with the fantastic gfort2py.

TODOs

Sorting

  • Migrate wrapped GSL C calling example here
  • Add a pytoml, such that one can straightforwardly install the python dependencies

Mapping a random number to an interval

  • Test Lemire's algorithm in the integer mapping module
    • Note, there could be issues arising from transcribing from C to fortran

Sampling without replacement

Test:

  • My choice of random seed precision (uint)
  • Hidde shuffle
    • Contains bug/s
  • Time all algorithms tested in the notebook

Implement

PRNGs

Implemented XOR, which is straightforward and has a large period of over a million. However, one has to be careful with the handling of signed vs unsigned integers when transcribing from C.

An unsigned int in C can hold numbers $[0, 2^{32} - 1]$, however fortran does not support this data type. Instead, signed int32 has the range $[-2^{31}, 2^{31} - 1]$. I use a bit mask to remap negative values:

! A mask that has all the bits set to 1 except the most significant bit 
! i.e. the sign bit in a 32-bit signed integer
iand(x, Z'7FFFFFFF')

This leaves $[0, 2^{31}-1]$ unchanged. However, for negative values, $x$ is represented using two's complement notation. When you apply iand(x, Z'7FFFFFFF'), you are effectively masking out the sign bit. This operation converts a negative number into its unsigned equivalent by removing the sign bit and keeping only the lower 31 bits.

integer(int32) :: x
x = -12345678_int32  ! x = -12345678 (in two's complement)
x = iand(x, Z'7FFFFFFF')  ! Masking the sign bit
! x becomes 2015137970 (the unsigned equivalent of -12345678)

There are smarter things one can do. See this Github reference by Jonas Finker, or MR 2528 for Octopus, however the above is currently sufficient for my needs.

Mapping integers to a smaller range

Also showed that mapping $[0, P)$ to $[a, b)$ is fine when the values are real but mapping to a smaller range of integers will inevitably result in duplication of numbers, even when uniformly sampling.

See:

Random Sampling a Population with no Replacements

For my use cases, one requires random sampling with no replacement. Algorithms that randomly sample a population with no replacements include:

  • Reservoir sampling

  • Skip and Gap Sampling (Vitter's Algorithm)

    • Can be more efficient than standard Reservoir Sampling, especially for large streams
    • Blog post on gap sampling
      • Quite short
      • Touches on Poisson distribution, which is also utilised by hidden shuffle - worth a read, but the code is Java
  • Hidden Shuffle This gives a python implementation, and claims it's more efficient than the above methods

  • Hash-Based Sampling

  • Simple Random Sampling with Sorting

    • Efficient when the range (m) is small w.r.t. N (i.e. $2^{32}$)
    • Guarantees uniqueness of selected items.

Some overviews on the problem, and related algorithms:

  • Looks like a good, recent paper "Simple, Optimal Algorithms for Random Sampling Without Replacement" giving an overview of the methods listed here
  • For way more detail and code examples, see this gist

Fortran implementation references:

About

Personal random sampling testing

Topics

Resources

Stars

Watchers

Forks

Contributors