In [15]:
%pylab inline
import numpy as np
from scipy.stats import kstest
from plotnine import *


Populating the interactive namespace from numpy and matplotlib


`%matplotlib` prevents importing * from pylab and numpy


# Background

Once of the chief concerns of early statistical computing was the generation of random numbers. Given that computers are, mostly, deterministic devices, generating true randomness appeared to be a pipedream not persuing. Consequently, the scientific community re-purposed their efforts towards pseudo-random number generation. In pseudo-random number generation, we use a deterministic proceedure to generate sequences of numbers that from a statistical point of view *appear* random. These pseudo-random number generation techniques proved suitable for the computational tasks requring randomness. There are many different techniques in this space running the gamut of the simple to the complex. Mersene Twistter is the most commonly used method in most modern applications. However, it is an incredibly complicated algorithm. The core concepts that undergird Mersene Twister also serve as the basis of older techniques that are much easier to understand. 

Currently, techniques that exploit sensor noise and environmental noise to generate true random numbers. However, the overhead of such techniques are too high to justify their use given that pseudo-random number generation can work just fine

## Linear Congruential Generator

The linear congruential generator is one of the most simple and easy to implement methods for pseudo-random number generation. An instance of LCG requires three paramaters:

- $a$, a mutliplier
- $c$, an increment
- $m$, the max value

Moreover, every instance of LCG also requires a seed value between 0 and $m$ inclusive, denoted as $X_0$; $X_0$ is often called the random seed. Each element is generated using the previous element as well as the aforementioned parameters. Each element is an integer between 0 and $m$ inclusive, and the overall sequence *appears* statistically indistinguisable from the discrete uniform distribution with limits 0 and $m$. Each element in the sequence can then be divided by $m$ to yield a floating point number between $0$ and $1$ to simulate the standard continuous uniform distribution. As we shall see later, the ability to simulate the standard continuous uniform distribution is a sort of master key that enables us to simulate any other arbitrary distribution.

The core iterative process of the LCG method is as follows:

$$X_{i + 1}= (aX_{i} + c) \operatorname{mod} m$$

In [9]:
class LCG(object):
    def __init__(self, a, c, m, X_0):
        self.X_0 = X_0
        self.a = a 
        self.c = c 
        self.m = m 
        self.curr = X_0
    
    def __call__(self):
        self.curr = (self.a * self.curr + self.c) % self.m 
        return self.curr 

In [10]:
lcg = LCG(4, 3, 100, 42)

In [20]:
n = 100

# generate 100 "samples" from our generator
# divide by m to scale between 0 and 1
arr = np.array([lcg()/100 for i in range(n)])

## Testing Uniformity

After generating our samples, we can evaluate the hypothesis that the generated sequence follows a standard uniform distribution. We can use the Kolmogorov-Smirnov test for goodness of fit. Recall that the hypothesis setup of the KS test is as follows:

$$
H_{0} \text{: the distributions are the same}
$$

$$
H_{A} \text{: the distributions are different}
$$

If the KS test reports a *high* p-value, then we fail to reject the null hypothesis, i.e. $H_{0}$

In [21]:
kstest(arr, 'uniform')

KstestResult(statistic=0.09000000000000008, pvalue=0.37063202558632946)

Since the above p-value is greater than 0.1, we fail to reject $H_{0}$ and we take the distributions as being the same