# Psuedo-random numbers, statistics, and Galton-Watson

Our goal is to program up a procedure that will produce random realizations of 
a Galton-Watson process, and then to analyze the results. I'll begin by 
introducing some functionality that we will need from the numpy library.

## Random sampling in Python

Everything you need for this is located within the numpy.random module. There 
are a few different ways of using the module, but the prefered way these days is 
to create a Generator object and then call various methods on it to obtain 
samples from whatever distribution you like. Internally, the Generator object 
will have a state that it is in from which it generates numbers according to a 
specific algorithm. The state can be set manually (known as a "seed"), and that 
allows you to obtain the same sequence of pseudo-random numbers every time. This 
means that you can have reproducable results while still obtaining random-like 
behavior.

In [None]:
import numpy as np

rng = np.random.default_rng() # Create a random number generator instance. 
                              # You can pass an integer seed for reproducibility.
                              # If it is omitted, a random seed is used (based on system entropy).

# The most basic task is to produce a number between 0 and 1 from a uniform distribution:
print(rng.random())
# Each call to random() produces a new random number.
# You can also generate arrays of random numbers by specifying the shape:
print(rng.random((2, 3)))  # 2x3 array of random numbers

# With a random number between 0 and 1, you can scale it to any range you want, 
#   or compare to a threshold to produce random booleans with a given probability.
# So, for example, to simulate a biased coin flip with a 70% chance of heads:
probability_of_heads = 0.7
is_heads = rng.random() < probability_of_heads
print(is_heads)
# To generate random integers within a specified range, use the integers() method:
random_integers = rng.integers(low=1, high=10, size=5)  # Random integers between 1 and 9
print(random_integers)

# It is also possible to draw samples from various probability distributions.
# Check out the documentation for a full list of available distributions and 
#   their parameters.

## Basic summary statistics in Python

For really basic stuff like mean, variance, and standard deviation, there are 
numpy functions that you can call on any array. More complicated statistics can 
be found in scipy.stats.

In [None]:
import numpy as np

rng = np.random.default_rng()
samples = rng.normal(loc=0.0, scale=1.0, size=1000)  # 100 samples from a standard normal distribution

print(f'Mean of the samples: {np.mean(samples)}')  # Should be close to 0
print(f'Standard deviation of the samples: {np.std(samples)}')  # Should be close to 1
print(f'Variance of the samples: {np.var(samples)}')  # Should be close to 1
print(f'Minimum value: {np.min(samples)}')
print(f'Maximum value: {np.max(samples)}')

# Note that all of these functions have an axis parameter that allows you to
#   compute these statistics along a specified axis of a multi-dimensional array.
#   So, for example, if you have N rows of simulation samples and want to compute
#   the mean for each simulation, you can do:
multi_samples = rng.normal(loc=0.0, scale=1.0, size=(5, 1000))  # 5 simulations of 1000 samples each
means_per_simulation = np.mean(multi_samples, axis=1)
print(f'Means per simulation: {means_per_simulation}')

## Galton-Watson process

The goal is to produce many simulations of a Galton-Watson process, bootstrapping 
the resulting distribution for population size after a certain amount of time 
and comparing it to our analytical results.

Like all coding projects, it's best to break this down into much smaller steps 
that can be built on, one after another. I suggest:
1. First, set the PMF for our process. Ideally, it should be something that will 
give us a good idea if it is working. I suggest something with $R_0>1$ so that 
it is not just dying out all the time, but with $p_0>0$ so that extinction isn't 
ruled out either.

So that we are all on the same page, let's go with the following:
$$p_0 = 0.3, p_1 = 0.3, p_2 = 0.2, p_3 = 0.2$$
(note: they have to sum to 1)

2. **In the cell below, start with one individual and use a for-loop to simulate 
20 steps of a Galton-Watson process with this PMF.** How do you do this? Well, 
you will need to create a data structure to hold how many individuals are in the 
population at each time step (a list that you can append to?). That data structure 
should start with one individual as an initial condition. Then, for each individual 
in the current population size, you will need a random number. You can use 
random numbers between zero and 1. Then you need to figure out how many offspring 
each individual had. Suppose r is the random number. Then:
- If $0\leq r < 0.3$, there are no offspring.
- If $0.3\leq r < 0.3+0.3=0.6$, there is one offspring.
- If $0.6\leq r < 0.6+0.2=0.8$, there are two offspring. 

...and so on. Do you see how I am partitioning the interval [0,1] according to 
the probabilities in the PMF? Then you add up all the results for all 
individuals in the population and that gives you the new population size.

Once you the simulation completes, print your results. Or better yet, plot them.

Note that if the population is growing geometrically over time, you will be 
performing this procedure for a lot more individuals as the simulation goes on. 
This can quickly get inefficient and become very slow. There are ways to speed 
up the process, but let's not worry about that for now.

In [None]:
import numpy as np

One instance of this process is not enough to understand its statistics. In the 
cell below, copy what you have done above but then alter it so that you can run 
the same process many times - maybe 10-20 times to start out? 

Each time, you will need to store your results so that you have all of them in 
the end. I suggest creating an Nx21 numpy array, where each of the N rows stores 
a different simulation and each of the columns is a time-step (the initial condition 
plus 20 steps).

When you have something that runs, try:
- Printing summary statistics over the last column of the results matrix, which 
corresponds to the final time of all the simulations. Compare the results to 
what you expect analytically.
- Plot all of the population trajectories on the same line plot so that you can 
visualize the population probability distribution as it evolves in time.
- Plot a histogram of the last column of your results matrix so you can visualize 
the probability distribution of the process after 20 time-steps. For low N, this 
won't be all that refined looking, but more simulations would make it better. 
Ultimately, this is the issue with using Monte Carlo simulations to bootstrap a 
probabililty distribution - you need to generate lots of data to get a result 
that looks like it is converging to something not dominated by noise, and that 
requires either some clever numerical work to get the simulation times down or 
a lot of computing power, or both.