# Problem Sheet 1.1 - Getting started with simulation! (Python Version)

The purpose of this practical is just to get used to random number generation within your preferred language, and try out a few basic things from lectures 1-4. This notebook is written specifically for Python.

Please work on this as a group of 3 or 4 , and hand in an archive file (`.tar` or `.zip`) with i) your code(s), ii) your results as a PDF file with comments on anything you think is interesting.

In [32]:
import numpy as np
import matplotlib.pyplot as plt
import time
from tqdm import tqdm
%matplotlib inline


**Part (a):** Generate $10^{6}$ uniformly distributed variables on the unit interval $[0,1]$, and check that they have the expected mean and variance. Repeat for $10^{6}$ unit Normal random variables.

Notes: 
- in python, you can use `rand`, `randn` from `numpy.random`.
- recall that a uniformly distributed variables over $[0,1]$ has mean $1/2 = 0.5$ and variance $1/12 \approx 0.833$, while a unit normal random variables have mean $0$ and variance $1$.

In [2]:
# For uniform distribution
my_sample = np.random.rand(10**6)
print(f"mean = {my_sample.mean()}, variance = {my_sample.var()}")

mean = 0.5004307864728101, variance = 0.08330550520865264


In [3]:
# For normal distribution
my_sample = np.random.randn(10**6)
print(f"mean = {my_sample.mean()}, variance = {my_sample.var()}")

mean = 0.0014654049716239137, variance = 0.9997583426528015


**Part (b):** Given a covariance matrix
$$
\Sigma=\left(\begin{array}{ll}
4 & 1 \\
1 & 4
\end{array}\right)
$$

perform a Cholesky factorisation to obtain a lower-triangular matrix $L$ such that

$$
\Sigma=L L^\top
$$

Use this matrix $L$ to convert $2 \times 10^{6}$ independent unit Normals into $10^{6}$ pairs of Normals with the desired covariance. Check that they have the expected mean and covariance.

Notes: 
- in python there is a function `cholesky` in `numpy.linalg`.
- recall if $Z$ has distribution $\mathsf{N}(0,I^{(2)})$, then $X = LZ$ will have distribution $\mathsf{N}(0,\Sigma)$ as desired.
- recall summing along `axis=0` means summing along columns, while summing along `axis=1` means summing along rows.

In [4]:
# Getting Cholesky factorisation of Sigma
Sigma = np.array([[4,1],[1,4]])
L = np.linalg.cholesky(Sigma)
print(L)

[[2.         0.        ]
 [0.5        1.93649167]]


In [15]:
# Drawing simulations from the desired normal distribution $\mathsf{N}(0,\Sigma)$. Here 
Z_arr = np.random.randn(2, 10**6)
X_arr = L @ Z_arr

In [26]:
# Getting mean (= 0), variances (=4) and covariance (=1).
samp_mean = np.mean(X_arr, axis=1)
samp_var = np.var(X_arr, axis=1)
covariance = np.mean(X_arr[0] * X_arr[1])
print(f"mean = {samp_mean}, \n variances = {samp_var}, \n covariance = {covariance}")

mean = [-0.00170408 -0.00158937], 
 variances = [3.99878293 3.99952829], 
 covariance = 1.0043445938499123


**Part (c):** Repeat the previous item using the PCA factorisation of $\Sigma$.

Notes: in python there is a function `eig` in `numpy.linalg`.

In [23]:
# Getting eigenvalue (PCA) factorisation of Sigma
Sigma = np.array([[4,1],[1,4]])
w,v = np.linalg.eig(Sigma)
L = v @ np.diag(np.sqrt(w))
print(L)

[[ 1.58113883 -1.22474487]
 [ 1.58113883  1.22474487]]


In [25]:
# Drawing simulations from the desired normal distribution $\mathsf{N}(0,\Sigma)$. Here 
Z_arr = np.random.randn(2, 10**6)
X_arr = L @ Z_arr

In [27]:
# Getting mean (= 0), variances (=4) and covariance (=1).
samp_mean = np.mean(X_arr, axis=1)
samp_var = np.var(X_arr, axis=1)
covariance = np.mean(X_arr[0] * X_arr[1])
print(f"mean = {samp_mean}, \n variances = {[samp_var[0], samp_var[1]]}, \n covariance = {covariance}")

mean = [-0.00170408 -0.00158937], 
 variances = [3.998782928686044, 3.999528287644133], 
 covariance = 1.0043445938499123


**Part (d):** Repeat to see how many pairs you can generated in 1 minute.

Note: 
- here we use a lazy method: to compute the different computing times for generating different sizes of samples, and estimate the rate.

In [41]:
chol_duration_mean_arr = []
chol_duration_std_arr = []
pca_duration_mean_arr = []
pca_duration_std_arr = []

for i in range(6,9):
    chol_duration_i_arr = []
    pca_duration_i_arr = []

    Sigma = np.array([[4,1],[1,4]])
    L1 = np.linalg.cholesky(Sigma)
    w,v = np.linalg.eig(Sigma)
    L2 = v @ np.diag(np.sqrt(w))
    
    for samples in tqdm(range(5)):
        start_time_1 = time.time()
        Z_arr = np.random.randn(2, 10**i)
        X_arr = L1 @ Z_arr
        end_time_1 = time.time()
        chol_duration_i_arr.append(end_time_1 - start_time_1)

        start_time_2 = time.time()
        Z_arr = np.random.randn(2, 10**i)
        X_arr = L2 @ Z_arr
        end_time_2 = time.time()
        pca_duration_i_arr.append(end_time_2 - start_time_2)

    chol_duration_i = np.array(chol_duration_i_arr)
    pca_duration_i = np.array(pca_duration_i_arr)
    chol_duration_mean_arr.append(chol_duration_i.mean())
    chol_duration_std_arr.append(chol_duration_i.std())
    pca_duration_mean_arr.append(pca_duration_i.mean())
    pca_duration_std_arr.append(pca_duration_i.std())
        

100%|██████████| 5/5 [00:00<00:00, 11.00it/s]
100%|██████████| 5/5 [00:03<00:00,  1.25it/s]
100%|██████████| 5/5 [00:30<00:00,  6.04s/it]


In [42]:
chol_duration_mean_arr 

[0.05000104904174805, 0.3966504096984863, 2.9725918769836426]

In [44]:
chol_duration_std_arr

[0.011633677664875419, 0.01820892649982267, 0.005640425168723844]

In [43]:
pca_duration_mean_arr

[0.03998003005981445, 0.40026149749755857, 3.0676573753356933]

In [45]:
pca_duration_std_arr

[0.003647984293315954, 0.01609559946249033, 0.174525150481134]

TODO: Discussions with number of samples needed!