<div style="background-image: linear-gradient(145deg, rgba(35, 47, 62, 1) 0%, rgba(0, 49, 129, 1) 40%, rgba(32, 116, 213, 1) 60%, rgba(244, 110, 197, 1) 85%, rgba(255, 173, 151, 1) 100%); padding: 1rem 2rem; width: 95%"><img style="width: 60%;" src="../../images/MLU_logo.png"></div>

# <a name="0">MLU Mathematical Fundamentals for Machine Learning</a>
# <a name="0">Lecture 3: Probability and Statistics Fundamentals</a>
## <a name="0">Lab 3.4: Inferential Statistics and the Central Limit Theorem</a>

 1. <a href="#1">Central Limit Theorem</a> 
 2. <a href="#2">Confidence Interval for the Mean</a> 
 
This notebook empirically shows how the Central Limit Theorem works by using random samples from three different distributions: continuous uniform, exponential and binomial.
It also describes how to construct a confidence interval for the mean of a population using the Normal distribution.

## <a name="1">1. Central Limit Theorem</a>
(<a href="#0">Go to top</a>)

The **Central Limit Theorem (CLT)** states that, given a large enough sample size $n$, drawn from a population with mean $\mu$ and standard deviation $\sigma$, the sampling distribution of the sample mean is normally distributed with mean $\mu$ and standard deviation $\sigma$ divided by the square root of $n$.

$$ \bar{X}=\frac{1}{n} \sum_{i=1}^n x_i$$ 

$$ \bar{X} \sim N(\mu,\frac{\sigma}{\sqrt{n}})$$ 

This is a very powerful theorem because it allows us to approximate the sample mean of a sample drawn from any population to a normal distribution and therefore use all the tools we already know to compute probabilities of intervals on normal distributions. 

In order to leverage the PDF and CDF of the **Standard Normal**, we will standardise the sample mean by subtracting its mean and dividing by its variance.

$$ \frac{\bar{X}-\mu}{\displaystyle{\frac{\sigma}{\sqrt{n}}}} \sim N(0,1)$$ 

We will now see how the CLT really works in practice with different probability distributions to sample from and different sample sizes.

In [None]:
# Import libraries
import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import uniform, binom, expon, norm, bernoulli

from IPython.display import Markdown, display

# Set a seed for reproducibility
np.random.seed(99)

Let's start sampling from a *Uniform* population $Uniform[0,1]$. 


<img style="width: 40%;" src="../../images/clt.png"></div>

We will take $m$ samples and each sample will have size $n$. We will fix $m=1200$ and start with a sample size of each sample $n=50$.

We'll end up with a sample of sample means which we standardise and then we can plot the histogram (in blue) and compare it with the standard normal PDF (red line).

In [None]:
# Uniform[0, 1]
mu = uniform.mean(loc=0, scale=1)
sigma = uniform.std(loc=0, scale=1)

n = 50 # sample size
m = 1200 # number of samples

print(f"Taking {m} samples each of size {n} from a Uniform[0,1] with mean {mu} and standard deviation {sigma}")

x = np.zeros([m, n])
standardised_sample_mean = np.zeros(m)

for i in range(m):
    x[i] = uniform.rvs(loc=0, scale=1, size=n, random_state=None)
    sample_mean = (x[i].sum()) / n
    standard_error = sigma / np.sqrt(n)
    standardised_sample_mean[i] = (sample_mean - mu) / standard_error
    
plt.hist(standardised_sample_mean, bins=40, density=True)
plt.plot(np.arange(-4, 4, 8/m), norm.pdf(np.arange(-4, 4, 8/m), loc=0, scale=1), color='r')
plt.show()

Despite the uniform distribution is quite different from the Normal, we can see that its sample mean for a sample size $n=50$ is approximately normal. Pretty powerful, don't you think?

Let's see now how the normal approximation changes for $4$ different values of the sample size $n \in {1, 10, 100, 1000}$.

In [None]:
# Uniform[0, 1]
mu = uniform.mean(loc=0, scale=1)
sigma = uniform.std(loc=0, scale=1)

n_values = [1, 10, 100, 1000] # sample size
m = 1200 # number of samples

plt.figure(figsize=(10, 6))

for j in range(len(n_values)):
    n = n_values[j]
    plt.subplot(2, 2, j + 1)
    x = np.zeros([m, n])
    standardised_sample_mean = np.zeros(m)

    for i in range(m):
        x[i] = uniform.rvs(loc=0, scale=1, size=n, random_state=None)
        sample_mean = (x[i].sum()) / n
        standard_error = sigma / np.sqrt(n)
        standardised_sample_mean[i] = (sample_mean - mu) / standard_error
    
    plt.hist(standardised_sample_mean, bins=40, density=True)
    plt.plot(np.arange(-4, 4, 8/m), norm.pdf(np.arange(-4, 4, 8/m), loc=0, scale=1), color='r')
    plt.title(f"sample size $n$: {n}")

plt.subplots_adjust(left=0.1,
                    bottom=0.1, 
                    right=0.9, 
                    top=0.9, 
                    wspace=0.4, 
                    hspace=0.8)    
plt.show()

Let's now do the same while sampling from an *Exponential* population with parameter $\lambda=1$. 

In [None]:
# Exponential lambda=1
mu = expon.mean(loc=0, scale=1)
sigma = expon.std(loc=0, scale=1)

n_values = [1, 10, 100, 1000] # sample size
m = 1200 # number of samples

plt.figure(figsize=(10, 6))

for j in range(len(n_values)):
    n = n_values[j]
    plt.subplot(2, 2, j + 1)
    x = np.zeros([m, n])
    standardised_sample_mean = np.zeros(m)

    for i in range(m):
        x[i] = expon.rvs(loc=0, scale=1, size=n, random_state=None)
        sample_mean = (x[i].sum()) / n
        standard_error = sigma / np.sqrt(n)
        standardised_sample_mean[i] = (sample_mean - mu) / standard_error
    
    plt.hist(standardised_sample_mean, bins=40, density=True)
    plt.plot(np.arange(-4, 4, 8/m), norm.pdf(np.arange(-4, 4, 8/m), loc=0, scale=1), color='r')
    plt.title(f"sample size $n$: {n}")

plt.subplots_adjust(left=0.1,
                    bottom=0.1, 
                    right=0.9, 
                    top=0.9, 
                    wspace=0.4, 
                    hspace=0.8)    
plt.show()

We can see that the exponential PDF is not symmetrical with respect to the mean and therefore we need to increase the sample size $n$ to about $100$ to achieve symmetry in the distribution of its sample mean.

### Exercise 1

<div style="align: left; border: 4px solid cornflowerblue; text-align: left; margin: auto; padding-left: 20px; padding-right: 20px; width: 65%">
        <img style="float: left; max-width: 80%; max-height:80%; margin: 5px;" src="../../images/MLU_challenge.png" alt="MLU challenge" width=12% height=12%/>
    <span style="padding: 20px; align: left;">
        <p><b>Try it yourself!</b></p>
        <p><b>Exercise 1.</b>Try generating the same data and plots above for a random sample drawn from a binomial distribution with parameters n=20 and p=0.5.</p>
    </span>
</div>

In [None]:
###### YOUR CODE HERE ######






###### END OF CODE ######

In [None]:
# %load solutions/lab34_ex1_solutions.txt

## <a name="2">2. Confidence Interval for the Mean</a>
(<a href="#0">Go to top</a>)

If we’re working with larger samples ($n \geq 30$), we can assume that the sampling distribution of the sample mean is normally distributed (thanks to the Central Limit Theorem) and can use the <code>norm.interval()</code> function from the <code>scipy.stats</code> library to compute the lower bound and the upper bound of the confidence interval for the true mean.

The following example shows how to calculate a confidence interval for the true population mean of a Bernoulli distribution using a sample of size $100$.

In [None]:
# generate sample data
sample_size = 100

data = bernoulli.rvs(p=0.5, size=sample_size)

data[:10]

In [None]:
#create 95% confidence interval for population mean weight
lower_bound, upper_bound = norm.interval(confidence=0.95, loc=np.mean(data), scale=np.std(data)/np.sqrt(sample_size))

print(f"The true mean of the population lies between {lower_bound} and {upper_bound} with 95% confidence.")

The sample was drawn from a population with mean $p=0.5$ and we can see the confidence interval includes this value. This was just a very short example on confidence intervals, more on this topic will be explored at the beginning of Lecture 5 with Hypotheis Testing.

<div style="display: flex; align-items: center; justify-content: left; background-color:#330066; width:99%;"> 
        <img style="float: left; max-width: 100%; max-height:100%; margin: 15px;" src="../../images/MLU_robot.png" alt="MLU robot" width="100" height="100"/>
    <span style="color: white; padding-left: 10px; align: left; margin: 15px;">
        <h3>Congratulations!</h3>
        You have completed Lab 3.4: Inferential Statistics and the Central Limit Theorem of Lecture 3: Probability and Statistics Fundamentals of MLU Mathematical Fundamentals of Machine Learning.
        <br/>
    </span>
</div>