<h1 style="text-align: center;">AAE 590 Surrogate Methods</h1>

## Review of Probability and Statistics

This notebook supports material covered in the class for probability and statistics. Uniform and Normal distributions are described here, along with how to use the distributions within `scipy.stats` module and compute pdf, cdf, etc. Following topics are covered here:

1. [Uniform Distribution](#Uniform-Distribution)
2. [Normal Distribution](#Normal-Distribution)
3. [Bivariate Normal Distribution](#Bivariate-Normal-Distribution)

There are various code blocks in between the text which provide python implementation for described task. You can run the code block and see the results. You can also change the value of various parameters and see how it changes the result.

Please go through the notebook entirely and reach out to the teaching team if you have any doubts.

You need to install **seaborn**. Activate the environment you created for this class in the anaconda prompt and install seaborn using `pip install seaborn`.

<font color='red'>**Please run the below block of code before you run any other block**</font> - it imports all the packages needed for this notebook.

In [None]:
from scipy.stats import norm # Imports normal distribution
from scipy.stats import uniform # Imports uniform distribution
from scipy.stats import multivariate_normal # Imports multivariate normal distribution
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns

### Uniform Distribution

A continuous random variable $X$ is said to have a uniform distribution on the interval $[A, B]$ if the pdf of $X$ is:

$$
    f(x;A,B) = 
    \begin{cases}
        \frac{1}{B-A} & A \leq X \leq B \\
         0 & \text{otherwise}
    \end{cases}
$$

This distribution essentially denotes that any value is equally likely between $A$ and $B$. The statement that $X$ has a uniform distribution on $[A, B]$ will be denoted by $X \sim$ Unif $[A, B]$. Now, we will look at an example for this distribution.

**Example**: Suppose the reaction temperature $X$ (in $^{\circ}$C) in a chemical process has a uniform distribution with
$A = -10$ and $B = 20$. Thus, pdf of $X$ will be:

$$
    f(x;A,B) = 
    \begin{cases}
        \frac{1}{30} & -10 \leq X \leq 20 \\
         0 & \text{otherwise}
    \end{cases}
$$

Now, let's use `uniform` object within `scipy.stats` module to answer various questions related to this example. By default, `uniform` object will in standard form i.e. $A = 0$ and $B = 1$. So, we need to mention `loc` (which is A) and `scale` (which is B - A). Reading the [documentation](https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.uniform.html#scipy.stats.uniform) for uniform distribution implemented in scipy will help.

**Question**: Compute mean, variance and standard deviation of this distribution.

**Answer**: Once `uniform` object is imported (which you did when you ran the first block in this notebook), you can access various function related to the distribution. To compute the quantities, function within `uniform` object is used as shown in following block.

In [None]:
# Defining starting point and range of uniform distribution
# loc = A
# scale = B - A
A = -10
B = 20
loc = A
scale = B - A

# Creating uniform distribution object with fixed location and scale parameters
rv = uniform(loc=loc, scale=scale)

# Compute mean of the distribution
print("Mean for this distribution: {}".format(rv.mean()))

# Compute variance of the distribution
print("Variance for this distribution: {}".format(rv.var()))

# Compute std-dev of the distribution
print("Standard deviation for this distribution: {}".format(rv.std()))

**Question**: Compute $P(X < 10).

**Answer**: Here, $P(X < 10) = P(X \leq 10) = F(10)$. So, we have to compute cdf for uniform distribution at $10$. You can do this as shown in following block:

In [None]:
# P(X<0)
rv.cdf(10)

**Question**: Compute $P(-5 < X < 5)$:

**Answer**: Here, $P(-5 < X < 5) = P(-5 \leq X \leq 5) = F(5) - F(-5)$. So, we have to compute cdf for uniform distribution at $5$ and $-5$. You can do this calculating as shown in following block:

In [None]:
# P(-10 < X < 25)
rv.cdf(5) - rv.cdf(-5)

**Question**: Plot cdf and pdf of the distribution.

**Answer**: We can use the `rv` object created in previous block and compute value of pdf and cdf at a bunch of x values. Then, use `matplotlib` to plot them. Code in the following block executes this task.

In [None]:
# Creating array of x values at which pdf and cdf will be computed while plotting
x = np.linspace(-30, 30, 100)

# Plotting PDF
fig, ax = plt.subplots()
ax.step(x, rv.pdf(x), where='post')
ax.set_xlabel("$x$")
ax.set_ylabel("PDF")
ax.grid()
plt.show()

# Plotting CDF
fig, ax = plt.subplots()
ax.plot(x, rv.cdf(x))
ax.set_xlabel("$x$")
ax.set_ylabel("CDF")
ax.grid()
plt.show()

Now, we will look into *frequency interpretation* of probability. You can read more about it [here](https://online.stat.psu.edu/stat500/lesson/2/2.3). Below code plots the distribution of samples drawn from uniform distribution. Number of samples initially is set to 10 and with every iteration it increases by an order of magnitude. Run the below code block and see the plots.

In [None]:
# Some settings
initial_samples = 10
iter = 6

for i in range(iter):
    # Number of samples
    samples = initial_samples*10**(i)

    # Generate samples from the distribution
    data = rv.rvs(size=samples)

    # Plotting using seaborn
    fig, ax = plt.subplots()
    plot = sns.histplot(data, stat="density", ax=ax)
    ax.set_xlabel("x")
    ax.set_xlim([-20, 40])

Note that all the samples are between $A$ and $B$, and as the number of samples increase the density value approaches $1/30$ which is the theortical density value. You can play around with the value of `iter`, `initial_samples`, $A$, $B$ and see how distribution changes.

Normal Distribution
-------

A continuous random variable $X$ is said to have a normal distribution (or Gaussian distribution) with parameters $\mu$ and $\sigma$, where $-\infty \leq \mu \leq \infty$ and $\sigma > 0$, if the pdf of $X$ is 

$$
    f(x;\mu,\sigma) = \frac{1}{\sigma \sqrt{2\pi}} e^{-\frac{(x-\mu)^2}{2\sigma^2}} \text{, where } -\infty \leq x \leq \infty
$$

The statement that $X$ is normally distributed with parameters $\mu$ and $\sigma$ is often abbreviated as $X \sim \mathcal{N}(\mu,\sigma)$. We will perform an exercise as we did with uniform distribution.

**Example**: Suppose the force acting on a column that helps to support a building is a normally distributed random variable $X$ with mean value 9 N and standard deviation 1.5 N.

Now, let's use `norm` object within `scipy.stats` module to answer various questions related to this example. Reading the [documentation](https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.norm.html#scipy.stats.norm) for normal distribution implemented in scipy will help.

**Question**: Compute mean, variance, and standard deviation.

**Answer**: This is very straight-forward since it is gaussian distribution, but just to demostrate, we will use `norm` object.

In [None]:
loc = 9 # mean
scale = 1.5 # std dev

# Creating normal distribution object with fixed mean and variance
rv = norm(loc=loc, scale=scale)

# Compute mean of the distribution
print("Mean for this distribution: {}".format(rv.mean()))

# Compute variance of the distribution
print("Variance for this distribution: {}".format(rv.var()))

# Compute std-dev of the distribution
print("Standard deviation for this distribution: {}".format(rv.std()))

**Question**: Compute $P(X \leq 8.8)$

**Answer**: $P(X \leq 8.8) = F(8.8)$. So, we need to compute cdf of normal distribution at 8.8. Following block of code shows how to do that:

In [None]:
# P(X <= 8.8)
rv.cdf(8.8)

**Question**: Compute $P(X \leq 11)$

**Answer**: $P(X \leq 11) = F(11)$. So, we need to compute cdf of normal distribution at 11. Following block of code shows how to do that:

In [None]:
# P(X <= 11)
rv.cdf(11)

**Question**: Compute $P(X \geq 7.5)$

**Answer**: $P(X \geq 7.5) = P(X > 7.5) = 1 - P(X \leq 7.5) = 1 - F(7.5)$. So, we need to compute cdf of normal distribution at 7.5. Following block of code shows how to do that:

In [None]:
# P(X >= 7.5)
1 - rv.cdf(7.5)

**Question**: Compute $P(9 \leq X \leq 10)$

**Answer**: $P(9 \leq X \leq 10) = F(10) - F(9)$. So, we need to compute cdf of normal distribution at 9 and 10. Following block of code shows how to do that:

In [None]:
# P(9 <= X <= 10)
rv.cdf(10) - rv.cdf(9)

**Question**: Plot the pdf and cdf for this example.

**Answer**: We can use the `rv` variable created in previous block and compute value of pdf and cdf at a bunch of x values. Then, use `matplotlib` to plot them. Code in the following block executes this task. Note that the `rv` variable contains normal distribution and not uniform distribution since it is defined in one of the previous blocks.

In [None]:
# Creating array of x values at which pdf and cdf will be computed while plotting
x = np.linspace(4, 14, 200)

# Plotting PDF
fig, ax = plt.subplots()
ax.plot(x, rv.pdf(x))
ax.set_xlabel("$x$")
ax.set_ylabel("PDF")
ax.grid()
plt.show()

# Plotting CDF
fig, ax = plt.subplots()
ax.plot(x, rv.cdf(x))
ax.set_xlabel("$x$")
ax.set_ylabel("CDF")
ax.grid()
plt.show()

Similar to last case, we will look into frequency interpretation of probability. Below code plots the distribution of  randomly drawn samples as you increase the number of samples. Since the number of samples will be high, it might take some time to generate the plots.

In [1]:
# Some parameters
initial_samples = 10
iter = 6

for i in range(iter):
    # Number of samples
    samples = initial_samples*10**(i)

    # Generate samples from the distribution
    data = rv.rvs(size=samples)

    # Plotting using seaborn
    fig, ax = plt.subplots()
    plot = sns.histplot(data, stat="density", ax=ax, kde=True)
    ax.set_xlabel("x")
    ax.set_xlim([3, 15])

NameError: name 'rv' is not defined

Similar to previous case, as the number of samples increase, the density curve approaches theortical normal density curve. You can play around with the value of iter, initial_samples, and see how distribution changes.

Bivariate Normal Distribution
-----

It is a joint distribution in which the individual variables are normally distributed. Now, let's use `multivariate_normal` object within `scipy.stats` module to answer various questions related to this example. Reading the [documentation](https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.multivariate_normal.html#scipy.stats.multivariate_normal) for multivariate normal distribution implemented in scipy will help.

**Example**: Consider the SAT exam score for a randomly selected student. Let $X$ and $Y$ denote the Critical Reading and Mathematics scores, respectively, for a randomly selected student. The population of students taking the exam in Fall 2012 had the following results:

$$
    \mu_x = 490 \text{, } \sigma_x = 120 \text{, } \mu_y = 550 \text{, } \sigma_y = 60
$$

The correlation coefficient $\rho$ is -0.25.

**Question**: What is the probability that a randomly selected student scored at most 650 on critcial reading and mathematics score, i.e., what is $P(X \leq 650 \cap Y \leq 650)$?

**Answer**: The covariance matrix will be:

$$
    Cov(X,Y) = \left[ { \begin{array}{cc}
        Var(X) & Cov(X,Y) \\ Cov(Y,X) & Var(Y)
    \end{array} } \right]
    =
    \left[ { \begin{array}{cc}
        \sigma_x^2 & \rho \sigma_x \sigma_y \\
        \rho \sigma_x \sigma_y & \sigma_y^2
    \end{array} } \right]
$$

The answer to the question is the multi-variate normal CDF evaluated at $X=650$ and $Y=650$.

In [None]:
# P(X <= 650, Y <= 650)

# Defining individual mean and corelation coefficient
mean_x = 490
sigma_x = 120
mean_y = 550
sigma_y = 60
rho = -0.5

# Defining mean vector and covariance matrix
mean = np.array([mean_x, mean_y])
cov = np.array([[sigma_x**2, rho*sigma_x*sigma_y], [rho*sigma_x*sigma_y, sigma_y**2]])

# Creating multivariate normal distribution object with fixed mean vector and covariance matrix
rv = multivariate_normal(mean=mean, cov=cov)

# Evaluation point
x = np.array([650, 650])

# Answer
rv.cdf(x=x)

Similar to last case, we will look into frequency interpretation of probability. Below code plots the distribution of randomly drawn samples as you increase the number of samples. **Note**: since the number of samples drawn is high, it will take some time to generate the plots.

In [None]:
# Some parameters
initial_samples = 10
iter = 5

for i in range(iter):
    # Number of samples
    samples = initial_samples*10**(i)
    
    # Generate samples from the distribution
    data = rv.rvs(size=samples)
    
    # Plotting using seaborn
    plot = sns.jointplot(x=data[:,0], y=data[:,1], kind="kde", fill=True)

Contours depict the samples drawn from the distribution, top distribution shows $x$ samples and side distribution shows $y$ samples. Notice that as you increase the number of samples both the distributions approach theortical normal distribution around respective mean, contour plot becomes elliptical in shape, and the center of the contour plot approaches ($\mu_x$, $\mu_y$). Also, since the $\rho$ is negative, the elliptical contour is tilted i.e. as $x$ value increases $y$ value decreases and vice versa. You can change the value of $\mu_x$, $\sigma_x$, $\mu_y$, $\sigma_y$, $\rho$, and see how the plots and probability answers change.