# Week 5: Probability Distributions

In [1]:
# Loading the libraries
import numpy as np
import sympy as sy
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import scipy.stats as stats

## Days 2 and 3: Some Special Probability Distributions

Today we continue with some continuous distributions which appear frequently in statistical analysis and in machine learning

### Normal Distribution
The **Normal Distribution** goes by many names, and most people have heard about it. It is also known as **Gaussian Distribution** or **Bell-Curve**. The normal distribution has two parameters: the *location* parameter which is defined by the **mean** $\mu$ of the distribution (and is the most likely outcome), and the *scale* parameter which is defined by the **variance** $\sigma^2$ or **standard deviation** $\sigma$ of the distribution (which determines the horizontal stretch of the distribution). If a random variable $X$ follows a normal distribution with mean $\mu$ and standard deviation $\sigma$, we write:
\begin{equation} X \sim \mathcal{N}(\mu, \sigma) \end{equation}
For any such random variable, $X \in (-\infty, \infty)$ and the pdf is given by:
\begin{equation} f(x \mid \mu, \sigma) = \displaystyle \frac{1}{\sqrt{2\pi}\cdot \sigma}~\exp\left( {-\frac{1}{2} ~ \left( \frac{x - \mu}{\sigma} \right)^2} \right) \end{equation}

Every Normal distribution has a unique definiing property known as **68-95-99.7 Rule** or **Empirical Rule** which relates the mean, the standard deviation, and the normal probabilities:
\begin{equation}
\begin{array}{rcl}
P \big(\mu - \sigma \leqslant X \leqslant \mu + \sigma \big) &\approx& 0.68\\
P \big(\mu - 2\sigma \leqslant X \leqslant \mu + 2\sigma \big) &\approx& 0.95\\
P \big(\mu - 3\sigma \leqslant X \leqslant \mu + 3\sigma \big) &\approx& 0.997
\end{array}
\end{equation}
In other words: only the outcomes which are **within 3 standard deviations of the mean** are relevant, all other outcomes occur so rarely that they can, most frequently, be disregarded.

### Example 1: Normal Distribution
The weights of babies born at Prince Louis Maternity Hospital last year averaged $\mu = 3.0$ kg with a standard deviation of $\sigma = 200$ grams.
* Visualize the distribution in the range $[\mu - 4\sigma, \mu + 4\sigma]$
* What is the probability that a randomly selected baby born at the hospital weighs less than 3.2 kg?
* If there were 545 babies born at this hospital last year, estimate the number of babies that weighed between 2.8 kg and 3.4 kg
* Find the weight $w$ such that 40% of babies weigh less than $w$

In [2]:
# use scipy.stats.norm()
# Define mu and sigma
mu = 
sigma =  #in kilograms!
weights_rv = 

# Visualize the distribution
xs = np.linspace(mu-4*sigma, mu+4*sigma, 1000)
plt.figure()
plt.plot
plt.show()


SyntaxError: invalid syntax (<ipython-input-2-141a3b1c3ad1>, line 3)

### Example 2: Normal Distribution
The heights of a group of students are normally distributed with a mean of 160 cm and a standard deviation of 20 cm.
* A student is chosen at random. Find the probability that the student’s height is greater than 180 cm.
* In this group of students, 11.9% have heights less than $d$ cm. Find the value of $d$.

In [None]:
# Define the parameters


### The Standard Normal Distribution
The **Standard Normal Distribution** $Z$ is the the normal distribution with $\mu =0 $ and $\sigma = 1$:
\begin{equation} Z \sim \mathcal{N}(0, 1) \end{equation}
It can be used as a "unversal comparison tool" between all possible normal distributions via the process called **standardization** (recall $z$-scores from descriptive statistics)

The relationship that scales a random variable $X \sim \mathcal{N}(\mu, \sigma)$ to $Z \sim \mathcal{N}(0, 1)$ is given by
\begin{equation} Z = \frac{X - \mu}{\sigma} \end{equation}
This means that all questions about any normally distributed random variable $X$ can be answered using calculations for the standard normal distribution $Z$. Historically, the statistical tables for normal distribution contained only values about $Z$.

### Example 3
A certain college requires a score of 900 on the GBT test for admission, but it will also accept an equivalent grade on the MRST test. The mean score on the GBT is 1020 and the standard deviation is 140; the mean score on the MRST is 21 and the standard deviation is 4.7. What is the minimum score on the MRST that the college will accept?!

In [4]:
## Define the normal variables
# G ~ N(1020, 140)
G = stats.norm(1020, 140)

# M ~ N(21, 4.7)
M = stats.norm(21, 4.7)

# Get the cutoff percentile for GBT
cutoff_perc = G.cdf(900)
print(cutoff_perc)
# Get the cutoff points for MRST
cutoff_pts = M.ppf(cutoff_perc)
print(cutoff_pts)

0.19568296915377598
16.97142857142857


### Example 4
The weight loss, in kilograms, of people using the slimming regime SLIM3M for a period of three months is modelled by a random variable $X$. Experimental data showed that
* 67% of the individuals using SLIM3M lost up to 5 kilograms, and
* 12.4 % lost at least 7 kilograms

Assuming that $X$ follows a Normal distribution, find the mean weight loss of a person who follows the SLIM3M regime for three months.

In [6]:
# Define Z
Z = stats.norm(0,1)
z1 = Z.ppf(67)
z2 = Z.ppf(1 - 0.124)

#graph


mu, sigma = sy.symbols('mu sigma', real = True)
z1 = sy.sympify(z1)
z2 = sy.sympify(z2)

eq1 = sy.Eq((5-mu)/sigma, z1)
eq2 = sy.Eq((7-mu)/sigma, z2)

sol = mu, sigma = sy.solve((eq1, eq2), (mu, sigma))

sol[mu], sol[sigma]

ValueError: not enough values to unpack (expected 2, got 0)

### $\chi^2$ Distirbution (chi-square)
The chi-square distribution appears in the comparison of distributions, in the independence test, and homogeneity test (among others). If $Z_1, Z_2, \ldots, Z_k$ are all standard normal variables independent of one another, then the variable
\begin{equation} \chi_k^2 = Z_1^2 + Z_2^2 + \ldots + Z_k^2 \end{equation}
follows $\chi^2$-distribution with $k$ degrees of freedom


### Example 5
Construct and visualize the $\chi^2$-distributions for $k=1, 2, 3, 4$

In [None]:
# Defining chi_1^2, chi_2^2, chi_3^2 and chi_4^2

# Visualize the distributions
xs = np.linspace(0, 9, 1000)
plt.figure(figsize=(15,5))
plt.plot()
plt.show()

### Student's $t$ distribution
Student's $t$ distribution appears in the statistical tests related to comparison of means and the ANOVA (ANalysis Of VAriance) framerork. It is key to analyzing symmetic data for fairly small samples.

Let the random variables $Z$ and $\chi_k^2$ be independent, with $Z$ being standard normal and $\chi_k^2$ being a chi-square distribution with $k$ degrees of freedom. Then $t_k$, the $t$ distribution with $k$ degrees of freedom is defined as
\begin{equation}
t_k = \frac{Z}{\sqrt{\left.\chi_k^2 \middle/ k\right.}}
\end{equation}
The $t_k$ distribution is centeded at the zero and has a similar shape to the $Z$ distribution. $t_1$ is farthest from $Z$ and as $k$ increases, $t_k$ grows closer to $Z$. In the limit case: $t_k \to Z$ as $k \to \infty$

### Example 6
Construct and visualize $t_k$ for $k = 1, 5, 10, 40$

In [None]:
# Defining t_k

# Visualize the distributions
xs = np.linspace(-4, 4, 1000)
plt.figure(figsize=(10, 5))
plt.plot()
plt.legend()
plt.show()