# Probability distributions II

<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Continuous-probability-distributions" data-toc-modified-id="Continuous-probability-distributions-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Continuous probability distributions</a></span><ul class="toc-item"><li><span><a href="#Uniform-distribution-on-interval-[0,-1]" data-toc-modified-id="Uniform-distribution-on-interval-[0,-1]-1.1"><span class="toc-item-num">1.1&nbsp;&nbsp;</span>Uniform distribution on interval [0, 1]</a></span></li><li><span><a href="#Uniform-distribution-on-interval-[a,-b]" data-toc-modified-id="Uniform-distribution-on-interval-[a,-b]-1.2"><span class="toc-item-num">1.2&nbsp;&nbsp;</span>Uniform distribution on interval [a, b]</a></span></li><li><span><a href="#Exponential-distribution" data-toc-modified-id="Exponential-distribution-1.3"><span class="toc-item-num">1.3&nbsp;&nbsp;</span>Exponential distribution</a></span></li><li><span><a href="#Normal-distribution" data-toc-modified-id="Normal-distribution-1.4"><span class="toc-item-num">1.4&nbsp;&nbsp;</span>Normal distribution</a></span><ul class="toc-item"><li><span><a href="#Intro" data-toc-modified-id="Intro-1.4.1"><span class="toc-item-num">1.4.1&nbsp;&nbsp;</span>Intro</a></span></li><li><span><a href="#scipy's-norm" data-toc-modified-id="scipy's-norm-1.4.2"><span class="toc-item-num">1.4.2&nbsp;&nbsp;</span>scipy's <code>norm</code></a></span></li><li><span><a href="#Sampling-from-a-normal" data-toc-modified-id="Sampling-from-a-normal-1.4.3"><span class="toc-item-num">1.4.3&nbsp;&nbsp;</span>Sampling from a normal</a></span></li><li><span><a href="#Point-distribution-function-(density-function)" data-toc-modified-id="Point-distribution-function-(density-function)-1.4.4"><span class="toc-item-num">1.4.4&nbsp;&nbsp;</span>Point distribution function (density function)</a></span></li><li><span><a href="#Cumulative-distribution-function" data-toc-modified-id="Cumulative-distribution-function-1.4.5"><span class="toc-item-num">1.4.5&nbsp;&nbsp;</span>Cumulative distribution function</a></span></li><li><span><a href="#Percent-point-function" data-toc-modified-id="Percent-point-function-1.4.6"><span class="toc-item-num">1.4.6&nbsp;&nbsp;</span>Percent point function</a></span></li></ul></li><li><span><a href="#Other-continuous-probability-distributions" data-toc-modified-id="Other-continuous-probability-distributions-1.5"><span class="toc-item-num">1.5&nbsp;&nbsp;</span>Other continuous probability distributions</a></span></li></ul></li><li><span><a href="#Summary" data-toc-modified-id="Summary-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Summary</a></span></li></ul></div>

In [None]:
import numpy as np

import seaborn as sns
import matplotlib.pyplot as plt

# for some extra plotting tools
import pylab as p

## Continuous probability distributions

**Continuous probability** distributions are those which can take any value in a given range. In particular, they can take infinite different values.

X is a continuous random variable.  
X follows a continuous probability distribution.

### Uniform distribution on interval [0, 1]

All numbers in the inteval [0, 1] are equally probable

In a continuous probability distribution, it only makes sense to talk about probability of an interval, not a particular number

$X \sim U(0, 1)$

<img src="https://upload.wikimedia.org/wikipedia/commons/thumb/2/28/Unit-interval.svg/1200px-Unit-interval.svg.png" width=500>

$P(x \leq 0.4)=$

$P(x \geq 0.7)=$

$P(x \leq 1)= $

$P(x \geq 0)=$

$P(0.1 \leq x \leq 0.4)=$

For all $c, d \in [0, 1]$ we have 

$$P(c < x <= d) = d-c$$

Lets use Python distribution instantiation to generate samples from a $U(0, 1)$ distribution

In [None]:
from scipy.stats import uniform

In [None]:
my_uniform = uniform(0, 1)

In [None]:
my_uniform.rvs(size=1)

In [None]:
sample = my_uniform.rvs(size=10)

In [None]:
sample

In [None]:
sns.histplot(sample)

In [None]:
sample = my_uniform.rvs(size=100)

In [None]:
sns.histplot(sample)

In [None]:
sample = my_uniform.rvs(size=1000)

In [None]:
sns.histplot(sample)

In [None]:
sample.mean()

In [None]:
my_uniform.mean()

`.cdf` is the cumulative distribution function  
`.cdf(x)` tells us the probability of $X<x$

In [None]:
my_uniform.cdf(1)

In [None]:
my_uniform.cdf(0)

In [None]:
my_uniform.cdf(0.5)

In [None]:
my_uniform.cdf(0.4)

Since this is true:    
$P(X < 0.2) + P(0.2 < X < 0.7) = P(X < 0.7)$

Then it happens that  
$P(0.2 < X < 0.7) = P(X < 0.7) - P(X < 0.2)$

In [None]:
my_uniform.cdf(0.7) - my_uniform.cdf(0.2)

### Uniform distribution on interval [a, b]

All numbers in the inteval [a, b] are equally probable

$X \sim U(a, b)$

$P(x \leq b)=1$

$P(x \geq a)=1$

$\frac{a+b}{2}$ is the mean between $a$ and $b$

$P(x \leq \frac{a+b}{2})=0.5$

Lets use Python distribution instantiation to generate samples from a $U(a, b)$ distribution

In [None]:
from scipy.stats import uniform

In [None]:
a, b = 4, 10

In [None]:
# h is the interval length
h = b - a
h

`uniform` receives `a` and `h`

In [None]:
my_uniform = uniform(loc=a, scale=h)

`.rvs` generates a sample drawn from the distribution

In [None]:
my_uniform.rvs(size=1)

In [None]:
sample = my_uniform.rvs(size=100)

In [None]:
sample.mean()

In [None]:
sns.histplot(sample)

In [None]:
sample = my_uniform.rvs(size=1000)

In [None]:
sns.histplot(sample)

`.cdf` is the cumulative distribution function  
`.cdf(x)` tells us the probability of $X<x$

In [None]:
my_uniform.cdf(11)

In [None]:
my_uniform.cdf(10)

In [None]:
my_uniform.cdf(9.9)

In [None]:
my_uniform.cdf(4)

In [None]:
my_uniform.cdf(7)

In [None]:
my_uniform.cdf(8)

Lets plot the cdf

In [None]:
x = np.linspace(0, 15, 100)
y = my_uniform.cdf(x)
fig, ax = plt.subplots(1, 1)
ax.plot(x,y)

- The cdf of a distribution is an increasing function
- The cdf takes value 1 at the right
- The cdf takes value 0 at the left

In [None]:
my_uniform.mean()

In [None]:
sample.std()

In [None]:
my_uniform.std()

### Exponential distribution

Models the time it takes for a random event to occur:
 * Time for next person in a queue to appear
 * Time for next call at call center to happen
 * Time for next radioactive particle to decay
 * Time for next DNA item to mutate

"The average time for a new patient to appear is 5 minutes"  
"The time for a new patient to appear follows an Exponential distribution with mean $5$"

The exponential is the reciprocal distribution of the Poisson distribution

In [None]:
from scipy.stats import expon

The exponential, like the Poisson, is a $1$-parameter distribution function

$X \sim Exp(\mu)$

This parameter is the mean, called `mu`, $\mu$

`scipy` calls it `scale`

Lets model clients who arrive on average every $30$ seconds to supermarket queue

In [None]:
my_e = expon(scale=30)

In [None]:
my_e.mean()

`.rvs` generates a sample drawn from the distribution

In [None]:
my_e.rvs(size=1).round()

In [None]:
sample = my_e.rvs(size=100)

In [None]:
sample.round()

In [None]:
sample.max()

In [None]:
sns.histplot(sample)

In [None]:
sample = my_e.rvs(size=10000)

In [None]:
sns.histplot(sample)

In [None]:
my_e.cdf(30)

`.pdf` is the point distribution function 

Analogous to `.pmf` in discrete

In [None]:
x = np.linspace(0, 100, 100)
y = my_e.pdf(x)
fig, ax = plt.subplots(1, 1)
ax.plot(x,y)

The area under the curve is the probability of times in  the given interval

The whole area is 1 (100%)

`.cdf` is the cumulative distribution function  
`.cdf(x)` tells us the probability of $X<x$

In [None]:
my_e.cdf(10)

In [None]:
my_e.cdf(30)

In [None]:
my_e.cdf(90)

In [None]:
my_e.cdf(150)

In [None]:
x = np.linspace(-5, 100, 100)
y = my_e.cdf(x)
fig, ax = plt.subplots(1, 1)
ax.plot(x,y)

### Normal distribution

#### Intro

Normal distributions are important in statistics and are often used in the natural and social sciences 

The normal distribution is the most important probability distribution in statistics, because it fits many natural and social phenomena :
 * heights
 * blood pressure
 * IQ scores
 

It is also known as the Gaussian distribution or the bell curve

#### scipy's `norm`

In [None]:
from scipy.stats import norm

The normal, unilike the exponential, is a $2$-parameter distribution function

$X \sim N(\mu, \sigma)$

These parameters are:
 * the mean, called $\mu$
 * the standard deviation, $\sigma$



`scipy` calls them `loc` and `scale`

Lets model a country in which heights have:
 * a mean of $170cm$
 * a std of $10cm$

In [None]:
my_normal = norm(loc=170, scale=10)

In [None]:
my_normal.mean()

In [None]:
my_normal.std()

#### Sampling from a normal

`.rvs` generates a sample drawn from the distribution

In [None]:
my_normal.rvs(size=1)

In [None]:
sample = my_normal.rvs(size=100)

In [None]:
# lets show some
sample[:15]

In [None]:
sns.histplot(sample)

In [None]:
sample = my_normal.rvs(size=10000)

In [None]:
sns.histplot(sample)

#### Point distribution function (density function)

`.pdf` is the point distribution function 

In [None]:
my_normal_2 = norm(170, 15)

In [None]:
x = np.linspace(120, 220, 100)
y = my_normal.pdf(x)
y2 = my_normal_2.pdf(x)
fig, ax = plt.subplots(1, 1)
ax.plot(x, y)
ax.plot(x, y2)

In [None]:
my_normal.pdf(160)

In [None]:
my_normal.pdf(165)

Total area under the curve is 1

#### Cumulative distribution function

`.cdf` is the cumulative distribution function  
`.cdf(x)` tells us the probability of $X<x$

For this distribution being **continuous**, we cannot SUM several .pdf values, but we have to INTEGRATE (area under the curve) until 170

$P(X < 170)$

In [None]:
my_normal.cdf(170)

In [None]:
my_normal.cdf(140)

In [None]:
my_normal.cdf(160)

In [None]:
import pylab as p

In [None]:
x = np.linspace(120, 220, 100)
y = my_normal.pdf(x)
fig, ax = plt.subplots(1, 1)
fill_x = np.linspace(140, 160, 100)
plt.fill_between(fill_x, my_normal.pdf(fill_x),color='r')
plt.text(150, 0.003, "0.158", size=15)
ax.plot(x,y)

In [None]:
my_normal.cdf(200)

What is the proportion of people in the interval ($\mu - \sigma, \mu + \sigma)$?

$P(160 < X < 180)$

In [None]:
my_normal.cdf(180) - my_normal.cdf(160)

In [None]:
x = np.linspace(140, 200, 100)
y = my_normal.pdf(x)
fig, ax = plt.subplots(1, 1)
fill_x = np.linspace(160, 180, 100)
plt.fill_between(fill_x, my_normal.pdf(fill_x),color='r')
plt.text(165, 0.02, "0.682", size=15)
ax.plot(x,y)

What is the proportion of people in the interval ($\mu - 2\sigma, \mu + 2\sigma)$?

What is the proportion of people in the interval ($\mu - 3\sigma, \mu + 3\sigma)$?

In [None]:
my_normal.mean()

In [None]:
my_normal.std()

<img src="https://miro.medium.com/max/700/1*IZ2II2HYKeoMrdLU5jW6Dw.png" width=500>

Lagartos   
Media 6.2  
Desv tipica 1

Entre 5.2 y 7.2 están el 68% de los lagartos

Entre 4.2 y 8.2 están el 95.45% de los lagartos

Entre 3.2 y 9.2 están el 99.73% de los lagartos

#### Percent point function

In [None]:
x = np.linspace(120, 220, 100)
y = my_normal.pdf(x)
fig, ax = plt.subplots(1, 1)
fill_x = np.linspace(140, 160, 100)
plt.fill_between(fill_x, my_normal.pdf(fill_x),color='r')
plt.text(150, 0.003, "0.158", size=15)
ax.plot(x,y)

Is the inverse of the Cumulative Distribution function

cdf(height) = proba

ppf(proba) = height

What height is such that 80% of people are lower than it?

In [None]:
my_normal.ppf(0.8)

In [None]:
my_normal.ppf(0.99)

In [None]:
my_normal.ppf(0.99999)

In [None]:
x = np.linspace(140, 200, 100)
y = my_normal.pdf(x)
fig, ax = plt.subplots(1, 1)
fill_x = np.linspace(140, 178, 100)
plt.fill_between(fill_x, my_normal.pdf(fill_x),color='r')
plt.text(165, 0.02, "0.80", size=15)
ax.plot(x,y)

### Other continuous probability distributions

 * Student's $t$ distribution (Student is a person)
 * Snedecor's $F$ distribution
 * Chi squared distribution

## Summary

 * Random variables model random experiments
 * We only need a sample space and probabilities to define a random experiment

 * Discrete random variables only have a finite (or countable) number of outcomes
 * Continuous random variables take an infinite number of outcomes

 * `.rvs` returns a sample
 * `.pmf` returns the point mass function (discrete distributions)
 * `.pdf` returns the point distribution function (continuous distributions)
 * `.cdf` returns the cumulative mass/distribution function (discrete/continuous)
 * `.ppf` return inverse cumulative distribution function (discrete/continuous). Aka percentile point function