## 1 Characterizing a Distribution

In [1]:
from scipy import stats

In [31]:
import numpy as np

In [3]:
import pandas as pd

### 1.1 Distribution Center

#### Mode

The *mode* is the most frequently occuring value in a distribution

In [2]:
data = [1, 3, 4, 4, 7]

In [3]:
stats.mode(data)

ModeResult(mode=array([4]), count=array([2]))

In [5]:
data = pd.Series(data)

In [6]:
data.mode()

0    4
dtype: int64

#### Geometric Mean

In [9]:
x = np.arange(1, 101)

In [10]:
stats.gmean(x)

37.992689344834304

### 1.2 Quantifying Variability

#### Range

The *range* is simply the difference between the highest and the lowest data value.

In [12]:
np.ptp(data)

6

`ptp` stands for "peak-to-peak"

#### Percentiles

$$
CDF(x) = \int_{-\infty}^xPDF(x)dx
$$

Percentils are just the inverse of CDF (cumulative distribution function), gives the x corresponding to specifix CDF value

#### Standard Deviation and Variance

The maximum likelihood estimator of the sample variance is given by:

$$
var=\frac{\sum\limits_{i=1}^{n}(x_i-\bar{x})^2}{n}
$$

The best unbiased estimator for the population variance is:

$$
var = \frac{\sum\limits_{i=1}^{n}(x_i - \bar{x})^2}{n - 1}
$$

See the process to prove that it is a unbiased estimator of $\sigma$:
https://www.ma.utexas.edu/users/mks/M358KInstr/SampleSDPf.pdf

The standard deviation is:

$$
s = \sqrt{var}
$$

`numpy` by default calculates the variance for "n". To obtain the sample variance one has to set `ddof=1`

In [13]:
data = np.arange(7, 14)

In [14]:
np.std(data)

2.0

In [15]:
np.std(data, ddof=1)

2.1602468994692869

In pandas, it is default to `ddof=1`

#### Standard Error

Mathematically, the variance of the sampling distribution obtained is equal to the variance of the population divided by the sample size. This is because as the sample size increases, sample means cluster more closely around the population mean.
Therefore, the relationship between the standard error and the standard deviation is such that, for a given sample size, the standard error equals the standard deviation divided by the square root of the sample size. In other words, the standard error of the mean is a measure of the dispersion of sample means around the population mean.

For nomally distributed data, the *sample standard error of the mean* (SE or SEM) is:

$$
SEM = \frac{s}{\sqrt{n}}
= \sqrt{\frac{\sum\limits_{i=1}^n(x_i-\bar{x})^2}{n-1}} \cdot
\frac{1}{\sqrt{n}}
$$

#### Confidence Intervals

The $\alpha\%$ *confidence interval(CI)* reports the range that contains the true value for the parameter with a likelihood of $\alpha\%$

If the sampling distribution is *symmetrical* and *unimodal*, it will often be possible to approximate the confidence interval by:

$$
ci = mean \pm std * N_{PPF}(\frac{1-\alpha}{2})
$$

The $N_{PPF}$ is the *percentile point function (PPF)* for the standard normal distribution

- To calculate the CI for the mean value, the SD has to be replaced by the SE

### 1.3 Parameters Describing the Form of a Distribution

#### Location

A location parameter determines the location or shift of a distribution

#### Scale

The scale parameter describes the width of a probability distribution

#### Shape Parameters

-Skewness

Distributions are skewed if they depart from symmetry.

-Kurtosis

*Kurtosis* is a measure of the "peakedness" of the probability distribution. Distributions with negative or positive excess kurtosis are called *platykurtic* or *leptokurtic* distributions, respectively.

### 1.4 Important Presentations of Probability Densities

- Probability Density Function (PDF)
- Culmulative Distribution Function (CDF)
- Survival Function (SF) = 1 - CDF
- Percentile Point Function (PPF)
- Inverse Survival Function (ISF)
- Random Variate Sample (RVS): random variates from a given distribution

The steps of working with distribution functions in Python:

In [6]:
myDF = stats.norm(5, 3)   # Step 1: Create the frozen distribution

In [9]:
x = np.linspace(-5, 15, 101)
y = myDF.cdf(x)  # Step 2: Calculate the function value for the desired x-input

## 2 Discrete Distributions

### 2.1 Bernoulli Distribution

In [5]:
p = 0.5
bernoulliDist = stats.bernoulli(p)

In [6]:
p_tails = bernoulliDist.pmf(0)
p_heads = bernoulliDist.pmf(1)

In [7]:
trials = bernoulliDist.rvs(10)

In [8]:
trials

array([0, 0, 1, 0, 0, 0, 0, 0, 0, 0])

### 2.2 Binomal Distribution

In [9]:
(p, num) = (0.5, 4)
binomDist = stats.binom(num, p)

In [10]:
binomDist.pmf(np.arange(5))

array([ 0.0625,  0.25  ,  0.375 ,  0.25  ,  0.0625])

#### Example: Binomial Test

$$
P[X=k] =
\begin{cases}
{n\choose k}p^k (1-p)^{n-k}&& {0 \le k \le n} \\ 
0 && otherwise
\end{cases}
\quad 0 \le p \le 1, \quad n \in N
$$

In [13]:
checkVal = 51

In [11]:
(p, num) = (1.0/6, 235)

In [12]:
bd = stats.binom(num, p)

In [14]:
# Calculate the one-side test, 
# i.e. the likelihood of getting the smae or more times of 6
p_oneTail = bd.sf(checkVal - 1)

In [15]:
p_oneTail

0.026544245711699471

In [19]:
p_twoTail = stats.binom_test(checkVal, num, p)

In [20]:
p_twoTail

0.043747970182413345

### 2.3 Poisson Distribution

The PMF of the Poisson Distribution is:

$$
P(X = k) = \frac{e^{-\lambda} \lambda ^k}{k!}
$$

### 2.4 Normal Distribution

$$
f_{\mu,\sigma}(x) = \frac{1}{\sigma \sqrt{2\pi}}e^{-(x-\mu)^2/{2\sigma^2}}
$$

In [21]:
mu = -2
sigma = 0.7

In [22]:
myDist = stats.norm(mu, sigma)

In [25]:
# Calculate the interval of the PDF containing 95% of the data
myDist.ppf([0.05/2, 1 - 0.05/2])

array([-3.37197479, -0.62802521])

#### Examples of Normal Distribution

In [26]:
norm = stats.norm()

In [29]:
dist = norm.isf(0.99)

In [30]:
250 - dist * 4

259.30539149616334

In [33]:
men = stats.norm(175, 6)
women = stats.norm(168, 3)

In [36]:
diff = stats.norm(175 - 168, 6**2 + 3**2)

In [37]:
diff.sf(0)

0.56180832004930026

### 3.3 Distributions and Hypothesis Test

In [29]:
nd = stats.norm(3.5, 0.76)

In [30]:
nd.cdf(2.6)

0.11816486815719918

## 4 Continuous Distributions Derived from the Normal Distribution

### 4.1 t-Distribution

$$
t = \frac{\bar{x} - \mu}{SE} 
= \frac{\bar{x} - \mu}{s/\sqrt{n}}
$$

In [9]:
n = 20
df = n-1
alpha = 0.05

In [10]:
tDist = stats.t(df)

In [11]:
tDist.ppf(alpha/2)

-2.0930240544082634

In [12]:
stats.norm.ppf(alpha/2)

-1.9599639845400545

Calculate 95%-CI

$$
ci = mean\pm se * t_{df, \alpha}
$$

In [14]:
N = stats.norm()

In [17]:
data = N.rvs(n)

In [33]:
ci = stats.t.interval(1 - alpha, df, loc=np.mean(data),
                      scale=stats.sem(data))

In [34]:
ci

(-0.4703866418601137, 0.51695782600780038)

### 4.2 Chi-Square Distribution

$$
\sum\limits^{n}_{i=1}X^2_i \in \chi^2_n
$$

In [36]:
data = np.r_[3.04, 2.94, 3.01, 3.00, 2.94, 2.91, 3.02,
3.04, 3.09, 2.95, 2.99, 3.10, 3.02]

In [37]:
sigma = 0.05

In [38]:
chi2Dist = stats.chi2(len(data) - 1)

In [39]:
statistic = sum(((data - np.mean(data)) / sigma) ** 2)

In [40]:
chi2Dist.sf(statistic)

0.19293306654285156