<a href="https://colab.research.google.com/github/Coding-Forest/2021-Statistics/blob/main/00%20Statyclopedia/STAT_Statyclopedia.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Encyclopedia of Key Statistical Concepts and functions

# 1. Probability

## 1) Factorial

In [None]:
math.factorial(number)

## 2) N choose K

$$ {n \choose k} = \frac{n!}{k!(n - k)!} $$

In [2]:
def n_choose_k(n, k):
    return math.factorial(n) / (math.factorial(k) * math.factorial(n-k)) 

## Relationships between Probability

See [4th notebook on this](https://github.com/Coding-Forest/2021-Statistics/blob/main/01%20Probability%20and%20Statistics%20for%20ML/STAT%2004%20Relationships%20Between%20Probabilities.ipynb)

### Joint Probability Distribution
$$P(x=x, y=y)$$

### Marginal Probability

$$\forall x \in x_1, P(x=x) = \sum_{y}P(x=x, y=y)$$

### Conditional Probability

$$P(y=y | x=x) = \frac{P(y=y, x=x)}{P(x=x)}$$

### Chain Rule of Probabilities

### Independent Random Variables

### Conditional Independence

# 2. Statistics

## Fundamental Concepts 1

### Mean

$$ \bar{x} = \frac{\sum_{i=1}^n x_i}{n} $$

### Mode

In [None]:
st.mode(number)

### Quantile

In [None]:
np.quantile(x, 0.5)

### Percentile

In [None]:
np.percentile(x, percentile)

### Skewness

In [None]:
st.skewnorm.rvs(skewness=0, size=1000)

## Distributions

### Locate where the number is

In [None]:
len(np.where(y > 85)) # locate where 85 stands in a list.

### Random sampling

In [None]:
np.random.choice(x, size=10, replace=False)

### Uniform

1. Dice rolling (PMF)
2. Card drawing (PMF)
3. Model hyperparameters: number of neurons in ANN
4. Emission of radioactive particles
5. Economic demand
6. Analog-to-digital signal quantization errors

In [None]:
np.random.uniform(size=size)

### Binomial

In [None]:
np.random.binomial(events, probabiltiy, trials)

In [None]:
# np.unique turns event counts. 
target_event_count, total_event_count = np.unique(target_count, return_counts=True)

### Multinomial

In [None]:
np.random.multinomial(n, [1/6.]*6)

### Poisson

In [None]:
np.random.poisson(lam, size)

### Normal

In [None]:
np.random.normal(size=10000)

### Exponential

In [None]:
np.random.exponential(scale=4, size=10000)

### LaPlace

In [None]:
np.random.laplace(size=10000)

### Multimodal

In [None]:
np.concatenate((np.random.normal(size=5000), np.random.normal(loc=4.0, size=5000)))  # loc = mean location

### Mixture 

See [5.8 Mixture Distributions](https://github.com/Coding-Forest/2021-Statistics/blob/main/01%20Probability%20and%20Statistics%20for%20ML/STAT%2005%20Distributions%20in%20Machine%20Learning.ipynb)

### PDF (Probability Density Function)

Probability that $x$ is between points $a$ and $b$:

$$\int_{[a, b]}p(x)\delta x = \int^bp(x)\delta x - \int^ap(x)\delta x$$

- integrate two ranges and subtract the range $a$ from $b$. 

## Information Theory

### Shannon Entropy

$$ H(x) = \mathbb{E}_{\text{x}\sim P}[I(x)] \\ = -\mathbb{E}_ {\text{x}\sim P}[\text{log}P(x)] $$


Low entropy:
- Distribution is ~deterministic and outcomes are ~certain

High entropy:
- Outcomes are uncertain

### Shannon Entropy for binary RV

$$ (p-1)\cdot \text{log}(1-p)-p \cdot \text{log}p $$

In [None]:
import numpy as np

def binary_entropy(p):
    return (p-1) * np.log(1-p) - p * np.log(p)

### Differential entropy

- simply the term for Shannon entropy if distribution is PDF.
- used to calculate a single distribution.

### Kullback-Leibler Divergence and Cross-Entropy

#### KL divergence
$$ D_\text{KL}(P||Q) = \mathbb{E}_{\text{x} \sim P}[\text{log}P(x) - \text{log}Q(x)] $$

#### Cross Entropy

$$ C = -(y \cdot \text{log}(\hat{y}) + (1-y) \cdot \text{log}(1-\hat{y})) $$

$$ C = -(truth \cdot \text{log}(pred) + (1-truth) \cdot \text{log}(1-pred)) $$

- Literally the values truth $y$ and pred $\hat y$ are crossing over 1 in the equation.

In [None]:
def cross_entropy(y, y_hat):    
    return -1 * (y * np.log(y_hat) + (1 - y) * np.log(1 - y_hat))

## Fundamental Concepts 2

### Expectation

If $x$ is discrete:$$E = \sum_x x\cdot P(x)$$

If $x$ is continuous:$$E =\int x \cdot p(x)\cdot \delta x$$
- value $x$ $\times$ probability $x$ $\times$ derivative of $x$

### Variance

Variance (denoted with $\sigma^2$) can be written using expected-value notation, but it's easier to understand without it:

$$\sigma ^2 = f\frac{\sum_{i=1}^{n}(x_i - \bar{x})^2}{n}$$

$$\text{difference squared} = \frac{\text{sum of all }\text{(observation - truth)}^2}{\text{sample size}}$$

- Square all the difference and sum them up. 
- Divide them by the sample size.

In [None]:
np.var(x)
x.var(ddof=1)

### Standard Deviation

$$\sigma = \sqrt{\sigma^2} = \sqrt{f\frac{\sum_{i=1}^{n}(x_i - \bar{x})^2}{n}}$$

$$\text{root of squared difference} = \sqrt{\sigma^2}$$

$$ = \sqrt{\frac{\text{sum of all }(observation - truth)^2}{\text{sample size}}}$$


### Standard Error

- A further derivation of standard deviation.
- The standard deviation of a sample $\bar x$

$$\sigma_{\bar x} = \frac{\sigma}{\sqrt n}$$

$$\sigma_{\bar x} = \frac{\text{standard deviation}}{\sqrt {\text{sample size}}}$$

In [None]:
st.sem(x)

### z-score

$$ z = \frac{x_i-\mu}{\sigma} $$

$$ z = \frac{x_i-\text{mean}}{\text{standard deviation}} $$

- A unit of standard deviation
- (observation minus mean) divided by SD.

### p-values

- `st.norm.ppf(.025)`
  - pass in p-value, returns z-score
- `st.norm.cdf(-2.5)`
  - pass in z-score, returns p-value

#### CDF - Cumulative Distribution Function

In [None]:
#import scipy.stats as st
# pass in z-score to get the percentile 
st.norm.cdf(z-score)

In [None]:
st.norm.cdf(-2.5)    # below 2.5 percentile, namely within 2.5%
1-st.norm.cdf(2.5)   # above 2.5 percentile, namely over 97.5% 
                     # returns the percentile p-value between 0 and 1.

#### PPD - Percent Point Function

- The method `norm.ppf()` takes a percentage and returns a **standard deviation multiplier** for what value that percentage occurs at.

- It is equivalent to a, 'One-tail test' on the density plot.

Stackoverflow source discussion [here](https://stackoverflow.com/questions/60699836/how-to-use-norm-ppf).

In [None]:
st.norm.ppf(.025) # returns z-score
st.norm.ppf(.975)

In [None]:
norm.ppf(0.95, loc=0, scale=1)

# Returns a 95% significance interval for a one-tail test 
# on a standard normal distribution (mean=0, std=1)

### Covariance
- Two variables: $x$ and $y$
- For two vectors of the same length, $x$ and $y$,
- where each element of $x$ is paired with the corresponding element of $y$, 
- Covariance measures how related the two variables are to each other.

<br/>

$$cov(x, y) = \frac{\sum_{i=1}^{n}(x_i - \bar{x})(y_i - \bar{y})}{n}$$

$$cov(x, y) = \frac{\text{sum of all }(i_{th} \text{ data point} - \text{mean of }x) \times (i_{th} \text{ data point} - \text{mean of }y)}{sample size}$$

<br/>

- sum of $variance(x) \times variance (y)$ divided by $length(n)$

In [None]:
np.cov(x, y, ddof=0)

### Correlation
$$\rho _{x, y}=\frac{cov(x,y)}{\sigma_x\sigma_y}$$

$$\rho _{x, y}=\frac{\text{covariance of } x \text{ and }y}{\text{standard error of }x \times \text{standard error of }y}$$

<br/>

- "*A drawback of [covariance](https://github.com/jonkrohn/ML-foundations/blob/master/notebooks/5-probability.ipynb) is that it confounds the relative scale of two variables with a measure of the variables' relatedness. Correlation builds on covariance and overcomes this drawback via rescaling, thereby measuring relatedness exclusively. Correlation is much more common because of this difference.*"

- "*Covariance and correlation only account for linear relationships. Two variables could be non-linearly related to each other and these metrics could come out as zero*."

See [3.10](https://github.com/Coding-Forest/2021-Statistics/blob/main/01%20Probability%20and%20Statistics%20for%20ML/STAT%2003%20Describing%20Distributions.ipynb)

### Pearson correlation



In [None]:
st.pearsonr(x, y)

## Comparing Means - t-tests

### T-tests

#### Student's One Sample t-test

The single-sample t-test is a variation of z-score:

<br/>

$$ t = \frac{\bar{x} - \mu_0}{s_{\bar{x}}} $$

$$ t = \frac{\text{sample mean} - \text{references mean}}{\text{sample's standard error}} $$

<br/>

- $\bar{x}$ is the sample mean
- $\mu_0$ is a reference mean, e.g., known population mean or "null hypothesis" mean
- $s_{\bar{x}}$ is the sample standard error

<br/>

Compare witht he below z-score formula: 
$$ z = \frac{x_i-\mu}{\sigma} $$

In [None]:
st.ttest_1samp(x, reference_mean)

In [None]:
# returns: t-stsatistic & p-value
Ttest_1sampResult(statistic=1.1338934190276817, pvalue=0.3392540508564543)

t-score compares the sample mean against a reference mean, while z-score estimates the spread of a data point against the population mean.

#### Welch's Independent Two-Sample t-test
Compare the means of two separate samples.

**Welch's Independent t-test**

$$ t = \frac{\bar{x} - \bar{y}}{\sqrt{\frac{s^2_x}{n_x} + \frac{s^2_y}{n_y}}} $$


Where:

- $\bar{x}$ and $\bar{y}$ are the sample means
- $s^2_x$ and $s^2_y$ are the sample variances
- $n_x$ and $n_y$ are the sample sizes

In [None]:
st.ttest_ind(sample1, sample2, equal_var=False)

In [None]:
# returns: t-stsatistic & p-value
Ttest_indResult(statistic=4.5588666963515765, pvalue=1.1099750778082192e-05)

#### Student's Paired-Sample t-test

$$ t = \frac{\bar{d} - \mu_0}{s_\bar{d}} $$

$$ t = \frac{\text{vector of differences btw paird } x \text{ and }y  - \text{typically 0}}{\text{standard error of the differences}} $$

<br/>

Where:

- $d$ is a vector of the differences between paired samples $x$ and $y$
- $\bar{d}$ is the mean of the differences
- $\mu_0$ will typically be zero, meaning the null hypothesis is that there is no difference between $x$ and $y$
- $s_\bar{d}$ is the standard error of the differences

(Note how similar to single-sample t-test formula.)

In [None]:
st.ttest_rel(sample1, sample2)

In [None]:
# returns: t-stsatistic & p-value
Ttest_relResult(statistic=3.3541019662496847, pvalue=0.02846020325433834) 

#### Applications of t-test in ML


1) **Single Sample T-test**
A variation of measure of spread using z-score (std)

- You invent your own ML model and compare it against the established benchmark.
- Run my model a number of times, get a sample of model outputs of model accuracies.
- Using the single sample t-test, you can compare the sample of accuracies from your stochastic model against the established benchmark and get an assessment of the statistical significance of your performance.

2) **Independent T-test**
Comparing the means of two different datasets.

- Does your model have unwanted bias in it?
- You can test this using independent samples
- Use independent samples to stratify our data set by different demographic groups.
- Compare the model outputs for one group against the other groups.
- Are the difference statistically significant?

3) **Paired Sample T-test**
- You invented a new TensorFlow model running a neural network in a browser. Is it significantly faster than the old model in the browser?

- Prepare paired samples with varying conditions:
  - Try a bunch of runs with your new model versus your old model and you see which one performs higher than the other.
  - Run the pairs on different browsers - Safari, Firefox, Chrome, on a mobile device, on a tablet, and on a desktop computer;
  - Pair by the situations where you test the old model on Safari on an iPhone, and you test the new model on Safari on an iPhone.

## Confidence Interval

$$C.I. =\bar x \pm z\frac{s}{\sqrt n}$$


Where:

- $\bar{x}$: the sample mean
- $s$: the sample standard deviation
- $n$: the sample size
- $z$: a z-score threshold
  - Some popular $z$ thresholds:
    - 95% CI: $z \pm 1.960$; 
    - 90% CI at $z \pm 1.645$ 
    - 99% CI at $z \pm 2.576$

## ANOVA test

Enables us to compare more than 2 samples in a single statistical analysis.

There are 3 assumptions to run ANOVA test. The samples must:

- independent (not paired)
- normally distributed
- Homoscedasticity: population standard deviations are equal.

In [None]:
st.f_oneway(sample1, sample2, sample3)

In [None]:
# returns: t-stsatistic & p-value
F_onewayResult(statistic=0.22627752438542714, pvalue=0.7980777848719299)

# 2. Plotting

## Bar graph

In [None]:
plt.bar(x_event, event_prob, color='mediumpurple')

## Scatterplot

- Matplotlib
- Seaborn

In [None]:
plt.scatter(x, y, c=c)

In [None]:
sb.scatterplot(x=x, y=y)

## Boxplot

In [None]:
sb.boxplot(x=, y=, hue=colour, data=dataset)

## Displot

In [None]:
sb.displot(x, kde=True)

## Distplot

In [None]:
sb.distplot(x)  # y unit = density

## Histplot

In [None]:
sb.histplot(x)  # y unit = count

### Auxiliary plotting functions

##### plt.errorbar()


In [None]:
ax.errorbar(['x label'], [mean], [CI error], fmt='o', color='green', label='')

##### plt.grid(axis=)


In [None]:
plt.grid(axis='y')