# Packages

Packages are pre-built python 'pieces' that we can install and import into our workspace for making our life easier. There's a vast variety of packages in the world. Here we'll be using few of the most common ones when it comes to data analysis and visualization.

- **pandas:** A very popular tabular data manipulation library. It allows you to format and work with your data with table-like objects named Data Frames.
- **numpy:** Numerical analysis library with a wide range of mathematical libraries and matrix operations. If you have used Matlab before this one will look familiar.
- **widgets:** It will enable interactive controls in order to tweak the plots.
- **matplotlib:** The classic plotting library.
- **plotly:** Another cool plotting library with a more friendly and interactive skin. 
- **seaborn:** Built on top of matplotlib, tailored for easy descriptive statistical plots on your data.

In [1]:
import pandas as pd
import numpy as np
import ipywidgets as widgets
from ipywidgets import interact, interact_manual
import plotly.express as ex
import plotly.graph_objects as go
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

# Statistical Inference

_"Statisticl Inference is the act of **generalizing** from a **sample** to a **population** with calculated **degree of cetainty**"_ , Casella Berger (Statistical Inference).


<img src='./images/Statistical_Inference.png' style="width:600px;'heigth:600px'"/>


**Generalization:**

It refers to our ability to approximate population **parameters** with in sample statistics (aka moments). These "approximates" are named as **estimators** and statistics are its output known as **estimates**.  E.g., we can define a mean value of our sample data $X_{s}$ as $\hat{\mu} = f(X_{s}) = \frac{1}{N}\sum_{i=1}^{N} x_{i}$. In this case, $f(X_{s})$ is our estimator and $\hat{\mu}$ is the estimate.

**Population:**

In simple terms, it is the "absolute" data collection of targeted events that we'd like to learn about. E.g., the expected expenditure ($G^{e}$) of incoming tourists. However, one could argue that $G^{e}$ might not provide any useful information, given that $G^{e}$ may drastically vary across different geographic / demographic groups. The concept of **sub-population** comes to the rescue. E.g., instead of studying $G^{e}$ of the entire world population, we might split it into regions and get something like: $G^{e}_{North Europe},G^{e}_{North America}$, etc.

Regardless whether we go with **sub-populations** or not, the core takeaway is that for a given case study / experiment, the targeted **population** contains all available data in existence (all people on the planet, all stars in the Universe, every particle that has gone through the space and time, etc.)

**Sample:**

Given the definition of a **population**, it is not usually feasible to access the entirety of it. I.e., given some natural constraints, we have to work with an arbitrary data collection known as a **sample**. On of its characteristics is that it is generated by a **sampling-procedure** applied on the results of a **data collection process**.  E.g., in the QA studies it is not feasible to have all clients to answer questioners, have 100% honest answers, and other frictions. Thus, in order to have the best **estimate** of clients satisfaction we'd have to deliberatly select the most representative and trustworthy **sample**. In other words, **sampling-procedure** and **data collection process** must be unbiased and flawless.

**Uncertainty & Confidence: (TO FINISH)**

In [2]:
np.random.seed(2656)
pop_height = np.random.normal(170, 15, 1000000)
height_min = int(min(pop_height))
height_max = int(max(pop_height))
population = pd.DataFrame(pop_height, columns = ['Height'])
del pop_height

In [3]:
def dist_plot(N, Height, SamplingMethod, seed):
    plt.figure(figsize=(16, 8))
    np.random.seed(seed)
    sns.distplot(population['Height'], label='Population', hist=False)
    if SamplingMethod == 'Random':
        sns.distplot(population['Height'].sample(N),
                     label='Sample',
                     hist=False)
    else:
        sns.distplot(population.loc[population['Height'].
                                    between(Height[0], Height[1]), 'Height'].sample(N),
                     label='Sample',
                     hist=False)
    plt.legend()

In [4]:
SamplingMethod = widgets.Dropdown(options=['Random', 'Selection Biased'],
                                  value='Random',
                                  descriptions='Sampling Method')
Height = widgets.IntRangeSlider(value=[height_min, height_max],
                                min=height_min,
                                max=height_max,
                                step=5,
                                descriptions='Height Range')

In [5]:
interact(dist_plot,
         N=widgets.IntText(1000),
         Height=Height,
         SamplingMethod=SamplingMethod,
         seed=widgets.IntText());

interactive(children=(IntText(value=1000, description='N'), IntRangeSlider(value=(98, 240), description='Heigh…

**Interesting cases:(TO FINISH)**

Participation bias https://en.wikipedia.org/wiki/Participation_bias

Simpson's paradox https://en.wikipedia.org/wiki/Simpson%27s_paradox

# Distributions

**Definition:**

In simple terms, a probability distribution is a mathematical function $f(\cdot)$ that attempts to study the behaviour of a random variable (RV) $X$ in terms of its possible outcomes. More formally, $f(X) \doteq Pr(X = x)$ if $X$ is a discrete RV; $f(X) \doteq Pr(a \leq X \leq b)$ if $X$ is a continuous RV. The former one can be reffered as **Probability Mass Function** (**pmf**), while the later is **Probability Density Function** (**pdf**). This distinction is of fundamental importance. 

In the simpliest example of a RV $X \in [0,1]$, and Uniform Probability Distribution, if $X$ is discrete then $Pr(X = 0) = Pr(X = 1) = 1/|X| = 1/2 : X \in \mathbb{Z}^{[0,1]}$; if $X$ is a continuous RV then $Pr(X = x) = 1/|X| = 1/\inf = 0$  $\forall x \in X:X \in \mathbb{R}^{[0,1]}$. However, if in the continuous case we used **pdf** instead of **pmf** we would get the desired answer: $Pr(0 \leq X \leq 0.5) = Pr(0.5 \leq X \leq 1) = 1/2$, which makes sense, given that 50% of all cases is contained within $[0,0.5]$ and $[0.5,1]$.

**Uniform Distribution:**

This distribution assumes that all possible realizations of a given RV $X$ are equally likely and bounded to some arbitrary interval $[a,b]$. More formally:


$$X \sim U(a,b) : f(x,a,b) \doteq \begin{cases}
  \frac{1}{b - a} & \mathrm{for}\ a \le x \le b, \\[8pt]
  0 & \mathrm{for}\ x<a\ \mathrm{or}\ x>b
  \end{cases}$$

<img src='./images/Uniform Distribution.png' style="width:600px;'heigth:600px'"/>


Use cases:
1. RNG
2. Generalization of other distributions with equally probable outcome scenarios (E.g. Bernoulli with fair dice / coins, etc.)
3. Studying the scenarios of launching a new product (if no useful information is available for conditioning the outcomes)
4. Others


**Gaussian (Normal) Distribution:**

Defines the behaviour of a continious RV $X$ in terms of its expected realization $E(X) = \mu$ (a.k.a location) and the spread of this realization $Var(X) = \sigma$ (a.k.a scale). More formally: 


$$X \sim N(\mu, \sigma) : f(x,\mu,\sigma) \doteq \frac{1}{\sigma \sqrt{2 \pi}}e^{-\frac{1}{2}(\frac{x-\mu}{\sigma})^2}$$


<img src='./images/Normal Distribution.svg' style="width:600px;'heigth:600px'"/>

Use cases:
1. CLT
2. Statistical Inference
3. Complex Distribution Aproximization (e.g via GMM)
4. Errors / Noise Modeling (e.g for Image Noise Injection, error propagation, models generalization improvement, etc.)
5. Generative modeling (VAE, GAN, etc.)
6. Others

**Poisson Distribution:**

It is a discrete probability distribution function of the number of events to happen in a given time interval, provided the information about the average occurrence of those events.

$$X \sim P(k,\lambda) : f(k,\lambda) = \frac{\lambda^{k}e^{-\lambda}}{k!}$$

Where $k$ is the number of occurrences and $\lambda$ is the expected number of occurrences.

<img src='./images/Poisson Distribution.png' style="width:600px;'heigth:600px'"/>

There are conditions to be met in order to apply Poisson Distribution:
1. There are no constraints on the amount of times a given event can occur during the chosen time interval
2. Occurrence of events is independent between each other
3. The rate of occurance must be stationary
4. The probability of an event occurring is proportional to the length of the time period

Use cases:
1. Supply Chain Management
2. Intermittent Demand Forecasting
3. Capacity Planning (Contact centers, attention personnel in hospitals, etc.)
4. Betting / Gambling
5. Poisson Regression is for analysis of countable variables (e.g the number of sold tickets by the end of this month, this quarter, this year)

**Bernoulli**

It is a discrete probability distribution of a RV $X$ to have value $x$ with probability $p$ and $1-x$ with probability $1-p$. Where $x$ is a binary choice (True or False, To be or not to be, etc.). More formally:

$$X \sim B(p) : f(x,p) \doteq x^{p}(1-x)^{1-p} = \begin{cases} 
  x \ \text{, with probability p}, \\[8pt]
  1-x \  \text{, with probability 1 - p}
  \end{cases}$$
  
  
<img src='Bernoulli Distribution.png' style="width:600px;'heigth:600px'"/>


Use cases:
1. Whenever a binary choice is involved. Clinical trials, A/B testing, gambling, etc. 

# Statistical Moments

**NOTE:** 
    
1. I will drop the explaination of concepts such as unbiasedness and consitency of an estimaor. Since, I consider that these concepts are out of the scope of this workshop.

2. As examples I will use the following distributions: Normal and Bernoulli.

3. For simplicity sake, we will skip examples of 3rd & 4th moments.

4. We assume univariate distributions.

## Distribution's Mean

**Definition:**

Given a sample $X_{s} = \{x_{1},...,x_{N}\}$ drawn from a population $X$, $\hat{x}$ is the **expected** realization of the population mean $\overline{x}$, i.e $\hat{x}$ is its estimate.

More formally:

$$\hat{x} = E(X_s)$$

**Examples:**

_Normal Distribution_ : $X_{s} \sim N(\hat{\mu},\hat{\sigma})$

$$E(X_{s}) = \frac{1}{N}\sum_{i=1}^{N} x_{i} = \hat{\mu}$$

_Bernoulli Distribution_ : $X_{s} \sim Bernoulli(\hat{p})$

$$E(X_{s}) = Pr(X_{s}=1)\cdot1 + Pr(X_{s}=0)\cdot0 = \hat{p}\cdot1 = \hat{p}$$ 

## Distribution's  Variance

**Definition:**

Given a sample $X_{s} = \{x_{1},...,x_{N}\}$ drawn from a population $X$, $\hat{\sigma}^2$ is the **estimate** of the population variance $\sigma^2$, i.e how **ucnertain** or **spread** a given distribution is.

More formally:

$$\hat{\sigma}^2 = Var(X_{s}) : Var(X_{s}) = E[(X_{s} - E(X_{s}))^2] = E(X_{s}^2) - E(X_{s})^2$$

**Examples:**

_Normal Distribution_ : $X_{s} \sim N(\hat{\mu},\hat{\sigma})$

$$Var(X_{s})  = \frac{1}{N-1}\sum_{i=1}^{N} (x_{i} - \hat{\mu})^2 = \hat{\sigma}^2 \text{, where} \frac{1}{N-1} \text{ is correction factor that guarantees unbiasedness}$$

_Bernoulli Distribution_ : $X_{s} \sim Bernoulli(\hat{p})$

$$Var(X_{s}) = E(X_{s}^2) - E(X_{s})^2 = Pr(X_{s}^2=1^2)\cdot1^2 - \hat{p}^2 = \hat{p} - \hat{p}^2 = \hat{p}\cdot(1-\hat{p})$$ 

## Distribution's  Skewness

**Definition:**

Given a sample $X_{s} = \{x_{1},...,x_{N}\}$ drawn from a population $X$, $\hat{\mu}_{3}$ is the estimate of $\mu_{3}$ that quantifies the asymmetry of a given probability distribution about its mean.

There are various definitions of sample skewness. The most common one is _"The Natural Method of Moments"_ :

$$\hat{\mu}_{3} = Skew(X_{s}) = \frac{E[(X_{s} - E(X_{s}))^3]}{\sqrt{E[(X_{s} - E(X_{s}))^2]^3}}$$

**Utility**: It is usually used in the study of real-valued continous distributions with finite skewness, with the goal to measure their "deviation" from the normal distribution. This is of high importance, given that many statistical inferences adopted by business practitioners are only valid if the distribution normality is guaranteed. E.g. in presence of high skewness you can forget about implementing t-tests, your p-values will become asymetrical & misleading, classical ANOVA might fail, the intervals around the mean won't be trustworthy, and others.

**NOTE:** $\hat{\mu}_{3}$ estimators tend to be highly biased in the case of mixture distributions. Best working with symmetrical real-valued continious distributions with finite skewness. The proposed sample method is implicitly biased, this implicit bias can be accounted by a correction factor.

## Distribution's  Kurtosis

**Definition:**

Given a sample $X_{s} = \{x_{1},...,x_{N}\}$ drawn from a population $X$, $\hat{\mu}_{4}$ is the estimate of $\mu_{4}$ that quantifies the propensity of a given probability distribution to produce extreme values.

There are various definitions of sample kurtosis. The most common one is _"The Natural Method of Moments"_ :

$$\hat{\mu}_{4} = Kurt(X_{s}) = \frac{E[(X_{s} - E(X_{s}))^4]}{\sqrt{E[(X_{s} - E(X_{s}))^2]^4}} = \frac{E[(X_{s} - E(X_{s}))^4]}{E[(X_{s} - E(X_{s}))^2]^2}$$

**Utility**: The obvious use case of $\hat{\mu}_{4}$ is to determine whether a given data distribution is affected by outliers. Other use case is related to the information theory. In particular, variables whose probability distributions have low kurtosis will lead to higher information entropy, i.e. lower predictive power. Vice versa for higher kurtosis. In multivariate scenarios, the information entropy can be used as a criterion for variable selection. There are other use cases.

**NOTE:** $\hat{\mu}_{4}$ estimators tend to be highly biased in the case of mixture distributions (with some exceptions). The proposed sample method is implicitly biased, this implicit bias can be accounted by a correction factor.

## Auxilary Statistics

**Median:**

The value which separates data distribution in two equal halves. More formally, if $|X|$ is even, then the index of the median is $\frac{|X|+1}{2}$. If set $|X|$ is odd, then $Median(X) = \frac{X(a_{l}) + X(a_{u})}{2} : a_{l} = \frac{|X|}{2}, a_{u} = \frac{|X|}{2}+1$

E.g., $X = \{1,2,3,4,5\}$, $|X|$ is even, then, $Median(X) = X(\frac{n+1}{2}) = 3$; $X = \{1,2,2,4,5,6\}$, $|X|$ is odd, then, $Median(X) = \frac{X(3)+X(4)}{2} = \frac{2+4}{2} = 3$

The primary use cases of median are: studying probability distribution properties (normality, skewness, etc.), developing of outlier-robust metrics, outlier-robust learning models, and outliers detectors.

**Mode:**

The most frequent element of a given set $X$. More formally, if $X$ is a discrete probability distribution, then, $Mode(X) = \operatorname*{arg\,max}_x pmf(X = x)$, where $pmf$ is the probability mass function. If $X$ is a continuous probability distribution, then, $Mode(X) = \{x_{i}^{*}\}: \forall x_{i}^{*} \text{ it is guaranteed that } f(x_{i}^{*}) \ge f(x_{i}^{*} \pm \epsilon)$, where $\epsilon$ is an arbitrary small value and $f(\cdot)$ is the liklihood function. In other words, the mode of a continuous probability distribution is a set of all maxima of its probability density function ($pdf$).

E.g., $X = \{1,2,2,4,5,6\}$, $X$ is discrete, then, $Mode(X) = 2$; $X \sim N(\mu,\sigma^2)$. If $X$ follows a normal-unimodal-symmetrical distribution, then $Mode(X) \doteq \frac{\partial X}{\partial x} = 0$, where $x^{*} = \operatorname*{arg\,max}_x X = \mu$.

The primary use cases of mode are: studying probability distribution properties (normality, skewness, "modality", etc.).

## Basic Multivariate Statistics  (TO FINISH)

**Covariance & Correlation**

**Correlation vs Causation**

https://www.youtube.com/watch?v=VMUQSMFGBDo

https://www.youtube.com/watch?v=8B271L3NtAw

In [190]:
#### Visual example for Correlation /= Causation

1.0