# Statistics - Recap

## Calculate the confidence interval for the mean of a population from one sampling result (vector of numbers) - based on Central Limit Theorem


### Be quick to use Statistics to answer all sorts of question 

In [1]:
# Preferences of autoformatting & Multiple Output
%load_ext nb_black

from IPython.core.interactiveshell import InteractiveShell

InteractiveShell.ast_node_interactivity = "all"

import warnings

warnings.filterwarnings("ignore")

import numpy as np
import scipy.stats as st

data = [12, 12, 13, 13, 15, 16, 17, 22, 23, 25, 26, 27, 28, 28, 29]

<IPython.core.display.Javascript object>

### Method I: - Use t.interval() & paramter of Standard Error of Mean (SEM) = np.std(a, ddof=1) / np.sqrt(n)

In [2]:
sem = st.sem(data)
# https://github.com/scipy/scipy/blob/v1.8.1/scipy/stats/_stats_py.py#L2234-L2295
sem

# Repeat SEM using np.std() with np.sqrt()
np.std(np.array(data), ddof=1) / np.sqrt(len(data))  # Because default for SEM is 1

# Obtaining CI
st.t.interval(alpha=0.95, df=len(data) - 1, loc=np.mean(data), scale=st.sem(data))

1.6981782956478755

1.6981782956478755

(16.75776979778498, 24.042230202215016)

<IPython.core.display.Javascript object>

### Method II: - Construct the interval with t-score together with SEM S/np.sqrt(n) where S is the Sampling stardard deviation

In [3]:
data_array = np.array(data)
m = data_array.mean()
m

# For Small Sample, Note the divisor should be N - 1
# Obtain Sample Standard Deviation
s = data_array.std(ddof=1)  # Divisor = N - ddof ** Statistics for small sample ****
s
# Obtain Sample Variance
v = data_array.var(ddof=1)
v
np.sqrt(v)

dof = len(data_array) - 1
confidence = 0.95

t_crit = np.abs(st.t.ppf(q=(1 - confidence) / 2, df=dof, loc=0, scale=1))

# Find T-Score
st.t.ppf(q=(1 - confidence) / 2, df=dof, loc=0, scale=1)
# Find T-Score
t_crit
(
    m - s * t_crit / np.sqrt(len(data_array)),
    m + s * t_crit / np.sqrt(len(data_array)),
)


20.4

6.577016257935117

43.25714285714285

6.577016257935117

-2.1447866879169273

2.1447866879169273

(16.75776979778498, 24.042230202215016)

<IPython.core.display.Javascript object>

### Confirm with CI using t.test in R & with Wiki
https://en.wikipedia.org/wiki/Student%27s_t-distribution#Confidence_intervals

In [4]:
### Confirm with t.test in R
# data <- c(12, 12, 13, 13, 15, 16, 17, 22, 23, 25, 26, 27, 28, 28, 29)
# t.test(data,mu=15, conf.level = 0.95)
# """
# One Sample t-test
# data:  data
# t = 3.1799, df = 14, p-value = 0.006683
# alternative hypothesis: true mean is not equal to 15
# 95 percent confidence interval:
#  16.75777 24.04223
# """


### Confirm with https://en.wikipedia.org/wiki/Student%27s_t-distribution#Confidence_intervals
st.t.interval(alpha=0.80, df=10, loc=10, scale=np.sqrt(2 / 11))

(9.414898929487673, 10.585101070512327)

<IPython.core.display.Javascript object>

## Central Limit Theorem is also the foundation for ANY hypothesis testing 

#### - t-test - one sample, two sample - independent, paired 
#### - chi-square test - one sample (goodness of fit), test of independence/homogenity (with post hoc) - (nrow-1)*(ncol-1)
#### - f-test (one-way anova) - means of two or more means of groups (with post hoc) 
#### - z test - proportion test - one sample, two sample - (A/B testing chose alternative <> Chi-Squared - alternative is always two-sided)  - z-score = (p1-p2)/np.sqrt(p(1-p)(1/n1+1/n2))
#### - anova - one-way, two-way (two factors for continuous data) 

## Probability Questions

#### - Normal distribution (mu, sigma)  - continuous probability distribution
#### - Possion distribution  (lambda - Average events in a fixed time interval)  - discrete - Control distribution - mean / variance == lambda
#### - Binominal distribution (n - number of trials, p - probability of success on a single trial) - discrete - Control distribution - mean = np, variance = npq 
#### - Chi-square distribution (degree of freedom) 


## To conclude, if your observations are

###  Numeric In Nature (wide range of possible outcome) --- Dependence/Difference Investigated through _Mean_ (t-test, anova, f-test)


### Categorical In Nature (limited number of outcome) --- Dependence/Difference Investigated through _Proportions_ (chi-square)

### Power Analysis for A/B Testing - Effect Size etc. 

In [5]:
import statsmodels.stats.api as sms

effect_size = sms.proportion_effectsize(0.13, 0.15)
required_n = sms.NormalIndPower().solve_power(
    effect_size, power=0.9, alpha=0.05, ratio=1
)
required_n

6318.050022391152

<IPython.core.display.Javascript object>