Summary of course: https://www.coursera.org/learn/stats-for-data-analysis/home/welcome

## Central limit theorem

$(X_1, \dots, X_n) \in R^n$ - i.i.d. random variables drown frow the distribution with mean $\mu$ and variance $\sigma^2$ and $\overline{X}_n = \frac{X_1+\dots + X_n}{n}$.

Then, 

$$\frac{\overline{X}_n - \mu }{\sigma/\sqrt{n}} \to N(0,1) \Leftrightarrow $$

$$ \overline{X}_n \to N \left(E\left(x\right), \frac{D(x)}{n} \right) $$



**Confident interval for mean:**
$$\overline{X}_n -z_{1-\frac{\alpha}{2}} \sqrt{\frac{D(x)}{n}} \leq E(X)\geq \overline{X}_n + z_{1-\frac{\alpha}{2}} \sqrt{\frac{D(x)}{n}} \approx 1 - \alpha $$

$z_{\alpha}$ quantile of distribution N(0,1)

## Confident intervals for mean:

**z-interval:** $\mu$ will be estimated by samples, $\sigma^2$ is supposed to be known

 $$\overline{X}_n \pm z_{1-\frac{\alpha}{2}}\sqrt{\frac{D(x)}{n}}$$

**t-interval:** $\mu, \sigma^2$ will be estimated by samples. Len $S_n$ is sample variance.

$$\overline{X}_n \pm t_{1-\frac{\alpha}{2}} \frac{S_n}{\sqrt{n}} $$

In [3]:
import numpy as np
from scipy import stats
from sklearn.datasets import make_classification
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split, cross_val_score

In [4]:
X, y = make_classification(n_samples=200, n_features=10, n_informative=8, n_redundant=2)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)

rf_cv_cores = cross_val_score(RandomForestClassifier(), X, y, scoring = 'roc_auc', cv = 10)
lr_cv_cores = cross_val_score(LogisticRegression(), X, y, scoring = 'roc_auc', cv = 10)

In [21]:
from statsmodels.stats.weightstats import _zconfint_generic, _tconfint_generic

*z-test example*

In [34]:
int_rf = _zconfint_generic(np.mean(rf_cv_cores), np.sqrt(0.25/len(rf_cv_cores)), 0.05, 'two-sided')
int_lr = _zconfint_generic(np.mean(lr_cv_cores), np.sqrt(0.25/len(lr_cv_cores)), 0.05, 'two-sided')

print(f'Confident interval z-test Random Forest Classifier {int_rf}')
print(f'Confident interval z-test Logistic Regression {int_lr}')

Confident interval z-test Random Forest Classifier (0.6306024838477193, 1.250397516152281)
Confident interval z-test Logistic Regression (0.5611024838477193, 1.1808975161522808)


*t-test example*

In [38]:
std_rf = rf_cv_cores.std(ddof=1)/np.sqrt(len(rf_cv_cores))
std_lr = lr_cv_cores.std(ddof=1)/np.sqrt(len(lr_cv_cores))

int_rf_t = _tconfint_generic(np.mean(rf_cv_cores), std_rf, len(rf_cv_cores) - 1, 0.05, 'two-sided')
int_lr_t = _tconfint_generic(np.mean(lr_cv_cores), std_lr, len(lr_cv_cores) - 1, 0.05, 'two-sided')

print(f'Confident interval t-test Random Forest Classifier {int_rf_t}')
print(f'Confident interval t-test Logistic Regression {int_lr_t}')

Confident interval t-test Random Forest Classifier (0.9042407419245063, 0.9767592580754939)
Confident interval t-test Logistic Regression (0.8096060566805147, 0.9323939433194856)


## Confident intervals for prportions:

**Based on Norm distribution**
$$\hat{p}\pm z_{1-\frac{\alpha}{2}} \sqrt{\frac{\hat{p}\left(1-\hat{p}\right)}{n}}$$

In [52]:
y_predict_rf = RandomForestClassifier().fit(X_train, y_train).predict(X_test)
y_predict_lr = LogisticRegression().fit(X_train, y_train).predict(X_test)

In [53]:
from statsmodels.stats.proportion import proportion_confint

normal_interval_rf = proportion_confint(sum(y_predict_rf), len(y_predict_rf), method = 'normal')
normal_interval_lr = proportion_confint(sum(y_predict_lr), len(y_predict_lr), method = 'normal')

print(f'Norm confident interval Random Forest Classifier {normal_interval_rf}')
print(f'Norm confident interval Logistic Regression {normal_interval_lr}')

Norm confident interval Random Forest Classifier (0.3568885078639549, 0.6097781588027118)
Norm confident interval Logistic Regression (0.47604099353908763, 0.7239590064609123)


**Willson confident interval**
$$\frac1{ 1 + \frac{z^2}{n} } \left( \hat{p} + \frac{z^2}{2n} \pm z \sqrt{ \frac{ \hat{p}\left(1-\hat{p}\right)}{n} + \frac{
z^2}{4n^2} } \right), \;\; z \equiv z_{1-\frac{\alpha}{2}}$$ 

pros: good for small and unbalanced samples

In [54]:
wilson_interval_rf = proportion_confint(sum(y_predict_rf), len(y_predict_rf), method = 'wilson')
wilson_interval_lr = proportion_confint(sum(y_predict_lr), len(y_predict_lr), method = 'wilson')

In [55]:
print(f'Willson confident interval proprotion norm Random Forest Classifier {normal_interval_rf}')
print(f'Willson confident interval proprotion norm Logistic Regression {normal_interval_lr}')

Willson confident interval proprotion norm Random Forest Classifier (0.3568885078639549, 0.6097781588027118)
Willson confident interval proprotion norm Logistic Regression (0.47604099353908763, 0.7239590064609123)


## Confident intervals for prportions difference (Independent Samples):

   | $X_1$ | $X_2$  
  ------------- | -------------|
  1  | a | b 
  0  | c | d 
  $\sum$ | $n_1$| $n_2$
  
$$ \hat{p}_1 = \frac{a}{n_1}$$

$$ \hat{p}_2 = \frac{b}{n_2}$$


$$\text{Confident interval }p_1 - p_2\colon \;\; \hat{p}_1 - \hat{p}_2 \pm z_{1-\frac{\alpha}{2}}\sqrt{\frac{\hat{p}_1(1 - \hat{p}_1)}{n_1} + \frac{\hat{p}_2(1 - \hat{p}_2)}{n_2}}$$

In [61]:
def proportions_confint_diff_ind(sample1, sample2, alpha = 0.05):    
    z = stats.norm.ppf(1 - alpha / 2.)   
    p1 = float(sum(sample1)) / len(sample1)
    p2 = float(sum(sample2)) / len(sample2)
    
    left_boundary = (p1 - p2) - z * np.sqrt(p1 * (1 - p1)/ len(sample1) + p2 * (1 - p2)/ len(sample2))
    right_boundary = (p1 - p2) + z * np.sqrt(p1 * (1 - p1)/ len(sample1) + p2 * (1 - p2)/ len(sample2))
    
    return (left_boundary, right_boundary)

print("confidence interval: [%f, %f]" % proportions_confint_diff_ind(y_predict_rf, y_predict_lr))

confidence interval: [-0.293738, 0.060404]


## Confident intervals for prportions difference (Dependent Samples):

  |$X_1$ \ $X_2$ | 1| 0 | $\sum$
  ------------- |-----|-----| -------------|
  1  | e | f | e + f
  0  | g | h | g + h
  $\sum$ | e + g| f + h | n  
  
$$ \hat{p}_1 = \frac{e + f}{n}$$

$$ \hat{p}_2 = \frac{e + g}{n}$$

$$ \hat{p}_1 - \hat{p}_2 = \frac{f - g}{n}$$


$$\text{Confident interval }p_1 - p_2\colon \;\;  \frac{f - g}{n} \pm z_{1-\frac{\alpha}{2}}\sqrt{\frac{f + g}{n^2} - \frac{(f - g)^2}{n^3}}$$

In [65]:
def proportions_confint_diff_rel(sample1, sample2, alpha = 0.05):
    z = stats.norm.ppf(1 - alpha / 2.)
    sample = list(zip(sample1, sample2))
    n = len(sample)
        
    f = sum([1 if (x[0] == 1 and x[1] == 0) else 0 for x in sample])
    g = sum([1 if (x[0] == 0 and x[1] == 1) else 0 for x in sample])
    
    left_boundary = float(f - g) / n  - z * np.sqrt(float((f + g)) / n**2 - float((f - g)**2) / n**3)
    right_boundary = float(f - g) / n  + z * np.sqrt(float((f + g)) / n**2 - float((f - g)**2) / n**3)
    return (left_boundary, right_boundary)

In [66]:
print("confidence interval: [%f, %f]" % proportions_confint_diff_rel(y_predict_rf, y_predict_lr))

confidence interval: [-0.197895, -0.035438]


## Confident intervals for parameter with unknown distribution:

Here we will use **bootstrap** - a method that estimates the sampling distribution by taking multiple samples with replacement from a single random sample.

This methos can be used if we want to calculate confident intervals for parameters drawn from unknown distribution, for example,max, min, median and tc.

In [23]:
def get_bootstrap_samples(data, n_samples):
    indices = np.random.randint(0, len(data), (n_samples, len(data)))
    samples = data[indices]
    return samples

In [24]:
def stat_intervals(stat, alpha):
    boundaries = np.percentile(stat, [100 * alpha / 2., 100 * (1 - alpha / 2.)])
    return boundaries

In [25]:
median_scores = np.median(t, axis=1)

print("95% confidence interval for the ILEC median repair time:",  stat_intervals(median_scores, 0.05))

95% confidence interval for the ILEC median repair time: [-0.3236299  0.451573 ]
