# Binomial criterion for portions (single sampling):

$H_0:$ $\hat{p} = p_0$ (for example, random choice $p_0=0.5$)

$H_1: \hat{p} < \neq > p_0$

**1-sided:**

In [1]:
import numpy as np
from scipy import stats

stats.binom_test(12, 16, 0.5, alternative = 'greater') #'less'

0.0384063720703125

**2-sided:**

In [2]:
stats.binom_test(12, 16, 0.5, alternative = 'two-sided')

0.076812744140625

# Student tests
Assume, that data is distributed normally, need to check it before applying tests!

## Z-test for the proportions difference (independent samples)


   | $X_1$ | $X_2$  
  ------------- | -------------|
  1  | a | b 
  0  | c | d 
  $\sum$ | $n_1$| $n_2$
  
$$ \hat{p}_1 = \frac{a}{n_1}$$

$$ \hat{p}_2 = \frac{b}{n_2}$$

$H_0:$ $p_1 = p_2$ 

$H_1:$ $p_1 < \neq > p_2$


$$\text{Confident interval }p_1 - p_2\colon \;\; \hat{p}_1 - \hat{p}_2 \pm z_{1-\frac{\alpha}{2}}\sqrt{\frac{\hat{p}_1(1 - \hat{p}_1)}{n_1} + \frac{\hat{p}_2(1 - \hat{p}_2)}{n_2}}$$

$$Z-statistics: Z({X_1, X_2}) =  \frac{\hat{p}_1 - \hat{p}_2}{\sqrt{P(1 - P)(\frac{1}{n_1} + \frac{1}{n_2})}}$$
$$P = \frac{\hat{p}_1{n_1} + \hat{p}_2{n_2}}{{n_1} + {n_2}} $$

In [8]:
def proportions_diff_z_stat_ind(sample1, sample2):
    n1 = len(sample1)
    n2 = len(sample2)
    
    p1 = float(sum(sample1)) / n1
    p2 = float(sum(sample2)) / n2 
    P = float(p1*n1 + p2*n2) / (n1 + n2)
    
    return (p1 - p2) / np.sqrt(P * (1 - P) * (1. / n1 + 1. / n2))

In [12]:
def proportions_diff_z_test(z_stat, alternative = 'two-sided'):
    if alternative not in ('two-sided', 'less', 'greater'):
        raise ValueError("alternative not recognized\n"
                         "should be 'two-sided', 'less' or 'greater'")
    
    if alternative == 'two-sided':
        return 2 * (1 - stats.norm.cdf(np.abs(z_stat)))
    
    if alternative == 'less':
        return stats.norm.cdf(z_stat)

    if alternative == 'greater':
        return 1 - stats.norm.cdf(z_stat)

In [13]:
import numpy as np
from scipy import stats
from sklearn.datasets import make_classification
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split, cross_val_score

X, y = make_classification(n_samples=200, n_features=10, n_informative=8, n_redundant=2)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)

y_predict_rf = RandomForestClassifier().fit(X_train, y_train).predict(X_test)
y_predict_lr = LogisticRegression().fit(X_train, y_train).predict(X_test)

In [14]:
print("p-value: %f" % proportions_diff_z_test(proportions_diff_z_stat_ind(y_predict_rf, y_predict_lr)))

p-value: 0.715001


## Confident intervals for prportions difference (Dependent Samples):

  |$X_1$ \ $X_2$ | 1| 0 | $\sum$
  ------------- |-----|-----| -------------|
  1  | e | f | e + f
  0  | g | h | g + h
  $\sum$ | e + g| f + h | n  
  
$$ \hat{p}_1 = \frac{e + f}{n}$$

$$ \hat{p}_2 = \frac{e + g}{n}$$

$$ \hat{p}_1 - \hat{p}_2 = \frac{f - g}{n}$$


$$\text{Confident interval }p_1 - p_2\colon \;\;  \frac{f - g}{n} \pm z_{1-\frac{\alpha}{2}}\sqrt{\frac{f + g}{n^2} - \frac{(f - g)^2}{n^3}}$$

$$Z-statistics: Z({X_1, X_2}) = \frac{f - g}{\sqrt{f + g - \frac{(f-g)^2}{n}}}$$

In [19]:
def proportions_diff_z_stat_rel(sample1, sample2):
    sample = list(zip(sample1, sample2))
    n = len(sample)
    
    f = sum([1 if (x[0] == 1 and x[1] == 0) else 0 for x in sample])
    g = sum([1 if (x[0] == 0 and x[1] == 1) else 0 for x in sample])
    
    return float(f - g) / np.sqrt(f + g - float((f - g)**2) / n )

In [18]:
print("p-value: %f" % proportions_diff_z_test(proportions_diff_z_stat_rel(y_predict_rf, y_predict_lr)))

p-value: 0.313244
