# 单元4 - Test of proportions - 生物医学统计概论 @林关宁

## Confidence Interval for a single proportion

我们可以使用 statsmodels 来计算一些试验中给定“成功”比例的置信区间，比如某个基因型出现的频率、以特定方式投票的意图等。

### 假设在某个网站上，在某一天的1126名访问者中，有310人点击了赞助商的广告。
#### 计算点击广告的访客的人口比例的置信区间

In [1]:
import statsmodels.api as sm
p = 310 / 1126    # Sample proportion
print(p)
# Function for computing confidence intervals
from statsmodels.stats.proportion import proportion_confint   

proportion_confint(count=310,    # Number of "successes"
                   nobs=1126,    # Number of trials
                   alpha=(1 - 0.95), method='normal')

0.2753108348134991


(0.24922129423231776, 0.30140037539468045)

#### If we wanted a 99% confidence interval, we would have a wider interval, but more confidence that the true proportion lies in this interval.

In [2]:
lower, upper = proportion_confint(310, 1126, alpha=(1 - 0.99))
print ('Lower confidence interval:', lower)
print ('Upper confidence interval:', upper)

Lower confidence interval: 0.24102336643386685
Upper confidence interval: 0.30959830319313136


### 单样本假设检验：网站管理员声称有30%的网站访问者点击了广告。这是真的吗？

### hypothesis test for proportions using z-test

#### 我们将做一个统计检验来测试管理员的说法

H0: p=0.3

我们计算𝑝-值。如果𝑝-价值很小，我们会拒绝𝐻0并断定管理员的声明是虚假的；点击广告的访客比例不是0.3。如果𝑝-价值不小，那么我们不拒绝𝐻0; 我们数据中的证据与他的说法并不矛盾

什么算“小”𝑝-价值在这里，我们可以说如果𝑝-值小于0.05，则𝑝-价值是“小”的，我们拒绝零假设。如果我们看到𝑝-值大于0.05时，我们不会拒绝零假设。（我们可以选择0.05以外的数字，比如选择0.01。）

In [4]:
# Performs the test just described
from statsmodels.stats.proportion import proportions_ztest
zstat, p = proportions_ztest(count=310,
                        nobs=1126,
                        value=0.3,  # The hypothesized value of population proportion p
                        alternative='two-sided') # Tests the "not equal to" alternative hypothesis

print ('z statistics:', zstat)
print ('p value:', p)

z statistics: -1.8547614674673856
p value: 0.06363029677684083


结果：我们得到了一个假设检验统计量 𝑧≈−1.85 和 𝑝-值≈0.0636 > 0.05. 

结论是，没有足够的统计证据证明网站管理员的说法不对

## Confidence Interval for two population proportions

In [5]:
import numpy as np
import pandas as pd
import scipy.stats as stats

def two_proprotions_confint(success_a, size_a, success_b, size_b, significance):  #这里定义函数，后面调用，并不是现成API调用
    """
    A/B test for two proportions;
    given a success a trial size of group A and B compute
    its confidence interval;
    resulting confidence interval matches R's prop.test function

    Parameters
    ----------
    success_a, success_b : int
        Number of successes in each group

    size_a, size_b : int
        Size, or number of observations in each group

    significance : float, default 0.05
        Often denoted as alpha. Governs the chance of a false positive.
        A significance level of 0.05 means that there is a 5% chance of
        a false positive. In other words, our confidence level is
        1 - 0.05 = 0.95

    Returns
    -------
    prop_diff : float
        Difference between the two proportion

    confint : 1d ndarray
        Confidence interval of the two proportion test
    """
    prop_a = success_a / size_a
    prop_b = success_b / size_b
    var = prop_a * (1 - prop_a) / size_a + prop_b * (1 - prop_b) / size_b
    se = np.sqrt(var)

    # z critical value
    confidence = 1 - significance
    z = stats.norm(loc = 0, scale = 1).ppf(confidence + significance / 2)

    # standard formula for the confidence interval
    # point-estimtate +- z * standard-error
    prop_diff = prop_b - prop_a
    confint = prop_diff + np.array([-1, 1]) * z * se
    return prop_diff, confint

In [7]:
success_a = 486
size_a = 5000
success_b = 527
size_b = 5000
alpha = 0.05

prop_diff, confint = two_proprotions_confint(success_a, size_a, success_b, size_b, alpha)
print('estimate difference:', prop_diff)
print('confidence interval:', confint)

estimate difference: 0.008199999999999999
confidence interval: [-0.00362633  0.02002633]
