In [3]:
import numpy as np
import pandas as pd

**Cardiovascular Disease**<br>
Consider the Physicians’ Health Study data presented in Example 10.37 (p. 411).

Ex 10.36:<br>
Participants were 22,000 male physicians ages 40−84 and free of cardiovascular disease in 1982. The physicians were randomized to either active aspirin (one white pill containing 325 mg of aspirin taken every other day) or aspirin placebo (one white placebo pill taken every other day). As the study progressed, it was estimated from self-report that 10% of the participants in the aspirin group were not complying (that is, were not taking their study [aspirin] capsules). Thus, the dropout rate was 10%. Also, it was estimated from self-report that 5% of the participants in the placebo group were taking aspirin regularly on their own outside the study protocol. Thus, the drop-in rate was 5%.

Ex 10.37:<br>
Suppose we assume that the incidence of MI is .005 per year among participants who actually take placebo and that aspirin prevents 20% of MIs (i.e., relative risk = p1/p2 = 0.8). We also assume that the duration of the study is 5 years and that the dropout rate in the aspirin group = 10% and the drop- in rate in the placebo group = 5%. How many participants need to be enrolled in each group to achieve 80% power using a two-sided test with significance level = .05?

*The incidence of MI is 0.005 **per year** for the placebo group, therefore the 5-year incidence rate of MI is 5 * 0.005 = 0.025*

10.1 How many participants need to be enrolled in each group to have a 90% chance of detecting a significant difference using a two-sided test with α = .05 if compliance is perfect?

In [45]:
import statsmodels.stats.api as sms
p1 = 0.025
p2 = 0.02 # p1/p2 = 0.8
power = 0.9
alpha = 0.05
ratio = 1
nobs1 = None
es = sms.proportion_effectsize(prop1=p1, prop2=p2, method='normal')
n = sms.NormalIndPower().solve_power(es, nobs1, alpha, power, ratio=1, alternative='two-sided')
print(n * 2)

36862.823179160376


~18,431 subjects per group (or 36,862 total) are need to be enrolled in each group to have a 90% chance of measuring a 20% drop in relative risk in the case group compared with the control group, at the 5% significance level. 

10.2 Answer Problem 10.1 if compliance is as given in Example 10.37.

In [47]:
# From equation 10.17 in the book...
gamma1 = 0.1
gamma2 = 0.05
n_non_compliance = n / (1 - gamma1 - gamma2) ** 2 # we can use the approximation since both gamma1 and gamma2 are <= 0.1
print(round(n_non_compliance * 2))

51021


The number of subjects increases significantly if we factor in drop-in and drop-outs. 

10.3 Answer Problem 10.1 if a one-sided test with power = .8 is used and compliance is perfect.

In [51]:
power = 0.8
alpha = 0.05
ratio = 1
nobs1 = None
n = sms.NormalIndPower().solve_power(es, nobs1, alpha, power, ratio=1, alternative='larger')
n = round(n)
print(round(n))
print(n * 2)

10845
21690


Not surprisingly, fewer participants are required when the power requirement is decreased.

10.4 Suppose 11,000 men are actually enrolled in each treatment group. What would be the power of such a study if a two-sided test with α = .05 were used and compliance is perfect?

10.5 Answer Problem 10.4 if compliance is as given in Example 10.37.
Refer to Table 2.13 (p. 36).

10.6 What significance test can be used to assess whether there is a relationship between receiving an antibiotic and receiving a bacterial culture while in the hospital?

10.7 Perform the test in Problem 10.6, and report a p-value.

**Gastroenterology**<br>
Two drugs (A, B) are compared for the medical treatment of duodenal ulcer. For this purpose, patients are carefully matched with regard to age, gender, and clinical condition. The treatment results based on 200 matched pairs show that for 89 matched pairs both treatments are effective; for 90 matched pairs both treatments are ineffective; for 5 matched pairs drug A is effective, whereas drug B is ineffective; and for 16 matched pairs drug B is effective, whereas drug A is ineffective.

10.8 What test procedure can be used to assess the results?

In [4]:
n_d = 5 + 16
print(n_d)

21


Since the # of discordant pairs is > 20, and the data are paired involving proportions, McNemar's test can be used. 

10.9 Perform the test in Problem 10.8, and report a p-value.

In [11]:
from statsmodels.stats.contingency_tables import mcnemar
table = [[89,16],[5,90]]
results = mcnemar(table, exact=True, correction=True)
print(results)

pvalue      0.026603698730468753
statistic   5.0


Since p = 0.026 < 0.05, we can conclude that the drug does not have a significant affect on the treatment of duodenal ulcers.

**Sexually Transmitted Disease**<br>
Suppose researchers do an epidemiologic investigation of people entering a sexually transmitted disease clinic. They find that 160 of 200 patients who are diagnosed as having gonorrhea and 50 of 105 patients who are diagnosed as having nongonococcal urethritis have had previous episodes of urethritis.

10.13 Are the present diagnosis and prior episodes of urethritis associated?

So, if urethritis is unrelated to diagnoses of gonorrhea and nongonococcal urethritis, then the proportion of patients with gonorrhea who have had previous episodes of urethritis (p1) should be 0.5 (or at least, not significantly different than 0.5). The same goes for the proportion patients diagnosed with nongonococcal urethritis (p2). Therefore, we can use an 2 x K contingency table and the chi-square test for association to test whether an association exists or not, (assuming the assumptions for the test are met). 

So therefore:<br>
H0: There is no association, i.e. p1 = p2 = p3 (= 0.5)<br>
HA: There is an association, i.e. at least two samples are not from the same population (don't have a rate of 0.5), which would mean that there is no significant association. 

**Assumptions:**
- No more than 1/5 of the cells have expected values < 5. and
- No cell has an expected value < 1.

The question is talking about two groups (i.e. samples) of patients - those with gonorrhea, and patients without gonorrhea. The question gives us the data on how many patients within each of the two groups had previous diagnoses of urethritis, and then asks if having been diagnosed with urethritis in the past is associated with a diagnosis of gonorrhea.

In [None]:
# pd.Dat[[160, 40], [50, 55]]

In [44]:
## solution based on solution from chegg, now that I understand what the question is asking...
from scipy.stats import chi2_contingency
observed = [[160, 40], [50, 55]]
chi2, pvalue, dof, exp = chi2_contingency(observed)
print(chi2, pvalue)

32.170242645303745 1.4123745276875494e-08


- H0: p1 = p2
- HA: p1 != p2

Since p < 0.05, we can reject the null and accept HA, that there is an association between previous diagnoses of urehtritis and gonnohrea.

In [24]:
p1 = 160/200
print(p1)
p2 = 50/105
print(p2)
n_total = 305
p3 = 0.5

0.8
0.47619047619047616


In [19]:
305 * 0.5

152.5

In [25]:
from scipy.stats import chi2_contingency
observed = [[40, 55, 152.5], [160, 50, 152.5]]
chi2, pvalue, dof, exp = chi2_contingency(observed)
print(pd.DataFrame(exp))
print(pvalue)

            0          1       2
0   81.147541  42.602459  123.75
1  118.852459  62.397541  181.25
4.139514165505269e-12


We can reject H0 and accept HA, that there is an association between urethritis and at least one of the populations under study (gonorrhea and nongonococcal urethritis).

But which one would it be?

In [35]:
from scipy.stats import chi2_contingency
observed = [[40, 152.5], [160, 152.5]]
chi2, pvalue, dof, exp = chi2_contingency(observed)
print(pd.DataFrame(exp))
print(pvalue)

            0           1
0   76.237624  116.262376
1  123.762376  188.737624
2.1551565213831957e-11


alternatively, using a z-test:<br>
H0: p1 = p3<br>
Ha: p1 != p3<br>

In [37]:
from statsmodels.stats.weightstats import ztest
x1 = np.concatenate((np.ones(160), np.zeros(40)))
tstat, pvalue = ztest(x1=x1, x2=None, value=0.5, alternative='two-sided')
print(pvalue)

3.687535748815112e-26


And we can reject the null hypothesis and accept the alternative, that the rate of urethritis for patients diagnosed with gonnorhea is not 0.5, and thus there is some association for the two groups. 

What about the group diagnosed with nongonococcal urethritis?

In [39]:
from statsmodels.stats.weightstats import ztest
x1 = np.concatenate((np.ones(50), np.zeros(55)))
tstat, pvalue = ztest(x1=x1, x2=None, value=0.5, alternative='two-sided')
print(pvalue)

0.6268449135173175


Since p > 0.05, we cannot reject the null and must accept it - there is no association between diagnoses of nongonococcal urethritis and prior diagnoses of urethritis. 