15-categorical-two-props.Rmd

# Inference for comparing two proportions {#inference-two-props}

```{r, include = FALSE}
source("_common.R")
```

::: chapterintro
We now extend the methods from Chapter \@ref(inference-one-prop) to apply confidence intervals and hypothesis tests to differences in population proportions that come from two groups, Group 1 and Group 2: $\pi_1 - \pi_2.$

In our investigations, we'll identify a reasonable point estimate of $\pi_1 - \pi_2$ based on the sample, and you may have already guessed its form: $\hat{p}_1 - \hat{p}_2.$ \index{point estimate!difference of proportions} Then we'll look at the inferential analysis in two different ways: simulation-based methods of a randomization test and applying bootstrapping for interval estimates, and, if we verify that the point estimate can be modeled using a normal distribution, theory-based methods through a two sample $z$-test and $z$-interval.
:::

<!-- ::: {.underconstruction} -->

<!-- Sex discrimination case study now in chapter 11. -->

<!-- ::: -->

<!-- ::: {.underconstruction} -->

<!-- Old content - revise as needed -->

<!-- ::: -->

<!-- ## Difference of two proportions {#diff-two-prop} -->

Below we summarize the notation used throughout this chapter.

::: onebox
**Notation for a binary explanatory variable and a binary response variable.**

-   $n_1$, $n_2$ = sample sizes of two independent samples
-   $\hat{p}_1$, $\hat{p}_2$ = sample proportions of two independent samples
-   $\pi_1$, $\pi_2$ = population proportions of two independent samples
:::

```{r include=FALSE}
terms_chp_15 <- c("point estimate")
```

## Randomization test for $H_0: \pi_1 - \pi_2 = 0$ {#two-prop-errors}

As you learned in Chapter \@ref(data-hello), a **randomized experiment** is done to assess whether or not one variable (the **explanatory** variable) causes changes in a second variable (the **response** variable). Every data set has some variability in it, so to decide whether the variability in the data is due to (1) the causal mechanism (the randomized explanatory variable in the experiment) or instead (2) natural variability inherent to the data, we set up a sham randomized experiment as a comparison. That is, we assume that each observational unit would have gotten the exact same response value regardless of the treatment level. By reassigning the treatments many many times, we can compare the actual experiment to the sham experiment. If the actual experiment has more extreme results than any of the sham experiments, we are led to believe that it is the explanatory variable which is causing the result and not inherent data variability. We first observed this method in the [Sex discrimination case study](caseStudySexDiscrimination) from Chapter \@ref(foundations-randomization). Now, using a few more studies, let's look more carefully at this idea of a **randomization test**\index{randomization}.

```{r include=FALSE}
terms_chp_15 <- c(terms_chp_15, "randomization")
```

### Case study: CPR and blood thinner {#cpr}

\index{data!CPR and blood thinner|(}

Cardiopulmonary resuscitation (CPR) is a procedure used on individuals suffering a heart attack when other emergency resources are unavailable. This procedure is helpful in providing some blood circulation to keep a person alive, but CPR chest compressions can also cause internal injuries. Internal bleeding and other injuries that can result from CPR complicate additional treatment efforts. For instance, blood thinners may be used to help release a clot that is causing the heart attack once a patient arrives in the hospital. However, blood thinners negatively affect internal injuries.

Here we consider an experiment with patients who underwent CPR for a heart attack and were subsequently admitted to a hospital.[^15-categorical-two-props-1] Each patient was randomly assigned to either receive a blood thinner (treatment group) or not receive a blood thinner (control group). The outcome variable of interest was whether the patient survived for at least 24 hours.

[^15-categorical-two-props-1]: B$\ddot{\text{o}}$ttiger et al. "Efficacy and safety of thrombolytic therapy after initially unsuccessful cardiopulmonary resuscitation: a prospective clinical trial." The Lancet, 2001.

::: workedexample
Form hypotheses for this study in plain and statistical language. Let $\pi_c$ represent the true survival rate of people who do not receive a blood thinner (corresponding to the control group) and $\pi_t$ represent the true survival rate for people receiving a blood thinner (corresponding to the treatment group).

------------------------------------------------------------------------

We want to understand whether blood thinners are helpful or harmful. We'll consider both of these possibilities using a two-sided hypothesis test.

-   $H_0$: Blood thinners do not have an overall survival effect, i.e., $\pi_t - \pi_c = 0$.

-   $H_A$: Blood thinners have an impact on survival, either positive or negative, but not zero, i.e., $\pi_t - \pi_c \neq 0$.

Note that if we had done a one-sided hypothesis test, the resulting hypotheses would have been:

-   $H_0$: Blood thinners do not have a positive overall survival effect, i.e., $\pi_t - \pi_c = 0$.

-   $H_A$: Blood thinners have a positive impact on survival, i.e., $\pi_t - \pi_c > 0$.
:::

There were 50 patients in the experiment who did not receive a blood thinner and 40 patients who did. The study results are shown in Table \@ref(tab:resultsForCPRStudyInSmallSampleSection).

```{r resultsForCPRStudyInSmallSampleSection}
temptbl <- tribble(
 ~variable,    ~col1, ~col2, ~col3, 
 "Survived", "14", "11", "25",
 "Died", "26", "39","65",
 "Total", "40", "50", "90"
)

temptbl %>%
 kable(caption = "Results for the CPR study. Patients in the treatment group were given a blood thinner, and patients in the control group were not.",
    col.names = c("", "Treatment", "Control", "Total")) %>%
 kable_styling() 
```

::: guidedpractice
What is the observed survival rate in the control group? And in the treatment group? Also, provide a point estimate of the difference in survival proportions of the two groups ($\hat{p}_t - \hat{p}_c$) and the relative "risk" of survival ($\hat{p}_t/\hat{p}_c$).[^15-categorical-two-props-2]
:::

[^15-categorical-two-props-2]: Observed control survival rate: $\hat{p}_c = \frac{11}{50} = 0.22$. Treatment survival rate: $\hat{p}_t = \frac{14}{40} = 0.35$. Observed difference: $\hat{p}_t - \hat{p}_c = 0.35 - 0.22 = 0.13$. Relative risk: $\hat{p}_t/\hat{p}_c = 0.35/0.22 = 1.59$

According to the point estimate, for patients who have undergone CPR outside of the hospital, an additional 13% of these patients survive when they are treated with blood thinners. Interpreting the relative risk, patients in this sample who had undergone CPR outside of the hospital had a 59% higher survival rate when they were treated with blood thinners. However, we wonder if this difference could be easily explainable by chance.

As we did in our past studies this chapter, we will simulate what type of differences we might see from chance alone under the null hypothesis. By randomly assigning "simulated treatment" and "simulated control" stickers to the patients' files, we get a new grouping. If we repeat this simulation 10,000 times, we can build a **null distribution**\index{null distribution} of the differences in sample proportions shown in Figure \@ref(fig:CPR-study-right-tail).

```{r CPR-study-right-tail, fig.cap="Null distribution of the point estimate for the difference in proportions, $\\hat{p}_t - \\hat{p}_c$. The shaded right tail shows observations that are at least as large as the observed difference, 0.13.", warning=FALSE, fig.width=10}

tr <- rep(1:2, c(50, 40))
su <- c(rep(c("s", "d"), c(11, 39)), rep(c("s", "d"), c(14, 26)))

N <- 10^5
d <- rep(NA, N)
for(i in 1:N){
	trf  <- sample(tr)
	p1   <- sum(su[trf == 1] == "s") / 50
	p2   <- sum(su[trf == 2] == "s") / 40
	d[i] <- p2 - p1
}
#sum((d) > 0.1299) / N

udiffs    <- sort(unique(d))
bin.width <- diff(udiffs)[1]
breaks    <- c(udiffs - bin.width / 4, udiffs + bin.width / 4)

hist(d, breaks = breaks, col=COL[7,3], main="", xlab="Differences under the null hypothesis", ylab="", axes=FALSE)
axis(1)
axis(2, at=(0:3)*N/20, labels=c(0, NA, NA, 3)/20)
hist(d[d > 0.1299], breaks = breaks, col=COL[1], add=TRUE)
abline(h=0)
lines(rep(0.13, 2), c(0, 3)*N/25, lty=3, lwd=1.7)
text(0.13, 3*N/25, "0.13", pos=3, cex=0.7)

```

The right tail area is 0.131.[^15-categorical-two-props-3] However, contrary to how we calculated the p-value in previous studies, the p-value of this test is not 0.131!

[^15-categorical-two-props-3]: Note: it is only a coincidence that we also have $\hat{p}_t - \hat{p}_c=0.13$!

The p-value is defined as the chance we observe a result *at least as favorable to the alternative hypothesis as the result* (i.e., the difference) we observe. In this case, any differences less than or equal to $-0.13$ would also provide equally strong evidence favoring the alternative hypothesis as a difference of $+0.13$ did. A difference of $-0.13$ would correspond to the survival rate in the *control group* being 0.13 higher than the treatment group.[^15-categorical-two-props-4] In Figure \@ref(fig:CPR-study-p-value) we've also shaded these differences in the left tail of the distribution. These two shaded tails provide a visual representation of the p-value for a two-sided test.

[^15-categorical-two-props-4]: Note that the relative risk in the opposite direction is not a *decrease* of 59%! When comparing control to treatment, the relative risk would be $0.22/0.35 = 0.63$, or a decrease of 37%. These values differ because the quantity we're comparing to (the "100%" quantity) differs.

```{=html}
<!--
%There is something different in this study than in the past studies: in this study, we are particularly interested in whether blood thinners increase *or* decrease the risk of death in patients who undergo CPR before arriving at the hospital.\footnote{Realistically, we probably are interested in either direction in the past studies as well, and so we should have used the approach we now discuss in this section. However, for simplicity and the sake of not introducing too many concepts at once, we skipped over these details in earlier sections.} For example, there are chance differences of $\hat{p}_t - \hat{p}_c = -0.14$, that would have been stronger evidence against the null hypothesis as our observed difference of +0.13. Likewise, anything less than or equal -0.13 would provide as much evidence against the null hypothesis as +0.13, and for this reason, we must count both tails towards the p-value, as shown in Figure \ref{CPR-study-p-value}.
-->
```
```{r CPR-study-p-value, fig.cap="Null distribution of the point estimate for the difference in proportions, $\\hat{p}_t - \\hat{p}_c$. All values that are at least as extreme as +0.13 but in either direction away from 0 are shaded.", warning=FALSE, fig.width=10}

hist(d, breaks = breaks, col=COL[7,3], main="", xlab="Differences under the null hypothesis", ylab="", axes=FALSE)
axis(1)
axis(2, at=(0:3)*N/20, labels=c(0, NA, NA, 3)/20)
hist(d[d > 0.1299], breaks = breaks, col=COL[1], add=TRUE)
abline(h=0)
hist(d[d < -0.1299], breaks = breaks, col=COL[1], add=TRUE)
abline(h=0)
lines(rep(0.13, 2), c(0, 3)*N/25, lty=3, lwd=1.7)
text(0.13, 3*N/25, "0.13", pos=3, cex=0.7)
```

For a two-sided test, since the null distribution is symmetric, take the single tail (in this case, 0.131) and double it to get the p-value: 0.262. With this large p-value, we do not find statistically significant evidence that the blood thinner has any influence on survival of patients who undergo CPR prior to arriving at the hospital.

```{=html}
<!--%Once again, we can discuss the causal conclusion since this is an experiment.
-->
```
\index{data!CPR and blood thinner|)}

### Case study: Opportunity cost {#caseStudyOpportunityCost}

How rational and consistent is the behavior of the typical American college student? In this section, we'll explore whether college student consumers always consider the following: money not spent now can be spent later.

In particular, we are interested in whether reminding students about this well-known fact about money causes them to be a little thriftier. A skeptic might think that such a reminder would have no impact. We can summarize the two different perspectives using the null and alternative hypothesis framework.

-   $H_0$: **Null hypothesis**. Reminding students that they can save money for later purchases will not have any impact on students' spending decisions.

-   $H_A$: **Alternative hypothesis**. Reminding students that they can save money for later purchases will reduce the chance they will continue with a purchase.

::: guidedpractice
How could you design a randomized experiment to test these two hypotheses?[^15-categorical-two-props-5]
:::

[^15-categorical-two-props-5]: With a sample of students, randomly assign half of them to the control condition, and the other half to the treatment condition. For those in the control condition, present them with a situation where an item is on sale and ask if they would like to buy the item. For those in the treatment condition, present them with the same situation, but also remind them that they can save money for later purchases, then ask if they would like to buy the item. Compute and compare the proportions who refrained from purchasing the item in each group.

In statistical notation, we can define parameters $\pi_{ctrl}$ = the probability a student under a control condition (not reminding them that they can save money for later purchases) refrains from making a purchase, and $\pi_{trmt}$ = the probability a student under a treatment condition (reminding them that they can save money for later purchases) refrains from makes a purchase. Our hypotheses are then

-   $H_0: \pi_{trmt} - \pi_{ctrl} = 0$

-   $H_A: \pi_{trmt} - \pi_{ctrl} > 0$

In this section, we'll explore an experiment conducted by researchers that investigates this very question for students at a university in the southwestern United States.[^15-categorical-two-props-6]

[^15-categorical-two-props-6]: Frederick S, Novemsky N, Wang J, Dhar R, Nowlis S. 2009. Opportunity Cost Neglect. Journal of Consumer Research 36: 553-561.

<!-- #### Observed data {.unnumbered} -->

```{=html}
<!--Shane Frederick of Yale School of Management and his collaborators conducted an experiment exploring the rational behavior of consumers. 

% Suppose when a person is about to spend money, we simply reminded them that they could spend the money on something else. Would it have any impact on the likelihood that they would continue with the purchase?
%What would you do in this situation? Please circle one of the options below.

-->
```
One-hundred and fifty students were recruited for the study, and each was given the following statement:

> Imagine that you have been saving some extra money on the side to make some purchases, and on your most recent visit to the video store you come across a special sale on a new video. This video is one with your favorite actor or actress, and your favorite type of movie (such as a comedy, drama, thriller, etc.). This particular video that you are considering is one you have been thinking about buying for a long time. It is available for a special sale price of \$14.99.

> What would you do in this situation? Please circle one of the options below.

Half of the 150 students were randomized into a control group and were given the following two options:

> (A) Buy this entertaining video.

> (B) Not buy this entertaining video.

The remaining 75 students were placed in the treatment group, and they saw a slightly modified option (B):

> (A) Buy this entertaining video.

> (B) Not buy this entertaining video. Keep the \$14.99 for other purchases.

Would the extra statement reminding students of an obvious fact impact the purchasing decision? Table \@ref(tab:OpportunityCostTable) summarizes the study results.

```{r OpportunityCostTable}
temptbl <- tribble(
 ~variable,    ~col1, ~col2, ~col3, 
 "buy DVD", "56", "41", "97",
 "not buy DVD", "19", "34","53",
 "Total", "75", "75", "150"
)

temptbl %>%
 kable(caption = "Summary of student choices in the opportunity cost study.",
    col.names = c("", "control group", "treatment group", "Total")) %>%
 kable_styling() #%>%
 #add_header_above(c(" ", "`decision`" = 2, " "))
```

```{=html}
<!--
%150 participants were asked whether they would buy a DVD under a particular circumstance. Participants in the control group were given two options, and participants in the treatment group were given the same options, except in the *not buy* option they were reminded that not spending the money meant the money could be used for a later purchase. This table summarizes the results from the study.}
-->
```
It might be a little easier to review the results using row proportions, specifically considering the proportion of participants in each group who said they would buy or not buy the DVD. These summaries are given in Table \@ref(tab:OpportunityCostTableRowProp), and a segmented bar plot is provided in Figure \@ref(fig:OpportunityCostBarplot).

```{r OpportunityCostTableRowProp}
temptbl <- tribble(
 ~variable,    ~col1, ~col2, ~col3, 
 "buy DVD", "0.747", "0.547", "0.647",
 "not buy DVD", "0.253", "0.453","0.353",
 "Total", "1.00", "1.00", "1.00"
)

temptbl %>%
 kable(caption = "The data above are now summarized using row proportions. Row proportions are particularly useful here since we can view the proportion of *buy* and *not buy* decisions in each group.",
    col.names = c("", "control group", "treatment group", "Total")) %>%
 kable_styling() #%>%
 #add_header_above(c( " ", "`decision`" = 2, " "))
```

```{r OpportunityCostBarplot, fig.cap="Segmented bar plot comparing the proportion who bought and did not buy the DVD between the control and treatment groups."}
temptbl <- tribble(
 ~variable,    ~col1, ~col2 ,
 "control", 0.747, "buy",
 "control", 0.253, "not buy",
 "treatment", 0.547, "buy",
 "treatment", 0.453, "not buy"
)
temptbl %>% ggplot(aes(x = variable, y=col1, fill=col2)) + geom_col() +
  xlab("Group") + ylab("Proportion") + 
  scale_fill_discrete(name="Decision")
```

We will define a **success**\index{success} in this study as a student who chooses not to buy the DVD.[^15-categorical-two-props-7] Then, the value of interest is the change in DVD purchase rates that results by reminding students that not spending money now means they can spend the money later.

[^15-categorical-two-props-7]: Success is often defined in a study as the outcome of interest, and a "success" may or may not actually be a positive outcome. For example, researchers working on a study on HIV prevalence might define a "success" in the statistical sense as a patient who is HIV+.

<!-- ::: {.underconstruction} -->

<!-- Check that we define success in previous chapter and then refer to it here. -->

<!-- ::: -->

```{r include=FALSE}
terms_chp_15 <- c(terms_chp_15, "success")
```

```{=html}
<!--
%A first look at the data suggests that reminding students that not spending money means they can spend the money later has an impact. 
-->
```
We can construct a point estimate for this difference as \begin{align*}
\hat{p}_{trmt} - \hat{p}_{ctrl}
 = \frac{34}{75} - \frac{19}{75}
 = 0.453 - 0.253
 = 0.200
\end{align*} The proportion of students who chose not to buy the DVD was 20% higher in the treatment group than the control group. However, is this result **statistically significant**\index{statistically significant}? In other words, is a 20% difference between the two groups so prominent that it is unlikely to have occurred from chance alone?

```{r include=FALSE}
terms_chp_15 <- c(terms_chp_15, "statistically significant")
```

<!-- #### Variability of the statistic {.unnumbered} -->

The primary goal in this data analysis is to understand what sort of differences we might see if the null hypothesis were true, i.e., the treatment had no effect on students. For this, we'll use the same procedure we applied in Section \@ref(caseStudySexDiscrimination): randomization.

Let's think about the data in the context of the hypotheses. If the null hypothesis ($H_0$) was true and the treatment had no impact on student decisions, then the observed difference between the two groups of 20% could be attributed entirely to chance. If, on the other hand, the alternative hypothesis ($H_A$) is true, then the difference indicates that reminding students about saving for later purchases actually impacts their buying decisions.

<!-- #### Observed statistic vs. null value {.unnumbered} -->

Just like with the gender discrimination study, we can perform a statistical analysis. Using the same randomization technique from the last section, let's see what happens when we simulate the experiment under the scenario where there is no effect from the treatment.

While we would in reality do this simulation on a computer, it might be useful to think about how we would go about carrying out the simulation without a computer. We start with 150 index cards and label each card to indicate the distribution of our response variable: `decision`. That is, 53 cards will be labeled "not buy DVD" to represent the 53 students who opted not to buy, and 97 will be labeled "buy DVD" for the other 97 students. Then we shuffle these cards thoroughly and divide them into two stacks of size 75, representing the simulated treatment and control groups. Any observed difference between the proportions of "not buy DVD" cards (what we earlier defined as *success*) can be attributed entirely to chance.

::: workedexample
If we are randomly assigning the cards into the simulated treatment and control groups, how many "not buy DVD" cards would we expect to end up with in each simulated group? What would be the expected difference between the proportions of "not buy DVD" cards in each group?

------------------------------------------------------------------------

Since the simulated groups are of equal size, we would expect $53 / 2 = 26.5$, i.e., 26 or 27, "not buy DVD" cards in each simulated group, yielding a simulated point estimate of 0% . However, due to random fluctuations, we might actually observe a number a little above or below 26 and 27.
:::

```{=html}
<!--
%We'll take the students and randomize them into two new groups, simulated-control and simulated-treatment groups, and then we'll look at the difference in the two groups. 
-->
```
The results of a single randomization from chance alone is shown in Table \@ref(tab:OpportunityCostTableSimulated). From this table, we can compute a difference that occurred from chance alone: \begin{align*}
\hat{p}_{trmt, simulated} - \hat{p}_{ctrl, simulated}
 = \frac{24}{75} - \frac{29}{75}
 = 0.32 - 0.387
 = - 0.067
\end{align*} <!--
%This difference of -6.7% is entirely due to chance.
-->

```{r OpportunityCostTableSimulated}
temptbl <- tribble(
 ~variable,    ~col1, ~col2, ~col3, 
 "buy DVD", "46", "51", "97",
 "not buy DVD", "29", "24","53",
 "Total", "75", "75", "150"
)

temptbl %>%
 kable(caption = "Summary of student choices against their simulated groups. The group assignment had no connection to the student decisions, so any difference between the two groups is due to chance.",
    col.names = c( "", "control group", "treatment group", "Total")) %>%
 kable_styling() #%>%
 #add_header_above(c( " ", "`decision`" = 2, " "))
```

Just one simulation will not be enough to get a sense of what sorts of differences would happen from chance alone. We'll simulate another set of simulated groups and compute the new difference: 0.013. And again: 0.067. And again: -0.173. We'll do this 1,000 times. The results are summarized in a dot plot in Figure \@ref(fig:OpportunityCostDiffsDotPlot), where each point represents a simulation. Since there are so many points, it is more convenient to summarize the results in a histogram such as the one in Figure \@ref(fig:OpportunityCostDiffs), where the height of each histogram bar represents the fraction of observations in that group.

```{r OpportunityCostDiffsDotPlot, fig.cap="A stacked dot plot of 1,000 chance differences produced under the null hypothesis, $H_0$. Six of the 1,000 simulations had a difference of at least 20% , which was the difference observed in the study.", warning=FALSE, fig.width=10}

set.seed(41)
DotPlotRegularized <- function(x,
                         width = 5,
                               col = "#569BBD",
                               cex = 1, ...) {
  x    <- sort(x)
  xu   <- unique(x)
  xlim <- range(x)
  X    <- table(x)
  height <- max(X) / width
  plot(0, 0, xlim = xlim, ylim = c(0, height),
       type = "n", axes = FALSE, ...)
  axis(1)
  
  spread <- min(diff(xu)) / width / 2
  for (i in seq_along(X)) {
    val     <- xu[i]
    num     <- X[i]
    num.rep <- round(num / width) + 1
    pos     <- rep(1:width, num.rep)
    rel.pos <- (pos[1:num] - width / 2) * spread
    ver.pos <- rep(1:num.rep, rep(width, num.rep))
    ver.pos <- ver.pos[1:num]
    points(val + rel.pos, ver.pos, pch = ".",
           col = col, cex = cex)
  }
  abline(h = 0)
}

group   <- c(rep('control', 75), rep('treatment', 75))
outcome <- c(rep(c('buy DVD', 'do not buy DVD'), c(56, 19)),
             rep(c('buy DVD', 'do not buy DVD'), c(41, 34)))

nsim  = 1000
n     = length(group)
group = group
var1  = outcome
success = "do not buy DVD"
sim   = matrix(NA, nrow = n, ncol = nsim)
n1    = n2 = 75

Statistic <- function(var1, group){	
  t1 <- var1 == success & group == levels(as.factor(group))[1]
  t2 <- var1 == success & group == levels(as.factor(group))[2]
  sum(t1)/n1 - sum(t2)/n2 
}

for(i in 1:nsim){
  sim[,i] = sample(group, n, replace = FALSE)
}

sim_dist = apply(sim, 2, Statistic, var1 = outcome)
diffs    = sim_dist
pval     = sum(diffs >= 0.2) / nsim
values  <- table(sim_dist)

xlab <- paste0("Simulated difference in proportions ",
               "of students who do not buy the DVD")

DotPlotRegularized(diffs, xlab = xlab, ylab = "", cex = 2)
text(0.2, 5, "observed\ndifference", pos = 3)
lines(rep(0.2, 2), c(0, 5), lty = 2, col = COL[6])
```

```{r OpportunityCostDiffs, fig.cap="A histogram of 1,000 chance differences produced under the null hypothesis, $H_0$. Histograms like this one are a more convenient representation of data or results when there are a large number of observations.", warning=FALSE, fig.width=10}

X <- c()
Y <- c()
udiffs <- sort(unique(diffs))
for(i in 1:length(udiffs)){
 x  <- udiffs[i]
 rec <- sum(sim_dist == x)
 X  <- append(X, rep(x, rec))
 Y  <- append(Y, 1:rec)
}

bin.width <- diff(udiffs)[1]
breaks <- c(udiffs - bin.width / 4, udiffs + bin.width / 4)

histPlot(diffs, breaks = breaks, axes = FALSE, col = rgb(1,1,1),
     xlab = "", ylab = "Proportion of simulated scenarios        ")
diffs.tmp <- diffs[diffs > 0.19]
histPlot(diffs.tmp, breaks = breaks, col = COL[1], add = TRUE)
axis(1)
axis(2, at = seq(0, 100, 50), labels = format(seq(0, 50, 25) / nsim))
lines(c(0.2, 0.2), c(0, 30), lty = 3, lwd = 3)
text(0.2, 30, "observed\ndifference", pos = 3)
par(las = 0)
mtext(xlab, 1, 2.5)

```

If there was no treatment effect, then we'd only observe a difference of at least +20% about 0.6% of the time, or about 1-in-150 times. That is really rare! Instead, we will conclude the data provide strong evidence there is a treatment effect: reminding students before a purchase that they could instead spend the money later on something else lowers the chance that they will continue with the purchase. Notice that we are able to make a causal statement for this study since the study is an experiment.

<!-- #### Scope of inference {.unnumbered} -->

Since the study was a randomized experiment, we can conclude that the effect was due to the reminder about saving money for other purchases---the reminder *caused* the lower rate of purchase. However, since this study used a volunteer sample (students were "recruited"), we can only generalize this result to individuals similar to those in the study. Thus, we have evidence that reminding students that they can save money for later purchases will reduce the chance they will continue with a purchase, but only among students are similar to those in the study.

### Case study: Malaria vaccine {#caseStudyMalaria}

<!-- #### Observed data {.unnumbered} -->

We consider a study on a new malaria vaccine called PfSPZ. In this study, volunteer patients were randomized into one of two experiment groups: 14 patients received an experimental vaccine and 6 patients received a placebo vaccine. Nineteen weeks later, all 20 patients were exposed to a drug-sensitive malaria virus strain; the motivation of using a drug-sensitive strain of virus here is for ethical considerations, allowing any infections to be treated effectively. The results are summarized in Table \@ref(tab:malaria-vaccine-20-exp-summary), where 9 of the 14 treatment patients remained free of signs of infection while all of the 6 patients in the control group patients showed some baseline signs of infection.

```{r malaria-vaccine-20-exp-summary}
temptbl <- tribble(
 ~variable,    ~col1, ~col2, ~col3, ~col4, 
 "", "infection", "5", "6", "11",
 "`no infection`", "placebo", "9", "0","9",
 "", "Total", "14", "6", "20"
)

temptbl %>%
 kable(caption = "Summary results for the malaria vaccine experiment.",
    col.names = c("", "", "vaccine", "placebo", "Total")) %>%
 kable_styling() %>%
 add_header_above(c(" ", " ", "`treatment`" = 2, " "))
```

::: guidedpractice
Is this an observational study or an experiment? What implications does the study type have on what can be inferred from the results?[^15-categorical-two-props-8]
:::

[^15-categorical-two-props-8]: The study is an experiment, as patients were randomly assigned an experiment group. Since this is an experiment, the results can be used to evaluate a causal relationship between the malaria vaccine and whether patients showed signs of an infection.

In this study, a smaller proportion of patients who received the vaccine showed signs of an infection (35.7% versus 100%). However, the sample is very small, and it is unclear whether the difference provides *convincing evidence* that the vaccine is effective. To determine this, we need to perform statistical inference.

Instead of using the *difference in proportion* infected as our summary measure, let's use the *relative risk* of infection for this case study. Thus, the parameter of interest is $\pi_{Vac} / \pi_{Pla}$, and our point estimate of this parameter is

$$
\frac{\hat{p}_{Vac}}{\hat{p}_{Pla}} = \frac{5/14}{6/6} = 0.357.
$$

Converting this to a percent decrease[^15-categorical-two-props-9], we see that the patients in the vaccine group had a 64.3% reduced risk of infection compared to the placebo group.[^15-categorical-two-props-10]

[^15-categorical-two-props-9]: $(0.357 - 1)\times 100$% = -64.3%

[^15-categorical-two-props-10]: With small sample sizes, researchers often add 0.5 to each of the four cells prior to calculating the sample relative risk in order to avoid dividing by zero. With this adjustment, the sample relative risk is: $\frac{5.5/15}{6.5/7} = 0.395$. We will use this adjustment when simulating relative risks as well.

In terms of relative risk, our null and alternative hypotheses are

**Independence model** $H_0: \dfrac{\pi_{Vac}}{\pi_{Pla}} = 1$ <br>

**Alternative model** $H_A: \dfrac{\pi_{Vac}}{\pi_{Pla}} < 1$

::: protip
Whether we write our hypotheses in terms of a difference in proportions or a ratio of proportions (relative risk), the hypotheses still have the same interpretation. For example, the three null hypotheses $H_0: \pi_{Vac} = \pi_{Pla}$, $H_0: \pi_{Vac} - \pi_{Pla} = 0$, and $H_0: \pi_{Vac}/\pi_{Pla} = 1$, are all algebraically equivalent.
:::

What would it mean if the independence model, which says the vaccine had no influence on the rate of infection, is true? It would mean 11 patients were going to develop an infection *no matter which group they were randomized into*, and 9 patients would not develop an infection *no matter which group they were randomized into*. That is, if the vaccine did not affect the rate of infection, the difference in the infection rates was due to chance alone in how the patients were randomized.

Now consider the alternative model: infection rates were influenced by whether a patient received the vaccine or not. If this was true, and especially if this influence was substantial, we would expect to see some difference in the infection rates of patients in the groups.

We choose between these two competing claims by assessing if the data conflict so much with $H_0$ that the independence model cannot be deemed reasonable. If this is the case, and the data support $H_A$, then we will reject the notion of independence and conclude the vaccine is effective.

<!-- #### Variability of the statistic {.unnumbered} -->

We're going to implement simulation, where we will pretend we know that the malaria vaccine being tested does \emph{not} work. Ultimately, we want to understand if the large difference we observed is common in these simulations. If it is common, then maybe the difference we observed was purely due to chance. If it is very uncommon, then the possibility that the vaccine was helpful seems more plausible.

We can again randomize the responses (`infection` or `no infection`) to the treatment conditions under the null hypothesis of independence, but this time, we'll compute sample relative risks with each simulated sample.

::: guidedpractice
How could you use cards to re-randomize one sample into groups? Remember, in this hypothetical world, we believe each patient that got an infection was going to get it regardless of which group they were in, and we would like to see what happens if we randomly assign these patients to the treatment and control groups again.[^15-categorical-two-props-11]
:::

[^15-categorical-two-props-11]: 1\. Take 20 notecards to represent the 20 patients, where we write down "infection" on 11 cards and "no infection" on 9 cards. 2. Thoroughly shuffle the notecards and deal 14 into a "vaccine" pile and 6 into a "placebo" pile. 3. Compute the proportion of "infection" cards in the "vaccine" pile and divide it by the proportion of "infection" cards in the "placebo" pile to get the simulated sample relative risk.

Figure \@ref(fig:malaria-rand-dot-plot) shows a histogram of the relative risks found from 1,000 randomization simulations, where each dot represents a simulated relative risk of infection (treatment rate divided by control rate).

```{r malaria-rand-dot-plot, fig.cap="A histogram of relative risks of infection from 1,000 simulations produced under the independence model $H_0$, where in these simulations infections are unaffected by the vaccine. Seventeen of the 1,000 simulations (shaded in red) had a relative risk of at most 0.357, the relative risk observed in the study.", warning=FALSE, fig.width=10}

set.seed(48)

exp_gp  <- rep(c("vaccine", "placebo"), c(14, 6))
outcome <- c(rep(c('infection', 'no infection'), c(5, 9)),
             rep(c('infection', 'no infection'), c(6, 0)))

nsim    <- 1000
n       <- length(exp_gp)
success <- "infection"

SimulateTable <- function(exp_gp, outcome, ...) {
  table(sample(exp_gp), outcome)
}
# SimulateTable(exp_gp, outcome)

sim_tables <-
    lapply(1:nsim, SimulateTable,
        exp_gp = exp_gp,
        outcome = outcome)
result <- sim_tables %>%
    lapply(function(x) {
    	  ((x[2, 1]+0.5) / sum(x[2, ],1))/((x[1, 1]+0.5) / sum(x[1, ],1))
    	}) %>%
    	unlist()


pval <- sum(result <= (5.5/15)/(6.5/7)) / nsim
values <- table(result)
diffs <- unique(result)

X <- c()
Y <- c()
for (i in 1:length(diffs)) {
  x   <- diffs[i]
  rec <- sum(result == x)
  X   <- append(X, rep(x, rec))
  Y   <- append(Y, 1:rec)
}

plot(X, Y,
     xlim = range(diffs) + c(-1, 1) * sd(diffs) / 4,
     xlab = "Relative risk of infection (vaccine/placebo)",
     ylab = "",
     axes = FALSE,
     ylim = c(0, max(Y)),
     col = COL[1],
     pch = 20)
# at <- seq(-0.4, 0.4, 0.1)
# labels <- c(-0.4, "", -0.2, "", 0, "", 0.2, "", 0.4)
axis(1) #, at, labels)
points(X[X <= (5.5/15)/(6.5/7)], Y[X<= (5.5/15)/(6.5/7)], lwd=3, col = COL[4], cex=0.4)
```

<!-- #### Observed statistic vs null value {.unnumbered} -->

Note that the distribution of these simulated differences is centered around 1. We simulated the relative risks assuming that the independence model was true, and under this condition, we expect the difference to be near one with some random fluctuation, where *near* is pretty generous in this case since the sample sizes are so small in this study.

::: workedexample
How often would you observe a sample relative risk of at most 0.357 (at least a 64.3% reduction in risk on vaccine) according to Figure \@ref(fig:malaria-rand-dot-plot)? Often, sometimes, rarely, or never?

------------------------------------------------------------------------

It appears that a 64.3% reduction in risk due to chance alone would only happen about 2% of the time according to Figure \@ref(fig:malaria-rand-dot-plot). Such a low probability indicates a rare event.
:::

Based on the simulations, we have two options:

1.  We conclude that the study results do not provide strong evidence against the independence model. That is, we do not have sufficiently strong evidence to conclude the vaccine had an effect in this clinical setting.

2.  We conclude the evidence is sufficiently strong to reject $H_0$ and assert that the vaccine was useful. When we conduct formal studies, usually we reject the notion that we just happened to observe a rare event.[^15-categorical-two-props-12]

[^15-categorical-two-props-12]: This reasoning does not generally extend to anecdotal observations. Each of us observes incredibly rare events every day, events we could not possibly hope to predict. However, in the non-rigorous setting of anecdotal evidence, almost anything may appear to be a rare event, so the idea of looking for rare events in day-to-day activities is treacherous. For example, we might look at the lottery: there was only a 1 in 292 million chance that the Powerball numbers for the largest jackpot in history (January 13th, 2016) would be (04, 08, 19, 27, 34) with a Powerball of (10), but nonetheless those numbers came up! However, no matter what numbers had turned up, they would have had the same incredibly rare odds. That is, *any set of numbers we could have observed would ultimately be incredibly rare*. This type of situation is typical of our daily lives: each possible event in itself seems incredibly rare, but if we consider every alternative, those outcomes are also incredibly rare. We should be cautious not to misinterpret such anecdotal evidence.

In this case, we reject the independence model in favor of the alternative. That is, we are concluding the data provide strong evidence that the vaccine provides some protection against malaria in this clinical setting.

\index{data!malaria vaccine|)}

Statistical inference is built on evaluating whether such differences are due to chance. In statistical inference, data scientists evaluate which model is most reasonable given the data. Errors do occur, just like rare events, and we might choose the wrong model. While we do not always choose correctly, statistical inference gives us tools to control and evaluate how often these errors occur.

## Bootstrap confidence interval for $\pi_1 - \pi_2$ {#two-prop-boot-ci}

In Section \@ref(two-prop-errors), we worked with the randomization distribution to understand the distribution of $\hat{p}_1 - \hat{p}_2$ when the null hypothesis $H_0: \pi_1 - \pi_2 = 0$ is true. Now, through bootstrapping, we study the variability of $\hat{p}_1 - \hat{p}_2$ without the null assumption.

### Observed data

Reconsider the CPR data from Section \@ref(two-prop-errors) which is provided in Table \@ref(tab:resultsForCPRStudyInSmallSampleSection). The experiment consisted of two treatments on patients who underwent CPR for a heart attack and were subsequently admitted to a hospital. Each patient was randomly assigned to either receive a blood thinner (treatment group) or not receive a blood thinner (control group). The outcome variable of interest was whether the patient survived for at least 24 hours.

Again, we use the difference in sample proportions as the observed statistic of interest. Here, the value of the statistic is: $\hat{p}_t - \hat{p}_c = 0.35 - 0.22 = 0.13$.

### Variability of the statistic

The bootstrap method applied to two samples is an extension of the method described in Section \@ref(boot-ci-prop). Now, we have two samples, so each sample estimates the population from which they came. In the CPR setting, the `treatment` sample estimates the population of all individuals who have gotten (or will get) the treatment; the `control` sample estimate the population of all individuals who do not get the treatment and are controls. Figure \@ref(fig:boot2proppops) extends Figure \@ref(fig:boot1) to show the bootstrapping process from two samples simultaneously.

```{r boot2proppops, fig.cap = "Creating two estimated populations from two different samples from different populations.", warning = FALSE, out.width = "100%"}
include_graphics("05/figures/boot2proppops.png")
```

As before, once the population is estimated, we can randomly resample observations to create bootstrap samples, as seen in Figure \@ref(fig:boot2propresamps). Computationally, each bootstrap resample is created by randomly sampling with replacement from the original sample.

```{r boot2propresamps, fig.cap = "Bootstrapped resamples from two separate estimated populations.", warning = FALSE, out.width = "100%"}
include_graphics("05/figures/boot2propresamps.png")
```

The variability of the statistic (the difference in sample proportions) can be calculated by taking one treatment bootstrap sample and one control bootstrap sample and calculating the difference of the bootstrap survival proportions. Figure \@ref(fig:boot2samp2) displays one bootstrap resample from each of the estimated populations, with the difference in sample proportions calculated between the treatment bootstrap sample and the control bootstrap sample.

```{r boot2samp2, fig.cap = "The bootstrap resample on the left is from the first estimated population; the one on the right from the second. In this case, the value of the simulated bootstrap statistic would be $\\hat{p}_1 - \\hat{p}_2 = \\frac{2}{7}-\\frac{1}{7}$.", warning = FALSE, out.width="75%"}
include_graphics("05/figures/boot2prop3.png")
```

As always, the variability of the difference in proportions can only be estimated by repeated simulations, in this case, repeated bootstrap samples. Figure \@ref(fig:boot2samp3) shows multiple bootstrap differences calculated for each of the repeated bootstrap samples.

```{r boot2samp3, fig.cap = "For each pair of bootstrap samples, we calculate the difference in sample proportions.", warning = FALSE,  out.width="75%"}
include_graphics("05/figures/boot2prop2.png")
```

Repeated bootstrap simulations lead to a bootstrap sampling distribution of the statistic of interest, here the difference in sample proportions. Figure \@ref(fig:boot2samp1) visualizes the process in the toy example, and Figure \@ref(fig:bootCPR1000) shows 1000 bootstrap differences in proportions for the CPR data. Note that the CPR data includes 40 and 50 people in the respective groups, and the toy example includes 7 and 9 people in the two groups. Accordingly, the variability in the distribution of sample proportions is higher for the toy example. It turns out that the standard error for the sample difference in proportions is inversely related to the sample sizes.

```{r boot2samp1, fig.cap = "The process of repeatedly resampling from the estimated population (sampling with replacement from the original sample), computing a difference in sample proportions from each pair of samples, then plotting this distribution.", warning = FALSE,  out.width="100%"}
include_graphics("05/figures/boot2prop1.png")
```

```{r bootCPR1000, fig.cap="A histogram of differences in proportions (treatment $-$ control) from 1000 bootstrap simulations using the CPR data.", warning=FALSE, fig.width=10}

set.seed(47)
exp_gp  <- rep(c("blood_thinner", "control"), c(40, 50))
outcome <- c(rep(c('survived', 'died'), c(14, 26)),
             rep(c('survived', 'died'), c(11, 39)))

nsim    <- 1000
gp1     <- "blood_thinner"
n1      <- length(exp_gp[exp_gp == gp1])
gp2     <- "control"
n2      <- length(exp_gp[exp_gp == gp2])

SimulateTable <- function(exp_gp, outcome, ...){
   bsobs1 <- sample(1:n1, replace = TRUE)
   bsobs2 <- sample((n1+1):(n1+n2), replace = TRUE)
   table(exp_gp[c(bsobs1,bsobs2)], outcome[c(bsobs1,bsobs2)])
}

sim_tables <-
    lapply(1:nsim, SimulateTable,
        exp_gp = exp_gp,
        outcome = outcome)
result <- sim_tables %>%
    lapply(function(x) {
    	  x[1, 2] / sum(x[1, ]) - x[2, 2] / sum(x[2, ])
    	}) %>%
    	unlist()

values <- table(result)
diffs <- unique(result)

X <- c()
Y <- c()
for (i in 1:length(diffs)) {
  x   <- diffs[i]
  rec <- sum(result == x)
  X   <- append(X, rep(x, rec))
  Y   <- append(Y, 1:rec)
}

breaks <- seq(min(result)-.05, max(result)+ .05, by = (max(result) - min(result) )/ 20)

histPlot(X, Y, breaks = breaks,
     xlim = c(min(result)-.05, max(result)+.05),
     xlab = "Bootstrapped Difference in Proportion Surviving (Treatment - Control)",
     ylab = "",
     axes = FALSE,
     col = COL[1],
     pch = 20)
axis(1)

cpr.90ci <- as.vector(quantile(result, c(0.05,0.95)))
```

Figure \@ref(fig:bootCPR1000) provides an estimate for the variability of the difference in survival proportions from sample to sample. As in Section \@ref(boot-ci-prop), the bootstrap confidence interval can be calculated directly from the bootstrapped differences in Figure \@ref(fig:bootCPR1000) by finding the percentiles of the distribution that correspond to the confidence level. For example, here we calculate the 90% confidence interval by finding the 5^th^ and 95^th^ percentile values from the bootstrapped differences. The bootstrap 5^th^ percentile proportion is `r cpr.90ci[1]` and the 95^th^ percentile is `r cpr.90ci[2]`. The result is: we are 90% confident that, in the population, the true difference in probability of survival (treatment $-$ control) is between `r cpr.90ci[1]` and `r cpr.90ci[2]`. More clearly, we are 90% confident that the probability of survival for heart attack patients who underwent CPR on blood thinners is between `r abs(cpr.90ci[1])` less to `r cpr.90ci[2]` more than that for patients who were not given blood thinners. The interval shows that we do not have much definitive evidence of the affect of blood thinners, one way or another.

```{r bootCPR1000CI, fig.cap="The CPR data is bootstrapped 1000 times. Each simulation creates a sample from the original data where the proportion who survived in the treatment group is $\\hat{p}_{t}  = 14/40$ and the proportion who survived in the control group is $\\hat{p}_{c} = 11/50$. ", warning=FALSE, fig.width=10}

#result from previous R chunk
bsq <- quantile(result, probs = c(0.01, 0.025, 0.05, 0.1, 0.9, 0.95, 0.975, 0.99))

bsprops_up <- result[result >= bsq[6]]
bsprops_low <- result[result <= bsq[3]]

#uprops <- sort(unique(result))
#bin.width <- diff(uprops)[1]*100
#breaks <- c(uprops - bin.width / 4, uprops + bin.width / 4)

breaks <- seq(min(result)-.05, max(result)+ .05, by = (max(result) - min(result) )/ 20)

histPlot(result, breaks = breaks, 
         axes = FALSE, col = rgb(1,1,1),
     xlab = "", ylab="")
histPlot(bsprops_up, breaks = breaks,
         col = COL[1], add = TRUE)
histPlot(bsprops_low, breaks = breaks, 
         col = COL[1], add = TRUE)
axis(1)
#axis(2, at = seq(0, 100, 50), labels = format(seq(0, 50, 25) / nsim))
lines(c(bsq[6], bsq[6]), c(0, 80), lty = 3, lwd = 3)
lines(c(bsq[3], bsq[3]), c(0, 80), lty = 3, lwd = 3)
text(bsq[6], 80, "95 percentile", pos = 3)
text(bsq[3], 80, "5 percentile", pos = 3)
par(las = 0)
mtext("Bootstrapped Difference in Proportion Surviving (Treatment - Control)", 1, 2.5)

```

<!-- ##### SE bootstrap interval {-} -->

<!-- Alternatively, we can use the variability in the bootstrapped differences to calculate a standard error of the difference. -->

<!-- The resulting interval is called the **SE interval**. -->

<!-- Section @ref(math-2prop) details the mathematical model for the standard error of the difference in sample proportions, but the bootstrap distribution typically does an excellent job of estimating the variability. -->

<!-- ```{r include=FALSE} -->

<!-- terms_chp_15 <- c(terms_chp_15, "SE interval") -->

<!-- ``` -->

<!-- $$SE(\hat{p}_t - \hat{p}_c) \approx SD(\hat{p}_{bs,t} - \hat{p}_{bs,c}) = 0.0975$$ -->

<!-- The variability of the bootstrapped difference in proportions was calculated in R using the `sd()` function, but any statistical software will calculate the standard deviation of the differences, here, the exact quantity we hope to approximate. -->

<!-- Note that we do not know know the true distribution of $\hat{p}_t - \hat{p}_c$, so we will use a rough approximation to find a confidence interval for $\pi_t - \pi_c$.  As seen in the bootstrap histograms, the shape of the distribution is roughly symmetric and bell-shaped.  So for a rough approximation, we will apply the 68-95-99.7 rule which tells us that 95% of observed differences should be roughly no farther than 2 SE from the true parameter difference.  An approximate 95% confidence interval for $\pi_t - \pi_c$ is given by: -->

<!-- \begin{align*} -->

<!-- \hat{p}_t - \hat{p}_c \pm 2 \cdot SE \ \ \ \rightarrow \ \ \ 14/40 - 11/50 \pm 2 \cdot 0.0975 \ \ \  \rightarrow \ \ \  (-0.065, 0.325) -->

<!-- \end{align*} -->

<!-- We are 95% confident that the true value of $\pi_t - \pi_c$ is between -0.065 and 0.325.  Again, the wide confidence interval that overlaps zero indicates that the study provides very little evidence about the effectiveness of blood thinners. -->

<!-- ```{block2, type="important", echo=TRUE} -->

<!-- Since the multiplier "2" in the SE bootstrap interval comes from the 68-95-99.7 rule for normal distributions, these intervals are only valid when the bootstrap sampling distribution is approximately normal. -->

<!-- ``` -->

### What does 95% mean?

Recall that the goal of a confidence interval is to find a plausible range of values for a *parameter* of interest. The estimated statistic is not the value of interest, but it is typically the best guess for the unknown parameter. The confidence level (often 95%) is a number that takes a while to get used to. Surprisingly, the percentage doesn't describe the data set at hand, it describes many possible data sets. One way to understand a confidence interval is to think about all the confidence intervals that you have ever made or that you will ever make as a scientist, the confidence level describes **those** intervals.

Figure \@ref(fig:ci25ints) demonstrates a hypothetical situation in which 25 different studies are performed on the exact same population (with the same goal of estimating the true parameter value of $\pi_1 - \pi_2 = 0.47$). The study at hand represents one point estimate (a dot) and a corresponding interval. It is not possible to know whether the interval at hand is to the right of the unknown true parameter value (the black line) or to the left of that line. It is also impossible to know whether the interval captures the true parameter (is blue) or doesn't (is red). If we are making 95% intervals, then 5% of the intervals we create over our lifetime will *not* capture the parameter of interest (e.g., will be red as in Figure \@ref(fig:ci25ints) ). What we know is that over our lifetimes as scientists, 95% of the intervals created and reported on will capture the parameter value of interest: thus the language "95% confident."

```{r ci25ints, fig.cap = "One hypothetical population, parameter value of: $\\pi_1 - \\pi_2 = 0.47$.  Twenty-five different studies all which led to a different point estimate, SE, and confidence interval.  The study at hand is one of the horizontal lines (hopefully a blue line!).", warning = FALSE, fig.width = 10}
data(run09)
set.seed(52)
m <- 103.4594
s <- 19.31445
n <- 100
k <- 25
SE <- s/sqrt(n)

set.seed(47)
means <- c()
SE    <- c()
for(i in 1:k){
  temp <- sample(nrow(run09), n)
  d    <- run09$time[temp]
  means[i] <- mean(d, na.rm = TRUE)
  SE[i]    <- sd(d)/sqrt(n)
}
xR <- m + 4 * c(-1, 1) * s / sqrt(n)
yR <- c(0.7, 25.3)
plot(xR, yR,
     type = 'n',
     xlab = '',
     ylab = '',
     axes = FALSE)
abline(v = m, lty = 2, col = COL[5,2])
axis(1, at = m, expression(paste(pi[1]," - ",pi[2]," = 0.47")))
for(i in 1:k){
  ci <- means[i] + 2 * c(-1, 1) * SE[i]
  if(abs(means[i] - m) > 1.96 * SE[i]){
    col <- COL[4]
    points(means[i], i, cex = 1.4, col = col)
    lines(ci, rep(i, 2), col = col, lwd = 4)
  } else {
    col <- COL[1]
  }
  points(means[i], i, pch = 20, cex = 1.2, col = col)
  lines(ci, rep(i, 2), col = col)
}


```

The choice of 95% or 90% or even 99% as a confidence level is admittedly somewhat arbitrary; however, it is related to the logic we used when deciding that a p-value should be declared as significant if it is lower than 0.05 (or 0.10 or 0.01, respectively). Indeed, one can show mathematically, that a 95% confidence interval and a two-sided hypothesis test at a cutoff of 0.05 will provide the same conclusion when the same data and mathematical tools are applied for the analysis. A full derivation of the explicit connection between confidence intervals and hypothesis tests is beyond the scope of this text.

## Theory-based inferential methods for $\pi_1 - \pi_2$ {#math-2prop}

Like with $\hat{p}$, the difference of two sample proportions $\hat{p}_1 - \hat{p}_2$ can be modeled using a normal distribution when certain conditions are met.

### Evaluating the two conditions required for modeling $\pi_1 - \pi_2$ using theory-based methods

First, we require a broader independence condition, and secondly, the success-failure condition must be met by both groups.

::: onebox
**Conditions for the sampling distribution of** $\hat{p}_1 -\hat{p}_2$ to be normal.

The difference $\hat{p}_1 - \hat{p}_2$ can be modeled using a normal distribution when

1.  **Independence** (extended). The data are independent within and between the two groups. Generally this is satisfied if the data come from two independent random samples or if the data come from a randomized experiment.
2.  **Success-failure condition.** The success-failure condition holds for both groups, where we check successes and failures in each group separately. This condition is met if we have at least 10 successes and 10 failures in each sample. If data are displayed in a two-way table, this is equivalent to checking that all cells in the table have at least 10 observations.

When these conditions are satisfied, then the sampling distribution of $\hat{p}_1 - \hat{p}_2$ is approximately normal with mean $\pi_1 - \pi_2$ and standard deviation

\begin{eqnarray*}
  SD(\hat{p}_1 - \hat{p}_2) = \sqrt{\frac{\pi_1(1-\pi_1)}{n_1} + \frac{\pi_2(1-\pi_2)}{n_2}}
  \end{eqnarray*} where $\pi_1$ and $\pi_2$ represent the population proportions, and $n_1$ and $n_2$ represent the sample sizes.

Note that in most cases, values for $\pi_1$ and $\pi_2$ are unknown, so the standard deviation is approximated using the observed data. Recall that the estimated standard deviation of a statistic is called its **standard error**.

$$SE(\hat{p}_1 - \hat{p}_2) = \sqrt{\frac{\hat{p}_1(1-\hat{p}_1)}{n_1} + \frac{\hat{p}_2(1-\hat{p}_2)}{n_2}}$$

where $\hat{p}_1$ and $\hat{p}_2$ represent the observed sample proportions, and $n_1$ and $n_2$ represent the sample sizes.
:::

\index{standard error (SE)!difference in proportions}

```{=html}
<!--
SE reference above?
    \label{seForDiffOfProp}
-->
```
::: protip
The success-failure condition listed above is only necessary for the sampling distribution of $\hat{p}_1 - \hat{p}_2$ to be approximately normal. The mean of the sampling distribution of $\hat{p}_1 - \hat{p}_2$ is $\pi_1 - \pi_2$, and the standard deviation is $\sqrt{\frac{\pi_1(1-\pi_1)}{n_1}+\frac{\pi_2(1-\pi_2)}{n_2}}$, regardless of the two sample sizes.
:::

```{r include=FALSE}
terms_chp_15 <- c(terms_chp_15, "standard error for difference in proportions")
```

As in the case of one proportion, we typically don't know the true proportions $\pi_1$ and $\pi_2$, so we will substitute some value to check the success-failure condition and to estimate the standard deviation of the sampling distribution of $\hat{p}_1 - \hat{p}_2$.

### Confidence interval for $\pi_1 - \pi_2$

\index{data!CPR and blood thinner|(}

::: onebox
**Standard error of the difference in two proportions,** $\hat{p}_1 -\hat{p}_2$:

When computing a theory-based confidence interval for $\pi_1 - \pi_2$, we substitute $\hat{p}_1$ for $\pi_1$ and $\hat{p}_2$ for $\pi_2$ in the expression for the standard deviation of the statistic, resulting in its standard error:

```{=tex}
\begin{eqnarray*}
  SE(\hat{p}_1 -\hat{p}_2) = \sqrt{\frac{\hat{p}_1(1-\hat{p}_1)}{n_1} + \frac{\hat{p}_2(1-\hat{p}_2)}{n_2}}
  \end{eqnarray*}
```
This is the standard error formula we will use when computing confidence intervals for the difference in two proportions.
:::

If the conditions for the sampling distribution of $\hat{p}_1 - \hat{p}_2$ to be normal are met, we can apply the generic confidence interval formula for a difference of two proportions, where we use $\hat{p}_1 - \hat{p}_2$ as the point estimate and substitute the $SE$ formula above:

\begin{align*}
\text{point estimate} \ &\pm\  z^{\star} \times SE  \quad\to\\
\hat{p}_1 - \hat{p}_2 \ &\pm\ z^{\star} \times \sqrt{\frac{\hat{p}_1(1-\hat{p}_1)}{n_1}+\frac{\hat{p}_2(1-\hat{p}_2)}{n_2}}
\end{align*}


::: workedexample
We reconsider the experiment for patients who underwent cardiopulmonary resuscitation (CPR) for a heart attack and were subsequently admitted to a hospital. These patients were randomly divided into a treatment group where they received a blood thinner or the control group where they did not receive a blood thinner. The outcome variable of interest was whether the patients survived for at least 24 hours. The results are shown in Table \@ref(tab:resultsForCPRStudyInSmallSampleSection). Check whether we can model the difference in sample proportions using the normal distribution.

------------------------------------------------------------------------

We first check for independence: since this is a randomized experiment, this condition is satisfied.

Next, we check the success-failure condition for each group. We have at least 10 successes and 10 failures in each experiment arm (11, 14, 39, 26), so this condition is also satisfied.

With both conditions satisfied, the difference in sample proportions can be reasonably modeled using a normal distribution for these data.
:::

````{=html}
<!--
```{r resultsForCPRStudyInSmallSampleSectionDup}
temptbl <- tribble(
 ~variable,    ~col1, ~col2, ~col3,
 "Control", "11", "39", "50",
 "Treatment", "14", "26", "40",
 "Total", "25", "65", "90"
)

temptbl %>%
 kable(caption = "Results for the CPR study.
    Patients in the treatment group were given
    a blood thinner, and patients in the control
    group were not.", 
    col.names = c("", "Survived", "Died", "Total")) %>%
 kable_styling()
```
-->
````

::: workedexample
Create and interpret a 90% confidence interval of the difference for the survival rates in the CPR study.

------------------------------------------------------------------------

We'll use $\pi_t$ for the true survival rate in the treatment group and $\pi_c$ for the control group. Our point estimate of $\pi_t - \pi_c$ is: \begin{align*}
  \hat{p}_{t} - \hat{p}_{c}
    = \frac{14}{40} - \frac{11}{50}
    = 0.35 - 0.22
    = 0.13
  \end{align*} We use the standard error formula previously provided. As with the one-sample proportion case, we use the sample estimates of each proportion in the formula in the confidence interval context: \begin{align*}
  SE \approx \sqrt{\frac{0.35 (1 - 0.35)}{40} +
      \frac{0.22 (1 - 0.22)}{50}}
    = 0.095
  \end{align*} For a 90% confidence interval, we use $z^{\star} = 1.65$: \begin{align*}
  \text{point estimate} &\pm\ z^{\star} \times SE \quad \to\\
   0.13 \ &\pm\ 1.65 \times  0.095 \quad = \quad (-0.027, 0.287)
  \end{align*} We are 90% confident that the survival probability for those patients given blood thinners is between 0.027 lower to 0.287 higher than that of patients not given blood thinners, among patients like those in the study. Because 0% is contained in the interval, we do not have enough information to say whether blood thinners help or harm heart attack patients who have been admitted after they have undergone CPR.

Note, the problem was set up as 90% to indicate that there was not a need for a high level of confidence (such a 95% or 99%). A lower degree of confidence increases potential for error, but it also produces a more narrow interval.
:::

\index{data!CPR and blood thinner|)}

::: guidedpractice
A 5-year experiment was conducted to evaluate the effectiveness of fish oils on reducing cardiovascular events, where each subject was randomized into one of two treatment groups. We will consider heart attack outcomes in the patients listed in Table \@ref(tab:fish-oil-data).

Create a 95% confidence interval for the effect of fish oils on heart attacks for patients who are well-represented by those in the study. Also interpret the interval in the context of the study.[^15-categorical-two-props-13]
:::

[^15-categorical-two-props-13]: Because the patients were randomized, the subjects are independent, both within and between the two groups. The success-failure condition is also met for both groups as all counts are at least 10. This satisfies the conditions necessary to model the difference in proportions using a normal distribution. Compute the sample proportions ($\hat{p}_{\text{fish oil}} = 0.0112$, $\hat{p}_{\text{placebo}} = 0.0155$), point estimate of the difference ($0.0112 - 0.0155 = -0.0043$), and standard error $SE = \sqrt{\frac{0.0112 \times 0.9888}{12933} + \frac{0.0155 \times 0.9845}{12938}} = 0.00145$. Next, plug the values into the general formula for a confidence interval, where $z^{\star} = 1.96$ for a 95% confidence level: $-0.0043 \pm 1.96 \times 0.00145 \rightarrow (-0.0071, -0.0015)$. We are 95% confident that fish oils decreases heart attacks by 0.15 to 0.71 percentage points (off of a baseline of about 1.55%) over a 5-year period for subjects who are similar to those in the study. Because the interval is entirely below 0, and the treatment was randomly assigned, the data provide strong evidence that fish oil supplements reduce heart attacks in patients like those in the study.

```{r fish-oil-data}
temptbl <- fish_oil_18[[3]] %>% 
  data.frame() 
row.names(temptbl) <- c("fish oil","placebo") 
temptbl <- temptbl %>%
  mutate(total = myocardioal_infarction + no_event) %>%
  t()
row.names(temptbl) <- c("heart attack","no event","Total")
kable(temptbl, caption = "Results for the study on n-3 fatty acid supplement and related health benefits.", 
    col.names = c( "fish oil", "placebo"),
    row.names = TRUE) %>%
 kable_styling()
```

### Hypothesis test for $H_0: \pi_1 - \pi_2 = 0$

\index{data!mammography|(} \index{data!breast cancer|(}

A mammogram is an X-ray procedure used to check for breast cancer. Whether mammograms should be used is part of a controversial discussion, and it's the topic of our next example where we learn about two proportion hypothesis tests when $H_0$ is $\pi_1 - \pi_2 = 0$ (or equivalently, $\pi_1 = \pi_2$).

A 30-year study was conducted with nearly 90,000 female participants. During a 5-year screening period, each woman was randomized to one of two groups: in the first group, women received regular mammograms to screen for breast cancer, and in the second group, women received regular non-mammogram breast cancer exams. No intervention was made during the following 25 years of the study, and we'll consider death resulting from breast cancer over the full 30-year period. Results from the study are summarized in Figure \@ref(tab:mammogramStudySummaryTable).

If mammograms are much more effective than non-mammogram breast cancer exams, then we would expect to see additional deaths from breast cancer in the control group. On the other hand, if mammograms are not as effective as regular breast cancer exams, we would expect to see an increase in breast cancer deaths in the mammogram group.

```{r mammogramStudySummaryTable}
temptbl <- tribble(
 ~variable,    ~col1, ~col2, ~col3,
 "Death from breast cancer?", "Yes", "500", "505",
 "", "No", "44,425", "44,405"
)

temptbl %>%
 kable(caption = "Summary results for breast cancer study.", 
    col.names = c( "", "", "Mammogram", "Control"))  %>%
 kable_styling()
```

::: guidedpractice
Is this study an experiment or an observational study?[^15-categorical-two-props-14]
:::

[^15-categorical-two-props-14]: This is an experiment. Patients were randomized to receive mammograms or a standard breast cancer exam. We will be able to make causal conclusions based on this study.

::: guidedpractice
Set up hypotheses to test whether there was a difference in breast cancer deaths in the mammogram and control groups.[^15-categorical-two-props-15]
:::

[^15-categorical-two-props-15]: $H_0$: the breast cancer death rate for patients screened using mammograms is the same as the breast cancer death rate for patients in the control, $\pi_{mgm} - \pi_{ctrl} = 0$.\
    $H_A$: the breast cancer death rate for patients screened using mammograms is different than the breast cancer death rate for patients in the control, $\pi_{mgm} - \pi_{ctrl} \neq 0$.

The research question describing mammograms is set up to address specific hypotheses (in contrast to a confidence interval for a parameter). In order to fully take advantage of the hypothesis testing structure, we assess the randomness under the condition that the null hypothesis is true (as we always do for hypothesis testing). Using the data from Table \@ref(tab:mammogramStudySummaryTable), we will check the conditions for using a normal distribution to analyze the results of the study using a hypothesis test. The details for checking conditions are very similar to that of confidence intervals. However, when the null hypothesis is that $\pi_1 - \pi_2 = 0$, we use a special proportion called the **pooled proportion**\index{pooled proportion} to check the success-failure condition and when computing the standard error: \begin{align*}
\hat{p}_{\textit{pool}}
    &= \frac
        {\text{# of patients who died from breast cancer in the entire study}}
        {\text{# of patients in the entire study}} \\
        &\\
    &= \frac{500 + 505}{500 + \text{44,425} + 505 + \text{44,405}} \\
    &\\
    &= 0.0112
\end{align*} This proportion is an estimate of the breast cancer death rate across the entire study, and it's our best estimate of the death rates $\pi_{mgm}$ and $\pi_{ctrl}$ *if the null hypothesis is true that* $\pi_{mgm} = \pi_{ctrl}$.

```{r include=FALSE}
terms_chp_15 <- c(terms_chp_15, "pooled proportion")
```

::: onebox
**Use the pooled proportion when** $H_0$ is $\pi_1 - \pi_2 = 0$.

When the null hypothesis is that the proportions are equal, use the pooled proportion ($\hat{p}_{\textit{pool}}$) to verify the success-failure condition and estimate the standard error: \begin{eqnarray*}
  \hat{p}_{\textit{pool}}
    = \frac{\text{total number of successes}}
      {\text{total number of cases}}
    = \frac{\hat{p}_1 n_1 + \hat{p}_2 n_2}{n_1 + n_2}
  \end{eqnarray*} Here $\hat{p}_1 n_1$ represents the number of successes in sample 1 since \begin{eqnarray*}
  \hat{p}_1
    = \frac{\text{number of successes in sample 1}}{n_1}
  \end{eqnarray*} Similarly, $\hat{p}_2 n_2$ represents the number of successes in sample 2.
:::

::: workedexample
Is it reasonable to model the difference in proportions using a normal distribution in this study?

------------------------------------------------------------------------

Because the patients are randomized, they can be treated as independent, both within and between groups. We also must check the success-failure condition for each group. Under the null hypothesis, the proportions $\pi_{mgm}$ and $\pi_{ctrl}$ are equal, so we check the success-failure condition with our best estimate of these values under $H_0$, the pooled proportion from the two samples, $\hat{p}_{\textit{pool}} = 0.0112$: \begin{align*}
  \hat{p}_{\textit{pool}} \times n_{mgm}
      &= 0.0112 \times \text{44,925} = 503 \\
   (1 - \hat{p}_{\textit{pool}}) \times n_{mgm}
      &= 0.9888 \times \text{44,925} = \text{44,422} \\
  & \\
  \hat{p}_{\textit{pool}} \times n_{ctrl}
      &= 0.0112 \times \text{44,910} = 503 \\
   (1 - \hat{p}_{\textit{pool}}) \times n_{ctrl}
      &= 0.9888 \times \text{44,910} = \text{44,407}
  \end{align*} The success-failure condition is satisfied since all values are at least 10. With both conditions satisfied, we can safely model the difference in proportions using a normal distribution.
:::

We used the pooled proportion to check the success-failure condition[^15-categorical-two-props-16]. We next use it again in the standard error calculation.

[^15-categorical-two-props-16]: For an example of a two proportion hypothesis test that does not require the success-failure condition to be met, see Section \@ref(two-prop-errors).

::: onebox
**Null standard error of the difference in two proportions,** $\hat{p}_1 -\hat{p}_2$: =

Since we assume $\pi_1 = \pi_2$ when we conduct a theory-based hypothesis test for $H_0: \pi_1 - \pi_2 = 0$, we substitute the **pooled sample proportion**, $\hat{p}_{pool}$ in for both $\pi_1$ and $\pi_2$ in the expression for the standard deviation of the statistic, resulting in its **null standard error**:

```{=tex}
\begin{align*}
  SE_0(\hat{p}_1 -\hat{p}_2) &= \sqrt{\frac{\hat{p}_{pool}(1-\hat{p}_{pool})}{n_1} + \frac{\hat{p}_{pool}(1-\hat{p}_{pool})}{n_2}} \\
  &= \sqrt{\hat{p}_{pool}(1-\hat{p}_{pool})\left(\frac{1}{n_1} + \frac{1}{n_2}\right)}
  \end{align*}
```
This is the standard error formula we will use when computing the test statistic for a hypothesis test of $H_0: \pi_1 - \pi_2 = 0$.
:::

::: workedexample
Compute the point estimate of the difference in breast cancer death rates in the two groups, and use the pooled proportion $\hat{p}_{\textit{pool}} = 0.0112$ to calculate the standard error.

------------------------------------------------------------------------

The point estimate of the difference in breast cancer death rates is \begin{align*}
  \hat{p}_{mgm} - \hat{p}_{ctrl}
    &= \frac{500}{500 + 44,425} - \frac{505}{505 + 44,405} \\
  & \\
    &= 0.01113 - 0.01125 \\
  & \\
    &= -0.00012
  \end{align*} The breast cancer death rate in the mammogram group was 0.00012 less than in the control group.

Next, the standard error of $\hat{p}_{mgm} - \hat{p}_{ctrl}$ is calculated *using the pooled proportion*, $\hat{p}_{\textit{pool}}$: \begin{align*}
SE_0 = \sqrt{
      \frac{\hat{p}_{\textit{pool}}(1-\hat{p}_{\textit{pool}})}
          {n_{mgm}}
      + \frac{\hat{p}_{\textit{pool}}(1-\hat{p}_{\textit{pool}})}
          {n_{ctrl}}
    }
    = 0.00070
\end{align*}
:::

::: workedexample
Using the point estimate $\hat{p}_{mgm} - \hat{p}_{ctrl} = -0.00012$ and standard error $SE = 0.00070$, calculate a p-value for the hypothesis test and write a conclusion.

------------------------------------------------------------------------

Just like in past tests, we first compute a test statistic and draw a picture: \begin{align*}
Z = \frac{\text{point estimate} - \text{null value}}{\mbox{Null }SE}
    = \frac{-0.00012 - 0}{0.00070}
    = -0.17
\end{align*}

The lower tail area below -0.17 on a standard normal distribution is 0.4325, which we double to get the p-value: 0.8650 (see Figure \@ref(fig:mamm-norm)). With this very large p-value, the difference in breast cancer death rates is reasonably explained by chance, and we have no significant evidence that mammograms either decrease or increase the risk of death by breast cancer compared to regular breast exams, among women similar to those in the study.
:::

```{r mamm-norm, fig.cap="Standard normal distribution with the p-value shaded. The shaded area represents the probability of observing a difference in sample proportions of -0.17 or further away from zero, if the true proportions were equal."}
normTail(L = -0.17, U = 0.17,
        col = COL[1],
        axes = FALSE,
        xlim = c(-3.2, 3.2),
        xlab = "Z")
at <- c(-10, -2, 0, 2, 10)
labels <- c(0, -0.0014, 0, 0.0014, 0)
axis(1, at, labels, cex.axis = 0.9)
```

Can we conclude that mammograms have no benefits or harm? Here are a few considerations to keep in mind when reviewing the mammogram study as well as any other medical study:

-   We do not accept the null hypothesis. We can only say we don't have sufficient evidence to conclude that mammograms reduce breast cancer deaths, and we don't have sufficient evidence to conclude that mammograms increase breast cancer deaths.
-   If mammograms are helpful or harmful, the data suggest the effect isn't very large.
-   Are mammograms more or less expensive than a non-mammogram breast exam? If one option is much more expensive than the other and doesn't offer clear benefits, then we should lean towards the less expensive option.
-   The study's authors also found that mammograms led to over-diagnosis of breast cancer, which means some breast cancers were found (or thought to be found) but that these cancers would not cause symptoms during patients' lifetimes. That is, something else would kill the patient before breast cancer symptoms appeared. This means some patients may have been treated for breast cancer unnecessarily, and this treatment is another cost to consider. It is also important to recognize that over-diagnosis can cause unnecessary physical or emotional harm to patients.

These considerations highlight the complexity around medical care and treatment recommendations. Experts and medical boards who study medical treatments use considerations like those above to provide their best recommendation based on the current evidence.

\index{data!breast cancer|)} \index{data!mammography|)}

````{=html}
<!--
```{block2, type = "onebox", echo=TRUE}
**Hypothesis testing when ${H_0}$ is $\pi_1 - \pi_2 = 0$.**
  
  Once you've determined a hypothesis test for the difference
  of two proportions is the correct procedure, there are four
  steps to completing the test:

* **Prepare.**
      Identify the parameter of interest,
      list out hypotheses,
      identify the significance level,
      and compute summary statistics for each group.
* **Check.**
      Verify the conditions to ensure
      $\hat{p}_1 - \hat{p}_2$ is nearly normal under $H_0$.
      When the null hypothesis is that the difference is 0,
      use a pooled proportion to check the success-failure
      condition for each group.
* **Calculate.**
      If the conditions hold, compute the standard
      error, again using the pooled proportion,
      compute the Z-score, and identify the p-value.
* **Conclude.**
      Evaluate the hypothesis test by comparing the p-value
      to $\alpha$, and provide a conclusion in the context
      of the problem.
```
-->
````

## Chapter review {#chp15-review}

<!-- ### Summary {-} -->

<!-- ::: {.underconstruction} -->

<!-- TODO -->

<!-- ::: -->

### Summary of Z-procedures

So far in this chapter, we have seen the normal distribution applied as the appropriate mathematical model in two distinct settings. Although the two data structures are different, their similarities and differences are worth pointing out. We provide Table \@ref(tab:zcompare) partly as a mechanism for understanding $z$-procedures and partly to highlight the extremely common usage of the normal distribution in practice. You will often hear the following two $z$-procedures referred to as a **one sample** $z$-test ($z$-interval) and **two sample** $z$-test ($z$-interval).

```{r include=FALSE}
terms_chp_15 <- c(terms_chp_15, "one sample $z$-test", "two sample $z$-test")
```

```{r zcompare}
zsim_table <- tribble(
  ~variable,    ~col1, ~col2,
"response variable",  "binary", "binary",

"explanatory variable", "none", "binary",

"parameter of interest", "proportion: $\\pi$", "diff in props:$\\pi_1 - \\pi_2$",


"statistic of interest", "proportion: $\\hat{p}$", "diff in props: $\\hat{p}_1 - \\hat{p}_2$",

"null standard error of the statistic", "$\\sqrt{\\dfrac{\\pi_0(1-\\pi_0)}{n}}$", "$\\sqrt{\\hat{p}_{pool}(1-\\hat{p}_{pool})\\left(\\dfrac{1}{n_1} + \\dfrac{1}{n_2}\\right)}$",

"standard error of the statistic", "$\\sqrt{\\dfrac{\\hat{p}(1-\\hat{p})}{n}}$", "$\\sqrt{\\dfrac{\\hat{p}_1(1-\\hat{p}_1)}{n_1} + \\dfrac{\\hat{p}_2(1-\\hat{p}_2)}{n_2}}$",

"conditions", "1. independence, 2. large samples (at least 10 successes and 10 failures)", "1. independence, 2. large samples (at least 10 successes and 10 failures in each sample)",

"Theory-based R functions", "prop.test", "prop.test", 

"Simulation-based R catstats functions", "one_proportion_test, one_proportion_bootstrap_CI", "two_proportion_test, two_proportion_bootstrap_CI"
)

zsim_table %>%
 kable(caption = "Similarities of $z$-methods across one sample and two independent samples analysis of a categorical response variable.", 
    col.names = c("", " one sample ", "two indep. samples")) %>%
 kable_styling()
```

**Hypothesis tests.** When applying the normal distribution for a hypothesis test, we proceed as follows:

1.  Write appropriate hypotheses.\
2.  Verify conditions for using the normal distribution.
    -   One-sample: the observations must be independent, and you must have at least 10 successes and 10 failures.\
    -   For a difference of proportions: each sample must separately satisfy the one-sample conditions for the normal distribution, and the data in the groups must also be independent.
3.  Compute the statistic of interest, the null standard error, and the degrees of freedom. For $df$, use $n-1$ for one sample, and for two samples use either statistical software or the smaller of $n_1 - 1$ and $n_2 - 1$.\
4.  Compute the Z-score using the general formula: $$
     Z = \frac{\mbox{statistic} - \mbox{null value}}{\mbox{null standard error of the statistic}} = \frac{\mbox{statistic} - \mbox{null value}}{SE_0(\mbox{statistic})}
     $$
5.  Use the statistical software to find the p-value using the standard normal distribution:
    -   Sign in $H_A$ is $<$: p-value = area below Z-score
    -   Sign in $H_A$ is $>$: p-value = area above Z-score
    -   Sign in $H_A$ is $\neq$: p-value = 2 $\times$ area below $-|\mbox{Z-score}|$
6.  Make a conclusion based on the p-value, and write a conclusion in context, in plain language, and in terms of the alternative hypothesis.

**Confidence intervals.** Similarly, the following is how we generally compute a confidence interval using a normal distribution:

1.  Verify conditions for using the normal distribution. (See above.)\
2.  Compute the statistic of interest, the standard error, and $z^{\star}$.\
3.  Calculate the confidence interval using the general formula: $$
     \mbox{statistic} \pm\ z^{\star} SE(\mbox{statistic}).
     $$
4.  Put the conclusions in context and in plain language so even non-data scientists can understand the results.

### Terms {.unnumbered}

We introduced the following terms in the chapter. If you're not sure what some of these terms mean, we recommend you go back in the text and review their definitions. We are purposefully presenting them in alphabetical order, instead of in order of appearance, so they will be a little more challenging to locate. However you should be able to easily spot them as **bolded text**.

```{r}
make_terms_table(terms_chp_15)
```

<!-- ### Key ideas {-} -->

<!-- ::: {.underconstruction} -->

<!-- TODO -->

<!-- ::: -->