# 3A: Anxiety in the ER

In [None]:
# This code will load the R packages we will use
suppressPackageStartupMessages({
    library(coursekata)
})

# This code will make sure the middle rows/columns don't get cut out (ellipsized) when you 
# print out a really large data frame (you can adjust the values for max rows/cols)
options(repr.matrix.max.rows=800, repr.matrix.max.cols=200)

## 1.0: The Data

Let's look at the data (in a data frame called `er`). Then I'll tell you a bit more about how it was collected.

In [None]:
head(er)

**1.1:** What do you think these cases (rows) are?

#### About the Study

<img src="https://i.postimg.cc/qpYsC1xY/image.png" alt="a variety of therapy dogs" width = 60%>

Researchers were interested in the potential benefits of therapy dogs in easing things such as anxiety, pain, and depression during emergency room visits. Several medically stable, adult patients visiting an emergency room were approached and randomly assigned to one of two conditions: 15 minutes exposure to a certified therapy dog and handler (**Dog condition**), or usual care (**Control condition**). Patient-reported anxiety, pain, and depression were assessed using a 0–10 scale (10 = worst), at three time points: 

- baseline (before the therapy dog)
- later (30 minutes after the therapy dog or control treatment)
- last (90 minutes after)

#### Study Procedure

<img src="https://i.postimg.cc/syjV5VSK/image.png" alt="Diagram of Dog Therapy Study procedure" width=800>

## Motivating Question: Are therapy dogs helpful in the emergency room (ER)?

**1.2:** The data frame `er` has a lot of variables. There is 1 row per patient with their health information and demographics. Why is there a variable like `dog_name` in this data set? Why don't all rows have a value in `dog_name`?

In [None]:
sample(er, 10)

#### Key Variables

For today, we're just going to focus on a few key variables having to do with patients' demographic characteristics and anxiety levels:

- `condition`: The research condition the patient was randomly assigned to (Dog or Control)
- `age`: The age of the patient	
- `gender`: the gender of the patient	
- `race`: The race of the patient 
- `base_anxiety`: The baseline self-reported anxiety rating on a scale of 0-10 (10 = worst), before any exposure to a therapy dog	 
- `later_anxiety`: Anxiety rating, 30 minutes after exposure to either the dog or the control treatment 
- `last_anxiety`: Anxiety rating, 90 minutes after exposure to treatment

##### Data Source: 

Research Paper: [Kline JA, Fisher MA, Pettit KL, Linville CT, Beck AM.](https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0209232) Controlled clinical trial of canine therapy versus usual care to reduce patient anxiety in the emergency department. PLoS One. 2019 Jan 9;14(1):e0209232. doi: 10.1371/journal.pone.0209232. PMID: 30625184; PMCID: PMC6326463.



**1.3:** Make a data frame called `er_anxiety` that only has these variables.

## 2.0: Explore Variation

**2.1:** Anxiety was measured at three different time points (`base_anxiety`, `later_anxiety`, and `last_anxiety`). Make some visualizations exploring these three variables. What do you find out about anxiety in the ER?

**2.2:** The researchers were particularly interested in whether `condition` explains any of the variation in anxiety. Make some visualizations to explore their hypothesis. Does `condition` make a difference on anxiety at any of these time points?

**2.3:** Let’s review our working definition of “explaining variation”. Does "explain variation" mean "cause variation"?

<div class="alert alert-block alert-info">

<b> <font size="+1">Key Question</font></b>

**2.4:** Anxiety at baseline (`base_anxiety`) doesn't look *exactly* the same across the two conditions. Why are they a little bit different? Could dog therapy have caused that difference?

</div>

In [None]:
gf_histogram(~ base_anxiety, data = er_anxiety, binwidth = 1, fill = "gold") %>%
  gf_facet_grid(condition ~ .) 

<div class="alert alert-block alert-info">

<b> <font size="+1">Key Question</font></b>

**2.5:** Could `condition` have caused the variation we see in `later_anxiety` or `last_anxiety`? 

</div>

**Summary:** Just using our common sense, we can figure out some things about the DGP that generated this data. 

Because of common sense, we know that `condition` couldn't have caused any of the differences we see in these two groups in `base_anxiety` because that is before any dog therapy happened. Any difference we see in these two groups is likely due to random chance. 

However, because this data was collected in an experiment (with random assignment to these two conditions), condition *could have* caused the differences we see in the two groups. **But** it is important to keep in mind that randomness *could have* caused differences as well. Later we will learn how to rule out randomness as a DGP.

**2.6:** Let's write some word equations. How would we write the hypothesis that: 

- `condition` could explain variation in `later_anxiety`?
- `condition` could explain variation in `last_anxiety`? 
- `condition` could not explain variation in `base_anxiety`?

## 3.0: Modeling Variation in `later_anxiety`

**3.1:** Let's focus on the hypothesis that `condition` could explain variation in `later_anxiety`. Find the best fitting model and add it to this faceted histogram below. Also add it to the jitter plot below.


In [None]:
gf_histogram(~ later_anxiety, data = er_anxiety, binwidth = 1, fill = "orange") %>%
  gf_facet_grid(condition ~ .) 

In [None]:
gf_jitter(later_anxiety ~ condition, data = er_anxiety, width = .1, color = "darkorange3")

**3.2:** Write the best fitting model of **later_anxiety = condition + other stuff** in GLM notation. You can double click on this cell to copy the equation we have started for you below:

$Y_i = b_0 + b_1X_i + e_i$

*Notes on writing fancy mathematical notation:*
- You can write GLM using Ys and Xs: $Y_i = b_0 + b_1X_i + e_i$
- Or using the variable names: $lateranxiety_i = b_0 + b_1conditionDog_i + e_i$

<div class="alert alert-block alert-info">

<b> <font size="+1">Key Question</font></b>

**3.3:** Interpret the parameter estimates. How do these numbers relate to the model shown in the graph?

</div>

**3.4** What would the condition model predict as the `later_anxiety` for someone who got dog therapy? How about someone who didn't?

**3.5:** Why does the model predict lower anxiety for those in the dog condition?

## 4.0: Modeling Variation in `base_anxiety`

In this section, let's focus on the hypothesis that `condition` **could not** explain variation in `base_anxiety`. 

**4.1:** Which of the following R codes would find the best fitting model to represent this hypothesis? (delete the other one)

In [None]:
# delete one
lm(base_anxiety ~ condition, data = er_anxiety)

lm(base_anxiety ~ NULL, data = er_anxiety)


**4.2:** Add the best fitting empty model to this faceted histogram below. Also add it to the jitter plot below.


In [None]:
gf_histogram(~ base_anxiety, data = er_anxiety, binwidth = 1, fill = "gold") %>%
  gf_facet_grid(condition ~ .)

gf_jitter(base_anxiety ~ condition, data = er_anxiety, width = .1, color = "gold")

<div class="alert alert-block alert-info">

<b> <font size="+1">Key Question</font></b>

**4.3:** Write the best fitting equation of the empty model in GLM notation. Interpret the parameter estimate. How does this number relate to the model shown in the graph?

</div>

**4.4:** Why does the empty model predict the same base anxiety level for people in the control condition as well as those who got dog therapy?

Someone who didn't know very much about how this data were collected decided to create a model using `condition` to predict variation in baseline anxiety. They added it to this jitter plot. 

In [None]:
base_condition_model <- lm(base_anxiety ~ condition, data = er_anxiety)

gf_jitter(base_anxiety ~ condition, data = er_anxiety, width = .1, color = "gold") %>%
  gf_model(base_condition_model)

base_condition_model

**4.5:** They wrongly concluded that dog therapy could actually increase baseline anxiety because the $b_1$ was .42 and they saw that the average anxiety in the dog condition is actually higher than in the control condition. How would you disabuse them of this idea?

## 5.0: Comparing `condition` Models

**5.1:** Here are three models using `condition` to explain variation in `base_anxiety`, `later_anxiety`, and `last_anxiety`. Which of these models visually looks like they explain the most variation in anxiety?

In [None]:
base_condition_model <- lm(base_anxiety ~ condition, data = er_anxiety)
gf_jitter(base_anxiety ~ condition, data = er_anxiety, width = .1, color = "gold") %>%
  gf_model(base_condition_model)

later_condition_model <- lm(later_anxiety ~ condition, data = er_anxiety)
gf_jitter(later_anxiety ~ condition, data = er_anxiety, width = .1, color = "orange") %>%
  gf_model(later_condition_model)

last_condition_model <- lm(last_anxiety ~ condition, data = er_anxiety)
gf_jitter(last_anxiety ~ condition, data = er_anxiety, width = .1, color = "red") %>%
  gf_model(last_condition_model)

<div class="alert alert-block alert-info">

<b> <font size="+1">Key Question</font></b>

**5.2:** Use a statistical measure to figure out which of these models explains the most variation in anxiety. Does your conclusion fit your visual intuition (from 5.1)?

</div>

**5.3:** Why are there fewer degrees of freedom available in the ANOVA table for `last_anxiety` than in the others?

**5.4 -- BONUS:** Can you create a new variable to look at the change in anxiety from baseline to 30 minutes later? How much variation in this change is explained by `condition`?