<div class="alert alert-block alert-danger">

# 3A: Anxiety in the ER (COMPLETE)

*This notebook is intended for students who have completed up to:*
 
**Page 3.5**

</div>

<div class="alert alert-block alert-warning">

#### Summary of Notebook:

In this lesson, students will explore some data that came from a research study. The researchers were interested in whether dog therapy could help alleviate feelings of anxiety, pain, and depression during emergency room (ER) visits. Students will compare a set of patients randomly assigned to either 15 minutes of exposure to dog therapy, or 15 minutes of usual care in the ER, to see how it affects reported ratings of anxiety at various time points: Baseline anxiety, then 30 minutes after the intervention, and 90 minutes after the intervention.

#### Includes:

- Fitting a group model and interpreting the parameter estimates
- Evaluating explained variation in models
- Thinking about research design and causal relationships in the DGP

#### Resources:

- Optional [Printable Graph Handout](https://docs.google.com/document/d/1KOMikJpuWSzEbiOQ0HrOWcHfGX7FFxKACtb49UBG39k/edit?usp=sharing). This handout contains images of the relevant visualizations that are made throughout the lesson. They can be used for students to manually draw on, mark up, and make notes. This can give students the chance to process the graphs more deeply, and connect them to the models they are fitting.

</div>

<div class="alert alert-block alert-success">

## Approximate time to complete Notebook: 75-105 Mins

</div>

In [None]:
# This code will load the R packages we will use
suppressPackageStartupMessages({
    library(coursekata)
})

# This code will make sure the middle rows/columns don't get cut out (ellipsized) when you 
# print out a really large data frame (you can adjust the values for max rows/cols)
options(repr.matrix.max.rows=800, repr.matrix.max.cols=200)

<div class="alert alert-block alert-success">

### 1.0 - Approximate Time:  10-15 mins

</div>

## 1.0: The Data

Let's look at the data (in a data frame called `er`). Then I'll tell you a bit more about how it was collected.

In [None]:
head(er)

**1.1:** What do you think these cases (rows) are?

<div class="alert alert-block alert-warning">

**Sample Response:**

*Students' answers will vary.*

For instructors: The cases are patients in the ER.

</div>

#### About the Study

<img src="https://i.postimg.cc/qpYsC1xY/image.png" alt="a variety of therapy dogs" width = 60%>

Researchers were interested in the potential benefits of therapy dogs in easing things such as anxiety, pain, and depression during emergency room visits. Several medically stable, adult patients visiting an emergency room were approached and randomly assigned to one of two conditions: 15 minutes exposure to a certified therapy dog and handler (**Dog condition**), or usual care (**Control condition**). Patient-reported anxiety, pain, and depression were assessed using a 0–10 scale (10 = worst), at three time points: 

- baseline (before the therapy dog)
- later (30 minutes after the therapy dog or control treatment)
- last (90 minutes after)

#### Study Procedure

<img src="https://i.postimg.cc/syjV5VSK/image.png" alt="Diagram of Dog Therapy Study procedure" width=800>

## Motivating Question: Are therapy dogs helpful in the emergency room (ER)?

**1.2:** The data frame `er` has a lot of variables. There is 1 row per patient with their health information and demographics. Why is there a variable like `dog_name` in this data set? Why don't all rows have a value in `dog_name`?

In [None]:
sample(er, 10)

<div class="alert alert-block alert-warning">

**Sample Response:**

There is a variable for dog name because the patients were not all exposed to the same dog. Some of them say NA because those patients were in the control condition.

</div>

#### Key Variables

For today, we're just going to focus on a few key variables having to do with patients' demographic characteristics and anxiety levels:

- `condition`: The research condition the patient was randomly assigned to (Dog or Control)
- `age`: The age of the patient	
- `gender`: the gender of the patient	
- `race`: The race of the patient 
- `base_anxiety`: The baseline self-reported anxiety rating on a scale of 0-10 (10 = worst), before any exposure to a therapy dog	 
- `later_anxiety`: Anxiety rating, 30 minutes after exposure to either the dog or the control treatment 
- `last_anxiety`: Anxiety rating, 90 minutes after exposure to treatment

##### Data Source: 

Research Paper: [Kline JA, Fisher MA, Pettit KL, Linville CT, Beck AM.](https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0209232) Controlled clinical trial of canine therapy versus usual care to reduce patient anxiety in the emergency department. PLoS One. 2019 Jan 9;14(1):e0209232. doi: 10.1371/journal.pone.0209232. PMID: 30625184; PMCID: PMC6326463.



**1.3:** Make a data frame called `er_anxiety` that only has these variables.

In [None]:
# Sample Response
er_anxiety <- select(er, condition, age, gender, race, base_anxiety, later_anxiety, last_anxiety)
head(er_anxiety)

<div class="alert alert-block alert-success">

### 2.0 - Approximate Time:  25-30 mins

</div>

## 2.0: Explore Variation

**2.1:** Anxiety was measured at three different time points (`base_anxiety`, `later_anxiety`, and `last_anxiety`). Make some visualizations exploring these three variables. What do you find out about anxiety in the ER?

In [None]:
# Sample Responses
gf_histogram(~ base_anxiety, data = er_anxiety, binwidth = 2, fill = "gold") %>%
  gf_boxplot()

gf_histogram(~ later_anxiety, data = er_anxiety, binwidth = 2, fill = "orange") %>%
  gf_boxplot()

gf_histogram(~ last_anxiety, data = er_anxiety, binwidth = 2, fill = "red") %>%
  gf_boxplot()

<div class="alert alert-block alert-warning">

**Sample Response:**

- The baseline anxiety is a little bit skewed left, and has a median near an anxiety level of about 6. 

- The later anxiety is a little bit skewed right, and has a median near an anxiety level of 4.

- The last anxiety is kind of similar to later anxiety, and is skewed a bit to the right, with a median anxiety of about 4.

Overall, it seems like anxiety started of kind of high then got lower over time.

**Note to Instructors:**

Some students may also notice that there are fewer cases for last_anxiety. Ask them to ponder why that may be. They should be able to figure out that response rates got lower over time (e.g., people left the ER before completing the final survey).

</div>

**2.2:** The researchers were particularly interested in whether `condition` explains any of the variation in anxiety. Make some visualizations to explore their hypothesis. Does `condition` make a difference on anxiety at any of these time points?

In [None]:
# Sample Responses

# base_anxiety
gf_jitter(base_anxiety ~ condition, data = er_anxiety)

gf_histogram(~base_anxiety, data = er_anxiety) %>%
    gf_facet_grid(condition ~ .)

favstats(base_anxiety ~ condition, data = er_anxiety)

# later_anxiety
gf_jitter(later_anxiety ~ condition, data = er_anxiety)

gf_histogram(~later_anxiety, data = er_anxiety) %>%
    gf_facet_grid(condition ~ .)

favstats(later_anxiety ~ condition, data = er_anxiety)

# last_anxiety
gf_jitter(last_anxiety ~ condition, data = er_anxiety)

gf_histogram(~last_anxiety, data = er_anxiety) %>%
    gf_facet_grid(condition ~ .)

favstats(last_anxiety ~ condition, data = er_anxiety)

<div class="alert alert-block alert-warning">

**Sample Response:**

For baseline anxiety, the distributions are pretty similar, with a similar range of variation for the dog and the control group. Although, they are not exactly the same.

For later anxiety, there is still a lot of overlap in the distributions, but the dog group clearly has a lot more of the lower anxiety ratings.

For the last anxiety measure, there is also a wide range in both groups and a lot of overlap, however, there are definitely a lot more of the lower anxiety ratings in the dog group. 

**Note to Instructors:**

They may also make comments about (or you may want to draw their attention to) the difference in sample size for last anxiety, both overall (in comparison to baseline and later anxiety), and between the condition groups for last anxiety (there are double the number of responses for the dog group than the control group). This does not mean our results are not still valid, as attrition is expected and we still have response rates for later anxiety, but it is something to keep in mind when drawing conclusions.

</div>

**2.3:** Let’s review our working definition of “explaining variation”. Does "explain variation" mean "cause variation"?

<div class="alert alert-block alert-warning">

**Sample Response:**

Students may still have trouble putting this into words, but should attempt to say something like:

- Knowing the value of one variable (X) can help us make a better prediction of another variable (Y).

- Correlation is not causation (the most famous quote).


**Note to Instructors:**

Feel free to add extra scaffolding here if students need more information about research methods and experimental design:

No, it does not mean "cause variation". Explained variation just means that the variation in one variable varies in a way that it helps us better predict variation in another variable. With some types of analyses, we may not be able to ever know if a predictive relationship is causal or not, such as with correlation models, but with the right experimental conditions, we can reach some degree of confidence as to whether the relationship is causal or not.

We will be able to tell whether `condition` causes differences in anxiety for `later_anxiety` and `last_anxiety` (with some degree of certainty), because this study follows the research methods necessary for a true experiment: Random assignment to experimental group or control group. If there is a difference between the two groups, it could be due to dog therapy **or** randomness.

In later chapters, we will need to rule out randomness (that the two groups are different by random chance represented by the empty model). For today, we'll find the best fitting models using `condition` to predict these outcomes.

</div>

<div class="alert alert-block alert-info">

<b> <font size="+1">Key Question</font></b>

**2.4:** Anxiety at baseline (`base_anxiety`) doesn't look *exactly* the same across the two conditions. Why are they a little bit different? Could dog therapy have caused that difference?

</div>

In [None]:
gf_histogram(~ base_anxiety, data = er_anxiety, binwidth = 1, fill = "gold") %>%
  gf_facet_grid(condition ~ .) 

<div class="alert alert-block alert-warning">

**Sample Response:**

They don't look exactly the same because there is natural sampling variation across the two groups. The patients were randomly assigned to the groups, and everyone had different anxiety levels. 

In order to establish a causal relationship, the intervention (`condition`) has to come before the measurement, so `condition` should not have any effect on `base_anxiety` since the patients have not been exposed to any dog therapy yet at that time point. That is why this is considered a "baseline" measurement.

**Note to Instructors:**

You may want to ask students what it would mean if the two groups DID look significantly different at baseline (e.g., one group had mostly low ratings)? How would that affect the analyses and the study? Why is it important to establish that there is not a difference already between the two groups before starting the experiment?


</div>

<div class="alert alert-block alert-info">

<b> <font size="+1">Key Question</font></b>

**2.5:** Could `condition` have caused the variation we see in `later_anxiety` or `last_anxiety`? 

</div>

<div class="alert alert-block alert-warning">

**Sample Response:**

Yes, it is possible that it could have caused variation in later and last anxiety because those measurements were taken *after* the exposure to the dog therapy.

</div>

**Summary:** Just using our common sense, we can figure out some things about the DGP that generated this data. 

Because of common sense, we know that `condition` couldn't have caused any of the differences we see in these two groups in `base_anxiety` because that is before any dog therapy happened. Any difference we see in these two groups is likely due to random chance. 

However, because this data was collected in an experiment (with random assignment to these two conditions), condition *could have* caused the differences we see in the two groups. **But** it is important to keep in mind that randomness *could have* caused differences as well. Later we will learn how to rule out randomness as a DGP.

**2.6:** Let's write some word equations. How would we write the hypothesis that: 

- `condition` could explain variation in `later_anxiety`?
- `condition` could explain variation in `last_anxiety`? 
- `condition` could not explain variation in `base_anxiety`?

<div class="alert alert-block alert-warning">

**Sample Response:**

- later_anxiety = condition + other stuff

- last_anxiety = condition + other stuff

- base_anxiety = other stuff

</div>

<div class="alert alert-block alert-success">

### 3.0 - Approximate Time:  15-20 mins

</div>

## 3.0: Modeling Variation in `later_anxiety`

**3.1:** Let's focus on the hypothesis that `condition` could explain variation in `later_anxiety`. Find the best fitting model and add it to this faceted histogram below. Also add it to the jitter plot below.


In [None]:
gf_histogram(~ later_anxiety, data = er_anxiety, binwidth = 1, fill = "orange") %>%
  gf_facet_grid(condition ~ .) 

In [None]:
gf_jitter(later_anxiety ~ condition, data = er_anxiety, width = .1, color = "darkorange3")

In [None]:
# Complete Version
later_condition_model <- lm(later_anxiety ~ condition, data = er_anxiety)
later_condition_model

gf_histogram(~ later_anxiety, data = er_anxiety, binwidth = 1, fill = "orange") %>%
  gf_facet_grid(condition ~ .) %>%
  gf_model(later_condition_model)

gf_jitter(later_anxiety ~ condition, data = er_anxiety, width = .1, color = "darkorange3") %>%
  gf_model(later_condition_model)

**3.2:** Write the best fitting model of **later_anxiety = condition + other stuff** in GLM notation. You can double click on this cell to copy the equation we have started for you below:

$Y_i = b_0 + b_1X_i + e_i$

*Notes on writing fancy mathematical notation:*
- You can write GLM using Ys and Xs: $Y_i = b_0 + b_1X_i + e_i$
- Or using the variable names: $lateranxiety_i = b_0 + b_1conditionDog_i + e_i$

In [None]:
# Complete
lm(later_anxiety ~ condition, data = er_anxiety)

<div class="alert alert-block alert-warning">

**Sample Responses:**

- $Y_i = 5.22 - 1.69X_i + e_i$ 
- $lateranxiety_i = 5.22 - 1.69conditionDog_i + e_i$

</div>

<div class="alert alert-block alert-info">

<b> <font size="+1">Key Question</font></b>

**3.3:** Interpret the parameter estimates. How do these numbers relate to the model shown in the graph?

</div>

<div class="alert alert-block alert-warning">

**Sample Response:**

For $b_1 = 5.22$ students might say: 
- this is the average later anxiety level of someone in the Control condition
- this is the model line in the Control condition
- this is what the model would predict as the anxiety level for someone in the Control condition

For $b_2 = -1.685$, students might say:
- this is the difference between the two group means; this is the difference between mean of Control group and mean of Dog group
- we add this to 5.22 in order to get the model prediction for someone in the Dog condition
- this is how far down you have to go to get from the line in the Control group to the line in the Dog group

</div>

**3.4** What would the condition model predict as the `later_anxiety` for someone who got dog therapy? How about someone who didn't?

<div class="alert alert-block alert-warning">

- Dog condition: 5.22 - 1.685 = 3.535
- Control condition: 5.22 

No R code is necessary for this question but some students might run `predict()` like this: `predict(later_condition_model)`
</div>

**3.5:** Why does the model predict lower anxiety for those in the dog condition?

<div class="alert alert-block alert-warning">

Answers may vary:

- Because the patients in the dog condition reported, on average, lower anxiety. 
- Dog condition's mean is lower.
- More people in Dog condition had lower later anxiety.

If students say "Because I believe dog therapy works!" you may want to nudge them to consider why the *model* seems to believe that as well.

</div>

<div class="alert alert-block alert-success">

### 4.0 - Approximate Time:  10-15 mins

</div>

## 4.0: Modeling Variation in `base_anxiety`

In this section, let's focus on the hypothesis that `condition` **could not** explain variation in `base_anxiety`. 

**4.1:** Which of the following R codes would find the best fitting model to represent this hypothesis? (delete the other one)

In [None]:
# delete one
lm(base_anxiety ~ condition, data = er_anxiety)

lm(base_anxiety ~ NULL, data = er_anxiety)

# Complete Version: Delete the model that includes condition

**4.2:** Add the best fitting empty model to this faceted histogram below. Also add it to the jitter plot below.


In [None]:
gf_histogram(~ base_anxiety, data = er_anxiety, binwidth = 1, fill = "gold") %>%
  gf_facet_grid(condition ~ .)

gf_jitter(base_anxiety ~ condition, data = er_anxiety, width = .1, color = "gold")

In [None]:
# Complete Version
empty_model <- lm(base_anxiety ~ NULL, data = er_anxiety)

gf_histogram(~ base_anxiety, data = er_anxiety, binwidth = 1, fill = "gold") %>%
  gf_facet_grid(condition ~ .) %>%
  gf_model(empty_model)

gf_jitter(base_anxiety ~ condition, data = er_anxiety, width = .1, color = "gold") %>%
  gf_model(empty_model)

<div class="alert alert-block alert-info">

<b> <font size="+1">Key Question</font></b>

**4.3:** Write the best fitting equation of the empty model in GLM notation. Interpret the parameter estimate. How does this number relate to the model shown in the graph?

</div>

<div class="alert alert-block alert-warning">

**Sample Response:**

***GLM:***

- $base\_anxiety_i = 6.12 + e_i$
- $Y_i = 6.12 + e_i$


***Interpretation:***

The empty model predicts that a patient (regardless of Condition) will have an average base anxiety rating of about 6.12.

This number is represented by the line for the model. The line for the model is plotted at the empty model's prediction of 6.12.

</div>

**4.4:** Why does the empty model predict the same base anxiety level for people in the control condition as well as those who got dog therapy?

<div class="alert alert-block alert-warning">

**Sample Responses:**

- The empty model does not include any information about `Condition` (it's "empty" or "null" of an explanatory variable). 

- The empty model is a constant, not a variable, and will just predict the same value for all cases (in this case, the grand mean of base anxiety).

</div>

Someone who didn't know very much about how this data were collected decided to create a model using `condition` to predict variation in baseline anxiety. They added it to this jitter plot. 

In [None]:
base_condition_model <- lm(base_anxiety ~ condition, data = er_anxiety)

gf_jitter(base_anxiety ~ condition, data = er_anxiety, width = .1, color = "gold") %>%
  gf_model(base_condition_model)

base_condition_model

**4.5:** They wrongly concluded that dog therapy could actually increase baseline anxiety because the $b_1$ was .42 and they saw that the average anxiety in the dog condition is actually higher than in the control condition. How would you disabuse them of this idea?

<div class="alert alert-block alert-warning">

**Sample Responses:**

- Tell them about error due to natural sampling variation.

- Tell them that for dog therapy to potentially cause that difference, it would have had to have happened *before* the base anxiety.

- Tell them that even though there is a difference, it is not a big enough difference to be meaningful.

- Tell them that even though $b_1$ isn't *exactly* zero, it doesn't mean there isn't relatively zero difference between the groups. It could still be a difference that came about by chance.

</div>

<div class="alert alert-block alert-success">

### 5.0 - Approximate Time:  15-20 mins

</div>

## 5.0: Comparing `condition` Models

**5.1:** Here are three models using `condition` to explain variation in `base_anxiety`, `later_anxiety`, and `last_anxiety`. Which of these models visually looks like they explain the most variation in anxiety?

In [None]:
base_condition_model <- lm(base_anxiety ~ condition, data = er_anxiety)
gf_jitter(base_anxiety ~ condition, data = er_anxiety, width = .1, color = "gold") %>%
  gf_model(base_condition_model)

later_condition_model <- lm(later_anxiety ~ condition, data = er_anxiety)
gf_jitter(later_anxiety ~ condition, data = er_anxiety, width = .1, color = "orange") %>%
  gf_model(later_condition_model)

last_condition_model <- lm(last_anxiety ~ condition, data = er_anxiety)
gf_jitter(last_anxiety ~ condition, data = er_anxiety, width = .1, color = "red") %>%
  gf_model(last_condition_model)

<div class="alert alert-block alert-warning">

**Sample Responses:**

Visually, it looks like the `Condition` model explains the most variation for `last_anxiety`. That model has the biggest gap, or difference, between the two lines of the model (the largest $b_1$). 

</div>

<div class="alert alert-block alert-info">

<b> <font size="+1">Key Question</font></b>

**5.2:** Use a statistical measure to figure out which of these models explains the most variation in anxiety. Does your conclusion fit your visual intuition (from 5.1)?

</div>

In [None]:
# Sample Response
supernova(base_condition_model)
supernova(later_condition_model)
supernova(last_condition_model)


<div class="alert alert-block alert-warning">

**Sample Responses:**

If we use PRE, or Model SS, then we can see that condition explains the most variation (or, reduces the most SS error) in last_anxiety, compared to the other two condition models.

</div>

**5.3:** Why are there fewer degrees of freedom available in the ANOVA table for `last_anxiety` than in the others?

<div class="alert alert-block alert-warning">

**Sample Responses:**

Because there are fewer cases for that measure (it has a smaller n).

</div>

**5.4 -- BONUS:** Can you create a new variable to look at the change in anxiety from baseline to 30 minutes later? How much variation in this change is explained by `condition`?

In [None]:
# Sample Response

# Create a variable that gets the difference between base anxiety and later anxiety
er_anxiety$base_later_diff <- er$base_anxiety - er$later_anxiety
head(er_anxiety)

# Plot with Models for change from base to later
gf_jitter(base_later_diff ~ condition, data = er_anxiety, width = .2, height = .2, color = "orchid4") %>%
gf_model(base_later_diff ~ NULL, color = "blue") %>%
gf_model(base_later_diff ~ condition, color = "orange") 

# Fit the models for change from base to later
empty_model_later_diff <- lm(base_later_diff ~ NULL, data = er_anxiety)
empty_model_later_diff

condition_model_later_diff <- lm(base_later_diff ~ condition, data = er_anxiety)
condition_model_later_diff

# Evaluate the models
supernova(condition_model_later_diff)


<div class="alert alert-block alert-warning">

**Sample Responses:**

According to the PRE, condition explains about 19% of the variation in the difference between baseline anxiety and later anxiety.


**Note to Instructors:**

Students can also try looking at the difference between baseline and last anxiety as well.

Also, because the outcome variable changes from measures of anxiety to a difference, with positive and negative values, you may want to get them to take time to interpret their values.

</div>