# DS105 Intermediate Statistics : Lesson Six Companion Notebook

### Table of Contents <a class="anchor" id="DS105L6_toc"></a>

* [Table of Contents](#DS105L6_toc)
    * [Page 1 - Mixed Measures ANOVAs Introduction](#DS105L6_page_1)
    * [Page 2 - Mixed Measures ANOVAs Setup in R](#DS105L6_page_2)
    * [Page 3 - Mixed Measures ANOVAs Analysis in R](#DS105L6_page_3)
    * [Page 4 - Mixed Measure ANOVA in R Activity](#DS105L6_page_4)
    * [Page 5 - Mixed Measure ANOVA in R Activity Solution](#DS105L6_page_5)
    * [Page 6 - Key Terms](#DS105L6_page_6)
    * [Page 7 - DS105 Lesson 06 Hands-On](#DS105L6_page_7)
    * [Page 8 - DS105 Lesson 06 Hands-On Solution](#DS105L6_page_8)    

<hr style="height:10px;border-width:0;color:gray;background-color:gray">

# Page 1 - Mixed Measures ANOVAs Introduction<a class="anchor" id="DS105L6_page_1"></a>

[Back to Top](#DS105L6_toc)

<hr style="height:10px;border-width:0;color:gray;background-color:gray">

In [2]:
from IPython.display import VimeoVideo
# Tutorial Video Name: Introduction to Mixed Measures ANOVAs
VimeoVideo('390079681', width=720, height=480)

The transcript for the above overview video **[is located here](https://repo.exeterlms.com/documents/V2/DataScience/Video-Transcripts/DSO105L06overview.zip)**.


<hr style="height:10px;border-width:0;color:gray;background-color:gray">

# Page 2 - Mixed Measures ANOVAs Setup in R<a class="anchor" id="DS105L6_page_2"></a>

[Back to Top](#DS105L6_toc)

<hr style="height:10px;border-width:0;color:gray;background-color:gray">


In [4]:
from IPython.display import VimeoVideo
# Tutorial Video Name: Mixed Measures ANOVA Part I
VimeoVideo('340999425', width=720, height=480)

# Mixed Measures ANOVAs Setup in R

Recall that a mixed measure ANOVA includes both a within and a between subject variable. You'll follow a very similar process to do a mixed measures ANOVA as you would to do a repeated measures or within subjects ANOVA, but you will add in an additional factor of treatment group (whether participants ate breakfast or not).

---

## Load Libraries

Mixed measures ANOVAs come as part of the base package in R, so the only libraries you will need to load in are ```rcompanion``` because you'll use it to check for the assumption of normality, and ```car``` if you need to run an ANOVA that will correct for a violation of homogeneity of variance. 

```{r}
library("rcompanion")
library("car")
```

---

## Load in Data

You will be examining **[data](https://repo.exeterlms.com/documents/V2/DataScience/Intermediate-Stats/breakfast.zip)** from a study about the effect of eating breakfast on weight loss and associated metrics, such as resting metabolic rate and waist circumference.  Most metrics were measured at baseline, and then again at follow-up, which was six weeks later.

---

## Question Setup

With this data, you will answer the question: 

```text
Did those who ate breakfast in the morning improve their resting metabolic rate from baseline to follow up compared to those who skipped breakfast?  
```

In order to answer this question, your first x, or independent variable, will be the ```Treatment.Group``` of whether they ate breakfast in the morning or not. Your second x will be time, whether it was baseline or follow up. Your y, or dependent variable, will be resting metabolic rate.  As with all ANOVAs, the IV will be categorical, and the DV will be continuous.

---

## Data Wrangling

Data wrangling for the mixed measures ANOVA is done exactly the same as it was for the repeated measures ANOVA. 

---

## Testing Assumptions 

The assumptions for a mixed measure ANOVA are the same as the ones you learned for a repeated measures ANOVA.  The only thing that differs is the sample size, because you now have two IVs. 

---

### Sample Size

A mixed measures ANOVA requires a sample size of at least 20 per independent variable or time factor.  In this case, you only have one independent variable, and you also have a factor of time.  So, you need 40 cases.  You are a few cases short of this requirement, clocking in at only *n* = 33, but for learning purposes, you will proceed. However, typically if you did not have a large enough sample size, you would either want to simplify your model (remove either the IV or the time variable), choose a different analysis, or run a procedure called bootstrapping which would re-sample your data until you had a larger *n*. 

---

<hr style="height:10px;border-width:0;color:gray;background-color:gray">

# Page 3 - Mixed Measures ANOVAs Analysis in R<a class="anchor" id="DS105L6_page_3"></a>

[Back to Top](#DS105L6_toc)

<hr style="height:10px;border-width:0;color:gray;background-color:gray">


# Mixed Measures ANOVAs Analysis in R

Alright! You've done all the prep work, now it's time for the fun! 

---

## Analysis


You will continue to use the ```aov()``` function, but add some additional arguments to it to make it mixed measures. 

<a class="anchor" id="DS105L6_page_3#breakfast5"></a>

```{r}
RManova1 <- aov(repdat~(Treatment.Group*contrasts)+Error(Participant.Code/(contrasts)), breakfast5)
summary(RManova1)
```

So what's happening here is that you are calling the ```aov()``` function on the interaction between all of your data factors.  First, you are saying that you want to see the repeated data of resting metabolic rate by your time factor (from baseline to follow-up). That's this part: ```repdat~(Treatment.Group*contrasts```. Next, you are adding in your error term, which is specified in this model by the command ```Error()```. In the error term, you are placing your subject identifier (which matches the pre and the post data together), and you also note that it needs to be done for both time factor groups as well. That's what this part of the code is doing: ```+Error(Participant.Code/(contrasts)```. Finish it all off by specifying the dataset at the end and you are good to go. Call ```summary()``` on the results:  

```text
Error: Participant.Code
                Df Sum Sq Mean Sq
Treatment.Group  1 154931  154931

Error: Participant.Code:contrasts
          Df Sum Sq Mean Sq
contrasts  1  717.2   717.2

Error: Within
                          Df  Sum Sq Mean Sq F value Pr(>F)
Treatment.Group            1      75      75   0.002  0.962
contrasts                  1    5208    5208   0.154  0.696
Treatment.Group:contrasts  1     921     921   0.027  0.869
Residuals                 58 1956447   33732            
```

This output is looking a little crazypants, compared to some of the previous output, so let's break it down! First of all, the only information you really need to pay attention to is in the last two columns: the *F* value and the associated *p* value.  

The first row is the treatment group (skipping breakfast or eating breakfast) and looks at changes in resting metabolic rate by treatment group, regardless of the time point. It's basically a one-way ANOVA.  

Same thing with the second row! But instead of treatment group, you have time as your one-way factor.  This row is just looking at change in resting metabolic rate from time point one to time point 2, regardless of what treatment group the subjects were in.

The third row, however, focuses on the interaction between those two things.  This is where the two-way design part comes in.  This line is called the *interaction effect*.  It looks at change in the resting metabolic rate over time by treatment group. 

Unfortunately, absolutely nothing is significant here.  There was no significant difference in resting metabolic rate between those who ate breakfast and those who didn't, there was no significant difference in resting metabolic rate between baseline and follow up, and there was no significant interaction between these factors.  

---

## Post Hocs

After finding such a load of bupkis above, so no need to worry about post hocs here! 

---

<hr style="height:10px;border-width:0;color:gray;background-color:gray">

# Page 4 - Mixed Measure ANOVA in R Activity<a class="anchor" id="DS105L6_page_4"></a>

[Back to Top](#DS105L6_toc)

<hr style="height:10px;border-width:0;color:gray;background-color:gray">


For this Activity, you will perform a repeated measures ANOVA in R. `This Hands-On will not be graded`, but you are encouraged to complete it. The best way to become a great data scientist is to practice! Once you have submitted your project, you will be able to access the solution on the next page. Note that the solution will be slightly different from yours, but should look similar.

<div class="panel panel-danger">
    <div class="panel-heading">
        <h3 class="panel-title">Caution!</h3>
    </div>
    <div class="panel-body">
        <p>Do not submit your project until you have completed all requirements, as you will not be able to resubmit.</p>
    </div>
</div>

---

## Requirements

Using the **[breakfast data from last page](https://repo.exeterlms.com/documents/V2/DataScience/Intermediate-Stats/breakfast.zip)**, determine whether weight loss changes from baseline to follow up based upon whether or not a person eats breakfast in the morning. In order to do this, you will need to: 

* Wrangle the data
* Test for assumptions
* Run the analysis for mixed measures ANOVA

<div class="panel panel-danger">
    <div class="panel-heading">
        <h3 class="panel-title">Caution!</h3>
    </div>
    <div class="panel-body">
        <p>Be sure to zip and submit your entire directory when finished!</p>
    </div>
</div>

<hr style="height:10px;border-width:0;color:gray;background-color:gray">

# Page 5 - Mixed Measure ANOVA in R Activity Solution<a class="anchor" id="DS105L6_page_5"></a>

[Back to Top](#DS105L6_toc)

<hr style="height:10px;border-width:0;color:gray;background-color:gray">


# Solution

---

## Answer 

Whether a participant ate breakfast or not did not impact their change in body mass from baseline to follow up!  

---

## Code

```{r}
# Subsetting the data to get rid of unnecessary rows and columns

breakfast1 <- breakfast[1:33,1:7]

# Making the data "long" to be ready for a repeated measures analysis

keeps <- c("Participant.Code", "Treatment.Group", "Age..y.", "Sex", "Height..m.", "Baseline.Resting.Metabolic.Rate..kcal.d.", "Follow.Up.Resting.Metabolic.Rate..kcal.d.")
breakfast2 <- breakfast1[keeps]

breakfast3 <- breakfast2[,1:5]
breakfast3$repdat <- breakfast2$Baseline.Resting.Metabolic.Rate..kcal.d.
breakfast3$contrasts <- "T1"

breakfast4 <- breakfast2[,1:5]
breakfast4$repdat <- breakfast2$Follow.Up.Resting.Metabolic.Rate..kcal.d.
breakfast4$contrasts <- "T2"

breakfast5 <- rbind(breakfast3, breakfast4)

# Testing for Normality

plotNormalHistogram(breakfast1$Baseline.Body.Mass..kg.)
plotNormalHistogram(breakfast1$Follow.Up.Body.Mass..kg.)

# They look approximately normal, so don't need transformation

# Testing for Homogeneity of Variance

leveneTest(repdat ~ Treatment.Group*contrasts, data=breakfast5)

# It was not significant, which means this assumption has been met

RManova3 <- aov(repdat~(Treatment.Group*contrasts)+Error(Participant.Code/(contrasts)), breakfast5)
summary(RManova3)

# Nothing was significant here either!
```

<hr style="height:10px;border-width:0;color:gray;background-color:gray">

# Page 6 - Key Terms<a class="anchor" id="DS105L6_page_6"></a>

[Back to Top](#DS105L6_toc)

<hr style="height:10px;border-width:0;color:gray;background-color:gray">


# Key Terms

Below is a list and short description of the important keywords learned in this lesson. Please read through and go back and review any concepts you do not fully understand. Great Work!

<table class="table table-striped">
    <tr>
        <th>Keyword</th>
        <th>Description</th>
    </tr>
    <tr>
        <td style="font-weight: bold;" nowrap>Mixed Measures ANOVA</td>
        <td>An ANOVA that uses both between and within subjects independent variables.</td>
    </tr>
</table>

<hr style="height:10px;border-width:0;color:gray;background-color:gray">

# Page 7 - DS105 Lesson 06 Hands-On<a class="anchor" id="DS105L6_page_7"></a>

[Back to Top](#DS105L6_toc)

<hr style="height:10px;border-width:0;color:gray;background-color:gray">


For your Hands On, you will be analyzing 20 years of suicide data from different countries around the world.  Although this is a depressing topic, it's an important issue to analyze, so they can be prevented.  

<div class="panel panel-danger">
    <div class="panel-heading">
        <h3 class="panel-title">Caution!</h3>
    </div>
    <div class="panel-body">
        <p>Do not submit your project until you have completed all requirements, as you will not be able to resubmit.</p>
    </div>
</div>

---

## Requirements

This hands on uses **[this data](https://repo.exeterlms.com/documents/V2/DataScience/Intermediate-Stats/suicide.zip)**. 

You will determine whether suicide rates (```suicides/100k pop```) has changed over the years (```year```),  and see if the ```generation``` has any influence. To do so, you will be using a mixed measures ANOVA, since there is both a repeated time element and a between subjects element. Provide a one-sentence conclusion at the bottom of your program file about the analysis you performed. 

<div class="panel panel-danger">
    <div class="panel-heading">
        <h3 class="panel-title">Caution!</h3>
    </div>
    <div class="panel-body">
        <p>Be sure to zip and submit your entire directory when finished!</p>
    </div>
</div>



<hr style="height:10px;border-width:0;color:gray;background-color:gray">

# Page 8 - DS105 Lesson 06 Hands-On Solution<a class="anchor" id="DS105L6_page_8"></a>

[Back to Top](#DS105L6_toc)

<hr style="height:10px;border-width:0;color:gray;background-color:gray">


# Lesson 6 Hands-On Solution

Below you will find the R code to come up with the solution for the Lesson 6 Hands-On.

```{r}
library("rcompanion")
library("car")
library("IDPmisc")
library("dplyr")

# Number of suicides by generation by country, with country being the repeated factor

## Check Assumptions

### Normality
plotNormalHistogram(suicide$suicides.100k.pop)

suicide$suicides.100k.popSQRT <- sqrt(suicide$suicides.100k.pop)
plotNormalHistogram(suicide$suicides.100k.popSQRT)

suicide$suicides.100k.popLOG <- log(suicide$suicides.100k.pop)

suicide4 <- NaRV.omit(suicide)

plotNormalHistogram(suicide4$suicides.100k.popLOG)

#### Use the log

### Homogeneity of Variance

leveneTest(suicides.100k.popLOG ~ generation, data=suicide4)

#### This failed the assumption, but proceed anyway for learning purposes

### Sample size -you have more than enough data

## Run the analysis

RManova1 <- aov(suicides.100k.popLOG~(generation*year)+Error(ï..country/(year)), suicide4)
summary(RManova1)

### Looks like there is a generational effect to suicide, and an interaction to how the year has affected the generation

## Post hocs

pairwise.t.test(suicide4$suicides.100k.popLOG, suicide4$generation, p.adjust="bonferroni")

### Looks like there is a difference in suicide rates among ALL the generations

## Determine Means and Draw Conclusions

suicideMeans <- suicide4 %>% group_by(generation, year) %>% summarize(Mean=mean(suicides.100k.pop))

# Generation Z is the least likely to commit suicide.  They were born mid 90's to early 2000s. The GI generation is the most likely. They were born 1901-1924. You can see that these differ over time as well - looks like the GI generation as do millenials just keeps rising in terms of suicide rates, while others like gen z and gen x are staying steady. 
```