# Biostatistics in R

# **Assignment 7: Analysis of Variance (ANOVA)**

### **Lesson Setup**

Run the next cell to load the necessary R packages for this lesson.

In [None]:
print(paste("My current working directory is", getwd()))

If your current working directory is not correct, you can run the code in the following cell to change it. However, you will need to modify this code to match the folder location on your particular computer system.

In [None]:
# Do NOT run the code in the cell unless you need to change your current working directory

# You must edit the pathname to match the location of this assignment on YOUR computer
# You must also remove the pound sign # befor the setwd() command before it will execute

# setwd("C:/Users/David/Biostats/Assignments/Assignment_04")

# **Introduction to ANOVA (One-Way)**
The analysis of variance (ANOVA) can be thought of as an extension to the t-test. The independent t-test is used to compare the means of a condition between 2 groups. ANOVA is used when one wants to compare the means of a condition between more than 2 groups. ANOVA is an omnibus test, meaning it tests the data as a whole. Another way to say that is this, ANOVA tests if there is a difference in the mean somewhere in the model (testing if there was an overall effect), but it does not tell one where the difference is if the there is one. To find out where the difference is between the groups, one has to conduct post-hoc tests. This is also covered in this section.

Although it can be thought of as an extension of the t-test, in terms of when to use it, mathematically speaking, it’s more of a regression model and is considered a generalized linear model (GLM).

## **ANOVA Assumptions**

There are 3 assumptions that need to be met for the results of an ANOVA test to be considered accurate and trust worthy. It’s important to note the the assumptions apply to the residuals and not the variables themselves. The ANOVA assumptions are the same as for linear regression and are:

* Normality
    + Caveat to this is, if group sizes are equal, the F-statistic is robust to violations of normality
* Homogeneity of variance
    + Same caveat as above, if group sizes are equal, the F-statistic is robust to this violation
* Independent observations

## **Data Using in this Assignment**

The data file "Cushing.txt" is part of the R Package called **_MASS_**. The data file has 27 rows and 3 columns:

**TCort**

The urinary excretion rate (mg/24 hr) of Tetrahydrocortisone

**PregN**

The urinary excretion rate (mg/24 hr) of Pregnanetriol 

**Type**

The type of adrenal cortical tumor coded as a (adenoma), b (bilateral hyperplasia), c (carcinoma) or u for unknown. 

The date was taken from the following reference

J. Aitchison and I. R. Dunsmore (1975) Statistical Prediction Analysis. Cambridge University Press, Tables 11.1–3.
_(NOTE: The column names have been shorten from the original dataset to make coding easier)_ 

The R code in the cell below loads the Cushing's data into dataframe called `CushDat` and then uses the `head()` function to print out the first six enteries.

In [None]:
# Loading data
CushDat  = read.csv("Cushings.txt")
head(CushDat)

## **EXAMPLE 1: Analysis of Variance (ANOVA)**

The Analysis of Variance (ANOVA) is part of a larger family of statistical analyses known as *_Linear Models_*. The base R installation contains a very powerful function called `lm` for solving a wide variety of linear models including ANOVA. Here we will use an R function called `aov()` that uses the `lm` function to perform ANOVA. In computer speak, the function `aov` is known as a _wrapper_ for the `lm` function. The main advantage of `aov` is that it prints out the results in the traditional language of ANOVA while the `lm` prints out its results in the traditional form of a linear model. 

The R code in the cell below uses `aov` to perform an ANOVA on the Cushing data. The first argument past to `aov` is `TCort ~ Type`. This is the formula for the model. In plain English this notation means "find the best fit linear model of the urinary level of Tetrahydrocortisone (TCort) as a function of adrenal tumor type (Type). The second argument past to `aov` is name of the dataframe holding the data to be analyzed, in this example `CushDat`.  Finally, the results returned by the `aov` function are stored in the variable 'aovResults`. 


In [None]:
# Example 1: ANOVA of Cushing data

# Perform ANOVA and store results in the variable "aovResults"
aovResults <- aov(TCort ~ Type, data = CushDat)

# Print out the results stored in "one.way"
summary(aovResults)

### **Interpretation of the Results**

The null hypothesis (H<sub>0</sub>:) of an ANOVA is that the means of all the groups are not statistically significant from one another. So in this example, the average urinary excretion of Tetrahydrocortisone is the same for all four types of adrenal cortical tumors. We must accept this null hypothesis, **_unless_** the F value is above a certain value depending upon the degrees of freedom (Df). For this example, the F-value (F statistic) was calculated to be 3.226 with 3 degrees of freedom. To see whether this F value is statistically significant, we look at the p-value `Pr(>F)` which in this example is 0.0412 *

If we look at the Significance codes, we can see that one asterisk * after the p-value means a significance value less than `0.95`. In other words, the F-value is statistically significant at the 95% confidence level. This means that the differences in the mean Tetrahydrocortisone excretion is **not** the same for the 4 tumor types, so we must reject the null hypothesis. Therefore we must accept the alternative hypothesis (H<sub>A</sub>:) that amount of Tetrahydrocortisone excretion is different between one (or more) of the tumor types. And without further analysis, we can't say any more than that. For example we can't say that a carcinoma tumor secretes more than other tumors. We just don't know based soley on the results of our ANOVA. However, there are further statistical tests that we can do if we want to know more about the relationship between Tetrahydrocortisone excretion and different types of adrenal cortical tumors.   

## **<u>EXAMPLE 1: Analysis of Variance (ANOVA)</u>**

In the code cell below, write the R code to perform an ANOVA on the Cushing Data, but this time you want to know if level of Pregnanetriol as a function of adrenal cortical tumor type. 

In [None]:
# Exercise 1: Insert your code for Exercise 1 here



In the cell below, explain the results of the ANOVA.

This method provides more information and is overall more useful. Like mentioned earlier, the intercept group is the high dose group since the high dose group’s data was not included in the model’s formula. Their data is still captured because this group has values of 0 in both of the other groups.

Something to note, at the bottom of the table there are a few tests that were conducted to test the models’s assumptions. This will be discussed later and shown how to call these diagnostics without printing out the model in the regression format.

Let’s interpret the table. Overall the model is significiant, F(2,12)= 5.12, p = 0.0247. This tells us that there is a significant difference in the group means. The coefficients (coef in the table), are the difference in mean between the control group and the respective group listed. The intercept is the mean for the high dose group, placebo group’s coefficient = 2.2 – 5.0 = -2.8, and low dose coefficient = 3.2 – 5.0 = -1.8. Looking at the p-values now (P>|t| in the table), we can see the difference between the high dose group and placebo group is significant, p = 0.008, but the difference between the low dose group and high dose group is not, p = 0.065. There is no comparison between the low dose group and the placebo group. I wanted to show you this to see where these numbers come from. Coming from the ANOVA framework, the information we are really after in this table it the F-statistic and it’s corresponding p-value. This tells us if we explained a significant amount of the overall variance. To test between groups, we need to do some post-hoc testing where we can compare all groups against each other. We are still missing some useful information with this method, we need an ANOVA table.

(Insert your comments below in this Markdown cell)

