## Manova

The difference between ANOVA and MANOVA (Multivariate Analysis of Variance) is that MANOVA deals with more 
than two dependent variables for variance analysis. 
Like ANOVA, MANOVA has both a one-way and a two-way analysis. 
The number of factor variables involved distinguish a one-way MANOVA from a two-way MANOVA. 

In the example below, the null hypothesis is that the two-dimensional mean-vector of 
water hardness and mortality is the same for cities in the North and the South. 
It can be tested by Hotelling-Lawley test in MANOVA. 
The R function `manova` can be used to fit such a model. 
The corresponding summary method performs the test specified by the test argument. 

The water hardness and mortality data for 61 large towns in England and Wales can be 
obtained from HSAUR package in R.

In [None]:
library(HSAUR)
data("water", package = "HSAUR")
head(water)
str(water)

In [None]:
summary(manova(cbind(hardness, mortality) ~ location, data = water), test = "Hotelling-Lawley")

The `cbind` statement combines hardness and mortality into a multivariate response variable to be modelled. 
The p-value associated with the Hotelling-Lawley statistic is very small. 
It indicates a strong evidence that the mean vectors of the two variables are not the same in the two regions.

**NOTE:** That we have changed modeling to be a _multivariate_ dependent, as a 2-tuple in this case `(hardness, mortality)`.

```R
cbind(hardness, mortality) ~ location
```

Recall that the `t()` function transposes the matrix.

Now, review the API documentation for `tapply`.

In [None]:
help(tapply)

In [None]:
t(tapply(water$hardness, water$location, mean))

In [None]:
t(tapply(water$mortality, water$location, mean))

There is a large differences in the two regions in both water hardness and mortality, 
where low mortality is associated with hard water in the South and high mortality with soft water in the North.

Now, let's look at our familiar auto-mpg data again.

In [None]:
auto_data=read.csv("/dsa/data/all_datasets/auto-mpg/auto-mpg.csv")

head(auto_data)

In [None]:
str(auto_data)
auto_data$origin = factor(auto_data$origin)
auto_data$cylinders = factor(auto_data$cylinders)

Let's create a multivariate predicted dependent variable from 
`mpg`, `displacement`, `weight`, and `acceleration`.

In [None]:

m.model <- manova(cbind(mpg, displacement,weight,acceleration) ~ origin * cylinders, data = auto_data)

summary(m.model, test = "Hotelling-Lawley")

Again the p-values indicate that the mean values of the groups formed by the factors `origin` and `cylinders` are different. 

Now let's look at the relationship of the means factored by origin for each of the dependent variables.

In [None]:
print('mpg vs origin')
t(tapply(auto_data$mpg, auto_data$origin, mean))

print('displacement vs origin')
t(tapply(auto_data$displacement, auto_data$origin, mean))

print('weight vs origin')
t(tapply(auto_data$weight, auto_data$origin, mean))

print('acceleration vs origin')
t(tapply(auto_data$acceleration, auto_data$origin, mean))

Origin 1 vehicles are significantly different than origin 2 and 3, but vehicles from origin 2 and 3 have similar mean values. 

ANOVA and MANOVA will help you make decisions about significance of variables by analyzing the amount of variation that exists in a variable. 
If a variable does not significantly vary to affect other variables, it is essentially not contributing anything when 
predicting the dependent variable so it can be excluded from model fitting. 


# Save your notebook!