# ANOVA, ANCOVA, MANOVA, & MANCOVA

In [8]:
# Uncomment this line upon first running the notebook if these packages 
# are not installed locally.
#!pip install --user numpy pandas scipy statsmodels

import numpy as np
import pandas as pd
import scipy

import statsmodels
import statsmodels.api as sm
from statsmodels.formula.api import ols

The following vocabulary is necessary to understand the analysis of variance and analysis of covariance tests.

* __F Values:__ Measures that indicate if there is a significant difference between the means of two test groups. This incorporates __F Critical Value__ (or F Statistic) and a __F-value__. If the F-value $\leq$ F-Critical-Value then there may be enough evidence to reject the null hypothesis.
    * __F Critical Value:__ A value, taken from an [F-distribution](http://www.socr.ucla.edu/Applets.dir/F_Table.html) that is based on a given $\alpha$.
    * __F Value:__ The value calculated from your values, $F=\frac{\text{explained variance}}{\text{unexplained variance}}$
    
* __p-value:__ A value calculated based on the F statistic that indicates the probability of an event is random. The lower the probability of an event is random, the greater the chance you should reject your null hypothesis. 

* __Alpha Level:__ The percentage a researcher wishes their results to be accurate too. For instance, if a researcher wished to be 95\% sure of their results then they would set $\alpha = (100\% - 95\%) = 5\%$ or 0.05. If the p-value is below the alpha level then there is strong evidence that the null hypothesis should be rejected. 

* __Null Hypothesis:__ The hypothesis that there is no statistically significant difference in amoungst test groups. 

* __Alternate Hypothesis:__ The hypothesis that there is a statistically significant difference in test groups.

* __Type 1 Error:__ False positive, rejection of a true null hypothesis.

* __Type 2 Error:__ False negative, rejection of a true alternative hypothesis.

* __Residuals:__ The difference between a dependent variables observed value and its expected value, $r = x_{observed} - x_{expected}$.

## Analysis of Variance (ANOVA)

ANOVA refers to a broad range of tests used to determine the variance that exists between group means. It can be seen as a generalization of the t-test when used to compare three or more group means. It relies on the following assumptions:
* __Normally Distribution of Residuals:__ The assumption that the residuals follow a normal distribution.
* __Homosedacisity (Homogeneity amoungst the residuals):__ This is just a fancy way of saying that all residuals are of approximatly the same magnitude. 
* __Independence of Observations:__ The error of each random variable is independent of one and other.

The null and alternate hypothesis for ANOVA relate to the equality of each sample mean and are as follows:
* __Null Hypotheses:__ The mean of each sample is equal, $\mu_1=\mu_2=\mu_3$.
* __Alternatative Hypotheses:__ The mean of one or more samples is not equal.


### Theory



It is built upon assumption set out by [Eve's Law](https://r.amherst.edu/apps/nhorton/Adam-Eve/):

$$Var(T)=E[X]2Var(N)+E[N]Var(X)$$
This law 

### Application

There are two common types of ANOVA: one-way and two-way. The former investigates how 

There are additionally three types of ANOVA test:
* __Type-1 ANOVA:__
* __Type-2 ANOVA:__
* __Type-3 ANOVA:__

### Python

The following is an example of conducting ANOVA within the python programming language. The Duncan data set contains data on:
* Job Position
* Income
* Level of Education
* Social Prestige 
For this test we will be looking at the relationship between *Level of Education* combinded with *Social Prestige* and *Income*. For this test our 

In [22]:
#First we import the dataset and extract the data frame
duncan_dataset = sm.datasets.get_rdataset("Duncan", "carData").data

#next we construct our linear model with:
#    - Depended Variable: income
#    - Independent Variable: education + prestige
duncan_dataset_lm = ols(
    "income ~ education + prestige", 
    data=duncan_dataset).fit()

#Now that the linear model has been constructed we can can extract
#the anova results from it
for anova_type in range(1,4):
    anova_results = sm.stats.anova_lm(duncan_dataset_lm, typ=anova_type)
    print("ANOVA Type %s" % anova_type)
    print(anova_results)
    print()



ANOVA Type 1
             df        sum_sq       mean_sq          F        PR(>F)
education   1.0  13790.229826  13790.229826  74.064913  8.116233e-11
prestige    1.0   4660.942771   4660.942771  25.033109  1.053184e-05
Residual   42.0   7820.027404    186.191129        NaN           NaN

ANOVA Type 2
                sum_sq    df          F    PR(>F)
education    11.124659   1.0   0.059749  0.808084
prestige   4660.942771   1.0  25.033109  0.000011
Residual   7820.027404  42.0        NaN       NaN

ANOVA Type 3
                sum_sq    df          F    PR(>F)
Intercept  1167.606015   1.0   6.271008  0.016243
education    11.124659   1.0   0.059749  0.808084
prestige   4660.942771   1.0  25.033109  0.000011
Residual   7820.027404  42.0        NaN       NaN



## Analysis of Covariance (ANCOVA)

## Multivariate Analysis of Variance (MANOVA)

## Multivariate Analysis of Covariance (MANCOVA)

# Sources

* [Stat Soup](http://www.statsmakemecry.com/smmctheblog/stats-soup-anova-ancova-manova-mancova)
* [Law of Total Variance](https://en.wikipedia.org/wiki/Law_of_total_variance)
* [Limit Theorems and Conditional Expectation](https://bookdown.org/probability/beta/limit-theorems-and-conditional-expectation.html)
* [ANOVA Python](https://plot.ly/python/v3/anova/)