In [1]:
import numpy as np
import pandas as pd
import seaborn as sns
from scipy import stats
import statsmodels.api as sm
import statsmodels.formula.api as smf
from statsmodels.stats import diagnostic
from statsmodels.stats.multitest import multipletests
from statsmodels.stats.outliers_influence import OLSInfluence

# Toy dataset for ANOVAs and planned comparisons

## Q

Load the `../data/wheat.txt` toy dataset [[1]](https://campus.murraystate.edu/academic/faculty/cmecklin/STA565/wheat.txt) with the adequate separator.

## A

## Q

Perform a one-way ANOVA using a Wilkinson formula to specify a linear model of response variable `yield` with `variety` as independent variable. Print the summary tables and, if necessary, an ANOVA table.

## A

## Q

Perform a two-way ANOVA of `yield` using `variety` and `location` as categorical variables. Can we introduce an interaction term?

## A

## Q

`variety` now appears to have a significant effect. Run pairwise *t* tests to determine which varieties exhibit different yields, with Sidak-Holm correction for multiple comparisons.

## A

## Q

Fit a mixed-effect linear model treating factor `location` as a random effect.

## A

## Q

Perform a Wald test to determine whether `variety` also exhibits a significant effect with this model.

## A

## Q

Perform pairwise Wald tests for each pair of different varieties, and make a dataframe with a `pvalue` column and as many rows (10) as `X-Y` comparisons.

## A

## Q

Correct the p-values for multiple comparisons, and add a `corrected pvalue` column to the result dataframe.

At least one difference now shows up as significant.

# Generalized linear models

## Q

Load the `../data/titanic_tickets.csv` data file and look at it.

Exclude the null-fare tickets.

## A

Meaning of some columns:
* `Pclass`: 1 = first class, 2 = second class, 3 = third class
* `Cabins`: number of cabins the ticket refers to
* `Passengers`: number of passengers registered on the ticket
* `SibSp`: maximum number of siblings or spouse
* `Parch`: maximum number of parents or children
* `Embarked`: C = Cherbourg (2nd port of embarkation), Q = Queenstown (3rd), S = Southampton (1st)
* `Deck`: <img src="../images/titanic_decks.png" style="height:600px" />


## Q

Instead of the classical `Survived` variable, we will try to explain the variations in `Fare`.

Let us first consider the first-class tickets only. In order not to loose many data, replace the missing deck information by an empty string (`''`).

Fit a _standard_ linear model for `Fare` as response variable, using `Embarked`, `Deck`, `Cabins`, `Passengers` and `Children` as independent variables (no interaction), and print the summary tables.

## A

##

If you used `ols`, you may notice several issues, including the non-normality of the residuals, with high skewness and kurtosis.

If you defined all variables as categorical, you may also be warned about multicollinearity. Let us ignore these warnings for now.

## Q

Print the residuals as a function of the predicted values.

## A

## Q

We have a clear case of heteroscedasticity, as could be expected from some statistics in the summary tables.

Before we move to a generalized linear model, let us try to improve the current model removing outliers.
Plot the Cook's distance for each ticket. Remove the outlier(s) and fit the model again.

## A

## Q

Plot the density of `Fare` for first-class passengers and overlay a fitted distribution function from the exponential family. `scipy.stats.invgauss` and `scipy.stats.gamma` may be useful here.

## A

## Q

Fit a generalized linear model using an inverse Gaussian distribution of `Fare` using `Embarked`, `Deck`, `Cabins`, `Passengers` and `Children` for the linear predictor.

Compare fares between decks, with corrections for multiple comparisons.

## A

## Q

Group the decks so that *A*, *B* and *C* are labelled *ABC*, and *D* and *E* are labelled *DE*. Check whether the simplified model unveils any difference of fare between the grouped decks.

## A

## Q

Programmatically search for a model with interaction terms that minimize the AIC.

You may for example look for the best model among those with a single `A * B` interaction term, and then repeat the procedure with the `A * B` term as a replacement for both `A` and `B`.

## A

## Q

Draw a [stripplot](https://seaborn.pydata.org/generated/seaborn.stripplot.html) of the AIC for the various models explored.

Example:

<img src="../images/stripplot.png" />

## A