# Model Diagnostics for Lineage Proportion Models

Recall that the lineage proportions are estimated according to the following assumption:
$$
\text{Frequency of mutation }i \approx \sum_{j=1}^J \text{(Proportion of variant $j$)} * \text{(1 if mutation $i$ is in variant $j$, 0 otherwise)}
$$
which is written in mathematical notation as:
$$
f_i \approx \beta_1x_{1i} + \beta_2x_{2i} + ... + \beta_Jx_{Ji}= X\underline\beta
$$
and this formulation of the model is exactly equivalent to the structure of a linear model. 

This might lead one to believe that a linear model is appropriate, and indeed this is how things are done. However, the usual residual diagnostics are not appropriate for several reasons:

1. $f_i\in[0, 1]$, so the errors can not be normal.
    - Note that least squares optimization does not need to assume normal errors, so estimation can be done with `lm()` so long as we ignore anything based on the normality assumption.
2. All $x_{ij}$ are 0 or 1. Technically, the usual diagnostics would apply, but the interpretations are not perfect. For example, a one-unit increase in $x_{ij}$ is still a useful thing to consider. However, we're more interested in the effect of $x_{ij}$ on it's own, with all other predictors held constant. 

In addition, there is structure in the data that we can take advantage of:

1. All $f_i$ such that $x_{ij} = 1$ and $x_{ik} = 0$, $k\ne j$, have the same estimate. 
    - We can evaluate a particular lineage's ability to predict each mutation.
2. Many lineages share mutations.
    - We can evaluate a lineage's contribution to a set of mutations.
3. Since we have discrete $x$ axes, we can calculate something like the variance of $f$ at each unique combination of $\underline x_i$ (which is the covariate vector for mutation $i$). 
    - Linear models assume equal variance regardless of the value of $x$. If this assumption is violated, maybe we can adjust for this?
4. It matters whether a mutation is unique to a lineage or is shared across several lineages.

# Also we have the coverage

A frequency of 0.5 could have come from 1/2 or 1000/2000. We expect the variance to be smaller when the coverage is larger. If the count of mutation $i$ comes from a binomial distribution, we expect the variance to be $n\hat f_i(1 - \hat f_i)$. 

Ignoring the $n$ and supposing we have three lineages, this further decomposes to 
$$
\beta_1x_{i1} + \beta_2x_{i2} + \beta_3x_{i3} - (\beta_1x_{i1} + \beta_2x_{i2} + \beta_3x_{i3})^2
$$
Since we have binary covariates, the squared terms are equal to themselves and the cross terms are still binary. We can look at individual mutations to see how they contribute to the variance and also tease out the contributions of individual variants (especially by looking at "unique" mutations). In fact, a second linear model with all possible interaction terms could be fit to the variance. Don't quite know what that would actually tell us, but it might be neat.

Furthermore, we can do some fun things with the likelihood.

# Also we don't know which lineages to look for

If we have too many lineages, there is bound to be a false positive. Since the sum of the proportions cannot exceed 1, a false positive in one place will necessarily take away from a false positive in another.  In other words, *bias increases with excess predictor variables.*

Conversely, suppose we are missing a lineage that should be present. Under the assumption that this lineage shares mutations with another lineage, the estimate of the other lineage will be biased upwards. In other words, *bias increases with missing predictor variables.* 

For a concrete example, consider variants A, B, and C with mutations m1, m2, and m3, as follows:

| mutation | frequency | varA | varB | varC |
|----------|-----------|------|------|------|
| m1       | 0.25      | yes  | no   | yes  |
| m2       | 0.5       | yes  | yes  | no   |
| m3       | 0.25      | no   | yes  | no   |
| m4       | 0         | no   | no   | yes  |

In this example, we can guess that A and B have a proportion of 0.25, while C has a proportion of 0. When A and B are both present, any mutations they share are expected to be in 50\% of the samples. However, if we were to exlcude A, then we would get an estimate of 0.375 (the average of the frequencies of mutations in variant B). The erroneous exclusion of A means that B's estimate is biased upwards.

Conversely, suppose we included varC in our estimation. If the proportion is estimated to be above 0, then the prediction for m1 would be too high, so instead the model would need to reduce the estimate for varA. The erroneous inlusion of C has biased A's estimate downwards.



# Other stray thoughts

- The AIC is something interesting here. With binary covariates, it can be decomposed into contributions of each combination of lineages possibly or something like that.
- The decomposition of the variance can help derive a better version of an $R^2$ statistic or something like that.
- Covariate selection becomes important. Lasso is nice, but other methods will help with Freyja/AlCoV/etc. 
- The variance of the parameters is very important. Also note that the parameters are extremely correlated since they are restricted to the simplex.