# Tracy Widom test
Implementation according to Patterson 2006, PLoS Genetics.
Variable names are in accordance as much as possible.

In [1]:
import numpy as np
from TracyWidom import TracyWidom
import scipy
import pandas as pd
import matplotlib.pyplot as plt


### Moment estimator
Equation (10) in Patterson et al.

$$n' = \frac{(m+1)(\sum_{i}\lambda_i)^2}
{(m-1)\sum_{i}\lambda_i^2 - (\sum_{i}\lambda_i)^2}$$

## Sample dataset
C is a 50x400 genotype  values with values $\in \{0, 1, 2\}$ from the [LEA tutorial](https://rdrr.io/bioc/LEA/man/main_tracyWidom.html).

## Implementation ala Patterson
### 1. Compute Matrix M
Equation (1)-(3) in [Patterson et al, 2006](https://journals.plos.org/plosgenetics/article?id=10.1371/journal.pgen.0020190#pgen-0020190-e003)

$$\mu(j) = \frac{\sum_{i=1}^{m}C(i,j)}{m}$$
$$p(j) = \mu(j)/2$$
$$M(i,j) = \frac{C(i,j)-\mu(j)}{\sqrt{p(j)(1-p(j))}}$$

In [10]:
C = pd.read_csv('Data/genotype.csv').iloc[:,1:].values
m, n = C.shape
m1 = m-1 # m' in the paper
mu = np.nanmean(C, axis=0)  #(1)
p = mu/2.
scale = np.sqrt(p*(1-p))
#scale = np.nanstd(C, axis=0) ## alternative scaling with sigma, probably used in LEA
M = (C-mu)/scale #(2) & (3) 

### 3. Eigenvalues of X
Seems unneccessary to calculate $X = MM'$ if we go with SVD.

Automatically sorted with linalg.svd, such that $\lambda_1 > \lambda_2 \ldots > \lambda_{m'} > 0$


In [11]:
## through SVD
U, s, V = scipy.linalg.svd(M) 
lambdas = (s**2)[:-1]
#L = m1*lambdas/lambdas.sum() ## scale lambdas, so they add up to m'
# U contains eigenvectors, identical with LEA tutorial :-D

In [14]:
def nprime(m, lambdas): ## Eq (10)
    import pdb
    t1 = (lambdas.sum())**2
    numer = (m+1) * t1
    denom = (m-1) * (lambdas **2).sum() - t1
    #if numer/denom < 0: pdb.set_trace()
    return numer/denom

In [17]:
def twstats(lambdas):
    import pdb
    tw = TracyWidom(beta=1)
    stats = []
    for m in range(len(lambdas), 0, -1):
        m1 = m - 1
        n1 = nprime(m, lambdas)
        #if n1 < 1: pdb.set_trace()
        mumn = ((np.sqrt(n1-1) + np.sqrt(m))**2)/n1 ## Eq (5)
        sigmn = (np.sqrt(n1-1) + np.sqrt(m))/n1 * (1/np.sqrt(n1-1) + 1/np.sqrt(m))**(1/3.) # Eq (6)
        l = m1*lambdas[0]/lambdas.sum() ## extend to all lambdas
        x = (l - mumn)/sigmn  # Eq (7)
        stats.append((lambdas[0], l, x, 1-tw.cdf(x)))
        lambdas = lambdas[1:]  ## dropping first lambda, preparing for next round
    df = pd.DataFrame(stats)
    df.columns = 'lambda scaled_lambda twstat p-value'.split()
    return df


In [32]:
result = twstats(lambdas)
result_smartpca = pd.read_csv('Data/smartpca.log', delim_whitespace=True).iloc[:,[1,3]]
result_smartpca.columns = [f'SM_{col}' for col in result_smartpca.columns.values]


  mumn = ((np.sqrt(n1-1) + np.sqrt(m))**2)/n1 ## Eq (5)
  sigmn = (np.sqrt(n1-1) + np.sqrt(m))/n1 * (1/np.sqrt(n1-1) + 1/np.sqrt(m))**(1/3.) # Eq (6)


## Comparison of TW stats
Strangely, eigenvalues and TW stats using above code, smartpca and R package LEA are similar but not identical:

In [33]:
pd.concat([result, result_smartpca], axis=1).head(20)

Unnamed: 0,lambda,scaled_lambda,twstat,p-value,SM_eigenvalue,SM_twstat
0,4963.372596,5.233372,12.941653,6.661338e-16,5.673028,13.145
1,4017.848721,4.655766,20.274075,0.0,4.445142,20.024
2,2001.556972,2.519586,10.858658,1.033507e-12,2.174567,10.292
3,1625.527437,2.117748,6.599648,4.210112e-07,1.74326,5.577
4,1423.87742,1.90339,3.701437,0.0004191513,1.545026,3.25
5,1235.120345,1.686497,-0.700012,0.3289762,1.325417,-1.266
6,1219.691623,1.693104,0.224003,0.1308897,1.275327,-1.59
7,1150.701429,1.624804,-0.921005,0.3924203,1.210664,-2.545
8,1109.647317,1.591697,-1.223479,0.4856893,1.190185,-2.147
9,1066.036803,1.552699,-1.776214,0.660006,1.134034,-3.024


In [35]:
#LEA implementation for comparison
l='''N eigenvalues twstats   pvalues      effectn percentage
1   1      2057.0 13.3200 8.000e-09 7.170617e+01   0.104900
2   2      1675.0 20.0100 8.000e-09 1.155594e+02   0.085440
3   3       864.5  9.9680 8.000e-09 2.563951e+02   0.044110
4   4       682.5  4.1770 1.503e-04 3.173119e+02   0.034820
5   5       603.4  1.3000 3.152e-02 3.508808e+02   0.030790
6   6       548.6 -1.0170 4.215e-01 3.730542e+02   0.027990
7   7       522.2 -1.7650 6.565e-01 3.861965e+02   0.026640
8   8       506.0 -1.8630 6.859e-01 3.968453e+02   0.025810
9   9       492.0 -1.8220 6.738e-01 4.076199e+02   0.025100
10 10       464.5 -3.0520 9.363e-01 4.191613e+02   0.023700'''

