# Running GWAS using limix
We're going to run GWAS using the limix library for python.
Limix is freely available [here](https://github.com/limix/limix) and has an extensive [documentation](https://limix.readthedocs.io/).

# 1.  Initial setup steps

## 1a. Set up the environment

In [1]:
import pandas as pd
import numpy as np
import os
import h5py
from limix.qtl import scan
from bisect import bisect

## 1b. Define variables

In [19]:
# phenotype file
pheno_file = './data/subset_flowering_time_16.csv'
# genotypes file
geno_file = './data/subset_all_chromosomes_binary.hdf5'
# kinship matrix file
kin_file = './data/kinship_ibs_binary_mac5.h5py'
# minor allele frequency threshold
MAF_thrs = 0.1
# output file for results with K correction
output_file = './results/subset_flowering_time_16.csv'
# output file for results without K correction
output_file_nc = './results/subset_flowering_time_16_nc.csv'

# 2. Load input for GWAS
We need three variables to run GWAS:
1.  Phenotype matrix (Y)
2.  Genotype matrix (G)
3.  K matrix (K)

The first step to generating these variables is to load the data.

### Please Note!
This script is written to handle data from the *Arabidopsis thaliana* 1001 genome project.  If you are running GWAS using another organism, you will more than likely start with data that is formatted differently!  If this is the case, some parts of this code *will not* work for you out of the box.  However, if you understand the __steps__ that this code takes, you will be able to adapt it to suit your specific input data.

Therefore, I would like you to __focus on the general approach__ to generating the Y, G, and K matrices rather than trying to understand every line of code.

## 2a. Load phenotypes
The phenotype data are stored in a 2-column .csv file.
The first column specifies the accession identifier ("ecotypeid"), the second column contains the phenotype value.  Our flowering time dataset has 200 accessions.

In [20]:
# load phenotype data
pheno = pd.read_csv(pheno_file, index_col = 0)

# encode the index (the accessions column) to UTF8 for compatability with the genotype data
pheno.index = pheno.index.map(lambda x: str(int(x)).encode('UTF8'))

In [21]:
# remove accessions with missing or non numerical
#pheno = pheno[np.isfinite(pheno)]
pheno = pheno.dropna()

In [22]:
# does pheno match our expectations?
print(pheno.shape)
print(pheno.head(n=5))

(199, 1)
           rosette_color
ecotypeid               
b'6092'         0.011739
b'6974'        -0.008859
b'6197'        -0.001053
b'6138'         0.015144
b'8258'        -0.003563


The phenotypes are loaded.

## 2b. Load genotypes
The genotypes we're going to use are a subset of SNPs obtained from whole-genome resequencing of 1,135 *Arabidopsis thaliana* accessions ([1001 genomes](http://1001genomes.org/)).

Genotype data is stored as an [hdf5](https://www.h5py.org/) file, which is a composite data type.  This means that one hdf5 file stores multiple related data sets. 

The genotype hdf5 file we are using here consists of three data sets:
1.  'accessions' contains the accession identifiers.
2.  'positions' provides the SNP positions.
3.  'snps' gives the SNP calls themselves. SNPs are coded as 0 for reference allele and 1 for alternate allele.

The 'positions' dataset is also associated with a small file called an attribute which provides information about the chromosome location ('chr_index').  We will use the attribute later when outputting GWAS results.

(If you are interested in learning more about how to use hdf5 files, check out https://www.pythonforthelab.com/blog/how-to-use-hdf5-files-in-python/ after class for a more detailed introduction.)


In [23]:
# load genotype data
geno_hdf = h5py.File(geno_file, 'r')
# structure of hdf5 file
# does geno_hdf match our expectations? Here, "key" refers to the three different data sets
for key in geno_hdf.keys():
    print(key)
    print(geno_hdf[key].shape)
    print(geno_hdf[key][0:10])

accessions
(1135,)
[b'88' b'108' b'139' b'159' b'265' b'350' b'351' b'403' b'410' b'424']
positions
(1070995,)
[ 55 101 139 203 237 291 332 375 431 502]
snps
(1070995, 1135)
[[0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 ...
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 1 1 1]]


## 2c.  Load K matrix
This hdf5 file is also an hdf5 file.  It contains 2 datasets:
1. 'accessions' gives the accession identifiers
2. 'kinship' is the kinship matrix.  (In this case, it is calculated for all 1135 accessions in the complete 1001 genomes dataset.)

In [24]:
kin_hdf = h5py.File(kin_file, 'r')

print(kin_hdf['accessions'].shape)
print(kin_hdf['accessions'][0:5])
print(kin_hdf['kinship'].shape)
print(kin_hdf['kinship'][0:5])

(1135,)
[b'88' b'108' b'139' b'159' b'265']
(1135, 1135)
[[7.32260312 6.44804411 6.44906758 ... 6.33012188 6.33045141 6.33019166]
 [6.44804411 7.32260312 7.30934451 ... 6.32966248 6.3303099  6.33010831]
 [6.44906758 7.30934451 7.32260312 ... 6.33023237 6.33063943 6.33054639]
 [6.44129268 6.38031273 6.37999871 ... 6.34332622 6.34272144 6.34269043]
 [6.33677639 6.34437683 6.34471411 ... 6.2482848  6.24887795 6.24868411]]


# 3.  Generating matrices for GWAS
Now that the data have been loaded, we need to do some final manipulations to generate the appropriate input matrices.  We have two main objectives here:

1.  Since limix will not work with missing data, we need to make sure that all three matrices include the __same set of accessions__.
2.  We also need to make sure that the accessions appear in the __same order__ in all three matrices.

We will also remove any SNPs that don't meet our minor allele frequency threshold.

Again, this code will work for the 1001 genomes data out of the box, but will likely need to be modified for other data.  Therefore, focus on the __steps__ rather than on each individual line of code.

## 3a.  Generate phenotype matrix (Y)

We now have the phenotypes stored as the variable called 'pheno'.  To finish the phenotype matrix (Y), we need to __*remove non-genotyped accessions*__ and __*order the data*__ to match the genotype data.

In [25]:
# remove non-genotyped accessions from phenotype
acn_genotyped = [acn for acn in pheno.index if acn in geno_hdf['accessions'][:]]
# subset phenotype data
pheno = pheno.loc[acn_genotyped]
# order genotypes in phenotype according SNP-matrix
acn_indices = [np.where(geno_hdf['accessions'][:] == acn)[0][0] for acn in pheno.index]
acn_indices.sort()
acn_order = geno_hdf['accessions'][acn_indices]
pheno = pheno.loc[acn_order]
# transform to a numpy phenotype matrix (Y)
Y = pheno.to_numpy()

The phenotype matrix (Y) is now ready.

## 3b. Finish genotype matrix (G)
Now we can finish the genotype matrix (G).  Here we remove:
1. accessions that are not genotyped 
2. SNPs with a minor allele frequency below our threshold.

In [26]:
# subset SNP matrix for phenotyped genotypes
G = geno_hdf['snps'][:, acn_indices]

In [27]:
# remove SNPs with minor allele frequency below set threshold
# count allele 1 and 0 for each SNP
AC1 = G.sum(axis = 1)
AC0 = G.shape[1] - AC1
AC = np.vstack((AC0,AC1))
# define the minor allele for each position
MAC = np.min(AC, axis = 0)
# calculate minor allele frequency
MAF = MAC/G.shape[1]
# select SNPs with MAF above threshold 
SNP_indices = np.where(MAF >= MAF_thrs)[0]
SNPs_MAF = MAF[SNP_indices]
G = G[SNP_indices, :]

# transpose SNP-matrix into accessions x SNPs matrix
G = G.transpose()
geno_hdf.close()


The genotype matrix (G) is now ready.

## 3c. Finish kinship matrix (K)
To finish the kinship array, subset for accessions that are phenotyped and genotyped.

In [28]:
# subset kinship matrix for phenotyped and genotyped accessions
acn_indices = [np.where(kin_hdf['accessions'][:] == acn)[0][0] for acn in pheno.index]
acn_indices.sort()
K = kin_hdf['kinship'][acn_indices, :][:, acn_indices]
kin_hdf.close()

The kinship matrix (K) is now ready.

# 4. Running GWAS

## 4a.  Check input variables

Now we have all three of our input variables.  If you are running GWAS using something other than the 1001 genomes data, __these are the matrices that you need to generate to run limix__.  Let's quickly look at these variables to make sure we understand their format.  The number of accessions should be the same in all three matrices.

### Y matrix (phenotypes)
*Y is a numpy array.
The number of rows = number of accessions.
The number of columns = 1.*

In [29]:
print(type(Y))
print(Y.shape)
print(Y[0:5])

<class 'numpy.ndarray'>
(185, 1)
[[-0.00131547]
 [ 0.00224056]
 [-0.00463855]
 [-0.00796214]
 [-0.00774782]]


### G matrix (genotypes)
*G is a numpy array.  The number of rows = number of accessions.  The number of columns = the number of snps.*

In [30]:
print(type(G))
print(G.shape)
print(G[0:5])

<class 'numpy.ndarray'>
(185, 309308)
[[1 1 0 ... 0 0 1]
 [1 1 0 ... 0 0 1]
 [0 0 0 ... 0 0 1]
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]]


### K matrix (K matrix)
*K is a numpy array.  The number of rows and columns = the number of accessions*

In [31]:
print(type(K))
print(K.shape)
print(K)

<class 'numpy.ndarray'>
(185, 185)
[[7.32260312 6.93561235 6.49246048 ... 6.37091539 6.34770118 6.41537247]
 [6.93561235 7.32260312 6.49742083 ... 6.37353029 6.35021528 6.43864677]
 [6.49246048 6.49742083 7.32260312 ... 6.39016946 6.37054128 6.47352624]
 ...
 [6.37091539 6.37353029 6.39016946 ... 7.32260312 6.3134323  6.37963041]
 [6.34770118 6.35021528 6.37054128 ... 6.3134323  7.32260312 6.33498531]
 [6.41537247 6.43864677 6.47352624 ... 6.37963041 6.33498531 7.32260312]]


## 4b.  Run limix with K correction

Now that we have phenotype (Y), genotype (G), and K (K) matrices, we can run GWAS using the 'scan' function from limix.

In [32]:
r = scan(G, Y, K = K, lik = 'normal', M = None, verbose = True)

Normalising input... 
[1A[21Cdone (2.05 seconds).


LMM: 26it [00:00, 1863.66it/s]
Scanning: 100%|██████████| 51/51 [00:02<00:00, 25.50it/s]
Results: 100%|██████████| 309308/309308 [00:18<00:00, 17026.73it/s]


Hypothesis 0

𝐲 ~ 𝓝(𝙼𝜶, 0.000⋅𝙺 + 0.000⋅𝙸)

M     = ['offset']
𝜶     = [-0.00151861]
se(𝜶) = [0.01795275]
lml   = 649.3166889093745

Hypothesis 2

𝐲 ~ 𝓝(𝙼𝜶 + G𝛃, s(0.000⋅𝙺 + 0.000⋅𝙸))

          lml      cov. effsizes   cand. effsizes
-------------------------------------------------
mean   6.498e+02      -1.429e-03       -2.747e-04
std    7.222e-01       8.381e-04        3.248e-03
min    6.493e+02      -1.319e-02       -1.916e-02
25%    6.494e+02      -1.616e-03       -1.877e-03
50%    6.496e+02      -1.511e-03       -1.828e-04
75%    6.500e+02      -1.364e-03        1.393e-03
max    6.704e+02       1.729e-02        1.828e-02

Likelihood-ratio test p-values

       𝓗₀ vs 𝓗₂ 
----------------
mean   4.953e-01
std    2.907e-01
min    8.741e-11
25%    2.406e-01
50%    4.920e-01
75%    7.489e-01
max    1.000e+00


## 4c.  Output results for limix with K correction

This saves the GWAS results as a .csv file that gives chromosome (chr), position (pos), p-value (pvalue), minor allele frequency (maf), and effect size (GVE) for each SNP.  We will explore this output in the next jupyter notebook.

In [33]:
# save results

# get chromosomes, positions, and minor allele frequencies
# you would need to recode this section if working with something other than the 1001 genomes data
geno_hdf = h5py.File(geno_file, 'r')
chr_index = geno_hdf['positions'].attrs['chr_regions']
chromosomes = [bisect(chr_index[:, 1], snp_index) + 1 for snp_index in SNP_indices]
positions_all = geno_hdf['positions'][:]
positions = [positions_all[snp] for snp in SNP_indices]
mafs = SNPs_MAF #from section 3b

# get p-value and effect size
# these variables are output from limix and should work regardless of initial input data format
#extract p-values
pvalues = r.stats.pv20.tolist()
#extract effect sizes
effsizes = r.effsizes['h2']['effsize'][r.effsizes['h2']['effect_type'] == 'candidate'].tolist()

gwas_results = pd.DataFrame(list(zip(chromosomes, positions, pvalues, mafs,effsizes)), columns = ['chr', 'pos', 'pvalue', 'maf', 'GVE'])
gwas_results.to_csv(output_file, index = False)
geno_hdf.close()

## 4d.  Run limix without K correction

For comparison, we will also run limix without K in the model.

In [17]:
r_nc = scan(G, Y, lik="normal", M=None, verbose=True)

Normalising input... 
[1A[21Cdone (0.81 seconds).


Scanning: 100%|██████████| 51/51 [00:00<00:00, 56.18it/s]
Results: 100%|██████████| 124002/124002 [00:09<00:00, 13613.45it/s]


Hypothesis 0

𝐲 ~ 𝓝(𝙼𝜶, 0.000⋅𝙸)

M     = ['offset']
𝜶     = [-1.00722984e-05]
se(𝜶) = [0.0005593]
lml   = 640.0475177901653

Hypothesis 2

𝐲 ~ 𝓝(𝙼𝜶 + G𝛃, s(0.000⋅𝙸))

          lml      cov. effsizes   cand. effsizes
-------------------------------------------------
mean   6.411e+02      -9.300e-05        1.940e-04
std    1.318e+00       7.636e-04        1.931e-03
min    6.400e+02      -6.261e-03       -1.018e-02
25%    6.402e+02      -4.436e-04       -1.127e-03
50%    6.405e+02      -7.108e-05        2.363e-04
75%    6.414e+02       2.774e-04        1.554e-03
max    6.593e+02       5.216e-03        7.450e-03

Likelihood-ratio test p-values

       𝓗₀ vs 𝓗₂ 
----------------
mean   3.773e-01
std    3.046e-01
min    5.739e-10
25%    9.442e-02
50%    3.163e-01
75%    6.287e-01
max    1.000e+00


## 4e.  Output results for limix without K correction

In [18]:
# save results
# link chromosome and position to p-values and effect sizes
geno_hdf = h5py.File(geno_file, 'r')
chr_index = geno_hdf['positions'].attrs['chr_regions']
chromosomes = [bisect(chr_index[:, 1], snp_index) + 1 for snp_index in SNP_indices]
positions_all = geno_hdf['positions'][:]
positions = [positions_all[snp] for snp in SNP_indices]
mafs = SNPs_MAF #from section 3b

pvalues = r_nc.stats.pv20.tolist()
effsizes = r_nc.effsizes['h2']['effsize'][r_nc.effsizes['h2']['effect_type'] == 'candidate'].tolist()

gwas_results = pd.DataFrame(list(zip(chromosomes, positions, pvalues, mafs, effsizes)), columns = ['chr', 'pos', 'pvalue', 'maf', 'GVE'])
gwas_results.to_csv(output_file_nc, index = False)
geno_hdf.close()