# Running GWAS using limix
We're going to run GWAS using the limix library for python.
Limix is freely available [here](https://github.com/limix/limix) and has an extensive [documentation](https://limix.readthedocs.io/).

# 1.  Initial setup steps

## 1a. Set up the environment

In [1]:
import pandas as pd
import numpy as np
import os
import h5py
from limix.qtl import scan
from bisect import bisect

## 1b. Define variables

In [2]:
# phenotypes
pheno_file = './data/subset_flowering_time_16.csv'
# genotypes
#geno_file = './data/all_chromosomes_binary.hdf5'
geno_file = './data/subset_all_chromosomes_binary.hdf5'
# kinship matrix
kin_file = './data/kinship_ibs_binary_mac5.h5py'
# minor allele frequency threshold
MAF_thrs = 0.1
# output results
output_file = './data/subset_flowering_time_16_gwas.csv'

# 2. Preparing input for GWAS
We need three variables to run GWAS:
1.  Phenotype matrix (Y)
2.  Genotype matrix (G)
3.  K matrix (K)

Let's prepare these one at a time.

## 2a. Load phenotypes
The phenotype data are stored in a 2-column .csv file.
The first column specifies the accession identifier, the second column contains the phenotype value.  For flowering time, there are 200 accessions.

In [3]:
# load phenotype data
pheno = pd.read_csv(pheno_file, index_col = 0)

In [4]:
# remove NA values
pheno = pheno[np.isfinite(pheno)]
# encode the index to UTF8 for compatability with the genotype data
pheno.index = pheno.index.map(lambda x: str(int(x)).encode('UTF8'))
# does pheno match our expectations?
print(pheno.shape)
print(pheno.head(n=5))

(200, 1)
           flowering_time_16
ecotypeid                   
b'770'                 72.25
b'801'                 88.25
b'991'                106.75
b'1062'                68.25
b'1367'                88.75


Phenotypes are loaded.  However, since all accessions in the phenotype matrix __*also*__ need to be in the genotype matrix, we need to load the genotypes before we can finalize the phenotype matrix.

## 2b. Load genotypes
The genotypes we're going to use are a subset of SNPs obtained from whole genome resequencing of 1,135 *Arabidopsis thaliana* accessions ([1001 genomes](http://1001genomes.org/)).
Genotype data is stored as an [hdf5](https://www.h5py.org/) file, which is a composite data type.  The genotype file we are using here consists of three data sets.  The first ('accessions') contains the accession identifiers.  The second ('positions') provides the SNP positions.  The third ('snps') gives the SNP calls themselves, with SNPs coded as 0 or 1.  The 'positions' dataset also has two associated attributes which provide information about the chromosome location ('chr' and 'chr_index') which we will use later when outputting GWAS results.

In [5]:
# load genotype data
geno_hdf = h5py.File(geno_file, 'r')
# does geno_hdf match our expectations?
for key in geno_hdf.keys():
    print(key)
    print(geno_hdf[key].shape)
    print(geno_hdf[key][0:5])

accessions
(1135,)
[b'88' b'108' b'139' b'159' b'265']
positions
(1070995,)
[ 55 101 139 203 237]
snps
(1070995, 1135)
[[0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]]


## 2c.  Finish phenotype matrix (Y)

We now have the phenotypes stored as the variable called pheno.  To finish the phenotype matrix (Y), we need to remove non-genotyped accessions and order the data to match the genotype data.

In [6]:
# remove non-genotyped accessions from phenotype
acn_genotyped = [acn for acn in pheno.index if acn in geno_hdf['accessions'][:]]
# subset phenotype data
pheno = pheno.loc[acn_genotyped]
# order genotypes in phenotype according SNP-matrix
acn_indices = [np.where(geno_hdf['accessions'][:] == acn)[0][0] for acn in pheno.index]
acn_indices.sort()
acn_order = geno_hdf['accessions'][acn_indices]
pheno = pheno.loc[acn_order]
# transform to phenotype matrix (Y)
Y = pheno.to_numpy()

## 2d. Finish genotype matrix (G)
The phenotype matrix is ready to go.  Now we can finish the genotype matrix (G).  Here we remove accessions that are not genotyped and SNPs with a minor allele frequency below our threshold.

In [7]:
# subset SNP-matrix for phenotyped genotypes
G = geno_hdf['snps'][:, acn_indices]

In [8]:
# remove SNPs with minor allele frequency below set threshold
# count allele 1 and 0 for each SNP
AC1 = G.sum(axis = 1)
AC0 = G.shape[1] - AC1
AC = np.vstack((AC0,AC1))
# define the minor allele for each position
MAC = np.min(AC, axis = 0)
# calculate minor allele frequency
MAF = MAC/G.shape[1]
# select SNPs with MAF above threshold 
SNP_indices = np.where(MAF >= MAF_thrs)[0]
SNPs_MAF = MAF[SNP_indices]
G = G[SNP_indices, :]

# transpose SNP-matrix into accessions x SNPs matrix
G = G.transpose()
geno_hdf.close()

The genotype matrix (G) is now ready.

## 2d.  Load kinship data

In [9]:
kin_hdf = h5py.File(kin_file, 'r')

## 2e. Finish kinship matrix (K)

In [10]:
# subset kinship matrix for phenotyped and genotyped accessions
acn_indices = [np.where(kin_hdf['accessions'][:] == acn)[0][0] for acn in pheno.index]
acn_indices.sort()
K = kin_hdf['kinship'][acn_indices, :][:, acn_indices]
kin_hdf.close()

In [11]:
print(K.shape)
print(G.shape)
print(Y.shape)

(170, 170)
(170, 125066)
(170, 1)


# 3. Running GWAS

Let's just quickly make sure our input matrices are the expected sizes.

In [12]:
print(K.shape)
print(G.shape)
print(Y.shape)

(170, 170)
(170, 125066)
(170, 1)


Now that we have genotype (G), phenotype (Y), and K (K) matrices, we can run GWAS using the scan function from limix.

In [13]:
r = scan(G, Y, K = K, lik = 'normal', M = None, verbose = True)

Normalising input... 
[1A[21Cdone (1.12 second).


LMM: 55it [00:00, 2481.49it/s]
Scanning: 100%|██████████| 51/51 [00:00<00:00, 74.87it/s]
Results: 100%|██████████| 125066/125066 [00:08<00:00, 15078.61it/s]


Hypothesis 0

𝐲 ~ 𝓝(𝙼𝜶, 422.851⋅𝙺 + 0.000⋅𝙸)

M     = ['offset']
𝜶     = [86.3851492]
se(𝜶) = [51.39663417]
lml   = -740.1760193587634

Hypothesis 2

𝐲 ~ 𝓝(𝙼𝜶 + G𝛃, s(422.851⋅𝙺 + 0.000⋅𝙸))

          lml       cov. effsizes   cand. effsizes
--------------------------------------------------
mean   -7.397e+02       8.611e+01        5.201e-01
std     7.455e-01       1.722e+00        4.118e+00
min    -7.402e+02       6.975e+01       -2.084e+01
25%    -7.401e+02       8.547e+01       -2.129e+00
50%    -7.399e+02       8.630e+01        4.310e-01
75%    -7.395e+02       8.687e+01        3.080e+00
max    -7.275e+02       1.001e+02        3.051e+01

Likelihood-ratio test p-values

       𝓗₀ vs 𝓗₂ 
----------------
mean   4.976e-01
std    2.907e-01
min    4.893e-07
25%    2.454e-01
50%    4.965e-01
75%    7.491e-01
max    1.000e+00


  v0 = self.h0.variances["fore_covariance"].item()
  v1 = self.h0.variances["back_covariance"].item()


In [14]:
# save results
# link chromosome and position to p-values and effect sizes
geno_hdf = h5py.File(geno_file, 'r')
chr_index = geno_hdf['positions'].attrs['chr_regions']
chromosomes = [bisect(chr_index[:, 1], snp_index) + 1 for snp_index in SNP_indices]
positions_all = geno_hdf['positions'][:]
positions = [positions_all[snp] for snp in SNP_indices]

pvalues = r.stats.pv20.tolist()
effsizes = r.effsizes['h2']['effsize'][r.effsizes['h2']['effect_type'] == 'candidate'].tolist()

gwas_results = pd.DataFrame(list(zip(chromosomes, positions, pvalues, SNPs_MAF, MAC[SNP_indices], effsizes)), columns = ['chr', 'pos', 'pvalue', 'maf', 'mac', 'GVE'])
gwas_results.to_csv(output_file, index = False)
geno_hdf.close()

This saves the GWAS results as a .csv file that gives chromosome (chr), position (pos), p-value (pvalue), minor allele frequency (maf), minor allele count (mac), and effect size (GVE) for each SNP.  We will explore this output in the next jupyter notebook.