initial commit

FunctionLab · Apr 11, 2018 · a99d737 · a99d737
commit a99d737
Show file tree

Hide file tree

Showing 444 changed files with 4,367,892 additions and 0 deletions.
diff --git a/README.md b/README.md
@@ -0,0 +1,42 @@
+## Background
+
+This code repository contains code for predicting expression effects for human genome variants and training new expression models. The ExPecto framework is described in the following manuscript:
+
+Jian Zhou, Chandra L. Theesfeld, Kevin Yao, Kathleen M. Chen, Aaron K. Wong,  and Olga G. Troyanskaya, Deep learning sequence-based ab initio expression prediction and disease-risk identification, Nature Genetics, 2018 (Accepted)
+
+## Install
+Use `pip install -r requirements.txt` to install the dependencies.
+Run `sh download_resources.sh; tar xf resources.tar.gz` to download and extract necessary model files and chromatin representations for training new ExPecto models.
+## Usage
+
+##### Example run :
+```bash
+python chromatin.py ./example/example.vcf
+python predict.py --coorFile ./example/example.vcf --geneFile ./example/example.vcf.bed.sorted.bed.closestgene --snpEffectFilePattern ./example/example.vcf.shift_SHIFT.diff.h5 --modelList ./resources/modellist --output output.csv
+```
+
+The output will be saved to output.csv. The first few columns of the csv file will be the same as the vcf files. The additional columns include predicted expression effect (log fold change) for each of the input models in the order given by the modelList file. `chromatin.py` computes the chromatin effects of the variant both on-site and to nearby regions, using trained convolutional neural network model. `predict.py` computes predicted tissue-specific expression effects from the predicted chromatin effects.
+
+#####  Explanations for the arguments:
+
+`--coorFile ./example/example.vcf` includes the variants of interest in vcf format.
+`--closestGeneFile ./example/example.vcf.bed.sorted.bed.closestgene` is the gene association file which specify the associated gene for each variant and the expression effect is predicted wrt that gene. The content of the gene association file has to have to following information: the first column and third columns are chromosome names and positions, and the last three columns are the strand of the associated gene, the ENSEMBL gene id (matched with the provided geneanno.csv file) and distance to the representative TSS of that gene. The gene association file does not need to be in the same order as the vcf file. The distance is signed and calculated as '' TSS position - variant position" regardless of on which strand the gene is transcribed. The representive TSSes can be found in the provided geneanno.csv file. The associated gene can be specified by finding the closest representative TSS. When is known of gene is of interest, such as for eQTL predictions, the know gene association can be used. This can be done for example using closest-features from [BEDOPS](https://bedops.readthedocs.io/en/latest/) and the representation TSS of protein coding genes that we included, for example:
+```
+closest-features --delim '\t' --closest --dist <(awk '{printf $1"\t"$2-1"\t"$2"\n"}' ./example/example.vcf|sort-bed - ) ./resources/geneanno.pc.sorted.bed > ./example/example.vcf.bed.sorted.bed.closestgene
+```
+
+`--snpEffectFilePattern ./example/example.vcf_shiftSHIFT_outdir/infile.vcf.wt2100.fasta.ref.h5.diff.h5` specify the name pattern of the input epigenomic effect prediction files. `SHIFT` string is used as a placeholder that is substituted automatically to the shift positions (e.g. 0, -200, -400, ...). For generating the epigenomic effect predictions, use the scripts under ./convnet directory and see instructions. Note that we will soon release a pure python version of convnet.
+
+Optional:  For very large input files use the split functionality to distribute the prediction into multiple runs. For example, `--splitFlag --splitIndex 0 --splitFold 10` will divide the input into 10 chunks and process only the first chunk.
+
+##### For training new models:
+
+Example run:
+```bash
+python ./train.py --expFile ./resources/geneanno.exp.csv --targetIndex 1 --output model.adipose
+```
+
+This trains an ExPecto model using the Adipose gene expression profile in the first column of the `geneanno.exp.csv` file. `./Xreducedall.2002.npy` is the default precomputed epigenomic features. For training new ExPecto model(s) for your custom (differential) expression profile, replace geneanno.exp.csv with the expression profile. Each row in the file has to be the expression value of gene. The gene order has to be the same as the geneanno.csv. The generated model can be used by `predict.py` by adding the path of the model file ('.save') to the `modelList` file.
+
+
+Contact me: Jian Zhou [jzhoup@gmail.com](mailto:jzhoup@gmail.com)
diff --git a/chromatin.py b/chromatin.py
@@ -0,0 +1,176 @@
+# -*- coding: utf-8 -*-
+"""Compute chromatin representation of variants (required by predict.py).
+
+This script takes a vcf file, compute the effect of the variant both at the
+variant position and at nearby positions, and output the effects as the
+representation that can be used by predict.py.
+
+Example:
+        $ python chromatin.py ./example/example.vcf
+
+"""
+import argparse
+import math
+import pyfasta
+import torch
+from torch.utils.serialization import load_lua
+import numpy as np
+import pandas as pd
+import h5py
+
+parser = argparse.ArgumentParser(description='Process some integers.')
+parser.add_argument('inputfile', type=str, help='Input file in vcf format')
+parser.add_argument('--maxshift', action="store",
+                    dest="maxshift", type=int, default=800,
+                    help='Maximum shift distance for computing nearby effects')
+parser.add_argument('--inputsize', action="store", dest="inputsize", type=int,
+                    default=2000, help="The input sequence window size for neural network")
+parser.add_argument('--batchsize', action="store", dest="batchsize",
+                    type=int, default=32, help="Batch size for neural network predictions.")
+parser.add_argument('--cuda', action='store_true')
+args = parser.parse_args()
+
+genome = pyfasta.Fasta('./resources/hg19.fa')
+model = load_lua('./resources/expecto.nn.2002.cpu')
+model.evaluate()
+if args.cuda:
+    model.cuda()
+
+CHRS = ['chr1', 'chr2', 'chr3', 'chr4', 'chr5', 'chr6', 'chr7', 'chr8', 'chr9',
+        'chr10', 'chr11', 'chr12', 'chr13', 'chr14', 'chr15', 'chr16', 'chr17',
+        'chr18', 'chr19', 'chr20', 'chr21', 'chr22', 'chrX','chrY']
+
+
+inputfile = args.inputfile
+maxshift = args.maxshift
+inputsize = args.inputsize
+batchSize = args.batchsize
+windowsize = inputsize + 100
+
+
+def encodeSeqs(seqs, inputsize=2000):
+    """Convert sequences to 0-1 encoding and truncate to the input size.
+    The output concatenates the forward and reverse complement sequence
+    encodings.
+
+    Args:
+        seqs: list of sequences (e.g. produced by fetchSeqs)
+        inputsize: the number of basepairs to encode in the output
+
+    Returns:
+        numpy array of dimension: (2 x number of sequence) x 4 x inputsize
+
+    2 x number of sequence because of the concatenation of forward and reverse
+    complement sequences.
+    """
+    seqsnp = np.zeros((len(seqs), 4, inputsize), np.bool_)
+
+    mydict = {'A': np.asarray([1, 0, 0, 0]), 'G': np.asarray([0, 1, 0, 0]),
+            'C': np.asarray([0, 0, 1, 0]), 'T': np.asarray([0, 0, 0, 1]),
+            'N': np.asarray([0, 0, 0, 0]), 'H': np.asarray([0, 0, 0, 0]),
+            'a': np.asarray([1, 0, 0, 0]), 'g': np.asarray([0, 1, 0, 0]),
+            'c': np.asarray([0, 0, 1, 0]), 't': np.asarray([0, 0, 0, 1]),
+            'n': np.asarray([0, 0, 0, 0]), '-': np.asarray([0, 0, 0, 0])}
+
+    n = 0
+    for line in seqs:
+        cline = line[int(math.floor(((len(line) - inputsize) / 2.0))):int(math.floor(len(line) - (len(line) - inputsize) / 2.0))]
+        for i, c in enumerate(cline):
+            seqsnp[n, :, i] = mydict[c]
+        n = n + 1
+
+    # get the complementary sequences
+    dataflip = seqsnp[:, ::-1, ::-1]
+    seqsnp = np.concatenate([seqsnp, dataflip], axis=0)
+    return seqsnp
+
+
+def fetchSeqs(chr, pos, ref, alt, shift=0, inputsize=2000):
+    """Fetches sequences from the genome.
+
+    Retrieves sequences centered at the given position with the given inputsize.
+    Returns both reference and alternative allele sequences . An additional 100bp
+    is retrived to accommodate indels.
+
+    Args:
+        chr: the chromosome name that must matches one of the names in CHRS.
+        pos: chromosome coordinate (1-based).
+        ref: the reference allele.
+        alt: the alternative allele.
+        shift: retrived sequence center position - variant position.
+        inputsize: the targeted sequence length (inputsize+100bp is retrived for
+                reference allele).
+
+    Returns:
+        A string that contains sequence with the reference allele,
+        A string that contains sequence with the alternative allele,
+        A boolean variable that tells whether the reference allele matches the
+        reference genome
+
+        The third variable is returned for diagnostic purpose. Generally it is
+        good practice to check whether the proportion of reference allele
+        matches is as expected.
+
+    """
+    windowsize = inputsize + 100
+    mutpos = int(windowsize / 2 - 1 - shift)
+    # return string: ref sequence, string: alt sequence, Bool: whether ref allele matches with reference genome
+    seq = genome.sequence({'chr': chr, 'start': pos + shift -
+                           int(windowsize / 2 - 1), 'stop': pos + shift + int(windowsize / 2)})
+    return seq[:mutpos] + ref + seq[(mutpos + len(ref)):], seq[:mutpos] + alt + seq[(mutpos + len(ref)):], seq[mutpos:(mutpos + len(ref))].upper() == ref.upper()
+
+
+vcf = pd.read_csv(inputfile, sep='\t', header=None, comment='#')
+
+# standardize
+vcf.iloc[:, 0] = 'chr' + vcf.iloc[:, 0].map(str).str.replace('chr', '')
+vcf = vcf[vcf.iloc[:, 0].isin(CHRS)]
+
+for shift in [0, ] + list(range(-200, -maxshift - 1, -200)) + list(range(200, maxshift + 1, 200)):
+    refseqs = []
+    altseqs = []
+    ref_matched_bools = []
+    for i in range(vcf.shape[0]):
+        refseq, altseq, ref_matched_bool = fetchSeqs(
+            vcf[0][i], vcf[1][i], vcf[3][i], vcf[4][i], shift=shift)
+        refseqs.append(refseq)
+        altseqs.append(altseq)
+        ref_matched_bools.append(ref_matched_bool)
+
+    if shift == 0:
+        # only need to be checked once
+        print("Number of variants with reference allele matched with reference genome:")
+        print(np.sum(ref_matched_bools))
+        print("Number of input variants:")
+        print(len(ref_matched_bools))
+
+    ref_encoded = encodeSeqs(refseqs).astype(np.float32)
+    alt_encoded = encodeSeqs(altseqs).astype(np.float32)
+    #print(ref_encoded.shape)
+    #print(alt_encoded.shape)
+
+    ref_preds = []
+    for i in range(int(1 + ref_encoded.shape[0] / batchSize)):
+        input = torch.from_numpy(ref_encoded[int(i*batchSize):int((i+1)*batchSize),:,:]).unsqueeze(3)
+        if args.cuda:
+            input = input.cuda()
+        ref_preds.append(model.forward(input).cpu().numpy().copy())
+    ref_preds = np.vstack(ref_preds)
+
+    alt_preds = []
+    for i in range(int(1 + alt_encoded.shape[0] / batchSize)):
+        input = torch.from_numpy(alt_encoded[int(i*batchSize):int((i+1)*batchSize),:,:]).unsqueeze(3)
+        if args.cuda:
+            input = input.cuda()
+        alt_preds.append(model.forward(input).cpu().numpy().copy())
+    alt_preds = np.vstack(alt_preds)
+
+    #ref_preds = np.vstack(map(lambda x: model.forward(x).numpy(),
+    #        np.array_split(ref_encoded, int(1 + len(refseqs) / batchSize))))
+    #alt_encoded = torch.from_numpy(encodeSeqs(altseqs).astype(np.float32)).unsqueeze(3)
+    #alt_preds = np.vstack(map(lambda x: model.forward(x).numpy(),
+    #        np.array_split(alt_encoded, int(1 + len(altseqs) / batchSize))))
+    diff = alt_preds - ref_preds
+    f = h5py.File(inputfile + '.shift_' + str(shift) + '.diff.h5', 'w')
+    f.create_dataset('pred', data=diff)
+    f.close()
diff --git a/download_resources.sh b/download_resources.sh
diff --git a/example/example.vcf b/example/example.vcf
@@ -0,0 +1,10 @@
+chr1	1265154	-	C	T
+chr1	1265460	-	C	T
+chr1	2957600	-	T	C
+chr1	3691528	-	G	A
+chr1	8021919	-	C	G
+chr1	8939842	-	G	A
+chr1	10457540	-	T	C
+chr1	11072117	-	C	T
+chr1	11072691	-	G	A
+chr1	11083408	-	G	A
diff --git a/example/example.vcf.bed.sorted.bed.closestgene b/example/example.vcf.bed.sorted.bed.closestgene
@@ -0,0 +1,10 @@
+1	1265153	1265154	C	T	1	1266659	1266660	+	ENSG00000169962	1506
+1	1265459	1265460	C	T	1	1266659	1266660	+	ENSG00000169962	1200
+1	2957599	2957600	T	C	1	2938067	2938068	+	ENSG00000169717	-19532
+1	3691527	3691528	G	A	1	3713057	3713058	-	ENSG00000130764	21530
+1	8021918	8021919	C	G	1	8021770	8021771	+	ENSG00000116288	-148
+1	8939841	8939842	G	A	1	8938744	8938745	-	ENSG00000074800	-1097
+1	10457539	10457540	T	C	1	10459120	10459121	+	ENSG00000142657	1581
+1	11072116	11072117	C	T	1	11072740	11072741	+	ENSG00000120948	624
+1	11072690	11072691	G	A	1	11072740	11072741	+	ENSG00000120948	50
+1	11083407	11083408	G	A	1	11072740	11072741	+	ENSG00000120948	-10667