-
Notifications
You must be signed in to change notification settings - Fork 41
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
0 parents
commit a99d737
Showing
444 changed files
with
4,367,892 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,42 @@ | ||
## Background | ||
|
||
This code repository contains code for predicting expression effects for human genome variants and training new expression models. The ExPecto framework is described in the following manuscript: | ||
|
||
Jian Zhou, Chandra L. Theesfeld, Kevin Yao, Kathleen M. Chen, Aaron K. Wong, and Olga G. Troyanskaya, Deep learning sequence-based ab initio expression prediction and disease-risk identification, Nature Genetics, 2018 (Accepted) | ||
|
||
## Install | ||
Use `pip install -r requirements.txt` to install the dependencies. | ||
Run `sh download_resources.sh; tar xf resources.tar.gz` to download and extract necessary model files and chromatin representations for training new ExPecto models. | ||
## Usage | ||
|
||
##### Example run : | ||
```bash | ||
python chromatin.py ./example/example.vcf | ||
python predict.py --coorFile ./example/example.vcf --geneFile ./example/example.vcf.bed.sorted.bed.closestgene --snpEffectFilePattern ./example/example.vcf.shift_SHIFT.diff.h5 --modelList ./resources/modellist --output output.csv | ||
``` | ||
|
||
The output will be saved to output.csv. The first few columns of the csv file will be the same as the vcf files. The additional columns include predicted expression effect (log fold change) for each of the input models in the order given by the modelList file. `chromatin.py` computes the chromatin effects of the variant both on-site and to nearby regions, using trained convolutional neural network model. `predict.py` computes predicted tissue-specific expression effects from the predicted chromatin effects. | ||
|
||
##### Explanations for the arguments: | ||
|
||
`--coorFile ./example/example.vcf` includes the variants of interest in vcf format. | ||
`--closestGeneFile ./example/example.vcf.bed.sorted.bed.closestgene` is the gene association file which specify the associated gene for each variant and the expression effect is predicted wrt that gene. The content of the gene association file has to have to following information: the first column and third columns are chromosome names and positions, and the last three columns are the strand of the associated gene, the ENSEMBL gene id (matched with the provided geneanno.csv file) and distance to the representative TSS of that gene. The gene association file does not need to be in the same order as the vcf file. The distance is signed and calculated as '' TSS position - variant position" regardless of on which strand the gene is transcribed. The representive TSSes can be found in the provided geneanno.csv file. The associated gene can be specified by finding the closest representative TSS. When is known of gene is of interest, such as for eQTL predictions, the know gene association can be used. This can be done for example using closest-features from [BEDOPS](https://bedops.readthedocs.io/en/latest/) and the representation TSS of protein coding genes that we included, for example: | ||
``` | ||
closest-features --delim '\t' --closest --dist <(awk '{printf $1"\t"$2-1"\t"$2"\n"}' ./example/example.vcf|sort-bed - ) ./resources/geneanno.pc.sorted.bed > ./example/example.vcf.bed.sorted.bed.closestgene | ||
``` | ||
|
||
`--snpEffectFilePattern ./example/example.vcf_shiftSHIFT_outdir/infile.vcf.wt2100.fasta.ref.h5.diff.h5` specify the name pattern of the input epigenomic effect prediction files. `SHIFT` string is used as a placeholder that is substituted automatically to the shift positions (e.g. 0, -200, -400, ...). For generating the epigenomic effect predictions, use the scripts under ./convnet directory and see instructions. Note that we will soon release a pure python version of convnet. | ||
|
||
Optional: For very large input files use the split functionality to distribute the prediction into multiple runs. For example, `--splitFlag --splitIndex 0 --splitFold 10` will divide the input into 10 chunks and process only the first chunk. | ||
|
||
##### For training new models: | ||
|
||
Example run: | ||
```bash | ||
python ./train.py --expFile ./resources/geneanno.exp.csv --targetIndex 1 --output model.adipose | ||
``` | ||
|
||
This trains an ExPecto model using the Adipose gene expression profile in the first column of the `geneanno.exp.csv` file. `./Xreducedall.2002.npy` is the default precomputed epigenomic features. For training new ExPecto model(s) for your custom (differential) expression profile, replace geneanno.exp.csv with the expression profile. Each row in the file has to be the expression value of gene. The gene order has to be the same as the geneanno.csv. The generated model can be used by `predict.py` by adding the path of the model file ('.save') to the `modelList` file. | ||
|
||
|
||
Contact me: Jian Zhou [jzhoup@gmail.com](mailto:jzhoup@gmail.com) |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,176 @@ | ||
# -*- coding: utf-8 -*- | ||
"""Compute chromatin representation of variants (required by predict.py). | ||
This script takes a vcf file, compute the effect of the variant both at the | ||
variant position and at nearby positions, and output the effects as the | ||
representation that can be used by predict.py. | ||
Example: | ||
$ python chromatin.py ./example/example.vcf | ||
""" | ||
import argparse | ||
import math | ||
import pyfasta | ||
import torch | ||
from torch.utils.serialization import load_lua | ||
import numpy as np | ||
import pandas as pd | ||
import h5py | ||
|
||
parser = argparse.ArgumentParser(description='Process some integers.') | ||
parser.add_argument('inputfile', type=str, help='Input file in vcf format') | ||
parser.add_argument('--maxshift', action="store", | ||
dest="maxshift", type=int, default=800, | ||
help='Maximum shift distance for computing nearby effects') | ||
parser.add_argument('--inputsize', action="store", dest="inputsize", type=int, | ||
default=2000, help="The input sequence window size for neural network") | ||
parser.add_argument('--batchsize', action="store", dest="batchsize", | ||
type=int, default=32, help="Batch size for neural network predictions.") | ||
parser.add_argument('--cuda', action='store_true') | ||
args = parser.parse_args() | ||
|
||
genome = pyfasta.Fasta('./resources/hg19.fa') | ||
model = load_lua('./resources/expecto.nn.2002.cpu') | ||
model.evaluate() | ||
if args.cuda: | ||
model.cuda() | ||
|
||
CHRS = ['chr1', 'chr2', 'chr3', 'chr4', 'chr5', 'chr6', 'chr7', 'chr8', 'chr9', | ||
'chr10', 'chr11', 'chr12', 'chr13', 'chr14', 'chr15', 'chr16', 'chr17', | ||
'chr18', 'chr19', 'chr20', 'chr21', 'chr22', 'chrX','chrY'] | ||
|
||
|
||
inputfile = args.inputfile | ||
maxshift = args.maxshift | ||
inputsize = args.inputsize | ||
batchSize = args.batchsize | ||
windowsize = inputsize + 100 | ||
|
||
|
||
def encodeSeqs(seqs, inputsize=2000): | ||
"""Convert sequences to 0-1 encoding and truncate to the input size. | ||
The output concatenates the forward and reverse complement sequence | ||
encodings. | ||
Args: | ||
seqs: list of sequences (e.g. produced by fetchSeqs) | ||
inputsize: the number of basepairs to encode in the output | ||
Returns: | ||
numpy array of dimension: (2 x number of sequence) x 4 x inputsize | ||
2 x number of sequence because of the concatenation of forward and reverse | ||
complement sequences. | ||
""" | ||
seqsnp = np.zeros((len(seqs), 4, inputsize), np.bool_) | ||
|
||
mydict = {'A': np.asarray([1, 0, 0, 0]), 'G': np.asarray([0, 1, 0, 0]), | ||
'C': np.asarray([0, 0, 1, 0]), 'T': np.asarray([0, 0, 0, 1]), | ||
'N': np.asarray([0, 0, 0, 0]), 'H': np.asarray([0, 0, 0, 0]), | ||
'a': np.asarray([1, 0, 0, 0]), 'g': np.asarray([0, 1, 0, 0]), | ||
'c': np.asarray([0, 0, 1, 0]), 't': np.asarray([0, 0, 0, 1]), | ||
'n': np.asarray([0, 0, 0, 0]), '-': np.asarray([0, 0, 0, 0])} | ||
|
||
n = 0 | ||
for line in seqs: | ||
cline = line[int(math.floor(((len(line) - inputsize) / 2.0))):int(math.floor(len(line) - (len(line) - inputsize) / 2.0))] | ||
for i, c in enumerate(cline): | ||
seqsnp[n, :, i] = mydict[c] | ||
n = n + 1 | ||
|
||
# get the complementary sequences | ||
dataflip = seqsnp[:, ::-1, ::-1] | ||
seqsnp = np.concatenate([seqsnp, dataflip], axis=0) | ||
return seqsnp | ||
|
||
|
||
def fetchSeqs(chr, pos, ref, alt, shift=0, inputsize=2000): | ||
"""Fetches sequences from the genome. | ||
Retrieves sequences centered at the given position with the given inputsize. | ||
Returns both reference and alternative allele sequences . An additional 100bp | ||
is retrived to accommodate indels. | ||
Args: | ||
chr: the chromosome name that must matches one of the names in CHRS. | ||
pos: chromosome coordinate (1-based). | ||
ref: the reference allele. | ||
alt: the alternative allele. | ||
shift: retrived sequence center position - variant position. | ||
inputsize: the targeted sequence length (inputsize+100bp is retrived for | ||
reference allele). | ||
Returns: | ||
A string that contains sequence with the reference allele, | ||
A string that contains sequence with the alternative allele, | ||
A boolean variable that tells whether the reference allele matches the | ||
reference genome | ||
The third variable is returned for diagnostic purpose. Generally it is | ||
good practice to check whether the proportion of reference allele | ||
matches is as expected. | ||
""" | ||
windowsize = inputsize + 100 | ||
mutpos = int(windowsize / 2 - 1 - shift) | ||
# return string: ref sequence, string: alt sequence, Bool: whether ref allele matches with reference genome | ||
seq = genome.sequence({'chr': chr, 'start': pos + shift - | ||
int(windowsize / 2 - 1), 'stop': pos + shift + int(windowsize / 2)}) | ||
return seq[:mutpos] + ref + seq[(mutpos + len(ref)):], seq[:mutpos] + alt + seq[(mutpos + len(ref)):], seq[mutpos:(mutpos + len(ref))].upper() == ref.upper() | ||
|
||
|
||
vcf = pd.read_csv(inputfile, sep='\t', header=None, comment='#') | ||
|
||
# standardize | ||
vcf.iloc[:, 0] = 'chr' + vcf.iloc[:, 0].map(str).str.replace('chr', '') | ||
vcf = vcf[vcf.iloc[:, 0].isin(CHRS)] | ||
|
||
for shift in [0, ] + list(range(-200, -maxshift - 1, -200)) + list(range(200, maxshift + 1, 200)): | ||
refseqs = [] | ||
altseqs = [] | ||
ref_matched_bools = [] | ||
for i in range(vcf.shape[0]): | ||
refseq, altseq, ref_matched_bool = fetchSeqs( | ||
vcf[0][i], vcf[1][i], vcf[3][i], vcf[4][i], shift=shift) | ||
refseqs.append(refseq) | ||
altseqs.append(altseq) | ||
ref_matched_bools.append(ref_matched_bool) | ||
|
||
if shift == 0: | ||
# only need to be checked once | ||
print("Number of variants with reference allele matched with reference genome:") | ||
print(np.sum(ref_matched_bools)) | ||
print("Number of input variants:") | ||
print(len(ref_matched_bools)) | ||
|
||
ref_encoded = encodeSeqs(refseqs).astype(np.float32) | ||
alt_encoded = encodeSeqs(altseqs).astype(np.float32) | ||
#print(ref_encoded.shape) | ||
#print(alt_encoded.shape) | ||
|
||
ref_preds = [] | ||
for i in range(int(1 + ref_encoded.shape[0] / batchSize)): | ||
input = torch.from_numpy(ref_encoded[int(i*batchSize):int((i+1)*batchSize),:,:]).unsqueeze(3) | ||
if args.cuda: | ||
input = input.cuda() | ||
ref_preds.append(model.forward(input).cpu().numpy().copy()) | ||
ref_preds = np.vstack(ref_preds) | ||
|
||
alt_preds = [] | ||
for i in range(int(1 + alt_encoded.shape[0] / batchSize)): | ||
input = torch.from_numpy(alt_encoded[int(i*batchSize):int((i+1)*batchSize),:,:]).unsqueeze(3) | ||
if args.cuda: | ||
input = input.cuda() | ||
alt_preds.append(model.forward(input).cpu().numpy().copy()) | ||
alt_preds = np.vstack(alt_preds) | ||
|
||
#ref_preds = np.vstack(map(lambda x: model.forward(x).numpy(), | ||
# np.array_split(ref_encoded, int(1 + len(refseqs) / batchSize)))) | ||
#alt_encoded = torch.from_numpy(encodeSeqs(altseqs).astype(np.float32)).unsqueeze(3) | ||
#alt_preds = np.vstack(map(lambda x: model.forward(x).numpy(), | ||
# np.array_split(alt_encoded, int(1 + len(altseqs) / batchSize)))) | ||
diff = alt_preds - ref_preds | ||
f = h5py.File(inputfile + '.shift_' + str(shift) + '.diff.h5', 'w') | ||
f.create_dataset('pred', data=diff) | ||
f.close() |
Empty file.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,10 @@ | ||
chr1 1265154 - C T | ||
chr1 1265460 - C T | ||
chr1 2957600 - T C | ||
chr1 3691528 - G A | ||
chr1 8021919 - C G | ||
chr1 8939842 - G A | ||
chr1 10457540 - T C | ||
chr1 11072117 - C T | ||
chr1 11072691 - G A | ||
chr1 11083408 - G A |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,10 @@ | ||
1 1265153 1265154 C T 1 1266659 1266660 + ENSG00000169962 1506 | ||
1 1265459 1265460 C T 1 1266659 1266660 + ENSG00000169962 1200 | ||
1 2957599 2957600 T C 1 2938067 2938068 + ENSG00000169717 -19532 | ||
1 3691527 3691528 G A 1 3713057 3713058 - ENSG00000130764 21530 | ||
1 8021918 8021919 C G 1 8021770 8021771 + ENSG00000116288 -148 | ||
1 8939841 8939842 G A 1 8938744 8938745 - ENSG00000074800 -1097 | ||
1 10457539 10457540 T C 1 10459120 10459121 + ENSG00000142657 1581 | ||
1 11072116 11072117 C T 1 11072740 11072741 + ENSG00000120948 624 | ||
1 11072690 11072691 G A 1 11072740 11072741 + ENSG00000120948 50 | ||
1 11083407 11083408 G A 1 11072740 11072741 + ENSG00000120948 -10667 |
Oops, something went wrong.