Skip to content

Commit

Permalink
initial commit
Browse files Browse the repository at this point in the history
  • Loading branch information
jzthree committed Apr 11, 2018
0 parents commit a99d737
Show file tree
Hide file tree
Showing 444 changed files with 4,367,892 additions and 0 deletions.
42 changes: 42 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,42 @@
## Background

This code repository contains code for predicting expression effects for human genome variants and training new expression models. The ExPecto framework is described in the following manuscript:

Jian Zhou, Chandra L. Theesfeld, Kevin Yao, Kathleen M. Chen, Aaron K. Wong, and Olga G. Troyanskaya, Deep learning sequence-based ab initio expression prediction and disease-risk identification, Nature Genetics, 2018 (Accepted)

## Install
Use `pip install -r requirements.txt` to install the dependencies.
Run `sh download_resources.sh; tar xf resources.tar.gz` to download and extract necessary model files and chromatin representations for training new ExPecto models.
## Usage

##### Example run :
```bash
python chromatin.py ./example/example.vcf
python predict.py --coorFile ./example/example.vcf --geneFile ./example/example.vcf.bed.sorted.bed.closestgene --snpEffectFilePattern ./example/example.vcf.shift_SHIFT.diff.h5 --modelList ./resources/modellist --output output.csv
```

The output will be saved to output.csv. The first few columns of the csv file will be the same as the vcf files. The additional columns include predicted expression effect (log fold change) for each of the input models in the order given by the modelList file. `chromatin.py` computes the chromatin effects of the variant both on-site and to nearby regions, using trained convolutional neural network model. `predict.py` computes predicted tissue-specific expression effects from the predicted chromatin effects.

##### Explanations for the arguments:

`--coorFile ./example/example.vcf` includes the variants of interest in vcf format.
`--closestGeneFile ./example/example.vcf.bed.sorted.bed.closestgene` is the gene association file which specify the associated gene for each variant and the expression effect is predicted wrt that gene. The content of the gene association file has to have to following information: the first column and third columns are chromosome names and positions, and the last three columns are the strand of the associated gene, the ENSEMBL gene id (matched with the provided geneanno.csv file) and distance to the representative TSS of that gene. The gene association file does not need to be in the same order as the vcf file. The distance is signed and calculated as '' TSS position - variant position" regardless of on which strand the gene is transcribed. The representive TSSes can be found in the provided geneanno.csv file. The associated gene can be specified by finding the closest representative TSS. When is known of gene is of interest, such as for eQTL predictions, the know gene association can be used. This can be done for example using closest-features from [BEDOPS](https://bedops.readthedocs.io/en/latest/) and the representation TSS of protein coding genes that we included, for example:
```
closest-features --delim '\t' --closest --dist <(awk '{printf $1"\t"$2-1"\t"$2"\n"}' ./example/example.vcf|sort-bed - ) ./resources/geneanno.pc.sorted.bed > ./example/example.vcf.bed.sorted.bed.closestgene
```

`--snpEffectFilePattern ./example/example.vcf_shiftSHIFT_outdir/infile.vcf.wt2100.fasta.ref.h5.diff.h5` specify the name pattern of the input epigenomic effect prediction files. `SHIFT` string is used as a placeholder that is substituted automatically to the shift positions (e.g. 0, -200, -400, ...). For generating the epigenomic effect predictions, use the scripts under ./convnet directory and see instructions. Note that we will soon release a pure python version of convnet.

Optional: For very large input files use the split functionality to distribute the prediction into multiple runs. For example, `--splitFlag --splitIndex 0 --splitFold 10` will divide the input into 10 chunks and process only the first chunk.

##### For training new models:

Example run:
```bash
python ./train.py --expFile ./resources/geneanno.exp.csv --targetIndex 1 --output model.adipose
```

This trains an ExPecto model using the Adipose gene expression profile in the first column of the `geneanno.exp.csv` file. `./Xreducedall.2002.npy` is the default precomputed epigenomic features. For training new ExPecto model(s) for your custom (differential) expression profile, replace geneanno.exp.csv with the expression profile. Each row in the file has to be the expression value of gene. The gene order has to be the same as the geneanno.csv. The generated model can be used by `predict.py` by adding the path of the model file ('.save') to the `modelList` file.


Contact me: Jian Zhou [jzhoup@gmail.com](mailto:jzhoup@gmail.com)
176 changes: 176 additions & 0 deletions chromatin.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,176 @@
# -*- coding: utf-8 -*-
"""Compute chromatin representation of variants (required by predict.py).
This script takes a vcf file, compute the effect of the variant both at the
variant position and at nearby positions, and output the effects as the
representation that can be used by predict.py.
Example:
$ python chromatin.py ./example/example.vcf
"""
import argparse
import math
import pyfasta
import torch
from torch.utils.serialization import load_lua
import numpy as np
import pandas as pd
import h5py

parser = argparse.ArgumentParser(description='Process some integers.')
parser.add_argument('inputfile', type=str, help='Input file in vcf format')
parser.add_argument('--maxshift', action="store",
dest="maxshift", type=int, default=800,
help='Maximum shift distance for computing nearby effects')
parser.add_argument('--inputsize', action="store", dest="inputsize", type=int,
default=2000, help="The input sequence window size for neural network")
parser.add_argument('--batchsize', action="store", dest="batchsize",
type=int, default=32, help="Batch size for neural network predictions.")
parser.add_argument('--cuda', action='store_true')
args = parser.parse_args()

genome = pyfasta.Fasta('./resources/hg19.fa')
model = load_lua('./resources/expecto.nn.2002.cpu')
model.evaluate()
if args.cuda:
model.cuda()

CHRS = ['chr1', 'chr2', 'chr3', 'chr4', 'chr5', 'chr6', 'chr7', 'chr8', 'chr9',
'chr10', 'chr11', 'chr12', 'chr13', 'chr14', 'chr15', 'chr16', 'chr17',
'chr18', 'chr19', 'chr20', 'chr21', 'chr22', 'chrX','chrY']


inputfile = args.inputfile
maxshift = args.maxshift
inputsize = args.inputsize
batchSize = args.batchsize
windowsize = inputsize + 100


def encodeSeqs(seqs, inputsize=2000):
"""Convert sequences to 0-1 encoding and truncate to the input size.
The output concatenates the forward and reverse complement sequence
encodings.
Args:
seqs: list of sequences (e.g. produced by fetchSeqs)
inputsize: the number of basepairs to encode in the output
Returns:
numpy array of dimension: (2 x number of sequence) x 4 x inputsize
2 x number of sequence because of the concatenation of forward and reverse
complement sequences.
"""
seqsnp = np.zeros((len(seqs), 4, inputsize), np.bool_)

mydict = {'A': np.asarray([1, 0, 0, 0]), 'G': np.asarray([0, 1, 0, 0]),
'C': np.asarray([0, 0, 1, 0]), 'T': np.asarray([0, 0, 0, 1]),
'N': np.asarray([0, 0, 0, 0]), 'H': np.asarray([0, 0, 0, 0]),
'a': np.asarray([1, 0, 0, 0]), 'g': np.asarray([0, 1, 0, 0]),
'c': np.asarray([0, 0, 1, 0]), 't': np.asarray([0, 0, 0, 1]),
'n': np.asarray([0, 0, 0, 0]), '-': np.asarray([0, 0, 0, 0])}

n = 0
for line in seqs:
cline = line[int(math.floor(((len(line) - inputsize) / 2.0))):int(math.floor(len(line) - (len(line) - inputsize) / 2.0))]
for i, c in enumerate(cline):
seqsnp[n, :, i] = mydict[c]
n = n + 1

# get the complementary sequences
dataflip = seqsnp[:, ::-1, ::-1]
seqsnp = np.concatenate([seqsnp, dataflip], axis=0)
return seqsnp


def fetchSeqs(chr, pos, ref, alt, shift=0, inputsize=2000):
"""Fetches sequences from the genome.
Retrieves sequences centered at the given position with the given inputsize.
Returns both reference and alternative allele sequences . An additional 100bp
is retrived to accommodate indels.
Args:
chr: the chromosome name that must matches one of the names in CHRS.
pos: chromosome coordinate (1-based).
ref: the reference allele.
alt: the alternative allele.
shift: retrived sequence center position - variant position.
inputsize: the targeted sequence length (inputsize+100bp is retrived for
reference allele).
Returns:
A string that contains sequence with the reference allele,
A string that contains sequence with the alternative allele,
A boolean variable that tells whether the reference allele matches the
reference genome
The third variable is returned for diagnostic purpose. Generally it is
good practice to check whether the proportion of reference allele
matches is as expected.
"""
windowsize = inputsize + 100
mutpos = int(windowsize / 2 - 1 - shift)
# return string: ref sequence, string: alt sequence, Bool: whether ref allele matches with reference genome
seq = genome.sequence({'chr': chr, 'start': pos + shift -
int(windowsize / 2 - 1), 'stop': pos + shift + int(windowsize / 2)})
return seq[:mutpos] + ref + seq[(mutpos + len(ref)):], seq[:mutpos] + alt + seq[(mutpos + len(ref)):], seq[mutpos:(mutpos + len(ref))].upper() == ref.upper()


vcf = pd.read_csv(inputfile, sep='\t', header=None, comment='#')

# standardize
vcf.iloc[:, 0] = 'chr' + vcf.iloc[:, 0].map(str).str.replace('chr', '')
vcf = vcf[vcf.iloc[:, 0].isin(CHRS)]

for shift in [0, ] + list(range(-200, -maxshift - 1, -200)) + list(range(200, maxshift + 1, 200)):
refseqs = []
altseqs = []
ref_matched_bools = []
for i in range(vcf.shape[0]):
refseq, altseq, ref_matched_bool = fetchSeqs(
vcf[0][i], vcf[1][i], vcf[3][i], vcf[4][i], shift=shift)
refseqs.append(refseq)
altseqs.append(altseq)
ref_matched_bools.append(ref_matched_bool)

if shift == 0:
# only need to be checked once
print("Number of variants with reference allele matched with reference genome:")
print(np.sum(ref_matched_bools))
print("Number of input variants:")
print(len(ref_matched_bools))

ref_encoded = encodeSeqs(refseqs).astype(np.float32)
alt_encoded = encodeSeqs(altseqs).astype(np.float32)
#print(ref_encoded.shape)
#print(alt_encoded.shape)

ref_preds = []
for i in range(int(1 + ref_encoded.shape[0] / batchSize)):
input = torch.from_numpy(ref_encoded[int(i*batchSize):int((i+1)*batchSize),:,:]).unsqueeze(3)
if args.cuda:
input = input.cuda()
ref_preds.append(model.forward(input).cpu().numpy().copy())
ref_preds = np.vstack(ref_preds)

alt_preds = []
for i in range(int(1 + alt_encoded.shape[0] / batchSize)):
input = torch.from_numpy(alt_encoded[int(i*batchSize):int((i+1)*batchSize),:,:]).unsqueeze(3)
if args.cuda:
input = input.cuda()
alt_preds.append(model.forward(input).cpu().numpy().copy())
alt_preds = np.vstack(alt_preds)

#ref_preds = np.vstack(map(lambda x: model.forward(x).numpy(),
# np.array_split(ref_encoded, int(1 + len(refseqs) / batchSize))))
#alt_encoded = torch.from_numpy(encodeSeqs(altseqs).astype(np.float32)).unsqueeze(3)
#alt_preds = np.vstack(map(lambda x: model.forward(x).numpy(),
# np.array_split(alt_encoded, int(1 + len(altseqs) / batchSize))))
diff = alt_preds - ref_preds
f = h5py.File(inputfile + '.shift_' + str(shift) + '.diff.h5', 'w')
f.create_dataset('pred', data=diff)
f.close()
Empty file added download_resources.sh
Empty file.
10 changes: 10 additions & 0 deletions example/example.vcf
Original file line number Diff line number Diff line change
@@ -0,0 +1,10 @@
chr1 1265154 - C T
chr1 1265460 - C T
chr1 2957600 - T C
chr1 3691528 - G A
chr1 8021919 - C G
chr1 8939842 - G A
chr1 10457540 - T C
chr1 11072117 - C T
chr1 11072691 - G A
chr1 11083408 - G A
10 changes: 10 additions & 0 deletions example/example.vcf.bed.sorted.bed.closestgene
Original file line number Diff line number Diff line change
@@ -0,0 +1,10 @@
1 1265153 1265154 C T 1 1266659 1266660 + ENSG00000169962 1506
1 1265459 1265460 C T 1 1266659 1266660 + ENSG00000169962 1200
1 2957599 2957600 T C 1 2938067 2938068 + ENSG00000169717 -19532
1 3691527 3691528 G A 1 3713057 3713058 - ENSG00000130764 21530
1 8021918 8021919 C G 1 8021770 8021771 + ENSG00000116288 -148
1 8939841 8939842 G A 1 8938744 8938745 - ENSG00000074800 -1097
1 10457539 10457540 T C 1 10459120 10459121 + ENSG00000142657 1581
1 11072116 11072117 C T 1 11072740 11072741 + ENSG00000120948 624
1 11072690 11072691 G A 1 11072740 11072741 + ENSG00000120948 50
1 11083407 11083408 G A 1 11072740 11072741 + ENSG00000120948 -10667
Loading

0 comments on commit a99d737

Please sign in to comment.