/
README
229 lines (169 loc) · 10.6 KB
/
README
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
regfm (Regulatory Fine Mapping) v1.0.0
regfm is a command line tool for identifying individual regulatory
regions mediating disease risk and the genes controlled by these
regulators. It uses summary statistics from genome-wide
association studies and a list of lead SNPs as the input, and outputs
posterior probabilities for each association being mediated by a
regulatory effect, for each regulator being the causal risk factor,
and for each gene in the locus being pathogenic.
To demonstrate the code and ensure it runs on your machine, we provide
data for the BACH2 locus on chromosome 6 from the IMSGC study of
multiple sclerosis risk (IMSGC, Nature Genetics 2013), available on
immunobase.org (https://www.immunobase.org). This is the locus
included in Figure 3 of our
upcoming paper (http://biorxiv.org/content/early/2016/05/19/054361).
regfm is a series of R scripts controlled by a shell script wrapper,
which takes a set of input and runs the analysis. Within this flow we
calculate LD from the 1000 Genomes project, which is quite large.
Contents
1. Dependencies
2. Preparing your data
3. Running regfm
4. Description of output
5. Details of package contents
6. Troubleshooting
7. Citation
8. Contact information
## Dependencies and installation
regfm depends on previous software and requires some additional,
pre-computed files that are too large to distribute via this
repository. For linkage disequilibrium calculations, it relies on an
external reference panel, which you will have to access from the
appropriate project page.
## Other software regfm depends on:
R (version >= 3.1.0), including several non-base packages
- bedtools (https://github.com/arq5x/bedtools2) (version >= 2-2.19.1)
- plink (https://www.cog-genomics.org/plink2) (version >= 1.9)
Please note that plink v1.07 is likely to run out of memory with the 1000 Genomes reference sets, so we strongly recommend you use v1.9.
To check which R packages are available on your system, copy the following code into an R console
## R code ##
# utility function checks if packages are installed
check.libs.fn <- function(X = c("e1071","grid","SDMTools","gplots","graphics","gap", "gtable"), ret=F) {
tmp <- installed.packages()
not.installed <- setdiff(X, tmp,1 )
if(!ret) print(paste(length(not.installed), "packages not installed"))
if(ret) return(not.installed)
}
# will print out number of libs not installed
check.list.fn(ret=F)
# if some are missing, install en masse with
install.packages(check.libs.fn(ret=T))
## End R code ##
You can download regfm from github via your browser or with the command
## Command-line code ##
git clone https://github.com/cotsapaslab/regfm.git
## End command-line code ##
## Pre-computed data
We provide correlation matrices between all DNase I hypersensitivity
clusters and expression levels for all genes, as described in our
paper. These data are available via DropBox, as GitHub does not host
large files. In the main directory you will find the script
Download-Data.sh, which will download these data for you into a
specified directory. You must specify if the script should use the
wget or the curl utility to do so, depending on what is installed
on your system.
## Command-line code ##
# Is curl installed?
curl --help
# is wget installed?
wget --help
# if you get a help message for at least one of these, you are good
# to go
./Download-Data.sh . curl
# or
./Download-Data.sh . wget
## End command-line code ##
You can also obtain the data via the DropBox website, though you will
have to sync your account to that directory.
## LD reference data
Finally, regfm requires an LD reference panel in plink formate
(bed/bim/fam). In our paper we used the 1000 Genomes Europeans
(excluding Finnish samples), but you should choose the most
appropriate panel for the study you are analyzing. Due to the size of
these data we cannot make them available directly. The plink
documentation has a
helpful list of sources (https://www.cog-genomics.org/plink2/resources). In
our distribution, we include data for chromosome 6 to allow the
example to run.
## Preparing Your Data
For each trait you wish to analyze, you will need two input files:
1 - Association summary statistics in BED format. This is one SNP per line, tab-delimited plain text with five columns: chromosome number, basepair position - 1 (in hg19 coordinates), basepair position (hg19), SNP ID, and association p value. The file should have no header line. Create a directory under regfm/BedData/ for each trait you wish to analyze, copy your bed file into that directory, and rename it allSNPs.bed.
2 - Lead SNPs to fine-map. You will also have to tell regfm which associations to fine-map, by providing a list of lead SNPs. Create a tab-delimited text file with column headers SNP, Chr, Position, P and corresponding entries SNP ID, chromosome number, basepair position (hg19), and association p value. Copy this to your input directory under regfm/BedData/ and rename it leadsnps.txt.
## Example files
As an example, we have included input files for MS association in the BACH2 locus in regfm/BedData/example.
## Running regfm
regfm assumes you will run the wrapper from the top-level directory where it lives. You will need to edit the first section of regfm.sh:
## Command-line code ##
maindir=<PATH TO THE REGFM DIRECTORY>
## End command-line code ##
You will also have to specify where the LD reference panel files are, by editing the next line:
## Command-line code ##
KGEurDir=$bigdatadir/1000Genome/European
## End command-line code ##
Additionally, if either or both of the plink or bedtools binaries are not in directories in your default path, you can uncomment and edit the commands in the next few lines of the script:
## Command-line code ##
export PATH=$PATH:<PLINKDIR>
export PATH=$PATH:<BEDTOOLSDIR>
## End command-line code ##
You can test if your binaries are in the default path by typing the following commands into your terminal:
## Command-line code ##
plink --version
bedtools --version
## End command-line code ##
If either produces an error, that binary is not in your default path and you should edit the PATH variable line.
Finally, to run regfm on data in your new directory <TRAIT>, simply run the wrapper script as:
## Command-line code ##
./regfm.sh <TRAIT>
## End command-line code ##
The example data can be run as:
## Command-line code ##
./regfm.sh example
## End command-line code ##
## Output Results
1 - Per-locus summaries. Results for each trait can be found in the tab-delimited file ./Tables/<TRAIT>/Final-Results.txt, which contains an entry for each lead SNP association a tab-delimited file with columns:
* Disease - Trait* Chr - Chromosome number of the risk locus* Start - Genomic position of the start site of 2Mbp risk region (hg19)* End - Genomic position of the end site of 2Mbp risk region (hg19)* Lead.SNP - Most associated SNP in the locus* N.CI - Number of 99% credible interval SNPs* N.DHS - Total number of DHS clusters overlapping the 2Mbp region* N.Gene - Total number of genes overlapping the 2Mbp region* N.Pr.DHS - Number of DHS clusters with at least one CI on them* Total.Rho - Posterior probability of association attributable to DHS clusters with at least one CI* Cell - Cell type in which ρ is significant (FDR < 0.1)* N.DHS.Cell - Number of DHS clusters in 2Mbp region active in the cell type* N.Pr.DHS.Cell - Number of prioritzed DHS clusters that are active in the cell type* Rho.perCell - Posterior probability of association attributable to DHS clusters active in the cell type* Rho.FDR - False discovery rate of the cell type \rho* Gene - Prioritized gene* Gamma.FDR - False discovery rate of the gene \gamma
2 - Per-locus figures. A figure for each locus can be found in ./Figures/<TRAIT>/. These show the association signal in the 2Mbp region centered on the lead SNP (the top panel), the regulatory elements with significant \rho values and their correlation to genes in the region (middle panel), and how gene expression varies by regulatory element accessibility (bottom panel). These figures are described in more detail in our paper (see Citations section, below).
## Detailed description of package contents
Under BedData folder
* DHS-Data-Extn-Sorted.bed: This file includes boundaries of each DHS cluster extended by 100 bp each side in the BED format.
* Genecode-Sorted.bed: This file includes boundaries of genes (from transcription start site to transcript end site) in the BED Format. Genes boundaries are obtained from Genecode v12.
Under Data folder
* Allele-Specific-Data/Allele-Specific-UW-MAF.bed: This file contains information about SNPs tested for allelic-imbalanced in DNA accessibility. This data is originally obtained from Maurano, et.al. Large-scale identification of sequence variants influencing human transcription factor occupancy in vivo ((http://www.nature.com/ng/journal/v47/n12/extref/ng.3432-S5.txt) ). We then obtained MAF of the SNPs using 1000 Genome data, and added MAF column to the original file.
Under Rdata folder
* Best-genomeFunc.Rdata: Genomic function of each DHS cluster obtained by using ChromHMM data.
* DHS-Data-Extn.Rdata: Presense/absence of DHS clusters over 22 cell types after excluding DHS clusters overlapping MHC region (chr6:28x10^6 - 34x10^6 on hg19).
* DHS-Data-Sorted.bed: Boundaries of each DHS cluster in the BED format.
* DHS-Data.Rdata: Presense/absence of DHS clusters over 22 cell types.
* Norm-Trans-Exp.Rdata: A matrix of gene expression data obtained by processing Roadmap Epigenome Project exon array data.
* Unique-DHS-Data.Rdata: All 796,747 DHS cluster can be categorized to 71,912 unique clusters based on their patterns of presence/absence over 22 cell types. This file contains these unique clusters.
Under BigData folder
* Crr-Pval-Mat-PerChr/Crr-pval-mat-chrN.Rdata: Correlation P values between DHS clusters and genes overlapping chromosome N. Here P values for chromosome 6 is provided as an example. For access to the correlation P values of all chromosomes, look at BigData folder.
## Troubleshooting
If you get a permission denied error when running a script, enter
## Command-line code ##
chmod u+x <file>
## End command-line code ##
for that file and try again. That should give you
permission to execute the file.
## Citation
regfm was written by Parisa Shooshtari (Yale University and the Broad Institute of MIT and Harvard).
You can cite regfm as:
Shooshtari, et al. Integrative genetic and epigenetic analysis uncovers regulatory mechanisms of autoimmune disease. BioRxiv, 2016. (http://biorxiv.org/content/early/2016/05/19/054361)
## Contact
Please address comments and questions to Parisa Shooshtari (pshoosh@broadinstitute.org) or Chris Cotsapas (cotsapas@broadinstitute.org)