Skip to content

Compare the enrichment of 2 histone marks in 1 list of active transcription start sites (TSS)

Charles Joly Beauparlant edited this page Apr 1, 2015 · 3 revisions

Enrichment of the H3K27ac and H4K27me3 histone marks in active transcription start sites (TSS)

1. Introduction

There are 2 kind of experimental designs that can be used for a metagene analysis. This document will introduce the second kind, which is the comparison of the same group of regions between 2 ChIP-Seq experiment.

For this vignette, we will compare the enrichment of the H3K27ac and H4K27me3 histone mark in active transcription start sites (TSS).

http://egg2.wustl.edu/roadmap/web_portal/chr_state_learning.html#exp_18state

2. Load datasets

We will use the TSS as defined by the Roadmap Consortium (TODO: add ref), for more details on how the data were downloaded, please see the R/get_regions.R file.

#data(TssA)   # Active TSS
load("../data/TssA.RData")   # Active TSS

# We will make sure that all the regions have the same size to avoid having to
# scale them during the metagene analysis
TssA <- GenomicRanges::resize(TssA, 1000, fix = "center")
regions <- GenomicRanges::GRangesList(TssA)
names(regions) <- c("TssA")

# We need to remove values from chrY that are not present in some bam files
regions <- lapply(regions, function(x) x[seqnames(x) != "chrY"])

# For memory and speed consideration, we will only use a subset of the regions
regions <- lapply(regions, function(x) x[sample(seq_along(x), 1000)])
regions <- GenomicRanges::GRangesList(regions)
regions
## GRangesList object of length 1:
## $TssA 
## GRanges object with 1000 ranges and 1 metadata column:
##          seqnames                 ranges strand   |        name
##             <Rle>              <IRanges>  <Rle>   | <character>
##      [1]    chr18 [ 45455501,  45456500]      *   |      1_TssA
##      [2]    chr17 [ 74022701,  74023700]      *   |      1_TssA
##      [3]    chr16 [ 75299201,  75300200]      *   |      1_TssA
##      [4]    chr11 [ 32916001,  32917000]      *   |      1_TssA
##      [5]     chr2 [192544001, 192545000]      *   |      1_TssA
##      ...      ...                    ...    ... ...         ...
##    [996]     chr7   [33168501, 33169500]      *   |      1_TssA
##    [997]     chr6   [17394901, 17395900]      *   |      1_TssA
##    [998]     chr8   [68256001, 68257000]      *   |      1_TssA
##    [999]    chr17   [49198801, 49199800]      *   |      1_TssA
##   [1000]    chr19   [41641201, 41642200]      *   |      1_TssA
## 
## -------
## seqinfo: 25 sequences from an unspecified genome; no seqlengths

For the ChIP-Seq of the H3K27ac histone mark, we will use the [Add accessions] file from ENCODE. The ENCFF000AKF.bam and ENCFF000AKI.bam files are two replicates from the same experiment and ENCFF000AHC.bam and ENCFF000AHD.bam are the recommanded control files.

bam_files <- c("../inst/extdata/ENCFF000AKF.bam",
               "../inst/extdata/ENCFF000AKI.bam",
               "../inst/extdata/ENCFF000VGB.bam",
               "../inst/extdata/ENCFF000VFS.bam")
bam_files
## [1] "../inst/extdata/ENCFF000AKF.bam" "../inst/extdata/ENCFF000AKI.bam"
## [3] "../inst/extdata/ENCFF000VGB.bam" "../inst/extdata/ENCFF000VFS.bam"

Since multiple samples should be included in the same profile (we need to combine the replicates), we have to produce a design so that metagene can know how to deal correctly with each file.

design <- data.frame(samples = bam_files,
                     H3K27ac = c(1,1,0,0),
                     H3K27me3 = c(0,0,1,1))
design
##                           samples H3K27ac H3K27me3
## 1 ../inst/extdata/ENCFF000AKF.bam       1        0
## 2 ../inst/extdata/ENCFF000AKI.bam       1        0
## 3 ../inst/extdata/ENCFF000VGB.bam       0        1
## 4 ../inst/extdata/ENCFF000VFS.bam       0        1

3. metagene analysis

The design is not used when we call the constructor of metagene (using metagene$new). The goal of this first step is to extract the coverage from the bam_files and to do the normalisation of the signal. Since this step can be computationally intensive, we do not want to do it every time we want to experiment with a new design. Thus, it is a good idea to extract the coverage from every bam file in a single step, save the result in RData format and then explore different design.

#mg <- metagene$new(regions = regions, bam_files = bam_files, cores = 2)
mg <- metagene$new(regions = regions, bam_files = bam_files)
# We could save this object to avoid re-doing this computationally intensive
# step:
#save(mg, file = "mg.RData")

4. Produce the graphs

The plot function allows you to create the metagene plot. By default, the plot will contain a profile for every combination of group of regions and bam file. In this vignette, we have 4 bam files (2 replicates for H3K27ac and 2 replicates for H3K27me3) and one group of regions (TssA). Which means that if we do not use a design, metagene will produce 4 profiles.

In this case, we only want to produce a profile for each histone mark (i.e. we want to combine the replicates), we have defined the groups of bam files in the design.

df <- mg$plot(design = design, bin_size = 10)
## [1] "H3K27ac_TssA"
## [1] "H3K27me3_TssA"

plot of chunk produceGraph

save(mg, file = "mg.RData")