This R/Bioconductor notebook is the first of 2 parts for the analysis of mouse data presented in **Figure 6**. This notebook ingests the location of ChIP-seq peaks (either called via MACS2 from Whyte *et al* for mESCs, or downloaded directly from the GEO (Matsuda *et al*) for mEpiSCs), and outputs various intermediate files for further processing in FIMO and by my own scripts.

# Initialization

In [1]:
# for my sanity - map some function names that I think exist to what actually exist
nrows <- nrow
ncols <- ncol
len <- length

In [2]:
library(tracktables)

library(rtracklayer)
library(HelloRanges)

Loading required package: GenomicRanges
Loading required package: stats4
Loading required package: BiocGenerics
Loading required package: parallel

Attaching package: ‘BiocGenerics’

The following objects are masked from ‘package:parallel’:

    clusterApply, clusterApplyLB, clusterCall, clusterEvalQ,
    clusterExport, clusterMap, parApply, parCapply, parLapply,
    parLapplyLB, parRapply, parSapply, parSapplyLB

The following objects are masked from ‘package:stats’:

    IQR, mad, sd, var, xtabs

The following objects are masked from ‘package:base’:

    anyDuplicated, append, as.data.frame, cbind, colMeans, colnames,
    colSums, do.call, duplicated, eval, evalq, Filter, Find, get, grep,
    grepl, intersect, is.unsorted, lapply, lengths, Map, mapply, match,
    mget, order, paste, pmax, pmax.int, pmin, pmin.int, Position, rank,
    rbind, Reduce, rowMeans, rownames, rowSums, sapply, setdiff, sort,
    table, tapply, union, unique, unsplit, which, which.max, which.min

Loading requi

In [3]:
library(GenomicFeatures)

In [4]:
library(BSgenome.Mmusculus.UCSC.mm9)

In [5]:
library(heatmaps)


Attaching package: ‘heatmaps’

The following object is masked from ‘package:SummarizedExperiment’:

    metadata<-

The following object is masked from ‘package:AnnotationDbi’:

    metadata

The following objects are masked from ‘package:S4Vectors’:

    metadata, metadata<-

The following object is masked from ‘package:base’:

    scale



In [6]:
library(TxDb.Mmusculus.UCSC.mm9.knownGene)

In [7]:
mouse_transcripts <- transcripts(TxDb.Mmusculus.UCSC.mm9.knownGene, columns=c("tx_id", "tx_name"))
mouse_transcripts <- keepStandardChromosomes(mouse_transcripts, pruning.mode = 'tidy')
# drop MT
mouse_transcripts <- mouse_transcripts[seqnames(mouse_transcripts) != "chrM"]
mouse_tss <- resize(mouse_transcripts, width=1, fix='start')

In [8]:
mm9info <- seqinfo(BSgenome.Mmusculus.UCSC.mm9)

In [9]:
mouse_tss_2000 <- resize(mouse_tss, width = 2000, fix='center')
mouse_tss_2000 <- trim(mouse_tss_2000)

# EpiSCs

BED files were downloaded directly from the NCBI GEO (Sox2: [`GSM1924746`](https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSM1924746); Oct4: [`GSM1924747`](https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSM1924747)).

In [10]:
Epi_SOX_peaks <- import("~/SoxOct/mouse/EpiSC_chip/GSM1924746_SOX2peaks.bed")
Epi_OCT_peaks <- import("~/SoxOct/mouse/EpiSC_chip/GSM1924747_POU5F1peaks.bed")

Epi_SOX_peaks <- keepStandardChromosomes(Epi_SOX_peaks, pruning.mode = 'coarse')
Epi_OCT_peaks <- keepStandardChromosomes(Epi_OCT_peaks, pruning.mode = 'coarse')

#rename peak names to avoid surprise problems
Epi_SOX_peaks$name <- paste0(Epi_SOX_peaks$name, "_SOX")
Epi_OCT_peaks$name <- paste0(Epi_OCT_peaks$name, "_OCT")

# drop MT chromosome
Epi_SOX_peaks <- Epi_SOX_peaks[seqnames(Epi_SOX_peaks) != "chrM"]
Epi_OCT_peaks <- Epi_OCT_peaks[seqnames(Epi_OCT_peaks) != "chrM"]

# sort the seqlevels... because we have to?
Epi_SOX_peaks <- sortSeqlevels(Epi_SOX_peaks)
Epi_OCT_peaks <- sortSeqlevels(Epi_OCT_peaks)

seqinfo(Epi_SOX_peaks) <- mm9info
seqinfo(Epi_OCT_peaks) <- mm9info

In [12]:
# add a stricter tandem definition -- expand the width to 200 around each peak, then intersect, then use that downstream
Epi_tandem_peaks <- subsetByOverlaps(resize(Epi_SOX_peaks, width=400, fix='center'), resize(Epi_OCT_peaks,width = 400, fix='center'))
#resize to be width 1, centered (so it's compatible w downstream code)
Epi_tandem_peaks <- resize(Epi_tandem_peaks, width = 1, fix = 'center')
Epi_tandem_peaks$name <- paste0("CHIP_SOXOCT_peak_", 1:len(Epi_tandem_peaks))
Epi_tandem_peaks

GRanges object with 2576 ranges and 2 metadata columns:
         seqnames                 ranges strand |                  name
            <Rle>              <IRanges>  <Rle> |           <character>
     [1]     chr1     [3068005, 3068005]      * |    CHIP_SOXOCT_peak_1
     [2]     chr1     [3754132, 3754132]      * |    CHIP_SOXOCT_peak_2
     [3]     chr1     [5645812, 5645812]      * |    CHIP_SOXOCT_peak_3
     [4]     chr1     [7303644, 7303644]      * |    CHIP_SOXOCT_peak_4
     [5]     chr1     [7767041, 7767041]      * |    CHIP_SOXOCT_peak_5
     ...      ...                    ...    ... .                   ...
  [2572]     chrX [157937488, 157937488]      * | CHIP_SOXOCT_peak_2572
  [2573]     chrX [158004657, 158004657]      * | CHIP_SOXOCT_peak_2573
  [2574]     chrX [162473875, 162473875]      * | CHIP_SOXOCT_peak_2574
  [2575]     chrX [164322806, 164322806]      * | CHIP_SOXOCT_peak_2575
  [2576]     chrX [165675253, 165675253]      * | CHIP_SOXOCT_peak_2576
        

In [13]:
writeDNAtofile <- function(grange, path){
    # map the seqlevels back to UCSC
    #grforexport <- renameSeqlevels(grange, mapSeqlevels(seqlevels(grange), "UCSC"))
    grforexport <- grange
    print(seqlevels(grforexport))

    # add flanking 200 bp
    DNAflank <- 200
    print(paste0("flanking DNA: ", DNAflank))
    grforexport <- grforexport + DNAflank
    
    grforexport <- trim(grforexport)
    
    DNAforexport <- getSeq(BSgenome.Mmusculus.UCSC.mm9, names = grforexport) #actual sequence lookup step
    names(DNAforexport) <- mcols(grforexport)$name # add names so I know what it is
    
    print(DNAforexport)
    
    # write it out to disk
    writeXStringSet(DNAforexport, filepath = path)
}

In [16]:
writeDNAtofile(Epi_SOX_peaks, "~/SoxOct/public/mouse/EpiSC_chip/Sox_peakseqs.fasta")

 [1] "chr1"         "chr2"         "chr3"         "chr4"         "chr5"        
 [6] "chr6"         "chr7"         "chr8"         "chr9"         "chr10"       
[11] "chr11"        "chr12"        "chr13"        "chr14"        "chr15"       
[16] "chr16"        "chr17"        "chr18"        "chr19"        "chrX"        
[21] "chrY"         "chrM"         "chr1_random"  "chr3_random"  "chr4_random" 
[26] "chr5_random"  "chr7_random"  "chr8_random"  "chr9_random"  "chr13_random"
[31] "chr16_random" "chr17_random" "chrX_random"  "chrY_random"  "chrUn_random"
[1] "flanking DNA: 200"
  A DNAStringSet instance of length 63678
        width seq                                           names               
    [1]  1401 ATGATGTCATCTTTCACTTTC...AGTAGAGAGGAACAGTTGCTG MACS_peak_1_SOX
    [2]  1004 CTGTTTAGTTCCTGTTTCCAC...TTAAGTTTGGGAAGTTTTCTT MACS_peak_2_SOX
    [3]  1035 CTCTTCTGGCTTTCATAGTCT...GGCAAGCTCTACTCTTGCAGC MACS_peak_3_SOX
    [4]  1428 GAAAGGCATAGCCCAGATTAA...TATAGAGAGGAATAGGAAAAT MACS_

In [17]:
writeDNAtofile(Epi_OCT_peaks, "~/SoxOct/public/mouse/EpiSC_chip/Oct_peakseqs.fasta")

 [1] "chr1"         "chr2"         "chr3"         "chr4"         "chr5"        
 [6] "chr6"         "chr7"         "chr8"         "chr9"         "chr10"       
[11] "chr11"        "chr12"        "chr13"        "chr14"        "chr15"       
[16] "chr16"        "chr17"        "chr18"        "chr19"        "chrX"        
[21] "chrY"         "chrM"         "chr1_random"  "chr3_random"  "chr4_random" 
[26] "chr5_random"  "chr7_random"  "chr8_random"  "chr9_random"  "chr13_random"
[31] "chr16_random" "chr17_random" "chrX_random"  "chrY_random"  "chrUn_random"
[1] "flanking DNA: 200"
  A DNAStringSet instance of length 62387
        width seq                                           names               
    [1]   936 ATACTTTTTGTAAGCACTCTT...CTCTCCCTCTCCCTCTCCCTC MACS_peak_1_OCT
    [2]  1059 CCTAATAAAAATAAATAAAAA...TTCTTATGAACATTCTGTTCA MACS_peak_2_OCT
    [3]  1033 ATAGAACCACTAAAAGAAAAA...TAACTGTCAGATATTTGAAAA MACS_peak_3_OCT
    [4]  1104 TCTCCTGGGTTTCAGTGTTCA...TCTTACCCTGAGCACTTCAGT MACS_

In [18]:
writeDNAtofile(Epi_tandem_peaks, "~/SoxOct/public/mouse/EpiSC_chip/tandempeakseqs.fasta")

 [1] "chr1"         "chr2"         "chr3"         "chr4"         "chr5"        
 [6] "chr6"         "chr7"         "chr8"         "chr9"         "chr10"       
[11] "chr11"        "chr12"        "chr13"        "chr14"        "chr15"       
[16] "chr16"        "chr17"        "chr18"        "chr19"        "chrX"        
[21] "chrY"         "chrM"         "chr1_random"  "chr3_random"  "chr4_random" 
[26] "chr5_random"  "chr7_random"  "chr8_random"  "chr9_random"  "chr13_random"
[31] "chr16_random" "chr17_random" "chrX_random"  "chrY_random"  "chrUn_random"
[1] "flanking DNA: 200"
  A DNAStringSet instance of length 2576
       width seq                                            names               
   [1]   401 CAAAAACTAACTCCAAATAGGT...AAATAAAATGGTGAATAAATA CHIP_SOXOCT_peak_1
   [2]   401 TTGATAGTTACATTGTCATTGA...ATCTTTGTTTTGTCTTGTATA CHIP_SOXOCT_peak_2
   [3]   401 ACCAGAAAGATCTCCCATCTTT...CCCAGAACCTGACACATAATA CHIP_SOXOCT_peak_3
   [4]   401 CATTTGTTTTAGGATTTCGTCT...TTGAAGCACTCTGATCTGC

## CHIP-confirmed motif locations in genome (via FIMO)

**Sox alone**: `/rugpfs/fs0/home/lzhao/.linuxbrew/opt/meme/bin/fimo --verbosity 3 --o ~/scratch/SoxOct/mouse/fimo/EpiSC_Sox2 ~/results/SoxOct/fimo/MA0143.3.meme ~/SoxOct/mouse/EpiSC_chip/Sox_peakseqs.fasta`

Note that I use the human POU5F1 motif on mouse.

**Oct alone**: `/rugpfs/fs0/home/lzhao/.linuxbrew/opt/meme/bin/fimo --verbosity 3 --o ~/scratch/SoxOct/mouse/fimo/EpiSC_Oct4 ~/results/SoxOct/fimo/MA1115.1.meme ~/SoxOct/mouse/EpiSC_chip/Oct_peakseqs.fasta`

**tandem** : `/rugpfs/fs0/home/lzhao/.linuxbrew/opt/meme/bin/fimo --verbosity 3 --o ~/scratch/SoxOct/mouse/fimo/EpiSC_tandem ~/results/SoxOct/fimo/MA0142.1.meme ~/SoxOct/mouse/EpiSC_chip/tandempeakseqs.fasta`

Next is to convert the FIMO coordinates back into genomic ones:

general formula: genomic coordinate of a motif's start = genomic coordinate of a motif's peak window + FIMO-reported start coordinate - value of flanking DNA bp (generally 200) - 1 (due to 1-based indexing, likely in FIMO)

## Output reference GTFs

To do the conversion (in [`py3_motif-matching-mouse-public`](py3_motif-matching-mouse-public)), I need a reference GTF with the coordinates of the original peaks.

In [19]:
export(Epi_SOX_peaks, '~/SoxOct/public/mouse/EpiSC_chip/Sox_peaks.gtf')
export(Epi_OCT_peaks, '~/SoxOct/public/mouse/EpiSC_chip/Oct_peaks.gtf')

In [20]:
export(Epi_tandem_peaks, '~/SoxOct/public/mouse/EpiSC_chip/tandempeaks.gtf')

# Whyte 2013 ChipSeq motifs

I generated peaks via MACS2 from data in Whyte 2013.

In [21]:
SOX_peaks <- import("~/SoxOct/mouse/chipseq_Whyte2013/macs2-sox/Sox_summits.bed")
OCT_peaks <- import("~/SoxOct/mouse/chipseq_Whyte2013/macs2-oct/Oct_summits.bed")

SOX_peaks <- keepStandardChromosomes(SOX_peaks, pruning.mode = 'coarse')
OCT_peaks <- keepStandardChromosomes(OCT_peaks, pruning.mode = 'coarse')

#rename peak names to avoid surprise problems
SOX_peaks$name <- paste0(SOX_peaks$name, "_SOX")
OCT_peaks$name <- paste0(OCT_peaks$name, "_OCT")

# drop MT chromosome
SOX_peaks <- SOX_peaks[seqnames(SOX_peaks) != "chrM"]
OCT_peaks <- OCT_peaks[seqnames(OCT_peaks) != "chrM"]

# sort the seqlevels... because we have to?
SOX_peaks <- sortSeqlevels(SOX_peaks)
OCT_peaks <- sortSeqlevels(OCT_peaks)

seqinfo(SOX_peaks) <- mm9info
seqinfo(OCT_peaks) <- mm9info

In [22]:
# add a stricter tandem definition -- expand the width to 200 around each peak, then intersect, then use that downstream
tandem_peaks <- subsetByOverlaps(resize(SOX_peaks, width=400, fix='center'), resize(OCT_peaks,width = 400, fix='center'))
#resize to be width 1, centered (so it's compatible w downstream code)
tandem_peaks <- resize(tandem_peaks, width = 1, fix = 'center')
tandem_peaks$name <- paste0("CHIP_SOXOCT_peak_", 1:len(tandem_peaks))
tandem_peaks

GRanges object with 10245 ranges and 2 metadata columns:
          seqnames                 ranges strand |                   name
             <Rle>              <IRanges>  <Rle> |            <character>
      [1]     chr1     [3053026, 3053026]      * |     CHIP_SOXOCT_peak_1
      [2]     chr1     [3473094, 3473094]      * |     CHIP_SOXOCT_peak_2
      [3]     chr1     [3671774, 3671774]      * |     CHIP_SOXOCT_peak_3
      [4]     chr1     [3904336, 3904336]      * |     CHIP_SOXOCT_peak_4
      [5]     chr1     [4141103, 4141103]      * |     CHIP_SOXOCT_peak_5
      ...      ...                    ...    ... .                    ...
  [10241]     chrX [161441290, 161441290]      * | CHIP_SOXOCT_peak_10241
  [10242]     chrX [161765464, 161765464]      * | CHIP_SOXOCT_peak_10242
  [10243]     chrX [162969929, 162969929]      * | CHIP_SOXOCT_peak_10243
  [10244]     chrX [163057217, 163057217]      * | CHIP_SOXOCT_peak_10244
  [10245]     chrX [166439502, 166439502]      * | CHIP

In [23]:
writeDNAtofile <- function(grange, path){
    # map the seqlevels back to UCSC
    #grforexport <- renameSeqlevels(grange, mapSeqlevels(seqlevels(grange), "UCSC"))
    grforexport <- grange
    print(seqlevels(grforexport))

    # add flanking 200 bp
    DNAflank <- 200
    print(paste0("flanking DNA: ", DNAflank))
    grforexport <- grforexport + DNAflank
    
    grforexport <- trim(grforexport)
    
    DNAforexport <- getSeq(BSgenome.Mmusculus.UCSC.mm9, names = grforexport) #actual sequence lookup step
    names(DNAforexport) <- mcols(grforexport)$name # add names so I know what it is
    
    print(DNAforexport)
    
    # write it out to disk
    writeXStringSet(DNAforexport, filepath = path)
}

In [24]:
writeDNAtofile(SOX_peaks, "~/SoxOct/public/mouse/chipseq_Whyte2013/Sox_peakseqs.fasta")

writeDNAtofile(OCT_peaks, "~/SoxOct/public/mouse/chipseq_Whyte2013/Oct_peakseqs.fasta")

 [1] "chr1"         "chr2"         "chr3"         "chr4"         "chr5"        
 [6] "chr6"         "chr7"         "chr8"         "chr9"         "chr10"       
[11] "chr11"        "chr12"        "chr13"        "chr14"        "chr15"       
[16] "chr16"        "chr17"        "chr18"        "chr19"        "chrX"        
[21] "chrY"         "chrM"         "chr1_random"  "chr3_random"  "chr4_random" 
[26] "chr5_random"  "chr7_random"  "chr8_random"  "chr9_random"  "chr13_random"
[31] "chr16_random" "chr17_random" "chrX_random"  "chrY_random"  "chrUn_random"
[1] "flanking DNA: 200"
  A DNAStringSet instance of length 12133
        width seq                                           names               
    [1]   401 AGAAATAATGGAGGCCAAGAC...TTCAGTGTCTGTGGATTCTGT Sox_peak_1_SOX
    [2]   401 GTCCAGTAGTCAAAATTCCTT...TCCTCTACCTAACACCTAAGT Sox_peak_2_SOX
    [3]   401 CTTTGAATGTTCTTGTGAATA...TCCACGTTGGAGTGTGTGTTT Sox_peak_3_SOX
    [4]   401 CTTTCAAAGCATGCATTACCA...CCATGGCTGCTGGACAGCTAC Sox_peak

In [25]:
writeDNAtofile(tandem_peaks, "~/SoxOct/public/mouse/chipseq_Whyte2013/tandem_peakseqs.fasta")

 [1] "chr1"         "chr2"         "chr3"         "chr4"         "chr5"        
 [6] "chr6"         "chr7"         "chr8"         "chr9"         "chr10"       
[11] "chr11"        "chr12"        "chr13"        "chr14"        "chr15"       
[16] "chr16"        "chr17"        "chr18"        "chr19"        "chrX"        
[21] "chrY"         "chrM"         "chr1_random"  "chr3_random"  "chr4_random" 
[26] "chr5_random"  "chr7_random"  "chr8_random"  "chr9_random"  "chr13_random"
[31] "chr16_random" "chr17_random" "chrX_random"  "chrY_random"  "chrUn_random"
[1] "flanking DNA: 200"
  A DNAStringSet instance of length 10245
        width seq                                           names               
    [1]   401 GAGAAATAATGGAGGCCAAGA...TTTCAGTGTCTGTGGATTCTG CHIP_SOXOCT_peak_1
    [2]   401 AGTCCAGTAGTCAAAATTCCT...ATCCTCTACCTAACACCTAAG CHIP_SOXOCT_peak_2
    [3]   401 CCTTTCAAAGCATGCATTACC...CCCATGGCTGCTGGACAGCTA CHIP_SOXOCT_peak_3
    [4]   401 CTGCTGTTTGGAATGGAGGCC...AACCAACAACCAAAAAAA

## CHIP-confirmed motif locations in genome (via FIMO)

**Sox alone**: `/rugpfs/fs0/home/lzhao/.linuxbrew/opt/meme/bin/fimo --verbosity 3 --o ~/scratch/SoxOct/mouse/fimo/Whyte_mESC_Sox2 ~/results/SoxOct/fimo/MA0143.3.meme ~/SoxOct/mouse/chipseq_Whyte2013/Sox_peakseqs.fasta`

Note that I use the human POU5F1 motif on mouse.

**Oct alone**: `/rugpfs/fs0/home/lzhao/.linuxbrew/opt/meme/bin/fimo --verbosity 3 --o ~/scratch/SoxOct/mouse/fimo/Whyte_mESC_Oct4 ~/results/SoxOct/fimo/MA1115.1.meme ~/SoxOct/mouse/chipseq_Whyte2013/Oct_peakseqs.fasta`

**tandem peak**: `/rugpfs/fs0/home/lzhao/.linuxbrew/opt/meme/bin/fimo --verbosity 3 --o ~/scratch/SoxOct/mouse/fimo/Whyte_mESC_tandem ~/results/SoxOct/fimo/MA0142.1.meme ~/SoxOct/mouse/chipseq_Whyte2013/tandem_peakseqs.fasta`

Next is to convert the FIMO coordinates back into genomic ones:

general formula: genomic coordinate of a motif's start = genomic coordinate of a motif's peak window + FIMO-reported start coordinate - value of flanking DNA bp (generally 200) - 1 (due to 1-based indexing, likely in FIMO)

## Output reference GTFs

To do the conversion (in [`py3_motif-matching-mouse-public`](py3_motif-matching-mouse-public)), I need a reference GTF with the coordinates of the original peaks.

In [26]:
export(SOX_peaks, '~/SoxOct/public/mouse/chipseq_Whyte2013/Sox_peaks.gtf')
export(OCT_peaks, '~/SoxOct/public/mouse/chipseq_Whyte2013/Oct_peaks.gtf')

In [27]:
export(tandem_peaks, '~/SoxOct/public/mouse/chipseq_Whyte2013/tandem_peaks.gtf')