**2019-04-11: adjusted to output Soufi2012 data**

This R/Bioconductor notebook is the first of 2 parts for the analysis of human data presented in Figure 6. This notebook ingests the location of ChIP-seq peaks (from A. Soufi, personal correspondence), and outputs various intermediate files for further processing in FIMO and by my own scripts.

# Initialization

In [1]:
# for my sanity - map some function names that I think exist to what actually exist
nrows <- nrow
ncols <- ncol
len <- length

In [2]:
library(repr)
options(repr.plot.width=6, repr.plot.height=3)

In [3]:
library(tracktables)
library(rtracklayer)
library(GenomicFeatures)
library(BSgenome.Hsapiens.UCSC.hg38)

Loading required package: GenomicRanges
Loading required package: stats4
Loading required package: BiocGenerics
Loading required package: parallel

Attaching package: ‘BiocGenerics’

The following objects are masked from ‘package:parallel’:

    clusterApply, clusterApplyLB, clusterCall, clusterEvalQ,
    clusterExport, clusterMap, parApply, parCapply, parLapply,
    parLapplyLB, parRapply, parSapply, parSapplyLB

The following objects are masked from ‘package:stats’:

    IQR, mad, sd, var, xtabs

The following objects are masked from ‘package:base’:

    anyDuplicated, append, as.data.frame, basename, cbind, colMeans,
    colnames, colSums, dirname, do.call, duplicated, eval, evalq,
    Filter, Find, get, grep, grepl, intersect, is.unsorted, lapply,
    lengths, Map, mapply, match, mget, order, paste, pmax, pmax.int,
    pmin, pmin.int, Position, rank, rbind, Reduce, rowMeans, rownames,
    rowSums, sapply, setdiff, sort, table, tapply, union, unique,
    unsplit, which, which.max, w

In [4]:
# get seqinfo (but also since the BSgenome is from UCSC, remap coordinates back to NCBI/ensembl)

hg38info <- keepStandardChromosomes(seqinfo(BSgenome.Hsapiens.UCSC.hg38), pruning.mode = "tidy")
newStyle <- mapSeqlevels(seqlevels(hg38info),"NCBI")
hg38info <- (renameSeqlevels(hg38info, newStyle))
hg38info

“'pruning.mode' is ignored in "seqlevels<-" method for Seqinfo objects”

Seqinfo object with 25 sequences (1 circular) from hg38 genome:
  seqnames seqlengths isCircular genome
  1         248956422      FALSE   hg38
  2         242193529      FALSE   hg38
  3         198295559      FALSE   hg38
  4         190214555      FALSE   hg38
  5         181538259      FALSE   hg38
  ...             ...        ...    ...
  21         46709983      FALSE   hg38
  22         50818468      FALSE   hg38
  X         156040895      FALSE   hg38
  Y          57227415      FALSE   hg38
  MT            16569       TRUE   hg38

# ChIP-peaks (from Soufi2012)

Peak locations were extracted and then lifted over via UCSC `liftOver` from hg18 to GRCh38/hg38. These locations are available on request. 


In [5]:
SOX_peaks <- import("~/SoxOct/chipseq/Soufi2012/Soufi_Sox2.hg38.bed")
OCT_peaks <- import("~/SoxOct/chipseq/Soufi2012/Soufi_Oct4.hg38.bed")

SOX_peaks <- keepStandardChromosomes(SOX_peaks, pruning.mode = 'coarse')
OCT_peaks <- keepStandardChromosomes(OCT_peaks, pruning.mode = 'coarse')

In [6]:
# drop MT chromosome
SOX_peaks <- SOX_peaks[seqnames(SOX_peaks) != "MT"]
OCT_peaks <- OCT_peaks[seqnames(OCT_peaks) != "MT"]

# manually add names since Soufi et al BED doesn't have any
SOX_peaks$name <- paste0("Soufi_SOX_peak_", 1:len(SOX_peaks))
OCT_peaks$name <- paste0("Soufi_OCT_peak_", 1:len(OCT_peaks))

In [7]:
# note that Soufi peaks are in UCSC coordinates:
SOX_peaks <- renameSeqlevels(SOX_peaks, mapSeqlevels(seqlevels(SOX_peaks), "NCBI"))
OCT_peaks <- renameSeqlevels(OCT_peaks, mapSeqlevels(seqlevels(OCT_peaks), "NCBI"))

In [8]:
# sort the seqlevels... because we have to?
SOX_peaks <- sortSeqlevels(SOX_peaks)
OCT_peaks <- sortSeqlevels(OCT_peaks)

In [9]:
seqinfo(SOX_peaks) <- hg38info
seqinfo(OCT_peaks) <- hg38info

# DNA export

I export the genomic DNA sequences to use `FIMO` to identify sequences conforming to given binding motifs.

In [10]:
writeDNAtofile <- function(grange, path){
    # map the seqlevels back to UCSC
    grforexport <- renameSeqlevels(grange, mapSeqlevels(seqlevels(grange), "UCSC"))
    print(seqlevels(grforexport))

    # add flanking 200 bp
    DNAflank <- 200
    #DNAflank <- 0 # since it's already a 200bp-wide thing
    print(paste0("flanking DNA: ", DNAflank))
    grforexport <- grforexport + DNAflank
    
    grforexport <- trim(grforexport)
    
    DNAforexport <- getSeq(BSgenome.Hsapiens.UCSC.hg38, names = grforexport) #actual sequence lookup step
    names(DNAforexport) <- mcols(grforexport)$name # add names so I know what it is
    
    print(DNAforexport)
    
    # write it out to disk
    writeXStringSet(DNAforexport, filepath = path)
}

In [48]:
writeDNAtofile(SOX_peaks, "~/SoxOct/chipseq/Soufi2012/SOX_peaks.fasta")

 [1] "chr1"  "chr2"  "chr3"  "chr4"  "chr5"  "chr6"  "chr7"  "chr8"  "chr9" 
[10] "chr10" "chr11" "chr12" "chr13" "chr14" "chr15" "chr16" "chr17" "chr18"
[19] "chr19" "chr20" "chr21" "chr22" "chrX"  "chrY"  "chrM" 
[1] "flanking DNA: 200"
  A DNAStringSet instance of length 64536
        width seq                                           names               
    [1]  1466 GCTAAACATTTTTTATGGTAT...AACTTGAACACGAAGCAAAAA Soufi_SOX_peak_1
    [2]  1124 CTGTATTGTGTAGTGTACTCT...TGTATTGTAAGTGTACTCTGT Soufi_SOX_peak_2
    [3]  1296 ATACTCTGTACTCCAGAATTT...ATTTCTCATACACCATTCTCC Soufi_SOX_peak_3
    [4]  1540 AAGTTTTGCTACACTGTTGCC...ATCTAATTTTTGTATTTTTAA Soufi_SOX_peak_4
    [5]  1117 ACTTTTACTACTTCTTTCCTT...AGAAAGACGCGGACCCTCGAG Soufi_SOX_peak_5
    ...   ... ...
[64532]   605 GTGTGCTTTCTCTGAATAAAC...AGTGGAGCTTACTGATTGCCA Soufi_SOX_peak_64532
[64533]   567 GTAAGCTTCACTTGGTTTGAC...GGCTTAATAATAAAAGTGAGT Soufi_SOX_peak_64533
[64534]   595 ATAGAAAGATGAATATGGCCA...TGAAGACAGCTGATTTAGCAT Soufi_SOX_pea

In [49]:
writeDNAtofile(OCT_peaks, "~/SoxOct/chipseq/Soufi2012/OCT_peaks.fasta")

 [1] "chr1"  "chr2"  "chr3"  "chr4"  "chr5"  "chr6"  "chr7"  "chr8"  "chr9" 
[10] "chr10" "chr11" "chr12" "chr13" "chr14" "chr15" "chr16" "chr17" "chr18"
[19] "chr19" "chr20" "chr21" "chr22" "chrX"  "chrY"  "chrM" 
[1] "flanking DNA: 200"
  A DNAStringSet instance of length 58162
        width seq                                           names               
    [1]   706 ACTGGCTTTATGAGTTCTGTT...CCCAACTTCCTAGCTGTCAGA Soufi_OCT_peak_1
    [2]  2033 GGTAATGACCTGAAGCAGGTG...GCATGCGTGCACACACCCAGG Soufi_OCT_peak_2
    [3]   713 CATTTGTGGTTCTGGCTTTTA...TTATGCTTTCTTAAAAGATTT Soufi_OCT_peak_3
    [4]   774 CTGTTTTCTTTTCTTTTTTTT...ATTTGAGAATTCAAAAGCAGC Soufi_OCT_peak_4
    [5]   738 CTTGCCATTCCTATGTCTCAT...AATTCAGACTCTGACTCAGAG Soufi_OCT_peak_5
    ...   ... ...
[58158]   673 TTGGTTTTCTCATTAAGCCGT...CCCAGCACTTTGGGAGGCTGA Soufi_OCT_peak_58158
[58159]   599 GGCCCCGCCTCCTCACCGCCC...GCGGGGGCTCCGGGCCCGGGG Soufi_OCT_peak_58159
[58160]   563 GTAATCATTGATATTATTGGG...GCATGGGAGGTTGAGGCTACA Soufi_OCT_pea

In [50]:
writeDNAtofile(tandem_peaks, "~/SoxOct/chipseq/Soufi2012/tandem_peaks.fasta")

 [1] "chr1"  "chr2"  "chr3"  "chr4"  "chr5"  "chr6"  "chr7"  "chr8"  "chr9" 
[10] "chr10" "chr11" "chr12" "chr13" "chr14" "chr15" "chr16" "chr17" "chr18"
[19] "chr19" "chr20" "chr21" "chr22" "chrX"  "chrY"  "chrM" 
[1] "flanking DNA: 200"
  A DNAStringSet instance of length 13455
        width seq                                           names               
    [1]   401 GTAGTGTACTTTGTGTTATGT...TGTGTAGTGTAGTCTATATTG Soufi2012_SOXOCT_...
    [2]   401 TCTTTTTACTGTTTTTTACCT...CTACCAGTGGTTTTTAAAATT Soufi2012_SOXOCT_...
    [3]   401 CATCAATTGTTCTGCCTTGTA...CAATGAGGAGAAAGCCGCCGC Soufi2012_SOXOCT_...
    [4]   401 TACAAAATACACAGTTTAGAG...CTGTAAGCTAAATAGACTTAA Soufi2012_SOXOCT_...
    [5]   401 CCATTGTCACCAATCCTACCA...AGTATGTCACACACCCATCCT Soufi2012_SOXOCT_...
    ...   ... ...
[13451]   401 TAATTAGTCATGTCTTCTCCT...AGAGAAGAAAAAATAGTTTCG Soufi2012_SOXOCT_...
[13452]   401 TTTTATCATATGCTGCAGCAT...CACATGATGCCAGTTTACCCT Soufi2012_SOXOCT_...
[13453]   401 TTGTAAATCATCTATCTGATA...GTGAGGATTGGCCAG

# CHIP-confirmed motif locations in genome (via FIMO) - needs updating

I use `FIMO` with `--verbosity 3` and to identify sites conforming to the canonical Sox2 motif (MA0143.3) in Sox2 peak sequences from above. I similarly use `FIMO` to identify canonical Oct4 (MA1115.1) in Oct4 peak sequences from above. I use the manually concatenated motif for the tandem site.


**Sox2**: `/rugpfs/fs0/home/lzhao/.linuxbrew/opt/meme/bin/fimo --verbosity 3 --o ~/SoxOct/chipseq/Soufi2012/MEME_Sox2 ~/SoxOct/SoxOct-fimo-meme/fimo/MA0143.3.meme ~/SoxOct/chipseq/Soufi2012/SOX_peaks.fasta`

**Oct4**: `/rugpfs/fs0/home/lzhao/.linuxbrew/opt/meme/bin/fimo --verbosity 3 --o ~/SoxOct/chipseq/Soufi2012/MEME_Oct4 ~/SoxOct/SoxOct-fimo-meme/fimo/MA1115.1.meme ~/SoxOct/chipseq/Soufi2012/OCT_peaks.fasta`

---------
**noncanonical**

**noncanonical Sox2**: `/rugpfs/fs0/home/lzhao/.linuxbrew/opt/meme/bin/fimo --verbosity 3 --thresh .001 --o ~/SoxOct/chipseq/Soufi2012/MEME_Sox2_74 ~/SoxOct/Soufi2015/SoxNucHi74.meme ~/SoxOct/chipseq/Soufi2012/SOX_peaks.fasta`

**noncanonical Oct4 28% variant**: `/rugpfs/fs0/home/lzhao/.linuxbrew/opt/meme/bin/fimo --verbosity 3 --thresh .001 --o ~/SoxOct/chipseq/Soufi2012/MEME_Oct4_28 ~/SoxOct/Soufi2015/OctNucHi28.meme ~/SoxOct/chipseq/Soufi2012/OCT_peaks.fasta`

**noncanonical Oct4 42% variant**: `/rugpfs/fs0/home/lzhao/.linuxbrew/opt/meme/bin/fimo --verbosity 3 --thresh .001 --o ~/SoxOct/chipseq/Soufi2012/MEME_Oct4_42 ~/SoxOct/Soufi2015/OctNucHi42.meme ~/SoxOct/chipseq/Soufi2012/OCT_peaks.fasta`

I then use [`py3_motif-matching-human-public.ipynb`](py3_motif-matching-human-public.ipynb) to convert the coordinates from FIMO back into genomic coordinates. This requires that I have the original peak locations as a GTF.

In [51]:
export(SOX_peaks, '~/SoxOct/chipseq/Soufi2012/Sox_peaks_200.gtf')
export(OCT_peaks, '~/SoxOct/chipseq/Soufi2012/Oct_peaks_200.gtf')