# Prep data for training a BPNet model
This notebook follows the tutorial at from https://github.com/kundajelab/basepairmodels. It is used to prepare the data for the training a BPNet model with Jacob Schreiber's BPNet implementation. The data is downloaded from the ENCODE portal from a ChIP-seq experiment of CTCF in K562 cells.

**Expected output:**
1. Positive and negative strand coverage BigWig files for the CTCF ChIP-seq experiment in K562 cells (2 merged replicates)
2. A BED file with the CTCF ChIP-seq peaks in K562 cells
3. Positive and negative strand coverage BigWig files for the control

# Get the bam files for the CTCF ChIP-seq experiment in K562 cells

In [None]:
# Download from ENCODE
!wget https://www.encodeproject.org/files/ENCFF198CVB/@@download/ENCFF198CVB.bam -O rep1.bam
!wget https://www.encodeproject.org/files/ENCFF488CXC/@@download/ENCFF488CXC.bam -O rep2.bam
!wget https://www.encodeproject.org/files/ENCFF023NGN/@@download/ENCFF023NGN.bam -O control.bam

# Merge the bams and index

In [None]:
# Create a merged bam file from both replicates amd index it
!samtools merge -f merged.bam rep1.bam rep2.bam
!samtools index merged.bam

# Get chrom sizes and fasta

In [None]:
# Download the reference genome and index it, also download the chromosome sizes (need a special version from ENCODE DCC)
!wget https://raw.githubusercontent.com/ENCODE-DCC/encValData/master/GRCh38/GRCh38_EBV.chrom.sizes
!wget https://hgdownload.soe.ucsc.edu/goldenPath/hg38/bigZips/hg38.fa.gz
!gunzip hg38.fa.gz
!samtools faidx hg38.fa

# Make bigWig files through bedGraph
[BEDGRAPH](http://genome.ucsc.edu/goldenPath/help/bedgraph.html) files reprot consecutive positions with the same coverage as a single output line. They describe the start and end coordinate of the interval having the coverage level, followed by the coverage level itself.

[bedGraphToBigWig download](http://hgdownload.soe.ucsc.edu/admin/exe/macOSX.x86_64/bedGraphToBigWig)

## Signal files

In [None]:
# get coverage of 5’ positions of the plus strand
!bedtools genomecov -5 -bg -strand + \
        -g GRCh38_EBV.chrom.sizes -ibam merged.bam \
        | sort -k1,1 -k2,2n > plus.bedGraph

# get coverage of 5’ positions of the minus strand
!bedtools genomecov -5 -bg -strand - \
        -g GRCh38_EBV.chrom.sizes -ibam merged.bam \
        | sort -k1,1 -k2,2n > minus.bedGraph

# Convert bedGraph files to bigWig files (will need to download bedGraphToBigWig from UCSC)
!bedGraphToBigWig plus.bedGraph GRCh38_EBV.chrom.sizes plus.bw
!bedGraphToBigWig minus.bedGraph GRCh38_EBV.chrom.sizes minus.bw

## Control files

In [None]:
# get coverage of 5’ positions of the plus strand
!bedtools genomecov -5 -bg -strand + \
        -g GRCh38_EBV.chrom.sizes -ibam control.bam \
        | sort -k1,1 -k2,2n > control_plus.bedGraph

!bedtools genomecov -5 -bg -strand - \
        -g GRCh38_EBV.chrom.sizes -ibam control.bam \
         | sort -k1,1 -k2,2n > control_minus.bedGraph

# Convert bedGraph files to bigWig files (will need to download bedGraphToBigWig from UCSC)
!bedGraphToBigWig control_plus.bedGraph GRCh38_EBV.chrom.sizes control_plus.bw
!bedGraphToBigWig control_minus.bedGraph GRCh38_EBV.chrom.sizes control_minus.bw

# Peaks
Normally we would call these from our bams, but here we will use the peaks from the ENCODE portal.

In [None]:
!wget https://www.encodeproject.org/files/ENCFF396BZQ/@@download/ENCFF396BZQ.bed.gz
!gunzip ENCFF396BZQ.bed.gz
!mv ENCFF396BZQ.bed peaks.bed

# Create a subset of the peaks for testing
!head -1000 ENCFF396BZQ.bed.gz > toy.bed

# Organize it

In [None]:
# Data actually used for training the model
!mkdir -p ENCSR000EGM/data
!mv *.bw ENCSR000EGM/data
!mv peaks.bed ENCSR000EGM/data
!mv toy.bed ENCSR000EGM

# Reference genome and chromosome sizes
!mkdir -p reference
!mv hg38.fa* reference
!mv GRCh38_EBV.chrom.sizes reference

# Downloaded and processed bams
!mkdir -p ENCSR000EGM/bam
!mv *.bam* ENCSR000EGM/bam

# BEDGRAPH files
!mkdir -p ENCSR000EGM/bedgraph
!mv *.bedGraph ENCSR000EGM/bedgraph

---

# DONE!