# Kopp21 et al 2021 Extract-Transform-Load
**Authorship:**
Adam Klie, *08/10/2022*
***
**Description:**
Notebook to extract, transform, and load data from the Kopp21 et al (2021) dataset on JunD binding.
***

In [1]:
import os
import sys

bin_dir = os.path.dirname(sys.executable)
os.environ["PATH"] += os.pathsep + bin_dir
from pybedtools import paths
paths._set_bedtools_path(bin_dir)
from pybedtools import BedTool

In [6]:
output = '/cellar/users/aklie/data/eugene/kopp21/junD'

# Downloads and command line data prep
Downloaded JunD peaks (ENCFF446WOD, conservative IDR thresholded peaks, narrowPeak format), and raw DNase-seq data (ENCFF546PJU, Stam. Lab, ENCODE; ENCFF059BEU Stam. Lab, ROADMAP, bam-format) for human embryonic stem cells (H1-hesc) from the encodeproject.org and the hg38 reference genome. Alignment indices were built with samtools. Blacklisted regions for hg38 were obtained from http://mitra.stanford.edu/kundaje/akundaje/release/blacklists/hg38-human/hg38.blacklist.bed.gz and removed using bedtools. The human genome was obtained from UCSC.

## Use `wget` to download data

In [3]:
# Peaks and tracks from ENCODE
!wget https://www.encodeproject.org/files/ENCFF446WOD/@@download/ENCFF446WOD.bed.gz -O {output}/jund_peaks.narrowPeak.gz
!wget https://www.encodeproject.org/files/ENCFF546PJU/@@download/ENCFF546PJU.bam -O  {output}/dnase_stam_encode.bam
!wget https://www.encodeproject.org/files/ENCFF059BEU/@@download/ENCFF059BEU.bam -O  {output}/dnase_stam_roadmap.bam

# blacklisted regions to remove
!wget http://mitra.stanford.edu/kundaje/akundaje/release/blacklists/hg38-human/hg38.blacklist.bed.gz -O  {output}/hg38.blacklisted.bed.gz
!gunzip -f  {output}/hg38.blacklisted.bed.gz

# human genome sequence hg38
!wget http://hgdownload.cse.ucsc.edu/goldenPath/hg38/bigZips/hg38.fa.gz -O  {output}/hg38.fa.gz
!gunzip -f  {output}/hg38.fa.gz

!wget http://hgdownload.cse.ucsc.edu/goldenPath/hg38/bigZips/hg38.chrom.sizes -O {output}/hg38.chrom.sizes

--2022-08-05 11:52:19--  https://www.encodeproject.org/files/ENCFF446WOD/@@download/ENCFF446WOD.bed.gz
Resolving www.encodeproject.org (www.encodeproject.org)... 34.211.244.144
Connecting to www.encodeproject.org (www.encodeproject.org)|34.211.244.144|:443... connected.
HTTP request sent, awaiting response... 307 Temporary Redirect
Location: https://encode-public.s3.amazonaws.com/2016/12/14/5643001d-fae4-43c3-8c6f-de56aa3e19a8/ENCFF446WOD.bed.gz?response-content-disposition=attachment%3B%20filename%3DENCFF446WOD.bed.gz&AWSAccessKeyId=ASIATGZNGCNXVKIDDXMF&Signature=6lSemstvgSKxmbhqgpaU4QUOmGU%3D&x-amz-security-token=IQoJb3JpZ2luX2VjEGsaCXVzLXdlc3QtMiJGMEQCIA%2FHmjVXgDvAlmtTPL11aHAZm41exPLvJ7OXjeV95mB5AiBbXUiDgG0zijrdAnIgxxuQcZjmIooMlwwb06oWHzHTySrbBAjE%2F%2F%2F%2F%2F%2F%2F%2F%2F%2F8BEAAaDDIyMDc0ODcxNDg2MyIMa5oB%2F5ot4t3fxRjmKq8EHbCre27CKivRB693OMhvwPtCzS2PG5oLmHx%2F%2B7PxRPT4aWVSlUX17KlP2FfXxt2PFLQt%2BNpPyTf%2FyZRo5lrDo8kg36MIV4Rr990GW%2FRwFGByiw%2F1hqoWzmbOdFSyOwt4IsuweINQJ9%2BR43SOFea

## Index the DNase-seq data using `samtools`

In [7]:
!samtools index {output}/dnase_stam_encode.bam
!samtools index {output}/dnase_stam_roadmap.bam

## Create the peaks to use for prediction using `bedtools`

Merge the narrow peaks

In [3]:
BedTool(os.path.join(output, 'jund_peaks.narrowPeak.gz')).sort().merge().saveas(
    os.path.join(output, 'jund_raw_peaks.bed'))

<BedTool(/cellar/users/aklie/data/eugene/junD/jund_raw_peaks.bed)>

To create the region of interest (ROI) for defining positive and negative peaks, extend the "raw" peaks by 10000bp in both directions (https://bedtools.readthedocs.io/en/latest/content/tools/slop.html).
We also need to subtract away any parts of the ROI in blacklisted regions.

In [4]:
BedTool(os.path.join(output, 'jund_raw_peaks.bed')).slop(b=10000, 
                                                               g=os.path.join(output, 'hg38.chrom.sizes')) \
 .sort().merge().subtract(os.path.join(output, 'hg38.blacklisted.bed'))\
.saveas(os.path.join(output, 'roi_jund_extended.bed'))

<BedTool(/cellar/users/aklie/data/eugene/junD/roi_jund_extended.bed)>

This [next command](https://github.com/BIMSBbioinfo/janggu/blob/5128419cf404d8f1904d46c627c0c7963356fff1/src/janggu/janggutrim.py) trims the starts and ends of the ROIs to make them divisible by the specified window size, in this case 200

In [5]:
!janggu-trim {output}/roi_jund_extended.bed {output}/trim_roi_jund_extended.bed -divby 200

# SeqData preparations
Next we need to use EUGENe to read in and prep this data.

In [1]:
import os
import eugene as eu
eu.settings.dataset_dir = '/cellar/users/aklie/data/eugene/kopp21'

Global seed set to 13


In [3]:
# Definethe input files
bed_file = os.path.join(
    eu.settings.dataset_dir,
    "jund_raw_peaks.bed" 
)
roi_file = os.path.join(
    eu.settings.dataset_dir,
    "trim_roi_jund_extended.bed"
)
refgenome = os.path.join(
    eu.settings.dataset_dir,
    "hg38.fa"
)
bed_file, roi_file, refgenome

('/cellar/users/aklie/data/eugene/kopp21/junD/jund_raw_peaks.bed',
 '/cellar/users/aklie/data/eugene/kopp21/junD/trim_roi_jund_extended.bed',
 '/cellar/users/aklie/data/eugene/kopp21/junD/hg38.fa')

In [4]:
# Read in the sequences to a SeqData object. Last loading took 6m 43.1s
sdata = eu.dl.read_bed(
    bed_file=bed_file,
    roi_file=roi_file,
    ref_file=refgenome,
    dnaflank=150,
    binsize=200,
    resolution=200
)

In [7]:
# Write this as a "raw" version of SeqData h5
sdata.write_h5sd(os.path.join(eu.settings.dataset_dir, "jund_raw.h5sd"))

In [8]:
# Decode the one-hot encoded sequences to save to other formats as well
sdata.seqs = eu.pp.decode_DNA_seqs(sdata.ohe_seqs)

Decoding DNA sequences:   0%|          | 0/1013080 [00:00<?, ?it/s]

In [10]:
# Get the reverse complement of the sequences
eu.pp.reverse_complement_data(sdata)

Reverse complementing DNA sequences:   0%|          | 0/1013080 [00:00<?, ?it/s]

SeqData object modified:
	rev_seqs: None -> 1013080 rev_seqs added


In [12]:
# Get the reverse complement one hot encoding
eu.pp.one_hot_encode_data(sdata)

One-hot-encoding sequences:   0%|          | 0/1013080 [00:00<?, ?it/s]

One-hot-encoding sequences:   0%|          | 0/1013080 [00:00<?, ?it/s]

SeqData object modified:
	ohe_seqs: [[[0 0 0 1]
  [0 0 0 1]
  [0 0 0 1]
  ...
  [0 0 1 0]
  [1 0 0 0]
  [0 0 1 0]]

 [[1 0 0 0]
  [0 0 0 1]
  [0 0 0 1]
  ...
  [0 1 0 0]
  [1 0 0 0]
  [0 0 0 1]]

 [[1 0 0 0]
  [1 0 0 0]
  [0 0 0 1]
  ...
  [1 0 0 0]
  [0 0 0 1]
  [0 0 1 0]]

 ...

 [[0 0 0 1]
  [0 0 1 0]
  [0 0 0 1]
  ...
  [0 1 0 0]
  [0 0 0 1]
  [0 1 0 0]]

 [[0 1 0 0]
  [0 0 1 0]
  [0 0 0 1]
  ...
  [0 0 1 0]
  [0 0 0 1]
  [0 0 1 0]]

 [[1 0 0 0]
  [0 0 1 0]
  [0 0 1 0]
  ...
  [0 0 0 1]
  [0 1 0 0]
  [0 0 1 0]]] -> 1013080 ohe_seqs added
	ohe_rev_seqs: None -> 1013080 ohe_rev_seqs added


In [19]:
# Get sequence lengths
sdata["seq_len"] = [len(seq) for seq in sdata.seqs]

In [20]:
# Add in info on ranges to seq_annot
eu.pp.add_ranges_annot(sdata)

SeqData object modified:
    seqs_annot:
        + end, start, chr


In [21]:
# Save combined
sdata.write_h5sd(os.path.join(eu.settings.dataset_dir, "jund_processed.h5sd"))

In [None]:
# Split into train and test sets
eu.pp.train_test_split_data(
    sdata, 
    train_key="train_test",
    chr = ["chr3"]
)

In [None]:
# Split into training and test sets
sdata_train = sdata[sdata["train_test"].values]
sdata_test = sdata[~sdata["train_test"].values]

In [None]:
# Split the training sequences into train and validation sets
eu.pp.train_test_split_data(
    sdata_train,
    train_key="train_val",
    chr=["chr2"]
)

In [None]:
# Save train
sdata_train.write_h5sd(os.path.join(eu.settings.dataset_dir, "jund_train_processed.h5sd"))

In [None]:
# Save test
sdata_test.write_h5sd(os.path.join(eu.settings.dataset_dir, "jund_test_processed.h5sd"))

---

# Scratch