Preparing a dataset for Basenji training involves a series of design choices.

The input you bring to the pipeline is:
* BigWig coverage tracks
* Genome FASTA file

First, make sure you have an hg19 FASTA file visible. If you have it already, put a symbolic link into the data directory. Otherwise, I have a machine learning friendly simplified version you can download in the next cell.

In [9]:
import os, subprocess

if not os.path.isfile('data/hg19.ml.fa'):
    subprocess.call('curl -o data/hg19.ml.fa https://storage.googleapis.com/basenji_tutorial_data/hg19.ml.fa', shell=True)
    subprocess.call('curl -o data/hg19.ml.fa.fai https://storage.googleapis.com/basenji_tutorial_data/hg19.ml.fa.fai', shell=True)                

Next, let's grab a few CAGE datasets from FANTOM5 related to heart biology.

These data were processed by
1. Aligning with Bowtie2 with very sensitive alignment parameters.
2. Distributing multi-mapping reads and estimating genomic coverage with [bam_cov.py](https://github.com/calico/basenji/blob/master/bin/bam_cov.py)

In [None]:
if not os.path.isfile('data/CNhs11760.bw'):
    subprocess.call('curl -o data/CNhs11760.bw https://storage.googleapis.com/basenji_tutorial_data/CNhs11760.bw', shell=True)
    subprocess.call('curl -o data/CNhs12843.bw https://storage.googleapis.com/basenji_tutorial_data/CNhs12843.bw', shell=True)
    subprocess.call('curl -o data/CNhs12856.bw https://storage.googleapis.com/basenji_tutorial_data/CNhs12856.bw', shell=True)_

Then we'll write out these BigWig files and labels to a samples table.

In [None]:
samples_out = open('data/heart_wigs.txt', 'w')
print('aorta\tdata/CNhs11760.bw', file=samples_out)
print('artery\tdata/CNhs12843.bw', file=samples_out)
print('pulmonic_valve\tdata/CNhs12856.bw', file=samples_out)
samples_out.close()

Next, we want to choose genomic sequences to form batches for stochastic gradient descent, divide them into training/validation/test sets, and form a single file to provide to downstream programs.

The script [basenji_hdf5_single.py](https://github.com/calico/basenji/blob/master/bin/basenji_hdf5_single.py) implements this procedure.

The most relevant options here are:

| Option/Argument | Value | Note |
|:---|:---|:---|
| -d | 0.05 | Down-sample the genome to 10% to speed things up here. |
| -g | data/unmap_macro.bed | Dodge large-scale unmappable regions like assembly gaps. |
| -l | 262144 | Sequence length. |
| -o | data/heart_l262k.bed | Write out the chosen sequences to a BED file. |
| -p | 4 | Uses multiple threads with the multiprocessing library. |
| -s | 131072 | Stride the sequences. By setting this to half the sequence length, the sequences will overlap by half. 
| -t | chr9 | Hold out chr9 sequences for testing. |
| -w | 128 | Pools the nucleotide-resolution values to 128 bp bins. |
| -v | chr8 | Hold out chr8 sequences for validation. |
| fasta_file| data/hg19.fa | FASTA file to extract sequences from. |
| sample_wigs_file | data/heart_wigs.txt | Samples and BigWig paths. |
| hdf5_file | data/heart_l262k.h5 | Output HDF5 file. |

In [None]:
! basenji_hdf5_single.py -d .05 -g data/unmap_macro.bed -l 262144 -o data/heart_l262k.bed -p 4 -s 131072 -t chr9 -w 128 -v chr8 data/hg19.fa data/heart_wigs.txt data/heart_l262k.h5

Now, you can offer data/heart_l262k.h5 to [basenji_train.py](https://github.com/calico/basenji/blob/master/bin/basenji_train.py) to train a model.

Inside are the following data structures.

In [None]:
import h5py

ml_h5 = h5py.File('data/heart_l262k.h5')
print(list(ml_h5.keys()))

In [None]:
print('train_in', ml_h5['train_in'].shape)
print('train_out', ml_h5['train_out'].shape)

print('valid_in', ml_h5['valid_in'].shape)
print('valid_out', ml_h5['valid_out'].shape)

print('test_in', ml_h5['test_in'].shape)
print('test_out', ml_h5['test_out'].shape)