# Tutorial 1: Training an AtacWorks model 

## Introduction

In this tutorial we train an AtacWorks model to denoise the signal track and call peaks from aggregate single-cell ATAC-seq data derived from a small number of cells. We use the dsc-ATAC-seq dataset presented in reference (1) (Section "AtacWorks enhances ATAC-seq results from small numbers of single cells", also Supplementary Table 8). This dataset consists of single-cell ATAC-seq data from several types of human blood cells.

Note that all the AtacWorks models described in reference (1) are available to download (https://atacworks-paper.s3.us-east-2.amazonaws.com) and you may be able to use one of these instead of training a new model. To learn how to download and use an existing model, refer to [Tutorial 2](tutorial2.ipynb).
 
We selected 2400 Monocytes from this dataset - this is our ‘clean’, high-coverage dataset. We then randomly sampled 50 of these 2400 Monocytes. Here's what the ATAC-seq signal from 50 cells and 2400 cells looks like, for a region on chromosome 10:

![Monocytes subsampled signal](../docs/tutorials/Mono.2400.50.png)

Compared to the 'clean' signal from 2400 cells, the aggregated ATAC-Seq signal track from these 50 cells is noisy. Because of noise in the signal, peak calls calculated by MACS2 on this data are also inaccurate.

We train an AtacWorks model to learn a mapping from the 50-cell ATAC-seq signals to the 2400-cell ATAC-seq signal and peak calls. In other words, given a noisy ATAC-seq signal from 50 cells, this model learns what the signal would look like - and where the peaks would be called - if we had sequenced 2400 cells.

**NOTE: You  may notice an exclamation mark (!) before most of the commands in this tutorial. That's because most of them are bash commands and to execute bash commands through notebook, they have to be preceded by an exclamation. These commands be directly copy pasted into a terminal (without the !) and executed. We created this notebook to make it very simple for our users to run the tutorials without having to worry about copy pasting.**

## Step 1: Create folder and set AtacWorks path
Replace 'path_to_atacworks' with the path to your cloned and set up 'AtacWorks' github repository.

In [50]:
%env atacworks=/ntadimeti/AtacWorks

env: atacworks=/ntadimeti/AtacWorks


Create a folder for this experiment. os.chdir() below allows us to enter into the new directory.

In [2]:
!mkdir tutorial1
import os
os.chdir('tutorial1')

## Step 2: Download data

We will download all of the data needed for this experiment from AWS into the `tutorial1` folder.


### Noisy ATAC-seq signal from 50 Monocytes

In [3]:
!wget https://atacworks-paper.s3.us-east-2.amazonaws.com/dsc_atac_blood_cell_denoising_experiments/50_cells/train_data/noisy_data/dsc.1.Mono.50.cutsites.smoothed.200.bw

--2020-06-09 19:55:24--  https://atacworks-paper.s3.us-east-2.amazonaws.com/dsc_atac_blood_cell_denoising_experiments/50_cells/train_data/noisy_data/dsc.1.Mono.50.cutsites.smoothed.200.bw
Resolving atacworks-paper.s3.us-east-2.amazonaws.com (atacworks-paper.s3.us-east-2.amazonaws.com)... 52.219.84.0
Connecting to atacworks-paper.s3.us-east-2.amazonaws.com (atacworks-paper.s3.us-east-2.amazonaws.com)|52.219.84.0|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 14633285 (14M) [binary/octet-stream]
Saving to: 'dsc.1.Mono.50.cutsites.smoothed.200.bw'


2020-06-09 19:55:26 (9.67 MB/s) - 'dsc.1.Mono.50.cutsites.smoothed.200.bw' saved [14633285/14633285]



### Clean ATAC-seq signal from 2400 Monocytes

In [4]:
!wget https://atacworks-paper.s3.us-east-2.amazonaws.com/dsc_atac_blood_cell_denoising_experiments/50_cells/train_data/clean_data/dsc.Mono.2400.cutsites.smoothed.200.bw

--2020-06-09 19:55:26--  https://atacworks-paper.s3.us-east-2.amazonaws.com/dsc_atac_blood_cell_denoising_experiments/50_cells/train_data/clean_data/dsc.Mono.2400.cutsites.smoothed.200.bw
Resolving atacworks-paper.s3.us-east-2.amazonaws.com (atacworks-paper.s3.us-east-2.amazonaws.com)... 52.219.104.112
Connecting to atacworks-paper.s3.us-east-2.amazonaws.com (atacworks-paper.s3.us-east-2.amazonaws.com)|52.219.104.112|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 235468888 (225M) [binary/octet-stream]
Saving to: 'dsc.Mono.2400.cutsites.smoothed.200.bw'


2020-06-09 19:55:43 (14.0 MB/s) - 'dsc.Mono.2400.cutsites.smoothed.200.bw' saved [235468888/235468888]



### Clean ATAC-seq peaks from 2400 Monocytes

In [5]:
!wget https://atacworks-paper.s3.us-east-2.amazonaws.com/dsc_atac_blood_cell_denoising_experiments/50_cells/train_data/clean_data/dsc.Mono.2400.cutsites.smoothed.200.3.narrowPeak

--2020-06-09 19:55:43--  https://atacworks-paper.s3.us-east-2.amazonaws.com/dsc_atac_blood_cell_denoising_experiments/50_cells/train_data/clean_data/dsc.Mono.2400.cutsites.smoothed.200.3.narrowPeak
Resolving atacworks-paper.s3.us-east-2.amazonaws.com (atacworks-paper.s3.us-east-2.amazonaws.com)... 52.219.100.72
Connecting to atacworks-paper.s3.us-east-2.amazonaws.com (atacworks-paper.s3.us-east-2.amazonaws.com)|52.219.100.72|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 16328627 (16M) [binary/octet-stream]
Saving to: 'dsc.Mono.2400.cutsites.smoothed.200.3.narrowPeak'


2020-06-09 19:55:45 (11.3 MB/s) - 'dsc.Mono.2400.cutsites.smoothed.200.3.narrowPeak' saved [16328627/16328627]



### Config files
We also need to download config files for this experiment. The config files describe the structure of the deep learning model and the parameters to train it. We will place these in the `configs` folder. 

In [6]:
!mkdir configs
!wget -P configs https://atacworks-paper.s3.us-east-2.amazonaws.com/dsc_atac_blood_cell_denoising_experiments/50_cells/configs/train_config.yaml
!wget -P configs https://atacworks-paper.s3.us-east-2.amazonaws.com/dsc_atac_blood_cell_denoising_experiments/50_cells/configs/model_structure.yaml

--2020-06-09 19:55:46--  https://atacworks-paper.s3.us-east-2.amazonaws.com/dsc_atac_blood_cell_denoising_experiments/50_cells/configs/train_config.yaml
Resolving atacworks-paper.s3.us-east-2.amazonaws.com (atacworks-paper.s3.us-east-2.amazonaws.com)... 52.219.96.184
Connecting to atacworks-paper.s3.us-east-2.amazonaws.com (atacworks-paper.s3.us-east-2.amazonaws.com)|52.219.96.184|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 652 []
Saving to: 'configs/train_config.yaml'


2020-06-09 19:55:46 (6.18 MB/s) - 'configs/train_config.yaml' saved [652/652]

--2020-06-09 19:55:47--  https://atacworks-paper.s3.us-east-2.amazonaws.com/dsc_atac_blood_cell_denoising_experiments/50_cells/configs/model_structure.yaml
Resolving atacworks-paper.s3.us-east-2.amazonaws.com (atacworks-paper.s3.us-east-2.amazonaws.com)... 52.219.101.27
Connecting to atacworks-paper.s3.us-east-2.amazonaws.com (atacworks-paper.s3.us-east-2.amazonaws.com)|52.219.101.27|:443... connected.
HTTP requ

## Step 3: Convert clean peak file into bigWig format

The clean peak calls (`dsc.Mono.2400.cutsites.smoothed.200.3.narrowPeak`) were produced by MACS2 and are in .narrowPeak format. We need to convert them to bigWig format for use. This also requires us to supply a chromosome sizes file describing the reference genome that we use. 

Chromosome sizes files for the hg19 and hg38 human reference genomes are supplied with AtacWorks in the folder `AtacWorks/data/reference`. Here, we are using hg19.


In [7]:
!python $atacworks/scripts/peak2bw.py \
    --input dsc.Mono.2400.cutsites.smoothed.200.3.narrowPeak \
    --sizes $atacworks/data/reference/hg19.chrom.sizes \
    --out_dir ./ \
    --skip 1

INFO:2020-06-09 19:55:48,544:AtacWorks-peak2bw] Reading input file
INFO:2020-06-09 19:55:48,816:AtacWorks-peak2bw] Read 105959 peaks.
INFO:2020-06-09 19:55:48,819:AtacWorks-peak2bw] Adding score
INFO:2020-06-09 19:55:48,820:AtacWorks-peak2bw] Writing peaks to bedGraph file
Discarding 0 entries outside sizes file.
INFO:2020-06-09 19:55:49,382:AtacWorks-peak2bw] Writing peaks to bigWig file dsc.Mono.2400.cutsites.smoothed.200.3.narrowPeak.bw
INFO:2020-06-09 19:55:49,885:AtacWorks-peak2bw] Done!


The `--skip 1` argument tells the script to ignore the first line of the narrowPeak file as it contains a header.

This command reads the peak positions from the .narrowPeak file and writes them to a bigWig file in the current directory,  named `dsc.Mono.2400.cutsites.smoothed.200.3.narrowPeak.bw`.

```
INFO:2020-01-22 20:32:05,270:AtacWorks-peak2bw] Reading input files
INFO:2020-01-22 20:32:05,387:AtacWorks-peak2bw] Retaining 105959 of 105959 peaks in given chromosomes.
INFO:2020-01-22 20:32:05,387:AtacWorks-peak2bw] Adding score
INFO:2020-01-22 20:32:05,388:AtacWorks-peak2bw] Writing peaks to bedGraph file
INFO:2020-01-22 20:32:05,855:AtacWorks-peak2bw] Writing peaks to bigWig file dsc.Mono.2400.cutsites.smoothed.200.3.narrowPeak.bw
INFO:2020-01-22 20:32:06,042:AtacWorks-peak2bw] Done!
```

For more information type `python $atacworks/scripts/peak2bw.py --help`

## Step 4: Create genomic intervals to define regions for training and validation

We need to define which regions of the genome will be used to train and test the model. We want to train models on some portion of the genome ('training set') and evaluate their performance on a separate portion ('validation set'). We will choose the model that performs best on the validation set as the best model. Later, we will evaluate the performance of this best model on a third portion of the genome ('holdout set').

We provide a chromosome sizes file 'hg19.auto.sizes' that contains sizes for all the autosomes of the hg19 reference genome. We split off chromosome 20 to use as the validation set, and chromosome 10 to use as the holdout set, and use the remaining autosomes as the training set. Since a whole chromosome is too long to feed into the model at once, we split each of these chromosomes into 50,000-bp long intervals.

In [8]:
!python $atacworks/scripts/get_intervals.py \
     --sizes $atacworks/data/reference/hg19.auto.sizes \
     --intervalsize 50000 \
     --out_dir ./ \
     --val chr20 \
     --holdout chr10

INFO:2020-06-09 19:55:50,948:AtacWorks-intervals] Generating training intervals
INFO:2020-06-09 19:55:51,628:AtacWorks-intervals] Generating val intervals
INFO:2020-06-09 19:55:51,662:AtacWorks-intervals] Generating holdout intervals
INFO:2020-06-09 19:55:51,687:AtacWorks-intervals] Done!


This command generates three BED files in the current directory: `training_intervals.bed`, `val_intervals.bed`, and `holdout_intervals.bed`. These BED files contain 50,000-bp long intervals spanning the given chromosomes. We can look at these intervals:

```
# head training_intervals.bed 
chr1  0 50000
chr1  50000 100000
chr1  100000  150000
chr1  150000  200000
chr1  200000  250000
chr1  250000  300000
chr1  300000  350000
chr1  350000  400000
chr1  400000  450000
chr1  450000  500000
```
For more information type `python $atacworks/scripts/get_intervals.py --help`

## Step 5: Read the training data and labels and save in .h5 format

We take the three bigWig files containing noisy ATAC-seq signal, the clean ATAC-seq signal, and the clean ATAC-seq peak calls. For these three files, we read the values in the regions defined by tge training intervals, and save these values in a format that can be read by our model. First, we read values for the intervals in the training set (`training_intervals.bed`), spanning all autosomes except chr10 and chr20.

In [9]:
!python $atacworks/scripts/bw2h5.py \
           --noisybw dsc.1.Mono.50.cutsites.smoothed.200.bw \
           --cleanbw dsc.Mono.2400.cutsites.smoothed.200.bw \
           --cleanpeakbw dsc.Mono.2400.cutsites.smoothed.200.3.narrowPeak.bw \
           --intervals training_intervals.bed \
           --out_dir ./ \
           --prefix Mono.50.2400.train \
           --pad 5000 \
           --nonzero

INFO:2020-06-09 19:55:53,073:AtacWorks-bw2h5] Reading intervals
INFO:2020-06-09 19:55:53,093:AtacWorks-bw2h5] Read 53641 intervals
INFO:2020-06-09 19:55:53,093:AtacWorks-bw2h5] Selecting intervals with nonzero coverage
INFO:2020-06-09 19:59:23,312:AtacWorks-bw2h5] Retaining 32000 of 53641 nonzero noisy intervals
INFO:2020-06-09 19:59:23,315:AtacWorks-bw2h5] Writing data in 32 batches.
INFO:2020-06-09 19:59:23,316:AtacWorks-bw2h5] Extracting data for each batch and writing to h5 file
INFO:2020-06-09 19:59:23,316:AtacWorks-bw2h5] batch 0 of 32
INFO:2020-06-09 20:03:05,931:AtacWorks-bw2h5] batch 10 of 32
INFO:2020-06-09 20:06:25,805:AtacWorks-bw2h5] batch 20 of 32
INFO:2020-06-09 20:09:48,491:AtacWorks-bw2h5] batch 30 of 32
INFO:2020-06-09 20:10:54,850:AtacWorks-bw2h5] Done! Saved to ./Mono.50.2400.train.h5


This produces a .h5 file in the current directory (`Mono.50.2400.train.h5`) containing the training data for the model. The `--nonzero` flag ignores intervals that contain zero coverage. We use this flag for training data as these intervals do not help the model to learn.

For more information type `python $atacworks/scripts/bw2h5.py --help`


## Step 6: Read the validation data and labels and save in .h5 format

Next we read and save the validation data for the model.


In [10]:
!python $atacworks/scripts/bw2h5.py \
           --noisybw dsc.1.Mono.50.cutsites.smoothed.200.bw \
           --cleanbw dsc.Mono.2400.cutsites.smoothed.200.bw \
           --cleanpeakbw dsc.Mono.2400.cutsites.smoothed.200.3.narrowPeak.bw \
           --intervals val_intervals.bed \
           --out_dir ./ \
           --prefix Mono.50.2400.val \
           --pad 5000

INFO:2020-06-09 20:10:56,268:AtacWorks-bw2h5] Reading intervals
INFO:2020-06-09 20:10:56,410:AtacWorks-bw2h5] Read 1260 intervals
INFO:2020-06-09 20:10:56,410:AtacWorks-bw2h5] Writing data in 2 batches.
INFO:2020-06-09 20:10:56,410:AtacWorks-bw2h5] Extracting data for each batch and writing to h5 file
INFO:2020-06-09 20:10:56,411:AtacWorks-bw2h5] batch 0 of 2
INFO:2020-06-09 20:11:39,943:AtacWorks-bw2h5] Done! Saved to ./Mono.50.2400.val.h5


This produces a .h5 file in the current directory (`Mono.50.2400.val.h5`) containing the validation data for the model.


## Step 7: Train and validate a model using the parameters in the given config files

We next train an AtacWorks model to learn a mapping from the noisy (50-cell) ATAC-seq signal to the clean (2400-cell) ATAC-seq signal and peak calls. The two .yaml files that we downloaded into the `configs` directory contain all the parameters that describe how to train the model. `configs/model_structure.yaml` contains parameters that control the architecture of the model and  `configs/config_params.yaml` contains parameters that control the process of training, such as the learning rate and batch size.

To train the model, we supply the training and validation datasets as well as the two config files.

In [11]:
!python $atacworks/scripts/main.py train \
        --config configs/train_config.yaml \
        --config_mparams configs/model_structure.yaml \
        --files_train Mono.50.2400.train.h5 \
        --val_files Mono.50.2400.val.h5

INFO:2020-06-09 20:11:42,089:AtacWorks-main] Running on GPU: 0
[33mBuilding model: resnet ...[0m
[33mFinished building.[0m
Saving config file to ./trained_models_2020.06.09_20.11/configs/model_structure.yaml...[0m
Num_batches 500; rank 0, gpu 0
Epoch [ 0/25] -------------------- [  0/500] mse:  20.142 | pearsonloss:   0.986 | total_loss:   1.603 | bce:   0.607
Epoch [ 0/25] ##------------------ [ 50/500] mse:2030.815 | pearsonloss:   0.055 | total_loss:   1.578 | bce:   0.507
Epoch [ 0/25] ####---------------- [100/500] mse:  20.136 | pearsonloss:   0.986 | total_loss:   1.079 | bce:   0.083
Epoch [ 0/25] ######-------------- [150/500] mse:1938.982 | pearsonloss:   0.014 | total_loss:   1.057 | bce:   0.074
Epoch [ 0/25] ########------------ [200/500] mse:  20.121 | pearsonloss:   0.985 | total_loss:   1.096 | bce:   0.101
Epoch [ 0/25] ##########---------- [250/500] mse:  75.668 | pearsonloss:   0.015 | total_loss:   0.123 | bce:   0.071
Epoch [ 0/25] ############-------- [300/50

This command trains a deep learning model using the supplied clean and noisy ATAC-seq data, for 5 epochs (5 full passes through the dataset). At the end of every epoch, the current state of the model is saved in the directory `trained_models_latest`, and the performance of the current model is measured on the validation set. At the end, out of the 5 saved models, the one with the best performance on the validation set is saved as `trained_models_latest/model_best.pth.tar`

This model has learned a mapping from the 50-cell signal to the 2400-cell signal and peak calls. Given a new 50-cell ATAC-seq track, it can denoise the track and produce high-quality peak calls.

See [Tutorial 2](tutorial2.md) for step-by-step instructions on how to apply this trained model to another dataset.

To change any of the parameters for the deep learning model, you can edit the appropriate parameters in `configs/train_config.yaml` or `configs/model_structure.yaml` and run the command in step 7 above. Type `python $atacworks/scripts/main.py train --help` for an explanation of the parameters.

Note: `train_config.yaml` is set up to use multiple GPUs. If you are using a single GPU, edit `train_config.yaml` to change the line `gpu: "None"` to read `gpu: 0`. 

## References
(1) Lal, A., Chiang, Z.D., Yakovenko, N., Duarte, F.M., Israeli, J. and Buenrostro, J.D., 2019. AtacWorks: A deep convolutional neural network toolkit for epigenomics. BioRxiv, p.829481. (https://www.biorxiv.org/content/10.1101/829481v1)

## Appendix 1: Training on multiple pairs of clean and noisy datasets

If using multiple pairs of clean and noisy datasets for training, use steps 5 and 6 on each pair to create a training h5 file and a validation h5 file for each pair. Save all of the training h5 files into a single folder and all of the validation h5 files into another folder.

Run step 7 as follows:

In [None]:
!python $atacworks/scripts/main.py train \
        --config configs/train_config.yaml \
        --config_mparams configs/model_structure.yaml \
        --files_train <path to folder containing all h5 files for training> \
        --val_files <path to folder containing all h5 files for validation>

## Appendix 2: Reproducing the model reported in the AtacWorks preprint (Reference 1)

In Section "AtacWorks enhances ATAC-seq results from small numbers of single cells" (also Supplementary Table 8), we report this experiment, although the model we use there is trained on more data.

To download the exact model used in the paper, see [Tutorial 2](tutorial2.md).

In order to train the same model reported in the paper, follow the following steps. 
- Download all the training data

NOTE: Jupyter notebook uses `/bin/sh` by default which points to dash shell. Below commands need bash shell for execution. You can set this option : `NotebookApp.terminado_settings` in jupyter config file or through command line when launching jupyter lab. 

In [None]:
!mkdir -p train_data/noisy_data
%env cell_types=CD19 Mono
%env subsamples=1 2 3 4 5
!for cell_type in ${cell_types[*]}; do \
     for subsample in ${subsamples[*]}; do \
         wget -P train_data/noisy_data https://atacworks-paper.s3.us-east-2.amazonaws.com/dsc_atac_blood_cell_denoising_experiments/50_cells/train_data/noisy_data/dsc.$subsample.$cell_type.50.cutsites.smoothed.200.bw; \
     done; \
done

In [None]:
!mkdir -p train_data/clean_data

In [None]:
%env cell_types=CD19 Mono
!for cell_type in ${cell_types[*]}; do \
     wget -P train_data/clean_data https://atacworks-paper.s3.us-east-2.amazonaws.com/dsc_atac_blood_cell_denoising_experiments/50_cells/train_data/clean_data/dsc.$cell_type.2400.cutsites.smoothed.200.bw; \
     wget -P train_data/clean_data https://atacworks-paper.s3.us-east-2.amazonaws.com/dsc_atac_blood_cell_denoising_experiments/50_cells/train_data/clean_data/dsc.$cell_type.2400.cutsites.smoothed.200.3.narrowPeak; \
done

- Encode all the training data and save in the `train_h5` directory.

In [None]:
%env cell_types = CD19 Mono
!for cell_type in ${cell_types[*]}; do \
    python $atacworks/scripts/peak2bw.py \
        --input train_data/clean_data/dsc.$cell_type.2400.cutsites.smoothed.200.3.narrowPeak \
        --sizes $atacworks/data/reference/hg19.chrom.sizes \
        --out_dir train_data/clean_data \
        --skip 1; \
done

In [None]:
!mkdir train_h5
%env cell_types =CD19 Mono
%env subsamples=1 2 3 4 5
!for cell_type in ${cell_types[*]}; do \
    for subsample in ${subsamples[*]}; do \
        python $atacworks/scripts/bw2h5.py \
           --noisybw train_data/noisy_data/dsc.$subsample.${cell_type}.50.cutsites.smoothed.200.bw \
           --cleanbw train_data/clean_data/dsc.${cell_type}.2400.cutsites.smoothed.200.bw \
           --cleanpeakbw train_data/clean_data/dsc.${cell_type}.2400.cutsites.smoothed.200.3.narrowPeak.bw \
           --intervals training_intervals.bed \
           --out_dir train_h5 \
           --prefix ${cell_type}.$subsample.50.2400.train \
           --pad 5000 \
           --nonzero; \
    done; \
done

- Encode all the validation data and save in the `val_h5` directory.

In [None]:
!mkdir val_h5
%env cell_types =CD19 Mono
!for cell_type in ${cell_types[*]}; do \
    python $atacworks/scripts/bw2h5.py \
           --noisybw train_data/noisy_data/dsc.1.${cell_type}.50.cutsites.smoothed.200.bw \
           --cleanbw train_data/clean_data/dsc.${cell_type}.2400.cutsites.smoothed.200.bw \
           --cleanpeakbw train_data/clean_data/dsc.${cell_type}.2400.cutsites.smoothed.200.3.narrowPeak.bw \
           --intervals val_intervals.bed \
           --out_dir val_h5 \
           --prefix ${cell_type}.50.2400.val \
           --pad 5000; \
done

- Train using all of the training and validation data. Here, we supply the directories `train_h5` and `val_h5`, and the model uses all the files within these directories for training and validation respectively.

In [None]:
!python $atacworks/scripts/main.py train \
        --config configs/train_config.yaml \
        --config_mparams configs/model_structure.yaml \
        --files_train train_h5 \
        --val_files val_h5