# Tutorial 1: Training an AtacWorks model 

## Introduction

In this tutorial we train an AtacWorks model to denoise the signal track and call peaks from aggregate single-cell ATAC-seq data derived from a small number of cells. We use the dsc-ATAC-seq dataset presented in reference (1) (Section "AtacWorks enhances ATAC-seq results from small numbers of single cells", also Supplementary Table 8). This dataset consists of single-cell ATAC-seq data from several types of human blood cells.


To access all the AtacWorks models described in reference (1), look at the [documentation](https://clara-parabricks.github.io/AtacWorks/tutorials/pretrained_models.html). You may be able to use one of those instead of training a new model. To learn how to download and use an existing model, refer to [Tutorial 2](tutorial2.ipynb).
 
In this tutorial, we selected 2400 Monocytes from a dataset - this is our ‘clean’, high-coverage dataset. We then randomly sampled 50 of these 2400 Monocytes. Here's what the ATAC-seq signal from 50 cells and 2400 cells looks like, for a region on chromosome 10:

![Monocytes subsampled signal](Mono.2400.50.png)

Compared to the 'clean' signal from 2400 cells, the aggregated ATAC-Seq signal track from these 50 cells is noisy. Because of noise in the signal, peak calls calculated by MACS2 on this data are also inaccurate.

We train an AtacWorks model to learn a mapping from the 50-cell ATAC-seq signals to the 2400-cell ATAC-seq signal and peak calls. In other words, given a noisy ATAC-seq signal from 50 cells, this model learns what the signal would look like - and where the peaks would be called - if we had sequenced 2400 cells.

**NOTE: You  may notice an exclamation mark (!) before most of the commands in this tutorial. That's because most of them are bash commands and to execute bash commands through notebook, they have to be preceded by an exclamation. These commands be directly copy pasted into a terminal (without the !) and executed. We created this notebook to make it very simple for our users to run the tutorials without having to worry about copy pasting.**

## Step 1: Create folder and set AtacWorks path
Replace 'path_to_atacworks' with the path to your cloned and set up 'AtacWorks' github repository. see [Readme](https://clara-parabricks.github.io/AtacWorks/readme.html) for installation instructions.

In [1]:
%env atacworks=path_to_atacworks

env: atacworks=/ntadimeti/AtacWorks


Create a folder for this experiment. os.chdir() below allows us to enter into the new directory.

In [2]:
!mkdir tutorial1
import os
os.chdir('tutorial1')

## Step 2: Download data

We will download all of the data needed for this experiment from AWS into the `tutorial1` folder.


### Noisy ATAC-seq signal from 50 Monocytes

In [3]:
!wget https://api.ngc.nvidia.com/v2/models/nvidia/atac_dsc_atac_lowcellcount_1m_48m_50_2400/versions/0.3/files/train_data/noisy_data/dsc.1.Mono.50.cutsites.smoothed.200.bw

--2020-07-26 14:58:05--  https://api.ngc.nvidia.com/v2/models/nvidia/atac_dsc_atac_lowcellcount_1m_48m_50_2400/versions/0.3/files/train_data/noisy_data/dsc.1.Mono.50.cutsites.smoothed.200.bw
Resolving api.ngc.nvidia.com (api.ngc.nvidia.com)... 52.52.190.18, 52.9.28.168
Connecting to api.ngc.nvidia.com (api.ngc.nvidia.com)|52.52.190.18|:443... connected.
HTTP request sent, awaiting response... 302 
Location: https://s3.us-west-2.amazonaws.com/prod-model-registry-ngc-bucket/org/nvidia/models/atac_dsc_atac_lowcellcount_1m_48m_50_2400/versions/0.3/files/train_data/noisy_data/dsc.1.Mono.50.cutsites.smoothed.200.bw?response-content-disposition=attachment%3B%20filename%3D%22dsc.1.Mono.50.cutsites.smoothed.200.bw%22&response-content-type=application%2Foctet-stream&X-Amz-Security-Token=IQoJb3JpZ2luX2VjEAYaCXVzLXdlc3QtMSJGMEQCID%2BJg8m4EUU1AF8NoVgOv4Bpmw1eouHAHTlodvgyzjxpAiA5IDqV10vpHPRVKG2%2B9s8A48jyjMG%2BPVB6WG3qXIPnIiq9Awi%2F%2F%2F%2F%2F%2F%2F%2F%2F%2F%2F8BEAIaDDc4OTM2MzEzNTAyNyIMLq41YDR%2F%2

### Clean ATAC-seq signal from 2400 Monocytes

In [4]:
!wget https://api.ngc.nvidia.com/v2/models/nvidia/atac_dsc_atac_lowcellcount_1m_48m_50_2400/versions/0.3/files/train_data/clean_data/dsc.Mono.2400.cutsites.smoothed.200.bw

--2020-07-26 14:58:24--  https://api.ngc.nvidia.com/v2/models/nvidia/atac_dsc_atac_lowcellcount_1m_48m_50_2400/versions/0.3/files/train_data/clean_data/dsc.Mono.2400.cutsites.smoothed.200.bw
Resolving api.ngc.nvidia.com (api.ngc.nvidia.com)... 52.9.28.168, 52.52.190.18
Connecting to api.ngc.nvidia.com (api.ngc.nvidia.com)|52.9.28.168|:443... connected.
HTTP request sent, awaiting response... 302 
Location: https://s3.us-west-2.amazonaws.com/prod-model-registry-ngc-bucket/org/nvidia/models/atac_dsc_atac_lowcellcount_1m_48m_50_2400/versions/0.3/files/train_data/clean_data/dsc.Mono.2400.cutsites.smoothed.200.bw?response-content-disposition=attachment%3B%20filename%3D%22dsc.Mono.2400.cutsites.smoothed.200.bw%22&response-content-type=application%2Foctet-stream&X-Amz-Security-Token=IQoJb3JpZ2luX2VjEAYaCXVzLXdlc3QtMSJGMEQCIFaYubtO3C8rm1BFqI7T7ggVjYr49B8xp%2Fi2ovzk99chAiAxTscQyzODIQspG%2BDUMidseqU2hiTG1zCMk3N%2BCLODnCq9Awi%2F%2F%2F%2F%2F%2F%2F%2F%2F%2F%2F8BEAIaDDc4OTM2MzEzNTAyNyIMFEKdTp92LoIxj

### Clean ATAC-seq peaks from 2400 Monocytes

In [5]:
!wget https://api.ngc.nvidia.com/v2/models/nvidia/atac_dsc_atac_lowcellcount_1m_48m_50_2400/versions/0.3/files/train_data/clean_data/dsc.Mono.2400.cutsites.smoothed.200.3.narrowPeak

--2020-07-26 14:58:48--  https://api.ngc.nvidia.com/v2/models/nvidia/atac_dsc_atac_lowcellcount_1m_48m_50_2400/versions/0.3/files/train_data/clean_data/dsc.Mono.2400.cutsites.smoothed.200.3.narrowPeak
Resolving api.ngc.nvidia.com (api.ngc.nvidia.com)... 52.9.28.168, 52.52.190.18
Connecting to api.ngc.nvidia.com (api.ngc.nvidia.com)|52.9.28.168|:443... connected.
HTTP request sent, awaiting response... 302 
Location: https://s3.us-west-2.amazonaws.com/prod-model-registry-ngc-bucket/org/nvidia/models/atac_dsc_atac_lowcellcount_1m_48m_50_2400/versions/0.3/files/train_data/clean_data/dsc.Mono.2400.cutsites.smoothed.200.3.narrowPeak?response-content-disposition=attachment%3B%20filename%3D%22dsc.Mono.2400.cutsites.smoothed.200.3.narrowPeak%22&response-content-type=application%2Foctet-stream&X-Amz-Security-Token=IQoJb3JpZ2luX2VjEAYaCXVzLXdlc3QtMSJGMEQCID%2BJg8m4EUU1AF8NoVgOv4Bpmw1eouHAHTlodvgyzjxpAiA5IDqV10vpHPRVKG2%2B9s8A48jyjMG%2BPVB6WG3qXIPnIiq9Awi%2F%2F%2F%2F%2F%2F%2F%2F%2F%2F%2F8BEAIaDDc

## Step 3: Train and validate a model using the parameters in the given config files

The clean peak calls (`dsc.Mono.2400.cutsites.smoothed.200.3.narrowPeak`) were produced by MACS2 and are in .narrowPeak format. 
Chromosome sizes files for the hg19 and hg38 human reference genomes are supplied with AtacWorks in the folder `AtacWorks/data/reference`. Here, we are using hg19.

We need to define which regions of the genome will be used to train and test the model. We want to train models on some portion of the genome ('training set') and evaluate their performance on a separate portion ('validation set'). We will choose the model that performs best on the validation set as the best model. Later, we will evaluate the performance of this best model on a third portion of the genome ('holdout set').

We provide a chromosome sizes file 'hg19.auto.sizes' that contains sizes for all the autosomes of the hg19 reference genome. We split off chromosome 2 to use as the validation set, and chromosome 10 to use as the holdout set, and use the remaining autosomes as the training set. Since a whole chromosome is too long to feed into the model at once, we split each of these chromosomes into 50,000-bp long intervals.

This command trains a deep learning model using the supplied clean and noisy ATAC-seq data, for 10 epochs (10 full passes through the dataset). At the end of every epoch, the current state of the model is saved in the directory `atacworks_train_latest`, and the performance of the current model is measured on the validation set. At the end, out of the 10 saved models, the one with the best performance on the validation set is saved as `atacworks_train_latest/model_best.pth.tar`

In [6]:
!atacworks train \
    --noisybw dsc.1.Mono.50.cutsites.smoothed.200.bw \
    --cleanbw dsc.Mono.2400.cutsites.smoothed.200.bw \
    --cleanpeakfile dsc.Mono.2400.cutsites.smoothed.200.3.narrowPeak \
    --genome $atacworks/data/reference/hg19.auto.sizes \
    --val_chrom chr2 \
    --holdout_chrom chr10 \
    --out_home "./" \
    --exp_name "atacworks_train" \
    --distributed

INFO:2020-07-26 14:59:18,486:AtacWorks-peak2bw] Reading input file
INFO:2020-07-26 14:59:18,854:AtacWorks-peak2bw] Read 105959 peaks.
INFO:2020-07-26 14:59:18,857:AtacWorks-peak2bw] Adding score
INFO:2020-07-26 14:59:18,858:AtacWorks-peak2bw] Writing peaks to bedGraph file
Discarding 2618 entries outside sizes file.
INFO:2020-07-26 14:59:19,376:AtacWorks-peak2bw] Writing peaks to bigWig file ./atacworks_train_2020.07.26_14.59/bigwig_peakfiles/dsc.Mono.2400.cutsites.smoothed.200.3.narrowPeak.bw
INFO:2020-07-26 14:59:20,277:AtacWorks-peak2bw] Done!
INFO:2020-07-26 14:59:20,289:AtacWorks-intervals] Generating training intervals
INFO:2020-07-26 14:59:20,604:AtacWorks-intervals] Generating val intervals
INFO:2020-07-26 14:59:20,663:AtacWorks-bw2h5] Reading intervals
INFO:2020-07-26 14:59:20,679:AtacWorks-bw2h5] Read 50038 intervals
INFO:2020-07-26 14:59:20,683:AtacWorks-bw2h5] Selecting intervals with nonzero coverage
INFO:2020-07-26 15:02:35,724:AtacWorks-bw2h5] Retaining 29863 of 50038 no

This model has learned a mapping from the 50-cell signal to the 2400-cell signal and peak calls. Given a new 50-cell ATAC-seq track, it can denoise the track and produce high-quality peak calls.

See [Tutorial 2](tutorial2.ipynb) for step-by-step instructions on how to apply this trained model to another dataset.

## References
(1) Lal, A., Chiang, Z.D., Yakovenko, N., Duarte, F.M., Israeli, J. and Buenrostro, J.D., 2019. AtacWorks: A deep convolutional neural network toolkit for epigenomics. BioRxiv, p.829481. (https://www.biorxiv.org/content/10.1101/829481v1)


## Appendix 1: Using Custom Config Files

To change any of the parameters for training, copy paste the relevant default config files at `$atacworks/configs` to the current location.

In [None]:
!mkdir custom_configs
!cp $atacworks/configs/train_config.yaml custom_configs
!cp $atacworks/configs/model_structure.yaml custom_configs

To change the experiment parameters, edit the `custom_configs/train_config.yaml` and pass it to atacworks using `--config` option. To change the model parameters, you can edit the `custom_configs/model_structure.yaml` file and pass it to atacworks using `--config_mparams` option. Type `atacworks train -h` for detail help on parameters.

## Appendix 2: Training on multiple pairs of clean and noisy datasets

If using multiple pairs of clean and noisy datasets for training, provide either a path to folder containing all bigwig files or a comma separated list of bigwig files like `[file1,file2,file3]`. The pseudo code to demonstrate this feature is:

```
!atacworks train
        --cleanbw <path to folder containing all clean bigwig files for training> \
        --noisybw <path to folder containing all clean bigwig files for validation> \
        --cleanpeakfile <path to folder containing all clean peak files>
        --out_home <out-dir> \
        --distributed \
        --config <path-to-custom-config-if-any> \
        --config_mparams <path-to-custom-model-structure-if-any>
```

## Appendix 3: Reproducing the model reported in the AtacWorks preprint (Reference 1)

In Section "AtacWorks enhances ATAC-seq results from small numbers of single cells" (also Supplementary Table 8), we report this experiment, although the model we use there is trained on more data.

To download the exact model used in the paper, see [Tutorial 2](tutorial2.md).

In order to train the same model reported in the paper:
- Download all the training data

NOTE: Jupyter notebook uses `/bin/sh` by default which points to dash shell. Below commands need bash shell for execution. You can set this option : `NotebookApp.terminado_settings` in jupyter config file or through command line when launching jupyter lab. 

In [None]:
!mkdir -p train_data/noisy_data
%env cell_types=CD19 Mono
%env subsamples=1 2 3 4 5
!for cell_type in ${cell_types[*]}; do \
     for subsample in ${subsamples[*]}; do \
         wget -P train_data/noisy_data https://api.ngc.nvidia.com/v2/models/nvidia/atac_dsc_atac_lowcellcount_1m_48m_50_2400/versions/0.3/files/train_data/noisy_data/dsc.$subsample.$cell_type.50.cutsites.smoothed.200.bw; \
     done; \
done

In [None]:
!mkdir -p train_data/clean_data

In [None]:
%env cell_types=CD19 Mono
!for cell_type in ${cell_types[*]}; do \
     wget -P train_data/clean_data https://api.ngc.nvidia.com/v2/models/nvidia/atac_dsc_atac_lowcellcount_1m_48m_50_2400/versions/0.3/files/train_data/clean_data/dsc.$cell_type.2400.cutsites.smoothed.200.bw; \
     wget -P train_data/clean_data https://api.ngc.nvidia.com/v2/models/nvidia/atac_dsc_atac_lowcellcount_1m_48m_50_2400/versions/0.3/files/train_data/clean_data/dsc.$cell_type.2400.cutsites.smoothed.200.3.narrowPeak; \
done

In [None]:
!mkdir configs
!wget -P configs https://api.ngc.nvidia.com/v2/models/nvidia/atac_dsc_atac_lowcellcount_1m_48m_50_2400/versions/0.3/files/configs/train_config.yaml
!wget -P configs https://api.ngc.nvidia.com/v2/models/nvidia/atac_dsc_atac_lowcellcount_1m_48m_50_2400/versions/0.3/files/configs/model_structure.yaml

The command in step 3 from this tutorial and psuedo command in Appendix 1 can be used as a guide to train the model.