## Tutorial 2: Using a trained AtacWorks model to denoise ATAC-seq data and call peaks. 

## Introduction

In this tutorial we use a pre-trained AtacWorks model to denoise and call peaks from low-coverage aggregate single-cell ATAC-seq data. We use the dsc-ATAC-seq dataset presented in reference (1), section "AtacWorks generalizes to diverse applications". This dataset consists of single-cell ATAC-seq data from several types of human blood cells.

We selected 2400 NK cells from this dataset - this is our ‘clean’, high-coverage dataset. We then randomly sampled 50 of these 2400 NK cells. Here's what the ATAC-seq signal from 50 cells and 2400 cells looks like, for a region on chromosome 10:

![subsampled_NK_cells](NK.2400.50.png)

Compared to the 'clean' signal from 2400 cells, the aggregated ATAC-seq profile of these 50 cells is noisy. Because the signal is noisy, peak calls calculated by MACS2 on this data (shown as red bars below the signal tracks) are also inaccurate. The AUPRC of peak calling by MACS2 on the noisy data is only 0.20.

As reported in our paper, we trained an AtacWorks model to learn a mapping from 50-cell signal to 2400-cell signals and peak calls. In other words, given a noisy ATAC-seq signal from 50 cells, this model learned what the signal would look like - and where the peaks would be called - if we had sequenced 2400 cells. This model was trained on data from Monocytes and B cells, so it has not encountered data from NK cells.

Note that for using pre-trained models on custom data, the data must be pre-processed in the exact same way as the AtacWorks model was trained with. To know the details and the caveats, read this [documentation](https://clara-parabricks.github.io/AtacWorks/tutorials/pretrained_models.html). If you want to train your own AtacWorks model instead of using the model reported in the paper, refer to [Tutorial 1](tutorial1.ipynb).


**NOTE: You  may notice an exclamation mark (!) before most of the commands in this tutorial. That's because most of them are bash commands and to execute bash commands through notebook, they have to be preceded by an exclamation. These commands be directly copy pasted into a terminal (without the !) and executed. We created this notebook to make it very simple for our users to run the tutorials without having to worry about copy pasting.**

## Step 1: Create folder and set AtacWorks path

Replace 'path_to_atacworks' with the path to your cloned and set up 'AtacWorks' github repository. See [Readme](https://clara-parabricks.github.io/AtacWorks/readme.html) for installation instructions.

In [1]:
%env atacworks=path_to_atacworks

env: atacworks=/ntadimeti/AtacWorks


Create a folder for this experiment.  os.chdir() below allows us to enter into the new directory.

In [2]:
!mkdir tutorial2
import os
os.chdir("tutorial2")

## Step 2: Download model

Download a pre-trained deep learning model (model.pth.tar) trained with dsc-ATAC-seq data from Monocytes and B cells. This model was reported and used in the AtacWorks paper (1).

In [3]:
!mkdir models
!wget -P models https://api.ngc.nvidia.com/v2/models/nvidia/atac_dsc_atac_lowcellcount_1m_48m_50_2400/versions/0.3/files/models/model.pth.tar

--2020-07-27 13:54:33--  https://api.ngc.nvidia.com/v2/models/nvidia/atac_dsc_atac_lowcellcount_1m_48m_50_2400/versions/0.3/files/models/model.pth.tar
Resolving api.ngc.nvidia.com (api.ngc.nvidia.com)... 52.9.28.168, 54.241.158.210
Connecting to api.ngc.nvidia.com (api.ngc.nvidia.com)|52.9.28.168|:443... connected.
HTTP request sent, awaiting response... 302 
Location: https://s3.us-west-2.amazonaws.com/prod-model-registry-ngc-bucket/org/nvidia/models/atac_dsc_atac_lowcellcount_1m_48m_50_2400/versions/0.3/files/models/model.pth.tar?response-content-disposition=attachment%3B%20filename%3D%22model.pth.tar%22&response-content-type=application%2Fx-tar&X-Amz-Security-Token=IQoJb3JpZ2luX2VjEB0aCXVzLXdlc3QtMSJHMEUCIBTK4SVGTr9T3zmRoJx5hmN2NPLHZX3kUIzU2vNwESQUAiEAo8s0QXcrMpGcY2BYS2D3jrzQ1uODf%2FJ%2B31OBeYpSOUQqvQMI1v%2F%2F%2F%2F%2F%2F%2F%2F%2F%2FARACGgw3ODkzNjMxMzUwMjciDHNExJPz3a7qRnC50SqRA%2FcmlqMjhnhNjazfXPpoEeAWd7FXLnFdZy66NIl8BaIkkY%2B2MkdsUAartOCi7qx8%2BAjRtJMKPMpjYnc55g%2BrQ4TBvo%2FZWUMzh

## Step 4: Download the test dsc-ATAC-seq signal from 50 NK cells (~1M reads), in bigWig format

In [4]:
!wget https://api.ngc.nvidia.com/v2/models/nvidia/atac_dsc_atac_lowcellcount_1m_48m_50_2400/versions/0.3/files/test_data/noisy_data/dsc.1.NK.50.cutsites.smoothed.200.bw

--2020-07-27 13:56:05--  https://api.ngc.nvidia.com/v2/models/nvidia/atac_dsc_atac_lowcellcount_1m_48m_50_2400/versions/0.3/files/test_data/noisy_data/dsc.1.NK.50.cutsites.smoothed.200.bw
Resolving api.ngc.nvidia.com (api.ngc.nvidia.com)... 52.9.28.168, 54.241.158.210
Connecting to api.ngc.nvidia.com (api.ngc.nvidia.com)|52.9.28.168|:443... connected.
HTTP request sent, awaiting response... 302 
Location: https://s3.us-west-2.amazonaws.com/prod-model-registry-ngc-bucket/org/nvidia/models/atac_dsc_atac_lowcellcount_1m_48m_50_2400/versions/0.3/files/test_data/noisy_data/dsc.1.NK.50.cutsites.smoothed.200.bw?response-content-disposition=attachment%3B%20filename%3D%22dsc.1.NK.50.cutsites.smoothed.200.bw%22&response-content-type=application%2Foctet-stream&X-Amz-Security-Token=IQoJb3JpZ2luX2VjEB0aCXVzLXdlc3QtMSJHMEUCIF5d7YmrvOcUydzBJ%2F6kAu0SMrQOGuu2ES%2BzGyoQ06y8AiEAvRihOhSaeWezPM6ZlhDNMKYgfqOPsbPNB0rkSSTXCukqvQMI1v%2F%2F%2F%2F%2F%2F%2F%2F%2F%2FARACGgw3ODkzNjMxMzUwMjciDMqBxclYMDF6P%2FyUUSqRA

## Step 5: Run denoising on the data
The model we downloaded takes the input ATAC-seq signal in non-overlapping genomic intervals spanning 50,000 bp. To define the genomic regions for the model to read, we take the chromosomes on which we want to apply the model and split their lengths into 50,000-bp intervals, which we save in BED format. This BED file can be found in `atacworks_denoise_latest/intervals` folder.
In this example, we will apply the model to chromosomes 1-22. The reference genome we use is hg19. We use the prepared chromosome sizes file `hg19.auto.sizes`, which contains the sizes of chromosomes 1-22 in hg19.

In [5]:
!atacworks denoise \
    --noisybw dsc.1.NK.50.cutsites.smoothed.200.bw \
    --genome $atacworks/data/reference/hg19.auto.sizes \
    --weights_path ./models/model.pth.tar \
    --out_home "./" \
    --exp_name "atacworks_denoise" \
    --distributed \
    --num_workers 0

INFO:2020-07-27 13:56:22,785:AtacWorks-intervals] Generating intervals tiling across all chromosomes             in sizes file: /ntadimeti/AtacWorks/data/reference/hg19.auto.sizes
INFO:2020-07-27 13:56:23,159:AtacWorks-intervals] Done!
INFO:2020-07-27 13:56:23,161:AtacWorks-bw2h5] Reading intervals
INFO:2020-07-27 13:56:23,188:AtacWorks-bw2h5] Read 57611 intervals
INFO:2020-07-27 13:56:23,197:AtacWorks-bw2h5] Writing data in 58 batches.
INFO:2020-07-27 13:56:23,197:AtacWorks-bw2h5] Extracting data for each batch and writing to h5 file
INFO:2020-07-27 13:56:23,197:AtacWorks-bw2h5] batch 0 of 58
INFO:2020-07-27 13:57:55,029:AtacWorks-bw2h5] batch 10 of 58
INFO:2020-07-27 13:59:20,546:AtacWorks-bw2h5] batch 20 of 58
INFO:2020-07-27 14:00:44,669:AtacWorks-bw2h5] batch 30 of 58
INFO:2020-07-27 14:02:21,629:AtacWorks-bw2h5] batch 40 of 58
INFO:2020-07-27 14:03:41,721:AtacWorks-bw2h5] batch 50 of 58
INFO:2020-07-27 14:04:45,983:AtacWorks-bw2h5] Done! Saved to ./atacworks_denoise_2020.07.27_13

The inference results will be saved in the folder `atacworks_denoise_latest`. This folder will contain four files: 
1. `dsc_infer.track.bedGraph` 
2. `dsc_infer.track.bw` 
3. `dsc_infer.peaks.bedGraph`. 
4. `dsc_infer.peaks.bw`

`dsc_infer.track.bedGraph` and `dsc_infer.track.bw` contain the denoised ATAC-seq track. `dsc_infer.peaks.bedGraph` and `dsc_infer.peaks.bw` contain the positions in the genome that are designated as peaks (the model predicts that the probability of these positions being part of a peak is at least 0.5)

If you are using your own model instead of the one provided, change the --weights_path to point to your model in Step 5.

## Step 6: Format peak calls

Delete peaks that are shorter than 20 bp in leangth, and format peak calls in BED format with coverage statistics and summit calls:

In [6]:
!python $atacworks/scripts/peaksummary.py \
    --peakbw atacworks_denoise_latest/dsc_infer.peaks.bw \
    --trackbw atacworks_denoise_latest/dsc_infer.track.bw \
    --prefix dsc_infer.peak_calls \
    --out_dir atacworks_denoise_latest \
    --minlen 20

INFO:2020-07-27 15:22:31,657:AtacWorks-peaksummary] Writing peaks to bedGraph file atacworks_denoise_latest/dsc_infer.peak_calls.bedGraph
INFO:2020-07-27 15:22:32,101:AtacWorks-peaksummary] Reading peaks
INFO:2020-07-27 15:22:32,171:AtacWorks-peaksummary] Calculating peak statistics
INFO:2020-07-27 15:24:09,336:AtacWorks-peaksummary] reduced number of peaks from 225182 to 26575.
INFO:2020-07-27 15:24:09,336:AtacWorks-peaksummary] Writing peaks to BED file atacworks_denoise_latest/dsc_infer.peak_calls.bed
INFO:2020-07-27 15:24:09,629:AtacWorks-peaksummary] Deleting bedGraph file


In [7]:
!head atacworks_denoise_latest/dsc_infer.peak_calls.bed

#chrom	start	end	len	mean	max	relativesummit	summit
chr1	10060	10363	303	32.660064697265625	54.0	60	10120
chr1	565575	566189	614	81.98696899414062	207.0	321	565896
chr1	569638	570165	527	73.18785858154297	176.0	273	569911
chr1	713600	714786	1186	376.9924011230469	1283.0	531	714131
chr1	762280	763421	1141	143.57669067382812	522.0	594	762874
chr1	805037	805620	583	47.64665603637695	100.0	188	805225
chr1	839713	840498	785	180.5477752685547	583.0	398	840111
chr1	856259	856826	567	38.50440979003906	49.0	194	856453
chr1	877933	878141	208	31.360576629638672	39.0	113	878046


This produces a file `output_latest/dsc_infer.peak_calls.bed` with 8 columns:
1. chromosome
2. start position of peak
3. end position of peak
4. length of peak (bp)
5. Mean coverage over peak
6. Maximum coverage in peak
7. Position of summit (relative to start)
8. Position of summit (absolute)

For more information type `python $atacworks/scripts/peaksummary.py --help`

## Appendix 1: Output the peak probabilities in inference instead of peak calls

The model predicts the probability of every position on the genome being part of a peak. In the above command, we take a cutoff of 0.5, and output the positions of regions where the probability is greater than 0.5. To output the probability for every base in the genome without any cutoff, we use the following command:


```
!atacworks denoise \
    --noisybw dsc.1.NK.50.cutsites.smoothed.200.bw \
    --interval_size 50000 \
    --genome $atacworks/data/reference/hg19.auto.sizes \
    --out_home "./" \
    --exp_name "atacworks_denoise_probs" \
    --distributed \
    --config <path-to-custom-infer_config.yaml>
```

To change any of the parameters for the denoising, copy paste the default config files at `$atacworks/configs` to the current location. 

In [8]:
!mkdir custom_configs
!cp $atacworks/configs/infer_config.yaml custom_configs
!cp $atacworks/configs/model_structure.yaml custom_configs

Open the `custom_configs/infer_config.yaml` and change the `threshold: 0.5` to `threshold: None`. This will turn off the threshold and atacworks will output probabilities instead of peak calls. Now, pass the path to the custom config file using the `--config` option as shown in the command above.

To change the model parameters, you can edit the `custom_configs/model_structure.yaml` file and pass it to atacworks using `--config_mparams` option. Type `atacworks denoise -h` for detail help on parameters.

The inference results will be saved in the folder `atacworks_denoise_probs_latest`. This folder will contain the same 4 files described in Step 5. However, `dsc_infer.peaks.bedGraph` and `dsc_infer.peaks.bw` will contain the probability of being part of a peak, for every position in the genome. This command is significantly slower, and the `dsc_infer.peaks.bedGraph` file produced by this command is larger than the file produced in Step 7.

The above command is useful in the following situations:
1. To calculate AUPRC or AUROC metrics.
2. If you are not sure what probability threshold to use for peak calling and want to try multiple thresholds.
3. If you wish to use the MACS2 subcommand `macs2 bdgpeakcall` for peak calling.

To call peaks from the probability track generated by this command, you can use `macs2 callpeak` from MACS2 (link) with the following command:

`!macs2 bdgpeakcall -i atacworks_denoise_probs_latest/dsc_infer.peaks.bedGraph -o atacworks_denoise_probs_latest/dsc_infer.peaks.narrowPeak -c 0.5`

Where `0.5` is the probability threshold to call peaks. Note that the summit calls and peak sizes generated by this procedure will be slightly different from those produced by steps 7-8.