# Bismark

In this notebook I'll use Bismark on the full extent of my data. The steps from the [user guide](https://rawgit.com/FelixKrueger/Bismark/master/Docs/Bismark_User_Guide.html#i-bismark-genome-preparation) are as follows:

1. Genome Preparation (already done in [this Jupyter notebook](https://github.com/RobertsLab/project-virginica-oa/blob/master/notebooks/2018-04-27-Gonad-Methylation-Bismark.ipynb))
2. Alignment
3. Deduplication
4. Methlyation Extractor
5. HTML Processing Report
6. Summary Report

## 0. Set working directory

In [1]:
pwd

'/Users/yaamini/Documents/project-virginica-oa/notebooks'

In [2]:
cd ../analyses/

/Users/yaamini/Documents/project-virginica-oa/analyses


In [3]:
mkdir 2018-05-04-Bismark-Full-Samples

In [4]:
ls

[34m2018-01-23-MBDSeq-Labwork[m[m/           [34m2018-05-01-MethylKit[m[m/
[34m2018-04-26-Gonad-Methylation-FastQC[m[m/ [34m2018-05-04-Bismark-Full-Samples[m[m/
[34m2018-04-27-Bismark[m[m/                  README.md


In [3]:
cd 2018-05-04-Bismark-Full-Samples/

/Users/yaamini/Documents/project-virginica-oa/analyses/2018-05-04-Bismark-Full-Samples


## 1. Genome Preparation

This step was already completed in [this Jupyter notebook](https://github.com/RobertsLab/project-virginica-oa/blob/master/notebooks/2018-04-27-Gonad-Methylation-Bismark.ipynb). The genome only needs to be prepared once. I will move on to the second step, alignment.

## 2. Alignment

In [6]:
! ../../../../../Shared/Apps/Bismark_v0.19.0/bismark -help



     This program is free software: you can redistribute it and/or modify
     it under the terms of the GNU General Public License as published by
     the Free Software Foundation, either version 3 of the License, or
     (at your option) any later version.

     This program is distributed in the hope that it will be useful,
     but WITHOUT ANY WARRANTY; without even the implied warranty of
     MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
     GNU General Public License for more details.
     You should have received a copy of the GNU General Public License
     along with this program.  If not, see <http://www.gnu.org/licenses/>.



DESCRIPTION


The following is a brief description of command line options and arguments to control the Bismark
bisulfite mapper and methylation caller. Bismark takes in FastA or FastQ files and aligns the
reads to a specified bisulfite genome. Sequence reads are transformed into a bisulfite converted forward strand
version (C->T co

What I need for this command:

1. Path to `bismark`
2. --non_directional: See [this issue](https://github.com/RobertsLab/resources/issues/216)
3. --score_min L,0,-1.2: Allow for mismatches
4. -p 2: My data has paired reads
5. --genome + path to the folder with the .fa genome, which also has all of the bisulfite genome directories.
6. Path to sequence files for alignment
7. Filename for `bismark` error.

In [None]:
! ../../../../../Shared/Apps/Bismark_v0.19.0/bismark \
--non_directional \
--score_min L,0,-1.2 \
-p 2 \
--genome ../2018-04-27-Bismark/2018-04-27-Bismark-Inputs/ \
/Volumes/web/nightingales/C_virginica/zr2096_* \
2> bismark.err

In [5]:
ls

zr2096_10_s1_R1_bismark_bt2.bam
zr2096_10_s1_R1_bismark_bt2_SE_report.txt
zr2096_10_s1_R2_bismark_bt2.bam
zr2096_10_s1_R2_bismark_bt2_SE_report.txt
zr2096_1_s1_R1_bismark_bt2.bam
zr2096_1_s1_R1_bismark_bt2_SE_report.txt
zr2096_1_s1_R2_bismark_bt2.bam
zr2096_1_s1_R2_bismark_bt2_SE_report.txt
zr2096_2_s1_R1_bismark_bt2.bam
zr2096_2_s1_R1_bismark_bt2_SE_report.txt
zr2096_2_s1_R2_bismark_bt2.bam
zr2096_2_s1_R2_bismark_bt2_SE_report.txt
zr2096_3_s1_R1_bismark_bt2.bam
zr2096_3_s1_R1_bismark_bt2_SE_report.txt
zr2096_3_s1_R2_bismark_bt2.bam
zr2096_3_s1_R2_bismark_bt2_SE_report.txt
zr2096_4_s1_R1_bismark_bt2.bam
zr2096_4_s1_R1_bismark_bt2_SE_report.txt
zr2096_4_s1_R2_bismark_bt2.bam
zr2096_4_s1_R2_bismark_bt2_SE_report.txt
zr2096_5_s1_R1_bismark_bt2.bam
zr2096_5_s1_R1_bismark_bt2_SE_report.txt
zr2096_5_s1_R2_bismark_bt2.bam
zr2096_5_s1_R2_bismark_bt2_SE_report.txt
zr2096_6_s1_R1_bismark_bt2.bam
zr2096_6_s1_R1_bismark_bt2_SE_report.txt
zr2096_6_s1_R2_bismark_bt2.bam
zr

## 3. Deduplication

This step wasn't explicitly laid out in the User Guide. The deduplication step removes any PCR artifacts from the data.

In [4]:
! ../../../../../Shared/Apps/Bismark_v0.19.0/deduplicate_bismark -help



This script is supposed to remove alignments to the same position in the genome from the Bismark mapping output
(both single and paired-end SAM files), which can arise by e.g. excessive PCR amplification. If sequences align
to the same genomic position but on different strands they will be scored individually.

Note that deduplication is not recommended for RRBS-type experiments!

In the default mode, the first alignment to a given position will be used irrespective of its methylation call
(this is the fastest option, and as the alignments are not ordered in any way this is also near enough random).

For single-end alignments only use the start coordinate of a read will be used for deduplication.

For paired-end alignments the start-coordinate of the first read and the end coordinate of the second
read will be used for deduplication. This script expects the Bismark output to be in SAM format
(Bismark v0.6.x or higher). To deduplicate the old custom Bismark output pleas

Here's what I need to run this command:

1. Path to `deduplicate_bismark`
2. -p: Input data is paired
3. --bam: Write the output as a .bam file
4. Path to input files 

In [10]:
! ../../../../../Shared/Apps/Bismark_v0.19.0/deduplicate_bismark \
--bam -p \
*.bam \

Processing paired-end Bismark output file(s) (SAM format):
zr2096_10_s1_R1_bismark_bt2.bam	zr2096_10_s1_R2_bismark_bt2.bam	zr2096_1_s1_R1_bismark_bt2.bam	zr2096_1_s1_R2_bismark_bt2.bam	zr2096_2_s1_R1_bismark_bt2.bam	zr2096_2_s1_R2_bismark_bt2.bam	zr2096_3_s1_R1_bismark_bt2.bam	zr2096_3_s1_R2_bismark_bt2.bam	zr2096_4_s1_R1_bismark_bt2.bam	zr2096_4_s1_R2_bismark_bt2.bam	zr2096_5_s1_R1_bismark_bt2.bam	zr2096_5_s1_R2_bismark_bt2.bam	zr2096_6_s1_R1_bismark_bt2.bam	zr2096_6_s1_R2_bismark_bt2.bam	zr2096_7_s1_R1_bismark_bt2.bam	zr2096_7_s1_R2_bismark_bt2.bam	zr2096_8_s1_R1_bismark_bt2.bam	zr2096_8_s1_R2_bismark_bt2.bam	zr2096_9_s1_R1_bismark_bt2.bam	zr2096_9_s1_R2_bismark_bt2.bam


If there are several alignments to a single position in the genome the first alignment will be chosen. Since the input files are not in any way sorted this is a near-enough random selection of reads.

Checking file >>zr2096_10_s1_R1_bismark_bt2.bam<< for signs of file truncation...

Now testing Bismark result file z