# Bismark

In this notebook I'll use Bismark on the full extent of my data. The steps from the [user guide](https://rawgit.com/FelixKrueger/Bismark/master/Docs/Bismark_User_Guide.html#i-bismark-genome-preparation) are as follows:

1. Genome Preparation (already done in [this Jupyter notebook](https://github.com/RobertsLab/project-virginica-oa/blob/master/notebooks/2018-04-27-Gonad-Methylation-Bismark.ipynb))
2. Alignment
3. Methlation Extractor
4. HTML Processing Report
5. Summary Report

## 0. Set working directory

In [1]:
pwd

'/Users/yaamini/Documents/project-virginica-oa/notebooks'

In [2]:
cd ../analyses/

/Users/yaamini/Documents/project-virginica-oa/analyses


In [3]:
mkdir 2018-05-04-Bismark-Full-Samples

In [4]:
ls

[34m2018-01-23-MBDSeq-Labwork[m[m/           [34m2018-05-01-MethylKit[m[m/
[34m2018-04-26-Gonad-Methylation-FastQC[m[m/ [34m2018-05-04-Bismark-Full-Samples[m[m/
[34m2018-04-27-Bismark[m[m/                  README.md


In [5]:
cd 2018-05-04-Bismark-Full-Samples/

/Users/yaamini/Documents/project-virginica-oa/analyses/2018-05-04-Bismark-Full-Samples


## 1. Genome Preparation

This step was already completed in [this Jupyter notebook](https://github.com/RobertsLab/project-virginica-oa/blob/master/notebooks/2018-04-27-Gonad-Methylation-Bismark.ipynb). The genome only needs to be prepared once. I will move on to the second step, alignment.

## 2. Alignment

In [6]:
! ../../../../../Shared/Apps/Bismark_v0.19.0/bismark -help



     This program is free software: you can redistribute it and/or modify
     it under the terms of the GNU General Public License as published by
     the Free Software Foundation, either version 3 of the License, or
     (at your option) any later version.

     This program is distributed in the hope that it will be useful,
     but WITHOUT ANY WARRANTY; without even the implied warranty of
     MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
     GNU General Public License for more details.
     You should have received a copy of the GNU General Public License
     along with this program.  If not, see <http://www.gnu.org/licenses/>.



DESCRIPTION


The following is a brief description of command line options and arguments to control the Bismark
bisulfite mapper and methylation caller. Bismark takes in FastA or FastQ files and aligns the
reads to a specified bisulfite genome. Sequence reads are transformed into a bisulfite converted forward strand
version (C->T co

What I need for this command:

1. Path to `bismark`
2. --non_directional: See [this issue](https://github.com/RobertsLab/resources/issues/216)
3. --score_min L,0,-1.2: Allow for mismatches
4. --genome + path to the folder with the .fa genome, which also has all of the bisulfite genome directories.
5. Path to sequence files for alignment

In [8]:
! ../../../../../Shared/Apps/Bismark_v0.19.0/bismark \
--non_directional \
--score_min L,0,-1.2 \
--genome ../2018-04-27-Bismark/2018-04-27-Bismark-Inputs/ \
/Volumes/web/nightingales/C_virginica/zr2096_* \

Path to Bowtie 2 specified as: bowtie2
Bowtie seems to be working fine (tested command 'bowtie2 --version' [2.3.0])
Output format is BAM (default)
Alignments will be written out in BAM format. Samtools found here: '/usr/local/bin/samtools'
Reference genome folder provided is ../2018-04-27-Bismark/2018-04-27-Bismark-Inputs/	(absolute path is '/Users/yaamini/Documents/project-virginica-oa/analyses/2018-04-27-Bismark/2018-04-27-Bismark-Inputs/)'
FastQ format assumed (by default)

Input files to be analysed (in current folder '/Users/yaamini/Documents/project-virginica-oa/analyses/2018-05-04-Bismark-Full-Samples'):
/Volumes/web/nightingales/C_virginica/zr2096_10_s1_R1.fastq.gz
/Volumes/web/nightingales/C_virginica/zr2096_10_s1_R2.fastq.gz
/Volumes/web/nightingales/C_virginica/zr2096_1_s1_R1.fastq.gz
/Volumes/web/nightingales/C_virginica/zr2096_1_s1_R2.fastq.gz
/Volumes/web/nightingales/C_virginica/zr2096_2_s1_R1.fastq.gz
/Volumes/web/nightingales/C_virginica/zr2096_2_s1_R2.fastq.gz
/Volume

Files 8-10 did not complete due to lack of space on the computer. [Sam cleared some space](https://github.com/RobertsLab/resources/issues/247), so I'll start running the last three samples below to finish off this step in the analysis pipeline.

In [9]:
! ../../../../../Shared/Apps/Bismark_v0.19.0/bismark \
--non_directional \
--score_min L,0,-1.2 \
--genome ../2018-04-27-Bismark/2018-04-27-Bismark-Inputs/ \
/Volumes/web/nightingales/C_virginica/zr2096_8_s1_R1.fastq.gz \
/Volumes/web/nightingales/C_virginica/zr2096_8_s1_R2.fastq.gz \
/Volumes/web/nightingales/C_virginica/zr2096_9_s1_R1.fastq.gz \
/Volumes/web/nightingales/C_virginica/zr2096_9_s1_R2.fastq.gz \
/Volumes/web/nightingales/C_virginica/zr2096_10_s1_R1.fastq.gz \
/Volumes/web/nightingales/C_virginica/zr2096_10_s1_R2.fastq.gz \

Path to Bowtie 2 specified as: bowtie2
Bowtie seems to be working fine (tested command 'bowtie2 --version' [2.3.0])
Output format is BAM (default)
Alignments will be written out in BAM format. Samtools found here: '/usr/local/bin/samtools'
Reference genome folder provided is ../2018-04-27-Bismark/2018-04-27-Bismark-Inputs/	(absolute path is '/Users/yaamini/Documents/project-virginica-oa/analyses/2018-04-27-Bismark/2018-04-27-Bismark-Inputs/)'
FastQ format assumed (by default)

Input files to be analysed (in current folder '/Users/yaamini/Documents/project-virginica-oa/analyses/2018-05-04-Bismark-Full-Samples'):
/Volumes/web/nightingales/C_virginica/zr2096_8_s1_R1.fastq.gz
/Volumes/web/nightingales/C_virginica/zr2096_8_s1_R2.fastq.gz
/Volumes/web/nightingales/C_virginica/zr2096_9_s1_R1.fastq.gz
/Volumes/web/nightingales/C_virginica/zr2096_9_s1_R2.fastq.gz
/Volumes/web/nightingales/C_virginica/zr2096_10_s1_R1.fastq.gz
/Volumes/web/nightingales/C_virginica/zr2096_10_s1_R2.fastq.gz
Library

## 3. Methylation Extractor

The first thing I want to do is create a new directory to store any outputs form this step.

In [7]:
pwd

'/Users/yaamini/Documents/project-virginica-oa/analyses/2018-04-27-Bismark'

In [8]:
mkdir 2018-04-29-Methylation-Extractor-Output

In [9]:
ls

[34m2018-04-27-Bismark-Inputs[m[m/
[34m2018-04-29-Methylation-Extractor-Output[m[m/
zr2096_10_s1_R1_bismark_bt2.bam
zr2096_10_s1_R1_bismark_bt2_SE_report.txt
zr2096_10_s1_R2_bismark_bt2.bam
zr2096_10_s1_R2_bismark_bt2_SE_report.txt
zr2096_1_s1_R1_bismark_bt2.bam
zr2096_1_s1_R1_bismark_bt2_SE_report.txt
zr2096_1_s1_R2_bismark_bt2.bam
zr2096_1_s1_R2_bismark_bt2_SE_report.txt
zr2096_2_s1_R1_bismark_bt2.bam
zr2096_2_s1_R1_bismark_bt2_SE_report.txt
zr2096_2_s1_R2_bismark_bt2.bam
zr2096_2_s1_R2_bismark_bt2_SE_report.txt
zr2096_3_s1_R1_bismark_bt2.bam
zr2096_3_s1_R1_bismark_bt2_SE_report.txt
zr2096_3_s1_R2_bismark_bt2.bam
zr2096_3_s1_R2_bismark_bt2_SE_report.txt
zr2096_4_s1_R1_bismark_bt2.bam
zr2096_4_s1_R1_bismark_bt2_SE_report.txt
zr2096_4_s1_R2_bismark_bt2.bam
zr2096_4_s1_R2_bismark_bt2_SE_report.txt
zr2096_5_s1_R1_bismark_bt2.bam
zr2096_5_s1_R1_bismark_bt2_SE_report.txt
zr2096_5_s1_R2_bismark_bt2.bam
zr2096_5_s1_R2_bismark_bt2_SE_report.txt
zr2096_6_s1_R1_b

In [6]:
! ../../../../../Shared/Apps/Bismark_v0.19.0/bismark_methylation_extractor -help



DESCRIPTION

The following is a brief description of all options to control the Bismark
methylation extractor. The script reads in a bisulfite read alignment results file 
produced by the Bismark bisulfite mapper (in BAM/CRAM/SAM format) and extracts the
methylation information for individual cytosines. This information is found in the
methylation call field which can contain the following characters:

       ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
       ~~~   X   for methylated C in CHG context                      ~~~
       ~~~   x   for not methylated C CHG                             ~~~
       ~~~   H   for methylated C in CHH context                      ~~~
       ~~~   h   for not methylated C in CHH context                  ~~~
       ~~~   Z   for methylated C in CpG context                      ~~~
       ~~~   z   for not methylated C in CpG context                  ~~~
       ~~~   U   for methylated C in Unknown context (CN 

According to the quick guide, I don't need much to extract methylated cytosines.

1. Path to `bismark_methylation_extractor`
2. -o + path to the output directory
3. --gzip: Compress methylation extractor files to save disk space
4. --bedGraph: Create `bedGraph` file with results
5. Path to .bam files generated in Step 2

In [10]:
! ../../../../../Shared/Apps/Bismark_v0.19.0/bismark_methylation_extractor \
-o 2018-04-29-Methylation-Extractor-Output \
--gzip \
--bedGraph \
zr2096_*_bt2.bam \


 *** Bismark methylation extractor version v0.19.0 ***

Trying to determine the type of mapping from the SAM header line of file zr2096_10_s1_R1_bismark_bt2.bam
Treating file(s) as single-end data (as extracted from @PG line)

Setting core usage to single-threaded (default). Consider using --multicore <int> to speed up the extraction process.

Summarising Bismark methylation extractor parameters:
Bismark single-end SAM format specified (default)
Number of cores to be used: 1
Output path specified as: 2018-04-29-Methylation-Extractor-Output/


Summarising bedGraph parameters:
Generating additional output in bedGraph and coverage format
bedGraph format:	<Chromosome> <Start Position> <End Position> <Methylation Percentage>
coverage format:	<Chromosome> <Start Position> <End Position> <Methylation Percentage> <count methylated> <count non-methylated>

Using a cutoff of 1 read(s) to report cytosine positions
Reporting and sorting cytosine methylation information in CpG context only (defaul

Once again, I got a `gunzip` error. However, I do have output files [in this folder](https://github.com/RobertsLab/project-virginica-oa/tree/master/analyses/2018-04-27-Bismark/2018-04-29-Methylation-Extractor-Output). I'll ignore the error and proceed.

## 4. HTML Processing Report

In [11]:
! ../../../../../Shared/Apps/Bismark_v0.19.0/bismark2report -help


  SYNOPSIS:

  This script uses a Bismark alignment report to generate a graphical HTML report page. Optionally, further reports of
  the Bismark suite such as deduplication, methylation extractor splitting or M-bias reports can be specified as well. If several
  Bismark reports are found in the same folder, a separate report will be generated for each of these, whereby the output filename
  will be derived from the Bismark alignment report file. Bismark2report attempts to find optional reports automatically based
  on the file basename.


  USAGE: bismark2report [options]


-o/--output <filename>     Name of the output file (optional). If not specified explicitly, the output filename will be derived
                           from the Bismark alignment report file. Specifying an output filename only works if the HTML report is
                           to be generated for a single Bismark alignment report (and potentially additional reports).

--dir                 

There are lots of ways to customize the `bismark2report` function, but unfortunately I don't understand what much of it means. I'll just run the command the simplest way possible to ensure it works.

1. Path to `bismark2report`
2. --dir + path to output directory

Before I run the command, I need to make a new output directory.

In [13]:
mkdir 2018-04-29-Bismark2Report-Outputs

In [14]:
ls

[34m2018-04-27-Bismark-Inputs[m[m/
[34m2018-04-29-Bismark2Report-Outputs[m[m/
[34m2018-04-29-Methylation-Extractor-Output[m[m/
zr2096_10_s1_R1_bismark_bt2.bam
zr2096_10_s1_R1_bismark_bt2_SE_report.txt
zr2096_10_s1_R2_bismark_bt2.bam
zr2096_10_s1_R2_bismark_bt2_SE_report.txt
zr2096_1_s1_R1_bismark_bt2.bam
zr2096_1_s1_R1_bismark_bt2_SE_report.txt
zr2096_1_s1_R2_bismark_bt2.bam
zr2096_1_s1_R2_bismark_bt2_SE_report.txt
zr2096_2_s1_R1_bismark_bt2.bam
zr2096_2_s1_R1_bismark_bt2_SE_report.txt
zr2096_2_s1_R2_bismark_bt2.bam
zr2096_2_s1_R2_bismark_bt2_SE_report.txt
zr2096_3_s1_R1_bismark_bt2.bam
zr2096_3_s1_R1_bismark_bt2_SE_report.txt
zr2096_3_s1_R2_bismark_bt2.bam
zr2096_3_s1_R2_bismark_bt2_SE_report.txt
zr2096_4_s1_R1_bismark_bt2.bam
zr2096_4_s1_R1_bismark_bt2_SE_report.txt
zr2096_4_s1_R2_bismark_bt2.bam
zr2096_4_s1_R2_bismark_bt2_SE_report.txt
zr2096_5_s1_R1_bismark_bt2.bam
zr2096_5_s1_R1_bismark_bt2_SE_report.txt
zr2096_5_s1_R2_bismark_bt2.bam
zr2096_5_s1

In [15]:
! ../../../../../Shared/Apps/Bismark_v0.19.0/bismark2report \
--dir 2018-04-29-Bismark2Report-Outputs

Found 20 alignment reports in current directory. Now trying to figure out whether there are corresponding optional reports

Writing Bismark HTML report to >> 2018-04-29-Bismark2Report-Outputs/zr2096_10_s1_R1_bismark_bt2_SE_report.html <<

Redundant argument in sprintf at ../../../../../Shared/Apps/Bismark_v0.19.0/bismark2report line 130.
Using the following alignment report:		> zr2096_10_s1_R1_bismark_bt2_SE_report.txt <
Processing alignment report zr2096_10_s1_R1_bismark_bt2_SE_report.txt ...
Complete

No deduplication report file specified, skipping this step
No splitting report file specified, skipping this step
No M-bias report file specified, skipping this step
No nucleotide coverage report file specified, skipping this step



Writing Bismark HTML report to >> 2018-04-29-Bismark2Report-Outputs/zr2096_10_s1_R2_bismark_bt2_SE_report.html <<

Redundant argument in sprintf at ../../../../../Shared/Apps/Bismark_v0.19.0/bismark2report line 130.
Using the following alignment report:		> 

The reports can be found [here](https://github.com/RobertsLab/project-virginica-oa/tree/master/analyses/2018-04-27-Bismark/2018-04-29-Bismark2Report-Outputs).

## Summary Report

In [16]:
! ../../../../../Shared/Apps/Bismark_v0.19.0/bismark2summary -help


  SYNOPSIS:

  This script uses Bismark report files of several samples in a run folder to generate a graphical summary HTML report as well as
  a whopping big table (tab delimited text) with all relevant alignment and methylation statistics which may be used for graphing
  purposes in R, Excel or the like. Unless specific BAM files are specified, bismark2summary first identifies Bismark BAM files in
  a folder (they need to use the Bismark naming conventions) and then automatically detects Bismark alignment, deduplication or
  methylation extractor (splitting) reports based on the input file basename. If splitting reports are found they overwrite the
  methylation statistics of the initial alignment report.


  USAGE: bismark2summary [options] [<BAM file(s)>]

  ARGUMENTS:

  BAM file(s)                 Optional. If no BAM files are specified explicitly the current working directory is scanned for 
                              Bismark alignment files and their associ

There are no required arguments, so I'm just going to run the command with nothing else.

In [17]:
! ../../../../../Shared/Apps/Bismark_v0.19.0/bismark2summary

Found Bismark/Bowtie2 single-end files
No Bismark/Bowtie2 paired-end BAM files detected
No Bismark/Bowtie single-end BAM files detected
No Bismark/Bowtie paired-end BAM files detected

Generating Bismark summary report from 20 Bismark BAM file(s)...
>> Reading from Bismark report: zr2096_10_s1_R1_bismark_bt2_SE_report.txt
No deduplication report present, skipping...
No methylation extractor report present, skipping...
>> Reading from Bismark report: zr2096_10_s1_R2_bismark_bt2_SE_report.txt
No deduplication report present, skipping...
No methylation extractor report present, skipping...
>> Reading from Bismark report: zr2096_1_s1_R1_bismark_bt2_SE_report.txt
No deduplication report present, skipping...
No methylation extractor report present, skipping...
>> Reading from Bismark report: zr2096_1_s1_R2_bismark_bt2_SE_report.txt
No deduplication report present, skipping...
No methylation extractor report present, skipping...
>> Reading from Bismark report: zr2096_2_s1_R1_bismark_bt2_SE_re

It seems to skip the methylation extractor reports because they're in a different directory. That's something I need to figure out how to fix if it ends up being important. Because we're using `methylkit` for methylation extraction, it may not be worth my time to figure out how to get it to read something from another directory.

The report can be found as a [.txt file](https://github.com/RobertsLab/project-virginica-oa/blob/master/analyses/2018-04-27-Bismark/bismark_summary_report.txt) and [.html document](https://github.com/RobertsLab/project-virginica-oa/blob/master/analyses/2018-04-27-Bismark/bismark_summary_report.html).

Overall, my `bismark` test pipeline seems to have yielded results! The next step is to understand what the outputs are, and then run the pipeline on the full extend of my sequencing data.