# The WGBS data analysis tutorial 1 - Bismark pipeline 
## Overview

In this tutorial, we’ll demonstrate how to use __[Bismark](https://github.com/FelixKrueger/Bismark)__ to map bisulfite-treated sequencing reads to a genome of interest and perform methylation calls in SageMaker Notebook instance from Amazon Web Services (AWS). Please refer to [Bismark's User Guide](http://felixkrueger.github.io/Bismark/) for more detailed usage and the commands used in the tutorial.  
> **Input**: raw sequencing data (.fastq, .fastq.gz files)  
> **Output**: methylation states/ratios (.bedgraph files)

<img src="images/notebook1.png" width="1000" />

This tutorial will cover:

- [Installing tools in SageMaker notebook](#Install-tools-in-Vertex-AI-notebook)
- [Importing the example dataset](#Importing-the-example-dataset)
- [Running Bismark pipeline](#Running-bismark-pipeline)
- [Visualization using IGV (Integrative Genomics Viewer)](#Visualization-using-IGV-(Integrative-Genomics-Viewer))

## Learning Objectives

* **Mastering the Bismark pipeline for WGBS data analysis:**  The core objective is to learn how to use the Bismark suite of tools to process whole-genome bisulfite sequencing (WGBS) data, from raw reads to methylation calls.

* **Understanding WGBS data processing steps:** Learners will gain a practical understanding of each stage in the WGBS workflow, including quality control (FastQC, MultiQC), adapter trimming (Trim Galore!), genome preparation, read alignment (Bismark), deduplication, methylation extraction, and report generation (Bismark2report, Bismark2summary).

* **Utilizing command-line tools in a cloud environment (SageMaker Notebook):** The tutorial emphasizes using command-line tools within the AWS's SageMaker Notebook, providing experience with managing a computational environment and executing bioinformatics pipelines remotely.

* **Interpreting bioinformatics results:** Learners will learn to interpret FastQC reports, MultiQC summaries, Bismark alignment statistics, and methylation call files.  They will also learn to identify potential biases in the data (e.g., adapter contamination, methylation bias).

* **Visualizing methylation data with IGV:** The notebook teaches how to use the Integrative Genomics Viewer (IGV) to visualize both alignment data (.bam) and methylation profiles (.bedGraph) to explore the results.

* **Working with cloud storage (Amazon S3):**  The tutorial involves downloading data from and potentially uploading results to Amazon S3 bucket, introducing learners to cloud-based data management.

* **Installing and managing bioinformatics tools with conda and mamba:** The notebook demonstrates the use of conda and mamba for package management, creating a reproducible and consistent analysis environment.

* **Applying bioinformatics concepts to a real-world dataset:** The tutorial uses a real-world WGBS dataset, allowing learners to apply their knowledge to a practical example and understand the biological context of the results.

## Prerequisites

* **Amazon Simple Storage Service (S3):** Essential for downloading the example dataset and reference genome. The notebook uses `aws s3` commands to interact with S3.

* **Software Dependencies:** The notebook installs several bioinformatics tools using `mamba` (a conda package manager).  These include `fastqc`, `multiqc`, `bismark`, `trim-galore`, `bedtools`, `samtools`, `metilene`, and `igv-notebook`.  The notebook provides the commands to install these.

## Get Started
### Install tools in Vertex AI notebook

#### Conda 
__[Conda](https://docs.conda.io/en/latest/)__ is an open source package management system and environment management system. It can help to find and install packages, switch between different environments when you need to run different versions of tools, such as Python. 

User-managed notebook instances have a suite of packages and tools pre-installed. To test if conda is pre-installed, try the following command: `conda env list`. For a successful installation, a list of your environments appears. The default environment is called 'base', and should have a **\*** in front of it.

In [None]:
! conda env list

#### Installation of the tools for Tutorial 1 and 2 using <code>mamba</code>

[**mamba**](https://mamba.readthedocs.io/en/latest/user_guide/mamba.html) is a re-implementation of the conda package manager in C++. It uses the same commands and configuration options as conda. The only difference is that you should still use conda for activation and deactivation.

<div class= "alert alert-block alert-info"><b>Tip</b>: use <code>\</code> to break a long command into multiple lines</div>

In [None]:
! mamba install -y -c conda-forge -c bioconda \
    fastqc \
    multiqc \
    bismark \
    trim-galore \
    bedtools \
    samtools \
    metilene

---
## **Importing the example dataset**

#### Set up the environment
The current working directory is where you are currently working in, and can be printed using command `pwd`. To organize the output files in this tutorial, let's create a new directory called `Tutorial_1` in the working directory and some sub-directories inside `Tutorial_1` to save different outputs from different steps.

In [None]:
# Show current working directory
! pwd
# Create directories
! mkdir -p Tutorial_1
! mkdir -p Tutorial_1/ref_genome
! mkdir -p Tutorial_1/fastqc
! mkdir -p Tutorial_1/trimmed

#### About the example dataset

This small example dataset was created by the [snakePipes WGBS pipeline](https://snakepipes.readthedocs.io/en/latest/content/workflows/WGBS.html) from paper: [Habibi, Ehsan, et al. Cell stem cell 13.3 (2013): 360-369](https://pubmed.ncbi.nlm.nih.gov/23850244/). 

The study of epigenetic mechanisms in the establishment and maintenance of the pluripotent state, as well as in the differentiation process, is an area of intense investigation in embryonic stem cell (ESCs) biology. In addition, since ESCs and cancer cells share certain phenotypic characteristics, such as the ability to be propagated in long-term culture, there has been interest in establishing whether they share certain epigenetic characteristics. This paper shows that the use of two kinase inhibitors (2i) enables derivation of mouse ESCs in the pluripotent ground state. From their WGBS data, we can see that male 2i ESCs are globally hypomethylated compared to conventional ESCs maintained in serum. 

<img src="images/1_data_graph.jpg" width="400" />

The original data are present in the NCBI GEO database with the accession number: GSE41923. In this example dataset, 4 samples were selected: two from mouse ESCs with 2i enabled derivations, and two from conventional ESCs maintained in serum. The sequences were down sampled to only contain the reads from the region: chr6:4000000-6000000 in the mouse genome.

#### Download the example dataset and reference genome from an Amazon S3 bucket.

This WGBS example dataset was stored in a **Amazon S3 bucket**. You can use `aws s3` commands to access and manage the cloud storage. 

Here are some examples of how to use the `aws s3` tool to create and delete buckets; move, copy, and rename objects in the bucket ([aws s3 quickstart](https://docs.aws.amazon.com/cli/latest/userguide/cli-services-s3.html)):
> - `aws s3` uses the prefix `s3://` to indicate a resource in Amazon S3: `s3://BUCKET_NAME/OBJECT_NAME`.
> - Create an S3 bucket: `aws s3 mb s3://BUCKET_NAME --region us-east-1`. The bucket you created can be viewed in the AWS Management Console by selecting: Services -> S3. Then, choose the bucket name from the list.
> - Copy an object/file to a bucket: `aws s3 cp OBJECT_NAME s3://BUCKET_NAME`
> - Download an object from your bucket: `aws s3 cp s3://BUCKET_NAME/OBJECT_NAME OBJECT_NAME`
> - List contents of an S3 bucket: `aws s3 ls s3://BUCKET_NAME`
> - Delete an object: `aws s3 rm s3://BUCKET_NAME/OBJECT_NAME`

To download this example dataset, copy the files from the storage bucket to the Tutorial_1 directory we just created using `aws s3 cp`. After decompressing the file, the resulting files should be in the sub-directory of Tutorial_1/fastq, including eight sequence (.fastq.gz) files, one sample sheet (.tsv), and one CpG island annotation file (.bed) from the targeted genome region.

In [None]:
# Download the example dataset from a Amazon S3 bucket
! aws s3 cp s3://nigms-sandbox/dna-methyl/Habibi2013_chr6F3A.tar.gz Tutorial_1
# Decompress the downloaded file
! cd Tutorial_1 && tar -xvzf Habibi2013_chr6F3A.tar.gz
# Delete the orginal compressed file
! rm Tutorial_1/Habibi2013_chr6F3A.tar.gz

Download the **reference genome** in the same way and place it in the `ref_genome` folder. The genome should be in FASTA format, and can be downloaded from NCBI or Ensemble websites. For this example dataset, we will only use the sequences from GRCm39 chromosome 6.

In [None]:
! aws s3 cp s3://nigms-sandbox/dna-methyl/Mus_musculus.GRCm39.dna.chromosome.6.fa Tutorial_1/ref_genome

---
## **Running Bismark pipeline**

#### Step 1. FastQC 

[**FastQC**](https://www.bioinformatics.babraham.ac.uk/projects/fastqc/) is simple tool that allows you to do some quality control checks on raw sequence data coming from high-throughput sequencing pipelines. It provides a modular set of analyses that you can use to give a quick impression of whether your data has any problems of which you should be aware before doing any further analysis. You can find examples of "good" and "bad" sequencing data from the [FastQC](https://www.bioinformatics.babraham.ac.uk/projects/fastqc/) website, in section "Example Reports". 

Run FastQC on the sequence files and the fastqc reports will be saved in Tutorial_1/fastqc. The `for` loop iterates through all the fastq.gz files in the Tutorial_1 directory and run `fastqc` each one of them separately.

In [None]:
! for file in Tutorial_1/fastq/*.gz; do fastqc -q -o Tutorial_1/fastqc "${file}"; done;

After running FastQC, we can use Python code below and HTML support to display results in-line (using one fastqc report from SRX202087_R1 as an example): 

In [None]:
from IPython.display import IFrame
IFrame(src='./Tutorial_1/fastqc/SRX202087_R1_fastqc.html', width=1200, height=500)

#### Step 2. MultiQC (optional)
__[MultiQC](https://multiqc.info/)__ can aggregate results from bioinformatics analysis across many samples into a single report. In our case, it reads in the FastQC reports and generates a compiled report for all the eight analyzed FASTQ files, and the report will be saved in `Tutorial_1/multiqc`. We can also view the report in .html format:

In [None]:
# Run multiqc to summarize all the fastqc reports
!multiqc -f -p  Tutorial_1/fastqc -o Tutorial_1/multiqc

# View multiqc report
from IPython.display import IFrame
IFrame(src='Tutorial_1/multiqc/multiqc_report.html', width=1200, height=400)

#### Step 3. Trim the adapters and low quality sequence reads

Adapter sequences should be removed from reads because they interfere with downstream analyses, such as alignment of reads to a reference. In the FastQC report, the adapter content plot shows the percentage of reads (y-axis), which has an adapter starting at a particular position along a read (x-axis). And if the reads were fragmentaed to lower than than the target molecule length, high propotion of reads with adapters will be observed (right).

<img src="images/1-adapter_content.png" width="500" />

In this tutorial, we use __[Trim Galore!](https://www.bioinformatics.babraham.ac.uk/projects/trim_galore/)__ for adaptor trimming and quality control. Use the `--fastqc` flag to run FastQC again to check the trimming results. Use flag `-j` to indicate the number of cores to be used for trimming. The results (trimmed sequences and FastQC report) will be kept in Tutorial_1/trimmed directory.

In [None]:
! trim_galore -j 4 --paired --illumina -o Tutorial_1/trimmed --fastqc Tutorial_1/fastq/SRX*.gz

#### Step 4. Genome preparation

In order to align the bisulfite-treated reads to reference genome, the reference genome need to be converted as well. This step can also be called <b>genome conversion</b>. This step will generated bisulfite treated forward strand index of a reference genome (C->T converted) and a bisulfite treated reverse strand index of the genome (G->A conversion of the forward strand). If missing this step, the bisulfite-treated reads will not aligned to the normal reference genome or will align with many mismatches.

<img src="images/1-genome_convertion.png" width="600" /> 

In this tutorial, run the command `bismark_genome_preparation` to convert the Cs in the reference genome to Ts, and Gs to As in the complementary strand. The converted geneome is still in the folder of Tutorial_1/ref_genome. Depending on the genome size, this process can take hours to finish. 

In [None]:
! bismark_genome_preparation Tutorial_1/ref_genome

#### Step 5. Run Bismark (alignment)
In this step, the quality and adapter trimmed reads will be mapped to the bisulfite converted reference genome using command `bismark`. The basic usage of this command is: 

`bismark [options] --genome <genome_folder> {-1 <mates1> -2 <mates2> | <singles>}`

**Running time**. This is the most important step in the whole Bismark workflow and also requires the most computational resources. For large genomes such as human or mouse, the alignment step could take several days depending on the sequencing depth and computational resources allocated.  

**Effect of bisulfite treatment of DNA**. As cytosine methylation is not symmetrical, the two strands of DNA in the reference genome must be considered separately. Bisulfite conversion of genomic DNA and subsequent PCR amplification gives rise to two PCR products and up to **four** potentially different DNA fragments for any given locus. OT, original top strand; CTOT, strand complementary to the original top strand; OB, original bottom strand; and CTOB, strand complementary to the original bottom strand. 
> <img src="images/1_4DNA_strands.png" width="600"/>

All four DNA strands that arise through bisulfite treatment and subsequent PCR amplification can be sequenced with the same frequency in non-directional libraries, while for directional libraries, adapters are attached to the DNA fragments such that only the original top or bottom strands will be sequenced. In the Bismark alignment step, `--directional` is set to default, so only report OT and OB strands.

**Output alignment**. The output .bam files are the binary version of the SAM (Sequence Alignment/Map) format. Please see [here](https://github.com/FelixKrueger/Bismark/blob/master/docs/bismark/alignment.md#bismark-bamsam-output-default) for more detailed explanation of the output .bam/.sam format, where the `XM-tag` is the methylation call string to indicate the methylated status of each C(tyosine) in different contexts: CG (or CpG), CHG or CHH (where H correspond to A, T or C). The methylation call string contains a dot ‘.’ for every position in the BS-read not involving a cytosine,
or contains one of the following letters (z,Z,x,X,h,H) for the three different cytosine methylation contexts (UPPER CASE = METHYLATED, lower case = unmethylated).
> <img src="images/1_bismark_bam.png" width="600" />

In plants and other organisms, DNA methylation is found in three different sequence contexts: CG (or CpG), CHG or CHH (where H correspond to A, T or C). In mammals, however, DNA methylation is almost exclusively found in CpG dinucleotides, with the cytosines on both strands being usually methylated.

To do the `bismark alignment`, we use a `for` loop to iterate through all the forward reads files (have '*_R1_val_1*' in the file names) in 'trimmed' directory, and run the `bismark` command by providing the reference genome location, and both the forward and reverse reads (by replacing '*R1_val_1*' with '*R2_val_2*' in the sequence file names) for each sample. Flag `-o` specifies the output directory. The resulting alignment files (.bam) will be saved into the Tutorial_1/bismark directory, and more details can be found in the report file from the same folder.

In [None]:
! for file_name in Tutorial_1/trimmed/*_R1_val_1.fq.gz; do \
    bismark --genome Tutorial_1/ref_genome/ -1 ${file_name} -2 ${file_name//_R1_val_1/_R2_val_2} -o Tutorial_1/bismark; \
done;

#### Step 6. De-duplicate 
This step removes alignments with identical mapping position to avoid technical duplication in the results. Given a large genome with moderate coverage, it is unlikely to sequence several genuine copies of the same fragment amongst so many possible fragments with different start sites:  
> <img src="images/1_complex_library.png" width="500"/>   

Therefore, a high level of duplication is more likely to indicate some kind of enrichment bias (eg PCR over amplification). 

But this step is not advisable for RRBS or other target enrichment methods where higher coverage is either desired or expected. We use command `deduplicate_bismark` below to perform the de-duplication, where `-p` stands for pair-end reads and `--bam` specifies the alignment files need to be de-duplicated.

In [None]:
! deduplicate_bismark -p --output_dir Tutorial_1/bismark --bam Tutorial_1/bismark/*.bam 

<div class="alert alert-block alert-success"><b>Note</b>: when running in a Jupyter notebook, there might an error involving samtools (also in step 8) saying '<em>samtools view: writing to standard output failed: Broken pipe samtools view: error closing standard output: -1</em>'. It can be ignored since it does not change the results. This error can be avoided if running in a terminal instead of a notebook.</div>

#### Step 7. Get methylation ratios 

This step operates on Bismark alignment files and extracts the methylation call for every single C analyzed. `--gzip` is used to compress the output files in .gz format.

**Strand-specific methylation output files** (default): The position of every single C will be written out to a new output file, depending on its context (CpG, CHG or CHH), whereby methylated Cs will be labeled as forward reads (+), non-methylated Cs as reverse reads (-). 

**Optional bedGraph output**. Alternatively, the output of the methylation extractor can be transformed into a `.bedGraph` and `.coverage` file using the option `--bedGraph`. These files can be used for [visualization](#Visualization-using-IGV-(Integrative-Genomics-Viewer)) later using tools such as Integrative Genomics Viewer. For the `.bedGraph` format, there will be 4 columns: chromosome, start position, end position, methylation percentage. For `.coverage` (bismark.cov.gz) format, there will be two additional columns: chromosome, start position, end position, methylation percentage, count methylated, count unmethylated. 

**M-bias plot data**. A methylation bias plot (M-bias plot) shows the methylation proportion across each possible position in the read. The data for the M-bias plot is written into a text file and is in the following format: 
\<read position\> \<count methylated\> \<count unmethylated\> \<% methylation\> \<total coverage\>. The plot will be generated in the next step (step 8) using `bismark2report`. M-bias plot can reveal methylation bias at certain positions of reads. For example, a 3’-end-repair bias at the first couple of positions in read 2 of paired-end reads (SRX202087):  
> <img src="images/1_mbias.png" width="700"/>

`--ignore_r2 2` will ignore the first 2 bp from the 5' end of Read 2 of paired-end sequencing results only. Since the first couple of bases in Read 2 of BS-Seq experiments show a severe bias towards non-methylation as a result of end-repairing sonicated fragments with unmethylated cytosines (see M-bias plot). These artificial cytosines incorporated during the end repair step of library preparation will decrease the methylation level at the corresponding positions. It is recommended that the first couple of bp of Read 2 are removed before starting downstream analysis.

All the results from this step will be saved in directory Tutorial_1/bismark as well.

In [None]:
! bismark_methylation_extractor --gzip --bedGraph --ignore_r2 2 -o Tutorial_1/bismark Tutorial_1/bismark/*deduplicated.bam

You can check the output files using the `zcat` command, since they were compressed. For example, the following two commands showing the first 10 rows of the output methylation levels of sample SRX20287 in `.bedGraph` and `.bismark.cov` format:

In [None]:
! zcat Tutorial_1/bismark/SRX202087_R1_val_1_bismark_bt2_pe.deduplicated.bedGraph.gz | head -n 10

In [None]:
! zcat Tutorial_1/bismark/SRX202087_R1_val_1_bismark_bt2_pe.deduplicated.bismark.cov.gz | head -n 10

#### Step 8. Generate report and summary 
The command `bismark2report` will generate a graphical HTML report for each sample (including the M-bias plot), using the reports generated from the previous steps (alignment, de-duplication, methylation extraction). The command `bismark2summary` will produce a summary HTML report for all the samples in the directory. To view the graphs in the HTML reports in Jupyter notebook, after opening the HTML file, you might need to click `trust HTML` in the top left corner of the HTML file.

In [None]:
! cd Tutorial_1/bismark && bismark2report
! cd Tutorial_1/bismark && bismark2summary

In [None]:
# Display the final summary from bismark workflow, including all 4 samples
from IPython.display import IFrame
IFrame(src='Tutorial_1/bismark/bismark_summary_report.txt', width=1000, height=400)

#### Important output files for downstream analysis

The Bismark methylation extractor (step 7) can output files in `.bedGraph` and `.cov` format with the methylation levels (percentage of methylated cytosine) at each position. These files can be used as input for downstream analysis, such as identification of differentially methylated regions.


---
## **Visualization using IGV (Integrative Genomics Viewer)**

The __[Integrative Genomics Viewer (IGV)](https://software.broadinstitute.org/software/igv/)__ is an interactive tool for the visual exploration of genomic data. It supports flexible integration of all the common types of genomic data and metadata. IGV supports many different file formats, such as .bam, .bed, GFF/GTF, .fasta. For a full list of **file formats** IGV supported, please visit https://software.broadinstitute.org/software/igv/FileFormats. 

IGV can be downloaded as a desktop application, and it also has a JavaScript version that can embed IGV in the web apps. The `igv-notebook` we are going to use in this tutorial is a Python package which wraps igv.js for embedding it in an IPython notebook. 

#### Basic usage
- Select reference genome - IGV hosts dozens of genomes and you can load other genomes too
- Load data tracks
- Navigate
    - Zoom in/out - from whole genome view to base pair resolution
    - Scroo/pan - view neighboring regions
    - Jump to locus - enter coordinates or name

#### Install `igv-notebook` 

Install igv-notebook. The kernel we are using is the default Python 3 (ipykernel) Vertex AI kernel. We are installing igv-notebook with `pip` below:

In [None]:
! pip install igv-notebook

#### Initialize IGV
Create a browser "b", showing a mouse reference mm39 from chromosome 6. You can change the settings in the browser interactively. The output should like this:  
> <img src="images/1_igv1.png" width = "900"/>

In [None]:
import igv_notebook
igv_notebook.init()

b = igv_notebook.Browser(
    {
        "genome": "mm39",
        "locus": "chr6:4,733,000-4,749,000"
    }
)

### Load data tracks
IGV displays data in horizontal rows called **tracks**. Typically, each track represents one sample or experiment. Track names are listed in the far-left panel. Legibility of the names depends on the height of the tracks, i.e., the smaller the track the less legible the name. There are different types of tracks (different file formats) that IGV can display:  
- **Data tracks** display numeric values, such as the methylation levels in our tutorial
- **Feature tracks** identify genomic features. For an example, see the Refseq Genes track, which IGV loads when you select a genome.
- **Alignment Track** display alignments 

#### Indexing alignment 

IGV requires that both SAM and BAM files be **sorted** by position and **indexed**, and that the index files follow a specific naming convention. Specifically, a BAM index file should be named by appending `.BAI` to the BAM file name. 

To view the alignment files generated from Step 6 in this tutorial, we first need to sort the alignment files (.bam) to generate the index (.bai) files: (the files will be found in `Tutorial_1/bismark` directory)

In [None]:
! samtools sort Tutorial_1/bismark/SRX271141_R1_val_1_bismark_bt2_pe.deduplicated.bam -o Tutorial_1/bismark/SRX271141.sort.bam
! samtools index Tutorial_1/bismark/SRX271141.sort.bam

You can use the following code to sort and index all the alignment generated in this tutorial:
```bash
# Sort alignment files
! for i in Tutorial_1/bismark/*deduplicated.bam; do samtools sort $i -o $i.sort.bam; samtools index $i.sort.bam;done
# Rename the output files for shorter names
! for f in Tutorial_1/bismark/*sort.bam* ; do mv "$f" "$(echo "$f" | sed s/_R1_val_1_bismark_bt2_pe.deduplicated.bam//)"; done
```

#### Load and view the alignment file (and its index) to igv-notebook.

Start a new browser "b2", and load the alignment .bam and its index file .bai into the browser:

In [None]:
b2 = igv_notebook.Browser(
    {
       "genome": "mm39",
       "locus": "chr6:4,733,000-4,749,000"
    }
)
b2.load_track({
    "name": "SRX271141",
    "path": "Tutorial_1/bismark/SRX271141.sort.bam",
    "indexPath": "Tutorial_1/bismark/SRX271141.sort.bam.bai",
    "format": "bam",
    "type": "alignment",
})

<div class=>

By default, IGV dynamically calculates and displays the default **coverage track** for an alignment file. When IGV is zoomed to the alignment read visibility threshold (by default, 30 KB), the coverage track displays the depth of the reads displayed at each locus as a gray bar chart. If a nucleotide differs from the reference sequence in greater than 20% of quality weighted reads, IGV colors the bar in proportion to the read count of each base (A, C, G, T). You can left-click the bar to see the actual percentages.

<img src="images/1_igv2.png" width="1000" />

For the **alignment tracks**, IGV uses color and other visual markers to highlight potential genetic alterations in reads against a reference sequence. Genetic alternations include single nucleotide variations, structural variations, and aneuploidy. Structural variations include insertions, deletions, inversions, tandem duplications, translocations, and other more complex rearrangements. You can explore more options when clicking the pop-up menu for the alignment track on the right side. For more detailed explanation of read colors, please visit https://software.broadinstitute.org/software/igv/AlignmentData. 

#### View the methylation levels (.bedgraph files)

The bedGraph format allows display of continuous-valued data in track format. This display type is useful for probability scores and transcriptome data. For more information on this file format, see the UCSC Genome Bioinformatics web site description at http://genome.ucsc.edu/goldenPath/help/bedgraph.html.

Here, you can load the generated `.bedgraph` files in this tutorial to view the methylation percentage at each position. The IGV can handle the compressed file .gz, so there's no need to decompress the files. Colors can be changed using the code with the key word `color`, or in the track pop-up menu -> "set track color". 

In [None]:
b3 = igv_notebook.Browser(
    {
       "genome": "mm39",
        "locus": "chr6:4,500,000-5,500,000"
    }
)
b3.load_track({
    "name": "SRX271141_bed",
    "path": "Tutorial_1/bismark/SRX271141_R1_val_1_bismark_bt2_pe.deduplicated.bedGraph.gz",
    "mode": "bisulfite" ,
    "color": "orange"
})
b3.load_track({
    "name": "SRX202087_bed",
    "path": "Tutorial_1/bismark/SRX202087_R1_val_1_bismark_bt2_pe.deduplicated.bedGraph.gz",
    "mode": "bisulfite" ,
    "color": "orange"
})

b3.load_track({
    "name": "SRX271142_bed",
    "path": "Tutorial_1/bismark/SRX271142_R1_val_1_bismark_bt2_pe.deduplicated.bedGraph.gz",
    "mode": "bisulfite" ,
    "color": "blue"
})
b3.load_track({
    "name": "SRX202088_bed",
    "path": "Tutorial_1/bismark/SRX202088_R1_val_1_bismark_bt2_pe.deduplicated.bedGraph.gz",
    "mode": "bisulfite" ,
    "color": "blue"
})

In [None]:
# Zoom in to see a specific region in the above browser
b3.search('chr6:4720,000-4800,000')

The above methylation profiles are line with the the authors' conclusion that: the covered cytosines are mostly methylated in
male serum ESCs (SRX271141, SRX202087), while methylation in the 2i ESCs and serum ESCs (SRX271142, SRX202088) are much reduced. 

Summary from the publication: "The use of two kinase inhibitors (2i) enables derivation of mouse embryonic stem cells (ESCs) in the pluripotent ground state. Using WGBS, we show that male 2i ESCs  are globally hypomethylated compared to conventional ESCs maintained in serum." -- [Habibi, Ehsan, et al. Cell stem cell 13.3 (2013): 360-369](https://pubmed.ncbi.nlm.nih.gov/23850244/)

## Terms & Quiz

In [None]:
! pip install jupytercards --quiet
from jupytercards import display_flashcards
display_flashcards('../quiz_files/f1.json')

In [None]:
! pip install jupyterquiz --quiet
from jupyterquiz import display_quiz
display_quiz('../quiz_files/q1-1.json')

## Conclusion

This tutorial provided a comprehensive walkthrough of the Bismark pipeline for WGBS data analysis within the AWS SageMaker notebook. We successfully demonstrated the process from raw FASTQ data to methylation calls, incorporating quality control steps with FastQC and MultiQC, adapter trimming with Trim Galore!, genome preparation and alignment with Bismark, deduplication, methylation extraction, and report generation. Finally, we visualized the methylation data using IGV, showcasing the hypomethylation observed in 2i-derived mouse ESCs compared to conventionally cultured ESCs, aligning with the findings of Habibi et al. (2013). This workflow provides a robust foundation for further downstream analyses, such as differential methylation analysis, which will be explored in subsequent tutorials. The generated methylation profiles (.bedGraph files) are ready for use in the next tutorial, highlighting the efficiency and reproducibility of this cloud-based approach.

## Clean Up

Transfer useful results back to an Amazon S3 bucket, and remove intermediate files. In this tutorial, the most important output files are the methylation profiles (.bedgraph.gz) of each sample. We will use these data to identify differentially methylated regions using metilene in the next tutorial.

For example, you can upload/copy all the .bedgraph.gz files to the bucket you created in Amazon S3 by using:     
`! aws s3 cp Tutorial_1/bismark/*.bedgraph.gz s3://BUCKET_NAME/bismark_results`

You can also delete the whole directory with all the files generated in this notebook using:
`! rm -rf Tutorial_1`

<div class="alert alert-block alert-danger">
    <i class="fa fa-exclamation-circle" aria-hidden="true"></i>
    <b>Don't forget:</b> after finish running the notebook, stop the notebook in AWS SageMaker to avoid cost accumulation.
</div>