# Big Data for Biologists: Decoding Genomic Function- Class 6

## Introduction to the genome browser: High throughput sequencing and RNA sequencing

##  Learning Objectives
***Students should be able to***
 <ol>
 <li><a href=#Qualityscores>Interpret quality scores from the output of a read mapping and alignment algorithm</a></li>
 <li><a href=#GeneSeqinBrowser> Introduction to the WashU Genome Browser</a></li>
 <li><a href=#GeneSeqinBrowser> View a gene sequence in the WashU Genome Browser</a></li>
 <li><a href=#HighThroughput> Understand what high-throughput DNA sequencing is</a></li>
 <li><a href=#Applications> Explain how an RNA-Seq experiment works and what it measures</a></li>
<li><a href=#RNASeqinBrowser> Determine what cell type a gene is likely expressed in by viewing results of RNA-Seq experiments </a></li>
  

## Interpret quality scores from the output of a read mapping and alignment algorithm <a name='Qualityscores' />

We ended the last class discussing the Bowtie2 algorithm that can be used to create an index for a reference genome. 

We used a small, 1000bp fragment of the genome to simplify the computation. 

After the index file is created, Bowtie2 also has an algorithm to map the sequencing reads and run alignments specifically at the sites that are identified during the mapping phase.  

In the code block below, we will re-run the Bowtie2 indexing algorithm and then will run the Bowtie2 mapping and alignment algorithm. 

In [1]:
!cat samples.fasta

>read0
GTTTATTCTAAATAGATGTGTAGAAATAACAGTTGTTTCACAGGAGACTA
>read1
gggcaaaacagttaaacttacagttcataCATAAGGAGAATCAGTCTTTT
>read2
ATATCCTGATTTCTTCCATAGCTTGGATCTTGACCTAGAGGGAAATATAA
>read3
GATTCCTGATTCTTTAATGTTTTCTAAAAAAGCAAAACAAACAAACAAAC
>read4
TACATGGCTAGGTAGACTTTTAGAAAACTTGGCTGCTCTAGAAAATTGAC
>read5
attgagaagtggaatctaataaaacaatagcttctgcacagcaaaagaag
>read6
aagagctggattcaactctactgactcttattaatcatgattttgggcac
>read7
ctatttgggtttttttttaaagtttggctgggtgcagcggctcacgcctg
>read8
tgtgtgtgtgtgtgAAAGACAGAAGAAAGAGGGAGACCTTAGAAGACTAT
>read9
TTCTAAATTGCACACTTTGATTCAAAAGAAACAGTCCAACCAACCAGTCA
>read10
ATCGATCGATATCGATCGATATCGATCGATATCGATCGATATCGATCGAT


In [1]:
#Import the alignment helper functions 
#import the sequence_alignment_helpers.py file in the helpers directory
import sys
sys.path.append('../helpers')

import alignment 
from alignment import * 

#Extract the lines 1 - 1201 from the reference fasta file to make 
#sure we have 1000 lines of data that do not start with N. 
!head -n 2201 /opt/data/hg19.genome.fa > reference.fasta 

#Run the Bowtie2 mapping algorithm to create an index
bowtie_index("reference.fasta","reference")

#Run the Bowtie2 alignment algorithm that uses the mapping indices
align("samples.fasta","reference","aligned")


['bowtie2', '-x', 'reference', '-f', '-U', 'samples.fasta', '-S', 'aligned.sam']


The align function generates an output bed file called aligned.bed. Let's examine the contents of this file:

In [2]:
!cat aligned.bed 

chr1	99951	100001	AGTCGGAGAGCTGGGGTCCTCCCAGCCCTCTTGGCCCTGTGGCCAATTTT	42	+
chr1	87251	87301	GATCACTTGAGGCCAGGAATTCAAGACCAGCGTGGCTAACATGGCGAAAC	42	+
chr1	93051	93101	AAAATTAATAAAATAAGAAGCCAAAAAACAGATCAAATCAGTAAACCAAA	42	+
chr1	73451	73501	CTCTTTTTATAAAATCGAAGCATTATTACTTACTCTCTTGTTAACCTATC	42	+
chr1	20451	20501	GTACAACTACCCTGCCCCCCACCTGACGACTTCAATAAGAAGTAGCCCAG	1	+
chr1	75501	75551	TGATAATGCTACCGGCAAATTCTGTTGTTTGTATAAACATCAGCCATGTT	1	+
chr1	37451	37501	GATGATATCTCATTGTGGTTTTGATTTGCATTTCTCTGATGGCCAGTGAT	1	+
chr1	109751	109801	CAGGGGAGCTGGATCTGAGCCAAGGCATCAACTCCAAGGTAACCCCTCAG	1	+
chr1	98851	98901	GGGAATGGTTTTGGCCTCCATTCTAAGTGCTGGACATGGGGTGGCCATAA	1	+
chr1	20551	20601	TCTCTTAAGGTCCAGCACGAGGTGGAGCACATGGTGGAGAGACAGATGCA	1	+
*	0	0	ATCGATCGATATCGATCGATATCGATCGATATCGATCGATATCGATCGAT	0	+


We observe that 6 of the 11  reads map to more than one position inthe reference genome. This is because more than half of the human genome is comprised of repeats. For example, at the beginning of chromosome 1 we see many instances of the sequence taccc. Note that lower-case letters are often used in a FASTA reference sequence to indicate repetitive or low-complexity regions. This practice is called "soft masking" (in contrast, recall that the use of "N" to indicate non-sequenced regions of the genome is referred to as "hard masking"). 


<img src="../Images/5-Repeats.png" style="width: 50%; height: 50%" align="center"//>

We see the sequences from samples.fasta listed in the fourth column. 

* The first column contains the name of the chromosome on reference.fasta along which the read aligned. 

* The second and third columns contain, respectively, the start and end coordinates of the aligned read along the reference. 

* The sixth column can contain strand information -- all our reads align to the positive strand of the reference. 

* The fifth column contains the mapping quality, which is a number between 0 and 255. Higher numbers indicate better 
mapping to the reference. Notice that this value is 42 for the four reads that mapped to only a single location 
in the reference sequence, and 1 for the 6 reads that mapped to multiple locations in the reference. It is 0 for 
the read that did not map to the reference. 

## How can I view a gene in the WashU Epigenome Browser? <a name='GeneSeqinBrowser' />

The Washu Epigenome browser provides a way to visualize the results of RNA-seq experiments. 
You can access the browser here:


<a href="http://epigenomegateway.wustl.edu/browser/">http://epigenomegateway.wustl.edu/browser/</a>

First, you will see a screen asking you to select the source organism for your dataset. Click on "human hg19", as this is the reference that we have been using for our analysis. 

<img src="../Images/5-Browser1.png" style="width: 50%; height: 50%" align="center"//>

The next screen will give you three options: 

* Custom tracks -- This allows you to upload your own aligned datasets for analysis. 

* Public hubs -- This allows you to visualize public datasets generated by other labs. 

* Genome Browser -- This is the simplest view -- it contains the annotations for genes along the hg19 reference. 

Let's begin by selecting 'Genome Browser'. 


<img src="../Images/5-Browser2.png" style="width: 50%; height: 50%" align="center"//>


You will see a display of the genome, zoomed to a random region on chromosome 7: 

<img src="../Images/5-Browser3.png" style="width: 50%; height: 50%" align="center"//>



The browser enables you to navigate across the genome and select regions or genes of interest to examine in more detail.

<img src="../Images/6-BrowserTutorial1.png" style="width: 80%; height: 80%" align="center"//>


Hovering the mouse over a gene track will highlight additional information about the gene. 


<img src="../Images/6-BrowserTutorial2.png" style="width: 80%; height: 80%" align="center"//>


## Video tutorials for Washu Genome Browser 

### Introduction

In [3]:
from IPython.display import HTML
HTML('<iframe width="560" height="315" src="https://www.youtube.com/embed/y-qo2rVQakY" frameborder="0" allowfullscreen></iframe>')

### Navigating the Browser 

In [4]:
HTML('<iframe width="560" height="315" src="https://www.youtube.com/embed/RrnYhwV1Y7Y" frameborder="0" allowfullscreen></iframe>')


### Exercise

Q1: Use the browser to find the insulin gene that we used in classes 1-3. 

For your reference, the genomic coordinates for the insulin gene are given below. 

INS gene -- type "INS" into the gene lookup field. 

For a gene in the browser be able to identify: 
- transcript identifiers
- exons
- introns
- 5'UTR
- 3'UTR
- coding regions
- strand

Q2: Examine the NEUROD1 gene. What is peculiar about this gene?

# Introduction to high-throughput DNA sequencing  <a name='HighThroughput' />

In [7]:
HTML('<iframe src="https://drive.google.com/file/d/0B_ssVVyXv8ZSWHUxSmxJREVsYUU/preview" width="1000" height="1200"></iframe>') 

### Video tutorial on Illumina sequencing by synthesis 

In [6]:
HTML('<iframe width="560" height="315" src="https://www.youtube.com/embed/fCd6B5HRaZ8" frameborder="0" allowfullscreen></iframe>')

# How can we use sequencing to detect and quantify RNA expression of all genes (RNA-sequencing) <a name='Applications' />

In [13]:
HTML('<iframe src="https://drive.google.com/file/d/0B_ssVVyXv8ZSOXlINi1XbzFIZFU/preview" width="1000" height="1200"></iframe>') 



### RNA sequencing tutorial 

In [10]:
HTML('<iframe width="560" height="315" src="https://www.youtube.com/embed/tlf6wYJrwKY" frameborder="0" allowfullscreen></iframe>')

#  Determine what tissue a gene is likely expressed in by viewing results of RNA-Seq experiments <a name='RNASeqinBrowser'>

Load browser session: http://epigenomegateway.wustl.edu/browser/?genome=hg19&session=WQuTWqS74L&statusId=888160560

Click on "Humbio51-lect6-rnaseq" 

Lets examine the following genes: 

* MYOD1
* INS

#### EXERCISE

Which tissues are the remaining four genes expressed in?  

* NEUROD1 

**Your answer**: 

* SPI1 

**Your answer**: 

* HNF4A 

**Your answer**: 

* GTF2B

**Your answer**: 
