# Big Data for Biologists: Decoding Genomic Function- Class 5

## How do you measure gene expression levels using RNA-Seq?

##  Learning Objectives
***Students should be able to***
 <ol>
 <li><a href=#RNASeqIntro> Describe what an RNA-Seq experiment is and what can be measured</a></li>
 <li><a href=#FASTQ>Recognize FASTQ file format</a></li>
 <li><a href=#MapSeqReads>Map  sequence reads to the human reference genome </a></li>
<li><a href=#RNASeqinBrowser> View results of RNA-Seq experiments in the WashU Epigenome Browser</a></li>
 <li><a href=#BAMReadPileups> Examine read pileups in .bam files to ??? (ie, can we capturewhat is the purpose of examining readpileups?) </a></li>
 <li><a href=#CellTypeDifferences> Recognize differential use of transcription start sites, splice sites, and expression levels ??? is thisbetween cell types??  
 </a></li>  
 <li><a href=#RNASeqDataFormat>Recognize that data from an RNA-Seq experiment can be processed and stored in a format that can be read into Python</a></li>
 <li> <a href=#IDHeaderSep>Identify the separator and header in a data table</a></li> 
 <li><a href=#LoadTable> Load a data table in .csv or .tsv format into Python</a></li>
 <li> <a href=#DataTableDim>Get the dimensions of a data table in Python  </a></li>
  


## What is an RNA-Seq experiment and what can be measured? <a name='RNASeqIntro' />


## Mapping sequencing reads to a reference genome<a name='MapSeqReads' />

 
* Comparing sequences to a reference genome **short-read sequence alignment or read mapping**

First of all, what are short reads versus long reads and why are we talking about short-read alignment?

The original DNA sequencing methods that were used for the Human Genome Project were called Sanger sequencing and could produce sequences that were >1000 base pairs. These long sequences were very helpful in the construction of the original human reference genome. 

Researchers are still developing technologies to make obtaining long sequencing reads less expensive and to create even longer ranges for the sequncing. 

These longer reads can help provide information in parts of the genome that have been difficult to sequence. For a recent story of how new technologies for obtaining long sequencing reads is helping in the clinic see [here](https://med.stanford.edu/news/all-news/2017/06/researchers-use-long-read-genome-sequencing-in-a-patient.html).

However, the data from most next generation sequencing machines usually produce **short reads** or sequences that are <~200 base pairs [Ref](http://www.nature.com/nbt/journal/v30/n11/full/nbt.2421.html). These are referred to as short-reads. Short-read sequencing is currently less expensive and more common than long-read sequencing. 

Since there is already a human reference genome, short reads from DNA sequencing experiments can be aligned to the reference genome and can help to define genetic variation in populations. 

As we'll see in the next class, short reads from types of DNA sequencing experiments known as RNA-Seq and CHIP-Seq experiments can also be aligned to the human reference genome to help define gene expression or gene regulatory regions in different cell types or across different conditions.

TODO: wrapper function with small fastq as I/O (3 single-mapped, 3 multi-mapped, 3 non-specific mapping) 

## Introduction to FASTA and FASTQ data formats<a name='FASTQ' />

A common format for the output of files from DNA sequencing machines, and the one that we'll be using in this class, is called FASTQ. 

You have already seen data in the FASTA format. The first line contains the sequence label, preceded by ">". The second line contains the actual sequence bases (A,C,G,T): 

**>FORJUSP02AJWD1** 

**CCGTCAATTCATTTAAGTTTTAACCTT**

FASTQ format takes this a step further by including sequence quality information. 

The sequence quality information is first calculated as numeric scores (known as [Phred Scores](https://en.wikipedia.org/wiki/Phred_quality_score), but is written in FASTQ files as characters, known as ASCII characters. 

The advantage of using [ASCII Characters](http://www.asciitable.com/) is that 94 numbers can be represented with single characters.  

For example, in the sequence below, ":" corresponds to 58. Since 58 has two digits it wouldn't align well underneath a single T whereas : only takes up one space and aligns with the T base pair. 

<img src="images/fastq_fig.jpg",align="center"//>



In [1]:
## You can convert the ASCII-encoded quality values to numeric Q scores with the 'ord' function. You must subtract 33
## from the converted value to obtain a Q score

quality_ascii='A:99@::??@@::FFAA'
numerical=[ord(c)-33 for c in quality_ascii]
print(numerical)

[32, 25, 24, 24, 31, 25, 25, 30, 30, 31, 31, 25, 25, 37, 37, 32, 32]


## How can I view the results of RNA-Seq experiments in the WashU Epigenome Browser? <a name='RNASeqinBrowser' />

>EDITS?
Maybe insert link to lead directly to a particular exmple to look at. 
In the last class they used the gene for Insulin and used bed tools to pull out the exon sequence, so it could be interesting for them to see how that is on in pancreatic cells but not in other cell types and also how you can define the exon intron boundaries with the expression data. 
Then could diversify with a different interesting example for transcription start sites, splice sites and expression levels differeing in cell types? 


In [1]:
from IPython.display import IFrame

IFrame("http://egg2.wustl.edu/roadmap/web_portal/processed_data.html",height=500,width=800)

## Explain what a read pileup is and examine read pileups in .bam files to ???? <a name='BAMReadPileups' />

RNA-seq datasets show gene expression levels. 
We can visualize 'read pileups' in the browser. 

>EDITS 
Would be helpful to have definition of read pileups and, if applicable, explain the connection from read pileups to the single vs paired end sequencing. 


Reads can be sequenced from just the 5' end (single-end sequencing), or from both the 5' and 3' end (paired end sequencing). 



![Single End vs. Paired End Sequencing](single_end_paired_end.png)

In [3]:
IFrame('http://epigenomegateway.wustl.edu/browser/?genome=hg19&tknamewidth=150&datahub=http://egg2.wustl.edu/web_portal_cache/752697500.json', width=800, height=800)


# How is data from an RNA-Seq experiment processed and stored in a format that can be read into Python? <a name='RNASeqDataFormat' />

We are now going to look at processed data from a real RNA-Seq experiment. 

>EDITS 
Might be helpful to give them a little background on the particular experiment-- is it data downloaded from the browser? From an expeirment in the lab? ie where did the asinh_tpm_minus_sva.tsv file come from? What had to happen (in simplest terms) to go from the read pileups to the datatable?   Also, we should probably mention here that the data has been pre-normalized- 
 

To start analyzing the data from the RNA-Seq experiment one of the first steps is reading the data into a program that can be used for the analysis. 

We'll be using Python and will need to first cover some general information about working with datatables. 

# Identify the separator and header in a data table  <a name='IDHeaderSep' />

Two common formats for data tables are comma separated values (**.csv**) files or tab separated values (**.tsv**) files.

In order to read a data table into a program, you often need to know what format the file is in. One way to check is to just look at the files in a text editor. Below you can see the differences between a file saved in .csv or .tsv format. 

Also, when you read a data table into Python (or R) you often need to specify which row of the file has the column labels. This row is referred to as a **header**.

Sometimes a file has extra lines above the header, so you may need to tell the program not to read that row since the extra lines may not have the same number of columns as the rest of the table and they can mess up the formatting. 

<img src="../Images/6-Tables-CSV-TSV.png" style="width: 100%; height: 100%" align="center"//>


# Load a data table in .csv or .tsv format into Python <a name='LoadTable' />

To read our RNA-Seq data table into Python, we are going to be using the <i>pandas</i> package. 

<i>Pandas</i> adds functionality to working with data in Python. You can learn more about <i>pandas</i> at the following [link](http://pandas.pydata.org/). In particular, <i>pandas</i> introduces a variable type called dataframes which are a convient way of working with tables.

After we have imported the <i>pandas</i> package into Python, we can load a .csv or .tsv file with the read_csv or read_table command. 

The RNA-Seq data that we will be using is a .tsv file.

Note that the read command also asks for you to specify the row number for the header which in our case is the first line, denoted zero in Python. 

Take a look at reading in a file in the example below. 

In [6]:
# load the pandas package and define an abbreviation (or alias) 
import pandas as pd   

# read_table loads a tabular data file into python with tab as the default separator
# read_csv loads a tabular data file into python with comma as the deafault separator
# header gives the number of the row that will be used for column names

df = pd.read_table(
     filepath_or_buffer='../datasets/RNAseq/asinh_tpm_minus_sva.tsv', 
     header=0)


Thought questions: 

* What would you change the code above to read in a .csv file?
* How would you need to change the code if your column names were in the third row? 

  # How do I know how many genes and cell types I have in my data set (aka. Get the dimensions of a data table in Python)<a name='DataTableDim' />
  
Once you've read your data set into Python, there are just a couple of commands that you need to look at the size of your dataset. 


In [7]:
#Use the 'head' command to examine the structure of your data matrix. 
df.head()

Unnamed: 0,ENCSR051GPK.Ganglion_Eminence_derived_primary_cultured_neurospheres.UCSF_Costello,ENCSR906HEV.Fetal_Muscle_Trunk.UW_Glass,ENCSR762CJN.H1_BMP4_Derived_Trophoblast_Cultured_Cells.UCSD_Thompson,ENCSR321ROU.Fetal_Kidney_Pelvis.UW_Glass,ENCSR109IQO.K562_Leukemia_Cells.UConn_Graveley,ENCSR000AEF.GM12878_Lymphoblastoid_Cells.UConn_Graveley,ENCSR244ISQ.H9_Derived_Neuronal_Progenitor_Cultured_Cells.CSHL_Gingeras,ENCSR446RKD.Fetal_Intestine_Small.UW_Glass,ENCSR396GIH.Sigmoid_Colon.Stanford_Snyder,ENCSR000CUA.Primary_hematopoietic_stem_cells.CSHL_Gingeras,...,ENCSR271DJJ.Pancreatic_Islets.UCSF_Costello,ENCSR000AED.GM12878_Lymphoblastoid_Cells.CSHL_Gingeras,ENCSR433GXV.hESC_Derived_CD56._Mesoderm_Cultured_Cells.Harvard,ENCSR535VTR.HT1080_Fibrosarcoma_Cell_Line.CSHL_Gingeras,ENCSR000AEV.Bladder.CSHL_Gingeras,ENCSR314LXG.Karpas.422_B_Cell_Non.Hodgkin_Lymphoma_Cell_Line.CSHL_Gingeras,ENCSR642GSA.Primary_T_CD8._naive_cells_from_peripheral_blood.UCSF_Costello,ENCSR880EGO.SJSA1_Osteosarcoma_Cell_Line.CSHL_Gingeras,ENCSR000AAT.Umbilical_Artery_Epithelial_Primary_Cells.CSHL_Gingeras,ENCSR000EYQ.HeLa.S3_Cervical_Carcinoma_Cell_Line.Caltech_Wold
ENSG00000242268.2,0.675065,0.27709,0.187297,-0.104115,0.384825,0.805894,0.047858,0.147409,0.292986,0.31223,...,0.653688,0.221102,0.206536,0.444246,0.246548,0.189249,0.418133,0.252317,0.326081,0.368408
ENSG00000167578.12,3.176028,2.113871,2.990098,2.860598,3.392432,2.345439,2.824117,2.697207,3.277173,3.183208,...,2.879215,3.832572,3.034219,2.787787,2.751289,3.37514,3.721178,3.144349,2.578081,1.950408
ENSG00000270112.2,-0.100813,0.325754,0.068335,-0.067226,0.145993,0.033347,-0.015703,0.305393,0.024166,0.153861,...,0.959168,-0.027327,0.121938,0.13114,0.137807,-0.031017,0.114565,0.175163,0.168535,0.02778
ENSG00000078237.4,3.750078,1.852882,3.030287,2.432092,2.05193,2.78384,2.88193,2.124639,2.829874,2.028751,...,2.116523,2.660524,2.838042,3.059306,2.241726,2.550853,2.296903,2.329019,2.827429,1.951301
ENSG00000263642.1,-0.005248,0.007042,0.002427,-0.028182,-8e-06,0.004928,0.011574,0.000988,0.011045,0.00295,...,0.001907,0.001637,0.009485,0.003524,0.006027,0.00567,-0.001775,0.011901,0.005117,0.00058


In [7]:
#Use the shape command to calculate the dimensions of your data matrix 
#shape[0] gives the number of rows, shape [1] gives the number of columns. 

num_genes=df.shape[0] 
num_samples=df.shape[1] 

#use the print command to print the variables you generated above 
print(num_genes)
print(num_samples)

55667
410


Thought questions:
* How many genes were measured in this experiment?
* How many samples were measured?