# Big Data for Biologists: Decoding Genomic Function- Class 7

## How do you quantify gene expression and visualize similarities and differences of gene expression profiles across cell types? Part I
 
##  Learning Objectives
***Students should be able to***
 <ol>
 <li> <a href=#GeneExpressionIntro>Quantify gene expression and understand units of gene expression levels</a></li>
 <li> <a href=#GeneExpressionIntro>Understand what a box plot is</a></li>
 <li> <a href=#GeneExpressionIntro>Visualize gene expression variation across cell types and individuals from the GTEx project</a></li>
 <li><a href=#RNASeqDataFormat>Recognize that data from an RNA-Seq experiment can be processed and stored in a format that can be read into Python</a></li>
 <li> <a href=#IDHeaderSep>Identify the separator and header in a data table</a></li> 
 <li><a href=#LoadTable> Load a data table in .csv or .tsv format into Python</a></li>
 <li> <a href=#DataTableDim>Get the dimensions of a data table in Python  </a></li>
 <li> <a href=#MetaData>Load RNA-Seq metadata for the physiological system of a cell type into Python</a></li> 
 <li> <a href=#Slicing>Slice a data table in Python to select a subset of rows or columns. </a></li> 
 <li> <a href=#Barplot>Make a bar plot from a data table using Python </a></li>
 <li> <a href=#BinaryIndex>Use binary indexing to select elements from a dataframe. </a></li>
 

## How do you quantify gene expression?

In [None]:
from IPython.display import HTML
HTML('<iframe src="https://drive.google.com/file/d/0B_ssVVyXv8ZSrnaseq_dataBNSkN5RFZDc00/preview" width="1000" height="480"></iframe>')

## How does gene expression change across cell types and individuals? 

Gene expression varies not only across cell types and tissues but also between healthy individuals and between health and diseased individuals.

The aim of the Genotype - Tissue Expression (GTEx) Project is to increase our understanding of how changes in our genes contribute to common human diseases, in order to improve health care for future generations.

Launched by the National Institutes of Health (NIH) in September 2010 (See: NIH launches Genotype-Tissue Expression project), GTEx has created a resource that researchers can use to study how inherited changes in genes lead to common diseases.  It has established a database and a tissue bank that can be used by many researchers around the world for future studies.

GTEx researchers are studying genes in different tissues obtained from many different people. Thus every donor's generous gift of tissues and medical information to the GTEx project makes possible research that will help improve our understanding of diseases, giving hope that we will find better ways to prevent, diagnose, treat and eventually cure these diseases in the future.

GTEx portal is at https://www.gtexportal.org/home/

In [None]:
HTML('<iframe src="https://drive.google.com/file/d/0B_ssVVyXv8ZSV2xYRnlHSDJMUTA/preview" width="1000" height="480"></iframe>')

Boxplots are a type of graph that is often used to visualize changes in gene expression, as in the GTEx examples below. We will use a python library called [plotly](https://help.plot.ly/what-is-a-box-plot/) to generate boxplots that illustrated changes in gene expression. 

Let's revisit the genes that we examined in the Washu Browser in the last tutorial. Click on the link next to each gene to navigate to the GTEx entry for the gene. 

* MYOD1 - muscle https://www.gtexportal.org/home/gene/MYOD1
* NEUROD1 - neurons https://www.gtexportal.org/home/gene/NEUROD1
* SPI1 - blood https://www.gtexportal.org/home/gene/SPI1
* HNF4A - Liver and related https://www.gtexportal.org/home/gene/HNF4A
* GTF2B - Ubiquitous gene https://www.gtexportal.org/home/gene/GTF2B


## How is data from an RNA-Seq experiment processed and stored in a format that can be read into Python? <a name='RNASeqDataFormat' />

We are now going to look at processed and normalized data from the ENCODE portal. 565 RNA-seq samples were collected across multiple tissues and cell types. We begin by looking at the number of sequence reads that align to each gene. The data is stored in a matrix format, with each row corresponding to a gene in the human genome and each column corresponding to an RNA-seq experiment. The values in the matrix are read counts -- specifically the number of reads that align to a given gene. We normalize the data to "counts per million". 

<img src="../Images/7-gene_expression.png" style="width: 50%; height: 50%" align="center"//>

To start analyzing the data from the RNA-Seq experiment one of the first steps is reading the data into a program that can be used for the analysis. 

We'll be using Python and will need to first cover some general information about working with datatables. 

# Identify the separator and header in a data table  <a name='IDHeaderSep' />

Two common formats for data tables are comma separated values (**.csv**) files or tab separated values (**.tsv**) files.

In order to read a data table into a program, you often need to know the format of the file. One way to check the format is to look at the files in a text editor. In the figure below, you can see the differences between a file saved in .csv or .tsv format. 

Also, when you read a data table into Python (or R) you often need to specify which row of the file has the column labels. This row is referred to as a **header**.

Sometimes a file has extra lines above the header, so you may need to tell the program not to read that row since the extra lines may not have the same number of columns as the rest of the table and they can mess up the formatting. 

<img src="../Images/6-Tables-CSV-TSV.png" style="width: 100%; height: 100%" align="center"//>

# Load a data table in .csv or .tsv format into Python <a name='LoadTable' />

To read our RNA-Seq data table into Python, we are going to be using the <i>pandas</i> package. 

<i>Pandas</i> adds functionality to working with data in Python. You can learn more about <i>pandas</i> at the following [link](http://pandas.pydata.org/). In particular, <i>pandas</i> introduces a variable type called dataframes which are a convient way of working with tables.

After we have imported the <i>pandas</i> package into Python, we can load a .csv or .tsv file with the read_csv or read_table command. 

The RNA-Seq data that we will be using is a .tsv file.

Note that the read command also asks for you to specify the row number for the header which in our case is the first line, denoted zero in Python. 

Take a look at reading in a file in the example below. 

In [None]:
%load_ext autoreload
# load the pandas package and define an abbreviation (or alias) 
import pandas as pd   

# read_table loads a tabular data file into python with tab as the default separator
# read_csv loads a tabular data file into python with comma as the deafault separator
# header gives the number of the row that will be used for column names

#Step 1: Read in the normalized data. 
rnaseq_data = pd.read_table(
     filepath_or_buffer='../datasets/RNAseq/rnaseq_normalized.tsv', 
     header=0)

Thought questions: 

* What would you change the code above to read in a .csv file?
* How would you need to change the code if your column names were in the third row? 

# How do I know how many genes and cell types I have in my data set (aka. Get the dimensions of a data table in Python)<a name='DataTableDim' />
  
Once you've read your data set into Python, there are just a couple of commands that you need to look at the size of your dataset. 


In [4]:
#Use the 'head' command to examine the structure of your data matrix. 
rnaseq_data.head()

Unnamed: 0,ENCSR051GPK.Ganglion_Eminence_derived_primary_cultured_neurospheres.UCSF_Costello,ENCSR906HEV.Fetal_Muscle_Trunk.UW_Glass,ENCSR762CJN.H1_BMP4_Derived_Trophoblast_Cultured_Cells.UCSD_Thompson,ENCSR321ROU.Fetal_Kidney_Pelvis.UW_Glass,ENCSR109IQO.K562_Leukemia_Cells.UConn_Graveley,ENCSR000AEF.GM12878_Lymphoblastoid_Cells.UConn_Graveley,ENCSR244ISQ.H9_Derived_Neuronal_Progenitor_Cultured_Cells.CSHL_Gingeras,ENCSR446RKD.Fetal_Intestine_Small.UW_Glass,ENCSR396GIH.Sigmoid_Colon.Stanford_Snyder,ENCSR000CUA.Primary_hematopoietic_stem_cells.CSHL_Gingeras,...,ENCSR271DJJ.Pancreatic_Islets.UCSF_Costello,ENCSR000AED.GM12878_Lymphoblastoid_Cells.CSHL_Gingeras,ENCSR433GXV.hESC_Derived_CD56._Mesoderm_Cultured_Cells.Harvard,ENCSR535VTR.HT1080_Fibrosarcoma_Cell_Line.CSHL_Gingeras,ENCSR000AEV.Bladder.CSHL_Gingeras,ENCSR314LXG.Karpas.422_B_Cell_Non.Hodgkin_Lymphoma_Cell_Line.CSHL_Gingeras,ENCSR642GSA.Primary_T_CD8._naive_cells_from_peripheral_blood.UCSF_Costello,ENCSR880EGO.SJSA1_Osteosarcoma_Cell_Line.CSHL_Gingeras,ENCSR000AAT.Umbilical_Artery_Epithelial_Primary_Cells.CSHL_Gingeras,ENCSR000EYQ.HeLa.S3_Cervical_Carcinoma_Cell_Line.Caltech_Wold
ENSG00000242268.2,0.675065,0.27709,0.187297,-0.104115,0.384825,0.805894,0.047858,0.147409,0.292986,0.31223,...,0.653688,0.221102,0.206536,0.444246,0.246548,0.189249,0.418133,0.252317,0.326081,0.368408
ENSG00000167578.12,3.176028,2.113871,2.990098,2.860598,3.392432,2.345439,2.824117,2.697207,3.277173,3.183208,...,2.879215,3.832572,3.034219,2.787787,2.751289,3.37514,3.721178,3.144349,2.578081,1.950408
ENSG00000270112.2,-0.100813,0.325754,0.068335,-0.067226,0.145993,0.033347,-0.015703,0.305393,0.024166,0.153861,...,0.959168,-0.027327,0.121938,0.13114,0.137807,-0.031017,0.114565,0.175163,0.168535,0.02778
ENSG00000078237.4,3.750078,1.852882,3.030287,2.432092,2.05193,2.78384,2.88193,2.124639,2.829874,2.028751,...,2.116523,2.660524,2.838042,3.059306,2.241726,2.550853,2.296903,2.329019,2.827429,1.951301
ENSG00000263642.1,-0.005248,0.007042,0.002427,-0.028182,-8e-06,0.004928,0.011574,0.000988,0.011045,0.00295,...,0.001907,0.001637,0.009485,0.003524,0.006027,0.00567,-0.001775,0.011901,0.005117,0.00058


In [5]:
#Use the shape command to calculate the dimensions of your data matrix 
#shape[0] gives the number of rows, shape [1] gives the number of columns. 

num_genes=rnaseq_data.shape[0] 
num_samples=rnaseq_data.shape[1] 

#use the print command to print the variables you generated above 
print(num_genes)
print(num_samples)

55667
410


Thought questions:
How many genes were measured in this experiment?
How many samples were measured?

 # Load RNA-Seq metadata for the physiological system of a cell type into Python<a name='MetaData' />

In our example today, we want to compare cell types in different organ systems such as the Nervous system, Musculoskeletal system or Blood. 

We have a file that lists the System, Organ and Cell Type for each Sample in the RNA-Seq experiment from the last class. This separate file with information about the samples is referred to as **metadata**. 

The metadata is stored in a file called: '../datasets/RNAseq/rnaseq_metadata.txt'. 

Since the name of the metadata table does not tell you, how can you check if the file is in .csv or .tsv format? 

Using what you learned yesterday about reading detatables into Python, write the code to read in the baches file and to view the top of the file. 

In [None]:
#Step 2: Load the metadata file that provides metadata annotations for each sample 
#(hint: this will be very similar to the code we wrote to load the data table)

In [7]:
###BEGIN SOLUTION
metadata = pd.read_table(
     filepath_or_buffer='../datasets/RNAseq/rnaseq_batches.txt', 
     header=0)
###END SOLUTION 

In [9]:
num_genes=metadata.shape[0] 
num_samples=metadata.shape[1] 

#use the print command to print the variables you generated above 
print(num_genes)
print(num_samples)

410
4


## Slice a data table in Python to select a subset of rows or columns. <a name='Slicing' />

To answer the question <i>"Do cell types from the same organ system have similar gene expression profiles?</i>, we are particularly interested in the column labeled System. 

Selecting part of a table is called slicing. It is very common in data analysis work to need to slice a table to select, for example, one column, one row or a set of rows and/or columns. 

Using the <i>pandas</i> package, there are a few ways that you can select rows and columns. Below is a table from the pandas [website](https://pandas.pydata.org/pandas-docs/stable/dsintro.html), that summarizes how you can select rows and columns.  For a more complete description and set of examples see this [link](https://pandas.pydata.org/pandas-docs/stable/indexing.html).


<img src="../Images/7-Indexing_Selecting Rows and Columns.png" style="width: 60%; height: 60%" align="center"//>


In our example, let's look first at how we would select the System column.  

In [8]:
#Use the 'head' command to examine the structure of your data matrix. 
###BEGIN SOLUTION
metadata.head()
###END SOLUTION

Unnamed: 0,Sample,System,Organ,CellType
0,ENCSR051GPK.Ganglion_Eminence_derived_primary_...,Nervous,Brain,Neurosphere
1,ENCSR906HEV.Fetal_Muscle_Trunk.UW_Glass,Musculoskeletal,Muscle,Muscle
2,ENCSR762CJN.H1_BMP4_Derived_Trophoblast_Cultur...,Embryonic,Trophoblast,ES.derived
3,ENCSR321ROU.Fetal_Kidney_Pelvis.UW_Glass,Urinary,Kidney,Kidney
4,ENCSR109IQO.K562_Leukemia_Cells.UConn_Graveley,Blood,Blood,Leukemia


In [10]:
x=metadata['System']
print(x)

0               Nervous
1       Musculoskeletal
2             Embryonic
3               Urinary
4                 Blood
5                 Blood
6               Nervous
7      Gastrointestinal
8      Gastrointestinal
9                 Blood
10               Breast
11      Musculoskeletal
12          Respiratory
13     Gastrointestinal
14      Musculoskeletal
15              Urinary
16              Nervous
17                Blood
18              Urinary
19               Immune
20               Immune
21      Musculoskeletal
22              Urinary
23       Cardiovascular
24          Respiratory
25            Embryonic
26            Embryonic
27                 Skin
28               Immune
29     Gastrointestinal
             ...       
380           Endocrine
381               Blood
382        Reproductive
383    Gastrointestinal
384             Nervous
385      Cardiovascular
386             Urinary
387     Musculoskeletal
388           Embryonic
389               Blood
390             

In [None]:
#Write the code to make a variable x with the Cell Type instead of the System. 
###BEGIN SOLUTION
###END SOLUTION

In [None]:
#Write the code to make a variable x with the first five rows of metadata 
#(remember to use Python zero-based numbering!). 
###BEGIN SOLUTION
###END SOLUTION

Notice that the last option in the table above indicates that you can specify the location with a Boolean vector. Boolean variables usually are variables that can take on two values, True or False. 

Using this syntax, we can write a criteria for the rows that we want to select. For example, if we want to select the rows from only the respiratory system we could specify the condition that System=='Respiratory'

In [12]:
metadata_subset=metadata.loc[metadata.System=='Respiratory']
print(metadata_subset)

                                                Sample       System Organ  \
12   ENCSR000AAN.Pulmonary_Artery_Smooth_Muscle_Pri...  Respiratory  Lung   
24                     ENCSR499NEL.Fetal_Lung.UW_Glass  Respiratory  Lung   
48                    ENCSR917YHC.Lung.Stanford_Snyder  Respiratory  Lung   
49                     ENCSR074APH.Fetal_Lung.UW_Glass  Respiratory  Lung   
53   ENCSR897KTO.Alveolus_Epithelial_Primary_Cells....  Respiratory  Lung   
68   ENCSR000CPM.NHLF_Lung_Fibroblast_Primary_Cells...  Respiratory  Lung   
80                     ENCSR733MWN.Fetal_Lung.UW_Glass  Respiratory  Lung   
85   ENCSR000AAS.Trachea_Smooth_Muscle_Primary_Cell...  Respiratory  Lung   
87   ENCSR000AAP.Pulmonary_Microvascular_Endothelia...  Respiratory  Lung   
129                    ENCSR044JAQ.Fetal_Lung.UW_Glass  Respiratory  Lung   
136                    ENCSR861SOG.Fetal_Lung.UW_Glass  Respiratory  Lung   
154  ENCSR000AAO.NHLF_Lung_Fibroblast_Primary_Cells...  Respiratory  Lung   

In the example below, we are going to want to select multiple Systems. To do this, we can use the "|" operator 

In [13]:
metadata_subset=metadata.loc[(metadata.System=='Respiratory') | (metadata.System=='Embryonic')]
print(metadata_subset)

                                                Sample       System  \
2    ENCSR762CJN.H1_BMP4_Derived_Trophoblast_Cultur...    Embryonic   
12   ENCSR000AAN.Pulmonary_Artery_Smooth_Muscle_Pri...  Respiratory   
24                     ENCSR499NEL.Fetal_Lung.UW_Glass  Respiratory   
25   ENCSR593AMV.hESC_Derived_CD56._Ectoderm_Cultur...    Embryonic   
26   ENCSR663WGC.H1_Derived_Mesenchymal_Stem_Cells....    Embryonic   
33                  ENCSR950PSB.H1_Cells.UCSF_Costello    Embryonic   
44   ENCSR976JGI.H1_BMP4_Derived_Mesendoderm_Cultur...    Embryonic   
48                    ENCSR917YHC.Lung.Stanford_Snyder  Respiratory   
49                     ENCSR074APH.Fetal_Lung.UW_Glass  Respiratory   
53   ENCSR897KTO.Alveolus_Epithelial_Primary_Cells....  Respiratory   
57            ENCSR282KJZ.ES.UCSF4_Cells.UCSF_Costello    Embryonic   
68   ENCSR000CPM.NHLF_Lung_Fibroblast_Primary_Cells...  Respiratory   
80                     ENCSR733MWN.Fetal_Lung.UW_Glass  Respiratory   
85   E

A more compact way, however, to select a list of cell types is to define a variable for the list of cell types and then use the <i>pandas</i> isin function.  

In [14]:
#define the list of cell types
systems_subset=["Blood","Embryonic","Immune","Respiratory"]

#select the rows in metadata for which the System is one of the 
metadata_subset=metadata.loc[metadata['System'].isin(systems_subset)]

print(metadata_subset)

                                                Sample       System  \
2    ENCSR762CJN.H1_BMP4_Derived_Trophoblast_Cultur...    Embryonic   
4       ENCSR109IQO.K562_Leukemia_Cells.UConn_Graveley        Blood   
5    ENCSR000AEF.GM12878_Lymphoblastoid_Cells.UConn...        Blood   
9    ENCSR000CUA.Primary_hematopoietic_stem_cells.C...        Blood   
12   ENCSR000AAN.Pulmonary_Artery_Smooth_Muscle_Pri...  Respiratory   
17   ENCSR463JBR.Primary_T_CD4._cells_from_peripher...        Blood   
19                   ENCSR265NZF.Fetal_Spleen.UW_Glass       Immune   
20                   ENCSR069CMT.Fetal_Thymus.UW_Glass       Immune   
24                     ENCSR499NEL.Fetal_Lung.UW_Glass  Respiratory   
25   ENCSR593AMV.hESC_Derived_CD56._Ectoderm_Cultur...    Embryonic   
26   ENCSR663WGC.H1_Derived_Mesenchymal_Stem_Cells....    Embryonic   
28                   ENCSR367QHR.Fetal_Thymus.UW_Glass       Immune   
33                  ENCSR950PSB.H1_Cells.UCSF_Costello    Embryonic   
41   E

## Make a bar plot from a data table using Python<a name='Barplot' />

Now that we know how to select specific columns from a table, we are going to make a bar plot to look at the number of samples in each of the organ systems. 

Specifically, since we have limited computational resources we are going to focus on only four systems today. We want to ensure that we are selecting systems that have a sufficient number of samples. 

There are several different packages that can be used to make plots in Python. [Matplotlib](https://matplotlib.org/) is one of the widely used plotting packages. [Plotly](https://plot.ly/python/) is a package that's well suited to making on-line or interactive plots. 

We will be using the plotly package. 


In [None]:
%autoreload
#Import the plotting helper functions from the helpers directory
import sys
sys.path.append('../helpers')
import plotly_helpers 
from plotly_helpers import * 

plot_RNAseq_barplot(metadata)

In [None]:
# We will focus our analysis on 5 of the anatomical structures and check for differential gene expression among them.  

# We need to read in our RNA-Seq datafile using the same code as Class 5. 
rnaseq_data = pd.read_table(
     filepath_or_buffer='../datasets/RNAseq/rnaseq_normalized.tsv', 
     header=0)

#Pick out the samples (or rows) in the metadata dataframe that contain the samples from the systems of interest 
systems_subset=["Blood","Embryonic","Immune","Respiratory"]
metadata_subset=metadata.loc[metadata['System'].isin(systems_subset)]

#Pick out the columns in the RNA-Seq datafile that have samples that are in the systems of interest
rnaseq_data_subset=rnaseq_data[metadata_subset['Sample']]

#Reindexes metadata_subset
metadata_subset=metadata_subset.reset_index()

#Check row & column numbers in rnaseq_data_subset 
print(rnaseq_data_subset.shape[0])#prints number of rows -- this is the gene axis 
print(rnaseq_data_subset.shape[1])#prints number of columns 


Yesterday the size of the matrix was:

55667 <br>
410 

Thought question: What is the difference between the rnaseq_data_subset matrix that we just made and the original rnaseq_data matrix?

In [None]:
#Now, we create a barplot with just our 4 organ systems of interest 
plot_RNAseq_barplot(metadata_subset)

In [None]:
#Step 3 : We are interested in genes that are differentially expressed across samples, so we can exclude genes that have 0 TPM
#in all samples -- these are not of interest. use the sum command to find such genes 

#Selects rows in the datafile subset for which the sum of the columns is not = to zero. 
rnaseq_data_subset=rnaseq_data_subset[rnaseq_data_subset.sum(axis=1)!=0]


We will now transpose the data frame. Transposition is an operation that flips the rows and columns in a matrix, like in the example below. 

<img src="../Images/7-transpose.png" style="width: 40%; height: 40%" align="center"//>

Currently our features (genes) are along the row axis, while sample names are along the column axis. Transposing the matrix will place the genes along the column axis and the sample names along the row axis. 

In [None]:
#Step 4 : Transpose the data frame 
#Now, our features (genes) are along the column axis, and sample names are along the row axis. 
rnaseq_data_subset=rnaseq_data_subset.transpose()
print(rnaseq_data_subset.shape)

In [None]:
rnaseq_data_subset.head(10)

###  Programming tip -- Using binary indexing to extract elements in a matrix. 

In the code above we needed to select the rows in pca_rnaseq_data for each of the four organ systems. To do this, we executued the line of code: 

```
x=pca_rnaseq_data.loc[metadata_subset["System"]==name,[0]].values.flatten().tolist()   
```
Let's break down what this line of code is doing. 
First, we find all positions in `metadata_subset["System"]` that have a specific name. For example:  

In [None]:
print(name)
metadata_subset['System']==name

Note that `metadata_subset['System']==name` returns a value of "True" or "False" at each position in the` metadata_subset['System']` array. This is referred to as binary indexing. 

Next, we identify the rows with a value of "True", and select them from `pca_rnaseq_data.loc`. 
This is done with the command: 

In [None]:
pca_rnaseq_data.loc[metadata_subset["System"]==name,[0]]

This has allowed us to create positional indices 4, 8, 15 ... from the binary True/False indices. 