# Big Data for Biologists: Decoding Genomic Function- Class 8

## How do you visualize similarities and differences of gene expression profiles across cell types? Part II

##  Learning Objectives
***Students should be able to***
 <ol>
 <li> <a href=#MetaData>Load RNA-Seq metadata for the physiological system of a cell type into Python</a></li> 
 <li> <a href=#BinaryIndex>Use binary indexing to select elements from a dataframe. </a></li>
 <li> <a href=#PCA1>Describe what Principal Component analysis is and how it can be used to analyze and visualize variation in large datasets</a></li>
 <li> <a href=#PCA2>Perform prinicipal component analysis to identify clustering patterns in gene expression data </a></li>
 <li> <a href=#Scatter>Make a scatter plot of the output from principal component analysis</a></li> 

# Load RNA-Seq metadata for the physiological system of a cell type into Python<a name='MetaData' />

In [None]:
%load_ext autoreload
# load the pandas package and define an abbreviation (or alias) 
import pandas as pd   

We will focus our analysis on 4 of the anatomical structures and check for differential gene expression among them.  


In [None]:
#Read in the metadata table. 
metadata = pd.read_table(
     filepath_or_buffer='../datasets/RNAseq/rnaseq_metadata.txt', 
     header=0)

In [None]:
# Read in the RNA-seq data matrix. 
rnaseq_data = pd.read_table(
     filepath_or_buffer='../datasets/RNAseq/rnaseq_normalized.tsv', 
     header=0)

In [None]:
#Pick out the samples (or rows) in the metadata dataframe that contain the samples from the systems of interest 
systems_subset=["Blood","Embryonic","Immune","Respiratory"]

In [None]:
#1. select the column labeled 'System' in the metadata array. 
system_column=metadata['System']

In [None]:
#2. select the rows from system_column that have values in the systems_subset list. 
#(i.e. select out the rows for "Blood","Embryonic","Immune","Respiratory")
metadata_subset=system_column.isin(systems_subset)

In [None]:
metadata_subset.head()

In [None]:
#Pick out the columns in the RNA-Seq datafile that have samples that are in the systems of interest
sample_column=metadata['Sample']
sample_ids_to_keep=sample_column[metadata_subset]

In [None]:
#Select the rows in the data matrix that contain the samples we wish to analyze (i.e. the samples
#from blood, embryonic, immune, and)
rnaseq_data_subset=rnaseq_data[sample_ids_to_keep]

In [None]:
rnaseq_data_subset.head()

In [None]:
#Use the boolean metadata_subset index to select rows from "metadata" 
metadata_subset=metadata[metadata_subset]
metadata_subset=metadata_subset.reset_index()

In [None]:
#Check row & column numbers in rnaseq_data_subset 
print("Number rows:"+str(rnaseq_data_subset.shape[0]))#prints number of rows -- this is the gene axis 
print("Number columns:"+str(rnaseq_data_subset.shape[1]))#prints number of columns

In [None]:
#Since our goal is to identify genes that are differentially expressed across organ systems, we 
# want to exclude genes that have expression = 0 in all 4 of the organ systems. 
rnaseq_data_subset=rnaseq_data_subset[rnaseq_data_subset.sum(axis=1)>0]
print("Number rows:"+str(rnaseq_data_subset.shape[0]))#prints number of rows -- this is the gene axis 
print("Number columns:"+str(rnaseq_data_subset.shape[1]))#prints number of columns

## Programming tip:Using binary indexing to select elements from a dataframe. <a name='BinaryIndex'/>
In the code above we needed to select the rows in rnaseq_data_subset that sum to a value greater than 0. To do this, we executued the line of code: 

```
rnaseq_data_subset=rnaseq_data_subset[rnaseq_data_subset.sum(axis=1)>0]

```
Let's break down what this line of code is doing. 
First, we find all rows in `rnaseq_data_subset` that have sum greater than 0. 

In [None]:
nonzero_rows=rnaseq_data_subset.sum(axis=1)>0
print(nonzero_rows)

Note that `rnaseq_data_subset.sum(axis=1)>0` returns a value of "True" or "False" at each row in the`rnaseq_data_subset` matrix. This is referred to as binary indexing. 

Next, we identify the rows with a value of "True", and select them from `rnaseq_data_subset`. 
This can be done with the command: 

In [None]:
rnaseq_data_subset=rnaseq_data_subset[nonzero_rows]

In [None]:
#Transpose the data frame 
#Now, our features (genes) are along the column axis, and sample names are along the row axis. 
rnaseq_data_subset=rnaseq_data_subset.transpose()

In [None]:
print("Number rows:"+str(rnaseq_data_subset.shape[0]))#prints number of rows -- this is the gene axis 
print("Number columns:"+str(rnaseq_data_subset.shape[1]))#prints number of columns

In [None]:
from IPython.display import HTML
HTML('<iframe src="https://docs.google.com/presentation/d/e/2PACX-1vREgMZvDAV0PvotVikbR_YzQQP1V_1Sis82NSbeoa6WoaIZTEpqfx8l4bTDjcvGVYANFmU-8GL-ZZha/embed?start=false&loop=false&delayms=60000" frameborder="0" width="960" height="569" allowfullscreen="true" mozallowfullscreen="true" webkitallowfullscreen="true"></iframe>')

We have extracted RNA-seq expression data for our four organ systems of interest. We have also removed all genes that are not expressed in any of the four organ systems.

## What is principal component analysis (PCA)? <a name='PCA1' />

Principal component analysis (PCA) is a statistical method to understand and visualize variation in large datasets.

In [None]:
from IPython.display import HTML
HTML('<iframe src="https://docs.google.com/presentation/d/e/2PACX-1vRcZ3RiXYbwH_xrE-261ccJT71HKZ5oPJqmIATdHa2SwvDekvAR5Lr7zDwnNPN88FAEM2XT-F6-DHiS/embed?start=false&loop=false&delayms=60000" frameborder="0" width="960" height="749" allowfullscreen="true" mozallowfullscreen="true" webkitallowfullscreen="true"></iframe>')

We will use the [scikit learn](http://scikit-learn.org/stable/) python library to perform principal component analysis. We import scikit learn with the command "import sklearn". This library has a number of built-in tools for performing statistical analysis and machine learning. 

[This tutorial](http://scikit-learn.org/stable/modules/generated/sklearn.decomposition.PCA.html) provides a guide to performing PCA analysis with scikit learn.

In [None]:
# Perform principal component analysis on the data to check for clustering patterns
from sklearn.decomposition import PCA as sklearnPCA

#We decompose the data into 10 principal components 
sklearn_pca = sklearnPCA(n_components=10)
pca_results = sklearn_pca.fit_transform(rnaseq_data_subset)


We visualize the percent of variance explained by each principal component in a graph called a "scree plot"

In [None]:
#We use our plotly helper functions to generate a scree plot from the principal component analysis. 
#Import the plotting helper functions from the helpers directory
import sys
sys.path.append('../helpers')
import plotly_helpers 
from plotly_helpers import * 



In [None]:
print(sklearn_pca.explained_variance_ratio_)

In [None]:
scree_plot(sklearn_pca.explained_variance_ratio_)

This indicates that the first principal component explain 33% of the variance in the data, while the second principal component explains 8% of the variance. 

In [None]:
pca_results[0:10]

In [None]:
print(pca_results.shape)


In [None]:
print (type(pca_results))

## Make a scatter plot of the output from principal component analysis <a name='Scatter' />

In [None]:
#Make a scatter plot of principle component 1 (PC1) vs principle component 2 (PC2)
#convert the pca_results into a dataframe from the array output
pca_rnaseq_data=pd.DataFrame(pca_results)
    
#Define a variable traces 
traces=[] 

#Define the x and y values for the scatter plot  
#Values are defined in loops by system type so each system has different colored markers

for name in systems_subset:   

    #selects the rows in pca_rnaseq_data for the system and column 0 for the first principle component
    #1. Find all rows in metadata_subset that contain the current organ system. 
    rows_for_system=metadata_subset["System"]==name 
    
    #2. Select these rows, and the first column from pca_rnaseq_data 
    x=pca_rnaseq_data.loc[rows_for_system,[0]]
    
    #3. Convert the first (and only) column of x to a list --
    #this is a syntax change to make plotly work
    #in this example plotly does not accept data frame inputs and needs lists.
    x=x[0].tolist()         
    
    #Repeat for the y-axis, but select column 1 instead of column 0 from pca_rnaseq_data. 
    y=pca_rnaseq_data.loc[rows_for_system,[1]] 
    y=y[1].tolist() 
    
    trace=Scatter(
    x=x,
    y=y,
    #defines the mode of the plot, in this case markers (as opposed to lines or text)
    mode="markers",

    #defines the name that appears in the legend, in this case the system name 
    name=name)
    
    #appends the trace for each system to traces
    traces.append(trace)

#Label the axes 
layout=Layout(xaxis=dict(title='PC1'),yaxis=dict(title='PC2'),showlegend=True)

#Draw the figure 
fig=Figure(data=traces,layout=layout)
plotly.offline.iplot(fig)    
    
    

In [None]:
#Make a scatter plot of principle component 2 (PC2) vs principle component 3 (PC3)
#Make sure to change your axes labels too!

##ANSWER## 

In [None]:
#Make a scatter plot of principle component 1 (PC1) vs principle component 3 (PC3)
#Make sure to change your axes labeles too!

##ANSWER## 