# Big Data for Biologists: Decoding Genomic Function- Class 8

## How do you visualize similarities and differences of gene expression profiles across cell types? Part II

##  Learning Objectives
***Students should be able to***
 <ol>
 <li> <a href=#MetaData>Prepare RNA-Seq samples and metadata for PCA analysis</a></li> 
 <li> <a href=#PCA1>Describe what Principal Component analysis is and how it can be used to analyze and visualize variation in large datasets</a></li>
 <li> <a href=#PCA2>Perform prinicipal component analysis to identify clustering patterns in gene expression data </a></li>
 <li> <a href=#Scatter>Make a scatter plot of the output from principal component analysis</a></li> 

# Prepare RNA-Seq samples and metadata for PCA analysis (review)<a name='MetaData' />

In [None]:
%load_ext autoreload
# load the pandas package and define an abbreviation (or alias) 
import pandas as pd   

We will focus our analysis on 4 of the anatomical structures and check for differential gene expression among them.  


In [None]:
#Read in the metadata table. 
metadata = pd.read_table(
     filepath_or_buffer='../datasets/RNAseq/rnaseq_metadata.txt', 
     header=0,
    index_col=0)

In [None]:
# Read in the RNA-seq data matrix. 
rnaseq_data = pd.read_table(
     filepath_or_buffer='../datasets/RNAseq/rnaseq_normalized.tsv', 
     header=0,
     index_col=0)

In [None]:
print("Number rows:"+str(rnaseq_data.shape[0]))#prints number of rows -- this is the sample
print("Number columns:"+str(rnaseq_data.shape[1]))#prints number of columns -- this is the gene axis

In [None]:
#Since our goal is to identify genes that are differentially expressed across organ systems, we 
# want to exclude genes that have expression = 0 in all 4 of the organ systems. 
rnaseq_data_subset=rnaseq_data[rnaseq_data.sum(axis=1)>0]


In [None]:
print("Number rows:"+str(rnaseq_data_subset.shape[0]))#prints number of rows -- this is the sample
print("Number columns:"+str(rnaseq_data_subset.shape[1]))#prints number of columns -- this is the gene axis

In [None]:
#Transpose the data frame 
#Now, our features (genes) are along the column axis, and sample names are along the row axis. This will make for easier
#downstream analysis. 
rnaseq_data_subset=rnaseq_data_subset.transpose()

In [None]:
print("Number rows:"+str(rnaseq_data_subset.shape[0]))#prints number of rows -- this is the gene axis 
print("Number columns:"+str(rnaseq_data_subset.shape[1]))#prints number of columns

In [None]:
#merge the rnaseq_subset dataframe with the metadata frame so we can more easily sub-select the organ systems 
#of interst.

merged_df=pd.merge(rnaseq_data_subset, metadata, left_index=True,right_index=True)
merged_df.head()

In [None]:
#Define the systems of interest
systems_subset=["Blood","Embryonic","Immune","Respiratory"]

In [None]:
#Pick out the samples (rows) in the merged dataframe that contain the samples from the systems of interest 
samples_to_keep=merged_df['System'].isin(systems_subset)
samples_to_keep.head()

In [None]:
#Select the rows in the data matrix that contain the samples we wish to analyze (i.e. the samples
#from blood, embryonic, immune, and)
merged_df_subset=merged_df[samples_to_keep]

In [None]:
merged_df_subset.head()

In [None]:
#Check row & column numbers in merged_df_subset 
print("Number rows:"+str(merged_df_subset.shape[0]))#prints number of rows -- this is the sample axis
print("Number columns:"+str(merged_df_subset.shape[1]))# prints the number of columns -- this is the gene axis 

In [None]:
nonzero_rows=rnaseq_data.sum(axis=1)>0
print(nonzero_rows)

Note that `rnaseq_data.sum(axis=1)>0` returns a value of "True" or "False" at each row in the`rnaseq_data` matrix. This is referred to as binary indexing. 

Next, we identify the rows with a value of "True", and select them from `rnaseq_data`. 
This can be done with the command: 

In [None]:
rnaseq_data_subset=rnaseq_data[nonzero_rows]

We have extracted RNA-seq expression data for our four organ systems of interest. We have also removed all genes that are not expressed in any of the four organ systems.

## What is principal component analysis (PCA)? <a name='PCA1' />

Principal component analysis (PCA) is a statistical method to understand and visualize variation in large datasets.

In [None]:
from IPython.display import HTML
HTML('<iframe src="https://docs.google.com/presentation/d/e/2PACX-1vRcZ3RiXYbwH_xrE-261ccJT71HKZ5oPJqmIATdHa2SwvDekvAR5Lr7zDwnNPN88FAEM2XT-F6-DHiS/embed?start=false&loop=false&delayms=60000" frameborder="0" width="960" height="749" allowfullscreen="true" mozallowfullscreen="true" webkitallowfullscreen="true"></iframe>')

We will use the [scikit learn](http://scikit-learn.org/stable/) python library to perform principal component analysis. We import scikit learn with the command "import sklearn". This library has a number of built-in tools for performing statistical analysis and machine learning. 

[This tutorial](http://scikit-learn.org/stable/modules/generated/sklearn.decomposition.PCA.html) provides a guide to performing PCA analysis with scikit learn.

In [None]:
# Perform principal component analysis on the data to check for clustering patterns
from sklearn.decomposition import PCA as sklearnPCA

#We decompose the data into 10 principal components 
sklearn_pca = sklearnPCA(n_components=10)
#We want to exclude the metadata columns from the PCA transformation -- they have served their purpose in helping 
#us filter the dataset to the organ systems of interest, and now we remove them. 
metadata_subset=merged_df_subset[['System','Organ','CellType']]
merged_df_subset = merged_df_subset.drop(['System', 'Organ','CellType'], axis=1)

pca_results = sklearn_pca.fit_transform(merged_df_subset)


In [None]:
merged_df_subset.shape
metadata_subset.shape

We visualize the percent of variance explained by each principal component in a graph called a "scree plot"

In [None]:
#We use our plotly helper functions to generate a scree plot from the principal component analysis. 
#Import the plotting helper functions from the helpers directory
from plotnine import * 


In [None]:
print(sklearn_pca.explained_variance_ratio_)

In [None]:
#We use the plotnine plotting library to generate a scree plot of the variance explained by each component
#Now, we create a barplot with just our 4 organ systems of interest 
y=sklearn_pca.explained_variance_ratio_
x=range(1,len(y)+1)
qplot(x=x,
      y=y,
      geom="bar",
      stat="identity",
      xlab="PC",
      ylab="Fraction of variance explained")

This indicates that the first principal component explain 82% of the variance in the data, while the second principal component explains 4% of the variance. 

In [None]:
pca_results[0:10]

In [None]:
print(pca_results.shape)


In [None]:
print (type(pca_results))

## Make a scatter plot of the output from principal component analysis <a name='Scatter' />

In [None]:
#We make a scatterplot of PC1 vs PC2 
x=pca_results[:,0]
y=pca_results[:,1]
qplot(x=x,
      y=y,
      geom="point",
      xlab="PC1",
      ylab="PC2")

To investigate whether there is any clustering of samples by organ system, we can color-code by the 'System' column from the metadata table.

In [None]:
qplot(x=x,
      y=y,
      geom="point",
      xlab="PC1",
      ylab="PC2",
      color=list(metadata_subset['System']))+scale_color_discrete(name="System")

In [None]:
#Make a scatter plot of principle component 2 (PC2) vs principle component 3 (PC3)
#Make sure to change your axes labels too!

##ANSWER## 

In [None]:
#Make a scatter plot of principle component 1 (PC1) vs principle component 3 (PC3)
#Make sure to change your axes labeles too!

##ANSWER## 