# KMC Proteins of Interest CCLE
This notebook will visualize gene expression data from the [CCLE](http://software.broadinstitute.org/software/cprg/?q=node/11) KMC proteins of interest (from the protein classes: kinases, GPCRs, and ion channels).

In [2]:
import pandas as pd
import generate_proteins_of_interest_matrix
from clustergrammer_widget import *

## Gather Subset of CCLE Data
This function will geenrate a subset of the CCLE expression matrix that contains only the genes of interest. This version of the CCLE data has been 'downsampled' in that the 1,037 cell lines have replaced with 100 clusters identified using K-means. The data has been downsampled to improve the visualization and to highlight undrerepresented tissue types (e.g. pancreas) and downplay the impact of over-represented tissues (e.g. lung). 

In [4]:
# generate gene expression matrix with proteins of interest
generate_proteins_of_interest_matrix.main()

-- generate dictionary with protein names
-- load CCLE downsampled data
321 proteins of interest were found in the CCLE data
-- save matrix with proteins_of_interest subset


## Interactive Visualization of Gene Expression Data using Clustergrammer
We will Z-score normalize the expression of the genes across all cell lines (cell-line-clusters) to emphasize relative expression across cell lines. 

In [5]:
net = Network()
net.load_file('CCLE/CCLE_kmeans_ds_col_100_poi.txt')
df = net.export_df()
net.normalize(axis='row', norm_type='zscore', keep_orig=False)
net.make_clust(views=[], sim_mat=True)
clustergrammer_widget(network=net.widget())

make similarity matrices of rows and columns, add to viz data structure


  inst_series.sort(ascending=False)


The above heatmap shows cell-line-clusters (obtained from K-means downsampling) as columns and genes as rows. Cell-line-clusters have a 'Majority Tissue' which gives the tissue that is most often found in this cluster as well as a 'number in clust' value which gives the number of cell lines that are found in each cell-line-cluster. Gene rows also have a 'type' category that identifies genes as: kinases, GPCRs, or ion channels. 

We can see that cell-line-clusters (columns) cluster according to their tissue type. We see that genes show some clustering based on gene-type, e.g. GPCRs appear to form two large clusters. 

## Gene Similarity Matrix
Below is a similarity matrix of genes based on their expression in CCLE. We again see that GPCRs tend to cluster with other GPCRs.

In [7]:
clustergrammer_widget(network=net.export_net_json('sim_row'))