### CanDI and DESeq2
Let's say I want to look at changes in RNA expression across some cell lines in CCLE. DESeq2 is my preffered tool for doing differential expression analysis, unforutantely it's written in R. CanDI makes it easy to format CCLE read counts data into the shape that DESeq2 expects.

In [4]:
import CanDI as can
import numpy as np
import pandas as pd

#### Object Instantiation
For this example I'm going to do differential expression analysis across male and female KRAS mutant cell lines. The cell below uses CanDI to generate the correct CellLineCluster objects for our purpose.

In [42]:
lung = can.Cancer("Lung Cancer", subtype="NSCLC")
lung = can.CellLineCluster(lung.mutated("KRAS", variant = "Variant_Classification", item = "Missense_Mutation"))

lung_male = can.CellLineCluster(list(lung._info.loc[lung._info.sex == "Male",].index))
lung_female = can.CellLineCluster(list(lung._info.loc[lung._info.sex == "Female"].index))



#### Data Munging
The follow function takes two objects that we want to compare and automatically generates the counts and coldata matrices that DESeq2 needs to run. It's typically a good idea to filter our genes/transcripts with consistently low counts prior to running DESeq2. This speeds up analysis and avoids issues related to read count scaling and multiple hypthothesis testing corrections. The function below will filter out all genes that have mean read counts less than 10. In this case we don't care about different splicing of the same genes so I sum counts for duplicate indeces for all samples. 

In [43]:
def make_counts_coldata(obj1, obj2, condition, factor1, factor2):
    
    counts1 = obj1.rnaseq_reads
    coldat1 = pd.Series(counts1.shape[1] * [factor1], index = counts1.columns, name = condition)
    
    counts2 = obj2.rnaseq_reads
    coldat2 = pd.Series(counts2.shape[1] * [factor2], index = counts2.columns, name = condition)
    
    #Concatenate Column Data
    coldat = pd.concat([coldat1, coldat2], axis = 0)
    
    #Concatenate read count data 
    counts_mat = pd.concat([counts1, counts2], axis = 1)
    #Filter out lowley epxressed genes
    counts_mat = counts_mat.loc[counts_mat.mean(1) < 10, ].astype(int)
    #Sum duplicate indeces
    counts_mat = counts_mat.groupby(counts_mat.index).sum()
    
    return counts_mat, coldat
    
counts, coldat = make_counts_coldata(lung_male, lung_female, "sex", "male", "female")

#counts.to_csv("temp_dat/lung_sex_counts.csv")
#coldat.to_csv("temp_dat/lung_sex_coldata.csv")

#### Running DESeq2
In the following cell I use the csvs I just saved as arguments for an r-script that runs DESeq2. The last argument in this script the filname for the results.

In [None]:
!Rscript scripts/run_deseq.r temp_dat/lung_sex_counts.csv temp_dat/lung_sex_coldata.csv temp_dat/lung_sex_deseq.csv

#### Analyzing Results
Now we can read the results of the differential expression analysis back into our python enviroment and continue analysis as necessary. 

In [50]:
res = pd.read_csv("temp_dat/lung_sex_deseq.csv", index_col = "Unnamed: 0")
res.head()

Unnamed: 0,baseMean,log2FoldChange,lfcSE,stat,pvalue,padj
A2ML1-AS1,0.050427,0.221562,3.213483,0.068948,0.945031,
A2ML1-AS2,0.0,,,,,
A2MP1,0.537341,0.085599,1.020871,0.083849,0.933176,
A3GALT2,0.961101,-1.256893,0.736166,-1.707349,0.087757,
A4GNT,3.320609,0.734466,0.738855,0.99406,0.320194,0.823095
