In [None]:
# setting up jupyter
%matplotlib inline
import matplotlib as mpl
mpl.rcParams['figure.dpi'] = 300
mpl.rcParams['savefig.dpi'] = 300

# Example 2 - exploring the knockdown of Chd4 and Mcrs1
In this example, mESCs were treated with shChd4 and shMcrs1. The sequencing analysis could not reveal the direction of differentiation in these cells, hence the outcome of exploring differentiation bias unknown. This example features the use of the ranked bar plot and how to utilize the plot's return values.

We start by passing the input data to <i>samples</i>. DEseq2 analysis was conducted on the data to identify the differential genes. Together with the expression TSV file, we can initiate the samples:

In [None]:
import DPre
s = DPre.samples(diff_genes = './knockdown_deseq_res', 
                 expression = 'knockdown_expression.tsv',
                 ctrl = 'shLuc',
                 name = 'sh knockdown in mESCs',
                )

As desired, the log tells us that the control was added to <i>diff_genes</i> data to match the length of expression. 
However, the initiation failed due to non-matching names between the expression- and gene list data. Since the elements do align and only differ in their naming (see logs), we can override the gene list names with the <b><i>override_namematcher</i></b> argument:

In [None]:
import DPre
s = DPre.samples(diff_genes = './knockdown_deseq_res', 
                 expression = 'knockdown_expression.tsv',
                 ctrl = 'shLuc',
                 name = 'sh knockdown in mESCs',
                 override_namematcher = True,    # ignore mismatches between expression and gene list names, use with care
                )

Create the 'mouse' reference targets instance:

In [None]:
import DPre
t = DPre.preset_targets('mouse', preset_colors=True)

We first get an overview of the similarities using the <i>target_similarity_heatmap()</i>. For cross validation, we also run both the 'intersect' and 'euclid' metric:

In [None]:
import DPre
# euclid overview
hm = t.target_similarity_heatmap(s, 
                                 metric = 'euclid', 
                                 hide_targetlabels = True,
                                 heatmap_width = .09,
                                 targetlabels_space = .8,
                                 pivot = True,
                                 filename = 'target_sim_euclid.png',
                                 )
# intersect overview
hm = t.target_similarity_heatmap(s, 
                                 metric = 'intersect', 
                                 hide_targetlabels = True,
                                 heatmap_width = .09,
                                 targetlabels_space= .8,
                                 pivot = True,
                                 filename = 'target_sim_intersect.png',
                                 )

Or on the command line (euclid):

In [None]:
# copy and paste into your terminal (somehow doesn't run in here)
!python ../../dpre.py -pt "mouse" -sd "./knockdown_deseq_res" -se "knockdown_expression.tsv" -c "shLuc" -sn "sh knockdown in mESCs" -so target_sim -hta -hw 0.09 -ta 0.8 -pi -f "target_sim_euclid.png"

While the similarity values between the two metrics largely overlap for the shMcrs1 sample, shChd4 values are more off. This is generally an indication for low validity. Mcrs1 knockdown seems to result in a similarity increase with distinct blood mesoderm cell types. To identify these values, we use the <i>ranked_similarity_barplot()</i> function:

In [None]:
bp = t.ranked_similarity_barplot(samples = s, 
                                 metric = 'euclid',
                                 display_negative = True,    # also show the bottom peak values
                                 pivot = True,
                                 filename =  'ranked_sim_eucl.pdf')
bp = t.ranked_similarity_barplot(samples = s, 
                                 metric = 'intersect',
                                 display_negative = True,
                                 pivot = True,
                                 filename =  'ranked_sim_intersect.pdf')

Or on the command line:

In [None]:
# copy and paste into your terminal
> python ../../dpre.py -pt "mouse" -sd "./knockdown_deseq_res" -se "knockdown_expression.tsv" -c "shLuc" -sn "sh knockdown in mESCs" -so ranked_sim -pi -din -f "ranked_sim_eucl.png"

The knockdown of Mcrs1 results in a defined differentiation bias towards Erythroblasts. We can proceed by identifying the driving genes that underlay this bias. We first subset the targets to the different Erythroblasts found in the reference. 

In [None]:
# filter out the erythroblast targets 
eryth_names = [n for n in t.names if 'rythroblast' in n]
t_eryth = t.slice_elements(eryth_names)
# drop the Chd4 element from the samples
mcrs4 = s.slice_elements(['shMcrs1', 'shLuc'])

# run the plot without saving to get the returned plot dictionary
gene_sim_plt = t_eryth.gene_similarity_heatmap(mcrs4, 
                                               metric = 'euclid',
                                               display_genes = 'increasing', 
                                               gene_number = 80,
                                               filename = None,
                                               )

The heatmaps reveal that the similarity shift is mainly driven by histone transcripts. We can use the return value of the plot to access the list of genes and assign colors to the different histone types. For this use case and more general data transformations, DPre provides an <i><b>annotate()</b></i> and <i><b>get_ensgs()</b></i> functions based on the Ensemble v96 annotation.

In [None]:
import DPre
# first index the target name, 
# then element 3 of [axes, figure, data], 
# then 'up' for the marker gene type (only one)
# then element 1 of [heatmap data, distance bar data, sum plot data]
# finally the column names of this DataFrame
genelist = gene_sim_plt['Erythroblast'][2]['up'][0].columns

# annotate the ensg keys
genelist = DPre.annotate(genelist, 'mouse')

# assemble lists containing the specific hist*-groups
hist1 = []
hist2 = []
hist3 = []
hist4 = []
for gene in genelist:
    if gene.startswith('Hist1'):
        hist1.append(gene)
    elif gene.startswith('Hist2'):
        hist2.append(gene)
    elif gene.startswith('Hist3'):
        hist3.append(gene)
    elif gene.startswith('Hist4'):
        hist4.append(gene)

# create a dictionary that maps the gene names to a respective color
hist1_cols = dict.fromkeys(hist1, DPre.config.colors[10])
hist2_cols = dict.fromkeys(hist2, DPre.config.colors[11])
hist3_cols = dict.fromkeys(hist3, DPre.config.colors[12])
hist4_cols = dict.fromkeys(hist4, DPre.config.colors[13])
genes_cb = {**hist1_cols, **hist2_cols, **hist3_cols, **hist4_cols}
# plot the color legend
DPre.plot_color_legend(('Hist1', 'Hist2', 'Hist3', 'Hist4'), DPre.config.colors[10:14],
                       filename='hist_legend.png')

# plot the most increasing genes and save
data = t_eryth.gene_similarity_heatmap(mcrs4, 
                                       metric = 'euclid',
                                       display_genes = 'increasing',
                                       gene_number = 80,
                                       heatmap_width = .9,
                                       genelabels_size = .7,
                                       genelabels_space = .5,
                                       show_genes_colorbar = genes_cb,
                                       filename = 'gene_sim_incr.pdf',
                                       HM_WSPACE = .1,           # size constant defining the size between plots (left/right)
                                      )

Or on the command line without colors:

In [None]:
# copy and paste into your terminal
!python ../../dpre.py -pt "mouse" -ts "Basophilic erythroblast" -ts "Erythroblast" -ts "Orthochromatic erythroblast" -ts "Polychromatic erythroblast" -ts "Proerythroblast" -sd "./knockdown_deseq_res" -se "knockdown_expression.tsv" -c "shLuc" -sn "sh knockdown in mESCs" -so -ss "shMcrs1" -ss "shLuc" gene_sim -di "increasing" -gn 80 -hw 0.9 -ges 0.7 -ge 0.7 -f "gene_sim_incr.pdf"

The gene colorbar emphasizes the dominence of Hist1 nicely. 