# Viewing Gene Expression Distributions

A great deal of quality-control information comes from the alignment step itself.
Here we are concerned with how the data looks as an ensemble.
Many methods are particularly concerned with gene-wise expression variance.

In this notebook we demonstrate the plotting utilities provided by `gsforge` to examine such distributions.

***Set up the notebook***

In [None]:
import itertools

import holoviews as hv
import numpy as np

import GSForge as gsf

hv.extension("bokeh", 'matplotlib')

In [None]:
# hv.help(hv.operation.datashader.datashade)

***Declare used paths***

In [None]:
# OS-independent path management.
from os import  environ
from pathlib import Path

In [None]:
OSF_PATH = Path(environ.get("GSFORGE_DEMO_DATA", default="~/GSForge_demo_data/osfstorage")).expanduser()
HYDRO_NORMED_GEM_PATH = OSF_PATH.joinpath("AnnotatedGEMs", "oryza_sativa_hydro_normed.nc")
assert HYDRO_NORMED_GEM_PATH.exists()

Declare an path to which the created `.nc` file will saved.

***Load an AnnotatedGEM***

In [5]:
agem = gsf.AnnotatedGEM(HYDRO_NORMED_GEM_PATH)
agem

<GSForge.AnnotatedGEM>
Name: Oryza Sativa
Selected GEM Variable: 'counts'
    Gene   66338
    Sample 475

***View available count arrays***

In [6]:
agem.count_array_names

['counts',
 'TPM_counts',
 'uq_raw_counts',
 'uq_tpm_counts',
 'tmm_counts',
 'quantile_counts']

Recall that all `gsforge` plotting operations allow use of the `Interface` data selection pipeline.

In this case we can select another count array and view the normalized distributions.

In [7]:
agem.data.quantile_counts.min()

In [8]:
agem.data.quantile_counts.values

array([[ 1.61113226,  0.55381332,  0.22777673, ..., -5.19933758,
        -5.19933758, -5.19933758],
       [ 1.5712865 , -5.19933758,  0.2085048 , ..., -5.19933758,
        -5.19933758, -5.19933758],
       [ 1.66260644,  0.43837517,  0.2239158 , ..., -5.19933758,
         0.04174632, -5.19933758],
       ...,
       [ 1.66131023,  0.50871307, -5.19933758, ...,  0.044767  ,
         0.0496718 , -5.19933758],
       [ 1.58549412,  0.61732144, -5.19933758, ..., -5.19933758,
        -0.01881964, -5.19933758],
       [ 1.61702478,  0.48201485, -0.02132929, ...,  0.05088393,
         0.04481233, -5.19933758]])

## Gene-wise Aggregate Distributions

The call below shows the default arguments, with the exception of `datashade=True`.

In [9]:
for count_var, y_axis in itertools.product(agem.count_array_names, ['variance', 'fano', 'cv_squared']):

    plot = gsf.plots.gem.GenewiseAggregateScatter(
        agem,
        count_variable=count_var,
        x_axis_selector='mean',
        y_axis_selector=y_axis,
        axis_transform=('log 2', lambda ds: np.log2(ds.where(ds > 0))),
        datashade=True,
        dynspread=True,
    )

    hv.save(plot, f'figures/genewise_aggs/gw_agg_{count_var}_log2_mean_vs_log2_{y_axis}.png', dpi=300, toolbar=None)

For some reason the adjoint png files produced have extra white space.
We can remove that with a solution from [github](https://gist.github.com/thomastweets/c7680e41ed88452d3c63401bb35116ed)

In [5]:
from PIL import Image
from PIL import ImageOps

padding = 5
padding = np.asarray([-1*padding, -1*padding, padding, padding])

for figure in Path('figures/genewise_aggs').glob('gw_agg_*.png'):
    
    image = Image.open(figure)
    image.load()
    imageSize = image.size

    # remove alpha channel
    invert_im = image.convert("RGB")

    # invert image (so that white is 0)
    invert_im = ImageOps.invert(invert_im)
    imageBox = invert_im.getbbox()
    imageBox = tuple(np.asarray(imageBox)+padding)

    cropped = image.crop(imageBox)
    cropped.save(figure)

## Grouped-Sample Covariance

These plotting functions can take a few minutes to complete.

In [11]:
treatment_labels = agem.data['treatment'].to_series().unique()
treatment_labels

array(['CONTROL', 'HEAT', 'RECOV_HEAT', 'DROUGHT', 'RECOV_DROUGHT'],
      dtype=object)

In [12]:
%%time
for group_a, group_b in itertools.combinations(treatment_labels, 2):
    plot =  gsf.plots.gem.GroupedGeneCovariance(agem, group_variable='treatment', 
                                                x_group_label=group_a, y_group_label=group_b,
                                                count_transform=lambda c: np.log(c + 0.25)
                                               ).opts(size=0.75, width=300, height=300)
    hv.save(plot, f'figures/grouped_covariance/covariance_{group_a}_vs_{group_b}.png', 'png')

CPU times: user 20.3 s, sys: 14.2 s, total: 34.5 s
Wall time: 1min 3s


## Sample-wise Distributions

These plotting func
tions can take a few minutes to complete.

### Kernel Density Estimates

In [6]:
%%time
for count_var, hue in itertools.product(['counts'], [None]):
    plot = gsf.plots.gem.SamplewiseDistributions(agem, count_variable=count_var, hue_key=hue, 
                                                 datashade=False)#.opts(width=300, height=300)
    hv.save(plot, f'figures/kde/samplewise_kde_{count_var}_{hue}.png', toolbar=None)

CPU times: user 11min 54s, sys: 42.1 s, total: 12min 36s
Wall time: 4min 34s


### Empirical Cumulative Distribution

In [14]:
%%time
for count_var, hue in itertools.product(agem.count_array_names[:-1], [None, 'treatment', 'genotype']):
    plot = gsf.plots.gem.EmpiricalCumulativeDistribution(agem, hue_key=hue, count_variable=count_var, datashade=True)
    hv.save(plot, f'figures/ecdf/ECDF_{count_var}_{hue}.png', dpi=300, toolbar=None)

CPU times: user 10min 11s, sys: 2min 45s, total: 12min 56s
Wall time: 13min 23s
