Example notebook demonstrating scMaui on single-cell gene expression toy data.

In [1]:
import os
import pkg_resources
from scmaui.data import load_data
from scmaui.data import SCDataset
from scmaui.utils import get_model_params
from scmaui.ensembles import EnsembleVAE

## 1) Loading data

In [2]:
data_path = pkg_resources.resource_filename('scmaui', 'resources/')
gtx = os.path.join(data_path, 'gtx.h5ad')

`gtx.h5ad` contain single-cell expression values of 100 cells

In [3]:
adatas = load_data([gtx], names=['gtx'])
adatas

{'input': [AnnData object with n_obs × n_vars = 100 × 35300
      var: 'interval', 'genename', 'ensid', 'genome', 'feature_type', 'chrom', 'start', 'end', 'view'
      uns: 'view'
      obsm: 'mask'],
 'output': [AnnData object with n_obs × n_vars = 100 × 35300
      var: 'interval', 'genename', 'ensid', 'genome', 'feature_type', 'chrom', 'start', 'end', 'view'
      uns: 'view'
      obsm: 'mask']}

We can construct a dataset considering only the intersection of cells like below

In [4]:
dataset = SCDataset(adatas, losses=['negbinom'])
dataset

Inputs: non-missing/samples x features
	gtx: 100/100 x 35300
Outputs:
	gtx: 100/100 x 35300
0 Adversarials: []
0 Conditionals: []

## 2) Instantiate a scMaui model

First we obtain some default parameters for the model, which are informed by the dataset dimensions:

In [5]:
params = get_model_params(dataset)
params

OrderedDict([('nunits_encoder', 32),
             ('nlayers_encoder', 5),
             ('nunits_decoder', 20),
             ('nlayers_decoder', 1),
             ('dropout_input', 0.1),
             ('dropout_encoder', 0.0),
             ('dropout_decoder', 0.0),
             ('nunits_adversary', 128),
             ('nlayers_adversary', 2),
             ('nlatent', 10),
             ('nmixcomp', 1),
             ('input_modality', ['gtx']),
             ('output_modality', ['gtx']),
             ('adversarial_name', []),
             ('adversarial_dim', []),
             ('adversarial_type', []),
             ('conditional_name', []),
             ('conditional_dim', []),
             ('conditional_type', []),
             ('losses', ['negbinom'])])

You can adjust the default settings by overwriting the dictionary entries

In [6]:
ensemble = EnsembleVAE(params=params)

using vae


## 3) Fit a model

In [7]:
ensemble.fit(dataset, epochs=1)

Run model 1


[<tensorflow.python.keras.callbacks.History at 0x7f9edc819f10>]

In [8]:
ensemble.summary()

Model: "encoder"
__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
modality_gtx (InputLayer)       [(None, 35300)]      0                                            
__________________________________________________________________________________________________
dropout (Dropout)               (None, 35300)        0           modality_gtx[0][0]               
__________________________________________________________________________________________________
dense (Dense)                   (None, 32)           1129632     dropout[0][0]                    
__________________________________________________________________________________________________
layer_normalization (LayerNorma (None, 32)           64          dense[0][0]                      
____________________________________________________________________________________________

## 4) Obtain latent features

After having fitted a model, we can obtain the latent feature representation as follows

In [9]:
latent, latent_list = ensemble.encode(dataset)

In [10]:
latent.head()

Unnamed: 0,D0-0,D0-1,D0-2,D0-3,D0-4,D0-5,D0-6,D0-7,D0-8,D0-9
AACAAGCCAGGTTCAC-1,-11.653465,-2.619335,14.745419,-12.953723,-0.711892,-17.268864,11.574628,-10.91097,-5.023945,-5.228242
AAAGGTTAGGGTGGAT-1,-3.506031,-0.722704,3.907086,-2.42874,-0.466056,-3.780588,3.428963,-2.429952,-0.768454,-4.573591
AACGACAAGGACCGCT-1,-4.901626,-0.366557,7.997086,-4.479526,-0.418263,-7.347968,4.021302,-5.009599,-2.311113,-5.598698
AACAGGATCATCACTT-1,-8.165175,0.094383,10.642797,-6.025104,-1.220338,-11.612317,6.191409,-5.79023,-2.359447,-4.66364
AACCGGCTCGATCAGT-1,-7.082474,-1.445342,9.344357,-5.344437,-1.262001,-9.807787,5.113561,-5.087104,-1.816845,-5.846072


## 5) Obtain feature imputation

We can impute/predict feature using the impute method.

In [11]:
predicted = ensemble.impute(dataset)

In [12]:
predicted[0].shape

(100, 35300)

## 6) Obtain a feature importance attribution

Given a selection of one or more similar cells, we can ask for an explanation regarding to the most relevant input features.

The result will average the input attributions across the selected cells. 

In [13]:
# select 5 cells
selected_cells = latent.index.tolist()[:5]

get the feature attributions

In [14]:
attributed = ensemble.explain(dataset, cellids=selected_cells)

The attribution dimensions are given by the input feature dimension and the latent feature dimension of the encoder

In [15]:
attributed[0]

array([[ 0.,  0.,  0., ...,  0.,  0.,  0.],
       [ 0.,  0.,  0., ..., -0.,  0., -0.],
       [-0.,  0.,  0., ..., -0.,  0.,  0.],
       ...,
       [-0.,  0., -0., ...,  0.,  0., -0.],
       [ 0.,  0.,  0., ..., -0., -0., -0.],
       [-0., -0., -0., ...,  0.,  0., -0.]])