<h1><center>Biogen Pretrained Tutorial - independent version</center></h1>


<center>Author: Qihuang Zhang*, Jian Hu, Kejie Li, Baohong Zhang, David Dai, Edward B. Lee, Rui Xiao, Mingyao Li*

## Outline
1. Preparation
2. Load Data
3. Prediction

In this tutorial, we illustrate the usage of the CeLEry pretrain model trained by Biogene mouse brain data (Li and Zhang, 2022). This model takes the gene expression input of 886 genes and produce a prediction probability vector to eight regions segemented from the spatial transcriptomics data.

This tutorial can be independent of the CeLEry package. It does not require installing the CeLEry package. 

## 1. Preparation

To implemente the model without installing CeLEry package, several helper functions are needed. The ``pickle`` package is used to load the pretrained model. Function ``make_annData_query()`` transform the raw input data into AnnData format and conduct data proprocessing, including  normalizing the gene expression per cell and performing ``log(1+p)`` transcformation. The ``get_zscore()`` helps to normalized the gene expression so that batch effect can be removed.

In [None]:
import pickle
from scanpy import read_10x_h5
import CeLEry as cel

import scanpy as sc
import numpy as np
from scipy.sparse import issparse


## 2. Load Data
 
Load scRNA-seq/snRNA-seq data. Example data can be download from [Li and Zhang (2022)](https://doi.org/10.5281/zenodo.6640285).

In [None]:

QueryData_raw = read_10x_h5("data/Biogen/7G-1/filtered_feature_bc_matrix.h5")
QueryData = cel.make_annData_query (QueryData_raw)

It is import to make sure the query scRNA-seq/snRNA-seq contains all the gene in the trained model.

In [None]:
## Load gene list
filename = "pretrainmodel/Biogen/Reference_genes_8_075B.obj"
filehandler = open(filename, 'rb') 
genenames = pickle.load(filehandler)

## Rearrange the data and filter the selected genes in the trained model.
Qdata = QueryData[:,list(genenames)]
cel.get_zscore(Qdata)

#### 3. Apply Pre-trained CeLEry model to the snRNA data

The gene expression of the first cell (a 1X886 matrix) in the snRNA-seq data is given by:

In [None]:
Qdata[0].X.A

Load the CeLEry prediction model which is located at the ``"../output/Biogene/models"`` named as ``Org_domain_075B``. We use CeLEry function ``Predict_domain()`` to conduct domain prediction for each single cells in the scRNA-seq/snRNA-seq data. The detailed argument are explained as follows:

* data_test: (AnnData object) the input scRNA-seq/snRNA-seq data 
* class_num: (int) the number of class to be predicted. This value should be consistent with the number of domains in the training model.
* path: (string) the location of the pre-trained model
* filename: (string) the file name of the saved pre-trained model
* predtype: (string) if predtype is "probability" (default) then a probability prediction matrix will be produced; if predtype is "deterministic", then the deterministic assignment based on the maximun probability prediction will be returned; if predtype is "both", then both prediction will be outputed. 

## 3. Prediction 

Prediction of the first cell

In [None]:
modelname = "pretrainmodel/Biogen/Org_domain_075B.obj"
filehandler = open(modelname, 'rb') 
CeLERymodel = pickle.load(filehandler)


pred_cord = cel.Predict_domain(data_test = Qdata[0], class_num = 8, path = "../output/Biogene/models", filename = "Org_domain_075B", predtype = "probability")


Prediction of the entire scRNA-seq data and report the proportion of the cells on different domains.

In [None]:
pred_cord = cel.Predict_domain(data_test = Qdata, class_num = 8, path = "../output/Biogene/models", filename = "Org_domain_075B", predtype = "probability")