# Stage 1: Model Construst

In this tutorial, we will show how to train a LTNN model to predict the origin and end cell. 

If you want to try the scltnn algorithm directly, you can also use our [trained model](https://github.com/Starlitnightly/scltnn/tree/main/model), which is trained on bladder data, and although our tests show its ability to generalise across species and tissues, we recommend that you train your own LTNN model.

In [1]:
import scltnn
import scanpy as sc
import scvelo as scv
import anndata
import numpy as np


In [2]:
import omicverse as ov
ov.utils.ov_plot_set()

## Data prepare

We need to calculate the lsi of cells from anndata of scRNA-seq, and exact the high variable genes

> **notice**: 
> 
> the anndata need to calculate velocity and latent time, See [scvelo's tutorial](https://scvelo.readthedocs.io/) for detailed calculations

In [3]:
adata=sc.read_h5ad('../../data/Pancreas-velocyto.h5ad')
adata

AnnData object with n_obs × n_vars = 3696 × 27998
    obs: 'clusters_coarse', 'clusters', 'S_score', 'G2M_score', 'initial_size_spliced', 'initial_size_unspliced', 'initial_size', 'n_counts', 'velocity_self_transition', 'phase', 'velocity_length', 'velocity_confidence', 'velocity_confidence_transition', 'root_cells', 'end_points', 'velocity_pseudotime', 'latent_time'
    var: 'highly_variable_genes'
    uns: 'clusters_coarse_colors', 'clusters_colors', 'day_colors', 'neighbors', 'pca'
    obsm: 'X_pca', 'X_umap'
    layers: 'spliced', 'unspliced'
    obsp: 'connectivities', 'distances'

In [4]:
adata=ov.pp.qc(adata,
              tresh={'mito_perc': 0.2, 'nUMIs': 500, 'detected_genes': 250})
adata=ov.pp.preprocess(adata,mode='shiftlog|pearson',n_HVGs=3000,)

Calculate QC metrics
End calculation of QC metrics.
Original cell number: 3696
Begin of post doublets removal and QC plot
Running Scrublet
filtered out 12261 genes that are detected in less than 3 cells
normalizing counts per cell
    finished (0:00:00)
extracting highly variable genes
    finished (0:00:00)
--> added
    'highly_variable', boolean vector (adata.var)
    'means', float vector (adata.var)
    'dispersions', float vector (adata.var)
    'dispersions_norm', float vector (adata.var)
normalizing counts per cell
    finished (0:00:00)
normalizing counts per cell
    finished (0:00:00)
Embedding transcriptomes using PCA...
Automatically set threshold at doublet score = 0.36
Detected doublet rate = 0.2%
Estimated detectable doublet fraction = 56.0%
Overall doublet rate:
	Expected   = 5.0%
	Estimated  = 0.4%
    Scrublet finished (0:00:04)
Cells retained after scrublet: 3688, 8 removed.
End of post doublets removal and QC plots.
Filters application (seurat or mads)
Lower tresho

In [5]:
adata=adata[:,adata.var['highly_variable_features']==True]
adata

View of AnnData object with n_obs × n_vars = 3688 × 3000
    obs: 'clusters_coarse', 'clusters', 'S_score', 'G2M_score', 'initial_size_spliced', 'initial_size_unspliced', 'initial_size', 'n_counts', 'velocity_self_transition', 'phase', 'velocity_length', 'velocity_confidence', 'velocity_confidence_transition', 'root_cells', 'end_points', 'velocity_pseudotime', 'latent_time', 'nUMIs', 'mito_perc', 'detected_genes', 'cell_complexity', 'doublet_score', 'predicted_doublet', 'passing_mt', 'passing_nUMIs', 'passing_ngenes', 'n_genes'
    var: 'highly_variable_genes', 'mt', 'n_cells', 'percent_cells', 'robust', 'mean', 'var', 'residual_variances', 'highly_variable_rank', 'highly_variable_features'
    uns: 'clusters_coarse_colors', 'clusters_colors', 'day_colors', 'neighbors', 'pca', 'scrublet', 'log1p', 'hvg'
    obsm: 'X_pca', 'X_umap'
    layers: 'spliced', 'unspliced', 'counts'
    obsp: 'connectivities', 'distances'

In [6]:
scltnn.lsi(adata, n_components=20, n_iter=15)

## Training Data split

We random selected 80% of cells as training dataset, and 20% of cells as test dataset

In [10]:
ltnn_obj=scltnn.scLTNN(adata,basis='X_lsi',input_dim=20,cpu='cpu')

In [None]:
ltnn_obj.ANNmodel_init(pseudotime='latent_time',batch_size=20,)
ltnn_obj.ANNmodel_train(n_epochs=200)


Pre-ANN model:  40%|▍| 81/200 [00:25<00:37,  3.21it/s, val loss, val mae=0.00069, 0.

## Model save

We now save the model objects for after analysis.

In [None]:
ltnn_obj.ANNmodel_save('model/model_20.h5')

For LTNN time predicted, please refer to [Latent time predicted by scLTNN](https://scltnn.readthedocs.io/en/latest/Tutorials/human_CD8.html)