# Preprocessing

The input data for the prediction of velocity is in ```csv``` format. And could be loaded using ```pandas.read_csv('your_path/cell_type_u_s.csv')```. Taking the ```pandas.DataFrame``` below as an example, the gene information is represented by columns of ```gene_name```, ```unsplice```, and ```splice```. The information in embedding space is represented by columns ```cellID```, ```clusters```, ```embedding1```, and ```embedding2```. ```unsplice``` and ```splice``` columns represent the unspliced and spliced reads (after the knn imputation), seperately. In the sample   ```DataFrame```, there are 5 cells corresponding to 2 genes. ```cellID``` is the unique id of each cell. ```clusters``` represents the cell type of each cell. ```embedding1``` and ```embedding2``` are the 2-dimensional representation of all cells such as UMAP, PCA, or t-SNE.

In [14]:
import pandas as pd
cell_type_u_s=pd.read_csv('your_path/cell_type_u_s_sample_df.csv')
cell_type_u_s

Unnamed: 0,gene_name,unsplice,splice,cellID,clusters,embedding1,embedding2
0,Hba-x,0.0,0.123217,cell_363,Blood progenitors 2,3.460521,15.574629
1,Hba-x,0.0,0.008806,cell_385,Blood progenitors 2,2.351203,15.267069
2,Hba-x,0.023665,21.719713,cell_592,Erythroid1,6.170377,12.916482
3,Hba-x,0.447068,301.9154,cell_16475,Erythroid2,8.311832,9.724998
4,Hba-x,0.66566,637.66565,cell_139318,Erythroid3,8.032358,7.603037
5,Sulf2,0.0,0.03396,cell_363,Blood progenitors 2,3.460521,15.574629
6,Sulf2,0.0,0.050277,cell_385,Blood progenitors 2,2.351203,15.267069
7,Sulf2,0.0,0.033758,cell_592,Erythroid1,6.170377,12.916482
8,Sulf2,0.0,0.011413,cell_16475,Erythroid2,8.311832,9.724998
9,Sulf2,0.0,0.007784,cell_139318,Erythroid3,8.032358,7.603037


## Format transfer

The input data are two count matrices of pre-mature (unspliced) and mature (spliced) abundances, could be obtained from standard sequencing protocols, using the `velocyto` or `loompy/kallisto` counting pipeline. 

We also provide a function (```celldancer.utilities.adata_to_csv()```) to transfer from [Anndata](https://anndata-tutorials.readthedocs.io/en/latest/getting-started.html) to csv format. For example, after the [preprocessing of Anndata](https://scvelo.readthedocs.io/VelocityBasics/). ```celldancer.utilities.adata_to_raw_with_embed(adata,us_para=['Mu','Ms'],cell_type_para='celltype',embed_para='X_umap',save_path='cell_type_u_s.csv',gene_list=['Hba-x','Smim1'])``` could be run to get the csv file

In the example, ```splice``` and ```unsplice``` columns (the two count matrices of spliced and unspliced abundances) are obtained from the ```['Ms', 'Mu']``` attributes of ```adata.layers```. ```cellID``` column is obtained from adata.obs.index. ```clusters``` column is obtained from ['celltype'] of adata.obs. The ```embedding1``` and ```embedding2``` columns are obtained from ['X_umap'] attribute of adata.obsm.

In [55]:
import pandas as pd



adata_to_raw_with_embed(adata,us_para=['Mu','Ms'],cell_type_para='celltype',embed_para='X_umap',save_path='cell_type_u_s.csv',gene_list=['Hba-x','Smim1'])

processing:0/2
processing:1/2


Unnamed: 0,gene_list,unsplice,splice,cellID,clusters,embedding1,embedding2
0,Hba-x,0.000000,0.139826,AAAGATCTCTCGAA,Blood progenitors 2,3.460521,15.574629
1,Hba-x,0.000000,0.048073,AATCTCACTGCTTT,Blood progenitors 2,2.490433,14.971734
2,Hba-x,0.000000,0.000000,AATGGCTGAAGATG,Blood progenitors 2,2.351203,15.267069
3,Hba-x,0.025080,3.652050,ACACATCTGTCAAC,Blood progenitors 2,5.899098,14.388825
4,Hba-x,0.000000,1.355590,ACGACAACTGGAGG,Blood progenitors 2,4.823139,15.374831
...,...,...,...,...,...,...,...
19625,Smim1,0.987823,7.671494,TTTCACGACTGGTA,Erythroid3,8.032358,7.603037
19626,Smim1,0.989246,7.317718,TTTCAGTGCGAGTT,Erythroid3,10.352904,6.446736
19627,Smim1,1.216333,6.978382,TTTCGAACGGTGAG,Erythroid3,9.464873,7.261099
19628,Smim1,1.141874,7.705575,TTTCGAACTAACCG,Erythroid3,9.990495,7.243880


In [1]:
import scvelo as scv

In [2]:
adata=scv.datasets.gastrulation_erythroid()

In [3]:
adata

AnnData object with n_obs × n_vars = 9815 × 53801
    obs: 'sample', 'stage', 'sequencing.batch', 'theiler', 'celltype'
    var: 'Accession', 'Chromosome', 'End', 'Start', 'Strand', 'MURK_gene', 'Δm', 'scaled Δm'
    uns: 'celltype_colors'
    obsm: 'X_pca', 'X_umap'
    layers: 'spliced', 'unspliced'

In [4]:
scv.pp.filter_genes(adata, min_shared_counts=20)
scv.pp.normalize_per_cell(adata)
scv.pp.filter_genes_dispersion(adata, n_top_genes=2000)
scv.pp.log1p(adata)

Filtered out 47456 genes that are detected 20 counts (shared).
Normalized count data: X, spliced, unspliced.
Extracted 2000 highly variable genes.


In [5]:
scv.pp.filter_and_normalize(adata, min_shared_counts=20, n_top_genes=2000)
scv.pp.moments(adata, n_pcs=30, n_neighbors=30)


Filtered out 51 genes that are detected 20 counts (shared).
Skip filtering by dispersion since number of variables are less than `n_top_genes`.
computing neighbors


OMP: Info #271: omp_set_nested routine deprecated, please use omp_set_max_active_levels instead.


    finished (0:00:21) --> added 
    'distances' and 'connectivities', weighted adjacency matrices (adata.obsp)
computing moments based on connectivities
    finished (0:00:01) --> added 
    'Ms' and 'Mu', moments of un/spliced abundances (adata.layers)


In [6]:
adata

AnnData object with n_obs × n_vars = 9815 × 1949
    obs: 'sample', 'stage', 'sequencing.batch', 'theiler', 'celltype', 'initial_size_spliced', 'initial_size_unspliced', 'initial_size', 'n_counts'
    var: 'Accession', 'Chromosome', 'End', 'Start', 'Strand', 'MURK_gene', 'Δm', 'scaled Δm', 'gene_count_corr', 'means', 'dispersions', 'dispersions_norm', 'highly_variable'
    uns: 'celltype_colors', 'neighbors'
    obsm: 'X_pca', 'X_umap'
    layers: 'spliced', 'unspliced', 'Ms', 'Mu'
    obsp: 'distances', 'connectivities'

In [8]:
import celldancer as cd
test=cd.utilities.adata_to_raw_with_embed(adata,us_para=['Mu','Ms'],cell_type_para='celltype',embed_para='X_umap',save_path='cell_type_u_s.csv',gene_list=['Hba-x','Smim1'])


processing:0/2
processing:1/2


In [9]:
test

Unnamed: 0,gene_list,unsplice,splice,cellID,clusters,embedding1,embedding2
0,Hba-x,0.000000,0.139826,AAAGATCTCTCGAA,Blood progenitors 2,3.460521,15.574629
1,Hba-x,0.000000,0.048073,AATCTCACTGCTTT,Blood progenitors 2,2.490433,14.971734
2,Hba-x,0.000000,0.000000,AATGGCTGAAGATG,Blood progenitors 2,2.351203,15.267069
3,Hba-x,0.025080,3.652050,ACACATCTGTCAAC,Blood progenitors 2,5.899098,14.388825
4,Hba-x,0.000000,1.355590,ACGACAACTGGAGG,Blood progenitors 2,4.823139,15.374831
...,...,...,...,...,...,...,...
19625,Smim1,0.987823,7.671494,TTTCACGACTGGTA,Erythroid3,8.032358,7.603037
19626,Smim1,0.989246,7.317718,TTTCAGTGCGAGTT,Erythroid3,10.352904,6.446736
19627,Smim1,1.216333,6.978382,TTTCGAACGGTGAG,Erythroid3,9.464873,7.261099
19628,Smim1,1.141874,7.705575,TTTCGAACTAACCG,Erythroid3,9.990495,7.243880


In [41]:
adata.layers['Ms'].shape

(9815, 1949)

In [23]:
import numpy as np
para=['Mu', 'Ms']
adata.layers[para[0]][:,0].copy().astype(np.float32)

array([0.08998882, 0.07766935, 0.06742084, ..., 0.04068623, 0.        ,
       0.        ], dtype=float32)