# Example of using SCProcessing package to make training splits

# Import libraries

Make sure `SCprocessing` is installed and ready to go : )

In [1]:
import scanpy as sc
import numpy as np
import pandas as pd
from SCProcessing import TrainSplit



 
 


## Read in data

Read in the scanpy data you want to split

In [2]:
adata = sc.read("/home/ubuntu/COVID_Data/NeuroCOVID/NeuroCOVID_Preprocessed_Logged&Clustered.h5ad")

In [3]:
adata

AnnData object with n_obs × n_vars = 78172 × 22807
    obs: 'n_genes_by_counts', 'total_counts', 'total_counts_mt', 'pct_counts_mt', 'total_counts_ribo', 'pct_counts_ribo', 'percent_mt2', 'n_counts', 'n_genes', 'louvain'
    var: 'gene_ids', 'feature_types', 'mt', 'ribo', 'n_cells_by_counts', 'mean_counts', 'pct_dropout_by_counts', 'total_counts', 'n_cells'
    uns: 'louvain', 'louvain_colors', 'neighbors', 'pca', 'umap'
    obsm: 'X_pca', 'X_umap'
    varm: 'PCs'
    obsp: 'connectivities', 'distances'

## Pass in the object to `TrainSplit` class

So first, we want to make sure that we create an object of `TrainSplit` class, so that we could call all of its method. Here is the usage of this class:

`__init__(self, data, trainNumber, validationNumber, testNumber, balancedSplit:bool=True, randSeed:int=0, clusterRes=None, savePath = None)`

Where `trainNumber` is the number of training samples, `testNumber` and `validNumber` are the number of testing and validation. Note that if you choose balance split (recommended), i.e. `balanceSplit=True`, then these numbers may be slightly off becuase the proportions are being considered. If clustering is not done already, `clusterRes` is the resolution of the clustering that will be done. 

In [4]:
obj = TrainSplit(adata, 70000, 0, 8172, balancedSplit=True)

### Clustering Information

If the data is already clustered, we only need to get the cluster ratios for a balance split. Make sure that the `scanpy` object has an attribute under `obs` that is called `cluster`, e.g. `adata.obs["cluster"]` exists. If data is ***not clustered***, then you can run `obj.Cluster()` first, and then continue on exactly the same.

In [1]:
## IF CLUSTERING IS NOT DONE ALREADY!
# obj.Cluster()

In [5]:
obj.Cluster_ratios()

INFO:root:Saved cluster ratios to object attributes


## Split!

In [6]:
obj.Split()

INFO:root:Cluster Ratios exist
INFO:root:Starting a *balanced* split


Starting a *balanced* split


  res = method(*args, **kwargs)
INFO:root:Splitting done


Now inspect the orginal data to make sure we have an attribute called `split`!

In [12]:
obj.sc_raw

AnnData object with n_obs × n_vars = 78172 × 22807
    obs: 'n_genes_by_counts', 'total_counts', 'total_counts_mt', 'pct_counts_mt', 'total_counts_ribo', 'pct_counts_ribo', 'percent_mt2', 'n_counts', 'n_genes', 'louvain', 'cluster', 'split'
    var: 'gene_ids', 'feature_types', 'mt', 'ribo', 'n_cells_by_counts', 'mean_counts', 'pct_dropout_by_counts', 'total_counts', 'n_cells'
    uns: 'louvain', 'louvain_colors', 'neighbors', 'pca', 'umap'
    obsm: 'X_pca', 'X_umap'
    varm: 'PCs'
    obsp: 'connectivities', 'distances'

## Save the new `scanpy` object

Now we can save the object to a `h5ad` file. If we do not provide a path, the object is saved to a directory `./TrainSplitData` with a JSON of the hyperparameters, and the actual `h5ad` file containing the object. Alternatively, you can provide a path initially when you make the `TrainSplit` object under `savePath`, but this is not necessary. 

In [13]:
obj.Save()

INFO:root:Saving data and parameters to folder ./TrainSplitData/TrainSplit.h5ad
