# Example of using SCProcessing package to make training splits

# Import libraries

Make sure `SCprocessing` is installed and ready to go : )

you can do so in this notebook using 

`!pip install -e <PATH/TO/WHERE/SETUP.PY FOR THIS PACKAGE RESIDES>`

In [1]:
import scanpy as sc
import numpy as np
import pandas as pd
from SCProcessing import TrainSplit



 
 


In [2]:
%load_ext autoreload
%autoreload 2

## Read in data

Read in the scanpy data you want to split

In [3]:
adata = sc.read("/home/jovyan/NACT_Project/N-ACT_Data/pbmc68k_filt_BP.h5ad")

In [4]:
adata

AnnData object with n_obs × n_vars = 66084 × 20387
    var: 'gene_ids'

## Pass in the object to `TrainSplit` class

So first, we want to make sure that we create an object of `TrainSplit` class, so that we could call all of its method. Here is the usage of this class:

`__init__(self, data, trainNumber, validationNumber, testNumber, balancedSplit:bool=True, randSeed:int=0, clusterRes=None, savePath = None)`

Where `trainNumber` is the number of training samples, `testNumber` and `validNumber` are the number of testing and validation. Note that if you choose balance split (recommended), i.e. `balanceSplit=True`, then these numbers may be slightly off becuase the proportions are being considered. If clustering is not done already, `clusterRes` is the resolution of the clustering that will be done. 

In [5]:
obj = TrainSplit(adata, 59000, 0, 7084, balancedSplit=True, clusterRes = 0.2)

### Clustering Information

If the data is already clustered, we only need to get the cluster ratios for a balance split. Make sure that the `scanpy` object has an attribute under `obs` that is called `cluster`, e.g. `adata.obs["cluster"]` exists. If data is ***not clustered***, then you can run `obj.Cluster()` first, and then continue on exactly the same.

In [6]:
## IF CLUSTERING IS NOT DONE ALREADY! 
obj.Cluster(clusterRes=0.35)

==> Starting to cluster
    -> Setting clustering resolution to 0.35
    -> Running PCA:
    -> PCA done.
    -> Clustering:


Using TensorFlow backend.
OMP: Info #271: omp_set_nested routine deprecated, please use omp_set_max_active_levels instead.


==> Saving cluster ratios:
    -> Number of clusters: 10
-><- Saved cluster ratios to object attributes
-><- Done. Clustering of the raw data is done to 10 clusters.
 Clustering took 224.00557947158813 seconds


In [7]:
## IF CLUSTERING IS ALREADY DONE! 
# obj.Cluster_ratios()

## Split!

In [8]:
obj.Split()

INFO:root:    -> Cluster Ratios exist
INFO:root:    -> Starting a *balanced* split


==> Splitting:
    -> Starting a *balanced* split
-><- Splitting done
Splitting took 1.7139842510223389 seconds


Now inspect the orginal data to make sure we have an attribute called `split`!

In [9]:
obj.sc_raw

AnnData object with n_obs × n_vars = 66084 × 20387
    obs: 'cluster', 'split'
    var: 'gene_ids'
    obsm: 'X_pca'
    varm: 'PCs'

## Save the new `scanpy` object

Now we can save the object to a `h5ad` file. If we do not provide a path, the object is saved to a directory `./TrainSplitData` with a JSON of the hyperparameters, and the actual `h5ad` file containing the object. Alternatively, you can provide a path initially when you make the `TrainSplit` object under `savePath`, but this is not necessary. 

In [10]:
obj.Save()

==> Saving processed data:
    -> Saving data and parameters to folder ./TrainTestSplitData/TrainTestSplit.h5ad
-><- Saving done.
