# Testing new feature selection methods

## Import section

In [1]:
from data_loader import load_data, yield_train_test_data
from selection import select_genes
from other_steps import cluster_cells, classify_cells
from experiments.metrics import clustering_metrics, classification_metrics

Global seed set to 0
2022-04-08 14:16:31.940610: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.10.1


## Load and preprocess the dataset
Before you load datasets, you need to specify the path to data and marker genes files in `config/datasets_config.py` :
```python
self.data_path = "/your/path/to/datasets/"
self.marker_path = "/your/path/to/marker/genes/"
```

Here, we used PBMC3k dataset as an example:

In [2]:
adata = load_data("PBMCsmall")

******************************** Loading original PBMCsmall dataset ********************************


Trying to set attribute `.var` of view, copying.


Dataset PBMCsmall has 2503 cells, 13714 genes and 8 classes after filtering.
Rare cell type (> 30 cells and < 5.0% of all cells): Monocyte_FCGR3A
Data complexity is 0.89.


For the first time the dataset was loaded, it was cached and can be directly loaded into menmory next time. All available names of datasets: 

|        |        |        |            |
| :----: | :----: | :----: | :----:     |
|BaronHuman|Segerstolpe|Zilionis|Marques|
|Darmanis|Guo|QuakeHeart|Zeisel|
| BaronMouse  | LaMannoStem  |LaMannoMidbrain|PBMCbatchone|
| QuakeSpleen|QuakeTongue | Alles| PBMCbatchtwo|
| Ariss| ToschesLizard| PBMCsmall|Aztekin|
|MouseAtlas|MouseHSP|MouseRetina|

## Feature selection

In [3]:
selected_adata = select_genes(adata, method='feast', n_selected_genes=2000)

2000 Genes have been saved to cache/geneData/all/PBMCsmall/feast/


In [4]:
selected_adata

AnnData object with n_obs × n_vars = 2503 × 2000
    obs: 'celltype', 'n_counts', 'n_genes', 'percent_mito', 'counts_per_cell'
    var: 'n_cells', 'mean', 'std'
    uns: 'log1p', 'rare_type', 'data_name', 'data_complexity'
    layers: 'normalized'

You can replace this function with new feature selection methods. Just pay attention to the input and output of your functions:
- ***input***: the `anndata` object generated from the function `load_data()` as shown above, in whcih the `anndata.X` is the log-normalized data, 
    the `anndata.raw` is the  data after quality control but before normalization, and the normalized data is in `adata.layers['normalized']`.
- ***output***: an `anndata` object, only contain the selected genes both in `anndata.X` and `anndata.raw`.

## Cell clustering
In `config/experiments_config.py`, you can specify:
- clustering methods and how many times it need to run
- evaluation metrics

```python
class CellClusteringConfig(BasicExperimentConfig):
    def __init__(self):
        super(CellClusteringConfig, self).__init__()
        self.methods = {'SC3s': 1, 'Seurat_v4': 1}  # clustering_method: number of runs
        self.metrics = ['ARI', 'V', 'bcubed']
```
other available clustering methods:'SHARP' and 'SC3'.

In [8]:
cluster_cells(selected_adata)

SC3s clustering starts. 2503 cells and 2000 genes in data...
**************************** SC3s - 1 ***************************
Seurat_v4 clustering starts. 2503 cells and 2000 genes in data...
************************* Seurat_v4 - 1 *************************


The generated cluster labels in each run were stroed in selected_adata.obs, the name of which have the format "{clustering_method}_{run}":

In [9]:
selected_adata.obs

Unnamed: 0,celltype,n_counts,n_genes,percent_mito,counts_per_cell,SC3s_1,Seurat_v4_1
AAACATACAACCAC,CD4.T.cell,2419.0,779,0.0,2419.0,0,1
AAACATTGATCAGC,CD4.T.cell,3147.0,1129,0.0,3147.0,0,1
AAACCGTGCTTCCG,Monocyte_CD14,2639.0,960,0.0,2639.0,7,7
AAACCGTGTATGCG,NK.cell,980.0,521,0.0,980.0,5,6
AAACGCACTGGTAC,CD4.T.cell,2163.0,781,0.0,2163.0,0,1
...,...,...,...,...,...,...,...
TTTCGAACTCTCAT,Monocyte_CD14,3459.0,1153,0.0,3459.0,7,3
TTTCTACTGAGGCA,B.cell,3443.0,1224,0.0,3443.0,1,4
TTTCTACTTCCTCG,B.cell,1684.0,622,0.0,1684.0,1,4
TTTGCATGAGAGGC,B.cell,1022.0,452,0.0,1022.0,1,4


After cell clustering we can evaluate the clustering results:

In [10]:
results = clustering_metrics(selected_adata)

In [11]:
results

{'SC3s': {'ARI_1': 0.600343060674016,
  'V_1': 0.7921364827506133,
  'bcubed_1': 0.9442864759196732},
 'Seurat_v4': {'ARI_1': 0.5782460625874656,
  'V_1': 0.7680852769186622,
  'bcubed_1': 0.9816576669238022}}

In [12]:
import pandas as pd
pd.DataFrame(results)

Unnamed: 0,SC3s,Seurat_v4
ARI_1,0.600343,0.578246
V_1,0.792136,0.768085
bcubed_1,0.944286,0.981658
