# Testing new feature selection methods

## Import section

In [1]:
import pandas as pd
from data_loader import load_data
from selection import select_genes
from selection.utils import subset_adata
from other_steps import cluster_cells, classify_cells
from experiments.metrics import clustering_metrics, classification_metrics

Global seed set to 0
2022-04-08 20:35:55.369958: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.10.1


## Load and preprocess the dataset
Before you load datasets, you need to specify the path to data and marker gene files in `config/datasets_config.py` :
```python
class DatasetConfig:
    def __init__(self):
        self.data_path = "/your/path/to/datasets/"
        self.marker_path = "/your/path/to/marker/genes/"
```

Here, we used PBMC3k dataset as an example:

In [2]:
adata = load_data("PBMCsmall")

******************************** Loading original PBMCsmall dataset ********************************


Trying to set attribute `.var` of view, copying.


Dataset PBMCsmall has 2503 cells, 13714 genes and 8 classes after filtering.
Rare cell type (> 30 cells and < 5.0% of all cells): Monocyte_FCGR3A
Data complexity is 0.89.


For the first time the dataset was loaded, it was cached and can be directly loaded into memory next time. All available names of datasets:

|        |        |        |            |
| :----: | :----: | :----: | :----:     |
|BaronHuman|Segerstolpe|Zilionis|Marques|
|Darmanis|Guo|QuakeHeart|Zeisel|
| BaronMouse  | LaMannoStem  |LaMannoMidbrain|PBMCbatchone|
| QuakeSpleen|QuakeTongue | Alles| PBMCbatchtwo|
| Ariss| ToschesLizard| PBMCsmall|Aztekin|
|MouseAtlas|MouseHSP|MouseRetina|

## Feature selection

In [3]:
selected_adata = select_genes(adata, method='feast', n_selected_genes=2000)

2000 Genes have been saved to cache/geneData/all/PBMCsmall/feast/


In [4]:
selected_adata

AnnData object with n_obs × n_vars = 2503 × 2000
    obs: 'celltype', 'n_counts', 'n_genes', 'percent_mito', 'counts_per_cell'
    var: 'n_cells', 'mean', 'std'
    uns: 'log1p', 'rare_type', 'data_name', 'data_complexity'
    layers: 'normalized'

You can replace this function with new feature selection methods. Just pay attention to the input and output of your functions:
- ***input***: the `anndata` object generated from the function `load_data()` as shown above, in whcih the `anndata.X` is the log-normalized data, 
    the `anndata.raw` is the  data after quality control but before normalization, and the normalized data is in `adata.layers['normalized']`.
- ***output***: an `anndata` object, only containing the selected genes both in `anndata.X` and `anndata.raw`.

## Cell clustering
In `config/experiments_config.py`, you can specify:
- clustering methods and how many times it needs to run
- evaluation metrics

```python
class CellClusteringConfig(BasicExperimentConfig):
    def __init__(self):
        super(CellClusteringConfig, self).__init__()
        self.methods = {'SC3s': 1, 'Seurat_v4': 1}  # clustering_method: number of runs
        self.metrics = ['ARI', 'V', 'bcubed']
```
other available clustering methods: 'SHARP' and 'SC3'.

In [8]:
cluster_cells(selected_adata)

SC3s clustering starts. 2503 cells and 2000 genes in data...
**************************** SC3s - 1 ***************************
Seurat_v4 clustering starts. 2503 cells and 2000 genes in data...
************************* Seurat_v4 - 1 *************************


The generated cluster labels in each run were stored in selected_adata.obs, the name of which have the format
 `{clustering_method}_{run}`:

In [9]:
selected_adata.obs

Unnamed: 0,celltype,n_counts,n_genes,percent_mito,counts_per_cell,SC3s_1,Seurat_v4_1
AAACATACAACCAC,CD4.T.cell,2419.0,779,0.0,2419.0,0,1
AAACATTGATCAGC,CD4.T.cell,3147.0,1129,0.0,3147.0,0,1
AAACCGTGCTTCCG,Monocyte_CD14,2639.0,960,0.0,2639.0,7,7
AAACCGTGTATGCG,NK.cell,980.0,521,0.0,980.0,5,6
AAACGCACTGGTAC,CD4.T.cell,2163.0,781,0.0,2163.0,0,1
...,...,...,...,...,...,...,...
TTTCGAACTCTCAT,Monocyte_CD14,3459.0,1153,0.0,3459.0,7,3
TTTCTACTGAGGCA,B.cell,3443.0,1224,0.0,3443.0,1,4
TTTCTACTTCCTCG,B.cell,1684.0,622,0.0,1684.0,1,4
TTTGCATGAGAGGC,B.cell,1022.0,452,0.0,1022.0,1,4


After cell clustering we evaluated the clustering results:

In [10]:
results = clustering_metrics(selected_adata)

In [11]:
results

{'SC3s': {'ARI_1': 0.600343060674016,
  'V_1': 0.7921364827506133,
  'bcubed_1': 0.9442864759196732},
 'Seurat_v4': {'ARI_1': 0.5782460625874656,
  'V_1': 0.7680852769186622,
  'bcubed_1': 0.9816576669238022}}

In [12]:
pd.DataFrame(results)

Unnamed: 0,SC3s,Seurat_v4
ARI_1,0.600343,0.578246
V_1,0.792136,0.768085
bcubed_1,0.944286,0.981658


## Inter-dataset cell classification

We used the same function `load_data()` to read 2 datasets: 

In [2]:
train_adata = load_data('PBMCbatchone')
test_adata = load_data('PBMCbatchtwo')

****************************** Loading processed PBMCbatchone dataset ******************************
Dataset PBMCbatchone has 7429 cells, 33694 genes and 9 classes after filtering.
Rare cell type (> 30 cells and < 5.0% of all cells): Megakaryocyte
Data complexity is 0.957.
****************************** Loading processed PBMCbatchtwo dataset ******************************
Dataset PBMCbatchtwo has 6987 cells, 33694 genes and 8 classes after filtering.
Rare cell type (> 30 cells and < 5.0% of all cells): Megakaryocyte
Data complexity is 0.95.


In [3]:
selected_train_adata = select_genes(train_adata, 'rf', 2000, select_by_batch=False)  # select genes on training set

2000 genes are selected by rf using previously saved genes and importances...


In [4]:
selected_train_adata

AnnData object with n_obs × n_vars = 7429 × 2000
    obs: 'celltype', 'n_counts', 'n_genes', 'percent_mito', 'counts_per_cell'
    var: 'Gene', 'n_cells', 'mean', 'std'
    uns: 'log1p', 'rare_type', 'data_name', 'data_complexity'
    layers: 'normalized'

In [5]:
selected_test_adata = subset_adata(test_adata, selected_train_adata.var_names)   # only preserve selected genes on test set

In [6]:
selected_test_adata

AnnData object with n_obs × n_vars = 6987 × 2000
    obs: 'celltype', 'n_counts', 'n_genes', 'percent_mito', 'counts_per_cell'
    var: 'Gene', 'n_cells', 'mean', 'std'
    uns: 'log1p', 'rare_type', 'data_name', 'data_complexity'
    layers: 'normalized'

In [7]:
classify_cells(selected_train_adata, selected_test_adata)   # do cell classification

SingleR starts. 7429 cells and 2000 genes in train data; 6987 cells and 2000 genes in test data...


The generated prediction labels were stored in selected_test_adata.obs, the name of which have the format
 `{classification_method}_label`:

In [11]:
selected_test_adata.obs

Unnamed: 0_level_0,celltype,n_counts,n_genes,percent_mito,counts_per_cell,SingleR_label
Cell,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
data_5p-AAACCTGAGCGATAGC-1,NK.cell,2712.0,1318,0.0,2712.0,NK.cell
data_5p-AAACCTGAGCTAAACA-1,Monocyte_CD14,6561.0,2164,0.0,6561.0,Monocyte_CD14
data_5p-AAACCTGAGGGAGTAA-1,Monocyte_CD14,6322.0,2112,0.0,6322.0,Monocyte_CD14
data_5p-AAACCTGAGTCTTGCA-1,CD8.T.cell,4528.0,1526,0.0,4528.0,CD8.T.cell
data_5p-AAACCTGAGTTCGATC-1,Monocyte_CD14,3426.0,1332,0.0,3426.0,Monocyte_FCGR3A
...,...,...,...,...,...,...
data_5p-TTTGTCATCCACGTTC-1,Monocyte_CD14,6547.0,2044,0.0,6547.0,Monocyte_CD14
data_5p-TTTGTCATCGCGTAGC-1,B.cell,3615.0,1397,0.0,3615.0,B.cell
data_5p-TTTGTCATCTTAACCT-1,CD8.T.cell,3828.0,1480,0.0,3828.0,CD8.T.cell
data_5p-TTTGTCATCTTACCGC-1,Plasmacytoid.dendritic.cell,6444.0,2388,0.0,6444.0,Plasmacytoid.dendritic.cell


In `config/experiments_config.py`, you can specify the evaluation metrics:

```python
class CellClassificationConfig(BasicExperimentConfig):
    def __init__(self):
        super(CellClassificationConfig, self).__init__()
        self.methods = ['SingleR']
        self.metrics = ['f1', 'ck'] # F1 score and cohen's kappa
```

In [9]:
results = classification_metrics(selected_test_adata)  # evaluate the results

In [10]:
results

{'SingleR': {'f1': 0.8136420027822318, 'ck': 0.8496280431910497}}