# Gene Selection Tutorial 

## 0. Importing packages

Import nvr, pandas, and numpy. NVR dependencies should be automatically imported if already installed. Be sure to have the following packages installed:

numpy, pandas, scipy.spatial, networkx, time, and nvr

In [1]:
import nvr
import pandas as pd
import numpy as np

## 1. Loading and pre-processing raw count data

This tutorial will utilize the raw count data for GSE102698. We have included this file as a CSV; in its current state it is stripped of gene names. 

In [2]:
s1counts=pd.read_csv("s1_countsRaw.csv",header=None)

In [3]:
s1counts.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,25497,25498,25499,25500,25501,25502,25503,25504,25505,25506
0,0,1,0,0,0,1,1,0,0,0,...,95,27,54,48,26,17,44,4,15,0
1,0,2,2,0,0,0,1,0,0,0,...,113,36,67,90,89,12,53,7,23,0
2,0,1,1,0,0,0,0,0,2,0,...,93,39,46,52,35,23,33,2,18,1
3,0,0,0,0,0,0,0,1,0,0,...,101,41,105,70,103,19,64,3,17,0
4,0,3,0,0,0,0,0,0,2,0,...,174,55,105,107,46,35,69,1,24,0


Datasets for use with NVR should be formatted as such, with each row as a cell and each column as a gene. Here we have 1597 cells and 25507 genes.

In [4]:
s1counts.shape

(1597, 25507)

NVR takes the data as a numpy array, and will throw an exception if other data types are input. So we convert it here first.

In [5]:
s1countsArr=np.asarray(s1counts)

Depending on the phenomenon of interest, this next quality control step is **optional** but recommended. Here we make the assumption that genes where only one or zero transcripts are found across all cells are low quality. Thus we remove these genes. 

In [6]:
hqGenes=nvr.parseNoise(s1countsArr)
s1countsArrHq=nvr.mkIndexedArr(s1countsArr,hqGenes)
s1countsArrHq.shape

(1597L, 13730L)

The parseNoise() function locates higher quality genes and returns their indices. Subsequently, the mkIndexedArr utilizes these indices and generates a new array containing only the count data from these high quality genes. 

Next, we must normalize the count data to the library size and perform a pointwise inverse hyperbolic sine transformation on the data with the pwArcsinh() function. 5000 is the constant we use for this transformation. Examples of alternative methods include log1p transformations.   

In [7]:
s1adataHq=nvr.pwArcsinh(s1countsArrHq,5000)

Completion:
99.937382592413

## 2. Unsupervised feature selection

With the data in the proper format, we can now perform feature selection with the select_genes() function. This function will return a set of indices which can be crossreferenced with a corresponding set of gene names. Note that the "neighborhood variance calculation" steps will take some time. For the preprocessed s1adataHq input data, the process took approximately 830 seconds on an i5-4300u.

In [8]:
s1_selected_genes=nvr.select_genes(s1adataHq)

Start min_conn_k
4 connections needed
Finished min_conn_k 
Start traj_dist
Finished traj_dist
Start adaptive_knn_graph
Finished adaptive_knn_graph
Start global variance calculation
Finished global variance calculation
Start neighborhood variance calculation
Completion:
Finished neighborhood variance calculation
Start global to neighborhood variance ratio calculation
Finished global to neighborhood variance ratio calculation
Finished selection_val
Finished gene selection in 946.611999989 seconds
done


In [10]:
s1_selected_genes.shape

(404L,)

This 404 length array represents the genes selected, or the genes where the neighborhood expression variance was lower than the global expression variance. These are indices that can be used to construct a subset of the full dataset for downstream analysis.

In [12]:
pd.DataFrame(s1_selected_genes).to_csv("s1_selected_genes_nvr.csv", header=False,index=False)