#FaSTR - Feature (a) Selection Reinforced

##About: 

The a is there b/c I wanted it to say FaST, even if it is slow.  Unlike FaST, FaSTR generates indices for a given FS size of the top K features for some number of subsets using both Chi2 and Random selection strategies.  Note FaST should be run first.

Author: Terek Arce

##TO DO:
* add other fs techniques

In [49]:
Database = "predict_kit2"
User = "terek"
Password = ""
Host = "localhost"
Port = "5432"

Enter the path where files can be stored and located for future reference.

In [50]:
Path = "/Users/terek/Dropbox/Mac/FutureConfStability/iNotebook/FaSTR_files"

Set the following variables to True if you'd like the corresponding feature selection algorithm to be used, False otherwise.

In [51]:
Chi2 = True

Set the list to format: [series, REGEX of classes]

In [52]:
get_data = [ ["GSE19804", "%normal%|%tumor%"] ,
             ["GSE39582", "%C1|%C2|%C3|%C4|%C5|%C6"] ,
             ["GSE27562", "%normal mammogram|%breast cancer, confirmed by diagnostic biopsy|%benign%" ] ]

##Main Program Start:

No user input is needed beyond this point.  When executing the code, responses from the DB and program will be displayed below each code piece.

The packages used by our program are found below:

In [2]:
import numpy as np
import random
from sklearn.feature_selection import SelectKBest, chi2
from os.path import exists
from psycopg2 import connect

Opens a connection to the database:

In [54]:
conn = connect(database=Database, user=User, password=Password, host=Host, port=Port)
print ( "DB Response: Opened connection to database successfully :)" )

DB Response: Opened connection to database successfully :)


Given a set of genes (n_samples, n_features), the classes associated with the genes (n_features) the number of features to be selected (k), this function will return the indices of the features to be selected with the top k highest chi2 scores.

In [55]:
def chi2_fs ( genes, classes, num ):
    b = SelectKBest( chi2, num ).fit( genes, classes )
    a = b.get_support( indices = True )
    return a

Saves the feature selected indices to a file, size [n_selected_features].

In [56]:
feature_selection_sizes = np.arange(5,101,5)

for i in range( len( get_data ) ): 
    classes_file = ( "FaST_files/%s_classes.npy" % get_data[i][0]  )
    genes_file = ( "FaST_files/%s_genes.npy" % get_data[i][0] )
    if ( exists( genes_file ) and exists( classes_file ) ):
        genes = np.load( genes_file )
        classes = np.load( classes_file )
        for fs_size in feature_selection_sizes:
            if ( Chi2 ):
                indices = chi2_fs ( genes, classes, fs_size ) 
                indices_file = ( "FaSTR_files/FS/%s_%03d_fs_indices.npy" % (get_data[i][0], fs_size) )
                np.save( indices_file, indices)
            # TODO: add other fs methods here

Gets the indices of the selected genes for a certain feature size [n_selected_features].

In [60]:
main_size = feature_selection_sizes[ len( feature_selection_sizes ) - 1 ]
for i in range( len( get_data ) ):
    indices_file = ( "FaSTR_files/FS/%s_%03d_fs_indices.npy" % (get_data[i][0], main_size) )
    if ( exists( indices_file ) ):
        N = np.load( indices_file )
        for fs_size in feature_selection_sizes:
            sub_index_file = ( "FaSTR_files/FS/%s_%03d_fs_indices.npy" % (get_data[i][0], fs_size) )
            if ( exists( indices_file ) ):
                K = np.load( sub_index_file )
                indices = [ np.where(N==k)[0][0] for k in K ]
                nCk = ( "FaSTR_files/%s_%03d_%03d_nCk.npy" % (get_data[i][0], main_size, fs_size) )
                np.save( nCk, indices)

Generate the random indices for a certain feature size [n_selected_features].

In [78]:
exp = np.arange(0, main_size, 1)
random.shuffle(exp)
for i in range( len( get_data ) ):
    for fs_size in feature_selection_sizes:   
        nCk = ( "FaSTR_files/%s_%03d_%03d_nCk_rand.npy" % (get_data[i][0], main_size, fs_size) )
        np.save( nCk, exp[0:fs_size])

Closes the connection to the DB.

In [58]:
conn.close()
print ("DB Response: Closed connection to database successfully - Goodbye :(")

DB Response: Closed connection to database successfully - Goodbye :(
