# GSA - Genetic Stability Analyzer

In order to speed up processing time when running classification algorithms, it is often useful to choose only the most "best" genes to use.  There are various algorithms available to choose genes, however here we use Chi2 Select K best.  K is how many genes you wish to use for testing stability.  More genese is usually better, however again in order to speed up processing time we limit the number of genes used.  This program allows you to set a min and max number of genes and an interval.  This will in turn setup numpy arrays with class and the select number of genes for further processing by FASTR and FASTrand.

### Libraries
Must be pre-installed.  Recommended to use virtual environment.

In [1]:
import numpy as np
from random import sample, choice
from os import path, getcwd, makedirs
from sklearn.feature_selection import SelectKBest, chi2
from sklearn.model_selection import StratifiedKFold
from collections import Counter
from sklearn.metrics import mean_squared_error
from scipy.sparse.linalg import lsqr
from math import sqrt, floor
from sklearn import svm
import multiprocessing as mp
from enum import Enum
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn import svm

  from numpy.core.umath_tests import inner1d


## Methods and Classes
This section defines all classes and methods used.  The corresponding methods and classes for each .py file found in `main/Python` are described here.

### NBC.py

The NBC.py file contains two classes: Model and NBC.  The Network-based supervised classification technique (NBC) is described in [Ahmet Ay et al](http://journals.sagepub.com/doi/abs/10.4137/CIN.S14025).  Briefly, for each gene, a model is constructed from the gene's neighbors.  The model is a function of the form:

gene_expressipn @ g = neigh1X + neigh2X +....neighNx + C

The set of expressions for all genes creates the expressio nmodel.

In [2]:
class Model:
    """Describes the model class."""

    def __init__(self, samples, eps, class_label):
        """Initialize the model class.

        Args:
            samples: training samples of size [samples, genes].
            eps: epsilon value for correlation cutoff.
            class_label: classification label.
        """

        self.class_label = class_label
        self.samples = np.array(samples)
        self.eps = eps

        # columns are variables, rows are samples
        self.correlation = np.corrcoef(self.samples, y=None, rowvar=False)

        # note that the mask is actually the graph
        self.mask = (np.absolute(self.correlation) > self.eps)

        # the coefficients associated with the system of equation: Ax=b,
        # where A is an equation list created from the neighbors of gene
        # n and b is the value of gene n.
        self.geneFuncMasks = []  # these are the coefficients in Ax=b
        for gene in range(len(self.correlation)):
            currMask = self.mask[gene]
            setOfNeighbors = []
            solutions = []
            for sample in self.samples:
                neighbors = [sample[neighbor] if (currMask[neighbor] and (gene != neighbor))
                             else 0 for neighbor in range(len(currMask))]
                neighbors.append(1)
                setOfNeighbors.append(neighbors)
                solutions.append(sample[gene])
            coeff = self.solver(setOfNeighbors, solutions, 2)
            self.geneFuncMasks.append(coeff.tolist())

        self.coefficients = np.array(self.geneFuncMasks)

    def solver(self, neighbors, sols, choice):
        # Use lsqr to solve Ax=b
        A = np.array(neighbors)
        b = np.array(sols)
        x = lsqr(A, b)[0]
        return x

    def expression(self, sample):
        """Given a sample, return the hypothetical expression.

        Args:
            sample: the sample whose hypothetical expression we wish to
            calculate
        Returns:
            expr: A list with the expression values of size number of genes.
        """
        expression = []
        for gene in range(len(self.coefficients)):
            geneVal = 0
            for neighbor in range(len(self.mask)-1):
                geneVal += self.coefficients[gene][neighbor] * sample[neighbor]
            geneVal += self.coefficients[gene][len(self.mask)]
            expression.append(geneVal)
        return np.array(expression)

    def label (self):
        """Return the classification label of this model."""
        return self.class_label


class NetworkBasedClassifier:
    """Describes the NBClassifier class."""

    def __init__(self, epsilon):
        """Initialize a NBF classifier.

        Args:
            eps: epsilon value
        """
        self.models = []
        self.epsilon = epsilon

    def fit(self, X, y):
        """Fit the data with classes to create class models.

        Fits the data [num_samples, num_genes] with classifications
        [num_samples] to the model.  Creates as many models as classes.

        Args:
            X: the data we wish to train the classifier on
            y: the classifications associated with the samples
        """
        y = np.array(y)
        X = np.array(X)
        for key in Counter(y):
            a_class = np.where(y == key)
            self.models.append(Model([X[i] for i in a_class[0]], self.epsilon, key))

            
    def score(self, X, y):
        """Scores the classifications of a given set of samples (X) according to their
        actual clsssifications (y).

        Must fit the classifier before this method is called.

        Args:
            samples: the samples we wish to predict classification for.

        Returns:
            accuracy: the classification accuracy
        """
        y = np.array(y)
        X = np.array(X)
        predicted = self.predict(X)        
        correct = np.asarray(predicted == y)
        return np.sum(correct)/correct.shape[0]        
        
        
    def predict(self, X):
        """Predict the classification of a sample.

        Must fit the classifier before this method is called.

        Args:
            samples: the samples we wish to predict classification for.

        Returns:
            classifications: the classifications of the samples.
        """
        classifications = []
        for sample in X:
            RMSEs = []
            for model in self.models:
                rmse = sqrt( mean_squared_error(sample, model.expression(sample)))
                RMSEs.append(rmse)
            min_index = RMSEs.index(min(RMSEs))
            label = self.models[min_index].label()
            classifications.append(label)
        return np.array(classifications)

### Alter.py
The Alter.py file contains all methods and helper methods used to alter expressions.  There are currently four methods to alter expressions:

1) Greedy - uses a greed strategy to select the top k genes that will produce the largest bad accuracy
2) All - alters all genes by some percent amount.
3) Subset - alters a subset of the genes selected via chi2 value
4) RandSubset - alters a subset of the genes selected randomly


In [3]:
class AlterStrategy:
    """Describes the alteration strategy used."""

    # Alter strategies included:
    ALL = 0
    SUB = 1
    RANDSUB = 2
    GREEDY = 3
    
    def __init__(self,
                 _type):
        self._type = _type
     
    
    def getType(self):
        return self._type
    
    def getName(self):
        return self._name
    
    def Accuracy(estimator, X, y, chosen, idx_to_change):
        # TODO: run multiple times and return avg accuracy
        result = []
        for x in X:
            alt = np.copy(x)
            # alter prev chosen
            for i in chosen:
                #alt[i] = choice([0, alt[i]*2]) 
                alt[i] = 0
            # alter new gene
            #alt[idx_to_change] = choice([0, alt[idx_to_change]*2]) 
            alt[idx_to_change] = 0
            result.append(alt)
        result = np.array(result)
        return estimator.score(result, y)

    '''Static method'''
    def RankGreedy(estimator, X, y):
        fileName = '{}greedyRank.npy'.format(estimator.getName())
        if (path.exists(path.join(gsa_path, fileName))):
            return
        print("GREEDY RANK INITIALIZER - THIS CAN TAKE A WHILE.")
        estimator.fit(X,y)
        notChosen = list(range(X.shape[1]))
        chosen = []
        for i in range(X.shape[1]):
            pool = mp.Pool(processes = mp.cpu_count())
            accuracy = pool.starmap(AlterStrategy.Accuracy, [(estimator, X, y, chosen, idx) for idx in notChosen])
            a = [x for _,x in sorted(zip(accuracy, notChosen))]
            chosen.append(a[0])
            print(notChosen)
            print(accuracy)
            notChosen.remove(a[0])  
            print(chosen)
        np.save(path.join(gsa_path,fileName), chosen)
        
    def Subset(percent, X):
        """
        B/c genese are already ordered by chi2 rank we can choose top k to alter.
        """
        result = []
        idx = floor(X[0].size * percent)
        if idx <= 0:
            return X
        else:
            for x in X:
                alt = np.copy(x)
                for i in range(idx):
                    alt[i] = choice([0, alt[i]*2])
                result.append(alt)
            return np.array(result)

    def RandSubset(percent,X):
        result = []
        k = floor(X[0].size * percent)
        if k <= 0:
            return X
        else:
            indices = sample(range(X[0].size),k)
            for x in X:
                alt = np.copy(x)
                for i in indices:
                    alt[i] = choice([0, alt[i]*2])
                result.append(alt)
            return np.array(result)
        
    def All(percent, X):
        result = []
        for x in X:
            alt = []
            for gene in x:
                _offset = gene * percent
                _low = gene - _offset
                _high = gene + _offset
                alt.append(choice([_low, _high]))
            result.append(alt)
        return np.array(result)        
        
    def Greedy(percent, X, estimator):
        result = []
        k = floor(X[0].size * percent)
        if k <= 0:
            return X        
        else:
            file = path.join(gsa_path,'{}greedyRank.npy'.format(estimator.getName()))
            indices = np.load(file)
            for x in X:
                alt = np.copy(x)
                for i in indices:
                    alt[i] = choice([0, alt[i]*2])
                result.append(alt)
            return np.array(result)        
        
    def Alter(self, percent, X, estimator):
        if self._type == AlterStrategy.ALL:
            return AlterStrategy.All(percent, X)
        elif self._type == AlterStrategy.SUB:
            return AlterStrategy.Subset(percent, X)
        elif self._type == AlterStrategy.RANDSUB:
            return AlterStrategy.RandSubset(percent, X)
        elif self._type == AlterStrategy.GREEDY:
            return AlterStrategy.Greedy(percent, X, estimator)


#### Helpers

The strategy is to choose greedily, however because of the random choice from the Accuracy method strategy, different genese may be chose, thus affecting the order in which the rank returns.  In general, the top gene should be the top gene in all cases, but as it progresses the genese can switch.

### Common.py
These are files that are helpers for the main program.  They consist of two methods:
1) A method to order the K best ranked genes.  This is used in conjunction with Atler.py Subset method.
2) A method to cross validate.  This is different than normal cross validation in that the estimator is fit to correct data while the score is derived from altered data.

In [4]:
def SelectKBestRanked(k, X, y):    
    b = SelectKBest(chi2, k).fit(X, y)
    a = b.get_support(indices = True)
    a = [x for _,x in sorted(zip(b.scores_[a],a),reverse=True)]
    return np.array(a)

In [5]:
def crossValidate(estimator, X, y, alterStrat, percent=0, cv=10):
    scores = []
    skf = StratifiedKFold(cv)
    for train_index, test_index in skf.split(X, y):
        estimator.fit(X[train_index], y[train_index]) 
        accuracy = estimator.score(
            alterStrat.Alter(percent, X[test_index],estimator),
            y[test_index])
        scores.append(accuracy)
    return np.array(scores)

### Estimatory.py

In [6]:
class Estimator:
    """Describes the classifier chooser."""

    # Classifiers included in the chooser
    NBC = 0
    KNN = 1
    SVM = 2
    RF = 3
    NB = 4
    
    def __init__(self,
                 _type,
                 epsilon=0.8,
                 neighbors=1):
        """Initialize a NBF classifier.

        Args:
            eps: epsilon value
        """
        self._type = _type

        # create the classifier
        if _type == Estimator.NBC:
            self._classifier = NetworkBasedClassifier(epsilon)
            self._name = "NBC"
        elif (_type == Estimator.kNN):
            self._classifier = KNeighborsClassifier(neighbors)
            self._name = "KNN"
        elif (_type == Estimator.SVM):
            self._classifier = svm.LinearSVC()
            self._name = "SVM"
        elif (_type == Estimator.RF):
            self._classifier = RandomForestClassifier()
            self._name = "RF"
        elif (_type == Estimator.NB):
            self._classifier = GaussianNB()
            self._name = "NB"
    
    def getType(self):
        return self._type
    
    def getName(self):
        return self._name
    
    def fit(self, X, y):
        self._classifier.fit(X,y)
        
    def score(self, X, y):
        return self._classifier.score(X, y)
        
    def predict(self, X):
        return self._classifier.predict(X)

## START MAIN PROGRAM

###### Enter the series and feature_size to use
Must be all upper case. e.g. `"GSE27562"`

In [21]:
series = "GSE19804"
feature_size = 50
alterStrat = AlterStrategy(AlterStrategy.GREEDY)
estimator = Estimator(Estimator.kNN)

### Get/Create Directories
Assumes this notebook is in `GenClass-Stability/main/notebooks/`

In [22]:
notebook_dir = getcwd();
main_dir = path.dirname(path.dirname(notebook_dir))
load_path = path.join(main_dir, "GSE", series)
gsa_path = path.join(main_dir,"GSA", series, str(feature_size))
if not path.exists(gsa_path):
    makedirs(gsa_path)

### Import Classes and Expressions
Load original data. Assumes SIT and custome GSE script have been run to import data.

In [23]:
classes =np.loadtxt(path.join(load_path, "classes.txt"), dtype=np.str, delimiter="\t")
exprs = np.loadtxt(path.join(load_path, "exprs.txt"), delimiter="\t")

Select K best genes for analysis.

In [24]:
a = SelectKBestRanked(feature_size, exprs, classes)

Get selected genes from expressions.

In [25]:
exprs = exprs[:, a]

Save the selected expression data for potential later use.

In [26]:
np.save(path.join(gsa_path,"exprs.npy"), exprs)
np.save(path.join(gsa_path,"classses.npy"), classes)

### Stability Test

In [27]:
if(alterStrat.getType() == AlterStrategy.GREEDY):
    AlterStrategy.RankGreedy(estimator, exprs, classes)

GREEDY RANK INITIALIZER - THIS CAN TAKE A WHILE.
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49]
[1.0, 1.0, 1.0, 1.0, 1.0, 0.9833333333333333, 0.9833333333333333, 0.9916666666666667, 1.0, 1.0, 0.9916666666666667, 1.0, 0.9916666666666667, 0.9916666666666667, 1.0, 1.0, 0.9916666666666667, 1.0, 1.0, 1.0, 1.0, 0.9916666666666667, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.975, 1.0, 1.0, 1.0, 1.0, 0.9916666666666667, 1.0, 0.9833333333333333, 1.0, 1.0, 1.0, 1.0, 0.9916666666666667, 1.0, 1.0, 0.9916666666666667, 1.0, 1.0, 1.0, 0.9916666666666667, 1.0]
[29]
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49]
[0.9666666666666667, 0.9583333333333334, 0.975, 0.9833333333333333, 0.9666666666666667, 0.975, 0.9833333333333333, 0.983

[2, 3, 4, 5, 6, 7, 8, 10, 11, 12, 13, 15, 16, 17, 18, 20, 21, 23, 24, 25, 26, 27, 28, 30, 31, 33, 35, 36, 37, 38, 39, 40, 41, 43, 44, 45, 46, 47, 48, 49]
[0.2916666666666667, 0.4083333333333333, 0.23333333333333334, 0.35, 0.5166666666666667, 0.3, 0.35, 0.5166666666666667, 0.25, 0.35833333333333334, 0.5, 0.2916666666666667, 0.49166666666666664, 0.3, 0.43333333333333335, 0.44166666666666665, 0.2916666666666667, 0.2833333333333333, 0.25, 0.4, 0.4083333333333333, 0.25833333333333336, 0.24166666666666667, 0.2833333333333333, 0.475, 0.3333333333333333, 0.26666666666666666, 0.375, 0.43333333333333335, 0.26666666666666666, 0.4583333333333333, 0.2916666666666667, 0.43333333333333335, 0.325, 0.48333333333333334, 0.49166666666666664, 0.225, 0.325, 0.4583333333333333, 0.3]
[29, 34, 32, 19, 0, 1, 9, 42, 14, 22, 46]
[2, 3, 4, 5, 6, 7, 8, 10, 11, 12, 13, 15, 16, 17, 18, 20, 21, 23, 24, 25, 26, 27, 28, 30, 31, 33, 35, 36, 37, 38, 39, 40, 41, 43, 44, 45, 47, 48, 49]
[0.3, 0.26666666666666666, 0.2, 0.25

[2, 3, 6, 7, 8, 10, 12, 13, 15, 16, 18, 20, 21, 23, 25, 26, 28, 30, 31, 33, 36, 37, 38, 43, 44, 45, 47, 48, 49]
[0.20833333333333334, 0.39166666666666666, 0.3333333333333333, 0.2833333333333333, 0.36666666666666664, 0.5, 0.15833333333333333, 0.425, 0.15833333333333333, 0.5166666666666667, 0.21666666666666667, 0.4166666666666667, 0.2916666666666667, 0.19166666666666668, 0.3416666666666667, 0.19166666666666668, 0.18333333333333332, 0.16666666666666666, 0.35833333333333334, 0.2916666666666667, 0.3, 0.24166666666666667, 0.20833333333333334, 0.3416666666666667, 0.375, 0.2833333333333333, 0.4, 0.3, 0.16666666666666666]
[29, 34, 32, 19, 0, 1, 9, 42, 14, 22, 46, 27, 4, 39, 17, 24, 35, 11, 40, 41, 5, 12]
[2, 3, 6, 7, 8, 10, 13, 15, 16, 18, 20, 21, 23, 25, 26, 28, 30, 31, 33, 36, 37, 38, 43, 44, 45, 47, 48, 49]
[0.21666666666666667, 0.4583333333333333, 0.3416666666666667, 0.30833333333333335, 0.30833333333333335, 0.4083333333333333, 0.38333333333333336, 0.16666666666666666, 0.5083333333333333, 0

[2, 3, 6, 8, 16, 20, 23, 25, 26, 28, 31, 36, 38, 47]
[0.15, 0.49166666666666664, 0.15, 0.10833333333333334, 0.13333333333333333, 0.21666666666666667, 0.5, 0.09166666666666666, 0.13333333333333333, 0.15833333333333333, 0.15, 0.5, 0.125, 0.13333333333333333]
[29, 34, 32, 19, 0, 1, 9, 42, 14, 22, 46, 27, 4, 39, 17, 24, 35, 11, 40, 41, 5, 12, 30, 13, 45, 18, 49, 48, 7, 10, 37, 33, 43, 44, 15, 21, 25]
[2, 3, 6, 8, 16, 20, 23, 26, 28, 31, 36, 38, 47]
[0.13333333333333333, 0.44166666666666665, 0.125, 0.08333333333333333, 0.125, 0.15, 0.45, 0.11666666666666667, 0.18333333333333332, 0.125, 0.48333333333333334, 0.08333333333333333, 0.11666666666666667]
[29, 34, 32, 19, 0, 1, 9, 42, 14, 22, 46, 27, 4, 39, 17, 24, 35, 11, 40, 41, 5, 12, 30, 13, 45, 18, 49, 48, 7, 10, 37, 33, 43, 44, 15, 21, 25, 8]
[2, 3, 6, 16, 20, 23, 26, 28, 31, 36, 38, 47]
[0.14166666666666666, 0.4583333333333333, 0.15, 0.14166666666666666, 0.15833333333333333, 0.49166666666666664, 0.11666666666666667, 0.15833333333333333, 0.14

In [28]:

scores = crossValidate(estimator, exprs, classes, alterStrat, 0.05, 10)
print("Accuracy: %0.2f (+/- %0.2f)" % (scores.mean(), scores.std() * 2))

Accuracy: 0.83 (+/- 0.29)


In [29]:
tmp = np.load(path.join(gsa_path,"NBCgreedyRank_tmp2.npy"))

In [30]:
tmp

array([41, 39, 25, 36, 24,  6,  9, 43, 26,  0, 49,  7,  4, 30, 21, 45,  8,
       18, 10, 11, 20, 47, 46, 16, 33, 40,  3, 14, 48, 35,  2, 13, 15, 34,
       19, 23,  1, 28, 29, 42, 27, 32,  5, 38, 12, 44, 22, 17, 31, 37])

In [31]:
tmp = np.load(path.join(gsa_path,"NBCgreedyRank_tmp.npy"))

In [32]:
tmp

array([41, 43,  7, 22, 39,  6, 21, 31, 26,  2, 30, 44, 18, 47,  4, 19, 14,
       25, 20, 16,  5,  9,  3,  8, 13, 23, 45, 32, 48, 27,  1, 34, 24,  0,
       42, 15, 33, 36, 10, 37, 40, 12, 28, 29, 49, 11, 46, 35, 17, 38])

In [33]:
tmp = np.load(path.join(gsa_path,"NBCgreedyRank.npy"))

In [None]:
tmp