# A comparison of known cancer causing genes with those identified by a classifier trained on gene expression data.

- BME 230A class project winter 2019
- Andrew E. Davidson
- [aedavids@ucsc.edu](mailto:aedavids@edu?subject=SimpleModel.ipynb)

## Abstract

A series of classifiers where trained on the [UCSC Xena Toil re-compute dataset](https://xenabrowser.net/datapages/?host=https://toil.xenahubs.net). 
It was known based on previous work that these classifiers should have high accuracy rate.
It was also know that biologists have previous identified the gene sets associated with 
various forms of cancer. The goal of this project was to verify that the classifiers are consistent with known biology

## Overview:
Typically data science projects develop models de novo. You start by mining an unknown data set. Your objective is to find structure in the data and create a data product. That is to say some sort of predictive model. It is important to understand the difference between a data product and a model. Models are of limited use. Often they may help you gain a better understanding of the relationships inherent in your data. A model is often considered good based on it's accuracy alone. Models are steps on the path to developing true data products. 

By contrast, data products are models that can be deployed at scale. Rarely is accuracy alone sufficient to decide if a model is deployable. In addiction to accuracy, most often data products must be explainable and generalize well.

For data products to be deployable we must have confidence that our model generalize to the true targeted population. In the case of the Xena data set we need not only patients that are sick and may or may not have cancer but also to healthy individuals. We also need to account for demographic bias in the training data set.  Data products related to human behavior, for example recommender systems, or natural language tasks, must have mechanisms to identify population drift and processes for retrain.

Explainability is often over looked when evaluating the deployability of a data product. Sometimes it is not required. For example consider a bad movie recommender. The viewer is not going to be harmed in anyway. For most data product the cost of false positives or negatives is high. For example consider a tumor/normal classifier or a model used to set insurance premiums. The new EU General Data Protection Regulation requires explainability for models with potential high mis-classification costs. It also seems unlike the the FDA will approve models that are not explainable. 

Lack of explainablity often limits the use of of Neural Networks. Fortunately neural network models based on Xena data set may be explainable. One approach for gaining insight into the workings of a trained model is to make predictions with hand crafter example and explore how these example activate the various layers of the neural network. [Andrej Karpath](https://cs.stanford.edu/people/karpathy/) used a similar approach to identify what kinds of images cause the filters of a convolutional neural network to activate.

## Model Evaluations and results

### Tumor/Normal classifier using Logistic Regression

The output layer of our model implements the sigmoid function show in Figure 1. We want to identify genes that push the value of z towards the the far right or left of Figure 1 

Based on equation (1) we can identify these genes by calculating their over all contribution to the value of z. The contribution is calculated as 
the mean value for each gene multiplied by its corresponding weight from our trained model 

Neural Network Activation Unit Equation 

$$ eq(1)\ z\ =\ W^{ T }x $$

$$ eq(2)\ sigmoid\ activation\ (z) =\frac { { 1 } }{ { 1 }+e^{ - z} } $$

![aedwip](images/simpleModelEvaluationFig1.png "images/simpleModelEvaluationFig1.png")

figure 2. "Gene Contribution Distribution shows that as we might expect most of the 58,581 genes are ignored by our model"

we used a Z score threshold of 3 to identify genes that unusually large contribution to the value input into the sigmoid function. We use the term 'promoter' for genes that increase the sigmoid input value and the term 'inhibitor' for genes that decrease this input value.

![images/simpleModelEvaluationFig2.png](images/simpleModelEvaluationFig2.png "images/simpleModelEvaluationFig2.png")

This is a small sample of genes found. Note they are not sorted by their contribution values
```
number normal promoters:211
normal promoter[0:3]:Index(['ABCA10', 'ABCA6', 'ABCA8']

number normal inhibitor:262
normal inhibitor[0:3]:Index([['AC000089.3'], ['AC005255.3'], ['AC006386.1']

number tumor promoters:270
['AC000089.3', 'AC005255.3', 'AC006386.1']

number tumor inhibitor:218
['ABCA10'], ['ABCA6'], ['ABCA8']
```

<span style="color:red">TODO:</span> confirm these genes know releation to cancer

references:
* [simple logistic regression evaluation notebook](./simpleModelEvaluation.ipynb)
* [simple logistic regression notebook](./simpleModel.ipynb)

### Disease Type Classifier Evaluation

Method:

The input size of our model is 58,581. Each value is a gene expression level. For each feature we make a prediction use a one-hot example. We then group the genes into sets based on the predicted disease type. We would expect the gene groups to correspond to know cancer related genes. Genes identified by the model that are not part of know pathways should be further explored.

Figure 3. is a histogram showing the count of genes that individually classified as disease type. 

![images/diseaseTypeClassifierEvalFig1.png](images/diseaseTypeClassifierEvalFig1.png "images/diseaseTypeClassifierEvalFig1.png")

To make the analysis easier we select the disease type with the fewest number of genes"

```
disease type with smallest identified gene set:['Pancreatic Adenocarcinoma']
number of genes in set:44
       ['AC079235.1', 'AC087499.9', 'AC090311.1', 'AC231645.1',
       'AL008708.1', 'AL354931.1', 'AL603650.4', 'ARAF', 'ATP6V1D',
       'C14orf93', 'CLK3', 'CNPY3', 'CTC-258N23.3', 'CTD-2132H18.3',
       'CYB5D2', 'FADS2P1', 'FBXW4', 'GOLGB1', 'HMGN1P5', 'KB-1582A10.2',
       'LINC00616', 'LINC01035', 'MED10', 'MON1A', 'NEK3', 'PIN1',
       'RFPL3-AS1_1', 'RNA5SP143', 'RNA5SP366', 'RNA5SP409', 'RNA5SP523',
       'RNA5SP55', 'RNU6-1194P', 'RNU6-204P', 'RNU6-498P',
       'RP11-626I20.3', 'RP11-74J13.9', 'SCARNA4', 'SF3B5', 'SGSM3',
       'SLC35B3', 'SNORA70D', 'STX18', 'SUGT1']
```

<span style="color:red">TODO: figure out how to mine biologic path way data.</span> We need to figure out how to map the gene sets we identify back to the know genes associated with each cancer type

references:
- [evaluation notebook diseaseTypeClassifierEval.ipynb](diseaseTypeClassifierEval.ipynb)
- [data exploration, model creation notebook diseaseTypeClassifier.ipynb](diseaseTypeClassifier.ipynb)

### Disease Type Classifier using principle component analysis Evaluation
<span style="color:red">TODO double check out analysis for bugs</span>

PCA and other dimensional reduction techniques are often used to reduce training time. My assumption was that given the performance of our model in a high dimensional space and the observation that our high dimensional models do not use the majority of gene expression features, we would expect to get similar results in a lower dimensional space.

I was also surprised to discover that after fitting PCA to training set it was very slow to transform full feature test examples. This transformation should be done once as preprocessing step with the results cached to disk.

We ran PCA on our training set accounting for 95% of the variance. This reduces size of our test set 58,581 features  to 5,895. This seems plausible given Fig 2. showed that most of the features in our full features models are ignored.

Using the lower dimensional training set we trained a identical model used in our [diseaseTypeClassifier.ipynb notebook](./diseaseTypeClassifier.ipynb). As expected using the higher dimensional data set we had 2,343,278 trainable parameters. Using the lower dimensional data we have 235,838.

Surprisingly there was a big drop in accuracy. Given we accounted for 95 % of the variance I expected a smaller drop

High dimension accuracy

```
training accuracy:0.99
    test accuracy:0.96
```
    
Low dimension accuracy

```
training accuracy:0.81
    test accuracy:0.73
```

The same neural architecture in a lower dimensional space suffers from high bias. In the high dimension model identified 'Pancreatic Adenocarcinoma' as the disease that had the smallest gene count at 44. While the low dimension model find 'Thymoma' with 800 genes

Notice the large difference between 4igure 3 and figure 4. 

Figure 4.  Count of genes that individually classified as diesase type"

![images/dimensionaltyReducedDiseaseTypeClassifierFig1.png](images/dimensionaltyReducedDiseaseTypeClassifierFig1.png "dimensionaltyReducedDiseaseTypeClassifierFig1.png")

## Next step
1. double check out analysis for bugs

2. figure out how to mine biologic path way data. We need to figure out how to map the gene sets we identify back to the know genes associated with each cancer type

3. develop a better model for lower dimensional spaces

4. Instead of using PCA try to learn a lower dimensional embedding

## Reproducibility
All data and juypter notebook used to clean data, explore data, train and evaluate models is avalible at [https://github.com/AEDWIP/BME-230a](https://github.com/AEDWIP/BME-230a)

you can view fully rendered version of the notebook complete with source code, text, and graphis by clicking on the *.ipynb notebook files on the github website

## References:
* [UCSC Xena Toil re-compute dataset](https://xenabrowser.net/datapages/host=https://toil.xenahubs.net)
* [Nature Biotechnology publication: https://doi.org/10.1038/nbt.3772](https://doi.org/10.1038/nbt.3772)
* [rcurrie/tumornormal/ingest notebook](https://github.com/rcurrie/tumornormal/blob/master/ingest.ipynb)
    + used to create a local copy of the tcga_target_gtex.h5 data file from the Xena Toil re-compute dataset
    + converts Covert Ensembl gene ids to Hugo