# A comparison of known cancer causing genes with those identified by a classifier trained on gene expression data.

- BME 230A class project winter 2019
- Andrew E. Davidson
- [aedavids@ucsc.edu](mailto:aedavids@edu?subject=SimpleModel.ipynb)

## Abstract

A series of classifiers where trained on the [UCSC Xena Toil re-compute dataset](https://xenabrowser.net/datapages/?host=https://toil.xenahubs.net). 
It was known based on previous work that these classifiers should have high accuracy rate.
It was also know that biologists have previous identified the gene sets associated with 
various forms of cancer. The goal of this project was to verify that the classifiers are consistent with known biology

## Overview:
Typically data science projects develop models de novo. You start by mining an unknown data set. Your objective is to find structure in the data and create a data product. That is to say some sort of predictive model. It is important to understand the difference between a data product and a model. Models are of limited use. Often they may help you gain a better understanding of the relationships inherent in your data. A model is often considered good based on it's accuracy alone. Models are steps on the path to developing true data products. 

By contrast, data products are models that can be deployed at scale. Rarely is accuracy alone sufficient to decide if a model is deployable. In addiction to accuracy, most often data products must be explainable and generalize well.

For data products to be deployable we must have confidence that our model generalize to the true targeted population. In the case of the Xena data set we need not only patients that are sick and may or may not have cancer but also to healthy individuals. We also need to account for demographic bias in the training data set.  Data products related to human behavior, for example recommender systems, or natural language tasks, must have mechanisms to identify population drift and processes for retrain.

Explainability is often over looked when evaluating the deployability of a data product. Sometimes it is not required. For example consider a bad movie recommender. The viewer is not going to be harmed in anyway. For most data product the cost of false positives or negatives is high. For example consider a tumor/normal classifier or a model used to set insurance premiums. The new EU General Data Protection Regulation requires explainability for models with potential high mis-classification costs. It also seems unlike the the FDA will approve models that are not explainable. 

Lack of explainablity often limits the use of of Neural Networks. Fortunately neural network models based on Xena data set may be explainable. One approach for gaining insight into the workings of a trained model is to make predictions with hand crafter example and explore how these example activate the various layers of the neural network. [Andrej Karpath](https://cs.stanford.edu/people/karpathy/) used a similar approach to identify what kinds of images cause the filters of a convolutional neural network to activate.

## Reproducibility
All data and juypter notebook used to clean data, explore data, train and evaluate models is avalible at [https://github.com/AEDWIP/BME-230a](https://github.com/AEDWIP/BME-230a)

you can view fully rendered version of the notebook complete with source code, text, and graphis by clicking on the *.ipynb notebook files on the github website

## References:
* [UCSC Xena Toil re-compute dataset](https://xenabrowser.net/datapages/host=https://toil.xenahubs.net)
* [Nature Biotechnology publication: https://doi.org/10.1038/nbt.3772](https://doi.org/10.1038/nbt.3772)
* [rcurrie/tumornormal/ingest notebook](https://github.com/rcurrie/tumornormal/blob/master/ingest.ipynb)
    + used to create a local copy of the tcga_target_gtex.h5 data file from the Xena Toil re-compute dataset
    + converts Covert Ensembl gene ids to Hugo