pancancer-evaluation

Note: This repository is currently a work in progress, so some aspects of the code/analysis may not be fully described or documented here.

In general, the goal of this project is to follow up on and generalize previous Greene Lab studies predicting driver mutation status from TCGA gene expression data. See previous repos and associated publications here and here for more detail on past work and biological significance/interpretation of mutation prediction from gene expression.

Broad research questions and analysis plan:

Replicate results from BioBombe repo for stratified train/test sets
Set up cross-validation holding out individual cancer types, and compare results to negative control with shuffled labels
Comparison of pan-cancer and single-cancer training sets: when does adding pan-cancer data help? Does it ever hurt?
Learning curve experiments: Is the effect of added data dependent on the number of samples for the cancer type in question, or the label balance for the driver gene in question?
External validation, generalization to other ICGC or pediatric cancer datasets: how/when can we use pan-cancer data to help?
More to come

Issues are mostly up-to-date for future ideas/research directions (filter by the "research question" tag), as well as known bugs/limitations of the code and evaluation infrastructure (other tags).

Setup

We recommend using the conda environment specified in the environment.yml file to run these analyses. To build and activate this environment, run:

# conda version 4.5.0
conda env create --file environment.yml

conda activate pancancer-evaluation

To make the relative file paths in pancancer_evaluation/config.py work correctly, you'll also need to install the pancancer_evaluation package in development mode:

pip install -e .

(note that currently running pip install . will break the file paths, this will be fixed eventually but at the moment we recommend using the -e/ development flag)

Running tests

Running the tests requires the pytest module (included in the specified Conda environment). Once this module is installed, you can run the tests by executing the command

pytest tests/

from the repo root.

Regenerating test data

If you make changes to the model fitting code, hyperparameters, cross-validation code, etc., you may need to regenerate the model output used for the model regression tests. To do this, you can run the following command from the repo root:

python pancancer_evaluation/scripts/generate_test_data.py --verbose

This will print messages showing which files are being rewritten.

Name		Name	Last commit message	Last commit date
Latest commit History 757 Commits
.github/workflows		.github/workflows
00_process_data		00_process_data
01_stratified_classification		01_stratified_classification
02_cancer_type_classification		02_cancer_type_classification
03_cross_cancer_classification		03_cross_cancer_classification
04_add_cancer_types		04_add_cancer_types
07_purity_prediction		07_purity_prediction
08_cell_line_prediction		08_cell_line_prediction
09_simulations		09_simulations
10_msi_prediction		10_msi_prediction
11_sex_prediction		11_sex_prediction
data		data
nbconverted		nbconverted
pancancer_evaluation		pancancer_evaluation
tests		tests
.gitignore		.gitignore
05_coefficient_analysis.ipynb		05_coefficient_analysis.ipynb
06_correlation_analysis.ipynb		06_correlation_analysis.ipynb
LICENSE		LICENSE
README.md		README.md
environment.yml		environment.yml
setup.py		setup.py

License

greenelab/pancancer-evaluation

Folders and files

Latest commit

History

Repository files navigation

pancancer-evaluation

Setup

Running tests

Regenerating test data

About

Topics

Resources

License

Stars

Watchers

Forks

Languages