Skip to content

Classification of mutation and data clinical to predict metastasis stage

License

Notifications You must be signed in to change notification settings

StarBrand/rf-tml

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

22 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

How to replicate experiment

workflow

Requirements

The code can be executed on Python>=3.7, due to type specification, some code could raise exceptions on Python<=3.6.

As a good practice, generating a virtual environment is recommended (more info). A quick way to do it:

By using conda

conda create -n <name_of_venv> python=3.7 # or any python>=3.7

# Activate
conda activate <name_of_venv>

By using Python

Note: Need version of Python of virtual environment to create installed

python -m venv <name_of_venv>
# In case you have installed Python v2
python3 -m venv <name_of_venv>

# Activate
# Windows
<name_of_venv>\Scripts\activate.bat
# Unix or MacOS
source <name_of_venv>/bin/activate

Install libraries

Using pip:

pip install -r requirements.txt

Preprocessing

preprocessing

Note: data_mutations_extended.txt, file that contains the data extracted on mutated_genes.tsv and tml.tsv, is missing on this repository due to excessive size (0.98 GB). However, it can be downloaded from cBioPortal. Files used on this study are data_mutation_extended.txt, data_clinical_patient.txt and data_clinical_sample.txt.

Data (on training_data folder) was generated from data saved on raw_data. In case you want to regenerate, execute:

cd pipeline/01-Pre_processing
# Generates data/training_data/mutated_genes/*.tsv
python extract_mutation.py
# Generates data/training_data/clinical_data/clinical_data.tsv, not to be used
python clinical_data.py
# Generates data/training_data/clinical_data/encoded_clinical_data.tsv
python one_hot_encoding_clinical_data.py
# Generates merging between clinical data and mutated genes, data/training_data/merged_data/*.tsv
python merge_data.py
cd ../..

Classification

classification

Whole classification is done in one script:

cd pipeline/02-Classification
python classify.py
# >> A long output
cd ../..

This generates the whole metrics folder. The output is saved in classify.out. The specification and parameters are on config.py file. Seed was fixes arbitrarily to make experiments reproducible.

Analysis of Result

Comparing metrics

metaanalysis

To compare training data used on model, first, we need to summarize all obtained metrics. In order to do this, execute:

cd pipeline/03-Meta_Analysis
python process_results.py
# Generates data/meta_data/metrics.tsv

Two type of graphs are generating to compare the different data used to train the model. The first one is to compare F1-score of the M1 label (cancer has been found to have spread to distant organs or tissues*). The second one is to compare precision and recall of the same label (M1).

First type of chart is generated with the script:

python f1_score_plot.py

This generate 4 plot:

PCA | Model Validation 5-Fold Cross Validation Test Set
Without PCA Plot Plot
With PCA Plot Plot

Second type of chart is generated by using:

python precision_recall.py

This generate 4 plot:

Model Validation Link to plot
5-Fold Cross Validation Plot
Test Set Plot

Third type is generated with:

python roc_curve.py
cd ../..

Getting most important parameters

featureimportance

To extract feature importance:

cd pipeline/04-Feature-Importance
python extract_feature_importance.py
# And to plot
python plot_feature_importance.py
cd ../..

This generates .tsv files on folder meta_data.

About

Classification of mutation and data clinical to predict metastasis stage

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages