How to replicate experiment

Requirements

The code can be executed on Python>=3.7, due to type specification, some code could raise exceptions on Python<=3.6.

As a good practice, generating a virtual environment is recommended (more info). A quick way to do it:

By using `conda`

conda create -n <name_of_venv> python=3.7 # or any python>=3.7

# Activate
conda activate <name_of_venv>

By using Python

Note: Need version of Python of virtual environment to create installed

python -m venv <name_of_venv>
# In case you have installed Python v2
python3 -m venv <name_of_venv>

# Activate
# Windows
<name_of_venv>\Scripts\activate.bat
# Unix or MacOS
source <name_of_venv>/bin/activate

Install libraries

Using pip:

pip install -r requirements.txt

Preprocessing

Note: data_mutations_extended.txt, file that contains the data extracted on mutated_genes.tsv and tml.tsv, is missing on this repository due to excessive size (0.98 GB). However, it can be downloaded from cBioPortal. Files used on this study are data_mutation_extended.txt, data_clinical_patient.txt and data_clinical_sample.txt.

Data (on training_data folder) was generated from data saved on raw_data. In case you want to regenerate, execute:

cd pipeline/01-Pre_processing
# Generates data/training_data/mutated_genes/*.tsv
python extract_mutation.py
# Generates data/training_data/clinical_data/clinical_data.tsv, not to be used
python clinical_data.py
# Generates data/training_data/clinical_data/encoded_clinical_data.tsv
python one_hot_encoding_clinical_data.py
# Generates merging between clinical data and mutated genes, data/training_data/merged_data/*.tsv
python merge_data.py
cd ../..

Classification

Whole classification is done in one script:

cd pipeline/02-Classification
python classify.py
# >> A long output
cd ../..

This generates the whole metrics folder. The output is saved in classify.out. The specification and parameters are on config.py file. Seed was fixes arbitrarily to make experiments reproducible.

Analysis of Result

Comparing metrics

To compare training data used on model, first, we need to summarize all obtained metrics. In order to do this, execute:

cd pipeline/03-Meta_Analysis
python process_results.py
# Generates data/meta_data/metrics.tsv

Two type of graphs are generating to compare the different data used to train the model. The first one is to compare F1-score of the M1 label (cancer has been found to have spread to distant organs or tissues*). The second one is to compare precision and recall of the same label (M1).

First type of chart is generated with the script:

python f1_score_plot.py

This generate 4 plot:

PCA \| Model Validation	5-Fold Cross Validation	Test Set
Without PCA	Plot	Plot
With PCA	Plot	Plot

Second type of chart is generated by using:

python precision_recall.py

This generate 4 plot:

Model Validation	Link to plot
5-Fold Cross Validation	Plot
Test Set	Plot

Third type is generated with:

python roc_curve.py
cd ../..

Getting most important parameters

To extract feature importance:

cd pipeline/04-Feature-Importance
python extract_feature_importance.py
# And to plot
python plot_feature_importance.py
cd ../..

This generates .tsv files on folder meta_data.

Name		Name	Last commit message	Last commit date
Latest commit History 22 Commits
data		data
pipeline		pipeline
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

How to replicate experiment

Requirements

By using `conda`

By using Python

Install libraries

Preprocessing

Classification

Analysis of Result

Comparing metrics

Getting most important parameters

About

Releases

Packages

Languages

License

StarBrand/rf-tml

Folders and files

Latest commit

History

Repository files navigation

How to replicate experiment

Requirements

By using conda

By using Python

Install libraries

Preprocessing

Classification

Analysis of Result

Comparing metrics

Getting most important parameters

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

By using `conda`

Packages