Skip to content

HUST-NingKang-Lab/DeepMicroCancer

Repository files navigation

Quick start

DeepMicroCancer is a diagnostic model for cancer diagnosis using transfer learning techniques for various cancer types. The model is built using a combination of Random Forest and Transfer Learning techniques. The predict.py module is used to quickly predict the labels of input samples.

Command line instructions

To run the predict module of DeepMicroCancer, simply use the following command:

python predict.py -i abundance.csv
  -l labels.csv
  -m model
  -t model_type
  -f fig_name
  -o output_directory

File descriptions

-i: A CSV file containing the abundance of microbial communities in the sample. The rows represent the hosts and the columns represent the features. The abundance file should be generated by Kraken and preprocessed using Voom and SNM (supervised normalization) to reduce batch effects. More information on preprocessing can be found here. The format of the file should look like:

microbe1 microbe2 ...
host1 0.01 0.05 ...
host2 0 0.02 ...
... ... ... ...

-l: Optional. If provided, DeepMicroCancer will calculate the AUROC and plot the ROC curve. This file is a CSV file containing the label of each host. The first column contains the index of each host and the second column named disease_type contains the label of each host, like:

SampleID disease_type
host1 status1
host2 status2
host3 status3
... ...

-m: A DeepMicroCancer model trained using train.py or transfer.py. There are three models (tissue_model, blood_model, and tissue-to-blood_model) available in the model directory. Choose the appropriate model based on your sample type.

-t: Specify the type of model. If the model was trained using train.py, set this parameter to independent. If the model was trained using transfer.py, set this parameter to transfer.

-f: Optional. The name of the figure to save the AUROC plot.

-o: The output directory to save the predict result. The result will include a CSV file with the predict results and an AUROC figure (if labels are provided).

Example

Using the test dataset and the tissue-to-blood_model as an example:

python predict.py -i data/blood/X_test.csv \
  -l data/blood/y_test.csv \
  -m models/tissue-blood_model \
  -t transfer -f tissue-blood \
  -o results/tissue-blood

Workflow

The project is divided into four scripts, each with its own function in the overall workflow.

Requirements

To install the required packages, run the following command:

pip install -r requirements.txt

Unzip the data files in the data folder.

cat data/data.* > data/tmp.zip
unzip data/tmp.zip -d data

Split Dataset

The split_dataset.py script is used to split the dataset into training and testing sets. The script takes the path to the features and labels files in csv format as input and generates the training and testing datasets.

Arguments

-x, --features: The path to a features file as csv format, each row is a sample, each column is a feature
-y, --labels: The path to a labels file as csv format, each row is a sample, the disease_type column is the label
-s, --test_size: The size of the test set (default = 0.3)
-o, --output: The path to save the output files

Usage:

Split the tissue dataset and blood dataset into training and testing sets with a test size of 30% and 20% respectively.

python split_dataset.py -x data/tissue_snm.csv \
  -s 0.3 \
  -y data/tissue_meta.csv \
  -o data/tissue
  
python split_dataset.py -x data/blood_snm.csv \
  -s 0.2 \
  -y data/blood_meta.csv \
  -o data/blood

Build Model

The build_model.py script builds the Random Forest classifier model. The script takes the path to the training features and labels files in csv format as input and outputs a saved model. Saved model contains three files: model.joblib contains the model parameters, features.txt contains the features used to build the model, and label_encoder.joblib contains the label encoder used to encode the labels.

Arguments

-x, --features: The path to a features file as csv format, each row is a sample, each column is a feature
-y, --labels: The path to a labels file as csv format, each row is a sample, the disease_type column is the label
-o, --output: The path to save the model

Usage:

Build the Random Forest classifier model for the tissue and blood datasets.

python build_model.py -x data/tissue/X_train.csv \
  -y data/tissue/y_train.csv \
  -o models/tissue_model
  
python build_model.py -x data/blood/X_train.csv \
  -y data/blood/y_train.csv \
  -o models/blood_model

About the seed

We use the seed 0 to split the dataset and seed 13 to build the model to make sure that the results are reproducible. The seed can be changed by changing the seed variable in the split_dataset.py and build_model.py scripts.

Transfer Model

The transfer.py script is used to transfer the model from one dataset to another. The script takes the path to the source model, source features and labels files, target features and labels files in csv format as input and outputs a saved model. The source model should be built using the build_model.py script.

Arguments

-s, --source_model: The path to the source model
-sf, --source_features: The path to a source features file as csv format, each row is a sample, each column is a feature
-sl, --source_labels: The path to a source labels file as csv format, each row is a sample, the disease_type column is the label
-tf, --target_features: The path to a target features file as csv format, each row is a sample, each column is a feature
-tl, --target_labels: The path to a target labels file as csv format, each row is a sample, the disease_type column is the label
-o, --output: The path to save the model

Usage:

Transfer the tissue model to the blood dataset.

python transfer.py -s models/tissue_model \
  -sf data/tissue/X_train.csv \
  -sl data/tissue/y_train.csv \
  -tf data/blood/X_train.csv \
  -tl data/blood/y_train.csv \
  -o models/tissue-blood_model

Predict

The predict.py script is used to predict the labels of the testing dataset. The script takes the path to the testing features and labels files (optional, if the labels file is provided the script will output the predicted labels and the AUROC plot) in csv format , the model build using the build_model.py or transfer.py script, the type of the model, either independent or transfer, the name of the figure to save (optional) the AUROC plot and the output path to save the results.

Arguments

-i, --input: The path to a features file as csv format, each row is a sample, each column is a feature
-l, --labels: The path to a labels file as csv format, each row is a sample, the disease_type column is the label
-m, --model: The path to the model
-t, --type: The type of the model, either independent or transfer
-f, --fig_name: The name of the figure to save
-o, --output: The path to save the results

Usage

Predict the labels of the testing dataset for the tissue using the tissue model.

python predict.py -i data/tissue/X_test.csv \
  -l data/tissue/y_test.csv \
  -m models/tissue_model \
  -t independent \
  -f tissue-tissue \
  -o results/tissue-tissue

Predict the labels of the testing dataset for the blood using the blood model.

python predict.py -i data/blood/X_test.csv \
  -l data/blood/y_test.csv \
  -m models/blood_model \
  -t independent \
  -f blood-blood \
  -o results/blood-blood

Predict the labels of the testing dataset for the blood using the tissue-blood model.

python predict.py -i data/blood/X_test.csv \
  -l data/blood/y_test.csv \
  -m models/tissue-blood_model \
  -t transfer -f tissue-blood \
  -o results/tissue-blood

About the seed

We use the seed 0 to split the dataset and seed 13 to build the model to make sure that the results are reproducible. The seed can be changed by changing the seed variable in the split_dataset.py and build_model.py scripts.

Feature importances

The feature importances of the tissue model and the blood model for each cancer type are calculate using the feature_importances.py script. Each cancer type is considered as a binary classification problem and the output is saved in feature_importances folder.

run the script using the following command:

python feature_importances.py

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages