Identifying Driver Genes for Individual Patients through Inductive Matrix Completion
RAM: 16 GB
The code requires a good amount of RAM.
- Python 3.6
- pandas 0.22.0
- numpy 1.14.6
- scikit-learn 0.23.2
- Download files and directories from the repo "IMCDriver",including "data", "data_preprocess.py", "IMCDriver.py"
- Unzip the Example.7z to the IMCDriver/data/ and get IMCDriver/data/Example
- Set the variable cancer_folder='Example' in "data_preprocess.py" and "IMCDriver.py"
- Since we have preprocessed the datasets, users can directly run the command "python IMCDriver.py" in the Terminal to implement IMC to predict driver genes for individuals, or directly run this file in the Pycharm IDE.
We have provided all the prepared files of five cancer datasets, including BRCA, HNSC, LUAD, LUSC, PRAD. If you want to perform IMCDriver with other cancer datasets from TCGA, you should firstly run the command 'python data_preprocess.py' in the Terminal to start the data pre-procession, or directly run the script of "data_preprocess.py" in the Pycharm IDE, which may take several minutes. Then, run the command "python IMCDriver.py" in the Terminal to start the personalized driver gene identification. This may take about an hour to test all the samples. The running time mainly depends on your computer and the number of samples in your dataset. The processing time that each sample takes will be printed in the console to facilitate your estimation of processing time.
The directory of data contains the following directories and files
Additional_file5_reliable_interactions.txt: the gene correlation network file.
NCG_known_711.txt: the list of 711 known driver genes.
Example.7z: contains the prepared files of the Example.
BRCA.7z: contains the prepared files of the BRCA dataset.
HNSC.7z: contains the prepared files of the HNSC dataset.
LUAD.7z: contains the prepared files of the LUAD dataset.
LUSC.7z: contains the prepared files of the LUSC dataset.
PRAD.7z: contains the prepared files of the PRAD dataset.
The directory of each cancer dataset is identically organized as follows,
mut_similarity: saving the file of Gaussian interaction profile kernel similarity between mutated genes.
orig_data: saving the RNA-seq.txt and SomaticMutation.txt file downloaded from TCGA by Xena.
results: saving the file predicted by IMCDriver consisting of scores of mutated genes of each patient in the cancer dataset.
sample_similarity: saving the file of Gaussian interaction profile kernel similarity between samples.
We stronger suggest that the names of files and subfolders generated by the script of data_preprocess.py not to be changed. Unless the names of changed files or subfolders are also changed identically in the script "python IMCDriver.py". The original RNA-seq.txt and SomaticMutation.txt of each cancer dataset are downloaded from TCGA data through Xena.