Step 3 of the BioXpress pipeline.
run_per_study.py -> run_per_tissue.py -> run_per_case.py
The python script run_per_study.py provides arguments to the R script deseq.R. The count and category files generated from the Annotation step are used to calculate differential expression and statistical significance. The result is a series of files per tissue including the normalized reads (DESeq normalization method), the DE results and significance, and QC files such as the PCA plot.
- Note: this step is time consuming (~2-3 hours of run time)
Edit the hard-coded paths in the script run_per_tissue.py
- Specify the
in_dir
to be the folder containing the final output files of the Annotation steps for per study- Specify the
out_dir
- Ensure that the file
list_files/studies.csv
contains all of the tissues you wish to process - Note: the studies can be run separately (in the event that 2-3 hours cannot be dedicated to run the all studies at once) by creating separate dat files with specific tissues to run
Run the shell script sh run_per_study.sh
- Note: the R libraries specified in deseq.R will need to be installed if running on a new server or system, as these installations are not included in the scripts
A set of files:
- log file
- deSeq_reads_normalized.csv - Normalized read counts (DESeq normalization method applied)
- results_significance.csv - log2fc differential expression results and statistical significance (t-test)
- dispersion.png
- distance_heatmap.png
- pca.png - Principal component analysis plot, important for observing how well the Primary Tumor and Solid Tissue Normal group together
The python script run_per_tissue.py provides arguments to the R script deseq.R. The count and category files generated from the Annotation step are used to calculate differential expression and statistical significance. The result is a series of files per study including the normalized reads (DESeq normalization method), the DE results and significance, and QC files such as the PCA plot.
- Note: this step is time consuming (~2-3 hours of run time)
Edit the hard-coded paths in the script run_per_tissue.py
- Specify the
in_dir
to be the folder containing the final output files of the Annotation steps for per tissue- Specify the
out_dir
- Ensure that the file
list_files/tissue.dat
contains all of the tissues you wish to process - Note: the tissues can be run separately (in the event that 2-3 hours cannot be dedicated to run the all tissues at once) by creating separate dat files with specific tissues to run
Run the shell script sh run_per_tissue.sh
A set of files:
- log file
- deSeq_reads_normalized.csv - Normalized read counts (DESeq normalization method applied)
- results_significance.csv - log2fc differential expression results and statistical significance (t-test)
- dispersion.png
- distance_heatmap.png
- pca.png - Principal component analysis plot, important for observing how well the Primary Tumor and Solid Tissue Normal group together
The python script run_per_case.py provides arguments to the R script deseq.R. The count and category files generated from the Annotation step are used to calculate differential expression and statistical significance. The result is a series of files per case including the normalized reads (DESeq normalization method), the DE results and significance, and QC files such as the PCA plot.
- Note: this step is time consuming (~2-3 hours of run time)
Edit the hard-coded paths in the script run_per_case.py
- Specify the
in_dir
to be the folder containing the final output files of the Annotation step for per_case- Specify the
out_dir
- Ensure that the file
list_files/cases.csv
contains all of the cases you wish to process - Note: the cases can be run separately (in the event that 2-3 hours cannot be dedicated to run the all tissues at once) by creating separate dat files with specific cases to run
Run the shell script sh run_per_tissue.sh
A set of files:
- log file
- deSeq_reads_normalized.csv - Normalized read counts (DESeq normalization method applied)
- results_significance.csv - log2fc differential expression results and statistical significance (t-test)
- dispersion.png
- distance_heatmap.png
- pca.png - Principal component analysis plot, important for observing how well the Primary Tumor and Solid Tissue Normal group together