ONCOCNV - a package to detect copy number changes in Deep Sequencing data
- Perl and R installed and added to the PATH E.g., export PATH=$PATH:YOURPATH/R/bin
- SAMtools (http://samtools.sourceforge.net/) installed and added to the PATH To add to PATH, type in the command line or add to "ONCOCNV.sh": export PATH=$PATH:YOURPATH/samtools/bin or alias samtools=YOURPATH/samtools/bin/samtools
- BEDTools (http://bedtools.readthedocs.org/en/latest/) installed and added to the PATH To add to PATH, type in the command line or add to "ONCOCNV.sh": export PATH=$PATH:YOURPATH/BEDTools/bin/ or alias bedtools=YOURPATH/BEDTools/bin/bedtools
- The following R libraries should be installed: MASS, mclust, PSCBS, DNAcopy, R.cache, scales, cwhmisc, fastICA, cghseg, digest
- The fasta sequence (one file, unzipped; e.g. "hg19.fa") of the targeted genome should be downloaded from http://hgdownload.soe.ucsc.edu/downloads.html
- You need to have your data aligned (.bam files)
- You need to have at least three control files to construct a reliable baseline. However, ONCOCNV will run with only 2 controls starting from version 5.4 and with JUST one control starting from version 5.7. Yet, we recommend to have at least 3 control for good performance of the algorithm.
Download ONCOCNV.zip (or ONCOCNV.vX.X.zip)
Unzip files into detectory "scripts"
Check requirements (R + the necessary R packages must be installed) To install the necessary P packages (when R is installed), type in the command line:
R install.packages("MASS") install.packages("mclust") install.packages("R.cache") install.packages("scales") install.packages("cwhmisc") install.packages("fastICA") install.packages("cghseg") install.packages("digest") source("http://bioconductor.org/biocLite.R") biocLite("DNAcopy") install.packages("PSCBS") quit()
- Open "ONCOCNV.sh" with a text editor (gedit, textpad, etc.)
- Set correct paths and filenames in the top part of the "ONCOCNV.sh"
- Check properties of "ONCOCNV.sh" chmod +rwx PathToONCOCNV/scripts/ONCOCNV.sh
- Check formats: o reads should be given in .BAM format o amplicon coordinates should be given in .bed format (with or without the headline) and have amplicon ID in column 4 and gene symbol in column 6, e.g.: chr1 2488068 2488201 AMPL223847 0 TNFRSF14 It is mandatory to provide gene names in the 6th column.
Please make sure that: - There is no duplicates in the coordinates - Coordinates are sorted - Gene names are gene names in the sense that corresponding amplicons fall in the same genomic locus and not on different chromosomes - Gene names cannot be the same as amplicon names or IDs because ONCOCNV assumes to have several amplicons per gene
- Run "ONCOCNV.sh" from the command line: cd PathToONCOCNV/scripts ./ONCOCNV.sh or . PathToONCOCNV/scripts/ONCOCNV.sh
HOW TO READ OUTPUT FILES
There are three output files per sample:
- Visual representation of normalized and annotated copy number profile Each dot corresponds to an amplicon; the X-axis is not up to scale. Color code: o GREEN one-point-outlier o DARK GREY SURROUNDINGS frequent one-point-outlier o BROWN >1 level gain o BROWN SURROUNDINGS 1-level gain o BLUE >1 level loss o BLUE SURROUNDINGS 1-level loss
- predictions per gene
gene gene name chr chromosome name start first amplicon start end last amplicon start copy.number predicted copy number (no normal contamination nor subclones is taken into accout) p.value p-value for the copy number status of the genomic region encompassing the gene q.values q-value ("fdr"-corrected p-value) for the copy number status of the genomic region encompassing the gene comments p-value for the hypothesis that the copy number of the gene does not match the copy number of the encompassing segment (in the case of a break within a gene - it is the p-value for the break)
- predictions per amplicon (detailed information)
chr chromosome name start first amplicon start end last amplicon start gene gene name ID amplicon ID ratio logarithm of the normalized read count (zero values correspond to the neutral copy number) predLargeSeg copy number predicted by segmentation of normalized read counts predLargeCorrected final prediction for the copy number pvalRatioCorrected p-value of the t-test to test the difference between the normalized read counts and the value expected from the segmentation or from the gene-based copy number assessment perGeneEvaluation copy number predicted per gene (unaware of the segementation) pvalRatioGene gene-based p-value of the t-test for the difference of the mean of the normalized read counts from zero predPoint predicted one-point-outlier predPointSusp predicted (frequent) one-point-outlier comments additional information: SegRatio mean value of the logarithms of the normalized read counts per segment AbsMeanSigma normalized difference of the mean value (~z-score/sqrt(#amplicons in the segment)) pvalue p-value for AbsMeanSigma pvalueTTest p-value of the t-test (per segment)