Skip to content

7. Output

MikiSchikora edited this page May 13, 2022 · 33 revisions

Before checking the output make sure that there is a file in the output directory called perSVade_finished_file.txt, which indicates that the pipeline worked and includes a report of the timing. Depending on how you ran perSVade you are interested in the following sections:

Output of modular perSVade (perSVade <module> <args>) (recommended)

Each module writes files into an output directory, specified through --outdir, which contains all the commands ran in all_cmds.txt. Below is the content of the outdir in each of the modules.

trim_reads_and_QC

  • trimmed_reads<readID>.fastq.gz: Trimmed reads after running TRIMMOMATIC with FASTQC-inferred adapters.
  • trimmed_reads<readID>_fastqcReport.html: Html report of the quality of the trimmed reads, according to FASTQC.

align_reads

  • aligned_reads.bam.sorted: Aligned reads with marked duplicates (unless --skip_marking_duplicates is specified) with bwa mem and sorted by genomic coordinate.

  • aligned_reads.bam.sorted.bai: The index file of the aligned reads.

  • coverage_windows_<window_length>bp.tab: A table with the coverage for windows of the genome that are as long as 5% of the median chromosome length (<window_length>). This is the output of mosdepth. This table contatins the following fields:

    • #chrom is the chromosome name.
    • start is the 0-based start coordinates of the region
    • end is the 1-based end coordinate of the region
    • length is the length of the region
    • mediancov_1 is the median read depth in the region
    • percentcovered_1 is the perecentage of the region that is covered with 1 read or more
    • nocoveragebp_1 is the number of bases that have no coverage

infer_repeats

  • RepeatModeler_families.fasta: The repeat families identified by RepeatModeler.
  • RepeatMasker_RepeatModelerFamilies.out: The repeats annotated by RepeatMasker using RepeatModeler_families.fasta as query librarary.
  • RepeatMasker_DefaultFamilies.out: The repeats annotated by RepeatMasker using the default repeat databases as query librarary.
  • combined_repeats.tab: The combined output of RepeatMasker_RepeatModelerFamilies.out and RepeatMasker_DefaultFamilies.out in a .tab format. This file can be used as an input for the modules SV_calling, coverageCNV_calling, small_variant_calling and SV_parameter_optimization.

find_homologous_regions

  • breakpoints_aroundHomRegions_wsize=<query_window_size>bp_maxEval=<max_evalue>_minQcovS=<min_pct_overlap>.bedpe: A .bedpe file with the pairs of homologous regions in the provided genome.

find_knownSVs_regions

  • knownSVs_breakpoints.bedpe: A .bedpe file with the regions that have SVs as called by perSVade in several short-read sequencing datasets.

optimize_parameters

  • optimized_parameters.json: A .json file with the optimized parameters, which can be an input to the module call_SVs.

  • optimized_parameters_accuracy.tab: A table with the accuracy (precision, recall, Fvalue) of the optimized parameters when tested on the different simulations and SVs. Only SV types that have at least 5 SVs simulated are shown. These are the fields:

    • simulation_genome, simulation_ploidy indicate the testing simulation and type of SV. For example, if you ran parameter optimisation for 2 simulations you expect simulation_genome to have "simulation_1" and "simulation_2". If you ran for only one ploidy (i.e.: haploid) you expect simulation_ploidy to be always "haploid".
    • FN, FP, TP, nevents are the numbers of false negatives, false positives, true positives and number of all real positives, respectively.
    • precision, recall and Fvalue are the resulting accuracy.
  • SVfiles: Files of the simulated SVs. They are formatted as explained in the FAQ How are the SVs encoded into single files.

  • plots/cross_accuracy_heatmaps/: Heatmaps that show how the best parameters for one simulation and SV type (training parameters in the rows) work for the other simulations and SV types (testing simulations in the columns). There is one plot for each accuracy measurement (recall, precision and Fvalue). Only testing SV types that have at least 5 SVs simulated are shown. The optimized_parameters.json (training) are those that work best across all the testing simulations and SVs. The heatmap shows the accuracy (as an annotation on the plot) for the chose set of optimized_parameters.json. These plots are useful to infer the SV calling accuracy in the input sample and assess the overfitting associated to parameter optimization.

call_SVs

  • deletions.tab, inversions.tab, tandemDuplications.tab, insertions.tab, translocations.tab, unclassified_SVs.tab: These are the called SVs. Each SV type is saved into a different files. Check the FAQ How are the SVs encoded into single files for more information on these files.

  • gridss_output.raw.vcf and gridss_output.filt.vcf are the raw and filtered outputs of gridss, respectively. These are a set of breakpoints encoded in a .vcf file. Check the gridss documentation for more details.

  • gridss_output.filt.bedpe is a representation in the .bedpe format of the breakpoints from gridss_output.filt.vcf. The column names are "chrom1", "start1", "end1", "chrom2", "start2", "end2", "breakpoint ID", "QUAL", "strand 1", "strand 2". Each row indicates a breakpoint between two breakends (1 and 2). Each breakend has a "start" and "end" (for example "start1" and "start2") because the exact position is not always resolved (and hence the range indicates a region where the breakend is found).

  • clove_output.vcf is the output of clove without any coverage-based filtering of tandem duplications and deletions. Check clove for more details.

call_CNVs

  • final_CNVcalling.tab: The regions with a copy number different from 1. These are the relevant fields (some of them depend on the chosen CNV calling algorithms specified with the option --cnv_calling_algs):

    • chromosome, start and end indicate the region under CNV.

    • median_relative_CN_AneuFinder, median_relative_CN_HMMcopy and median_relative_CN_CONY indicate the predicted copy number by each of the three CNV callers. This value is relative to 1.0 (absence of CNV). In a diploid genome, a complete deletion would be 0.0, a monosomy 0.5, a trisomy 1.5 and a tetraploidy 2.0.

    • merged_relative_CN is the output of merging median_relative_CN_AneuFinder, median_relative_CN_HMMcopy and median_relative_CN_CONY. This merging is done in a conservative way, so that the value that is closer to 1.0 from the three programs (no CNV) is always taken.

    • median_coverage is the median read depth of this region.

    • median_coverage_corrected is the coverage corrected by mappability, GC content and distance to the telomere. This is the coverage that is used to run CNV calling. Note that the correction is only performed if a LOWESS fit on the data using a given set of predictors yields and r2>0.1

  • final_df_coverage.tab: A table with the raw and corrected coverage per windows of the genome, which can be used for debugging the output.

  • calculating_corrected_coverage: A folder with some files reporting how the coverage correction went. The most important one is calculating_corrected_coverage/plots/coverage_interactive.html, which shows the raw and corrected coverage in an interactive way.

  • plots: A folder with graphical visualizations of the CNV calling. The file plots/CNcalling_interactive.html shows the corrected coverage with the results of each of the programs. It can be used to filter the outputs.

integrate_SV_CNV_calls

  • SV_and_CNV_variant_calling.vcf: Merged .vcf file with all SVs and CNVs. Check the FAQ What is in SV_and_CNV_variant_calling.vcf? for more information. Note that this file is NOT sorted by genomic coordinates, as it is sorted by SV identity. This may give problems when integrated with other software. If your downstream software requires a file sorted by genomic coordinates you can run sort -k1,1 -k1n,2 <vcf file> to get the sorted vcf by genomic coordinate.

annotate_SVs

  • annotated_variants_before_GeneticCode_correction.html: The graphical representation of the functional annotation with VEP before correcting the annotations according to the genetic codes. This file is useful to have an overview of the types of variants. However, it is based on the assumption that the genetic code is standard for all protein-coding genes. This means that some variants may be incorrectly annotated. The file annotated_variants.tab contains the corrected variants.

  • annotated_variants.tab: The final, corrected functional annotation with VEP, where each line corresponds to a particular alteration in a given gene. This file is the 'Default VEP outout' (tabular) format from https://www.ensembl.org/info/docs/tools/vep/vep_formats.html#output.

call_small_variants

  • bcftools_ploidy<ploidy>_out, HaplotypeCaller_ploidy<ploidy>_out and freebayes_ploidy<ploidy>_out: Folders that contain the raw and filtered vcf files of each fo the programs.

  • merged_vcfs_allVars_ploidy<ploidy>.vcf: A vcf file with the merged output of the algorithms used in the variant calling. You can check the header to understand what are all the tags in the INFO field. Here, the multiallelic loci are split into different lines to ease the analysis.

  • variant_calling_ploidy<ploidy>.tab: A tabular version of merged_vcfs_allVars_ploidy<ploidy>.vcf. Each column contains information parsed from the vcf file. This file is easier to manage because it has all the info in tabular format.

  • variants_atLeast<n_PASS_programs>PASS_ploidy<ploidy>.vcf: The variants that PASS the filters in at least n_PASS_programs algorithms and also have a fraction of reads covering them above --min_AF. The INFO and FORMAT fields are simplified to take less space. It may be useful to take, for example, variants that PASS de filters by at least 2 algorithms.

annotate_small_vars

  • annotated_variants.tab: The final, corrected functional annotation with VEP, where each line corresponds to a particular alteration in a given gene. This file is the 'Default VEP outout' (tabular) format from https://www.ensembl.org/info/docs/tools/vep/vep_formats.html#output. In addition, this file also contains information about whether the annotation refers to a SNP (in the 'is_snp' field) and whether it is a protein-altering annotation (through the 'is_protein_altering' field).

get_cov_genes

  • genes_and_regions_coverage.tab: A table that contains the coverage of all the genes in the provided gff. These are the fields:

    • chromosome, start, end and length are the coordinates of the gene
    • median_reads_per_gene is the median read depth for this gene
    • nocoveragebp_1 is the number of bases that have no coverage
    • percentcovered_1 is the perecentage of the region that is covered with 1 read or more
    • ID is the gene ID in the gff (parsed from the attributes). If there are some duplicated IDs with parts it corresponds to the union of ID and part.
    • fraction_covered_by_MoreThan1read is equivalent to percentcovered_1, but from 0 to 1.
    • relative_coverage is the median_reads_per_gene divided by the median of all median_reads_per_gene.
    • All the fields with a _+-10kb_region sufix are equivalent to the ones that do not have it, but for a region that starts at -10kb of the gene start and ends at +10kb of the gene end. This can be useful for CNV analysis.




Output of the one-liner perSVade (perSVade.py <args>) (deprecated from version 1.02.3 on)

You can also use the script scripts/perSVade.py to execute multiple modules of perSVade with a single command. However, this script is less user friendly and error prone, and it will be no longer maintained after version 1.02.3. This file contains the description of the output of the one-liner version.