-
Notifications
You must be signed in to change notification settings - Fork 4
7. Output
Before checking the output make sure that there is a file in the output directory called perSVade_finished_file.txt
, which indicates that the pipeline worked and includes a report of the timing. Depending on how you ran perSVade you are interested in the following sections:
Each module writes files into an output directory, specified through --outdir
, which contains all the commands ran in all_cmds.txt
. Below is the content of the outdir in each of the modules.
-
trimmed_reads<readID>.fastq.gz
: Trimmed reads after running TRIMMOMATIC with FASTQC-inferred adapters. -
trimmed_reads<readID>_fastqcReport.html
: Html report of the quality of the trimmed reads, according to FASTQC.
-
aligned_reads.bam.sorted
: Aligned reads with marked duplicates (unless--skip_marking_duplicates
is specified) with bwa mem and sorted by genomic coordinate. -
aligned_reads.bam.sorted.bai
: The index file of the aligned reads. -
coverage_windows_<window_length>bp.tab
: A table with the coverage for windows of the genome that are as long as 5% of the median chromosome length (<window_length>). This is the output of mosdepth. This table contatins the following fields:-
#chrom
is the chromosome name. -
start
is the 0-based start coordinates of the region -
end
is the 1-based end coordinate of the region -
length
is the length of the region -
mediancov_1
is the median read depth in the region -
percentcovered_1
is the perecentage of the region that is covered with 1 read or more -
nocoveragebp_1
is the number of bases that have no coverage
-
-
RepeatModeler_families.fasta
: The repeat families identified byRepeatModeler
. -
RepeatMasker_RepeatModelerFamilies.out
: The repeats annotated byRepeatMasker
usingRepeatModeler_families.fasta
as query librarary. -
RepeatMasker_DefaultFamilies.out
: The repeats annotated byRepeatMasker
using the default repeat databases as query librarary. -
combined_repeats.tab
: The combined output ofRepeatMasker_RepeatModelerFamilies.out
andRepeatMasker_DefaultFamilies.out
in a .tab format. This file can be used as an input for the modulesSV_calling
,coverageCNV_calling
,small_variant_calling
andSV_parameter_optimization
.
-
breakpoints_aroundHomRegions_wsize=<query_window_size>bp_maxEval=<max_evalue>_minQcovS=<min_pct_overlap>.bedpe
: A .bedpe file with the pairs of homologous regions in the provided genome.
-
knownSVs_breakpoints.bedpe
: A .bedpe file with the regions that have SVs as called by perSVade in several short-read sequencing datasets.
-
optimized_parameters.json
: A .json file with the optimized parameters, which can be an input to the modulecall_SVs
. -
optimized_parameters_accuracy.tab
: A table with the accuracy (precision, recall, Fvalue) of the optimized parameters when tested on the different simulations and SVs. Only SV types that have at least 5 SVs simulated are shown. These are the fields:-
simulation_genome
,simulation_ploidy
indicate the testing simulation and type of SV. For example, if you ran parameter optimisation for 2 simulations you expectsimulation_genome
to have "simulation_1" and "simulation_2". If you ran for only one ploidy (i.e.: haploid) you expectsimulation_ploidy
to be always "haploid". -
FN
,FP
,TP
,nevents
are the numbers of false negatives, false positives, true positives and number of all real positives, respectively. -
precision
,recall
andFvalue
are the resulting accuracy.
-
-
SVfiles
: Files of the simulated SVs. They are formatted as explained in the FAQ How are the SVs encoded into single files. -
plots/cross_accuracy_heatmaps/
: Heatmaps that show how the best parameters for one simulation and SV type (training parameters in the rows) work for the other simulations and SV types (testing simulations in the columns). There is one plot for each accuracy measurement (recall, precision and Fvalue). Only testing SV types that have at least 5 SVs simulated are shown. Theoptimized_parameters.json
(training) are those that work best across all the testing simulations and SVs. The heatmap shows the accuracy (as an annotation on the plot) for the chose set ofoptimized_parameters.json
. These plots are useful to infer the SV calling accuracy in the input sample and assess the overfitting associated to parameter optimization.
-
deletions.tab
,inversions.tab
,tandemDuplications.tab
,insertions.tab
,translocations.tab
,unclassified_SVs.tab
: These are the called SVs. Each SV type is saved into a different files. Check the FAQ How are the SVs encoded into single files for more information on these files. -
gridss_output.raw.vcf
andgridss_output.filt.vcf
are the raw and filtered outputs ofgridss
, respectively. These are a set of breakpoints encoded in a .vcf file. Check the gridss documentation for more details. -
gridss_output.filt.bedpe
is a representation in the .bedpe format of the breakpoints fromgridss_output.filt.vcf
. The column names are "chrom1", "start1", "end1", "chrom2", "start2", "end2", "breakpoint ID", "QUAL", "strand 1", "strand 2". Each row indicates a breakpoint between two breakends (1 and 2). Each breakend has a "start" and "end" (for example "start1" and "start2") because the exact position is not always resolved (and hence the range indicates a region where the breakend is found). -
clove_output.vcf
is the output of clove without any coverage-based filtering of tandem duplications and deletions. Check clove for more details.
-
final_CNVcalling.tab
: The regions with a copy number different from 1. These are the relevant fields (some of them depend on the chosen CNV calling algorithms specified with the option--cnv_calling_algs
):-
chromosome
,start
andend
indicate the region under CNV. -
median_relative_CN_AneuFinder
,median_relative_CN_HMMcopy
andmedian_relative_CN_CONY
indicate the predicted copy number by each of the three CNV callers. This value is relative to 1.0 (absence of CNV). In a diploid genome, a complete deletion would be 0.0, a monosomy 0.5, a trisomy 1.5 and a tetraploidy 2.0. -
merged_relative_CN
is the output of mergingmedian_relative_CN_AneuFinder
,median_relative_CN_HMMcopy
andmedian_relative_CN_CONY
. This merging is done in a conservative way, so that the value that is closer to 1.0 from the three programs (no CNV) is always taken. -
median_coverage
is the median read depth of this region. -
median_coverage_corrected
is the coverage corrected by mappability, GC content and distance to the telomere. This is the coverage that is used to run CNV calling. Note that the correction is only performed if a LOWESS fit on the data using a given set of predictors yields and r2>0.1
-
-
final_df_coverage.tab
: A table with the raw and corrected coverage per windows of the genome, which can be used for debugging the output. -
calculating_corrected_coverage
: A folder with some files reporting how the coverage correction went. The most important one iscalculating_corrected_coverage/plots/coverage_interactive.html
, which shows the raw and corrected coverage in an interactive way. -
plots
: A folder with graphical visualizations of the CNV calling. The fileplots/CNcalling_interactive.html
shows the corrected coverage with the results of each of the programs. It can be used to filter the outputs.
-
SV_and_CNV_variant_calling.vcf
: Merged .vcf file with all SVs and CNVs. Check the FAQ What is in SV_and_CNV_variant_calling.vcf? for more information. Note that this file is NOT sorted by genomic coordinates, as it is sorted by SV identity. This may give problems when integrated with other software. If your downstream software requires a file sorted by genomic coordinates you can runsort -k1,1 -k1n,2 <vcf file>
to get the sorted vcf by genomic coordinate.
-
annotated_variants_before_GeneticCode_correction.html
: The graphical representation of the functional annotation withVEP
before correcting the annotations according to the genetic codes. This file is useful to have an overview of the types of variants. However, it is based on the assumption that the genetic code is standard for all protein-coding genes. This means that some variants may be incorrectly annotated. The fileannotated_variants.tab
contains the corrected variants. -
annotated_variants.tab
: The final, corrected functional annotation withVEP
, where each line corresponds to a particular alteration in a given gene. This file is the 'Default VEP outout' (tabular) format from https://www.ensembl.org/info/docs/tools/vep/vep_formats.html#output.
-
bcftools_ploidy<ploidy>_out
,HaplotypeCaller_ploidy<ploidy>_out
andfreebayes_ploidy<ploidy>_out
: Folders that contain the raw and filtered vcf files of each fo the programs. -
merged_vcfs_allVars_ploidy<ploidy>.vcf
: A vcf file with the merged output of the algorithms used in the variant calling. You can check the header to understand what are all the tags in theINFO
field. Here, the multiallelic loci are split into different lines to ease the analysis. -
variant_calling_ploidy<ploidy>.tab
: A tabular version ofmerged_vcfs_allVars_ploidy<ploidy>.vcf
. Each column contains information parsed from the vcf file. This file is easier to manage because it has all the info in tabular format. -
variants_atLeast<n_PASS_programs>PASS_ploidy<ploidy>.vcf
: The variants that PASS the filters in at least n_PASS_programs algorithms and also have a fraction of reads covering them above--min_AF
. The INFO and FORMAT fields are simplified to take less space. It may be useful to take, for example, variants that PASS de filters by at least 2 algorithms.
-
annotated_variants.tab
: The final, corrected functional annotation withVEP
, where each line corresponds to a particular alteration in a given gene. This file is the 'Default VEP outout' (tabular) format from https://www.ensembl.org/info/docs/tools/vep/vep_formats.html#output. In addition, this file also contains information about whether the annotation refers to a SNP (in the 'is_snp' field) and whether it is a protein-altering annotation (through the 'is_protein_altering' field).
-
genes_and_regions_coverage.tab
: A table that contains the coverage of all the genes in the provided gff. These are the fields:-
chromosome
,start
,end
andlength
are the coordinates of the gene -
median_reads_per_gene
is the median read depth for this gene -
nocoveragebp_1
is the number of bases that have no coverage -
percentcovered_1
is the perecentage of the region that is covered with 1 read or more -
ID
is the gene ID in the gff (parsed from the attributes). If there are some duplicated IDs with parts it corresponds to the union ofID
andpart
. -
fraction_covered_by_MoreThan1read
is equivalent topercentcovered_1
, but from 0 to 1. -
relative_coverage
is themedian_reads_per_gene
divided by the median of allmedian_reads_per_gene
. - All the fields with a
_+-10kb_region
sufix are equivalent to the ones that do not have it, but for a region that starts at -10kb of the gene start and ends at +10kb of the gene end. This can be useful for CNV analysis.
-
You can also use the script scripts/perSVade.py to execute multiple modules of perSVade with a single command. However, this script is less user friendly and error prone, and it will be no longer maintained after version 1.02.3. This file contains the description of the output of the one-liner version.