Co-occurring variant analyzer
Switch branches/tags
Nothing to show
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Failed to load latest commit information.
lib
scripts
test
Changes
README
coovar.pl

README

CooVar version 0.07 (Nov 20th, 2012)
====================================

(0) INSTALL
(1) MOTIVATION
(2) WHAT IT DOES
(3) HOW IT DOES IT
(4) HOW TO RUN IT
(5) INPUT
(6) OUTPUT
(7) TIPS FOR RUNNING COOVAR FASTER
(8) HOW TO CITE COOVAR

(0) INSTALL
-----------

To install CooVar, open a Linux terminal and run the following commands:

tar xvf CooVar-0.07.tar.gz
cd coovar-0.07
chmod +x scripts/* coovar.pl

The program itself is a Perl script and requires the following Perl modules to be 
installed on your system:

use Cwd;
use Getopt::Long;
use POSIX;
use File::Basename;
use List::Util;
use Bio::DB::Fasta;
use Bio::Seq;
use Bio::SeqUtils;
use Bio::SeqIO;
use Set::IntervalTree;
use Set::IntSpan;

(1) MOTIVATION
--------------

CooVar was conceived as an alternative to previous tools that are tightly linked to databases (e.g. Ensembl, UCSC) and for which 
there is no overall assessment of the impact of co-occurring GVs on protein-coding transcripts. The former makes it hard for people 
working with newly sequenced genomes, genomes that are not available in such databases, or who just have an alternative 
annotation of transcript models for which the impact of different GVs needs to be asssessed. The latter is misleading at the moment of 
evaluating the impact of multiple co-occurring GVs on protein-coding transcripts. For example, a 1 bp insertion and a 1 bp deletion may 
occur on the same transcript, and, if assessed independently, they cause disruption of the ORF, but together cancel each other, 
restoring the ORF. In the same way, 2 substitutions on a same codon may result in a different amino acid change than each substitution 
independently. 

(2) WHAT IT DOES
----------------

CooVar takes as input:
 
(i)   transcript models (GFF3 or GTF format), 
(ii)  the genomic DNA where the models reside and (FASTA format) 
(iii) Single Nucleotide Variants (SNV), Insertions (INS) and Deletions (DEL) (collectively called Genomic Variations, or GVs) 

and returns, for each transcript, 

(i)   a summary of all GVs impacting a protein-coding transcript, with a categorization of the Open Reading Frame (ORF) 
      in: INTACT, PRESERVED, DISRUPTED or FULLY_DELETED.
(ii)  the cDNA before (called the reference) and after applying the GVs (called the variant), 
(iii) the reference and variant protein sequence, 
(iv)  an alignment of the reference and variant cDNA at the exon level, 
(v)   a number of basic statistics for each type of GV. 

(3) HOW IT DOES IT
------------------

CooVar pursues the following steps:

(4.0) Revises the GV file in order to make sure that they follow the specified format (VCF or TAB). SNV and INS involving non-ACTG are 
discarded. Lines not following the specified format are discarded. (revise-gv-files.pl)
(4.1) Extracts the cDNA for each transcript from the genomic DNA, using the transcript model coordinates, (extract-cdna.pl)
(4.2) Applies SNPs to the cDNA (apply-snps.pl)
(4.3) Categorizes SNPs according to its impact (or not) on protein-coding transcripts (categorize-snps.pl).
(4.4) Computes stats on SNPs (get-stats-snps.pl)
(4.5) Applies Insertions and Deletions to the cDNA sequences, and computes stats of length distribution 
(apply-insertions-deletions.pl)
(4.6) If the --circos flag is used, it generates the input files for a 
heatmap visualization of each type of GV and for the exons with the Circos 
tool (Krzywinski et al., 2009). The circos data for GVs represents the number
of base pairs inserted or deleted per interval length. The interval length  
is automatically inferred from the genome size (interval length = genome size / 1000).

(4) HOW TO RUN IT
-----------------

 ./coovar.pl -e EXONS_GFF -r REFERENCE_FASTA (-t GVS_TAB_FORMAT | -v GVS_VCF_FORMAT) [-o OUTPUT_DIRECTORY] [--circos] [--feature_source] [--feature_type]

Parameters in brackets [] are optional.  

(5) INPUT
---------

CooVar takes the following input:

(5.1) A gff3/gtf file specifying coding exons in form of CDS (coding-sequence) entries

Parameter: -e
Each coding exon represents a part of a protein-coding transcript and excludes UTRs. 

Example 1: transcript 2L52.1 in C. elegans in gff3 format:

II      Coding_transcript       CDS     1867    1911    .       +       0       ID=CDS:2L52.1;Note=Zinc finger%2C C2H2;gene=2L52.1 
II      Coding_transcript       CDS     2506    2694    .       +       0       ID=CDS:2L52.1;Note=Zinc finger%2C C2H2;gene=2L52.1 
II      Coding_transcript       CDS     2738    2888    .       +       0       ID=CDS:2L52.1;Note=Zinc finger%2C C2H2;gene=2L52.1
II      Coding_transcript       CDS     2931    3036    .       +       2       ID=CDS:2L52.1;Note=Zinc finger%2C C2H2;gene=2L52.1 
II      Coding_transcript       CDS     3406    3552    .       +       1       ID=CDS:2L52.1;Note=Zinc finger%2C C2H2;gene=2L52.1
II      Coding_transcript       CDS     3802    3984    .       +       1       ID=CDS:2L52.1;Note=Zinc finger%2C C2H2;gene=2L52.1 
II      Coding_transcript       CDS     4201    4663    .       +       1       ID=CDS:2L52.1;Note=Zinc finger%2C C2H2;gene=2L52.1
 
Example 2: transcript ENST00000371026 in Human, in gtf format:

chr1    hg18_ensGene    CDS     67052404        67052451        0.000000        -       0       gene_id "ENST00000371026"; transcript_id "ENST00000371026"; 
chr1    hg18_ensGene    CDS     67060632        67060788        0.000000        -       1       gene_id "ENST00000371026"; transcript_id "ENST00000371026"; 
chr1    hg18_ensGene    CDS     67065091        67065317        0.000000        -       0       gene_id "ENST00000371026"; transcript_id "ENST00000371026"; 
chr1    hg18_ensGene    CDS     67066083        67066181        0.000000        -       0       gene_id "ENST00000371026"; transcript_id "ENST00000371026"; 
chr1    hg18_ensGene    CDS     67071856        67071977        0.000000        -       2       gene_id "ENST00000371026"; transcript_id "ENST00000371026"; 
chr1    hg18_ensGene    CDS     67072262        67072419        0.000000        -       1       gene_id "ENST00000371026"; transcript_id "ENST00000371026"; 
chr1    hg18_ensGene    CDS     67073897        67074048        0.000000        -       0       gene_id "ENST00000371026"; transcript_id "ENST00000371026"; 
chr1    hg18_ensGene    CDS     67075981        67076067        0.000000        -       0       gene_id "ENST00000371026"; transcript_id "ENST00000371026"; 
chr1    hg18_ensGene    CDS     67078740        67078942        0.000000        -       2       gene_id "ENST00000371026"; transcript_id "ENST00000371026"; 
chr1    hg18_ensGene    CDS     67085755        67085949        0.000000        -       2       gene_id "ENST00000371026"; transcript_id "ENST00000371026"; 
chr1    hg18_ensGene    CDS     67100418        67100573        0.000000        -       2       gene_id "ENST00000371026"; transcript_id "ENST00000371026"; 
chr1    hg18_ensGene    CDS     67109641        67109780        0.000000        -       1       gene_id "ENST00000371026"; transcript_id "ENST00000371026"; 
chr1    hg18_ensGene    CDS     67113052        67113208        0.000000        -       2       gene_id "ENST00000371026"; transcript_id "ENST00000371026"; 
chr1    hg18_ensGene    CDS     67129425        67129537        0.000000        -       1       gene_id "ENST00000371026"; transcript_id "ENST00000371026"; 
chr1    hg18_ensGene    CDS     67131500        67131684        0.000000        -       0       gene_id "ENST00000371026"; transcript_id "ENST00000371026"; 
chr1    hg18_ensGene    CDS     67143472        67143646        0.000000        -       1       gene_id "ENST00000371026"; transcript_id "ENST00000371026"; 
chr1    hg18_ensGene    CDS     67162933        67163102        0.000000        -       0       gene_id "ENST00000371026"; transcript_id "ENST00000371026"; 

IMPORTANT: Note that by default CooVar ignores all lines in the exon file that are not of type 'CDS'. This behavior can be changed using the 
           --feature_source and --feature_type program parameter. Also, make sure that different coding exons belonging to the same transcript have 
           (a) the same 'transcript_id' (in case of a GTF input file) or (b) either the same 'ID' (GFF2 input) or the same 'Parent' ID (GFF3 input). 

If the gene attribute is specified ('gene=' in gff3 or 'gene_id' in gtf) then this gene ID is carried over to the transcripts.gff3
output file. This is useful if one wants to know the identity of genes effected by GVs in addition to affected transcripts.  

As you can see, the relevant columns are 

column1: the chromosome/contig
column4: start coordinate
column5: end coordinate,
column7: strand,
column9: transcript ID

Make sure columns are separated by TAB characters (not spaces). The transcript ID in column9 can be 
preceded by either 'Parent=' , 'ID=' or 'transcript_id '. 

(5.2) A reference genomic DNA file in FASTA format
--------------------------------------------------

Parameter: -r
Self explanatory. Avoid incompatibilities between the transcript coordinates and the chromosomes (e.g. transcripts with coordinates 
out of boundaries, or including transcripts from chromosomes that are not in the fasta file). Entries in the FASTA file may have any 
line length up to 65,536 characters, and different line lengths are allowed in the same file. However, within a sequence entry, 
all lines must be the same length except for the last (see Bio::DB::Fasta module).

(5.3) A list of SNV, INS and DEL in either VCF4.0 format or TAB format
----------------------------------------------------------------------

Parameter: -v (if using VCF4.0 format) or -t (if using TAB format)
the Variant Call Format (VCF) was proposed by the 1,000 genomes project consortium as a standard to report variants. Specifications 
can be found in: 

http://www.1000genomes.org/wiki/Analysis/vcf4.0

Note that if multiple alternative alleles are specified for a variant in the VCF file, CooVar will consider only the first
of these alternative alleles. Please make sure that this is the allele that you want to annotate. Multiple 
alternative alleles per variant will be supported in a future release of CooVar. 

The VCF is used by a number of programs that detect SNPs and small indels, including the BCF tools that comes with SAMTools for 
calling SNPs and small indels 

http://samtools.sourceforge.net/mpileup.shtml

and Dindel

ftp://ftp.sanger.ac.uk/pub4/resources/software/dindel/manual-1.01.pdf

The TAB format is an alternative for the cases where the user doesn't have a VCF file. It is a tab-separated file meant to be easy to 
generate. 

An example of a SNV line in the TAB input:

SNP	chr1	10469	C	G 

column1: SNP indicates it is a single basepair difference between the reference and the variant.
column2: chromosome/contig
column3: coordinate within chromosome/contig
column4: reference nucleotide at that coordinate
column5: variant nucleotide at that coordinate

An example of an INS line in the TAB input:

INS	chr1	719490	719491	ATAAT

column1: INS indicates it is an insertion in the variant compared to the reference.
column2: chromosome/contig
column3: start coordinate within chromosome/contig
column4: end coordinate within chromosome/contig 
column5: insertion sequence

Note1: even though insertions are usually annotated to occur between 2 adjacent basepairs X and X+1, this is not required by 
CooVar. If the start and end coordinate are separated by 1 or more basepairs, the region in between is regarded as a 
deletion associated with the insertion. For example, if the insertion is annotated to occur betwen basepairs 10 and 20, then basepairs 
11 to 19 will be deleted.

An example of a DEL line in the TAB input:

DEL	chr1	768120	768122

column1: DEL indicates it is a deletion in the variant compared to the reference.
column2: chromosome/contig
column3: start coordinate within chromosome/contig
column4: end coordinate within chromosome/contig

All base pairs from start to end (including start AND end) are deleted in the variant.

Note2: for INS and DEL the start coordinate HAS to be lesser or equal than the end coordinate, otherwise it will be discarded.

(5.4) Specifying the output directory
-------------------------------------
Parameter: -o
Optional parameter that specifies the output directory for CooVar output files. If the specified output directory 
does not exist, CooVar creates it. If omitted, output files are written to a subdirectory in the current directory
named "output_YYYYMMDD_hhmmss", whereas YYYY, MM, DD, hh, mm, ss denote year, month, day, hour, minute, and second of the
current date/time.

(5.5) Specifying feature source and feature type
------------------------------------------------
Parameter: --feature_source (optional; no default value)
           --feature_type   (optional; default: 'CDS')

If --feature_source and --feature_type are not specified, CooVar by default retains only features of type 'CDS' from the 
input GFF/GTF file, assuming that these features represent coding exons of protein-coding transcripts. 

If --feature_source is specified, CooVar considers only features of that source (= 2nd column of input GFF/GTF).  

If --feature_type is specified, Coovar considers only features of that type (= 3rd column of input GFF/GTF). The default is 'CDS'.

These two parameters can also be used in combination. 

(6) OUTPUT
----------

Assuming that your files are called:

REFERENCE: the genomic DNA
GVs: the list of Genomic Variations in either format
TRANSCRIPTS: gff3 file with the coding exon coordinates

The output of CooVar ("CV") is structured as follows:
.
|-- transcripts.gff3: transcripts, GVs impacting them, status of the ORF. See description in section 6.1 below
|-- categorized-gvs.gvf: each GV and its impact (or not, if silent) on all transcripts. See description in section 6.1 below
|-- deletions
|   |-- distribution_deletions.out: length distribution of all deletions and of those deletions impacting exons.
|   |-- excluded_gvs.del: deletions excluded from the input for not accomplishing the specified format.
|   |-- kept_gvs.del: deletions kept from the input. The ones to be applied to the transcripts. 
|   `-- kept_gvs.del.circos: Circos frequency file for deletions. Designed for heatmap visualization.
|-- insertions
|   |-- distribution_insertions.out: length distribution of all insertions and of those insertions impacting exons.
|   |-- excluded_gvs.ins: insertions excluded from the input for not accomplishing the specified format.
|   |-- kept_gvs.ins: deletions kept from the input. The ones to be applied to the transcripts.
|   `-- kept_gvs.ins.circos: Circos frequency file for insertions. Designed for heatmap visualization.
|-- snps
|   |-- ambiguous_snps.list: SNPs that fall into 2 or more categories by impacting transcripts differently
|   |-- categorized_snps.stat: overall and chromosomal frequencies of SNPs per category.
|   |-- codon_bias.stat: frequency of the location of the SNP within the codon (1,2, or 3) and its impact as syn/non-syn 
|   |-- excluded_gvs.snp: SNPs excluded from the input for not accomplishing the specified format.
|   |-- kept_gvs.list.snp: SNPs kept from the input. The ones to be applied to the transcripts. 
|   `-- kept_gvs.list.snp.circos: Circos frequency file for deletions. Designed for heatmap visualization.
|-- transcripts
|   |-- reference_cdna.exons: reference cDNA shown per exon, per transcript.
|   |-- variant_cdna.exons: variant cDNA shown per exon, per transcript.
|   |-- reference_cdna.fasta: reference cDNA for each transcript.
|   |-- variant_cdna.fasta: variant cDNA for each transcript.
|   |-- reference_peptides.fasta: reference peptide for each transcript.
|   |-- variant_peptides.fasta: variant peptide for each transcript.
|   |-- reference_variant.alignment: alignment of the reference and variant (i.e. all GVs applied) cDNA
|   |-- exons.circos: Circos frequency file for exons on a bp basis. Designed for heatmap visualization.
|   `-- variant.stat: distribution of early stop codons along proteins, frequency of different ORF status and summary of impact of GVs on transcripts

(6.1) transcripts.gff3 and categorized-gvs.gvf
 
The Generic Feature Format Version 3 (GFF3) is a widely used and accepted format for representing genomic features such as trnascript 
models and others. The Genome Variation Format (GVF) is an extension of GFF3 and it is designed to 
describe the impact of Genomic Variants such as SNV, INS and DEL on different genomic features described in a GFF3 file. Hence, both 
files are meant to contain complementary information. More details and specifications for both formats can be found at

http://www.sequenceontology.org/

Given this, CooVar provides two outputs:

(6.1.1) transcripts.gff3

An example of a transcript in the transcripts.gff3 file is shown here:

chr20   CooVar        mRNA    11790883        11830259        .       -       .       ID=ENST00000427835;Note=snp_746931(conservative_missense_codon),ins_66552,del_83066,ORF_DISRUPTED(15.53)
chr20   CooVar        coding_exon     11790883        11791087        .       -       .       Parent=ENST00000427835
chr20   CooVar        coding_exon     11830155        11830259        .       -       .       Parent=ENST00000427835


For each transcript, CooVar provides the coordinates of the overall coding region (mRNA) together with the coordinates for 
each coding exon. In the first line, The Note tag is used to describe , in a comma separated format,

- all snps impacting the protein-coding transcript, and its categorization,
- all insertions impacting the protein-coding transcript,
- all deletions impacting the protein-coding transcript,
- the overall impact of ALL GVs on the ORF (ORF_INTACT/ORF_DISRUPTED/ORF_PRESERVED/FULLY_DELETED)
- if the ORF is found disrupted by the GVs provided as input, then a percentage in parenthesis indicates the location within the 
protein where the first stop codon occurs. 

ORF_INTACT: the coding region of the transcript is not impacted by GVs.
ORF_PRESERVED: the overall impact of GVs on the variant transcript is such that it doesn't introduce an early stop codon, compared to 
the reference.
ORF_DISRUPTED: the overall impact of GVs on the	variant transcript is such that it introduces an early stop codon, compared to  
the reference. 
FULLY_DELETED: the variant transcript is such that only two, one or zero base pairs of the coding region remain after a deletion 
event.

Note: the information associated to each GV listed in the GFF3 file (e.g. snp_746931) can be found in the categorized-gvs.gvf file.

(6.1.2) categorized-gvs.gvf

An example of a SNV, INS and DEL in the categorized-gvs.gvf file is shown:

chr15   CooVar        SNV     80988136        80988136        .       +       .       ID=snp_14204;
Variant_seq=G;Reference_seq=C;Variant_effect=non_conservative_missense_codon 0 mRNA ENST00000258884,ENST00000424700;
Note=tgc>tgG_C>W_aa122_codon_loc3_RADICAL(215),tgc>tgG_C>W_aa122_codon_loc3_RADICAL(215)
chr14   CooVar        insertion       90740444        90740445        .       +       .       ID=ins_1;Variant_seq=G;Reference_seq=~;Variant_effect=frameshift_variant 0 mRNA ENST00000380590
chr14   CooVar        deletion        94935764        94935766        .       +       .       ID=del_9;Variant_seq=-;Reference_seq=~;Variant_effect=inframe_variant 0 mRNA ENST00000380365,ENST00000337425,ENST00000448305

- For each case, columns 1, 4 and 5 represent the chromosome, start and end coordinate of the GV.

In column 8:
- Variant_seq represents the variant sequence for this GV.
- Reference_seq represents the reference sequence for each GV.
- Variant_effect represents the predicted effect of this GV on the ORF (e.g. stop gained, synonymous, conservative_missense, non_conservative_missense, etc.) 
format. All GVs impacting a transcript are found in the corresponding GFF3 file.
- The "Note=" tag for SNPs can be lengthy and it displays the following information:
-- codon_reference>codon_variant. Note that this codon can come from the + or - strand of the genomic DNA. 
-- aa_reference>aa_variant.
-- location of the SNP within the codon (1,2 or 3).
-- functional impact of the amino acid change for mis-sense mutations together with the associated Grantham Score in parenthesis.


(7) TIPS FOR RUNNING COOVAR FASTER
--------------------------------------------

Some steps of CooVar can take time when run on very large (human sized) data files. However, even in this case 
CooVar should return results in a reasonable amount of time (~1h). One way to speed up execution is to run 
CooVar for different chromosomes separately. Also, consider commenting out lines in the coovar.pl script
if re-running a particular step of the analysis is not required. For example, by commenting out the line executing
the 'extract-cdna.pl' script, CooVar would not extract cDNA sequences when it is run again, using existing output
files from previous runs.

(8) HOW TO CITE COOVAR
----------------------

If you find CooVar useful in your work, please cite:

Vergara IA, Frech C and Chen N. 
CooVar: Co-occurring variant analyzer
BMC Research Notes 2012, 5:615