/
vcf_creator_user_guide.txt
126 lines (114 loc) · 11.6 KB
/
vcf_creator_user_guide.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
VCF_creator : module of DiscoSnp++
Mapping and VCF Creation features
Table of contents
User part 1
VCF_creator : one module in DiscoSnp++ Mapping and VCF Creation 1
Quick starting 1
Running VCF_creator 1
Options 2
Output 2
Examples for the filter fields 5
User part
VCF_creator : one module in DiscoSnp++ Mapping and VCF Creation
VCF_creator is designed to map on a genome the output of DiscoSnp++ the Single Nucleotide Polymorphism (SNP) and small indels. This module create a Variant Calling Format (VCF) from the output of DiscoSnp++ or from an alignment obtained with the software BWA *.
Quick starting
Download and uncompress the DiscoSnp++
Install BWA (you can add it to your PATH )
Run the simple example:
XXX
Running VCF_creator
The main script run_VCF_creator.sh has three modes according to your needs :
MODE 1 : You don't have a reference genome but you want to create a vcf (it will summarize the DiscoSnp++ informations and will give the position of the variant on the upper path of DiscoSnp++ variant). The module will create a vcf from the output of DiscoSnp++ :
./run_VCF_creator.sh -p <disco_file> -o <output> [-s <bwa_errors_in_seed>]
MODE 2 : You have a reference genome and you want to align your variants against a reference genome. The module will run BWA to make an alignment between your reference genome and the output of DiscoSnp++ :
./run_VCF_creator.sh -G <ref> -p <disco_file> -o <output> [-B <path_bwa>] [-l <seed_size>] [-n <mismatch_number>] [-s <bwa_errors_in_seed>] [-w]
MODE 3 : You already have an alignment (.sam file) and you want to create a vcf file :
./run_VCF_creator.sh -f <sam_file> -n <mismatch_number> -o <output>
Options
General options :
-p : DiscoSnp++ output file (<file>.fasta) (Mandatory unless MODE 3)
-o : ouput (<file>.vcf) (Mandatory)
-G : reference genome (<file.fasta>) (Only in MODE 2)
-B : bwa path ( /home/me/my_programs/bwa-0.7.12/) (note that bwa must be pre-compiled) (Only in MODE 2) (not necessary if BWA is in the path)
-f : alignment already done (<file>.sam) (Only in MODE 3)
-h : show help
-w : remove waste tmp files (bwa index files)
-I : create of an output (VCF format) specific to IGV (Integrative Genomics Viewer) : sorting VCF by mapping positions and removing unmapped variants
BWA specifics options (- warning, bwa mapping running time highly depends on this parameter) :
-s : bwa option (-s) : seed distance ( optionnal, default value 0)
-l : bwa option (-l) : length of the seed for alignment (optionnal, default value 10)
-n : bwa option (-n) : maximal bwa mapping distance
Optionnal in MODE 1 and MODE 3 ( default value 3)
Mandatory in MODE 3, it's preferable to use the same value used during alignment by BWA (default value 3)
Sample example: XXX
Output
Final results are in your <file>.vcf. It's a Variant Call Format that will summarize all the mapping information and the header of DiscoSnp++ informations. Example :
MODE 1 :
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT G1 G2 G3 G4 G5
SNP_higher_path_13736 30 13736 A G . . Ty=SNP;Rk=0.99909;UL=.;UR=.;CL=.;CR=. GT:DP:PL:AD 0/0:53447:1947,160066,1067676:53425,22 0/1:3:40,9,20:1,2 1/1:94862:1895477,284398,3575:30,94832 0/1:5:34,9,54:3,2 1/1:65389:1306661,196101,2493:19,65370
SNP_higher_path_5442 30 5442_1 A G . . Ty=SNP;Rk=0.42995;UL=.;UR=.;CL=.;CR=. GT:DP:PL:AD 0|1:19:98,14,198:12,7 0|1:18:138,12,138:9,9 0|1:8:66,10,66:4,4 0|1:130:257,124,1813:104,26 0|0:224:14,598,4324:220,4
SNP_higher_path_5442 31 5442_2 G A . . Ty=SNP;Rk=0.42995;UL=.;UR=.;CL=.;CR=. GT:DP:PL:AD 0|1:19:98,14,198:12,7 0|1:18:138,12,138:9,9 0|1:8:66,10,66:4,4 0|1:130:257,124,1813:104,26 0|0:224:14,598,4324:220,4
INDEL_higher_path_586 30 586 GGAC G . . Ty=INS;Rk=0.40534;UL=.;UR=.;CL=.;CR=. GT:DP:PL:AD 0/1:127:837,17,977:67,60 1/1:387:6056,518,408:52,335 0/1:70:372,21,651:42,28 0/1:126:1395,53,477:40,86 0/1:102:671,16,791:54,48
MODE 2 and MODE 3 :
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT G1 G2 G3 G4 G5
Pseudomonas 109860 15103 C G . PASS Ty=SNP;Rk=0.65101;DT=1;UL=.;UR=.;CL=.;CR=.;Genome=C;Sd=-1 GT:DP:PL:AD 1/1:4:84,16,4:0,4 1/1:1:24,7,4:0,1 1/1:1:24,7,4:0,1 0/1:6:17,15,97:5,1 0/1:10:124,14,44:3,7
Pseudomonas 4416118 1416_1 A G . PASS Ty=SNP;Rk=0.41638;DT=1;UL=.;UR=.;CL=.;CR=.;Genome=A;Sd=1 GT:DP:PL:AD 1|1:35:606,71,28:3,32 0|1:26:403,41,43:4,22 1|1:34:617,79,18:2,32 0|1:46:316,14,356:24,22 0|1:43:547,36,128:11,32
Pseudomonas 4416119 1416_2 A C . PASS Ty=SNP;Rk=0.41638;DT=1;UL=.;UR=.;CL=.;CR=.;Genome=A;Sd=1 GT:DP:PL:AD 1|1:35:606,71,28:3,32 0|1:26:403,41,43:4,22 1|1:34:617,79,18:2,32 0|1:46:316,14,356:24,22 0|1:43:547,36,128:11,32
Pseudomonas 1312837 14361 C CGTAGAAGATCTC . PASS Ty=DEL;Rk=0.26896;DT=0;UL=.;UR=.;CL=.;CR=.;Genome=.;Sd=-1 GT:DP:PL:AD 0/0:792:47,1978,15015:771,21 0/0:745:34,1905,14223:728,17 0/0:697:37,1765,13268:680,17 0/0:783:32,2016,14979:766,17 0/1:849:1303,868,12339:701,148
Pseudomonas 1753272 6164 TGTGCAGGTCGAAGACTTGA T . PASS Ty=INS;Rk=0.13677;DT=0;UL=.;UR=.;CL=.;CR=.;Genome=.;Sd=1 GT:DP:PL:AD 0/0:1086:34,2817,20829:1064,22 0/0:1158:34,3011,22226:1135,23 0/0:1003:32,2608,19250:983,20 0/0:1486:124,3555,27823:1437,49 0/0:1132:585,1967,19224:1033,99
In this example, the three types of DiscoSnp++ variants : in red : simple SNP ; in green close SNPs ; in black INDEL.
The VCF file has the following fields :
CHROM : chromosome id where the prediction is mapped, or allele id of the upper path if the variant is unmapped or if no reference genome is provided.
POS (1-based leftmost position):
If a reference genome is provided and if the variant is mapped on a unique position : the mapping position of the variant
If a reference genome is provided and if the variant is not uniquely mapped : one of the positions of the variant (1-based lefmost position)
Else (no reference genome provided or unmapped variant) : position of the variant on the upper path of the discoSnp++ prediction (including the left extension)
ID : identification of the variant (used by DiscoSnp++). For the close SNPs, the SNP number is added to the ID. Example : 10388_2
REF :
If one of the two predicted allele maps this position : the corresponding variant
Else, or if no reference genome provided : the lexicographically smallest of the two variants
In case of close SNPs : the first is defined as previously described. The following SNPs are those located on the same path
ALT : The variant non reported as the “REF” variant
QUAL : “.” (unused)
FILTER :
PASS if the variant is mapped at unique position
MULTIPLE if the variant is mapped on multiple positions
“.” : if the variant is unmapped or if no reference genome is provided
INFO :
Ty : Type of variant
SNP : If the variant is a simple SNP or close SNPs
INS : If the variant mapped corresponds to the longuest path ; the alt carries the deletion
DEL : If the variant mapped corresponds to the shortest path ; the alt carries the deletion
Rk : Rank of the prediction computed by DiscoSnp++ (if several datasets are used in DiscoSnp++, ranks the predictions according to their read coverage in each condition favoring SNPs that are discriminant between conditions value between 0 and 1)
DT : If the variant is mapped on a unique position : distance of the mapping (number of mismatches). If the variant is unmapped or mapped on multiple positions : “-1”
UL : Length of the left unitig (“.” if not computed)
UR : Length of the right unitig (“.” if not computed)
CL : Length of the left contig (“.” if not computed)
CR : Length of the right contig (“.” if not computed)
Genome : Applies only for SNPs when a reference genome is provided (“.” for INDELs and when no reference genome provide or if the variant is unmapped). Reference nucleotide (!!nucleotide in the reference genome !! In general it is correspond to the REF field ; could be different for close snps). Important Remark : If one of the two predictions matches the reference : equal to the “REF” field, else equal to the nucleotide of the reference genome.
Sd : Applies only when a reference genome is provided (“.” if no reference genome provided or if the variant is unmapped). Strand of the prediction mapping. “1” : Forward ; “-1” : Reverse. Important Remark : Fields “REF” , “ALT” and “Genome” are based on the mapped predictions. If Sd is 1 then these fields correspond to the DiscoSnp++ prediction, else if the Sd is “-1”, then they correspond to the reverse complement of the DiscoSnp++ predictions.
FORMAT : Description of the genotype fields (G1, G2, G3 …)
GT : genotype, encodes as allele values with 0 corresponding to the reference and 1 to the alternative. About genotypes :
If the separator is a “/” the genotypes are unphased ( INDEL, Simple SNP)
If the separator is a “|” the genotypes are phased (Close SNPs with the same ID )
DP : Cumulated depth across samples (sum)
PL : Phred-scaled Genotype Likelihood (given by DiscoSnp++)
AD : Depth of each allele by sample
Examples
FILTER Fields:
PASS if the variant is mapped at unique position
MULTIPLE if the variant is mapped on multiple positions
“.” : if the variant is unmapped or if no reference genome is provided
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT G1 G2 G3 G4 G5
Pseudomonas 1945315 8816 C G . PASS Ty=SNP;Rk=0.63574;DT=1;UL=.;UR=.;CL=.;CR=.;Genome=C;Sd=1 GT:DP:PL:AD 1/1:8:135,19,16:1,7 0/0:2:4,10,44:2,0 0/0:1:4,7,24:1,0 0/1:8:30,14,110:6,2 0/1:5:54,9,34:2,3
Pseudomonas 3190754 768 A G . MULTIPLE Ty=SNP;Rk=0.06837;DT=0;UL=.;UR=.;CL=.;CR=.;Genome=A;Sd=-1 GT:DP:PL:AD 0/1:1254:8182,28,9459:659,595 0/1:1235:9809,42,7594:562,673 0/1:903:7217,37,5521:409,494 0/1:473:4217,52,2520:194,279 0/1:1416:10513,26,9395:680,736
VCF for IGV : If you want to use IGV (Integrative Genomics Viewer) to visualize your data, VCF_creator can create a vcf specific to IGV (0-based, sorted by positions, and without unmapped variants)
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT G1 G2 G3 G4 G5
Pseudomonas 1206 7667_1 G T . PASS Ty=SNP;Rk=0.10501;DT=1;UL=.;UR=.;CL=.;CR=.;Genome=G;Sd=-1 GT:DP:PL:AD 0|1:115:1172,36,513:41,74 0|1:142:1154,19,875:64,78 0|1:90:707,16,587:42,48 0|1:111:1190,43,452:37,74 0|1:102:960,26,521:40,62
Pseudomonas 1210 7667_2 T G . PASS Ty=SNP;Rk=0.10501;DT=1;UL=.;UR=.;CL=.;CR=.;Genome=T;Sd=-1 GT:DP:PL:AD 0|1:115:1172,36,513:41,74 0|1:142:1154,19,875:64,78 0|1:90:707,16,587:42,48 0|1:111:1190,43,452:37,74 0|1:102:960,26,521:40,62
Pseudomonas 1214 8588 T G . PASS Ty=SNP;Rk=0.02395;DT=0;UL=.;UR=.;CL=.;CR=.;Genome=T;Sd=1 GT:DP:PL:AD 0/0:1059:32,2755,20328:1038,21 0/0:1288:16,3518,25081:1272,16 0/0:830:21,2204,16026:816,14 0/0:907:14,2486,17676:896,11 0/0:993:19,2666,19237:978,15
Pseudomonas 1217 12549_2 A G . PASS Ty=SNP;Rk=0.05280;DT=1;UL=.;UR=.;CL=.;CR=.;Genome=A;Sd=1 GT:DP:PL:AD 0|1:284:1659,27,2378:160,124 0|1:379:2586,19,2766:194,185 0|1:231:1509,19,1768:122,109 0|1:321:1852,30,2710:182,139 0|1:260:1820,17,1860:131,129
Pseudomonas 1237 10749 T G . PASS Ty=SNP;Rk=0.02921;DT=0;UL=.;UR=.;CL=.;CR=.;Genome=T;Sd=1 GT:DP:PL:AD 0/0:995:14,2806,19551:987,8 0/0:1252:15,3550,24642:1243,9 0/0:784:14,2240,15461:779,5 0/0:812:31,2420,16195:811,1 0/0:944:15,2696,18615:938,6
Pseudomonas 1243 860 T G . PASS Ty=SNP;Rk=0.02113;DT=0;UL=.;UR=.;CL=.;CR=.;Genome=T;Sd=-1 GT:DP:PL:AD 0/0:1008:14,2845,19811:1000,8 0/0:1247:15,3535,24542:1238,9 0/0:775:19,2258,15366:772,3 0/0:782:13,2213,15380:776,6 0/0:921:20,2672,18240:917,4
Pseudomonas 5133 14601 C A . PASS Ty=SNP;Rk=0.04497;DT=0;UL=.;UR=.;CL=.;CR=.;Genome=C;Sd=1 GT:DP:PL:AD 0/0:1458:41,4306,29017:1455,3 0/0:1613:75,4860,32264:1613,0 0/0:1044:20,3017,20654:1039,5 0/0:1013:48,3054,20264:1013,0 0/0:1215:46,3631,24253:1214,1
Pseudomonas 6702 8511_1 T G . PASS Ty=SNP;Rk=0.05088;DT=1;UL=.;UR=.;CL=.;CR=.;Genome=T;Sd=1 GT:DP:PL:AD 0|1:200:1360,17,1479:103,97 0|1:196:1177,22,1616:109,87 0|1:118:843,16,843:59,59 0|1:182:1041,25,1560:104,78 0|1:160:1040,18,1239:85,75