/
prediction.html
422 lines (346 loc) · 23.3 KB
/
prediction.html
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN">
<html><head><title>Predicting Genes with AUGUSTUS</title>
<link rel="stylesheet" type="text/css" href="augustus.css">
<script src="tutorial.js" type="text/javascript"></script>
</head><body>
<font size=-1>
Navigate to <a href="index.html">Lab Session on AUGUSTUS</a>.
<a href="scipio.html">Using Scipio</a>.
<a href="training.html">Training AUGUSTUS</a>.
<a href="ppx.html">AUGUSTUS-PPX</a>.
</font>
<div align="right">Show <a href="javascript:allOn()">all</a> / <a href="javascript:allOff()">no</a> details.</div>
<h1>Predicting Genes with AUGUSTUS</h1>
This tutorial describes various typical settings for predicting genes with AUGUSTUS.
<h2 id="abinitiopred">1. PREDICT GENES AB INITIO</h2>
<i>Ab initio</i> prediction means that no other input is used than the target genome itself.
Below, you will find examples of predictions that use evidence (hints), here we
use none.
<span class="assignment">Predict the genes in the range 7,000,001-7,500,000 of chr2R of <i>D. melanogaster</i></span>. Use the FASTA format file <a href="data/chr2R.fa"><tt>chr2R.fa</tt></a>, which includes the whole chromosome 2R.
<span class="assignment">To shorten this test run</span> (or when running several
jobs in parallel) you should <span class="assignment">specify a subrange</span> of the
input sequence using the parameters
<tt>--predictionStart</tt> and <tt>--predictionEnd</tt>.
<pre class="code">augustus --species=fly --predictionStart=7000001 --predictionEnd=7500000 chr2R.fa > augustus.abinitio.gff # takes ~1m</pre>
In this example, I am using the <tt>fly</tt> parameters for comparability whith
predictions below. Of course, the self-trained <tt>bug</tt> parameters also work.
The output file <span class="result"><tt>augustus.abinitio.gff</tt></span> now contains
the predicted gene structures in GFF format with additional comments (lines starting with #).
<pre class="code">
# This output was generated with AUGUSTUS (version 2.5).
...
# start gene g1
chr2R AUGUSTUS gene 7007533 7010935 0.02 - . g1
chr2R AUGUSTUS transcript 7007533 7010935 0.02 - . g1.t1
chr2R AUGUSTUS tts 7007533 7007533 . - . transcript_id "g1.t1"; gene_id "g1";
chr2R AUGUSTUS exon 7007533 7008630 . - . transcript_id "g1.t1"; gene_id "g1";
chr2R AUGUSTUS stop_codon 7007610 7007612 . - 0 transcript_id "g1.t1"; gene_id "g1";
chr2R AUGUSTUS intron 7008631 7008694 1 - . transcript_id "g1.t1"; gene_id "g1";
chr2R AUGUSTUS intron 7008812 7008865 0.88 - . transcript_id "g1.t1"; gene_id "g1";
chr2R AUGUSTUS intron 7009192 7009251 0.95 - . transcript_id "g1.t1"; gene_id "g1";
chr2R AUGUSTUS CDS 7007610 7008630 1 - 1 transcript_id "g1.t1"; gene_id "g1";
chr2R AUGUSTUS CDS 7008695 7008811 0.88 - 1 transcript_id "g1.t1"; gene_id "g1";
chr2R AUGUSTUS exon 7008695 7008811 . - . transcript_id "g1.t1"; gene_id "g1";
chr2R AUGUSTUS CDS 7008866 7009191 0.99 - 0 transcript_id "g1.t1"; gene_id "g1";
chr2R AUGUSTUS exon 7008866 7009191 . - . transcript_id "g1.t1"; gene_id "g1";
chr2R AUGUSTUS CDS 7009252 7009353 0.95 - 0 transcript_id "g1.t1"; gene_id "g1";
chr2R AUGUSTUS exon 7009252 7009429 . - . transcript_id "g1.t1"; gene_id "g1";
chr2R AUGUSTUS start_codon 7009351 7009353 . - 0 transcript_id "g1.t1"; gene_id "g1";
chr2R AUGUSTUS exon 7010820 7010935 . - . transcript_id "g1.t1"; gene_id "g1";
chr2R AUGUSTUS tss 7010935 7010935 . - . transcript_id "g1.t1"; gene_id "g1";
# protein sequence = [MNTLSSARSVAIYVGPVRSSRSASVLAHEQAKSSITEEHKTYDEIPRPNKFKFMRAFMPGGEFQNASITEYTSAMRKR
# YGDIYVMPGMFGRKDWVTTFNTKDIEMVFRNEGIWPRRDGLDSIVYFREHVRPDVYGEVQGLVASQNEAWGKLRSAINPIFMQPRGLRMYYEPLSNIN
# NEFIERIKEIRDPKTLEVPEDFTDEISRLVFESLGLVAFDRQMGLIRKNRDNSDALTLFQTSRDIFRLTFKLDIQPSMWKIISTPTYRKMKRTLNDSL
# NVAQKMLKENQDALEKRRQAGEKINSNSMLERLMEIDPKVAVIMSLDILFAGVDATATLLSAVLLCLSKHPDKQAKLREELLSIMPTKDSLLNEENMK
# DMPYLRAVIKETLRYYPNGLGTMRTCQNDVILSGYRVPKGTTVLLGSNVLMKEATYYPRPDEFLPERWLRDPETGKKMQVSPFTFLPFGFGPRMCIGK
# RVVDLEMETTVAKLIRNFHVEFNRDASRPFKTMFVMEPAITFPFKFTDIEQ]
# end gene g1
...
</pre>
For a description of the GFF format see <a href="http://www.sanger.ac.uk/resources/software/gff/spec.html">the GFF definition at the Sanger Centre</a>.<br><br>
If you also want the protein sequences you can retrieve them with
<pre class="code">getAnnoFasta.pl augustus.abinitio.gff</pre>
which extracts the peptide sequences into a file <span class="result"><tt>augustus.abinitio.aa</tt></span>:
<pre class="code">
>g1.t1
MNKLNLVLITEEHKTYDEIPRPNKFKFMRAFMPGGEFQNASITEYTSAMRKRYGDIYVMPGMFGRKDWVTTFNTKDIEMVFRNEGIWPRRDGLDSIVYFR
EHVRPDVYGEVQGLVASQNEAWGKLRSAINPIFMQPRGLRMYYEPLSNINNEFIERIKEIRDPKTLEVPEDFTDEISRLVFESLGLVAFDRQMGLIRKNR
DNSDALTLFQTSRDIFRLTFKLDIQPSMWKIISTPTYRKMKRTLNDSLNVAQKMLKENQDALEKRRQAGEKINSNSMLERLMEIDPKVAVIMSLDILFAG
VDATATLLSAVLLCLSKHPDKQAKLREELLSIMPTKDSLLNEENMKDMPYLRAVIKETLRYYPNGLGTMRTCQNDVILSGYRVPKGTTVLLGSNVLMKEA
TYYPRPDEFLPERWLRDPETGKKMQVSPFTFLPFGFGPRMCIGKRVVDLEMETTVAKLIRNFHVEFNRDASRPFKTMFVMEPAITFPFKFTDIEQ
>g2.t1
MRHRNKGAVKRKGPSAGAEQEQELKKPKSEFSNGFKRYITEEHKTYDEIPRPNKFKFMRAFMPGGEFQNASITEYTSAMRKRYGDIYVMPGMFGRKDWVT
...
</pre>
<h2 id="customtrack">2. MAKE A CUSTOM GENE PREDICTION TRACK ON THE UCSC GENOME BROWSER</h2>
In order to visually inspect our results and to compare with the FlyBase annotation we will
now make a custom track of the gene structures in <span class="result"><tt>augustus.abinitio.gff</tt></span>. We need to create a few header lines in the custom track file which we can either do manually with an editor or like below on the command line (cut and paste).
<pre class="code">
echo -e "browser position chr2R:7000000-7050000\n\
browser hide multiz15way bdtnpChipper\n\
track name=abinitio description=\"Augustus ab initio predictions\" db=dm3 visibility=3" > abinitio.browser
grep -P "AUGUSTUS\tCDS" augustus.abinitio.gff >> abinitio.browser
</pre>
With the <tt>grep</tt> command we just filtered out the lines that specify the coding exon coordinates.<br>
The resulting file <span class="result"><tt>abinitio.browser</tt></span> is now in UCSC custom track format and looks like this:
<pre class="code">
browser position chr2R:7000000-7050000
browser hide multiz15way bdtnpChipper
track name=abinitio description="Augustus ab initio predictions" db=dm3 visibility=3
chr2R AUGUSTUS CDS 7007610 7008630 1 - 1 transcript_id "g1.t1"; gene_id "g1";
chr2R AUGUSTUS CDS 7008695 7008811 0.88 - 1 transcript_id "g1.t1"; gene_id "g1";
chr2R AUGUSTUS CDS 7008866 7009191 0.99 - 0 transcript_id "g1.t1"; gene_id "g1";
chr2R AUGUSTUS CDS 7009252 7009353 0.95 - 0 transcript_id "g1.t1"; gene_id "g1";
chr2R AUGUSTUS CDS 7011376 7012396 1 - 1 transcript_id "g2.t1"; gene_id "g2";
chr2R AUGUSTUS CDS 7012461 7012577 0.92 - 1 transcript_id "g2.t1"; gene_id "g2";
chr2R AUGUSTUS CDS 7012632 7012957 0.99 - 0 transcript_id "g2.t1"; gene_id "g2";
...
</pre>
<p>Now upload this file as a custom track on the UCSC genome browser.</p>
<a href="javascript:onoff('aibr')" class="dlink"><span id="aibr" title="aibrd" class="dcross">[+]</span>
<span class="dtitle">How again?</span></a> <br>
<div id="aibrd" class="details" style="display:none;">
<ol>
<li><span class="assignment">open the <a href="http://genome.ucsc.edu/cgi-bin/hgTracks?db=dm3" target="_blank">browser for <i>Drosophila melanogaster</i></a>,</span></li>
<li> <span class="assignment">click on "add custom tracks" or "manage custom tracks"</span>,</li>
<li> <span class="assignment">upload the file <tt>abinitio.browser</tt></span> and
<li> <span class="assignment">click on the link in the column "Pos" in the table
of custom tracks</tt></span>.
</ol>
</div><br>
The lazy ones can look at the result by clicking this link to a previously
<a href="http://genome.ucsc.edu/cgi-bin/hgTracks?db=dm3&hgt.customText=http://bioinf.uni-greifswald.de/augustus/binaries/tutorial/results/abinitio.browser" target="_blank">prepared custom track</a>.
<h2 id="prephints">3. PREPARE HINTS</h2>
<p>Hints are extrinsic evidence about the location and struture of genes in
a particular GFF format. When predicting genes AUGUSTUS can incorporate these hints,
which will change the likelihood of gene structures candidates. It will tend to
predict gene structures that are in agreement with the hints. </p>
<h4>Sources of Hints</h4>
<table>
<tr>
<td>ESTs or mRNAs</td>
<td>transcriptome reads, long enough to span several exons (454, Sanger)</td>
</tr>
<tr>
<td>RNA-Seq</td>
<td>high coverage short read transcriptome sequences (Illumina, SOLiD)</td>
</tr>
<tr>
<td>genomic conservation </td>
<td></td>
</tr>
<tr>
<td>MSMS</td>
<td></td>
</tr>
</table>
<p> Below, we will practice the preparation of hints from ESTs or mRNA
(<a href="#hest">3.1</a>) and from RNA-Seq (<a href="#hrnaseq">3.2</a>).
</p>
<h3 id="hest">3.1 From ESTs</h2>
As an example, we will use a set of 8458 ESTs which map to
chr2R:7000000-8000000 of of <i>Drosophila</i>: <a href="data/est.chr2R.7M-8M.fa">est.chr2R.7M-8M.fa</a>.
<p><span class="assignment">Align the ESTs against chr2R</span> using BLAT.</p>
<pre class="code">
blat -noHead chr2R.fa est.chr2R.7M-8M.fa est.psl # takes ~3m
</pre>
This creates an alignment file <span class="result"><tt>est.psl</tt></span>
in <a href="http://genome.ucsc.edu/FAQ/FAQformat.html#format2">PSL format</a>:
<pre class="code" style="font-size:small">
440 5 0 2 0 0 1 1281 - gi|1703783 447 0 447 chr2R 21146708 7776697 7778425 2 197,250, 0,197, 7776697,7778175,
494 3 0 1 0 0 2 65 + gi|1703784 498 0 498 chr2R 21146708 7775550 7776113 3 452,12,34, 0,452,464,x 7775550,7776003,7776079,
...
</pre>
However, typically some ESTs align well to very many places in the genome. BLAT also
includes short local alignments starting from 30bp. For this reason, we further
<span class="assignment">filter the alignments</span>:
<pre class="code">cat est.psl | filterPSL.pl --best --minCover=80 > est.f.psl</pre>
<span class="result"><tt>est.f.psl</tt></span> now only contains for each query
the best alginment(s) and that only if it covers at least 80% of the query length.
This reduces the number of alignments:
<pre class="code">
wc -l est.psl est.f.psl
# 10487 est.psl
# 8606 est.f.psl
</pre>
We can have a look at those EST alignments by <span class="assignment">creating another custom browser track</span>:
<pre class="code">
echo -e "browser position chr2R:7000000-7050000\n\
track name=ESTs description=\"EST alignments\" db=dm3 visibility=4" > ests.browser
cat est.f.psl >> ests.browser
gzip ests.browser
</pre>
<p>You can now <span class="assignment">upload <span class="result"><tt>ests.browser.gz</tt></span> as another custom track</span> or <span class="assignment">click on this <a href="http://genome.ucsc.edu/cgi-bin/hgTracks?db=dm3&hgt.customText=http://bioinf.uni-greifswald.de/augustus/binaries/tutorial/results/ests.browser.gz" target="_blank">prepared custom track</a></span></p>
Next, <span class="assignment">generate hints from the EST alignments</span>:
<pre class="code">blat2hints.pl --nomult --in=est.f.psl --out=hints.est.gff</pre>
The script <tt>blat2hints.pl</tt> identifies the positions of likely introns,
exons and exonic regions (termed <i>exonpart</i> or <i>ep</i>) from the alignments.
The file <span class="result"><tt>hints.est.gff</tt></span> now contains these hints sorted by third column. However, they are internally grouped by the the group name following grp= in the last column. An example group of hints may look like this.<br>
E.g.
<pre class="code">grep 15058069 hints.est.gff</pre>
yields
<pre class="code">
chr2R b2h ep 7559574 7559803 0 . . grp=gi|15058069;pri=4;src=E
chr2R b2h ep 7560550 7560814 0 . . grp=gi|15058069;pri=4;src=E
chr2R b2h exon 7560222 7560347 0 . . grp=gi|15058069;pri=4;src=E
chr2R b2h intron 7559804 7560221 0 . . grp=gi|15058069;pri=4;src=E
chr2R b2h intron 7560348 7560549 0 . . grp=gi|15058069;pri=4;src=E
</pre>
<h3 id="hrnaseq">3.2 From RNA-Seq</h2>
Massive amounts of (short) transcriptome reads from next generation sequencing
methods like Illumina first need to be aligned to the genome. Recently, a large
number of short read aligners was developed. For the sake of this tutorial we will
assume that we have already aligned the reads to the genome and that we have
two files:
<ol>
<li>The file <a href="data/chr2R.7M-8M.wig"><tt>chr2R.7M-8M.wig</tt></a> contains a coverage graph, that contains for each base in the window chr2R:7000000-8000000 the number of
reads alignments that cover the position. The UCSC group has a <a href="http://genome.ucsc.edu/FAQ/FAQformat.html#format6">description of the wiggle format (.wig)</a>.</li>
<li>The file <a href="data/hints.rnaseq.intron.gff"><tt>hints.rnaseq.intron.gff</tt></a>
contains likely intron positions, inferred from gaps in the query of the read alignments. Together with the intron boundaries the multiplicity (mult) is reported, which
counts the number of alignments that support the given intron candidate, if there is more than one.</li>
</ol>
In this <a href="http://bioinf.uni-greifswald.de/augustus/binaries/readme.rnaseq.html">readme
about AUGUSTUS in the RGASP assessment</a> a method is described that would produce two such files. TopHat is a spliced read mapper for RNA-Seq that also produces
a coverage grap and reports introns with their multiplicity.
<p>
<span class="assignment">Upload the file <tt>chr2R.7M-8M.wig</tt> as a UCSC custom track</span> (<tt>gzip</tt>ing would speed up upload) or <span class="assignment">click on this <a href="http://genome.ucsc.edu/cgi-bin/hgTracks?db=dm3&hgt.customText=http://bioinf.uni-greifswald.de/augustus/binaries/tutorial/data/chr2R.7M-8M.wig.gz" target="_blank">prepared custom track</a></span>.</p>
<p><span class="assignment">Generate hints about exonic regions from the coverage graph</span> (wig file):
<pre class="code">
cat chr2R.7M-8M.wig | wig2hints.pl --width=10 --margin=10 --minthresh=2 --minscore=4 \
--src=W --type=ep --radius=4.5 > hints.rnaseq.ep.gff
</pre>
The resulting <span class="result"><tt>hints.rnaseq.ep.gff</tt></span> now
contains hints in a GFF format that <tt>augustus</tt> understands:
<pre class="code">
chr2R w2h ep 7007551 7007560 5.300 . . src=W;mult=5;
chr2R w2h ep 7007561 7007570 7.400 . . src=W;mult=7;
chr2R w2h ep 7007571 7007580 9.700 . . src=W;mult=9;
chr2R w2h ep 7007581 7007590 10.200 . . src=W;mult=10;
</pre>
Again, at the end of the column, the multiplicity (mult) contains the
average coverage in the given interval. <tt>augustus</tt> will consider each
such line evidence that the sequence interval is <i>part of an exon</i> (ep=exonpart).
<h3 id="concath">3.3 Concatenate all Hints</h2>
Now, <span class="assignment">join the hints from all sources</span> into one file:
<pre class="code">cat hints.est.gff hints.rnaseq.intron.gff hints.rnaseq.ep.gff > hints.gff</pre>
<span class="result"><tt>hints.gff</tt></span> now contains the exon, intron and exonpart hints from ESTs as well as the intron and exonpart hints from RNA-Seq.
<h2 id="extr">4. SET HINT PARAMETERS</h2>
The strength of the influence of the hints can be adjusted with a few parameters
from no influence (ab initio) up to the point where they should be trusted
completely and "anchor" the gene structure. These parameters are stored in an
extrinsic configuration file.
The folder <tt>config/extrinsic</tt> of the AUGUSTUS package contains a
few examples.
<p>Start by <span class="assignment">copying another extrinsic configuration file</span>:</p>
<pre class="code">cp $AUGUSTUS_CONFIG_PATH/extrinsic/extrinsic.M.RM.E.W.cfg extrinsic.bug.cfg</pre>
Now <span class="assignment">edit <tt class="result">extrinsic.bug.cfg</tt></span> so that the
non-comment lines are like this. Alternatively, you may just copy that file from
the result files and edit some of the bold numbers below.
<pre class="code">
[SOURCES]
M E W
[SOURCE-PARAMETERS]
[GENERAL]
start 1 1 M 1 1e+100 E 1 1 W 1 1
stop 1 1 M 1 1e+100 E 1 1 W 1 1
tss 1 1 M 1 1e+100 E 1 1 W 1 1
tts 1 1 M 1 1e+100 E 1 1 W 1 1
ass 1 1 M 1 1e+100 E 1 1 W 1 1
dss 1 1 M 1 1e+100 E 1 1 W 1 1
exonpart 1 <span style="font-weight:bold;color:red">.997</span> M 1 1e+100 E 1 <span style="font-weight:bold;color:darkgreen">1e2</span> W 1 <span style="font-weight:bold;color:darkgreen">1.05</span>
exon 1 1 M 1 1e+100 E 1 <span style="font-weight:bold;color:darkgreen">1e4</span> W 1 1
intronpart 1 1 M 1 1e+100 E 1 1 W 1 1
intron 1 <span style="font-weight:bold;color:red">.3</span> M 1 1e+100 E 1 <span style="font-weight:bold;color:darkgreen">1e6</span> W 1 1
CDSpart 1 1 <span style="font-weight:bold;color:red">0.985</span> M 1 1e+100 E 1 1 W 1 1
CDS 1 1 M 1 1e+100 E 1 1 W 1 1
UTRpart 1 1 <span style="font-weight:bold;color:red">.96</span> M 1 1e+100 E 1 1 W 1 1
UTR 1 1 M 1 1e+100 E 1 1 W 1 1
irpart 1 1 M 1 1e+100 E 1 1 W 1 1
nonexonpart 1 1 M 1 1e+100 E 1 1 W 1 1
genicpart 1 1 M 1 1e+100 E 1 1 W 1 1
</pre>
The bold green numbers specify the <i>bonus</i> that a gene structure candidate gets
for being compatible with a hint of that type and source.
<p>For example, the <span style="font-weight:bold;color:darkgreen">1e6</span> in the intron row after the source letter E means that for each intron hint from
ESTs (src=E), gene structures that have an intron
with both boundaries given as in the hint are rewarded by a factor of 1 million
relatively to gene structures disregarding the intron hint.</p>
<p>The <span style="font-weight:bold;color:darkgreen">1.05</span> in the exonpart row after the letter W specifies that for each exonpart hint in the RNA-Seq hints file
(src=W), every gene structure that has an exon including the range of the hint gets
a relative bonus factor 1.05 <i>per multiplicity</i>.</p>
<p>The red numbers mean a punishment (malus) for gene structures with unsupported
features. For example, the <span style="font-weight:bold;color:red">.3</span> in
the intron row means that every intron candidate
<i>that has no intron hints supporting it</i> is punished by multiplying its relative probability with the factor 0.3. If you decrease this number even
more (say from .3 to .001) then fewer introns unsupported by spliced transcriptome
reads should be predicted. This would likely decrease the false positive intron rate, but also more true unsupported introns would be missed.
</p>
For more information look at into one of the extrinsic.cfg files.
<h2 id="predh">5. PREDICT GENES USING HINTS</h2>
<span class="assignment">Predict the genes in the range 7,000,001-7,500,000 of chr2R of <i>D. melanogaster</i> using evidence from <tt>hints.gff</tt></span>.
<pre class="code">augustus --species=fly --predictionStart=7000001 --predictionEnd=7500000 chr2R.fa \
--extrinsicCfgFile=extrinsic.bug.cfg --hintsfile=hints.gff > augustus.hints.gff # takes ~9m</pre>
<p>The species <tt>fly</tt> contains UTR parameters, which we didn't have the time to
train for <tt>bug</tt>. When using RNA-Seq as hints it is better to use a model
with UTRs, as a significant fraction of reads map to UTRs. It is also possible
to use <tt>bug</tt> here, though.</p>
The output <span class="result"><tt>augustus.hints.gff</tt></span> now looks like that
<pre class="code">
# This output was generated with AUGUSTUS (version 2.5).
...
# start gene g1
chr2R AUGUSTUS gene 7007533 7009385 0.2 - . g1
chr2R AUGUSTUS transcript 7007533 7009385 0.2 - . g1.t1
chr2R AUGUSTUS tts 7007533 7007533 . - . transcript_id "g1.t1"; gene_id "g1";
chr2R AUGUSTUS exon 7007533 7008630 . - . transcript_id "g1.t1"; gene_id "g1";
chr2R AUGUSTUS stop_codon 7007610 7007612 . - 0 transcript_id "g1.t1"; gene_id "g1";
chr2R AUGUSTUS intron 7008631 7008694 1 - . transcript_id "g1.t1"; gene_id "g1";
chr2R AUGUSTUS intron 7008812 7008865 1 - . transcript_id "g1.t1"; gene_id "g1";
chr2R AUGUSTUS intron 7009192 7009251 1 - . transcript_id "g1.t1"; gene_id "g1";
chr2R AUGUSTUS CDS 7007610 7008630 1 - 1 transcript_id "g1.t1"; gene_id "g1";
chr2R AUGUSTUS CDS 7008695 7008811 1 - 1 transcript_id "g1.t1"; gene_id "g1";
chr2R AUGUSTUS exon 7008695 7008811 . - . transcript_id "g1.t1"; gene_id "g1";
chr2R AUGUSTUS CDS 7008866 7009191 1 - 0 transcript_id "g1.t1"; gene_id "g1";
chr2R AUGUSTUS exon 7008866 7009191 . - . transcript_id "g1.t1"; gene_id "g1";
chr2R AUGUSTUS CDS 7009252 7009353 0.94- 0 transcript_id "g1.t1"; gene_id "g1";
chr2R AUGUSTUS exon 7009252 7009385 . - . transcript_id "g1.t1"; gene_id "g1";
chr2R AUGUSTUS start_codon 7009351 7009353 . - 0 transcript_id "g1.t1"; gene_id "g1";
chr2R AUGUSTUS tss 7009385 7009385 . - . transcript_id "g1.t1"; gene_id "g1";
# protein sequence = [MNTLSSARSVAIYVGPVRSSRSASVLAHEQAKSSITEEHKTYDEIPRPNKFKFMRAFMPGGEFQNASITEYTSAMRKR
# YGDIYVMPGMFGRKDWVTTFNTKDIEMVFRNEGIWPRRDGLDSIVYFREHVRPDVYGEVQGLVASQNEAWGKLRSAINPIFMQPRGLRMYYEPLSNIN
# NEFIERIKEIRDPKTLEVPEDFTDEISRLVFESLGLVAFDRQMGLIRKNRDNSDALTLFQTSRDIFRLTFKLDIQPSMWKIISTPTYRKMKRTLNDSL
# NVAQKMLKENQDALEKRRQAGEKINSNSMLERLMEIDPKVAVIMSLDILFAGVDATATLLSAVLLCLSKHPDKQAKLREELLSIMPTKDSLLNEENMK
# DMPYLRAVIKETLRYYPNGLGTMRTCQNDVILSGYRVPKGTTVLLGSNVLMKEATYYPRPDEFLPERWLRDPETGKKMQVSPFTFLPFGFGPRMCIGK
# RVVDLEMETTVAKLIRNFHVEFNRDASRPFKTMFVMEPAITFPFKFTDIEQ]
# Evidence for and against this transcript:
# % of transcript supported by hints (any source): 100
# CDS exons: 4/4
# E: 4
# W: 4
# CDS introns: 3/3
# E: 3
# 5'UTR exons and introns: 1/1
# E: 1
# 3'UTR exons and introns: 1/1
# W: 1
# hint groups fully obeyed: 137
# E: 4 (gi|15542574,SRR023546.8642467/1)
# W: 133
# incompatible hint groups: 18
# E: 18 (gi|13769068,gi|4203815,gi|15543927,gi|38623822,gi|15539951,gi|14693753,gi|14699170,...)
# end gene g1
...
</pre>
<p>After each predicted transcript a little statistics follows about the support and
compatibility of this transcript with the hints. Note, that AUGUSTUS now predicts alternative
splice forms (ending e.g. in .t2).</p>
Finally, <span class="assignment">make another custom track with the predictions using hints</span>:
<pre class="code">
echo -e "browser position chr2R:7299000-7318000\n\
track name=withhints description=\"Augustus predictions using hints\" db=dm3 visibility=3" > withhints.browser
grep -P "AUGUSTUS\t(CDS|exon)" augustus.hints.gff >> withhints.browser
</pre>
In the last line we are filering out from the predictions just the lines specifying exons and CDS. The additional exon lines identify the UTR (if you used <tt>fly</tt>)
by subtracting the CDS ranges.
<p>Again, you may also just click on this like to a <a href="http://genome.ucsc.edu/cgi-bin/hgTracks?db=dm3&hgt.customText=http://bioinf.uni-greifswald.de/augustus/binaries/tutorial/results/withhints.browser" target="_blank">prepared custom track</a> with the preditions.</p>
</body></html>