-
Notifications
You must be signed in to change notification settings - Fork 0
/
00TRANSMARKNOTES
594 lines (394 loc) · 39.5 KB
/
00TRANSMARKNOTES
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
#The all_alignments.stk file was created by Travis from Pfam DB
#and contains all the multiple sequence alignments in the Pfam DB
First get the names of all the alignements
% esl-alistat all_alignments.stk | awk '/Alignment name/ { print $3}' > all_ali_names.lst
#Just subsample 5% of all MSAs so there are not too many families and benchmark is smaller
% sort -R --random-source all_ali_names.lst all_ali_names.lst | head -n 736 > some_ali_names.lst
(using "--random-source all_ali_names.lst" to set the seed)
% esl-afetch -f all_alignments.stk some_ali_names.lst > some_alignments.stk
# --maxtest contro
ls the number of planted sequences
% rmark/rmark-create -N 10 -L 100000000 -R 10 -E 10 --maxtrain 30 --maxtest 20 pfamDNAmark some_alignments.stk rmark_folder/rmark/rmark3-bg.hmm
% wc pfamDNAmark.tbl
147 1029 9114 pfamDNAmark.tbl
% esl-seqstat pfamDNAmark.pfa
...
Number of sequences: 2762
OK, that seems feasible.
% time makeblastdb -dbtype nucl -in pfamDNAmark.fa
Building a new DB, current time: 01/06/2015 15:17:20
New DB name: pfamDNAmark.fa
New DB title: pfamDNAmark.fa
Sequence type: Nucleotide
Keep Linkouts: T
Keep MBits: T
Maximum file size: 1000000000B
hmmbuild pfamDNAmark.hmm pfamDNAmark.msa
% hmmbuild pfamDNAmark.hmm pfamDNAmark.msa
# CPU time: 27.35u 0.14s 00:00:27.49 Elapsed: 00:00:03.80
#exact commands I ran:
In my first pass at this, I had a ton of families:
% grep "\/\/" ../all_alignments.stk | wc
14724 14724 44172
[wshands@scyld harderpfamdnamark]$ /home/um/wshands/pulloftranslatedsearch/hmmer/easel/miniapps/esl-alistat all_alignments.stk | awk '/Alignment name/ { print $3}' > all_ali_names.lst
Use 5% of families only
sort -R --random-source all_ali_names.lst all_ali_names.lst | head -n 736 > some_736_ali_names.lst
/home/um/wshands/pulloftranslatedsearch/hmmer/easel/miniapps/esl-afetch -f all_alignments.stk some_736_ali_names.lst > some_736_alignments.stk
#Used rmark-create in /data/wshands/harderpfamdnamark/Users/traviswheeler/src/infernal-1.1.1/rmark because apparently
#this is a special version which must be used to generate the (training/test sequences?)
[wshands@scyld transsearchbenchmark]$ ../Users/traviswheeler/src/infernal-1.1.1/rmark/rmark-create -N 10 -L 100000000 -R 10 -E 10 --maxtrain 30 --maxtest 20 pfamtransDNAmark ../some_736_alignments.stk ../Users/traviswheeler/src/infernal-1.1.1/rmark/rmark3-bg.hmm
used curl ftp://ftp.ebi.ac.uk/pub/databases/Pfam/releases/Pfam27.0/Pfam-A.seed.gz > Pfam-A.v27.seed.gz
to get Pfam seed file for use in build_protein_training_seeds
[wshands@scyld transsearchbenchmark]$ ./build_protein_training_seeds.pl
[wshands@scyld transsearchbenchmark]$ hmmbuild pfamtransAAmark.hmm pfamtransAAmark.msa
[wshands@scyld transsearchbenchmark]$ perl ../rmark/rmark-master.pl -F -N 16 -C pfamtransAAmark.hmm $H_PATH ../rmark ../rmark ptr.std.e100 ../rmark_opts/phmmert.e100.opts pfamtransDNAmark ../rmark/x-phmmert 100000
To get status of the jobs:
[wshands@scyld ptr.std.e100]$ qstat -u wshands
scyld.localdomain:
Req'd Req'd Elap
Job ID Username Queue Jobname SessID NDS TSK Memory Time S Time
----------------------- ----------- -------- ---------------- ------ ----- ------ ------ --------- - ---------
5866.scyld.localdomain wshands batch ptr.std.e100.0 79674 1 16 -- 27:46:40 R 00:07:20
5867.scyld.localdomain wshands batch ptr.std.e100.1 79676 1 16 -- 27:46:40 R 00:07:20
...
Then generate some statistics files:
[wshands@scyld transsearchbenchmark]$ my_msub gather "../rmark/rmark-pp.sh pfamtransDNAmark ptr.std.e100 1" 1
Then generate the ROC plot:
[wshands@scyld transsearchbenchmark]$ perl ../rmark/transmark-Rroc.pl -R listfiles/pfamtransDNAmark.ROC.list pfamtransDNAmark-ROC.pdf 0 ""
OK this ROC plot did not display any line for phmmert; perhaps because phmmert found most of the sequences???
So create a benchmark that is harder by decreasing the percentage similarity allowable between training and test sequences with the -1 option; set it to 0.40
[wshands@scyld transsearch_-1_0.4]$ ../Users/traviswheeler/src/infernal-1.1.1/rmark/rmark-create -1 0.4 -N 10 -L 100000000 -R 10 -E 10 --maxtrain 30 --maxtest 20 pfamtransDNAmark ../some_736_alignments.stk ../Users/traviswheeler/src/infernal-1.1.1/rmark/rmark3-bg.hmm
[wshands@scyld transsearch_-1_0.4]$ wc -l pfamtransDNAmark.tbl
45 pfamtransDNAmark.tbl
[wshands@scyld transsearch_-1_0.4]$ wc -l pfamtransDNAmark.pos
718 pfamtransDNAmark.pos
[wshands@scyld transsearch_-1_0.4]$
Only a quarter of the total number of test sequences as the previous run but I will see if it is difficult enough and
if the ROC plot is generated with a trend line
[wshands@scyld transsearch_-1_0.4]$ ../build_protein_training_seeds.pl
[wshands@scyld transsearch_-1_0.4]$hmmbuild pfamtransAAmark.hmm pfamtransAAmark.msa
[wshands@scyld transsearch_-1_0.4]$ perl ../rmark/rmark-master.pl -F -N 16 -C pfamtransAAmark.hmm $H_PATH ../rmark ../rmark ptr.std.e100 ../rmark_opts/phmmert.e100.opts pfamtransDNAmark ../rmark/x-phmmert 100000
[wshands@scyld transsearch_-1_0.4]$ my_msub gather "../rmark/rmark-pp.sh pfamtransDNAmark ptr.std.e100 1" 1
[wshands@scyld transsearch_-1_0.4]$ perl ../rmark/transmark-Rroc.pl -R ../listfiles/pfamtransDNAmark.ROC.list pfamtransDNAmark-ROC.pdf 0 ""
Now I get a ROC plot with a phmmert sensitivity trend consistantly at 90% so benchmark not hard enough; I need phmmert to have
somewhere between 40 to 60% sensitivity
../Users/traviswheeler/src/infernal-1.1.1/rmark/rmark-create -1 0.3 -N 10 -L 100000000 -R 5 -E 1 --maxtrain 30 --maxtest 20 pfamtransDNAmark ../all_alignments.stk ../Users/traviswheeler/src/infernal-1.1.1/rmark/rmark3-bg.hmm
[wshands@scyld transsearch_-1_0.3_-R_5_-E_1]$ wc -l pfamtransDNAmark.tbl
2348 pfamtransDNAmark.tbl
[wshands@scyld transsearch_-1_0.3_-R_5_-E_1]$ wc -l pfamtransDNAmark.pos
3740 pfamtransDNAmark.pos
[wshands@scyld transsearch_-1_0.3_-R_5_-E_1]$ ../build_protein_training_seeds.pl
[wshands@scyld transsearch_-1_0.3_-R_5_-E_1]$ hmmbuild pfamtransAAmark.hmm pfamtransAAmark.msa
[wshands@scyld transsearch_-1_0.3_-R_5_-E_1]$ perl ../rmark/rmark-master.pl -F -N 16 -C pfamtransAAmark.hmm $H_PATH ../rmark ../rmark ptr.std.e100 ../rmark_opts/phmmert.e100.opts pfamtransDNAmark ../rmark/x-phmmert 100000
[wshands@scyld transsearch_-1_0.3_-R_5_-E_1]$ my_msub gather "../rmark/rmark-pp.sh pfamtransDNAmark ptr.std.e100 1" 1
[wshands@scyld transsearch_-1_0.3_-R_5_-E_1]$ perl ../rmark/transmark-Rroc.pl -R ../listfiles/pfamtransDNAmark.ROC.list pfamtransDNAmark-ROC.pdf 0 "title"
Sensitivity now at about 78%; still not low enough??? but I will try it with tblastn to check
[wshands@scyld transsearch_-1_0.3_-R_5_-E_1]$ makeblastdb -dbtype nucl -in pfamtransDNAmark.fa
[wshands@scyld transsearch_-1_0.3_-R_5_-E_1]$ perl ../rmark/rmark-master.pl -G pfamtransAAmark -F -N 16 $H_PATH ../rmark ../rmark tbn.w3.e100.cons ../rmark_opts/tblastn-w3-e100.opts pfamtransDNAmark ../rmark/x-tblastn-cons 100000
[wshands@scyld transsearch_-1_0.3_-R_5_-E_1]$ my_msub gather "../rmark/rmark-pp.sh pfamtransDNAmark tbn.w3.e100.cons 1" 1
[wshands@scyld transsearch_-1_0.3_-R_5_-E_1]$ perl ../rmark/rmark-master.pl -G pfamtransAAmark -F -N 16 $H_PATH ../rmark ../rmark tbn.w3.e100.fpw ../rmark_opts/tblastn-w3-e100.opts pfamtransDNAmark ../rmark/x-tblastn-fpw 100000
[wshands@scyld transsearch_-1_0.3_-R_5_-E_1]$ my_msub gather "../rmark/rmark-pp.sh pfamtransDNAmark tbn.w3.e100.fpw 1" 1
perl ../rmark/transmark-Rroc.pl -R ../listfiles/pfamtransDNAmark.ROC.list pfamtransDNAmark-ROC.pdf 0 "title"
Try reducing the number of families to 7362 from the all list and set the -1 and -2 parameters
sort -R --random-source all_ali_names.lst all_ali_names.lst | head -n 7362 > 7362_ali_names.lst
esl-afetch -f all_alignments.stk 7362_ali_names.lst > 7362_alignments.stk
mkdir transsearch_-1_0.4_-2_0.4_-R_10_-E_10_7362fam
cd transsearch_-1_0.4_-2_0.4_-R_10_-E_10_7362fam
[wshands@scyld transsearch_-1_0.4_-2_0.4_-R_10_-E_10_7362fam]$ ../Users/traviswheeler/src/infernal-1.1.1/rmark/rmark-create -1 0.4 -2 0.4 -N 10 -L 100000000 -R 10 -E 10 --maxtrain 30 --maxtest 20 pfamtransDNAmark ../7362_alignments.stk ../Users/traviswheeler/src/infernal-1.1.1/rmark/rmark3-bg.hmm
[wshands@scyld transsearch_-1_0.4_-2_0.4_-R_10_-E_10_7362fam]$../build_protein_training_seeds.pl
[wshands@scyld transsearch_-1_0.4_-2_0.4_-R_10_-E_10_7362fam]$hmmbuild pfamtransAAmark.hmm pfamtransAAmark.msa
[wshands@scyld transsearch_-1_0.4_-2_0.4_-R_10_-E_10_7362fam]$makeblastdb -dbtype nucl -in pfamtransDNAmark.fa
[wshands@scyld transsearch_-1_0.4_-2_0.4_-R_10_-E_10_7362fam]$perl ../rmark/rmark-master.pl -F -N 16 -C pfamtransAAmark.hmm $H_PATH ../rmark ../rmark ptr.std.e100 ../rmark_opts/phmmert.e100.opts pfamtransDNAmark ../rmark/x-phmmert 100000
[wshands@scyld transsearch_-1_0.4_-2_0.4_-R_10_-E_10_7362fam]$perl ../rmark/rmark-master.pl -G pfamtransAAmark -F -N 16 $H_PATH ../rmark ../rmark tbn.w3.e100.cons ../rmark_opts/tblastn-w3-e100.opts pfamtransDNAmark ../rmark/x-tblastn-cons 100000
[wshands@scyld transsearch_-1_0.4_-2_0.4_-R_10_-E_10_7362fam]$perl ../rmark/rmark-master.pl -G pfamtransAAmark -F -N 16 $H_PATH ../rmark ../rmark tbn.w3.e100.fpw ../rmark_opts/tblastn-w3-e100.opts pfamtransDNAmark ../rmark/x-tblastn-fpw 1000000
[wshands@scyld transsearch_-1_0.4_-2_0.4_-R_10_-E_10_7362fam]$my_msub gather "../rmark/rmark-pp.sh pfamtransDNAmark tbn.w3.e100.cons 1" 1
[wshands@scyld transsearch_-1_0.4_-2_0.4_-R_10_-E_10_7362fam]$my_msub gather "../rmark/rmark-pp.sh pfamtransDNAmark tbn.w3.e100.fpw 1" 1
[wshands@scyld transsearch_-1_0.4_-2_0.4_-R_10_-E_10_7362fam]$my_msub gather "../rmark/rmark-pp.sh pfamtransDNAmark ptr.std.e100 1" 1
[wshands@scyld transsearchorf_-R_10_-E_10_7362fam_2016_10_29]$ my_msub gather "../rmark/rmark-pp.sh pfamtransDNAmark exonerate.fpw 1" 1
[wshands@scyld transsearch_-1_0.4_-2_0.4_-R_10_-E_10_7362fam]$perl ../rmark/transmark-Rroc.pl -R ../listfiles/pfamtransDNAmark.ROC.list pfamtransDNAmark-ROC.pdf 0 "translated search 7362 fam"
The results are a little better, but tblastn fpw sensitivity result still does not seem to be reduced much
It turned out that randomly generating nucleotide sequence to use as background sequence may add stop codons too frequently so that there
are unrealistically short ORFs that skew the results so that there are not many potential hits or false positives in the background
sequence. We need to plant simulated
ORFs so that there is some negative ORFs that simulate the length and composition of real ORFs. The ability to plant simulated
ORFs to the background sequence was added to rmark-create as a '-D' option that includes a protein MSA file name following the
'-D' from which a random MSA and random sequence will be extracted and planted in the background sequence so that there will be
about 5% simulated ORFs in the background sequence.
[wshands@scyld rmarktest]$ ../rmark-create -N 10 -L 100000000 -R 10 -E 10 --maxtrain 30 --maxtest 20 -D /data/wshands/harderpfamdnamark/Pfam-A.v27.seed rmarktestinsertDNA /data/wshands/harderpfamdnamark/some_736_alignments.stk ../rmark3-bg.hmm
[wshands@scyld transdecoy_-R_10_-E_10_7362fam]$ /home/um/wshands/infernal/Users/traviswheeler/src/infernal-1.1.1/rmark/rmark-create -N 10 -L 100000000 -R 10 -E 10 --maxtrain 30 --maxtest 20 -D /data/wshands/harderpfamdnamark/Pfam-A.v27.seed rmarktestinsertDNA /data/wshands/harderpfamdnamark/some_736_alignments.stk /home/um/wshands/infernal/Users/traviswheeler/src/infernal-1.1.1/rmark/rmark3-bg.hmm
Ran the full experiment like this:
[wshands@scyld transdecoy_-R_10_-E_10_7362fam]$ makeblastdb -dbtype nucl -in rmarktestinsertDNA.fa
[wshands@scyld transdecoy_-R_10_-E_10_7362fam]$ ../build_protein_training_seeds.pl
[wshands@scyld transdecoy_-R_10_-E_10_7362fam]$ hmmbuild rmarktestinsertAA.hmm rmarktestinsertAA.msa
[wshands@scyld transdecoy_-R_10_-E_10_7362fam]$ /home/um/wshands/infernal/Users/traviswheeler/src/infernal-1.1.1/rmark/rmark-create -N 10 -L 100000000 -R 10 -E 10 --maxtrain 30 --maxtest 20 -D /data/wshands/harderpfamdnamark/Pfam-A.v27.seed rmarktestinsertDNA /data/wshands/harderpfamdnamark/some_736_alignments.stk /home/um/wshands/infernal/Users/traviswheeler/src/infernal-1.1.1/rmark/rmark3-bg.hmm
[wshands@scyld transdecoy_-R_10_-E_10_7362fam]$ perl ../rmark/rmark-master.pl -F -N 16 -C rmarktestinsertAA.hmm $H_PATH ../rmark ../rmark ptr.std.e100 ../rmark_opts/phmmert.e100.opts rmarktestinsertDNA ../rmark/x-phmmert 100000
[wshands@scyld transdecoy_-R_10_-E_10_7362fam]$ perl ../rmark/rmark-master.pl -G rmarktestinsertAA -F -N 16 $H_PATH ../rmark ../rmark tbn.w3.e100.cons ../rmark_opts/tblastn-w3-e100.opts rmarktestinsertDNA ../rmark/x-tblastn-cons 100000
[wshands@scyld transdecoy_-R_10_-E_10_7362fam]$ perl ../rmark/rmark-master.pl -G rmarktestinsertAA -F -N 16 $H_PATH ../rmark ../rmark tbn.w3.e100.fpw ../rmark_opts/tblastn-w3-e100.opts rmarktestinsertDNA ../rmark/x-tblastn-fpw 1000000
[wshands@scyld transdecoy_-R_10_-E_10_7362fam]$ my_msub gather "../rmark/rmark-pp.sh rmarktestinsertDNA ptr.std.e100 1" 1
[wshands@scyld transdecoy_-R_10_-E_10_7362fam]$ my_msub gather "../rmark/rmark-pp.sh rmarktestinsertDNA tbn.w3.e100.fpw 1" 1
[wshands@scyld transdecoy_-R_10_-E_10_7362fam]$ my_msub gather "../rmark/rmark-pp.sh rmarktestinsertDNA tbn.w3.e100.cons 1" 1
[wshands@scyld transdecoy_-R_10_-E_10_7362fam]$ perl ../rmark/transmark-Rroc.pl -R ../listfiles/pfamtransDNAmark.ROC.list pfamtransDNAmark-ROC.pdf 0 "translated searchwith decoy inserts 7362 fam"
Results were no better than before...
I think I discovered a bug in the perl script build_protein_training seeds which fails to strip the test sequences from the msa that is used to construct the search hmm.
Please look at the DUF1203* files in /data/wshands/harderpfamdnamark/testbuildscript and let me know if I am off base here/not thinking the right way etc.
[wshands@scyld testbuildscript]$ grep Q166P0_ROSDO DUF1203.filter.names
Q166P0_ROSDO/36-153
[wshands@scyld testbuildscript]$ grep Q166P0_ROSDO rmarktestinsertAA.msa
#=GS Q166P0_ROSDO/36-153 AC Q166P0.1
Q166P0_ROSDO/36-153 TYPCRHCLKEAGKKDGALLFSYRLP--------LPDSVYAQPTAVFLCKGECTGYTGGQSVP----------EIV-GNRFVALRAFRAKGLMDYAMNTLSDGKG-------VGDEVERILQNNEIAFINVHTALAGCMLCAVRR
[wshands@scyld testbuildscript]$ grep Q166P0_ROSDO origtraintest
#=GS DUF1203/1 DE Q166P0_ROSDO
[wshands@scyld testbuildscript]$ less origtraintest
...
# STOCKHOLM 1.0
#=GF ID TEST.DUF1203
#=GS DUF1203/1 DE Q166P0_ROSDO
...
So the test sequence in this case will be included in the amino acid HMM used to search the DNA sequence DB when the hmmbuild command issued.
This is caused by a bug in the line:
do_cmd (qq[awk 'BEGIN{FS="|"} {print \$2}' $ali.dna.names > $ali.dna.names2]);
which leaves several empty lines in DUF1203.dna.names2, and not the fixed up names from DUF1203.dna.names.
For some reason when the line:
do_cmd ("grep -f $ali.dna.names2 $ali.pfam.names > $ali.filter.names");
is executed all the sequence names in DUF1203.pfam.names are included in DUF1203.filter.names. I would have thought none of them would be included...which would have uncovered the bug...
I believe that DUF1203.dna.names2 should contain the fixed up names from DUF1203.dna.names and not contain empty lines. Then when the command:
do_cmd ("esl-alimanip --seq-k $ali.filter.names $ali.pfam.sto >> $protfile");
is executed in build_protein_training_seeds.pl it would only include the sequences in the DNA MSA which is the msa used for training (building the HMM).
Fix bug by
# do_cmd (qq[awk 'BEGIN{FS="|"} {print \$2}' $ali.dna.names > $ali.dna.names2]);
do_cmd (qq[awk 'BEGIN{FS="|"} {if ( \$2 == \'\\n\' ) print \$1; else print \$2}' $ali.dna.names > $ali.dna.names2]);
I fixed the bug in build_protein_training_seeds.pl and ran the searches again with -R 10 -E 10 and did not set -1 or -2 and adding in the decoy DNA sequences. The results were dramatically different with phmmert sensitivity around .8 and tblastn cons around .15 and fpw around .28
I will have to run some more searches to confirm this but based on these preliminary results it seems that it will not be necessary to sort the training and test sequences using amino acids.
Quick check: in the plot, it looks like FPW actually performs worse (in terms of the height of the plotted line) than consensus. That's perfectly possible, but worth verifying. FPW will certainly have higher sensitivity, but if the false positive rate is too high, you'll have false hits showing up before all those hits found due to increased sensitivity, which drives down that line. Is that what you're seeing in the output?
Also, I notice that the FPW runtime isn't a lot worse than the consensus variant. Is that because the number of sequences per family has been limited?
These are the statistics in
tbn.w3.e100.fpw.mer:
# family/category mer npos fn fp thresh
# --------------- ---- ---- ---- ---- --------
zf-UBP 20 20 20 0 -
# --------------- ---- ---- ---- ---- --------
*family-sum* 2532 3009 2531 1 -
*summary* 2551 3009 2549 2 0.079 0.51h
for tbn.w3.e100.cons.mer:
# family/category mer npos fn fp thresh
# --------------- ---- ---- ---- ---- --------
zf-UBP 20 20 20 0 -
# --------------- ---- ---- ---- ---- --------
*family-sum* 2104 3009 2100 4 -
*summary* 2153 3009 2137 16 2.5 0.22h
for ptr.std.e100.mer:
# family/category mer npos fn fp thresh
# --------------- ---- ---- ---- ---- --------
*family-sum* 530 3009 525 5 -
*summary* 605 3009 593 12 0.59 7.10h
So it seems like the big difference between phmmert and tblastn is the number of false negatives. The number of false positives is about the same between tblastn cons and fpw.
I am not sure what 'before' means in "if the false positive rate is too high, you'll have false hits showing up before all those hits found due to increased sensitivity." The decoys show up at about line 463 in the fpw mer file and about line 884 in the cons mer file. I guess this is what you were suggesting?
Its unclear to me how the "number of false positives per search" is calculated and displayed; is this done by binning E-values and counting false negatives per family to true positives for that family for all the families in the E-value bin and using the Sn = TP/(TP + FN) formula? Not sure where the FP factor in...
The maximum number of training and test sets per family where limited to 10 each. Yes usually fpw takes much longer than cons, not sure why it didn't here, but the number of families was limited:
There were 159 families used in queries:
[wshands@scyld transdecoy_-R_10_-E_10_736fam]$ wc -l rmarktestinsertDNA.tbl
159 rmarktestinsertDNA.tbl
Including the decoy shuffled translated ORFs causes tblastn fpw sensitivity to decrease dramatically. See the graph when I did not include decoy ORFs and how much better fpw sensitivity is. The decoy ORFs cause tblastn fpw to generate a lot more false negatives
# family/category mer npos fn fp thresh
DECOY ORFS PLANTED
# --------------- ---- ---- ---- ---- --------
zf-UBP 20 20 20 0 -
# --------------- ---- ---- ---- ---- --------
*family-sum* 2532 3009 2531 1 -
*summary* 2551 3009 2549 2 0.079 0.51h
NO DECOY ORFS PLANTED
# --------------- ---- ---- ---- ---- --------
*family-sum* 1553 3009 1540 13 -
*summary* 1628 3009 1583 45 0.13 1.76h
Why would this happen?
[wshands@scyld testorfscripts]$ my_msub gather "../rmark/rmark-pp.sh rmarktestinsertDNA tbn.w3.e100.cons 1 orf" 1
Create the plot of hits on shuffled ORFs by doing:
[wshands@scyld transsearchorf_-R_10_-E_10_736fam_2016_10_16]$ cat ../rmark/plot.hmm.fpw.subset.R | R --vanilla
10/29/2016
created new test infrastructure now that the scripts and technique are set up
[wshands@scyld transsearchorf_-R_10_-E_10_7362fam_2016_10_29]$ /home/um/wshands/infernal/Users/traviswheeler/src/infernal-1.1.1/rmark/rmark-create -N 10 -L 100000000 -R 10 -E 10 --maxtrain 30 --maxtest 20 -D /data/wshands/harderpfamdnamark/Pfam-A.v27.seed rmarkinsertORFandDNA /data/wshands/harderpfamdnamark/7362_alignments.stk /home/um/wshands/infernal/Users/traviswheeler/src/infernal-1.1.1/rmark/rmark3-bg.hmm
11/3/2016
#found memory overwrite in remove duplicates function using valgrind
DEBUG: Name of decoy MSA to retrieve is Rad1 at index 11632
DEBUG: Decoy sequence to retrieve is at index 7 out of 26 sequences
allocated 727 bytes for decoy ORF textdecoy DNA seq is TTTACTAAATTTGTTTCTTGTCCTTCTTTTGATAAAATTTTTGCTGTTTATTCTGATGTTGATTATAATTTTTTAATTGTTTTAGAAAATCCTAATACTTATGAAGATTCTTGTATTTTA6
Added 8 decoy sequences
DEBUG: Name of decoy MSA to retrieve is Cu_amine_oxid at index 2087
Hardware watchpoint 3: posseqs[3]
Old value = (ESL_SQ *) 0x692890
New value = (ESL_SQ *) 0x2e2e2e2e2e2e2e2e
memcpy () at ../sysdeps/x86_64/memcpy.S:218
218 leaq 32(%rsi), %rsi
(gdb) bt
#0 memcpy () at ../sysdeps/x86_64/memcpy.S:218
#1 0x0000000000409e04 in esl_strcat (dest=0x73cdd10, ldest=-1, src=0x738bf30 '.' <repeats 93 times>, "*", '.' <repeats 95 times>, "*.........."..., lsrc=-1) at easel.c:615
#2 0x0000000000423d34 in esl_msa_AppendGR (msa=0x73ed130, tag=0x73321b0 "pAS", sqidx=14,
value=0x738bf30 '.' <repeats 93 times>, "*", '.' <repeats 95 times>, "*.........."...) at esl_msa.c:1826
#3 0x0000000000425186 in esl_msa_SequenceSubset (msa=0x734a170, useme=0x73ea930, ret_new=0x7ffffffd6048) at esl_msa.c:2215
#4 0x0000000000405073 in remove_fragments (cfg=0x7fffffffd480, msa=0x734a170, ret_filteredmsa=0x7ffffffd6048, ret_nfrags=0x7ffffffd6034) at rmark-create.c:893
#5 0x0000000000404749 in add_decoy_ORFs_to_positive_sequence_list (cfg=0x7fffffffd480, decoymsafp=0x7d7cd0, decoy_alphabet=0x67f630, num_decoy_ORFs=50000,
decoy_msa_names=0x7ffffffd61e0, num_decoy_msa_names=14831, posseqs=0x7ffff7e30010, npos=0x7ffffffd60d4) at rmark-create.c:729
#6 0x0000000000406f6b in synthesize_negatives_and_embed_positives (go=0x67d010, cfg=0x7fffffffd480, posseqs=0x73f5bb0, npos=27325, decoymsafp=0x7d7cd0,
decoy_alphabet=0x67f630, num_decoy_ORFs=50000, decoy_msa_names=0x7ffffffd61e0, num_decoy_msa_names=14831) at rmark-create.c:1465
#7 0x0000000000404170 in main (argc=18, argv=0x7fffffffd788) at rmark-create.c:613
(gdb) p *posseqs[0]
No symbol "posseqs" in current context.
(gdb) p posseqs[0]
No symbol "posseqs" in current context.
(gdb)
11/5/2016
The bug was that I was passing posseqs to the add decoy sequences function, not &posseqs
#commands to generate test benchmark:
wshands@scyld transsearchorf_-R_10_-E_10_7362fam_2016_10_29]$ /home/um/wshands/infernal/Users/traviswheeler/src/infernal-1.1.1/rmark/rmark-create -N 10 -L 100000000 -R 10 -E 10 --maxtrain 30 --maxtest 20 -D /data/wshands/harderpfamdnamark/Pfam-A.v27.seed rmarkinsertORFandDNA /data/wshands/harderpfamdnamark/7362_alignments.stk /home/um/wshands/infernal/Users/traviswheeler/src/infernal-1.1.1/rmark/rmark3-bg.hmm
[wshands@scyld transsearchorf_-R_10_-E_10_7362fam_2016_10_29]$ makeblastdb -dbtype nucl -in rmarkinsertORFandDNA.fa
[wshands@scyld transsearchorf_-R_10_-E_10_7362fam_2016_10_29]$ ../build_protein_training_seeds.pl
[wshands@scyld transsearchorf_-R_10_-E_10_7362fam_2016_10_29]$ hmmbuild rmarkinsertAA.hmm rmarkinsertAA.msa
[wshands@scyld transsearchorf_-R_10_-E_10_7362fam_2016_10_29]$ which phmmert
~/pulloftranslatedsearch/hmmer/src/phmmert
#do the searches
#NOTE the result directories, e.g. tbn.w3.e100.fpw must be created before the command below is called
#otherwise the script will try to write to the directory possibly before it is created and you will loose
#result data
[wshands@scyld transsearchorf_-R_10_-E_10_7362fam_2016_10_29]$ perl ../rmark/rmark-master.pl -F -N 16 -C rmarkinsertAA.hmm $H_PATH ../rmark ../rmark ptr.std.e100 ../rmark_opts/phmmert.e100.opts rmarkinsertORFandDNA ../rmark/x-phmmert 1000000
[wshands@scyld transsearchorf_-R_10_-E_10_7362fam_2016_10_29]$ perl ../rmark/rmark-master.pl -G rmarkinsertAA -F -N 16 $H_PATH ../rmark ../rmark tbn.w3.e100.cons ../rmark_opts/tblastn-w3-e100.opts rmarkinsertORFandDNA ../rmark/x-tblastn-cons 1000000
[wshands@scyld transsearchorf_-R_10_-E_10_7362fam_2016_10_29]$ perl ../rmark/rmark-master.pl -G rmarkinsertAA -F -N 16 $H_PATH ../rmark ../rmark tbn.w3.e100.fpw ../rmark_opts/tblastn-w3-e100.opts rmarkinsertORFandDNA ../rmark/x-tblastn-fpw 1000000
[wshands@scyld transsearchorf_-R_10_-E_10_7362fam_2016_10_29]$ perl ../rmark/rmark-master.pl -G rmarkinsertAA -F -N 16 $H_PATH ../rmark ../rmark exonerate.fpw ../rmark_opts/exonerate.opts rmarkinsertORFandDNA ../rmark/x-exonerate-fpw 1000000
[wshands@scyld transsearchorf_-R_10_-E_10_7362fam_2016_10_29]$ my_msub gather "../rmark/rmark-pp.sh rmarkinsertORFandDNA tbn.w3.e100.cons 1" 1
[wshands@scyld transsearchorf_-R_10_-E_10_7362fam_2016_10_29]$ my_msub gather "../rmark/rmark-pp.sh rmarkinsertORFandDNA tbn.w3.e100.fpw 1" 1
[wshands@scyld transsearchorf_-R_10_-E_10_7362fam_2016_10_29]$ my_msub gather "../rmark/rmark-pp.sh rmarkinsertORFandDNA ptr.std.e100 1" 1
[wshands@scyld transsearchorf_-R_10_-E_10_7362fam_2016_10_29]$ my_msub gather "../rmark/rmark-pp.sh pfamtransDNAmark exonerate.fpw 1" 1
[wshands@scyld transsearchorf_-R_10_-E_10_7362fam_2016_10_29]$ my_msub gather "../rmark/rmark-pp.sh rmarkinsertORFandDNA tbn.w3.e100.fpw 1 .orf" 1
[wshands@scyld transsearchorf_-R_10_-E_10_7362fam_2016_10_29]$ my_msub gather "../rmark/rmark-pp.sh rmarkinsertORFandDNA tbn.w3.e100.cons 1 .orf" 1
[wshands@scyld transsearchorf_-R_10_-E_10_7362fam_2016_10_29]$ my_msub gather "../rmark/rmark-pp.sh rmarkinsertORFandDNA ptr.std.e100 1 .orf" 1
wshands@scyld transsearchorf_-R_10_-E_10_7362fam_2016_10_29]$ ../rmark/compute-glob-table.pl fpw > fpwout
[wshands@scyld transsearchorf_-R_10_-E_10_7362fam_2016_10_29]$ ../rmark/compute-glob-table.pl hmm > hmmout
[wshands@scyld transsearchorf_-R_10_-E_10_7362fam_2016_10_29]$ ../rmark/compute-glob-table.pl cons > consout
[wshands@scyld transsearchorf_-R_10_-E_10_7362fam_2016_10_29]$ ../rmark/compute-glob-table.pl fpw orf > fpworfout
[wshands@scyld transsearchorf_-R_10_-E_10_7362fam_2016_10_29]$ ../rmark/compute-glob-table.pl cons orf > consorfout
[wshands@scyld transsearchorf_-R_10_-E_10_7362fam_2016_10_29]$ ../rmark/compute-glob-table.pl hmm orf > hmmorfout
#to make plots of shuffled orf hits
[whands@scyld transsearchorf_-R_10_-E_10_7362fam_2016_10_29]$ cat ../rmark/plot.shuffledORFhits.R | R --vanilla
[wshands@scyld transsearchorf_-R_10_-E_10_7362fam_2016_10_29]$ cat ../rmark/plot.POShits.R | R --vanilla
#debugging the tblastn fpw bug where the scripts don't report the tblastn hit
1431 less consout
1432 cd tbn.w3.e100.cons
1433 grep Bac_GDH/7 *.out
tbl0.out: 0 1038.0 52511493 52515356 rmark7 Bac_GDH Bac_GDH/7 same
tbl5.out: 33 26.9 52513416 52513544 rmark7 YqjK Bac_GDH/7 same
tbl7.out: 94 25.8 52515171 52515242 rmark7 PUA Bac_GDH/7 same
1434 cd ../tbn.w3.e100.fpw
1435 grep Bac_GDH/7 *.out
tbl12.out: 50 26.2 52512377 52512261 rmark7 DUF4157 Bac_GDH/7 opposite
tbl13.out: 3.5 31.6 52513637 52513455 rmark7 AAA_assoc_C Bac_GDH/7 opposite
tbl13.out: 95 27.3 52513217 52513083 rmark7 AAA_assoc_C Bac_GDH/7 opposite
tbl2.out: 48 26.6 52513353 52513562 rmark7 Thr_synth_N Bac_GDH/7 same
tbl2.out: 92 26.9 52511980 52511855 rmark7 DUF4081 Bac_GDH/7 opposite
tbl3.out: 91 27.3 52511073 52511189 rmark7 UPF0066 Bac_GDH/7 same
tbl5.out: 10 28.5 52513386 52513601 rmark7 YqjK Bac_GDH/7 same
tbl5.out: 41 31.2 52513947 52514120 rmark7 DUF482 Bac_GDH/7 same
#the hit below does not show up in tbn.w3.e100.fpw but it should
0 1038.0 52511493 52515356 rmark7 Bac_GDH Bac_GDH/7 same
#if we manually construct the fpw query, Bac_GDH Bac_GDH/7 is found by tblastn
1440 mkdir fpwtest
1441 cd fpwtest
1442 esl-afetch --index ../rmarkinsertAA.msa
1443 esl-afetch ../rmarkinsertAA.msa Bac_GDH > Bac_GDH.sto
1444 less Bac_GDH.sto
1445 esl-reformat -h
1446 esl-reformat pfam Bac_GDH.sto > Bac_GDH.pfam.sto
# take one sequence from Bac_GDH.pfam.sto and put it in oneseq.fa
#then run tblastn
1457 tblastn -word_size 3 -db ../rmarkinsertORFandDNA.fa -query oneseq.fa -outfmt 7 -num_threads 8 -out resultout
# TBLASTN 2.2.31+
# Query: A1WVU5_HALHL/801598
# Database: ../rmarkinsertORFandDNA.fa
# Fields: query id, subject id, % identity, alignment length, mismatches, gap opens, q. start, q. end, s. start, s. end, evalue, bit score
# 20 hits found
...
A1WVU5_HALHL/801598 rmark7 46.12 1442 736 15 9 1421 52511070 52515359 0.0 974
#create test to see how close family fpw and cons results are for those that are very different for some fpw
#here is the command to create the ratio test file:
[wshands@scyld transsearchorf_-R_10_-E_10_7362fam_2016_10_29]$ awk 'FNR>1 {printf("%40s\t%1.2e\t%1.2e\t%1.2e\n", $1,$2,$3,$2/$3) }' consoutmin-170 > ratiotest
#normalize????
[wshands@scyld transsearchorf_-R_10_-E_10_7362fam_2016_10_29]$ awk 'FNR>1 {p=(($2-$3) - (1e-170))/(100 - (1e-170)); printf("%40s\t%1.2e\t%1.2e\t%1.2e\n", $1,$2,$3, p > 0 ? p : -p) }' consoutmin-170 > ratiotest
#diff as percentage of min of fpw or cons
[wshands@scyld transsearchorf_-R_10_-E_10_7362fam_2016_10_29]$ awk 'FNR>1 {min = ($2 < $3) ? $2 : $3; p=($2-$3)/min; printf("%40s\t%1.2e\t%1.2e\t%1.2e\n", $1,$2,$3, p >= 0 ? p : -p) }' consoutmin-170 > ratiotest
#use + and - to indicate cons better E-Value (+) or fpw better E-Value (-) after sort
[wshands@scyld transsearchorf_-R_10_-E_10_7362fam_2016_10_29]$ awk 'FNR>1 {min = ($2 < $3) ? $2 : $3; p=($2-$3)/min; printf("%40s\t%1.2e\t%1.2e\t%1.2e\n", $1,$2,$3, p) }' consoutmin-170 > ratiotest
#to sort on the 4th column
[wshands@scyld transsearchorf_-R_10_-E_10_7362fam_2016_10_29]$ sort -g -b -t$'\t' -k 4,8 ratiotest | less
#to create difference of log of evalues of tblastn cons and phmmert
[wshands@scyld transsearchorf_-R_10_-E_10_7362fam_2016_10_29]$ awk 'FNR>1 {printf("%40s\t%1.2e\t%1.2e\t%1.2e\n", $1,$3,$4,(log($4)/log(10))-(log($3)/log(10)) ) }' hmmoutmin-170 | sort -g -b -t$'\t' -k 4,8 > logdiffevalue
[wshands@scyld transsearchorf_-R_10_-E_10_7362fam_2016_10_29]$ awk 'BEGIN { print " search\ttblastncons\tphmmert\tlogdiff"} FNR>1 {printf("%40s\t%1.2e\t%1.2e\t%1.2e\n", $1,$3,$4,(log($4)/log(10))-(log($3)/log(10)) ) }' hmmoutmin-170 | sort -g -b -t$'\t' -k 4,8 > logevaluediff
[wshands@scyld transsearchorf_-R_10_-E_10_7362fam_2016_10_29]$ awk 'BEGIN { print " search\ttblastncons\tphmmert\tlogdiff"} FNR>1 { tblastlog = (log($3)/log(10)); phmmertlog = (log($4)/log(10)); logdiff = (tblastlog < phmmertlog) ? phmmertlog-tblastlog : tblastlog-phmmertlog; printf("%40s\t%1.2e\t%1.2e\t%1.2e\n", $1,$3,$4,logdiff) }' hmmoutmin-170 | sort -g -b -t$'\t' -k 4,8 > logevaluediff
hands@scyld transsearchorf_-R_10_-E_10_7362fam_2016_10_29]$ awk 'BEGIN { print " search\ttblastncons\tphmmert\tlogdiff"} FNR>1 {printf("%40s\t%1.2e\t%1.2e\t%1.2e\n", $1,$2,$4,(log($4)/log(10))-(log($2)/log(10)) ) }' hmmoutmin-170 | sort -g -b -t$'\t' -k 4,8 > logevaluediff
#filter out results where hit for phmmert did not exist and hit for fpw did not exist
[wshands@scyld transsearchorf_-R_10_-E_10_7362fam_2016_10_29]$ awk 'BEGIN { print " search\ttblastn\tphmmert\tlogevaluediff"} FNR>1 {if (($2 < 100) && ($4 < 100)) printf("%40s\t%1.2e\t%1.2e\t%1.2e\n", $1,$2,$4, (log($4)/log(10))-(log($2)/log(10)) ) }' hmmout | sort -g -b -t$'\t' -k 4,8 > hmmfpwlogevaluediff
#filter out results where hit for phmmert did not exist and hit for cons did not exist
[wshands@scyld transsearchorf_-R_10_-E_10_7362fam_2016_10_29]$ awk 'BEGIN { print " search\ttblastn\tphmmert\tlogevaluediff"} FNR>1 {if (($3 < 100) && ($4 < 100)) printf("%40s\t%1.2e\t%1.2e\t%1.2e\n", $1,$3,$4, (log($4)/log(10))-(log($3)/log(10)) ) }' hmmout | sort -g -b -t$'\t' -k 4,8 > hmmconslogevaluediff
#get results where hit for phmmert exists and hit for cons did not exist
[wshands@scyld transsearchorf_-R_10_-E_10_7362fam_2016_10_29]$ awk 'BEGIN { print " search\ttblastn\tphmmert\tlogevaluediff"} FNR>1 {if (($3 == 100) && ($4 < 100)) printf("%40s\t%1.2e\t%1.2e\t%1.2e\n", $1,$3,$4, (log($4)/log(10))-(log($3)/log(10)) ) }' hmmout | sort -g -b -t$'\t' -k 4,8 > hmmbutnoconslogevaluediff
#get results where hit for phmmert exists and hit for fpw did not exist
[wshands@scyld transsearchorf_-R_10_-E_10_7362fam_2016_10_29]$ awk 'BEGIN { print " search\ttblastn\tphmmert\tlogevaluediff"} FNR>1 {if (($2 == 100) && ($4 < 100)) printf("%40s\t%1.2e\t%1.2e\t%1.2e\n", $1,$2,$4, (log($4)/log(10))-(log($2)/log(10)) ) }' hmmout | sort -g -b -t$'\t' -k 4,8 > hmmbutnofpwlogevaluediff
Other notes similar to what is above I took when developing the benchmark scripts:
#The all_alignments.stk file was created by Travis from Pfam DB
#and contains all the multiple sequence alignments in the Pfam DB
#to get the number of MSAs in the file
% grep "\/\/" ../all_alignments.stk | wc
14724 14724 44172
#First get the names of all the alignments
% esl-alistat all_alignments.stk | awk '/Alignment name/ { print $3}' > all_ali_names.lst
#Just subsample 50% of all MSAs so there are not too many families and benchmark is smaller
#and get the names of the alignments to use
% sort -R --random-source all_ali_names.lst all_ali_names.lst | head -n 7362 > some_ali_names.lst
(using "--random-source all_ali_names.lst" to set the seed)
#now get the named MSAs from the file with all the MSAs
% esl-afetch -f all_alignments.stk some_ali_names.lst > 7362_alignments.stk
#put my_msub in path
#my_msub should be in repository
#commands to generate test benchmark:
#generate the DNA background benchmark with decoy shuffled ORFs inserted into the background
wshands@scyld transsearchorf_-R_10_-E_10_7362fam_2016_10_29]$ /home/um/wshands/infernal/Users/traviswheeler/src/infernal-1.1.1/rmark/rmark-create -N 10 -L 100000000 -R 10 -E 10 --maxtrain 30 --maxtest 20 -D /data/wshands/harderpfamdnamark/Pfam-A.v27.seed rmarkinsertORFandDNA /data/wshands/harderpfamdnamark/7362_alignments.stk /home/um/wshands/infernal/Users/traviswheeler/src/infernal-1.1.1/rmark/rmark3-bg.hmm
#create a DB for tblastn to use
#used ncbi-blast-2.2.31+-x64-linux.tar.gz
[wshands@scyld transsearchorf_-R_10_-E_10_7362fam_2016_10_29]$ makeblastdb -dbtype nucl -in rmarkinsertORFandDNA.fa
#create the file that has the amino acid MSAs that have the sequences will be used as query sequences
#against the background sequences
[wshands@scyld transsearchorf_-R_10_-E_10_7362fam_2016_10_29]$ ../build_protein_training_seeds.pl
#create the HMMs to use as queries from the amino acid MSAs
[wshands@scyld transsearchorf_-R_10_-E_10_7362fam_2016_10_29]$ hmmbuild rmarkinsertAA.hmm rmarkinsertAA.msa
#do the searches
#NOTE the result directories, e.g. tbn.w3.e100.fpw must be created before the command below is called
#otherwise the script will try to write to the directory possibly before it is created and you will loose
#result data
[wshands@scyld transsearchorf_-R_10_-E_10_7362fam_2016_10_29]$ perl ../rmark/rmark-master.pl -F -N 16 -C rmarkinsertAA.hmm $H_PATH ../rmark ../rmark ptr.std.e100 ../rmark_opts/phmmert.e100.opts rmarkinsertORFandDNA ../rmark/x-phmmert 1000000
[wshands@scyld transsearchorf_-R_10_-E_10_7362fam_2016_10_29]$ perl ../rmark/rmark-master.pl -G rmarkinsertAA -F -N 16 $H_PATH ../rmark ../rmark tbn.w3.e100.cons ../rmark_opts/tblastn-w3-e100.opts rmarkinsertORFandDNA ../rmark/x-tblastn-cons 1000000
[wshands@scyld transsearchorf_-R_10_-E_10_7362fam_2016_10_29]$ perl ../rmark/rmark-master.pl -G rmarkinsertAA -F -N 16 $H_PATH ../rmark ../rmark tbn.w3.e100.fpw ../rmark_opts/tblastn-w3-e100.opts rmarkinsertORFandDNA ../rmark/x-tblastn-fpw 1000000
[wshands@scyld transsearchorf_-R_10_-E_10_7362fam_2016_10_29]$ perl ../rmark/rmark-master.pl -G rmarkinsertAA -F -N 16 $H_PATH ../rmark ../rmark exonerate.fpw ../rmark_opts/exonerate.opts rmarkinsertORFandDNA ../rmark/x-exonerate-fpw 1000000
#To get status of the jobs:
[wshands@scyld ptr.std.e100]$ qstat -u wshands
#gather statistics for how many positive embedded squences were found by the search tools
[wshands@scyld transsearchorf_-R_10_-E_10_7362fam_2016_10_29]$ my_msub gather "../rmark/rmark-pp.sh rmarkinsertORFandDNA tbn.w3.e100.cons 1" 1
[wshands@scyld transsearchorf_-R_10_-E_10_7362fam_2016_10_29]$ my_msub gather "../rmark/rmark-pp.sh rmarkinsertORFandDNA tbn.w3.e100.fpw 1" 1
[wshands@scyld transsearchorf_-R_10_-E_10_7362fam_2016_10_29]$ my_msub gather "../rmark/rmark-pp.sh rmarkinsertORFandDNA ptr.std.e100 1" 1
[wshands@scyld transsearchorf_-R_10_-E_10_7362fam_2016_10_29]$ my_msub gather "../rmark/rmark-pp.sh pfamtransDNAmark exonerate.fpw 1" 1
[wshands@scyld transsearchorf_-R_10_-E_10_7362fam_2016_10_29]$ my_msub gather "../rmark/rmark-pp.sh rmarkinsertORFandDNA tbn.w3.e100.fpw 1 .orf" 1
[wshands@scyld transsearchorf_-R_10_-E_10_7362fam_2016_10_29]$ my_msub gather "../rmark/rmark-pp.sh rmarkinsertORFandDNA tbn.w3.e100.cons 1 .orf" 1
[wshands@scyld transsearchorf_-R_10_-E_10_7362fam_2016_10_29]$ my_msub gather "../rmark/rmark-pp.sh rmarkinsertORFandDNA ptr.std.e100 1 .orf" 1
#create the table that compares the E-Values of the hits on the positive inserted target sequences
#found by each tool
[wshands@scyld transsearchorf_-R_10_-E_10_7362fam_2016_10_29]$ ../rmark/compute-glob-table.pl fpw > fpwout
[wshands@scyld transsearchorf_-R_10_-E_10_7362fam_2016_10_29]$ ../rmark/compute-glob-table.pl hmm > hmmout
[wshands@scyld transsearchorf_-R_10_-E_10_7362fam_2016_10_29]$ ../rmark/compute-glob-table.pl cons > consout
#create the table that compares the E-Values of the false positive hits (hits on the decoy shuffled ORF sequences)
[wshands@scyld transsearchorf_-R_10_-E_10_7362fam_2016_10_29]$ ../rmark/compute-glob-table.pl fpw orf > fpworfout
[wshands@scyld transsearchorf_-R_10_-E_10_7362fam_2016_10_29]$ ../rmark/compute-glob-table.pl cons orf > consorfout
[wshands@scyld transsearchorf_-R_10_-E_10_7362fam_2016_10_29]$ ../rmark/compute-glob-table.pl hmm orf > hmmorfout
#to make plots of shuffled orf hits
[whands@scyld transsearchorf_-R_10_-E_10_7362fam_2016_10_29]$ cat ../rmark/plot.shuffledORFhits.R | R --vanilla
#make a plot of the positive hits
[wshands@scyld transsearchorf_-R_10_-E_10_7362fam_2016_10_29]$ cat ../rmark/plot.POShits.R | R --vanilla
#create the difference in log E-Values between phmmert and tblastn cons
[wshands@scyld transsearchorf_-R_10_-E_10_7362fam_2016_10_29]$ awk 'BEGIN { print " search\ttblastncons\tphmmert\tlogdiff"} FNR>1 {printf("%40s\t%1.2e\t%1.2e\t%1.2e\n", $1,$3,$4,(log($4)/log(10))-(log($3)/log(10)) ) }' hmmoutmin-170 | sort -g -b -t$'\t' -k 4,8 > logevaluediff
#create the difference in log E-Values between phmmert and tblastn fpw
[wshands@scyld transsearchorf_-R_10_-E_10_7362fam_2016_10_29]$ awk 'BEGIN { print " search\ttblastncons\tphmmert\tlogdiff"} FNR>1 {printf("%40s\t%1.2e\t%1.2e\t%1.2e\n", $1,$2,$4,(log($4)/log(10))-(log($2)/log(10)) ) }' hmmoutmin-170 | sort -g -b -t$'\t' -k 4,8 > logevaluediff