Skip to content

Commit

Permalink
Bug fix: parse error when unaligned regions exist
Browse files Browse the repository at this point in the history
 * Bug fix: parse error when unaligned regions exist. Thanks to
   Tomoaki NISHIYAMA who reports the bug ([BioRuby] SIM4 parser).
 * To confirm the bug fix, tests are added with new test data.
  • Loading branch information
ngoto committed Aug 28, 2009
1 parent 137ec4c commit 02d531e
Show file tree
Hide file tree
Showing 3 changed files with 421 additions and 13 deletions.
59 changes: 46 additions & 13 deletions lib/bio/appl/sim4/report.rb
Expand Up @@ -161,7 +161,7 @@ def initialize(seq1, seq2, midline = nil,
attr_reader :percent_identity

# Returns directions of mapping.
# Maybe one of "->", "<-" or "" or nil.
# Maybe one of "->", "<-", "==" or "" or nil.
# This would be a Bio::Sim4 specific method.
attr_reader :direction

Expand All @@ -170,7 +170,7 @@ def initialize(seq1, seq2, midline = nil,
# Bio::Sim4::Report::Hit class.
# Users shall not use it directly.
def self.parse(str, aln)
/^(\d+)\-(\d+)\s*\((\d+)\-(\d+)\)\s*([\d\.]+)\%\s*([\-\<\>]*)/ =~ str
/^(\d+)\-(\d+)\s*\((\d+)\-(\d+)\)\s*([\d\.]+)\%\s*([\-\<\>\=]*)/ =~ str
self.new(Segment.new($1, $2, aln[0]),
Segment.new($3, $4, aln[2]),
aln[1], $5, $6)
Expand Down Expand Up @@ -198,6 +198,17 @@ def self.seq2_intron(prev_e, e, aln)
aln[1])
end

# Parses part of sim4 result text and creates a new SegmentPair
# object for regions which can not be aligned correctly.
# It is designed to be called internally from
# Bio::Sim4::Report::Hit class.
# Users shall not use it directly.
def self.both_intron(prev_e, e, aln)
self.new(Segment.new(prev_e.seq1.to+1, e.seq1.from-1, aln[0]),
Segment.new(prev_e.seq2.to+1, e.seq2.from-1, aln[2]),
aln[1])
end

#--
# Bio::BLAST::*::Report::Hsp compatible methods
# Methods already defined: midline, percent_identity
Expand Down Expand Up @@ -318,7 +329,9 @@ def parse_segmentpairs
exo << e
if ai then
# intron data in alignment
if ai[2].strip.empty? then
if ai[1].strip.empty? then
i = SegmentPair.both_intron(prev_e, e, ai)
elsif ai[2].strip.empty? then
i = SegmentPair.seq1_intron(prev_e, e, ai)
else
i = SegmentPair.seq2_intron(prev_e, e, ai)
Expand All @@ -338,11 +351,24 @@ def parse_segmentpairs
# Parses alignment.
def parse_align
s1 = []; ml = []; s2 = []
blocks = []
blocks.push [ s1, ml, s2 ]
dat = @data[1..-1]
return unless dat
dat.each do |str|
a = str.split(/\r?\n/)
a.shift
ruler = a.shift
# First line, for example,
# " 50 . : . : . : . : . :"
# When the number is 0, forced to be a separated block
if /^\s*(\d+)/ =~ ruler and $1.to_i == 0 and !ml.empty? then
s1 = []; ml = []; s2 = []
blocks.push [ s1, ml, s2 ]
end
# For example,
# " 190 GAGTCATGCATGATACAA CTTATATATGTACTTAGCGGCA"
# " ||||||||||||||||||<<<...<<<-||-|||||||||||||||||||"
# " 400 GAGTCATGCATGATACAACTT...AGCGCT ATATATGTACTTAGCGGCA"
if /^(\s*\d+\s)(.+)$/ =~ a[0] then
range = ($1.length)..($1.length + $2.chomp.length - 1)
a.collect! { |x| x[range] }
Expand All @@ -351,16 +377,23 @@ def parse_align
s2 << a.shift
end
end #each
alx = ml.join('').split(/([\<\>]+\.+[\<\>]+)/)
seq1 = s1.join(''); seq2 = s2.join('')
i = 0
alx.collect! do |x|
len = x.length
y = [ seq1[i, len], x, seq2[i, len] ]
i += len
y
alx_all = []
blocks.each do |ary|
s1, ml, s2 = ary
alx = ml.join('').split(/([\<\>]+\.+[\<\>]+)/)
seq1 = s1.join(''); seq2 = s2.join('')
i = 0
alx.collect! do |x|
len = x.length
y = [ seq1[i, len], x, seq2[i, len] ]
i += len
y
end
# adds virtual intron information if necessary
alx_all.push([ '', '', '' ]) unless alx_all.empty?
alx_all.concat alx
end
@align = alx
@align = alx_all
end
private :parse_align

Expand Down
43 changes: 43 additions & 0 deletions test/data/sim4/complement-A4.sim4
@@ -0,0 +1,43 @@

seq1 = sample41-1c.fst, 284 bp
seq2 = sample40-2.fst (genome4), 770 bp

>mrna4c
>genome4

(complement)

1-72 (351-424) 89% ->
73-142 (563-630) 95% ==
213-284 (700-770) 95%

0 . : . : . : . : . :
1 TTTTAGCCGGCACGAGATTG AGCGTATGATCACGCGCGCGGCCTCCT C
||||||||||||||||||||-||||-||||||||||||||||||||||-|
351 TTTTAGCCGGCACGAGATTGCAGCG ATGATCACGCGCGCGGCCTCCTAC

50 . : . : . : . : . :
49 AGAGTGATGCATGATACAACTT AT ATATGTACTTAGCTG
-|||| ||||||||||||||||- |->>>...>>>|||||||||||||-|
400 GAGTCATGCATGATACAACTTCTTGGTT...GATATATGTACTTAGC G

100 . : . : . : . : . :
88 GCAACCGAGATTTACTTTCGAAGCACTGTGATGAACCCGCGGCCCTTTGA
||||||||||||||||||||||| |||||||||||||||||-||||||||
577 GCAACCGAGATTTACTTTCGAAGGACTGTGATGAACCCGCG CCCTTTGA

150 .
138 GCGCT
|||||
626 GCGCT

0 . : . : . : . : . :
213 TATATATGTACTTAGCGG ACACCGAGATTTACTTTCGAAGGACTGTGGA
||||||||||||||||||-|-|||||||||||||||||||||||||||-|
700 TATATATGTACTTAGCGGCA ACCGAGATTTACTTTCGAAGGACTGTG A

50 . : . :
262 TGAACCCGCGCCCTTTGAGCGCT
|||||||||||||||||||||||
748 TGAACCCGCGCCCTTTGAGCGCT

0 comments on commit 02d531e

Please sign in to comment.