Bio::Sequence.guess issue #14

yannickwurm · 2010-10-30T09:41:28Z

ruby-1.9.2-preview1 > Bio::Sequence.guess("ACGT" )

=> Bio::Sequence::NA

ruby-1.9.2-preview1 > Bio::Sequence.guess("ACGT\n" )

=> Bio::Sequence::AA

whitespace should not affect sequence determination?
and perhaps Bio::Sequence.guess(" ") should throw an error instead of returning AA?

cheers,
yannick

ghost · 2010-10-31T18:20:44Z

Bio::Sequence.guess('ACGTCCGGTGGGGGGCACGTAAGATCTCG\n')
=> Bio::Sequence::NA

Bio::Sequence.guess('ACGTCCGGTGGGGGGCACGTAGCTGATCG\t')
Bio::Sequence::NA

Bio::Sequence.guess('ACGTCC\n')
=> Bio::Sequence::AA

tomoakin · 2010-11-01T12:39:02Z

Current definition in /lib/bio/sequence.rb
is
def guess(threshold = 0.9, length = 10000, index = 0)
str = seq.to_s[index,length].to_s.extend Bio::Sequence::Common
cmp = str.composition

  bases = cmp['A'] + cmp['T'] + cmp['G'] + cmp['C'] + cmp['U'] +
          cmp['a'] + cmp['t'] + cmp['g'] + cmp['c'] + cmp['u']

  total = str.length - cmp['N'] - cmp['n']

  if bases.to_f / total > threshold
    return NA
  else
    return AA
  end
end

So, perhaps the line "total = str.length - cmp['N'] - cmp['n']"
can be changed to something like
total = str.length - cmp['N'] - cmp['n'] - cmp["\n"] - cmp["\t"] - cmp["\r"]

cmp[' ']

I think this is meaningful because they may input triplet sequence like
"ACG TCC GGT GGG GGG CAC GTA GCT GAT CGT"
in which space could have considerable proportion of more than 0.25

yannickwurm · 2010-11-01T15:45:38Z

should sequence.composition actually be counting non-sequence characters?

(some users might also end up with numbers in their sequence because of copy-pasting from a genbank entry?)

tomoakin · 2010-11-02T00:55:52Z

In fact, the sequence should not contain non-sequence characters.
Non-sequence character will be counted for the address calculation as well.

in fasta.rb, the input sequence is processed prior to make a sequence like:

@seq = Sequence::Generic.new(@data.tr(" \t\r\n0-9", '')) # lazy clean up

Thus, Bio::Sequence::new() and Bio::Sequence::guess() doesn't consider
characters that is not meant to go into the sequence.

Note that guess() is marked as "In general, used by developers only, but if you know what you are doing, feel free."

User input needs cleanup before passing to Sequence::new().
One might paste the whole genbank entry or fasta entry as well?

yannickwurm · 2010-11-12T14:55:25Z

Okay - I hadn't seen the "used by developers only" comment. But without sequence cleaning, the same issues would carry over to Sequence::new(), right?
(perhaps there could be a Sequence::clean() method that removes crap?)

# File lib/bio/sequence.rb, line 262

262: def auto
263: @moltype = guess
264: if @moltype == NA
265: @seq = NA.new(seq)
266: else
267: @seq = AA.new(seq)
268: end
269: end

tomoakin · 2010-11-13T00:33:37Z

Sequence.new() just assigns the input to self without any modification.
White spaces (' \t\r\n') are removed at AA.new or NA.new, while numbers are not.

Since Sequence is a general class, I don't think Sequence::clean() can be defined.

For numbers within AA sequence, someone may want to include 1 and 2 for separate serine residues encoded by non neighboring codons. This is perhaps not supported, but not denied either in current implementation.

If the user inputs number in a web, it might be better to interpret as gi number or some kind of accession numbers, though.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bio::Sequence.guess issue #14

Bio::Sequence.guess issue #14

yannickwurm commented Oct 30, 2010

ghost commented Oct 31, 2010

tomoakin commented Nov 1, 2010

yannickwurm commented Nov 1, 2010

tomoakin commented Nov 2, 2010

yannickwurm commented Nov 12, 2010

tomoakin commented Nov 13, 2010

Bio::Sequence.guess issue #14

Bio::Sequence.guess issue #14

Comments

yannickwurm commented Oct 30, 2010

ghost commented Oct 31, 2010

tomoakin commented Nov 1, 2010

yannickwurm commented Nov 1, 2010

tomoakin commented Nov 2, 2010

yannickwurm commented Nov 12, 2010

tomoakin commented Nov 13, 2010