Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bio::Sequence.guess issue #14

Open
yannickwurm opened this issue Oct 30, 2010 · 6 comments
Open

Bio::Sequence.guess issue #14

yannickwurm opened this issue Oct 30, 2010 · 6 comments

Comments

@yannickwurm
Copy link

ruby-1.9.2-preview1 > Bio::Sequence.guess("ACGT" )

=> Bio::Sequence::NA

ruby-1.9.2-preview1 > Bio::Sequence.guess("ACGT\n" )

=> Bio::Sequence::AA

whitespace should not affect sequence determination?
and perhaps Bio::Sequence.guess(" ") should throw an error instead of returning AA?

cheers,
yannick

@ghost
Copy link

ghost commented Oct 31, 2010

Bio::Sequence.guess('ACGTCCGGTGGGGGGCACGTAAGATCTCG\n')
=> Bio::Sequence::NA

Bio::Sequence.guess('ACGTCCGGTGGGGGGCACGTAGCTGATCG\t')
Bio::Sequence::NA

Bio::Sequence.guess('ACGTCC\n')
=> Bio::Sequence::AA

@tomoakin
Copy link
Contributor

tomoakin commented Nov 1, 2010

Current definition in /lib/bio/sequence.rb
is
def guess(threshold = 0.9, length = 10000, index = 0)
str = seq.to_s[index,length].to_s.extend Bio::Sequence::Common
cmp = str.composition

  bases = cmp['A'] + cmp['T'] + cmp['G'] + cmp['C'] + cmp['U'] +
          cmp['a'] + cmp['t'] + cmp['g'] + cmp['c'] + cmp['u']

  total = str.length - cmp['N'] - cmp['n']

  if bases.to_f / total > threshold
    return NA
  else
    return AA
  end
end 

So, perhaps the line "total = str.length - cmp['N'] - cmp['n']"
can be changed to something like
total = str.length - cmp['N'] - cmp['n'] - cmp["\n"] - cmp["\t"] - cmp["\r"]

  • cmp[' ']

I think this is meaningful because they may input triplet sequence like
"ACG TCC GGT GGG GGG CAC GTA GCT GAT CGT"
in which space could have considerable proportion of more than 0.25

@yannickwurm
Copy link
Author

should sequence.composition actually be counting non-sequence characters?

(some users might also end up with numbers in their sequence because of copy-pasting from a genbank entry?)

@tomoakin
Copy link
Contributor

tomoakin commented Nov 2, 2010

In fact, the sequence should not contain non-sequence characters.
Non-sequence character will be counted for the address calculation as well.

in fasta.rb, the input sequence is processed prior to make a sequence like:

@seq = Sequence::Generic.new(@data.tr(" \t\r\n0-9", '')) # lazy clean up

Thus, Bio::Sequence::new() and Bio::Sequence::guess() doesn't consider
characters that is not meant to go into the sequence.

Note that guess() is marked as "In general, used by developers only, but if you know what you are doing, feel free."

User input needs cleanup before passing to Sequence::new().
One might paste the whole genbank entry or fasta entry as well?

@yannickwurm
Copy link
Author

Okay - I hadn't seen the "used by developers only" comment. But without sequence cleaning, the same issues would carry over to Sequence::new(), right?
(perhaps there could be a Sequence::clean() method that removes crap?)

# File lib/bio/sequence.rb, line 262

262: def auto
263: @moltype = guess
264: if @moltype == NA
265: @seq = NA.new(seq)
266: else
267: @seq = AA.new(seq)
268: end
269: end

@tomoakin
Copy link
Contributor

Sequence.new() just assigns the input to self without any modification.
White spaces (' \t\r\n') are removed at AA.new or NA.new, while numbers are not.

Since Sequence is a general class, I don't think Sequence::clean() can be defined.

For numbers within AA sequence, someone may want to include 1 and 2 for separate serine residues encoded by non neighboring codons. This is perhaps not supported, but not denied either in current implementation.

If the user inputs number in a web, it might be better to interpret as gi number or some kind of accession numbers, though.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants