Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Simple Error Model #5

Closed
4 tasks done
HadrienG opened this issue Nov 21, 2016 · 4 comments
Closed
4 tasks done

Simple Error Model #5

HadrienG opened this issue Nov 21, 2016 · 4 comments
Assignees

Comments

@HadrienG
Copy link
Owner

HadrienG commented Nov 21, 2016

Issue to track the progress on the Roadmap item "Add a simple error model"

I guess that the simplest would be to:

  • Add 2 parameters: mean_quality and std_dev*
  • Generate random quality scores following a normal distribution
  • Eventually modify the nucleotides whose quality aren't perfect (it never is)
  • write docstrings and tests

A few unknowns:

  • Should the length of the sequences vary?
  • To which base switch if we get an erroneous call. A random other nucleotide? I would guess a random nucl. is good enough for a simple error model.

* I haven't added a standard deviation parameter. it is hardcoded to 0.01 but can be discussed

@HadrienG HadrienG self-assigned this Nov 21, 2016
@Ackia
Copy link

Ackia commented Nov 21, 2016

A few unknowns:

Should the length of the sequences vary?

Yes, they should vary. At least to a certain degree. Within all sequencing technologies they are varying and are often not normally distributed.

To which base switch if we get an erroneous call. A random other nucleotide? I would guess a random nucl. is good enough for a simple error model.

Random should be good enough. Possibly also include INDEL?

@HadrienG HadrienG mentioned this issue Nov 21, 2016
20 tasks
@HadrienG
Copy link
Owner Author

@Ackia the last HiSeq reads I received are all 76bp. Also, in the BEAR article, they state that "Illumina reads are generally uniform in length, reads from other technologies can vary greatly in length" which makes sense since X cycles should give you X base pairs.

Indels occur at a really low rate in Illumina data: 2.8 x 10^−6 (errors per base) for R1 insertions and 5.1 x 10^−6 (errors per base) for R1 deletions according to doi.org/10.1186/s12859-016-0976-y
I'm gonna leave them out of the simple error model, which is just really a test for me to create reads than a model we're gonna use

@Ackia
Copy link

Ackia commented Nov 29, 2016

I agree. I was mixing Illumina up with IonTorrent. My bad. Good progress!

@HadrienG
Copy link
Owner Author

Closed with 031454b ! 🚀

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants