Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Advanced error model(s) #6

Closed
8 tasks done
HadrienG opened this issue Nov 29, 2016 · 7 comments
Closed
8 tasks done

Advanced error model(s) #6

HadrienG opened this issue Nov 29, 2016 · 7 comments
Assignees
Milestone

Comments

@HadrienG
Copy link
Owner

HadrienG commented Nov 29, 2016

Issue to track the progress on the Roadmap item "Add a more complicated error model"

This is where the fun begins.

The current plan is to:

  • Download raw Illumina datasets from known bacterial species*
  • Map the downloaded reads against their reference genome
  • Extract the Base quality information from the BAM file
  • Parse the BAM files and for each position, get the substitution rate (using CIGAR strings?) **
  • Figure out how to get deletions information from the MD tag and CIGAR string. (samtools calmd?) It worked well with CIGAR
  • write the error rates to a file
  • Write the subsitution matrix for every position in a file***
  • Generate realistic-ish illumina data!

* It will be very important to track every step used to generate the profiles: genomes download, mapping options, ... I will keep everything in evernote for now but will dump it in the software doc later on.

** I'll need pysam as a dependency. Hopefully it will not add difficulties to the install process Pysam installed seamlessly

*** something of the like (not sure yet, subject to change!):

pos AtoA AtoT ...
1 0.96 0.01 ...
...
@HadrienG HadrienG self-assigned this Nov 29, 2016
@HadrienG HadrienG mentioned this issue Nov 29, 2016
20 tasks
@HadrienG
Copy link
Owner Author

HadrienG commented Feb 1, 2017

Alright let's get back on this. A few more TODOs!

  • write two subparsers, one for calculating an error model from a bam file, the other to generate reads
  • split the error model in two: one for R1, one for R2
  • switch the phred quality model from the current normalised cumulative density function to kernel density estimation

I'm focusing on hiseq data atm, once we get the functions right, it should be easy to add models for different machines.

@HadrienG
Copy link
Owner Author

HadrienG commented May 4, 2017

For the indels:

  • if clause to catch indels and pass
  • improved dispatch_dict with 2 more keys per nucleotide: 'N1' for insertions, 'N2' for deletions
  • deal with the cigar_string after having read the read
  • increment the dispatch_dict accordingly
  • change the subst_matrix_to_choices method to relfect the new dispatch dict
  • generate indel. This should be independent of the phred score

@HadrienG
Copy link
Owner Author

HadrienG commented May 4, 2017

Concerning the error model itself, the per read sequence quality looks nice, the per sequence mean quality still looks a bit fake-ish

I might want to switch to a 2D KDE (taking the mean quality into account?), but that's more of a v1 issue. An alpha could be released with somewhat poor modelling of the average sequence quality.

@HadrienG
Copy link
Owner Author

Closed (for now?) with 06b16f3 🚀

@HadrienG HadrienG reopened this Jul 19, 2017
@HadrienG
Copy link
Owner Author

HadrienG commented Jul 19, 2017

Re-opened for refining the error model:

@HadrienG
Copy link
Owner Author

HadrienG commented Sep 19, 2017

The 2D KDE didn't give the expected result. We'll continue with 1D and explore other possibilities in the future.

nvm i fixed it

@HadrienG
Copy link
Owner Author

closed with the last release

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant