Advanced error model(s) #6

HadrienG · 2016-11-29T13:17:02Z

Issue to track the progress on the Roadmap item "Add a more complicated error model"

This is where the fun begins.

The current plan is to:

Download raw Illumina datasets from known bacterial species*
Map the downloaded reads against their reference genome
Extract the Base quality information from the BAM file
Parse the BAM files and for each position, get the substitution rate (using CIGAR strings?) **
Figure out how to get deletions information from the ~~MD tag and~~ CIGAR string. ~~(samtools calmd?)~~ It worked well with CIGAR
write the error rates to a file
Write the subsitution matrix for every position in a file***
Generate realistic-ish illumina data!

* It will be very important to track every step used to generate the profiles: genomes download, mapping options, ... I will keep everything in evernote for now but will dump it in the software doc later on.

** I'll need pysam as a dependency. ~~Hopefully it will not add difficulties to the install process~~ Pysam installed seamlessly

*** something of the like (not sure yet, subject to change!):

pos	AtoA	AtoT	...
1	0.96	0.01	...
...

HadrienG · 2017-02-01T13:36:53Z

Alright let's get back on this. A few more TODOs!

write two subparsers, one for calculating an error model from a bam file, the other to generate reads
split the error model in two: one for R1, one for R2
switch the phred quality model from the current normalised cumulative density function to kernel density estimation

I'm focusing on hiseq data atm, once we get the functions right, it should be easy to add models for different machines.

HadrienG · 2017-05-04T19:12:03Z

For the indels:

if clause to catch indels and pass
improved dispatch_dict with 2 more keys per nucleotide: 'N1' for insertions, 'N2' for deletions
deal with the cigar_string after having read the read
increment the dispatch_dict accordingly
change the subst_matrix_to_choices method to relfect the new dispatch dict
generate indel. This should be independent of the phred score

HadrienG · 2017-05-04T19:14:59Z

Concerning the error model itself, the per read sequence quality looks nice, the per sequence mean quality still looks a bit fake-ish

I might want to switch to a 2D KDE (taking the mean quality into account?), but that's more of a v1 issue. An alpha could be released with somewhat poor modelling of the average sequence quality.

HadrienG · 2017-05-10T12:37:32Z

Closed (for now?) with 06b16f3 🚀

HadrienG · 2017-07-19T18:31:11Z

Re-opened for refining the error model:

2D KDE ( Switch from 1D KDE to 2D #22 )
GC bias ( Model GC bias #19 )
Insert size distribution ( Model the insert size distribution instead of taking the mean #18 )

HadrienG · 2017-09-19T11:27:56Z

~~The 2D KDE didn't give the expected result. We'll continue with 1D and explore other possibilities in the future.~~

nvm i fixed it

HadrienG · 2017-09-20T14:24:09Z

closed with the last release

HadrienG self-assigned this Nov 29, 2016

HadrienG mentioned this issue Nov 29, 2016

Roadmap #1

Open

20 tasks

HadrienG closed this as completed May 10, 2017

HadrienG reopened this Jul 19, 2017

HadrienG added this to the 1.0.0 milestone Jul 19, 2017

HadrienG added the enhancement label Jul 19, 2017

HadrienG mentioned this issue Aug 3, 2017

Model the insert size distribution instead of taking the mean #18

Merged

HadrienG closed this as completed Sep 20, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Advanced error model(s) #6

Advanced error model(s) #6

HadrienG commented Nov 29, 2016 •

edited

Loading

HadrienG commented Feb 1, 2017 •

edited

Loading

HadrienG commented May 4, 2017 •

edited

Loading

HadrienG commented May 4, 2017

HadrienG commented May 10, 2017

HadrienG commented Jul 19, 2017 •

edited

Loading

HadrienG commented Sep 19, 2017 •

edited

Loading

HadrienG commented Sep 20, 2017

Advanced error model(s) #6

Advanced error model(s) #6

Comments

HadrienG commented Nov 29, 2016 • edited Loading

HadrienG commented Feb 1, 2017 • edited Loading

HadrienG commented May 4, 2017 • edited Loading

HadrienG commented May 4, 2017

HadrienG commented May 10, 2017

HadrienG commented Jul 19, 2017 • edited Loading

HadrienG commented Sep 19, 2017 • edited Loading

HadrienG commented Sep 20, 2017

HadrienG commented Nov 29, 2016 •

edited

Loading

HadrienG commented Feb 1, 2017 •

edited

Loading

HadrienG commented May 4, 2017 •

edited

Loading

HadrienG commented Jul 19, 2017 •

edited

Loading

HadrienG commented Sep 19, 2017 •

edited

Loading