Best way to run proovread on large dataset #48

bibilujan · 2016-03-09T14:37:08Z

Hi,

I am interested in using proovread to correct PacBio long reads. My question is regarding your usage, in the manual it says try with a subset of the data first (which I did). After seeing that it works should I run proovread on my whole dataset (19GB file), or should I always run it on a small subset of PacBio reads (20M in the manual).

I currently have about 18X illumina and 37X coverage with PacBio long reads. The genome that I am sequencing is 260Mbp. I am planning on getting more illumina paired end to at least get 50X coverage with paired ends. But I am doing some tests while I get my new data.

Thank you for your time.
Beatriz

thackl · 2016-03-09T20:06:59Z

Hi Beatriz,

the recommendation regarding chunk size in the manual is somewhat outdated. With the latest versions of proovread, the rule of thumb is: Use chunks as large as possible, but with respect to the following limits:

RAM of your cluster nodes - chunks up to a few GB should be fine
max per job runtime on some scheduling systems - smaller chunks will run faster in absolute time, yet larger chunks are faster compared to runtime required for the entire set
Size of your genome. Chunks should not exceed 1X coverage of the genome, so in your case, I would go for 200Mbp chunks, which is something like SeqChunker -s 400M for fastq files.

Also, max --threads - running different instances on the same machine is no longer necessary.
And for your set make sure to set --coverage=18.

bibilujan · 2016-03-12T18:09:52Z

Hi Thomas,

Thank you for your prompt reply and you useful comments. I was interested in using ccs reads as well as illumina for correction. However, I see that the %masked drops considerably in the last iteration when I use illumina+css. Is this normal? Is it better avoid mixing data types? At the moment I am doing these runs with low coverage, so I don't know if that would be a factor.

ILLUMINA HISEQ (16X) ONLY
Running mode: sr
Running task bwa-sr-1
Masked : 67.8%
Running task bwa-sr-2
Masked : 84.5%
Running task bwa-sr-3
Masked : 88.0%
Running task bwa-sr-4
Masked : 93.8%
Running task bwa-sr-finish
Masked : 79.4%

ILLUMINA HISEQ (16X) + CCS (1X)
Running mode: mr
Running task bwa-mr-1
Masked : 42.5%
Running task bwa-mr-2
Masked : 64.9%
Running task bwa-mr-3
Masked : 71.6%
Running task bwa-mr-4
Masked : 81.6%
Running task bwa-mr-5
Masked : 82.8%
Running task bwa-mr-finish
Masked : 43.1%

Any suggestions? I am getting MiSeq data soon ~50X coverage, in that case do you recommend using only MiSeq data for correction?

Thanks in advance,

Beatriz

thackl · 2016-03-13T08:01:54Z

The difference between the two runs is the running mode: sr (short reads for HiSeq <=100bp) vs mr (medium reads for MiSeq/merged Hiseq reads >100bp). Because of the longer ccs reads, proovread decided to use the mr mode for the second run. But that mode is not sensitive enough to properly align the shorter HiSeq reads. You could explicitly set the mode to sr also when using ccs.

However, you won't get that problem for MiSeq data anyway. 50X MiSeq is more or less the perfect data set. If possible, use merged overlapping MiSeq reads for correction. They will also work well with or without ccs reads.

bibilujan · 2016-04-05T17:46:25Z

Hi Thomas,

I wanted to ask you for advice regarding overlapping the MiSeq reads, so far I my quality control pipeline includes trimming (trimmomatic) and then doing error correction (musket). Do you know what approach is more successful to correct long PacBio reads using Miseq data with proovread:

correct reads -> overlap -> trim (based on quality) or
trim -> correct -> overlap?

Considering that my MiSeq reads have poor quality at the 3' end.

My coverage at this point is PE-hiseq (11X), PE-miseq (23X), ccs (0.86X). Do you suggest using all the data for proovread to get good enough coverage? Or would it best to use only MiSeq.

Thank you for your time,

Beatriz

thackl · 2016-04-06T13:54:25Z

Hi Beatriz,

I would go for overlapping reads directly, overlapping already decreases error rates in tails, very poor ends won't produce merged reads anyway. You can do trimming/correction afterwards, but I don't think it is necessary. Since proovread creates consensus from multiple Illumina reads, random errors in single reads don't affect correction accuracy.

I guess that your coverage will decrease during overlapping, so you should use both, HiSeq+MiSeq reads. Make sure to set --coverage appropriately.

Cheers
Thomas

wsuplantpathology · 2016-08-19T04:03:11Z

Hi there:

I wonder how to tell the program to deal with many .fq files.

In the proovread.cfg, it says:

LIST of Pacbio read files to correct. FASTA or FASTQ format.

'long-reads' => [],

it seems this command is not clear. #74 Thanks.

thackl mentioned this issue May 18, 2016

bam2cns fails using unitig mode #59

Closed

thackl closed this as completed May 31, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Best way to run proovread on large dataset #48

Best way to run proovread on large dataset #48

bibilujan commented Mar 9, 2016

thackl commented Mar 9, 2016

bibilujan commented Mar 12, 2016

thackl commented Mar 13, 2016

bibilujan commented Apr 5, 2016

thackl commented Apr 6, 2016

wsuplantpathology commented Aug 19, 2016 •

edited

Loading

Best way to run proovread on large dataset #48

Best way to run proovread on large dataset #48

Comments

bibilujan commented Mar 9, 2016

thackl commented Mar 9, 2016

bibilujan commented Mar 12, 2016

thackl commented Mar 13, 2016

bibilujan commented Apr 5, 2016

thackl commented Apr 6, 2016

wsuplantpathology commented Aug 19, 2016 • edited Loading

LIST of Pacbio read files to correct. FASTA or FASTQ format.

wsuplantpathology commented Aug 19, 2016 •

edited

Loading