Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Best way to run proovread on large dataset #48

Closed
bibilujan opened this issue Mar 9, 2016 · 6 comments
Closed

Best way to run proovread on large dataset #48

bibilujan opened this issue Mar 9, 2016 · 6 comments

Comments

@bibilujan
Copy link

Hi,

I am interested in using proovread to correct PacBio long reads. My question is regarding your usage, in the manual it says try with a subset of the data first (which I did). After seeing that it works should I run proovread on my whole dataset (19GB file), or should I always run it on a small subset of PacBio reads (20M in the manual).

I currently have about 18X illumina and 37X coverage with PacBio long reads. The genome that I am sequencing is 260Mbp. I am planning on getting more illumina paired end to at least get 50X coverage with paired ends. But I am doing some tests while I get my new data.

Thank you for your time.
Beatriz

@thackl
Copy link
Contributor

thackl commented Mar 9, 2016

Hi Beatriz,

the recommendation regarding chunk size in the manual is somewhat outdated. With the latest versions of proovread, the rule of thumb is: Use chunks as large as possible, but with respect to the following limits:

  1. RAM of your cluster nodes - chunks up to a few GB should be fine
  2. max per job runtime on some scheduling systems - smaller chunks will run faster in absolute time, yet larger chunks are faster compared to runtime required for the entire set
  3. Size of your genome. Chunks should not exceed 1X coverage of the genome, so in your case, I would go for 200Mbp chunks, which is something like SeqChunker -s 400M for fastq files.

Also, max --threads - running different instances on the same machine is no longer necessary.
And for your set make sure to set --coverage=18.

@bibilujan
Copy link
Author

Hi Thomas,

Thank you for your prompt reply and you useful comments. I was interested in using ccs reads as well as illumina for correction. However, I see that the %masked drops considerably in the last iteration when I use illumina+css. Is this normal? Is it better avoid mixing data types? At the moment I am doing these runs with low coverage, so I don't know if that would be a factor.

ILLUMINA HISEQ (16X) ONLY
Running mode: sr
Running task bwa-sr-1
Masked : 67.8%
Running task bwa-sr-2
Masked : 84.5%
Running task bwa-sr-3
Masked : 88.0%
Running task bwa-sr-4
Masked : 93.8%
Running task bwa-sr-finish
Masked : 79.4%

ILLUMINA HISEQ (16X) + CCS (1X)
Running mode: mr
Running task bwa-mr-1
Masked : 42.5%
Running task bwa-mr-2
Masked : 64.9%
Running task bwa-mr-3
Masked : 71.6%
Running task bwa-mr-4
Masked : 81.6%
Running task bwa-mr-5
Masked : 82.8%
Running task bwa-mr-finish
Masked : 43.1%

Any suggestions? I am getting MiSeq data soon ~50X coverage, in that case do you recommend using only MiSeq data for correction?

Thanks in advance,

Beatriz

@thackl
Copy link
Contributor

thackl commented Mar 13, 2016

The difference between the two runs is the running mode: sr (short reads for HiSeq <=100bp) vs mr (medium reads for MiSeq/merged Hiseq reads >100bp). Because of the longer ccs reads, proovread decided to use the mr mode for the second run. But that mode is not sensitive enough to properly align the shorter HiSeq reads. You could explicitly set the mode to sr also when using ccs.

However, you won't get that problem for MiSeq data anyway. 50X MiSeq is more or less the perfect data set. If possible, use merged overlapping MiSeq reads for correction. They will also work well with or without ccs reads.

@bibilujan
Copy link
Author

Hi Thomas,

I wanted to ask you for advice regarding overlapping the MiSeq reads, so far I my quality control pipeline includes trimming (trimmomatic) and then doing error correction (musket). Do you know what approach is more successful to correct long PacBio reads using Miseq data with proovread:

  1. correct reads -> overlap -> trim (based on quality) or
  2. trim -> correct -> overlap?

Considering that my MiSeq reads have poor quality at the 3' end.

My coverage at this point is PE-hiseq (11X), PE-miseq (23X), ccs (0.86X). Do you suggest using all the data for proovread to get good enough coverage? Or would it best to use only MiSeq.

Thank you for your time,

Beatriz

@thackl
Copy link
Contributor

thackl commented Apr 6, 2016

Hi Beatriz,

I would go for overlapping reads directly, overlapping already decreases error rates in tails, very poor ends won't produce merged reads anyway. You can do trimming/correction afterwards, but I don't think it is necessary. Since proovread creates consensus from multiple Illumina reads, random errors in single reads don't affect correction accuracy.

I guess that your coverage will decrease during overlapping, so you should use both, HiSeq+MiSeq reads. Make sure to set --coverage appropriately.

Cheers
Thomas

@wsuplantpathology
Copy link

wsuplantpathology commented Aug 19, 2016

Hi there:

I wonder how to tell the program to deal with many .fq files.

In the proovread.cfg, it says:

LIST of Pacbio read files to correct. FASTA or FASTQ format.

'long-reads' => [],

it seems this command is not clear. #74 Thanks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants