Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Problem speed of mapping #44

Closed
ArtemPalanaria opened this issue May 4, 2020 · 23 comments
Closed

Problem speed of mapping #44

ArtemPalanaria opened this issue May 4, 2020 · 23 comments

Comments

@ArtemPalanaria
Copy link

Dear Matt. I tried to run Read Until, went through the testing stages (to the Testing basecalling and mapping stage, point 6). In this case, basecalling is launched on GPU (RTX 2080Ti). But the speed of mapping shows more than 1-3 seconds. What could be the problem?
thank
I did the launch from a file - an example "Testing"
I attach files
human_chr_selection.toml.txt
chunk_log.log
ru_test.log
guppy_basecall_server_log-2020-05-04_15-07-45.log

@mattloose
Copy link
Contributor

Hi,

There are a lot of issues here.

The toml file you provide (human_chr_selection.toml.txt) won't pass validation as it has no targets. It isn't the one passed in the command shown in ru_test.log (that one is human_chr_selection.toml).

The ru_test.log shows that you actually have two targets in your toml file (further suggesting an incorrect toml file here) BUT your used toml file has 2 targets, none of which are found in the reference.

Reads will either be always off target or not map at all. If they are not mapping (and I suspect that is the case here) you will collect more data and so your basecalling will take longer and longer.

In essence I'm not sure you have configured this experiment properly.

If you can provide further information including the source of data (are you playing back a bulkfile here or something else?) and the correct toml file we might be able to help further.

Matt

@ArtemPalanaria
Copy link
Author

Thanks for the answer. I changed the reference and the file passed the test. After starting, it still shows a long time. To start, I used the bulk file from
http://s3.amazonaws.com/nanopore-human-wgs/bulkfile/PLSP57501_20170308_FNFAF14035_MN16458_sequencing_run_NOTT_Hum_wh1rs2_60428.fast5

Attach files.
human_chr_selection.toml.txt
chunk_log.log
ru_test.log
chek_toml.txt

@ArtemPalanaria
Copy link
Author

Last time I attached the wrong file (TOML) attached ..

@mattloose
Copy link
Contributor

Thanks for the update - So that is a lot slower than I would expect.

I would check a few things here.

First - how quickly can your GPU call reads when running standalone. You may need to play with guppy parameters to tune your guppy basecaller optimally.

However, we need to see if it is GPU or CPU which is limiting here - how big is your reference file that you are mapping too? Also what sort of power is your CPU?

Have you tried the fast basecalling model instead of the high accuracy model? If you see an improvement in speed here then we can pinpoint the source of the problem a little.

Thanks

@ArtemPalanaria
Copy link
Author

Thank. I launched it on a high accuracy model - the speed for the first 2 minutes was normal, but then again everything started to slow down to 1 second or more. I have CPU Ryzen 7 (3800X 8 core 16 treads). I use as a reference the indexed file from ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/001/405/GCA_000001405.15_GRCh38/seqs_for_alignment_pipelines.ucsc_ids/GCA_000001405.15_GRCh38_no_alt_analysis_set.fna.gz
File size (mmi) more than 7 GB. Maybe this is the case?
Thanks

@mattloose
Copy link
Contributor

This doesn't really make sense then. Can you please try setting the max chunks to 8 rather than infinite and see what happens?

Also - please leave it running for more than 15 minutes and check the resulting data to see if the selection is working.

@mattloose
Copy link
Contributor

Also - please can you try it with the FAST model and not the High Accuracy Model. Running on the fast model will tell us something about where the lag is.

@ArtemPalanaria
Copy link
Author

Dear Matt. I started the process on fast and hac models with max_chunks = 8
But according to the data obtained it is clear that the fast process is going as fast as necessary, and the hac is slowing down and very much. And the results obtained are also apparently bad.
Here are the fast data
read-length-histogram-05 05 2020, 10_55_26
chunk_log.zip
ru_test.zip
result.txt
and hac data
result.txt
chunk_log.log
hac
ru_test.zip

Thank

@mattloose
Copy link
Contributor

can I check what operating system you are on? And also can you provide a metric for how quickly you can basecall standard reads on your current setup?

@tchrisboles
Copy link

image

How would I check the speed in standard basecalling? From log files? I've never looked for them - give me a hint and I dig it out.

@ArtemPalanaria
Copy link
Author

Dear Matt. Here are the system data (Ubuntu 18.04.4 LTS
Gnome 3.28.2)
and baseсall speed files.
guppy.txt
guppy_basecaller_log-2020-05-07_10-42-57.log
Thank

@vincentmanz
Copy link

I have observed the same problem here when using the hac model in the toml file, I obtained very slow mapping time (>1s).
#44~18.04.2-Ubuntu SMP Thu Apr 23 14:27:18 UTC 2020

@mattloose
Copy link
Contributor

Hi All,

A quick question - could people confirm the version of guppy they are using?

Thanks.

@mattloose
Copy link
Contributor

If you are on version 3.6 it may be worth trying guppy 3.4.5 - it is available from:

https://mirror.oxfordnanoportal.com/software/analysis/ont-guppy_3.4.5_linux64.tar.gz

It looks as though there is a change in guppy performance that might be negatively impacting the speed of read until.

@tchrisboles
Copy link

Hi Matt and Artem,

I have been having problems similar to Artem and I am running:
image

@tchrisboles
Copy link

Thanks Matt - will try 3.4.5 later today.

@mattloose
Copy link
Contributor

HI Chris,

If you can let us know how 3.4.5 goes - the accuracy differences aren't key here but the speed is. So you should find that gives you better performance. We're really keen to resolve this ASAP!

Best

Matt

@tchrisboles
Copy link

OK, I think you guys nailed it with the guppy server version. Here's my test results.
(I downloaded and untarred ont-guppy package 3.4.5 as Matt pointed out above.)
Setup basecall server:
image
In second terminal window setup the ru_generators command:
image
I had previously modified Matt's toml file as here:
image
After 16 min the read distribution and mapping timing looked like this:
image
Which is much closer to Matt's readme image than I have gotten previously. Mapping timing is still not quite as fast as Matt's. Here's a close-up of 16 minute read distribution:
image
And the summarise output:
image
The median read lengths are now showing enrichments for chr21,22. Again, not quite as good as Matt's readme, but significant.
I think what would help us all would be some additional guidance on best strategies for optimizing for guppy server settings.

Hope this helps others who are as interested in ru as we are.

@tchrisboles
Copy link

By the way, you can see my previous results using guppy 3.5.2 in Question #39.

@mattloose
Copy link
Contributor

Thanks @tchrisboles

We're just running some equivalence tests across a few GPUs here. All our work was reported using 3.4.5 - we will investigate the issues with guppy > 3.4.5 with ONT.

@mattloose
Copy link
Contributor

signal-attachment-2020-05-08-195144_001
So here is a comparison of a 1080 vs the GPU (GV100) in the GridION - as you can see for guppy 3.4.5 performance is roughly equivalent, but guppy 3.6 performance is not sufficient for real time calling. We suspect some underlying issue that can be resolved but for now recommend guppy 3.4.5. You can have two versions of guppy running side by side as required.

@ArtemPalanaria
Copy link
Author

Dear Matt. I got similar results as Chris using guppy 3.4.5.
Run.txt
run
I got similar results as Chris using guppy 3.4.5
I also wanted to know - can I use any fast files in the quality of the bulk file, or do I need to prepare them somehow?
And do not tell me an example of setting up library depletion for the human genome (for enriching the metagenome)?
Thanks for the help. I am very glad that everything worked!

@mattloose
Copy link
Contributor

Hi - you have to record a bulkfile from a run - you cannot use any fast5 file.

Look under the advanced file save options.

For depletion of a human genome you just need to configure your toml file to reject anything that maps to the reference you want to get rid off. Have a look at our paper for detailsl.

Adoni5 added a commit that referenced this issue Oct 6, 2023
* Update README.md

Closes #44 
Uses BETA syntax see https://github.com/orgs/community/discussions/16925#
thub pages.
Adds a link to the Sphinx documentation for readfish on the looselab github pages

* Exclude README.md from trailing-whitespace pre-commit
Need trailing whitespace to render the warning boxes

* Invert the notes about the FAQ and README
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants