-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Scaling KAZU to run over millions of abstracts #18
Comments
Hiya! Yes sorry about the delay in producing the Ray tutorial. To answer your question, You have multiple options for scaling Kazu on a large dataset. It integrates quite well with the Spark .mapPartitions concept, and also Ray. How do you intend to serialise the results? MongoDb or on to disk? |
I wanted to start simple with just dumping data to disk as jsons. Would there be a large speed advantage by using MongoDB instead? |
it depends on your set up. Mongo is a great way of centralising the data, so that it can be easily queried afterwards (Elasticsearch is good too). However, the DB can be a bottleneck, if you have many Kazu workers and only a single Mongo instance. In terms of batch processing with Ray, I've found it's useful to use the ray Queue concept You can then wrap Kazu in ray actors, as follows
|
For speed over millions of abstracts, if I use Ray on a single multi-cpu node (essentially mimic multiprocessing) but without multiple GPUs, do you think Kazu can still scale reasonably to my task? |
yes kazu is designed to run without a GPU |
Hi, thanks again for sharing KAZU, it looks like exactly the tool I need for setting up a simple biomed NLP system.
I am planning on using KAZU to add ontology terms to the whole PubMed abstract dump and writing these out to disk. I am a bit concerned that it will not scale well using the approach you show with single document in the tutorial (EGFR query). However, your Ray tutorial page still has TBA.
Do you have some general suggestions for how to scale it for my task? Don't need specific code examples, unless you already have something to share, could be even "quick and dirty".
The text was updated successfully, but these errors were encountered: