Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize Mgrep dictionary generation process #15

Open
mdorf opened this issue Apr 13, 2021 · 1 comment
Open

Optimize Mgrep dictionary generation process #15

mdorf opened this issue Apr 13, 2021 · 1 comment
Assignees

Comments

@mdorf
Copy link
Member

mdorf commented Apr 13, 2021

Currently, the dictionary is re-generated every time an ontology (submission) is processed. This process takes over an hour due to retrieving a huge data structure from Redis in a single call:

https://github.com/ncbo/ncbo_annotator/blob/master/lib/ncbo_annotator.rb#L122

There is room for optimization here. Possible avenues to pursue:

  1. Incremental dictionary file population

We may not need to rebuild the dictionary file for the entire system on every ontology parse. Updating it incrementally may drastically improve performance

  1. Retrieve data from Redis in an iterative way:

Instead of using all = redis.hgetall(dict_holder), it's possible to iterate of the data structure using SCAN:

          cursor = 0
          loop do
            cursor, key_values = redis.hscan(dict_holder, cursor, count: 1000)
            @logger.info cursor if cursor.to_i % 1000 == 0
            break if cursor == "0"
          end
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant