Optimize Mgrep dictionary generation process #15

mdorf · 2021-04-13T23:19:55Z

Currently, the dictionary is re-generated every time an ontology (submission) is processed. This process takes over an hour due to retrieving a huge data structure from Redis in a single call:

https://github.com/ncbo/ncbo_annotator/blob/master/lib/ncbo_annotator.rb#L122

There is room for optimization here. Possible avenues to pursue:

Incremental dictionary file population

We may not need to rebuild the dictionary file for the entire system on every ontology parse. Updating it incrementally may drastically improve performance

Retrieve data from Redis in an iterative way:

Instead of using all = redis.hgetall(dict_holder), it's possible to iterate of the data structure using SCAN:

          cursor = 0
          loop do
            cursor, key_values = redis.hscan(dict_holder, cursor, count: 1000)
            @logger.info cursor if cursor.to_i % 1000 == 0
            break if cursor == "0"
          end

The text was updated successfully, but these errors were encountered:

mdorf · 2021-04-13T23:20:26Z

A useful test suite:

https://github.com/redis/redis-rb/blob/master/test/scanning_test.rb

mdorf added the enhancement label Apr 13, 2021

mdorf self-assigned this Apr 13, 2021

mdorf mentioned this issue Jul 15, 2021

ontology processing stalls for a long time between "Completed caching" and "Completed processing" steps ncbo/ncbo_cron#45

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimize Mgrep dictionary generation process #15

Optimize Mgrep dictionary generation process #15

mdorf commented Apr 13, 2021

mdorf commented Apr 13, 2021

Optimize Mgrep dictionary generation process #15

Optimize Mgrep dictionary generation process #15

Comments

mdorf commented Apr 13, 2021

mdorf commented Apr 13, 2021