Performance #2626

hexylena · 2021-07-02T10:14:42Z

@abretaud and I are working to debug an issue where the slowness of findAllOrganisms (>30s) is killing the training we're giving.

This route should be fast. like <2 seconds fast. I've replaced it with a flask app that talks directly to the DB and does all of the joins and filtering on the DB side which seems to be MUCH more efficient.

Here's the flask app which just replaces that one route.

from flask import Flask
import codecs
from flask import jsonify
from functools import wraps
from flask import render_template
from flask import request
from flask_sqlalchemy import SQLAlchemy
import time

global CACHED_RESULT
global CACHED_TIME
CACHED_RESULT = None
CACHED_TIME = 0

app = Flask(__name__)

app.config["SQLALCHEMY_DATABASE_URI"] = "postgresql://...:5432/apollo"
app.config['SQLALCHEMY_TRACK_MODIFICATIONS'] = False
db = SQLAlchemy(app)

QUERY = """
SELECT
    organism.common_name,
    organism.blatdb,
    organism.metadata,
    organism.obsolete,
    organism.directory,
    organism.public_mode,
    organism.valid,
    organism.genome_fasta_index,
    organism.genus,
    organism.species,
    organism.id,
    organism.non_default_translation_table,
    organism.genome_fasta,
    false AS currentorganism,
    sum(
        CASE
        WHEN feature.class
        IN (
                'org.bbop.apollo.RepeatRegion',
                'org.bbop.apollo.Terminator',
                'org.bbop.apollo.TransposableElement',
                'org.bbop.apollo.Gene',
                'org.bbop.apollo.Pseudogene',
                'org.bbop.apollo.PseudogenicRegion',
                'org.bbop.apollo.ProcessedPseudogene',
                'org.bbop.apollo.Deletion',
                'org.bbop.apollo.Insertion',
                'org.bbop.apollo.Substitution',
                'org.bbop.apollo.SNV',
                'org.bbop.apollo.SNP',
                'org.bbop.apollo.MNV',
                'org.bbop.apollo.MNP',
                'org.bbop.apollo.Indel'
            )
        THEN 1
        ELSE 0
        END
    ) AS annotationcount,
    count(distinct sequence.id) AS sequences
FROM
    organism
    LEFT OUTER JOIN sequence ON organism.id = sequence.organism_id
    LEFT OUTER JOIN feature_location ON
            sequence.id = feature_location.sequence_id
    LEFT OUTER JOIN feature ON
            feature.id = feature_location.feature_id
GROUP BY
    organism.common_name,
    organism.blatdb,
    organism.metadata,
    organism.obsolete,
    organism.directory,
    organism.public_mode,
    organism.valid,
    organism.genome_fasta_index,
    organism.genus,
    organism.species,
    organism.id,
    organism.non_default_translation_table,
    organism.genome_fasta
    ;
"""

columns =  [
    "commonName", "blatdb", "metadata" , "obsolete", "directory",
    "publicMode", "valid", "genomeFastaIndex", "genus", "species", "id",
    "nonDefaultTranslationTable", "genomeFasta", "currentOrganism",
    "annotationCount", "sequences"
]

def _fetch():
    roles = db.engine.execute(QUERY)
    out = []
    for role in roles:
        out.append(dict(zip(columns, role)))
    return out


@app.route("/get", methods=["GET", "POST"])
def doit():
    global CACHED_TIME
    global CACHED_RESULT
    now = time.time()
    if now - CACHED_TIME > 30:
        CACHED_RESULT = _fetch()
        CACHED_TIME = now

    return jsonify(CACHED_RESULT)

I'm running this service and we're just proxying that one route through our own version:

location /apollo/organism/findAllOrganisms {
   proxy_pass http://127.0.0.1:4321/get;
}

I think there are a couple parts to the issue:

lack of any indexes, not even on sequence.id, feature.id, etc.
doing operations in groovy rather than doing them in the DB, resulting in fetching more data and processing more slowly than the DB can.

A key point for me is that I really don't think apollo needs a graph database. I think it just needs some time spent understanding how to most effectively use SQL (I'm happy to offer my expertise there.)

The text was updated successfully, but these errors were encountered:

abretaud · 2021-07-02T11:55:03Z

On my side I've made some profiling: most of the time is spent in this for loop, it takes ~0.2s per organism on my test setup => ~10sec for 40 orgs => you can easily hit a timeout if you have many orgs

abretaud · 2021-08-27T15:18:43Z

@hexylena I took the liberty to dockerize your code there: https://github.com/galaxy-genome-annotation/apolpi
I hope/guess it's ok for you (licensing too?)

hexylena · 2021-08-30T08:51:34Z

Ahhh awesome @abretaud that'll make it easier to deploy.

yeah license is fine :) (Normally I'd do agpl3 to force folks to contribute back their changes, but, in this case I don't think it matters)

abretaud · 2021-08-30T09:29:20Z

Cool thanks, used on apololo.genouest.org and bipaa.genouest.org/apollo now

abretaud · 2021-09-03T11:55:53Z

Yep I noticed it's slow too but I don't know why, maybe it's doing things on the data dir!?

hexylena · 2021-09-06T08:03:39Z

It's odd, the API responds quickly, it was just through the UI. Anyway

cross12tamu mentioned this issue Aug 25, 2021

Apollo 2.6.5 Upgrade Issue #2630

Open

abretaud added a commit to galaxy-genome-annotation/apolpi that referenced this issue Aug 27, 2021

initial commit, code from GMOD/Apollo#2626

1a47af5

hexylena changed the title ~~Performance on /organism/findAllOrganisms~~ Performance Sep 3, 2021

This comment has been minimized.

Sign in to view

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Performance #2626

Performance #2626

hexylena commented Jul 2, 2021 •

edited

abretaud commented Jul 2, 2021

abretaud commented Aug 27, 2021

hexylena commented Aug 30, 2021

abretaud commented Aug 30, 2021

This comment has been minimized.

abretaud commented Sep 3, 2021

hexylena commented Sep 6, 2021

Performance #2626

Performance #2626

Comments

hexylena commented Jul 2, 2021 • edited

abretaud commented Jul 2, 2021

abretaud commented Aug 27, 2021

hexylena commented Aug 30, 2021

abretaud commented Aug 30, 2021

This comment has been minimized.

abretaud commented Sep 3, 2021

hexylena commented Sep 6, 2021

hexylena commented Jul 2, 2021 •

edited