Skip to content

feat!: UTA access class separation of concerns, postgres lib fixes#465

Open
jsstevenson wants to merge 18 commits into
mainfrom
uta-refactor
Open

feat!: UTA access class separation of concerns, postgres lib fixes#465
jsstevenson wants to merge 18 commits into
mainfrom
uta-refactor

Conversation

@jsstevenson
Copy link
Copy Markdown
Member

@jsstevenson jsstevenson commented Apr 2, 2026

Basically a lot of tech debt fixes

  • Migrate from asyncpg to psycopg. Honestly, asyncpg might be faster but we are generally using psycopg elsewhere so might as well stick with one thing.
  • (breaking) break the DB access class into 2 things: 1) a repository pattern-style lookup class that just lasts as long as a DB connection, and 2) a connection pool-holder class for easily launching repository instances.
    • This means that applications which want to have more direct control over the lifespan of the connection pool can do so directly, and separates the connection-management logistics from the actual query/data transformation logic.
    • Mark the "get AWS secrets" function as deprecated. In the back of my head, I felt like we were trying to move away from secrets manager for DB connection params, but maybe I made that up. Either way, it's not ideal to be writing code specific to our AWS deployment into a library, so I threw down a deprecation warning and maybe we can figure out something better in 10 years
    • Previously we were basically doing a connection pool status check every time we ran a query, and (re)creating the pool if it was gone. It's a little funky to be delaying pool setup/config until runtime like this. I included a child class of the new UtaDatabase class that basically retains this behavior so we can minimize needed changes in the short term (+ less work for me to update tests)
  • (breaking in the future) set schema in the way you are supposed to per postgres docs: ie postgresql://uta_admin@localhost:5432/uta?options=-csearch_path%3Duta_20241220,public rather than /uta/uta_20241220
    • the postgresql://stuff/(db_name)/(schema_name) thing is nonstandard (I think it was basically invented for the hgvs library) and not in keeping with more typical postgres access patterns. In general, there are two standards for describing postgres connections -- as a key-value string and as a URI. So, I think if we're gonna have users go with the URI way, we should adhere to the standard.
    • env var name stays the same, but the old value won't work any more (it'll give you the wrong search path) and the default value is new
    • update: still support the old URL format via a _normalize_uta_db_url function that converts it to the new way and issues a deprecation warning. If this ends up being really annoying we can also just silence the warning.
  • Don't hardcode schema into queries anymore. Instead, set the search path when creating the connection. (Still assume uta_20241220 by default, but this is set at connection pool creation via the connection string)
  • Don't hard-code variables into queries anymore, but pass them as params using the psycopg %(argname)s syntax
  • Except for 1-2 exceptions that were too hard, make all queries static instead of dynamically constructing them
  • Allow the public UTA query execution function to accept params as an additional arg
  • Change one UTA DB function that returned a (result, failure_description) tuple where the result and the failure values were mutually exclusive into a function that returned the result value, and raised an exception in case of failure.
  • Refactor the genomic table initializer method, and also make it optional for the pool factory function to attempt creation of the genomic table (I think this addresses UTA database recreates schema on startup #430)

basically I think the ideal FastAPI usage for just UTA should look like

@asynccontextmanager
async def lifespan(app: FastAPI) -> AsyncGenerator:
    uta_pool = await create_uta_connection_pool()
    app.state.uta_pool = uta_pool
    yield
    await uta_pool.close()

async def get_uta(request: Request) -> Generator(UtaRepository, None, None):
    async with request.app.state.uta_pool.connection() as conn:
        yield UtaRepository(conn)

@app.get("/check_exists")
async def check_exists(gene: str, uta: Annotated[UtaRepository, Depends(get_uta)]):
    return await uta.gene_exists(gene)

and within coolseqtool itself, or in other contexts where you might want to pass around the entire coolseqtool god class instance, you can do something like

uta_db: UtaDatabase  # assume this exists
async with uta_db.repository() as uta:
    print(await uta.gene_exists("BRAF"))

A few misc engineering notes I thought about while working on this

  • Yes, we want to be using connection pools vs plain connections in basically all contexts. The pool more efficiently recycles connections, so even in contexts where everything is sequential, there's still a speedup
  • We probably want to stay async. It would be a pain to move back. Maybe in the future we can move the rest of the library to async/dial down all the blocking file open calls.

@jsstevenson jsstevenson added the priority:medium Medium priority label Apr 2, 2026
@jsstevenson jsstevenson marked this pull request as ready for review April 2, 2026 01:14
@jsstevenson jsstevenson requested a review from a team as a code owner April 2, 2026 01:14
@korikuzma
Copy link
Copy Markdown
Member

Will review this tomorrow

Copy link
Copy Markdown
Member

@korikuzma korikuzma left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work, minor things

Comment thread src/cool_seq_tool/sources/uta_database.py Outdated
Comment thread src/cool_seq_tool/sources/uta_database.py Outdated
Comment thread tests/sources/test_uta_database.py Outdated
resp = await uta_repo.get_alt_ac_start_or_end("NM_152263.3", 822, 892, None)
assert resp == tpm3_1_8_end_genomic

with pytest.raises(NoMatchingAlignmentError):
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consistency?

Suggested change
with pytest.raises(NoMatchingAlignmentError):
with pytest.raises(
NoMatchingAlignmentError,
match=re.escape(
"Unable to find a result where NM_152263.63 has transcript coordinates (tx_exon_start=822, tx_exon_end=892) between an exon's start and end coordinates on gene=None"
),
):

Comment thread tests/sources/test_uta_database.py
Comment thread src/cool_seq_tool/sources/uta_database.py Outdated
Co-authored-by: Kori Kuzma <korikuzma@gmail.com>
@jsstevenson
Copy link
Copy Markdown
Member Author

just as a heads-up -- it occurred to me that maybe this class should be tailored to ensure compatibility with the hgvs dataprovider interface so that it can be reused there. so for now i am pausing this to enable reflection

@jsstevenson
Copy link
Copy Markdown
Member Author

just as a heads-up -- it occurred to me that maybe this class should be tailored to ensure compatibility with the hgvs dataprovider interface so that it can be reused there. so for now i am pausing this to enable reflection

ugh

I think I would like to see a change in hgvs rather than adapting this function

Comment thread src/cool_seq_tool/sources/uta_database.py
Copy link
Copy Markdown
Member

@korikuzma korikuzma left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had to change the uta-setup.sql to the following to get tests to pass (otherwise, I got the following error when running tests Getting psycopg.errors.InsufficientPrivilege: must be owner of table genomic)

\c uta;

GRANT CONNECT ON DATABASE uta TO anonymous;
GRANT USAGE ON SCHEMA uta_20241220 TO anonymous;

CREATE TABLE IF NOT EXISTS uta_20241220.genomic AS
SELECT
    t.hgnc,
    aes.alt_ac,
    aes.alt_aln_method,
    aes.alt_strand,
    ae.start_i AS alt_start_i,
    ae.end_i AS alt_end_i
FROM uta_20241220.transcript t
JOIN uta_20241220.exon_set tes
    ON t.ac = tes.tx_ac
   AND tes.alt_aln_method = 'transcript'
JOIN uta_20241220.exon_set aes
    ON t.ac = aes.tx_ac
   AND aes.alt_aln_method <> 'transcript'
JOIN uta_20241220.exon te
    ON tes.exon_set_id = te.exon_set_id
JOIN uta_20241220.exon ae
    ON aes.exon_set_id = ae.exon_set_id
   AND te.ord = ae.ord
LEFT JOIN uta_20241220.exon_aln ea
    ON te.exon_id = ea.tx_exon_id
   AND ae.exon_id = ea.alt_exon_id;

CREATE INDEX IF NOT EXISTS alt_pos_index
ON uta_20241220.genomic (alt_ac, alt_start_i, alt_end_i);

CREATE INDEX IF NOT EXISTS gene_alt_index
ON uta_20241220.genomic (hgnc, alt_ac);

CREATE INDEX IF NOT EXISTS alt_ac_index
ON uta_20241220.genomic (alt_ac);

ALTER TABLE uta_20241220.genomic OWNER TO anonymous;

ALTER INDEX uta_20241220.alt_pos_index OWNER TO anonymous;
ALTER INDEX uta_20241220.gene_alt_index OWNER TO anonymous;
ALTER INDEX uta_20241220.alt_ac_index OWNER TO anonymous;

GRANT SELECT, INSERT, UPDATE, DELETE ON ALL TABLES IN SCHEMA uta_20241220 TO anonymous;

ALTER DEFAULT PRIVILEGES IN SCHEMA uta_20241220
GRANT SELECT, INSERT, UPDATE, DELETE ON TABLES TO anonymous;

ALTER DATABASE uta OWNER TO anonymous;
ALTER SCHEMA uta_20241220 OWNER TO anonymous;

When I changed to this, all tests passed using both DB URLs (new and legacy)

@jsstevenson
Copy link
Copy Markdown
Member Author

jsstevenson commented May 14, 2026

@korikuzma what if we could just ditch the genomic table entirely?

per friendly chatgpt, get_ac_from_gene can be refactored like so

    async def get_ac_from_gene(self, gene: str) -> list[str]:
        """Return genomic accession(s) associated to a gene.

        :param gene: Gene symbol
        :return: List of genomic accessions, sorted in desc order
        """
        query = """
            SELECT DISTINCT aes.alt_ac
            FROM transcript AS t
            JOIN exon_set AS aes
              ON t.ac = aes.tx_ac
            WHERE t.hgnc = %(gene)s
              AND aes.alt_aln_method <> 'transcript'
              AND aes.alt_ac LIKE 'NC_00%%'
            ORDER BY aes.alt_ac;
            """

        cursor = await self.execute_query(query, {"gene": gene})
        results = await cursor.fetchall()
        alt_acs = [r[0] for r in results]
        alt_acs.sort(key=lambda x: int(x.split(".")[-1]), reverse=True)
        return alt_acs

with basically no loss of performance (this is important, it's used in the variation normalizer)

get_gene_from_ac is more challenging, the filtering on alt_start_i and alt_end_i apparently gets slow without those new indexes (~1ms to 100ms). However, as far as I can tell... this function isn't actually used anywhere? Maybe we could do the refactor now, and then not worry about genomic table stuff moving forward?

@korikuzma
Copy link
Copy Markdown
Member

@korikuzma what if we could just ditch the genomic table entirely?

per friendly chatgpt, get_ac_from_gene can be refactored like so

    async def get_ac_from_gene(self, gene: str) -> list[str]:
        """Return genomic accession(s) associated to a gene.

        :param gene: Gene symbol
        :return: List of genomic accessions, sorted in desc order
        """
        query = """
            SELECT DISTINCT aes.alt_ac
            FROM transcript AS t
            JOIN exon_set AS aes
              ON t.ac = aes.tx_ac
            WHERE t.hgnc = %(gene)s
              AND aes.alt_aln_method <> 'transcript'
              AND aes.alt_ac LIKE 'NC_00%%'
            ORDER BY aes.alt_ac;
            """

        cursor = await self.execute_query(query, {"gene": gene})
        results = await cursor.fetchall()
        alt_acs = [r[0] for r in results]
        alt_acs.sort(key=lambda x: int(x.split(".")[-1]), reverse=True)
        return alt_acs

with basically no loss of performance (this is important, it's used in the variation normalizer)

get_gene_from_ac is more challenging, the filtering on alt_start_i and alt_end_i apparently gets slow without those new indexes (~1ms to 100ms). However, as far as I can tell... this function isn't actually used anywhere? Maybe we could do the refactor now, and then not worry about genomic table stuff moving forward?

Sent in Slack, but posting here.

https://github.com/search?q=ORG%3Acancervariants%20OR%20ORG%3Agenomicmedlab%20get_gene_from_ac&type=code

No usage. Let's remove both. 🚀

@jsstevenson jsstevenson requested a review from korikuzma May 15, 2026 00:08
@jsstevenson
Copy link
Copy Markdown
Member Author

Ok, I think I covered my bases here?

Copy link
Copy Markdown
Member

@korikuzma korikuzma left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missed one reference

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

priority:medium Medium priority

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants