feat!: UTA access class separation of concerns, postgres lib fixes by jsstevenson · Pull Request #465 · GenomicMedLab/cool-seq-tool

jsstevenson · 2026-04-02T00:29:54Z

Basically a lot of tech debt fixes

Migrate from asyncpg to psycopg. Honestly, asyncpg might be faster but we are generally using psycopg elsewhere so might as well stick with one thing.
(breaking) break the DB access class into 2 things: 1) a repository pattern-style lookup class that just lasts as long as a DB connection, and 2) a connection pool-holder class for easily launching repository instances.
- This means that applications which want to have more direct control over the lifespan of the connection pool can do so directly, and separates the connection-management logistics from the actual query/data transformation logic.
- Mark the "get AWS secrets" function as deprecated. In the back of my head, I felt like we were trying to move away from secrets manager for DB connection params, but maybe I made that up. Either way, it's not ideal to be writing code specific to our AWS deployment into a library, so I threw down a deprecation warning and maybe we can figure out something better in 10 years
- Previously we were basically doing a connection pool status check every time we ran a query, and (re)creating the pool if it was gone. It's a little funky to be delaying pool setup/config until runtime like this. I included a child class of the new UtaDatabase class that basically retains this behavior so we can minimize needed changes in the short term (+ less work for me to update tests)
(breaking in the future) set schema in the way you are supposed to per postgres docs: ie postgresql://uta_admin@localhost:5432/uta?options=-csearch_path%3Duta_20241220,public rather than /uta/uta_20241220
- the postgresql://stuff/(db_name)/(schema_name) thing is nonstandard (I think it was basically invented for the hgvs library) and not in keeping with more typical postgres access patterns. In general, there are two standards for describing postgres connections -- as a key-value string and as a URI. So, I think if we're gonna have users go with the URI way, we should adhere to the standard.
- env var name stays the same, but the old value won't work any more (it'll give you the wrong search path) and the default value is new
- update: still support the old URL format via a _normalize_uta_db_url function that converts it to the new way and issues a deprecation warning. If this ends up being really annoying we can also just silence the warning.
Don't hardcode schema into queries anymore. Instead, set the search path when creating the connection. (Still assume uta_20241220 by default, but this is set at connection pool creation via the connection string)
Don't hard-code variables into queries anymore, but pass them as params using the psycopg %(argname)s syntax
Except for 1-2 exceptions that were too hard, make all queries static instead of dynamically constructing them
Allow the public UTA query execution function to accept params as an additional arg
Change one UTA DB function that returned a (result, failure_description) tuple where the result and the failure values were mutually exclusive into a function that returned the result value, and raised an exception in case of failure.
Refactor the genomic table initializer method, and also make it optional for the pool factory function to attempt creation of the genomic table (I think this addresses UTA database recreates schema on startup #430)

basically I think the ideal FastAPI usage for just UTA should look like

@asynccontextmanager
async def lifespan(app: FastAPI) -> AsyncGenerator:
    uta_pool = await create_uta_connection_pool()
    app.state.uta_pool = uta_pool
    yield
    await uta_pool.close()

async def get_uta(request: Request) -> Generator(UtaRepository, None, None):
    async with request.app.state.uta_pool.connection() as conn:
        yield UtaRepository(conn)

@app.get("/check_exists")
async def check_exists(gene: str, uta: Annotated[UtaRepository, Depends(get_uta)]):
    return await uta.gene_exists(gene)

and within coolseqtool itself, or in other contexts where you might want to pass around the entire coolseqtool god class instance, you can do something like

uta_db: UtaDatabase  # assume this exists
async with uta_db.repository() as uta:
    print(await uta.gene_exists("BRAF"))

A few misc engineering notes I thought about while working on this

Yes, we want to be using connection pools vs plain connections in basically all contexts. The pool more efficiently recycles connections, so even in contexts where everything is sequential, there's still a speedup
We probably want to stay async. It would be a pain to move back. Maybe in the future we can move the rest of the library to async/dial down all the blocking file open calls.

korikuzma · 2026-04-02T12:10:44Z

Will review this tomorrow

korikuzma

Great work, minor things

korikuzma · 2026-04-02T19:52:59Z

+    resp = await uta_repo.get_alt_ac_start_or_end("NM_152263.3", 822, 892, None)
+    assert resp == tpm3_1_8_end_genomic
+
+    with pytest.raises(NoMatchingAlignmentError):


Consistency?

Suggested change

with pytest.raises(NoMatchingAlignmentError):

with pytest.raises(

NoMatchingAlignmentError,

match=re.escape(

"Unable to find a result where NM_152263.63 has transcript coordinates (tx_exon_start=822, tx_exon_end=892) between an exon's start and end coordinates on gene=None"

),

):

Co-authored-by: Kori Kuzma <korikuzma@gmail.com>

jsstevenson · 2026-04-21T19:56:26Z

just as a heads-up -- it occurred to me that maybe this class should be tailored to ensure compatibility with the hgvs dataprovider interface so that it can be reused there. so for now i am pausing this to enable reflection

jsstevenson · 2026-05-13T19:40:20Z

just as a heads-up -- it occurred to me that maybe this class should be tailored to ensure compatibility with the hgvs dataprovider interface so that it can be reused there. so for now i am pausing this to enable reflection

ugh

I think I would like to see a change in hgvs rather than adapting this function

korikuzma

I had to change the uta-setup.sql to the following to get tests to pass (otherwise, I got the following error when running tests Getting psycopg.errors.InsufficientPrivilege: must be owner of table genomic)

\c uta;

GRANT CONNECT ON DATABASE uta TO anonymous;
GRANT USAGE ON SCHEMA uta_20241220 TO anonymous;

CREATE TABLE IF NOT EXISTS uta_20241220.genomic AS
SELECT
    t.hgnc,
    aes.alt_ac,
    aes.alt_aln_method,
    aes.alt_strand,
    ae.start_i AS alt_start_i,
    ae.end_i AS alt_end_i
FROM uta_20241220.transcript t
JOIN uta_20241220.exon_set tes
    ON t.ac = tes.tx_ac
   AND tes.alt_aln_method = 'transcript'
JOIN uta_20241220.exon_set aes
    ON t.ac = aes.tx_ac
   AND aes.alt_aln_method <> 'transcript'
JOIN uta_20241220.exon te
    ON tes.exon_set_id = te.exon_set_id
JOIN uta_20241220.exon ae
    ON aes.exon_set_id = ae.exon_set_id
   AND te.ord = ae.ord
LEFT JOIN uta_20241220.exon_aln ea
    ON te.exon_id = ea.tx_exon_id
   AND ae.exon_id = ea.alt_exon_id;

CREATE INDEX IF NOT EXISTS alt_pos_index
ON uta_20241220.genomic (alt_ac, alt_start_i, alt_end_i);

CREATE INDEX IF NOT EXISTS gene_alt_index
ON uta_20241220.genomic (hgnc, alt_ac);

CREATE INDEX IF NOT EXISTS alt_ac_index
ON uta_20241220.genomic (alt_ac);

ALTER TABLE uta_20241220.genomic OWNER TO anonymous;

ALTER INDEX uta_20241220.alt_pos_index OWNER TO anonymous;
ALTER INDEX uta_20241220.gene_alt_index OWNER TO anonymous;
ALTER INDEX uta_20241220.alt_ac_index OWNER TO anonymous;

GRANT SELECT, INSERT, UPDATE, DELETE ON ALL TABLES IN SCHEMA uta_20241220 TO anonymous;

ALTER DEFAULT PRIVILEGES IN SCHEMA uta_20241220
GRANT SELECT, INSERT, UPDATE, DELETE ON TABLES TO anonymous;

ALTER DATABASE uta OWNER TO anonymous;
ALTER SCHEMA uta_20241220 OWNER TO anonymous;

When I changed to this, all tests passed using both DB URLs (new and legacy)

jsstevenson · 2026-05-14T19:24:50Z

@korikuzma what if we could just ditch the genomic table entirely?

per friendly chatgpt, get_ac_from_gene can be refactored like so

    async def get_ac_from_gene(self, gene: str) -> list[str]:
        """Return genomic accession(s) associated to a gene.

        :param gene: Gene symbol
        :return: List of genomic accessions, sorted in desc order
        """
        query = """
            SELECT DISTINCT aes.alt_ac
            FROM transcript AS t
            JOIN exon_set AS aes
              ON t.ac = aes.tx_ac
            WHERE t.hgnc = %(gene)s
              AND aes.alt_aln_method <> 'transcript'
              AND aes.alt_ac LIKE 'NC_00%%'
            ORDER BY aes.alt_ac;
            """

        cursor = await self.execute_query(query, {"gene": gene})
        results = await cursor.fetchall()
        alt_acs = [r[0] for r in results]
        alt_acs.sort(key=lambda x: int(x.split(".")[-1]), reverse=True)
        return alt_acs

with basically no loss of performance (this is important, it's used in the variation normalizer)

get_gene_from_ac is more challenging, the filtering on alt_start_i and alt_end_i apparently gets slow without those new indexes (~1ms to 100ms). However, as far as I can tell... this function isn't actually used anywhere? Maybe we could do the refactor now, and then not worry about genomic table stuff moving forward?

korikuzma · 2026-05-14T19:51:28Z

@korikuzma what if we could just ditch the genomic table entirely?

per friendly chatgpt, get_ac_from_gene can be refactored like so
    async def get_ac_from_gene(self, gene: str) -> list[str]:
        """Return genomic accession(s) associated to a gene.

        :param gene: Gene symbol
        :return: List of genomic accessions, sorted in desc order
        """
        query = """
            SELECT DISTINCT aes.alt_ac
            FROM transcript AS t
            JOIN exon_set AS aes
              ON t.ac = aes.tx_ac
            WHERE t.hgnc = %(gene)s
              AND aes.alt_aln_method <> 'transcript'
              AND aes.alt_ac LIKE 'NC_00%%'
            ORDER BY aes.alt_ac;
            """

        cursor = await self.execute_query(query, {"gene": gene})
        results = await cursor.fetchall()
        alt_acs = [r[0] for r in results]
        alt_acs.sort(key=lambda x: int(x.split(".")[-1]), reverse=True)
        return alt_acs
with basically no loss of performance (this is important, it's used in the variation normalizer)

get_gene_from_ac is more challenging, the filtering on alt_start_i and alt_end_i apparently gets slow without those new indexes (~1ms to 100ms). However, as far as I can tell... this function isn't actually used anywhere? Maybe we could do the refactor now, and then not worry about genomic table stuff moving forward?

Sent in Slack, but posting here.

https://github.com/search?q=ORG%3Acancervariants%20OR%20ORG%3Agenomicmedlab%20get_gene_from_ac&type=code

No usage. Let's remove both. 🚀

jsstevenson · 2026-05-15T00:09:11Z

Ok, I think I covered my bases here?

korikuzma

cool-seq-tool/src/cool_seq_tool/mappers/mane_transcript.py

Line 276 in 6f81a69

FROM genomic

Missed one reference

jsstevenson added 8 commits March 30, 2026 09:59

stash

c308acc

update usage elsewhere in the library

fef0ccc

minor tweaks to factory method

4154d16

add todo

f386b30

fix up query api issues

b0a8445

cleanup eventual error msg

d7606e9

ope

bbc4dcd

update deps

5272243

jsstevenson added the priority:medium Medium priority label Apr 2, 2026

jsstevenson added 3 commits April 1, 2026 20:55

refactor genomic table initializer + make it optional

3f899ec

Merge branch 'main' into uta-refactor

4cf205e

this doesnt need to be function-scoped

14c4d12

jsstevenson marked this pull request as ready for review April 2, 2026 01:14

jsstevenson requested a review from a team as a code owner April 2, 2026 01:14

korikuzma mentioned this pull request Apr 2, 2026

feat: add gene-based search biocommons/anyvar#416

Merged

korikuzma requested changes Apr 2, 2026

View reviewed changes

Update src/cool_seq_tool/sources/uta_database.py

075b775

Co-authored-by: Kori Kuzma <korikuzma@gmail.com>

jsstevenson added 4 commits May 13, 2026 15:50

support old url version

d85b1d6

fixes from review

e976af1

rm unused fixture

3792e13

update tests

1db7e05

jsstevenson requested a review from korikuzma May 13, 2026 19:52

korikuzma mentioned this pull request May 14, 2026

UTA: Remove table construction from pool constructor function #467

Open

korikuzma reviewed May 14, 2026

View reviewed changes

Comment thread src/cool_seq_tool/sources/uta_database.py

korikuzma requested changes May 14, 2026

View reviewed changes

jsstevenson added 2 commits May 14, 2026 20:08

remove genomic table stuff

d1fca22

remove genomic table stuff

6f81a69

jsstevenson requested a review from korikuzma May 15, 2026 00:08

korikuzma mentioned this pull request May 15, 2026

Consider alternatives to custom genomic table #285

Closed

korikuzma requested changes May 15, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat!: UTA access class separation of concerns, postgres lib fixes#465

feat!: UTA access class separation of concerns, postgres lib fixes#465
jsstevenson wants to merge 18 commits into
mainfrom
uta-refactor

jsstevenson commented Apr 2, 2026 •

edited

Loading

Uh oh!

korikuzma commented Apr 2, 2026

Uh oh!

korikuzma left a comment

Uh oh!

Uh oh!

Uh oh!

korikuzma Apr 2, 2026

Uh oh!

Uh oh!

Uh oh!

jsstevenson commented Apr 21, 2026

Uh oh!

jsstevenson commented May 13, 2026

Uh oh!

Uh oh!

korikuzma left a comment

Uh oh!

jsstevenson commented May 14, 2026 •

edited

Loading

Uh oh!

korikuzma commented May 14, 2026

Uh oh!

jsstevenson commented May 15, 2026

Uh oh!

korikuzma left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

-    with pytest.raises(NoMatchingAlignmentError):
+    with pytest.raises(
+        NoMatchingAlignmentError,
+        match=re.escape(
+            "Unable to find a result where NM_152263.63 has transcript coordinates (tx_exon_start=822, tx_exon_end=892) between an exon's start and end coordinates on gene=None"
+        ),
+    ):

Conversation

jsstevenson commented Apr 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

korikuzma commented Apr 2, 2026

Uh oh!

korikuzma left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

korikuzma Apr 2, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

jsstevenson commented Apr 21, 2026

Uh oh!

jsstevenson commented May 13, 2026

Uh oh!

Uh oh!

korikuzma left a comment

Choose a reason for hiding this comment

Uh oh!

jsstevenson commented May 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

korikuzma commented May 14, 2026

Uh oh!

jsstevenson commented May 15, 2026

Uh oh!

korikuzma left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

jsstevenson commented Apr 2, 2026 •

edited

Loading

jsstevenson commented May 14, 2026 •

edited

Loading