Load genes missing Ensembl ID using cytoband coordinates #4677

northwestwitch · 2024-06-14T11:16:48Z

This PR adds a functionality or fixes a bug.

Closes Loading HGNC gene group loci #4505

Testing on cg-vm1 server (Clinical Genomics Stockholm)

Prepare for testing

Make sure the PR is pushed and available on Docker Hub
Fist book your testing time using the Pax software available at https://pax.scilifelab.se/. The resource you are going to call dibs on is scout-stage and the server is cg-vm1.
ssh <USER.NAME>@cg-vm1.scilifelab.se
sudo -iu hiseq.clinical
ssh localhost
(optional) Find out which scout branch is currently deployed on cg-vm1: podman ps
Stop the service with current deployed branch: systemctl --user stop scout.target
Start the scout service with the branch to test: systemctl --user start scout@<this_branch>
Make sure the branch is deployed: systemctl --user status scout.target
After testing is done, repeat procedure at https://pax.scilifelab.se/, which will release the allocated resource (scout-stage) to be used for testing by other users.

Testing on hasta server (Clinical Genomics Stockholm)

Prepare for testing

ssh <USER.NAME>@hasta.scilifelab.se
Book your testing time using the Pax software. us; paxa -u <user> -s hasta -r scout-stage. You can also use the WSGI Pax app available at https://pax.scilifelab.se/.
(optional) Find out which scout branch is currently deployed on cg-vm1: conda activate S_scout; pip freeze | grep scout-browser
Deploy the branch to test: bash /home/proj/production/servers/resources/hasta.scilifelab.se/update-tool-stage.sh -e S_scout -t scout -b <this_branch>
Make sure the branch is deployed: us; scout --version
After testing is done, repeat the paxa procedure, which will release the allocated resource (scout-stage) to be used for testing by other users.

How to test:

how to test it, possibly with real cases/data

Expected outcome:
The functionality should be working
Take a screenshot and attach or copy/paste the output.

Review:

code approved by
tests executed by

codecov · 2024-06-14T12:36:33Z

Codecov Report

Attention: Patch coverage is 93.75000% with 2 lines in your changes missing coverage. Please review.

Project coverage is 84.44%. Comparing base (4d00d74) to head (9c066b4).

Files	Patch %	Lines
scout/load/hgnc_gene.py	89.47%	2 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #4677      +/-   ##
==========================================
+ Coverage   84.43%   84.44%   +0.01%     
==========================================
  Files         311      311              
  Lines       18761    18783      +22     
==========================================
+ Hits        15840    15861      +21     
- Misses       2921     2922       +1

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

northwestwitch · 2024-06-18T12:56:18Z

Marking this PR as ready for review. It doesn't solve the fact that fusion variant with gene IGH@ gets assigned a valid gene (yet) but opens up to a fix involving genes aliases.

Genes using this branch were loaded on hasta stage.

dnil

Creative idea! 🧠

In the spirit of summer, I'm approving, and fairly safely assuming you are right about all ensembl genes having coords. It seems very reasonable, but refseq would not necessarily... 😁 Given that that is correct I would still have an issue with the function name set_gene_coordinates as it is just doing a select few stragglers really, and one may easily get the idea that the more exact ensembl coords are overwritten.

scout/adapter/mongo/hgnc.py

scout/commands/update/genes.py

tests/load/test_load_hgnc_genes.py

scout/parse/hgnc.py

scout/adapter/mongo/cytoband.py

scout/load/hgnc_gene.py

dnil · 2024-06-26T14:37:43Z

scout/load/hgnc_gene.py

    with progressbar(genes.values(), label="Building genes", length=nr_genes) as bar:
        for gene_data in bar:
+            set_gene_coordinates(gene_data=gene_data, cytoband_coords=cytoband_coords)


Hm, if I read this correctly you are always overwriting the more exact gene coords with the cytoband coords? That doesn't seem quite right! 😊 Perhaps a check for existing coords, or set only if the chrome, start stop dont already exist? You do check for ensembl gene id, which may or may not be enough there - maybe just rename the function slightly then?

No no, it's not overwriting coords for genes that already have them, see line 23 of that file:

I'll rename the function as you suggested!

dnil · 2024-06-26T14:43:45Z

scout/load/hgnc_gene.py

@@ -16,6 +17,22 @@
 LOG = logging.getLogger(__name__)


+def set_gene_coordinates(gene_data: dict, cytoband_coords: Dict[str, dict]):


As with the comment for the call, you are afaik right about the other coord source being ensembl, but it would feel more straightforward to check for the existence of good values on chr, start stop? Regardless, perhaps change to the function name to e.g. set_missing_gene_coordinates, set_empty_gene_coords_from_cytoband_location or such.

northwestwitch · 2024-07-15T07:40:40Z

In the spirit of summer, I'm approving, and fairly safely assuming you are right about all ensembl genes having coords.

That's how it worked so far. Also, now we have a pydantic check before saving genes in the database and the genes loading would fail in the eventuality that coordinates are missing.

northwestwitch · 2024-07-15T07:44:59Z

scout/load/hgnc_gene.py

    with progressbar(genes.values(), label="Building genes", length=nr_genes) as bar:
        for gene_data in bar:
+            set_gene_coordinates(gene_data=gene_data, cytoband_coords=cytoband_coords)
+
            if not gene_data.get("chromosome"):


Note that there is also this other check to make sure that coords will not be missing

sonarcloud · 2024-07-15T07:53:18Z

Quality Gate passed

Issues
0 New issues
0 Accepted issues

Measures
0 Security Hotspots
0.0% Coverage on New Code
0.0% Duplication on New Code

See analysis details on SonarCloud

Chiara Rasi and others added 6 commits June 14, 2024 13:13

Load genes missing Ensembl ID using cytoband coordinates

fd1a407

Fix code style issues with Black

973bf01

Import Optional in build gene

1955853

Fix tests

710667b

Fix typo

fb461b0

Thanks Sonarcloud

fc00685

Chiara Rasi and others added 14 commits June 14, 2024 14:37

Reverting some code here and there

371fd5c

More specific error message

285ef27

Typo

2fe10c9

Work in progress

91b19a4

Simplify code in one function

da1f40d

Merge branch 'main' into load_IGH_genes

930ecf6

Fix some issues due to other PR merging

55c5167

Fix code style issues with Black

bfaba9d

trigger tests

204c533

Fix typo

d5ffc66

Remove leftover code

577b631

Fix code

9328e07

Add test

97d278e

Additional test

b72ffd3

northwestwitch marked this pull request as ready for review June 18, 2024 12:56

northwestwitch and others added 3 commits June 18, 2024 15:29

Merge branch 'main' into load_IGH_genes

6e96a80

Merge branch 'main' into load_IGH_genes

7df3a3e

Merge branch 'main' into load_IGH_genes

c6bee80

dnil approved these changes Jun 26, 2024

View reviewed changes

dnil added 3 commits June 27, 2024 14:15

Merge branch 'main' into load_IGH_genes

13bd97a

Merge branch 'main' into load_IGH_genes

da0fbee

Merge branch 'main' into load_IGH_genes

7790570

northwestwitch commented Jul 15, 2024

View reviewed changes

Chiara Rasi added 2 commits July 15, 2024 09:48

Rename function as per review

d832dc2

Do the actual renaming in the test

9c066b4

northwestwitch merged commit 07d0676 into main Jul 15, 2024
25 checks passed

northwestwitch deleted the load_IGH_genes branch July 24, 2024 08:04

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Load genes missing Ensembl ID using cytoband coordinates #4677

Load genes missing Ensembl ID using cytoband coordinates #4677

northwestwitch commented Jun 14, 2024 •

edited

Loading

codecov bot commented Jun 14, 2024 •

edited

Loading

northwestwitch commented Jun 18, 2024

dnil left a comment

dnil Jun 26, 2024

northwestwitch Jul 15, 2024

dnil Jun 26, 2024

northwestwitch commented Jul 15, 2024

northwestwitch Jul 15, 2024

sonarcloud bot commented Jul 15, 2024

		@@ -16,6 +17,22 @@
		LOG = logging.getLogger(__name__)


		def set_gene_coordinates(gene_data: dict, cytoband_coords: Dict[str, dict]):

Load genes missing Ensembl ID using cytoband coordinates #4677

Load genes missing Ensembl ID using cytoband coordinates #4677

Conversation

northwestwitch commented Jun 14, 2024 • edited Loading

codecov bot commented Jun 14, 2024 • edited Loading

Codecov Report

northwestwitch commented Jun 18, 2024

dnil left a comment

Choose a reason for hiding this comment

dnil Jun 26, 2024

Choose a reason for hiding this comment

northwestwitch Jul 15, 2024

Choose a reason for hiding this comment

dnil Jun 26, 2024

Choose a reason for hiding this comment

northwestwitch commented Jul 15, 2024

northwestwitch Jul 15, 2024

Choose a reason for hiding this comment

sonarcloud bot commented Jul 15, 2024

Quality Gate passed

northwestwitch commented Jun 14, 2024 •

edited

Loading

codecov bot commented Jun 14, 2024 •

edited

Loading