Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Load genes missing Ensembl ID using cytoband coordinates #4677

Merged
merged 28 commits into from
Jul 15, 2024

Conversation

northwestwitch
Copy link
Member

@northwestwitch northwestwitch commented Jun 14, 2024

This PR adds a functionality or fixes a bug.

image
Testing on cg-vm1 server (Clinical Genomics Stockholm)

Prepare for testing

  1. Make sure the PR is pushed and available on Docker Hub
  2. Fist book your testing time using the Pax software available at https://pax.scilifelab.se/. The resource you are going to call dibs on is scout-stage and the server is cg-vm1.
  3. ssh <USER.NAME>@cg-vm1.scilifelab.se
  4. sudo -iu hiseq.clinical
  5. ssh localhost
  6. (optional) Find out which scout branch is currently deployed on cg-vm1: podman ps
  7. Stop the service with current deployed branch: systemctl --user stop scout.target
  8. Start the scout service with the branch to test: systemctl --user start scout@<this_branch>
  9. Make sure the branch is deployed: systemctl --user status scout.target
  10. After testing is done, repeat procedure at https://pax.scilifelab.se/, which will release the allocated resource (scout-stage) to be used for testing by other users.
Testing on hasta server (Clinical Genomics Stockholm)

Prepare for testing

  1. ssh <USER.NAME>@hasta.scilifelab.se
  2. Book your testing time using the Pax software. us; paxa -u <user> -s hasta -r scout-stage. You can also use the WSGI Pax app available at https://pax.scilifelab.se/.
  3. (optional) Find out which scout branch is currently deployed on cg-vm1: conda activate S_scout; pip freeze | grep scout-browser
  4. Deploy the branch to test: bash /home/proj/production/servers/resources/hasta.scilifelab.se/update-tool-stage.sh -e S_scout -t scout -b <this_branch>
  5. Make sure the branch is deployed: us; scout --version
  6. After testing is done, repeat the paxa procedure, which will release the allocated resource (scout-stage) to be used for testing by other users.

How to test:

  1. how to test it, possibly with real cases/data

Expected outcome:
The functionality should be working
Take a screenshot and attach or copy/paste the output.

Review:

  • code approved by
  • tests executed by

Copy link

codecov bot commented Jun 14, 2024

Codecov Report

Attention: Patch coverage is 93.75000% with 2 lines in your changes missing coverage. Please review.

Project coverage is 84.44%. Comparing base (4d00d74) to head (9c066b4).

Files Patch % Lines
scout/load/hgnc_gene.py 89.47% 2 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #4677      +/-   ##
==========================================
+ Coverage   84.43%   84.44%   +0.01%     
==========================================
  Files         311      311              
  Lines       18761    18783      +22     
==========================================
+ Hits        15840    15861      +21     
- Misses       2921     2922       +1     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@northwestwitch
Copy link
Member Author

Marking this PR as ready for review. It doesn't solve the fact that fusion variant with gene IGH@ gets assigned a valid gene (yet) but opens up to a fix involving genes aliases.

Genes using this branch were loaded on hasta stage.

@northwestwitch northwestwitch marked this pull request as ready for review June 18, 2024 12:56
Copy link
Collaborator

@dnil dnil left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Creative idea! 🧠

In the spirit of summer, I'm approving, and fairly safely assuming you are right about all ensembl genes having coords. It seems very reasonable, but refseq would not necessarily... 😁 Given that that is correct I would still have an issue with the function name set_gene_coordinates as it is just doing a select few stragglers really, and one may easily get the idea that the more exact ensembl coords are overwritten.

scout/adapter/mongo/hgnc.py Show resolved Hide resolved
scout/commands/update/genes.py Show resolved Hide resolved
scout/commands/update/genes.py Show resolved Hide resolved
tests/load/test_load_hgnc_genes.py Outdated Show resolved Hide resolved
scout/parse/hgnc.py Show resolved Hide resolved
scout/adapter/mongo/cytoband.py Show resolved Hide resolved
scout/load/hgnc_gene.py Show resolved Hide resolved
with progressbar(genes.values(), label="Building genes", length=nr_genes) as bar:
for gene_data in bar:
set_gene_coordinates(gene_data=gene_data, cytoband_coords=cytoband_coords)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, if I read this correctly you are always overwriting the more exact gene coords with the cytoband coords? That doesn't seem quite right! 😊 Perhaps a check for existing coords, or set only if the chrome, start stop dont already exist? You do check for ensembl gene id, which may or may not be enough there - maybe just rename the function slightly then?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No no, it's not overwriting coords for genes that already have them, see line 23 of that file:

image

I'll rename the function as you suggested!

@@ -16,6 +17,22 @@
LOG = logging.getLogger(__name__)


def set_gene_coordinates(gene_data: dict, cytoband_coords: Dict[str, dict]):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As with the comment for the call, you are afaik right about the other coord source being ensembl, but it would feel more straightforward to check for the existence of good values on chr, start stop? Regardless, perhaps change to the function name to e.g. set_missing_gene_coordinates, set_empty_gene_coords_from_cytoband_location or such.

@northwestwitch
Copy link
Member Author

In the spirit of summer, I'm approving, and fairly safely assuming you are right about all ensembl genes having coords.

That's how it worked so far. Also, now we have a pydantic check before saving genes in the database and the genes loading would fail in the eventuality that coordinates are missing.

with progressbar(genes.values(), label="Building genes", length=nr_genes) as bar:
for gene_data in bar:
set_gene_coordinates(gene_data=gene_data, cytoband_coords=cytoband_coords)

if not gene_data.get("chromosome"):
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that there is also this other check to make sure that coords will not be missing

Copy link

sonarcloud bot commented Jul 15, 2024

@northwestwitch northwestwitch merged commit 07d0676 into main Jul 15, 2024
25 checks passed
@northwestwitch northwestwitch deleted the load_IGH_genes branch July 24, 2024 08:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Loading HGNC gene group loci
3 participants