-
Notifications
You must be signed in to change notification settings - Fork 43
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Load genes missing Ensembl ID using cytoband coordinates #4677
Conversation
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #4677 +/- ##
==========================================
+ Coverage 84.43% 84.44% +0.01%
==========================================
Files 311 311
Lines 18761 18783 +22
==========================================
+ Hits 15840 15861 +21
- Misses 2921 2922 +1 ☔ View full report in Codecov by Sentry. |
Marking this PR as ready for review. It doesn't solve the fact that fusion variant with gene IGH@ gets assigned a valid gene (yet) but opens up to a fix involving genes aliases. Genes using this branch were loaded on hasta stage. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Creative idea! 🧠
In the spirit of summer, I'm approving, and fairly safely assuming you are right about all ensembl genes having coords. It seems very reasonable, but refseq would not necessarily... 😁 Given that that is correct I would still have an issue with the function name set_gene_coordinates as it is just doing a select few stragglers really, and one may easily get the idea that the more exact ensembl coords are overwritten.
scout/load/hgnc_gene.py
Outdated
with progressbar(genes.values(), label="Building genes", length=nr_genes) as bar: | ||
for gene_data in bar: | ||
set_gene_coordinates(gene_data=gene_data, cytoband_coords=cytoband_coords) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm, if I read this correctly you are always overwriting the more exact gene coords with the cytoband coords? That doesn't seem quite right! 😊 Perhaps a check for existing coords, or set only if the chrome, start stop dont already exist? You do check for ensembl gene id, which may or may not be enough there - maybe just rename the function slightly then?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
scout/load/hgnc_gene.py
Outdated
@@ -16,6 +17,22 @@ | |||
LOG = logging.getLogger(__name__) | |||
|
|||
|
|||
def set_gene_coordinates(gene_data: dict, cytoband_coords: Dict[str, dict]): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As with the comment for the call, you are afaik right about the other coord source being ensembl, but it would feel more straightforward to check for the existence of good values on chr, start stop? Regardless, perhaps change to the function name to e.g. set_missing_gene_coordinates
, set_empty_gene_coords_from_cytoband_location
or such.
That's how it worked so far. Also, now we have a pydantic check before saving genes in the database and the genes loading would fail in the eventuality that coordinates are missing. |
with progressbar(genes.values(), label="Building genes", length=nr_genes) as bar: | ||
for gene_data in bar: | ||
set_gene_coordinates(gene_data=gene_data, cytoband_coords=cytoband_coords) | ||
|
||
if not gene_data.get("chromosome"): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note that there is also this other check to make sure that coords will not be missing
|
This PR adds a functionality or fixes a bug.
Testing on cg-vm1 server (Clinical Genomics Stockholm)
Prepare for testing
scout-stage
and the server iscg-vm1
.ssh <USER.NAME>@cg-vm1.scilifelab.se
sudo -iu hiseq.clinical
ssh localhost
podman ps
systemctl --user stop scout.target
systemctl --user start scout@<this_branch>
systemctl --user status scout.target
scout-stage
) to be used for testing by other users.Testing on hasta server (Clinical Genomics Stockholm)
Prepare for testing
ssh <USER.NAME>@hasta.scilifelab.se
us; paxa -u <user> -s hasta -r scout-stage
. You can also use the WSGI Pax app available at https://pax.scilifelab.se/.conda activate S_scout; pip freeze | grep scout-browser
bash /home/proj/production/servers/resources/hasta.scilifelab.se/update-tool-stage.sh -e S_scout -t scout -b <this_branch>
us; scout --version
paxa
procedure, which will release the allocated resource (scout-stage
) to be used for testing by other users.How to test:
Expected outcome:
The functionality should be working
Take a screenshot and attach or copy/paste the output.
Review: