Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix #4469 - parse TRGT STR VCF #4566

Open
wants to merge 33 commits into
base: main
Choose a base branch
from
Open

Fix #4469 - parse TRGT STR VCF #4566

wants to merge 33 commits into from

Conversation

dnil
Copy link
Collaborator

@dnil dnil commented Apr 16, 2024

This PR adds a functionality or fixes a bug.
OR
This PR marks a new Scout release. We apply semantic versioning. This is a major/minor/patch release for reasons.

Testing on cg-vm1 server (Clinical Genomics Stockholm)

Prepare for testing

  1. Make sure the PR is pushed and available on Docker Hub
  2. Fist book your testing time using the Pax software available at https://pax.scilifelab.se/. The resource you are going to call dibs on is scout-stage and the server is cg-vm1.
  3. ssh <USER.NAME>@cg-vm1.scilifelab.se
  4. sudo -iu hiseq.clinical
  5. ssh localhost
  6. (optional) Find out which scout branch is currently deployed on cg-vm1: podman ps
  7. Stop the service with current deployed branch: systemctl --user stop scout.target
  8. Start the scout service with the branch to test: systemctl --user start scout@<this_branch>
  9. Make sure the branch is deployed: systemctl --user status scout.target
  10. After testing is done, repeat procedure at https://pax.scilifelab.se/, which will release the allocated resource (scout-stage) to be used for testing by other users.
Testing on hasta server (Clinical Genomics Stockholm)

Prepare for testing

  1. ssh <USER.NAME>@hasta.scilifelab.se
  2. Book your testing time using the Pax software. us; paxa -u <user> -s hasta -r scout-stage. You can also use the WSGI Pax app available at https://pax.scilifelab.se/.
  3. (optional) Find out which scout branch is currently deployed on cg-vm1: conda activate S_scout; pip freeze | grep scout-browser
  4. Deploy the branch to test: bash /home/proj/production/servers/resources/hasta.scilifelab.se/update-tool-stage.sh -e S_scout -t scout -b <this_branch>
  5. Make sure the branch is deployed: us; scout --version
  6. After testing is done, repeat the paxa procedure, which will release the allocated resource (scout-stage) to be used for testing by other users.

How to test:

  1. how to test it, possibly with real cases/data

Expected outcome:
The functionality should be working
Take a screenshot and attach or copy/paste the output.

Review:

  • code approved by
  • tests executed by

Copy link

codecov bot commented Apr 16, 2024

Codecov Report

Attention: Patch coverage is 63.88889% with 26 lines in your changes are missing coverage. Please review.

Project coverage is 84.53%. Comparing base (33dd4b5) to head (db9e50f).

Files Patch % Lines
scout/parse/variant/genotype.py 53.33% 21 Missing ⚠️
scout/server/blueprints/variants/controllers.py 72.22% 5 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #4566      +/-   ##
==========================================
- Coverage   84.61%   84.53%   -0.09%     
==========================================
  Files         310      310              
  Lines       18679    18744      +65     
==========================================
+ Hits        15805    15845      +40     
- Misses       2874     2899      +25     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@dnil
Copy link
Collaborator Author

dnil commented May 2, 2024

Ok, actually working! There are a couple of small things missing in Stranger now (Clinical-Genomics/stranger#58), but once they are in place we can release it, and make sure we are parsing the release version ok with this PR.

Screenshot 2024-05-02 at 16 30 49

@northwestwitch
Copy link
Member

Would be nice to add this one to the new release. Who's with the missing things in STranger? Otherwise in the next release?

@dnil
Copy link
Collaborator Author

dnil commented May 6, 2024

Would be nice to add this one to the new release. Who's with the missing things in STranger? Otherwise in the next release?

Well, feel free to review: it would be nice with some input. In my mind right now the further additions would be in STRanger and possibly the reference files, but I conservatively kept this on hold since having things like the REF count visible has been useful in the past. Not that it’s strictly needed.

@dnil dnil marked this pull request as ready for review May 6, 2024 06:01
Copy link
Member

@northwestwitch northwestwitch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me. Fine to merge since it works I think. I have a few minor suggestions

@@ -11,9 +11,11 @@ About changelog [here](https://keepachangelog.com/en/1.0.0/)
- STR variant information card with database links, replacing empty frequency panel
- Display paging and number of HPO terms available in the database on Phenotypes page
- On case page, typeahead hints when searching for a disease using substrings containing source ("OMIM:", "ORPHA:")
t
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
t

- Button to monitor the status of submissions on ClinVar Submissions page
- Option to filter cancer variants by number of observations in somatic and germline archived database
- Documentation for integrating chanjo2
- Parse TRGT STR VCF
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
- Parse TRGT STR VCF
- Parse Tandem repeat genotyping (TRGT) tags from STR VCFs

@@ -199,7 +199,12 @@ def build_variant(
variant_obj["str_pathologic_min"] = variant.get("str_pathologic_min")
variant_obj["str_ref"] = variant.get("str_ref")
variant_obj["str_repid"] = variant.get("str_repid")
variant_obj["str_trid"] = variant.get("str_trid")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know that we have long lists of key/values in this build_variant function, but without making huge changes to it, what about having all these strs keys/values into a constant and then call a specific function (outside this one) to assign these values in a loop? It would be less code and more readable

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this kind of transformation should be done using a class, why not a Pydantic one since we have started using them. I would prefer to do that as a separate PR, knowing that we tend to introduce some issues with empty and missing values when we convert to Pydantic if it's ok with you.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok!

@@ -103,6 +103,14 @@ def parse_genotype(variant, ind, pos):
(flanking_ref, flanking_alt) = _parse_format_entry(variant, pos, "ADFL")
(inrepeat_ref, inrepeat_alt) = _parse_format_entry(variant, pos, "ADIR")

# TRGT long read STR specific
(mc_ref, mc_alt) = _parse_format_entry_trgt_mc(variant, pos)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
(mc_ref, mc_alt) = _parse_format_entry_trgt_mc(variant, pos)
(_, mc_alt) = _parse_format_entry_trgt_mc(variant, pos)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like it's not used downstream?

@@ -395,14 +403,15 @@ def get_str_so(variant, pos):
return str_so


def _parse_format_entry(variant, pos, format_entry_name):
def _parse_format_entry(variant, pos, format_entry_name, number_format=int):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
def _parse_format_entry(variant, pos, format_entry_name, number_format=int):
def _parse_format_entry(variant: cyvcf2.Variant, pos: int, format_entry_name: str, number_format:Optional[Union[float. int]]=int) -> Tuple(Union[float, int]):

values = list(value.split("/"))
values = variant.format(format_entry_name)[pos]

new_values = []
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Otherwise this could also work

values = re.split("/|,", values)


ref_value = None
alt_value = None

if len(values) > 1:
ref_value = int(values[0])
alt_value = int(values[1])
ref_value = (number_format)(values[0])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍🏻

if ref_value >= 0:
ref = ref_value
if alt_value >= 0:
alt = alt_value
except (ValueError, TypeError) as _ignore_error:
pass
return (ref, alt)


def _parse_format_entry_trgt_mc(variant, pos):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add type hints and return type instead of having the long docstring. Instead the docstring couls contain a better explanation of what MC is

Comment on lines +463 to +465
pathologic_struc = variant.INFO.get("PathologicStruc", None)

pathologic_counts = 0
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
pathologic_struc = variant.INFO.get("PathologicStruc", None)
pathologic_counts = 0
pathologic_struc = variant.INFO.get("PathologicStruc", None)
pathologic_counts = 0

These 2 could be moved down, after line 468 perhaps?

Copy link

sonarcloud bot commented May 7, 2024

Quality Gate Passed Quality Gate passed

Issues
2 New issues
0 Accepted issues

Measures
0 Security Hotspots
No data about Coverage
0.0% Duplication on New Code

See analysis details on SonarCloud

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants