Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Filter generic gene-related condition terms from manual curation and evidence string generation #435

Open
apriltuesday opened this issue Jul 1, 2024 · 2 comments

Comments

@apriltuesday
Copy link
Contributor

Recent submissions to ClinVar have included a large number of gene-related conditions as trait names (e.g. TTN-related condition). These are less informative for our purposes, not likely to be mappable to good EFO terms and time-consuming for curators to sift through. As 99% of targets associated with these terms are already covered by other ClinVar records, we've decided to filter them out as uninformative.

Tasks:

  • Evaluate scale of false positives from filtering out a suitable regex (e.g. if we remove anything matching [0-9a-zA-Z]+-related .*, are we removing informative trait names or large numbers of records from other submissions?)
  • If impact is deemed acceptable, implement the filter for both curation and evidence generation pipelines
@apriltuesday
Copy link
Contributor Author

I'm not sure this requires a notebook to review, here's what I've done.

First I confirmed that for the large recent submission, gene-related condition terms have been replaced with gene-related disorder terms. There's still a question about whether we should do anything about the "condition" terms, but for now I used the following regex: ^\S+-related disorder$

In the most recent ClinVar release, there are 122,432 records with preferred trait name matching this pattern but only 5,788 unique trait names. I think this makes sense given how broad the trait is (i.e. associated with many variants).

Of these trait names, only 1.4% have a MedGen ID within ClinVar, and only 0.1% have an exact EFO match. Here are all the EFO terms:

trait name EFO
CBL-related disorder http://purl.obolibrary.org/obo/MONDO_0013308
CLCN4-related disorder http://www.ebi.ac.uk/efo/EFO_0009066
ATP6AP2-related disorder http://purl.obolibrary.org/obo/MONDO_0100146
STAG1-related disorder http://www.ebi.ac.uk/efo/EFO_0009078
COL4A1-related disorder http://purl.obolibrary.org/obo/MONDO_0800461
DKC1-related disorder http://purl.obolibrary.org/obo/MONDO_0100152
CTSC-related disorder http://purl.obolibrary.org/obo/MONDO_0800465

Some of these may be of debatable utility in EFO but several look indeed legitimate, so I'm not sure about the decision to exclude this pattern entirely.

@tcezard Any thoughts? I was thinking of checking whether the variants involved in these records are associated with more specific traits as well (OT reports 99% of gene targets are covered by other evidence, but I don't think we know about variants). Is it worth doing this or should we be thinking of other strategies?

@apriltuesday
Copy link
Contributor Author

In case it's useful, I went ahead and checked whether variants in these "gene-related disorder" records are associated with other traits. For simplicity, I identified variants by VCV, which is ClinVar's variant identifier; this might not be 1:1 with chr_pos_ref_alt but it shouldn't matter too much for these counts.

  • Total variants associated with gene-related disorder traits: 122,355
    • Number of these associated with multiple traits: 49,241 (40.2%)
    • Number of these associated with single trait: 73,114 (59.8%)

So while target genes might be overwhelmingly covered by other evidence, this is certainly not true for variants. Furthermore when variants are associated with multiple traits, these won't always be more specific than the gene-related disorder trait. Some examples can be found in this spreadsheet, which is filtered to include only VCVs associated with one of the EFO-mapped traits listed above, both to make the spreadsheet size manageable and to make it possible to look at terms within the EFO hierarchy (e.g. CBL-related disorder vs. rasopathy).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant