Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Some NCBIGene records should have biolink category "GenomicEntity" #1220

Closed
saramsey opened this issue Feb 3, 2021 · 11 comments
Closed

Some NCBIGene records should have biolink category "GenomicEntity" #1220

saramsey opened this issue Feb 3, 2021 · 11 comments

Comments

@saramsey
Copy link
Member

saramsey commented Feb 3, 2021

Example: NCBIGene:780896.

Screen Shot 2021-02-02 at 4 44 00 PM

Note that under the "General Gene Information" section, it has the property "phenotype only". I think in that case it makes sense to give it a biolink category biolink:GenomicEntity which is our closest Biolink approximation for "locus". I think this would require a fix in ncbigene_to_tsv_to_kg_json.py.

See issue #1213 which first brought this issue to my attention.

@saramsey
Copy link
Member Author

saramsey commented Feb 3, 2021

In addition to correctly setting the biolink category, we should prepend "Genetic locus for " to the name and full_name fields for these types of NCBIGene records.

saramsey added a commit that referenced this issue Feb 3, 2021
@kvarforl kvarforl added the sar-look Marks an issue that Steve needs to examine label Mar 6, 2021
@kvarforl
Copy link
Contributor

kvarforl commented Mar 6, 2021

looks like this change (for the most part) made it into kg2.5.2.
There are now 1343 nodes in kg2 from NCBIGene that have category genomic entity and contain the string "locus" in their full name

match (n) where n.id starts with "NCBIGene" and n.category_label = "genomic_entity" and lower(n.full_name) contains "locus" return count(*)

However,
@saramsey the prepending of "Genetic locus for" made it into full_name but not name. good enough, or additional tweaks required?

match (n{id:"NCBIGene:780896"}) return n.name, n.full_name, n.category_label
n.name n.full_name n.category_label
"ACTD" "Genetic locus associated with Acetabular dysplasia" "genomic_entity"

@kvarforl
Copy link
Contributor

@dkoslicki @edeutsch can you take a look at the above comment and let me know if you would like additional adjustments?

@edeutsch
Copy link
Collaborator

Thanks! I think we wanted the prefix "Genetic locus associated with " on the name field as well. The issue is that the NodeSynonymizer uses the name field to associate things, and was conflating the disease with the locus because they have the same name. The NS could still compensate for this, but fixing it at the KG level seemed to make the most sense?

@kvarforl
Copy link
Contributor

Okay, I've added a fix for this! It's in aacf9c2 ; I mistyped the issue number in my commit message, hence the lack of traditional linkage :)

this should appear in the next kg2 build.

@saramsey
Copy link
Member Author

Great, looks like there is nothing for me to do here, as @edeutsch has communicated what specifically is needed (i.e., on the name field and not just on the full_name field).

@saramsey saramsey removed the sar-look Marks an issue that Steve needs to examine label Mar 12, 2021
@ecwood
Copy link
Collaborator

ecwood commented Mar 16, 2021

Also, I think some NCBIGene records should have the biolink category "microRNA":

Ex.

{
  "iri": "https://identifiers.org/ncbigene:100126330",
  "synonym": [
    "MIR941-4",
    "MIRN941-4",
    "hsa-mir-941-4"
  ],
  "category_label": "gene",
  "full_name": "microRNA 941-4",
  "deprecated": "False",
  "name": "MIR941-4",
  "description": "Type:ncRNA; Locus:20q13.33; NameStatus:official",
  "provided_by": "identifiers_org_registry:ncbigene",
  "id": "NCBIGene:100126330",
  "category": "biolink:Gene",
  "update_date": "20210302"
}

This is labeled as a gene, but it is a microRNA (it's xref is mirbase:MI0005766 from the ETL of miRBase #1247). Is this a problem?

@ecwood
Copy link
Collaborator

ecwood commented Mar 16, 2021

The output of the cypher:
match (n {provided_by: 'identifiers_org_registry:ncbigene'}) where lower(n.full_name) starts with "microrna" and lower(n.category_label)="gene" return count(n)

is 1915.

@edeutsch
Copy link
Collaborator

I think it hinders our reasoning on KG2, yes, so a fix would be appreciated!

@ecwood
Copy link
Collaborator

ecwood commented Mar 17, 2021

The microRNA issue is addressed in 3378f83 (forgot to tie that commit to this issue)

@kvarforl
Copy link
Contributor

Okay looks like the naming portion of this is fixed in kg2.6.0:

n.name n.full_name n.category_label
"Genetic locus associated with ACTD" "Genetic locus associated with Acetabular dysplasia" "genomic_entity"

closing it out, as I think 1379 covers the microRNA stuff :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants