Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement import of trait names from ClinVar #14

Closed
1 task
tskir opened this issue Jun 4, 2020 · 8 comments · Fixed by #39
Closed
1 task

Implement import of trait names from ClinVar #14

tskir opened this issue Jun 4, 2020 · 8 comments · Fixed by #39
Labels
Priority: High Should be prioritised over other issues Scope: Backend Backend logic & data processing scripts

Comments

@tskir
Copy link
Member

tskir commented Jun 4, 2020

Information about traits can be retrieved from a static, periodically updated endpoint https://ftp.ncbi.nlm.nih.gov/pub/clinvar/tab_delimited/variant_summary.txt.gz. There is an accompanying checksum file https://ftp.ncbi.nlm.nih.gov/pub/clinvar/tab_delimited/variant_summary.txt.gz.md5.

The main file is a gzipped TSV containing about two dozen columns (first line, starting with #, is a header). There are three columns of interest:

  • PhenotypeList column contains trait names. There can be one or multiple per record. If there are multiple, they are separated with semicolons, for example Inborn genetic diseases;Mitochondrial complex 1 deficiency, nuclear type 21;Mitochondrial complex I deficiency;not provided
  • RCVaccession contains IDs of ClinVar RCV records. An RCV record, in general, associates multiple genetic variants with multiple trait names. Similarly, there can be one or multiple, for example RCV000622708;RCV000735415;RCV000000017;RCV000196589.
  • AlleleID contains the identifier of a genetic variant. Example: 15046.

These columns follow this list of constraints:

  • The number of semicolon-separated values in PhenotypeList and RCVaccession is always the same for a given record and is always 1 or greater. The values with same same index form pairs.
  • AlleleID always contains an integer, however, it is not a key for the table and several rows can have the same AlleleID (see below).

Trait names imported into the database should represent a unique set of all trait names present in this file.

To calculate the number of records assigned to each trait, we need to consider all tuples of (AlleleID, RCVaccession, PhenotypeList). For example, exploding along RCVaccession and PhenotypeList, the sample record described above would generate four tuples:

  • (15046, RCV000622708, Inborn genetic diseases)
  • (15046, RCV000735415, Mitochondrial complex 1 deficiency, nuclear type 21)
  • (15046, RCV000000017, Mitochondrial complex I deficiency)
  • (15046, RCV000196589, not provided)

When we do this for all rows in the file, we need to combine all tuples together and deduplicate them (leave only unique tuples). In this final set, the number of tuples which mention a given trait name is the number of records associated with this trait, which should also be imported into the database and stored in a separate field.

The reason there can be non-unique tuples is because of different reference genomes. Sometimes there will be two records with the same AlleleID and with most columns containing identical or very similar values; the difference will only be in the reference genome (GRCh37/GRCh38) and chromosomal coordinates. This is why we do deduplication to count the number of records correctly.

Some considerations about the import:

  • Columns in the file can change order, so they should be extracted based on their names, not fixed position indexes.
  • Import should be performed periodically and automatically; and there should also be a way to trigger import manually (perhaps through a button or a separate page?)
  • Failure to import must generate some sort of a logging message which must go somewhere (we'll need to think through the logging mechanism, perhaps in a separate issue)
  • Care should be taken so that user modifications don't overlap with a database modification caused by a user.

Action checklist

@tskir tskir added the Priority: High Should be prioritised over other issues label Jun 4, 2020
@tskir
Copy link
Member Author

tskir commented Jun 6, 2020

To clarify the scope here: as we discussed yesterday, this issue is specifically about backend logic to do the import. Import automation & manual triggering will be implemented separately.

@tskir tskir added the Scope: Backend Backend logic & data processing scripts label Jun 6, 2020
@joj0s joj0s added this to To do in Project Progress Jun 7, 2020
@joj0s joj0s moved this from To do to In progress in Project Progress Jun 8, 2020
@joj0s
Copy link
Collaborator

joj0s commented Jun 9, 2020

So just to clear up the process of selecting trait names and calculating the number of source records:

  • Inserting trait names: I select every single unique trait name that appears in the PhenotypeList column
  • Calculating source record number: For each AlleleID, I calculate every unique possible tuple with RCVaccession and PhenotypeList. The number of each trait name's appearance in those is its source record number.

Also, we need to see what the behavior will be for already existing trait names whenever a new import cycle begins. Do we calculate the source records again for already existing trait names, and just insert the new ones as usual?

@joj0s
Copy link
Collaborator

joj0s commented Jun 11, 2020

Another thing, is that I am excluding "not provided" values for both trait name imports and source record number calculation, should I treat those records differently?

@tskir
Copy link
Member Author

tskir commented Jun 15, 2020

Inserting trait names: I select every single unique trait name that appears in the PhenotypeList column

That's correct, provided that the PhenotypeList column values are exploded prior to that—i.e., split by ; character.

Calculating source record number: For each AlleleID, I calculate every unique possible tuple with RCVaccession and PhenotypeList. The number of each trait name's appearance in those is its source record number.

That's correct. Just to clarify, after calculating tuples per each AlleleID, it's also important to combine all of them together and then do the deduplication (remember, AlleleID is not unique per row).

To expand a bit on why we do this: the central object which is in the end submitted to Open Targets is an "evidence string". Each evidence string is defined by a tuple (trait, variant, ClinVar record). So the number of evidence strings generated by any given trait will depend on the total number of (variant, ClinVar record) tuples associated with it. This corresponds to (AlleleID, RCV) tuples in the source data.

Another thing, is that I am excluding "not provided" values for both trait name imports and source record number calculation, should I treat those records differently?

Yeah, that's a good question. ClinVar has a number of trait "names" which cannot be mapped to any ontology term. Most notably "not provided", but there are also things like "see cases", "other" and a couple of others. In this ticket you don't need to address those situations in any special way. In future, we will need a possibility for curators to mark the trait as "invalid", probably necessitating an additional status. This is not a high priority issue. I've added it to backlog: #37

@joj0s
Copy link
Collaborator

joj0s commented Jun 15, 2020

Also, we need to see what the behavior will be for already existing trait names whenever a new import cycle begins. Do we calculate the source records again for already existing trait names, and just insert the new ones as usual?

Regarding this, what I am doing right now is that If a record already exists in the database I leave it as is, and I only insert new ones with their source record numbers. Let me know if I should change this.

Other than that, the script is ready. Should I add a button somewhere to trigger the import and submit it in a PR?

@tskir
Copy link
Member Author

tskir commented Jun 15, 2020

Also, we need to see what the behavior will be for already existing trait names whenever a new import cycle begins. Do we calculate the source records again for already existing trait names, and just insert the new ones as usual?

Regarding this, what I am doing right now is that If a record already exists in the database I leave it as is, and I only insert new ones with their source record numbers. Let me know if I should change this.

Oh, I missed that question, sorry. The correct behaviour in this case is what you described originally: insert new trait names with their record counts + also recalculate number of linked records for all existing traits.

Other than that, the script is ready. Should I add a button somewhere to trigger the import and submit it in a PR?

Yes, please do. I'm not sure where to put this button, though. There is an issue #32 for designing the page for buttons triggering various processes, but it's a separate one. So for now, I guess you could just add the button anywhere where it is convenient. And then, once #32 and subsequent implementation issues are done, we'll move it there.

Another issue is logging. In case the import doesn't go through for some reason, we need a way to access the logs of the script. Do you have any ideas on how to do that best? For example, regarding the Heroku instances where all of this is running, do you have any way to access them (semi-) directly? Or could we make the script, once triggered, output its logs to the Javascript console or something?

@joj0s
Copy link
Collaborator

joj0s commented Jun 15, 2020

Another issue is logging. In case the import doesn't go through for some reason, we need a way to access the logs of the script. Do you have any ideas on how to do that best? For example, regarding the Heroku instances where all of this is running, do you have any way to access them (semi-) directly? Or could we make the script, once triggered, output its logs to the Javascript console or something?

The easiest way to view logs would be to output them to the server console. They can then be accessed via the Heroku CLI or simply by going to the 'logs' page in the Heroku webpage. For example https://dashboard.heroku.com/apps/clinvar-trai-prototype-pmia54t/logs

@joj0s joj0s moved this from In progress to In review in Project Progress Jun 16, 2020
@tskir
Copy link
Member Author

tskir commented Jun 17, 2020

The easiest way to view logs would be to output them to the server console. They can then be accessed via the Heroku CLI or simply by going to the 'logs' page in the Heroku webpage. For example https://dashboard.heroku.com/apps/clinvar-trai-prototype-pmia54t/logs

OK, that's great, let's use this way for now. Maybe in the future we'll implement some user friendly logging, a status page, email notifications, or something similar. For now Heroku server logs will do just fine

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Priority: High Should be prioritised over other issues Scope: Backend Backend logic & data processing scripts
Projects
Development

Successfully merging a pull request may close this issue.

2 participants