-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Transcript Versions - Gaps and replace alignments with newer annotations #494
Comments
Need to remember to re-match any classifications against transcripts that change because of this issue. Since we haven't upgraded any prod systems yet might be better to modify the previous rematch classifications management command, delete the manual migration and then do a new one Should only call the API made classifications, James says you can best get this via:
|
I think I'll just handle the gaps in PyHGVS
https://github.com/The-Sequence-Ontology/Specifications/blob/master/gff3.md Then just need to handle that shift in the exons - probably just write that into PyHGVS - we're already made modifications etc Previous discussion: 3132 HGVS - mRNA/genome gapped alignments |
…nnotations - use HTTP as getting corruption via ftp to NCBI
I found that "partial" was used to describe partial matches, so added that to data
Output:
So, GFF better, plus we get those cDNA matches etc we can use |
Working in branch "gene_import"
TODO:
|
Checked in start of gap handling code to https://github.com/SACGF/hgvs Using test data that worked with VEP: 2:73385942 A>T: VEP: NM_015120.4:c.74A>T - OLD pyhgvs: NM_015120.4(ALMS1):c.74A>T
Still TODO:
|
Install instructions
|
…ence diffs on transcript version page
Just testing those I had transcript sequences for (so didn't need to use API request to get more)
So polyA explains 99.93% of the length differences (not due to cDNA matches) and the transcripts look like they're polyA, here's the end sequence:
|
Testing covered by https://app.zenhub.com/workspaces/everything-space-5bb3158c4b5806bc2beae448/issues/sacgf/shariant-admin/124. @davmlaw Does anything else need to be tested? |
Think this has been well tested by fire effectively. |
The alignments can change for the same transcript version, so we should update data to the latest (if using --replace)
https://www.ncbi.nlm.nih.gov/nuccore/NM_004656.4 is 3600bp long, and 1 base short summing up JSON exon lengths
Emma noticed that GCF_000001405.39_GRCh38.p13_genomic.109.20210514.gff.gz the exon lengths sum to 3600 - ie different than our JSON data.
Here's the genePred data (0-based like the JSON) and yes it looks like the 1st (in genomic order) exon start is 52435023 while the JSON is 52435024
The data was originally inserted via these files which do have the 52435024 coordinate:
The current process is:
By default we do NOT replace the transcript data with the new stuff, my thoughts were that if the transcript version wasn't bumped, then that wouldn't change.
Looks like this was a bad assumption as for whatever reason, the genome being patched or UCSC having bad alignments, the alignments for the same transcript version can indeed change. Will move this to a new issue
The text was updated successfully, but these errors were encountered: