Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Caching seqinfo #26

Open
jeff-mandell opened this issue Jun 22, 2021 · 7 comments
Open

Caching seqinfo #26

jeff-mandell opened this issue Jun 22, 2021 · 7 comments

Comments

@jeff-mandell
Copy link

Hi, my package uses genomeInfoDb, and we use the seqlevelsStyle function to clean up user-inputted data and ensure consistent chromosome names (in our case, we go with NCBI style, which means stripping chr prefixes). I can see that what seems like a simple task gets complicated under the hood with the need to download the latest info from NCBI, Ensembl, and UCSC.

I found that .UCSC_cached_chrom_info and .NCBI_cached_chrom_info store the necessary information for seqlevelsStyle throughout a session, but an internet connection is initially necessary every new session. This causes a problem for offline users and users on networks that for whatever reason are blocking any of NCBI/UCSC/Ensembl traffic (yes, this is really happening). Since seqinfo is such a small amount of data, is there a plan to take advantage of R's support for caching user data to save this information and allow seqlevelsStyle to run offline? Or is there a safe workaround to supply the necessary seqinfo?

I did it this way, but I'm concerned this could cause problems with new GenomeInfoDb releases or if anything changes on the NCBI/UCSC/Ensembl server side.

# Get information for local caching
bsg = getBSgenome("hg19")
seqlevelsStyle(bsg) = "NCBI"
ucsc_info = GenomeInfoDb:::.add_ensembl_column(ucsc_info, "hg19")
ucsc_info = getFromNamespace(".UCSC_cached_chrom_info", "GenomeInfoDb")[["hg19"]]
ucsc_info = GenomeInfoDb:::.add_ensembl_column(ucsc_info, "hg19")
ncbi_info = getFromNamespace(".NCBI_cached_chrom_info", "GenomeInfoDb")[["GCF_000001405.25"]]
saveRDS(ncbi_info, "hg19_ncbi_seqinfo_for_GenomeInfoDb.rds")
saveRDS(ucsc_info, "hg19_ucsc_seqinfo_for_GenomeInfoDb.rds")

# Later, in new (offline) R session
ucsc_info = readRDS("hg19_ucsc_seqinfo_for_GenomeInfoDb.rds")
ncbi_info = readRDS("hg19_ncbi_seqinfo_for_GenomeInfoDb.rds")
assign('hg19', ucsc_info, envir = get(".UCSC_cached_chrom_info", envir = asNamespace('GenomeInfoDb')))
assign('GCF_000001405.25', ncbi_info, envir = get(".NCBI_cached_chrom_info", envir = asNamespace('GenomeInfoDb')))

# seqlevelsStyle now works offline

`

@hpages
Copy link
Contributor

hpages commented Jun 23, 2021

There's no plan at the moment to take advantage of R's support for caching user data to save NCBI or UCSC assembly/genome information and allow seqlevelsStyle() to run offline.

One concern with a persistent caching solution is that there's the slight possibility that the information provided by NCBI or UCSC for a given assembly/genome changes in the future. But maybe the risk that this actually happens is so low that we shouldn't be too concerned. This could also be mitigated via an expiration mechanism e.g. NCBI or UCSC chromosome information gets automatically removed from the persistent cache after a couple of months or something like that.

Also note that even with a persistent caching solution, an internet connection would still be initially necessary so it doesn't really solve the problem for users on networks that are blocking NCBI/UCSC/Ensembl traffic.

@jeff-mandell
Copy link
Author

Thanks for taking the time to respond. I understand the risk of sequence information changing. A persistent caching solution would help users who sometimes work offline, and it would prevent some crashes in HPC environments (e.g., a random node is misconfigured or has network problems). Maybe it's too niche of a need, but it could also help out package developers to be able to insert their own entries into .UCSC_cached_chrom_info and .NCBI_cached_chrom_info for use in these situations. The need for the end user to do simple harmonization of human data (just making chr prefixes and M/MT consistent, without regard for non-primary assembly sequences) is probably pretty widespread.

@hpages
Copy link
Contributor

hpages commented Jun 30, 2021

Bingo! And just when we were talking about the possibility of UCSC suddenly changing the chromosome information of their genomes, they just do it! See issue #27.

Note that this is not the first time. They already did this last year with hg19 when they decided to base it on GRCh37.p13 instead of GRCh37. This broke many things and created a lot of confusion.

@hpages
Copy link
Contributor

hpages commented Oct 13, 2022

Hi @jeff-mandell ,

Just to let you know that I implemented an "offline mode" for getChromInfoFromUCSC(). This is in GenomeInfoDb 1.33.9. See commit 345f22c.

Note that it's only a partial "offline mode" i.e. it works when called with assembled.molecules.only=TRUE and only for a selection of registered genomes. See "Offline mode" in ?getChromInfoFromUCSC for more information.

Cheers,
H.

@jeff-mandell
Copy link
Author

Thank you, this is nice to have!

@nvictus
Copy link

nvictus commented Apr 3, 2024

@hpages Are there plans to make "offline" assembly metadata available on AnnotationHub like the Ensembl, UCSC transcription DBs?

@hpages
Copy link
Contributor

hpages commented Apr 3, 2024

There are plans to make some assembly metadata available offline but there's no clear roadmap yet. In particular whether it's going to be via AnnotationHub or other means has not been decided.

Note that the chrom info for some UCSC genomes is already available offline e.g. getChromInfoFromUCSC("hg38", assembled.molecules.only=TRUE) or getChromInfoFromUCSC("hs1", assembled.molecules.only=TRUE) work offline. The offline mode only works if assembled.molecules.only=TRUE, that is, if one tries to obtain the chrom info for the chromosomes only and not for all the sequences in the genome assembly (i.e. chromosome + scaffolds).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants