Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Proposed contribution task for Outreachy applicants: Register UCSC genome xenTro10 #46

Closed
hpages opened this issue Oct 13, 2022 · 16 comments

Comments

@hpages
Copy link
Contributor

hpages commented Oct 13, 2022

xenTro10 is the latest UCSC genome for the Western clawed frog (Xenopus tropicalis). See "List of UCSC genome releases" at https://genome.ucsc.edu/FAQ/FAQreleases.html for all the genomes currently supported by UCSC.

Also check out the "Genome Browser Gateway" page here. This is the main entrance to the "UCSC Genome Browser". Find X. tropicalis in the UCSC species tree on the left, click on it, then make sure to select the latest X. tropicalis Assembly (xenTro10). This will display a bunch of additional information about the xenTro10 assembly.

Note that many UCSC genomes are already registered in the GenomeInfoDb package (83 as of October 2022). The registered_UCSC_genomes() function in GenomeInfoDb returns the list of all the UCSC genomes that are currently registered in the package. An important thing to be aware of is that getChromInfoFromUCSC() still works on an unregistered genome, but in "degraded" mode, that is:

  • the assembled.molecules argument is ignored,
  • the assembled and circular columns of the returned data.frame are filled with NAs,
  • and the chromosomes/sequences are not returned in any particular order.

Registering a genome fixes that. In other words, once a genome is registered in GenomeInfoDb, the information returned by getChromInfoFromUCSC() for that genome is guaranteed to be complete and accurate.

See ?getChromInfoFromUCSC (after loading GenomeInfoDb) for more information.

Registering a new UCSC genome is only a matter of adding a new file, called "registration file", to GenomeInfoDb/inst/registered/UCSC_genomes/. Note that the folder contains a README.TXT file that provides some brief information about what a "registration file" should contain (unfortunately the registration process is not fully documented).

For xenTro10, since this is the first xenTro genome that we're going to register in GenomeInfoDb, we need to start the xenTro10.R file from scratch. However, looking at other registration files to get a feeling of how things are done is always a good idea. Don't bother with the NCBI_LINKER component for now. We'll add it later, once the corresponding NCBI assembly (UCB_Xtro_10.0) is also registered (registering UCB_Xtro_10.0 will is the topic of issue #47).

IMPORTANT NOTES TO OUTREACHY APPLICANTS:

  • Make sure to complete all the Preliminary tasks listed here before you start working on this task. In particular, make sure that you have R 4.2 and that you are set up to use the devel version of Bioconductor (currently 3.16).
  • Only one applicant can work on this task. If you choose to work on this task, please make sure to assign yourself so other applicants know that the task is already being worked on. If later on you change your mind, please unassign yourself. It's ok to change your mind!
  • To work on this task, please fork the GenomeInfoDb repository. Then do your work on that fork.
  • Always test your changes before you commit them to your fork. This consists in installing the modified package, starting R, loading the package, and playing around with the new functionality. This process is called "ad hoc manual testing". Once everything behaves and looks as expected, run R CMD build and R CMD check on the package. Note that R CMD check should always be run on the source tarball produced by R CMD build.
  • R CMD check might produce some NOTEs and even some WARNINGs. These are ok if they existed before your changes. You can check that by taking a look at the daily report produced by our automated builds here: https://bioconductor.org/checkResults/devel/bioc-LATEST/ Make sure to not introduce new NOTEs or WARNINGs!
  • Once your work is ready to be merged, please submit a PR (Pull Request).
  • Remember to record your contribution on Outreachy at https://www.outreachy.org/outreachy-december-2022-internship-round/communities/bioconductor/refactor-the-bsgenomeforge-tools/contributions/.
@kakopo
Copy link
Collaborator

kakopo commented Oct 14, 2022

Hello @hpages. I'd like to contribute towards this issue. I have completed the preliminary tasks.

@hpages
Copy link
Contributor Author

hpages commented Oct 14, 2022

Hi @kakopo , please choose a task in the Cat group instead. See https://github.com/Bioconductor/BSgenomeForge/wiki/List-of-contribution-tasks-for-the-Outreachy-application-period Thanks!

@hpages hpages changed the title Proposed contribution tasks for Outreachy applicants: Register UCSC genome xenTro10 Proposed contribution task for Outreachy applicants: Register UCSC genome xenTro10 Oct 17, 2022
@Simplecodez
Copy link
Contributor

Good day sir. I love to be assigned to this task

@Simplecodez
Copy link
Contributor

Simplecodez commented Oct 23, 2022

Hi @hpages. i just started working on it since i was going to be assigned to this task.
Below is the code of my work

GENOME <- "xenTro10"
ORGANISM <- "Xenopus tropicalis"
ASSEMBLED_MOLECULES <- paste0("chr", c(1:9, "X", "M"))
CIRC_SEQS <- "chrM"

library(IRanges)       # for CharacterList()
library(GenomeInfoDb)  # for fetch_chrom_sizes_from_UCSC()

.order_seqlevels <- function(seqlevels)
{
    tmp <- CharacterList(strsplit(seqlevels, "_"))
    npart <- lengths(tmp)
    stopifnot(all(npart %in% c(1L, 3L)))

    idx1 <- which(npart == 1L)
    stopifnot(length(idx1) == length(ASSEMBLED_MOLECULES))
    oo1 <- match(ASSEMBLED_MOLECULES, seqlevels[idx1])
    stopifnot(!anyNA(oo1))
    idx1 <- idx1[oo1]

    idx3 <- which(npart == 3L)
    m3 <- matrix(unlist(tmp[idx3]), ncol=3L, byrow=TRUE)
    stopifnot(all(m3[ , 1L] == "chrUn"))
    stopifnot(all(m3[ , 2L] == "NW"))
    oo3 <- order(m3[ , 3L])
    idx3 <- idx3[oo3]

    c(idx1, idx3)
}

FETCH_ORDERED_CHROM_SIZES <-
    function(goldenPath.url=getOption("UCSC.goldenPath.url"))
{
    chrom_sizes <- GenomeInfoDb:::fetch_chrom_sizes_from_UCSC(GENOME,
                                              goldenPath.url=goldenPath.url)
    oo <- .order_seqlevels(chrom_sizes[ , "chrom"])
    S4Vectors:::extract_data_frame_rows(chrom_sizes, oo)
}

NCBI_LINKER <- list(
    assembly_accession="GCF_000004195.4",
    special_mappings=c(chrM="MT")
)

After installing GenomeInfoDb I ran chrominfo <- getChromInfoFromUCSC("xenTro10", assembled.molecules.only=TRUE) and got this error:

Error in .order_seqlevels(chrom_sizes[, "chrom"]) : 
  !anyNA(oo1) is not TRUE

please what could be the cause?

@hpages
Copy link
Contributor Author

hpages commented Oct 24, 2022

Hi @Simplecodez, I've just assigned you to this issue. Before I answer your question above, I'll take a look at the PR you submitted for the previous task (PR #69). Thanks for your patience.

@Simplecodez
Copy link
Contributor

Alright sir. Thank you

@Simplecodez
Copy link
Contributor

Simplecodez commented Oct 26, 2022

Hi @hpages please I am waiting for your response sir.
Thank you.

@hpages
Copy link
Contributor Author

hpages commented Oct 26, 2022

Hi @Simplecodez ,

please I am waiting for your response sir.

As I said, before moving to this new task, I wanted that we finish the previous one first. Thanks for understanding.

After installing GenomeInfoDb I ran chrominfo <- getChromInfoFromUCSC("xenTro10", assembled.molecules.only=TRUE) and got this error:

Error in .order_seqlevels(chrom_sizes[, "chrom"]) : 
  !anyNA(oo1) is not TRUE

To work on this task, it's best to run the code in your xenTro10.R file manually. You can use source() to source the file, or you can copy-paste the content of the file in your R session.

As documented in the README.TXT file located in inst/registered/UCSC_genomes/ (make sure to read this file), the code in the file must define 4 mandatory variables: GENOME, ORGANISM, ASSEMBLED_MOLECULES, and CIRC_SEQS. Inspect them (i.e. look at their values) and make sure that they look as expected.

Make sure that all the xenTro10 chromosomes are listed in ASSEMBLED_MOLECULES. For example there's no reason to exclude chromosome 10!

Then try to call FETCH_ORDERED_CHROM_SIZES(). This should reproduce the error that you got above.

To understand why this error happens, we're going to run the code in the FETCH_ORDERED_CHROM_SIZES() function line by line, and we'll discuss each line of code.

So let's get started:

## The argument of the FETCH_ORDERED_CHROM_SIZES() function is named 'goldenPath.url'
## so we set that variable manually.
goldenPath.url <- getOption("UCSC.goldenPath.url")

## Download the sequence sizes for xenTro10 from UCSC. The result is stored
## in 'chrom_sizes'. This is a data frame.
chrom_sizes <- GenomeInfoDb:::fetch_chrom_sizes_from_UCSC(GENOME,
                                          goldenPath.url=goldenPath.url)

## Let's take the time to inspect 'chrom_sizes'. Display it by entering the variable name at the R
## prompt, followed by <Enter>. Check its dimensions (i.e. number of rows and columns)
## with 'dim(chrom_sizes)'. Display the column names with 'colnames(chrom_sizes)'.
## Display the first and last 6 rows with 'head(chrom_sizes)' and 'tail(chrom_sizes)' respectively.
## Extract the "chrom" column with 'chrom_sizes$chrom' (same as  'chrom_sizes[["chrom"]]').
## Another way to extract this column is with 'chrom_sizes[ , "chrom"]'. This is safer than the
## other ways in the sense that it will fail if the column doesn't exist rather than silently returning
## a NULL. For example try 'chrom_sizes$chromo', then 'chrom_sizes[["chromo"]]',
## then 'chrom_sizes[ , "chromo"]'. See the difference?

## Call .order_seqlevels() on the sequence names. 
oo <- .order_seqlevels(chrom_sizes[ , "chrom"])

When you execute this last line, the error should happen again. This tells us that the error happens in the .order_seqlevels() function.

Now let's run the code in the .order_seqlevels() function line by line:

## The argument of the .order_seqlevels() function is named 'seqlevels' so we set
## that variable manually.
seqlevels <- chrom_sizes[ , "chrom"]

## Take the time to look at 'seqlevels'. This is a character vector that contains all the
## sequence names in the xenTro10 assembly (sequence names are sometimes
## called seqlevels in Bioconductor). You should see 11 chromosomes: chromosomes
## 1 to 10, plus the mitochondrial chromosome. All the additional sequences are
## called "unplaced scaffolds". Those are small DNA fragments that were produced
## by the assembly process but the scientists working on this assembly were not able
## to identify which chromosome those fragments are coming from. UCSC naming
## convention is to prefix the names of "unplaced scaffolds" with 'chrUn_'.

## Break the seqlevels in parts. Make sure to display 'tmp'.  As you will see, the
## seqlevel parts are returned in a CharacterList object. This object is parallel to
## the 'seqlevels' vector, that is, the 2 objects have the same length and the i-th
## element in 'tmp' corresponds to the i-th element in 'seqlevels'.
tmp <- CharacterList(strsplit(seqlevels, "_"))

## Compute the number of parts in each seqlevel. Make sure to display 'npart'.
## This is an integer vector parallel to 'tmp' (i.e. same length, and the i-th element
## in 'npart' corresponds to the i-th element in 'tmp'). So it's also parallel to 'seqlevels'.
npart <- lengths(tmp)

## Check that all seqlevels have 1 or 3 parts. This is just a sanity check. Take the
## time to study how this code works exactly by breaking it down in smaller steps.
## That is, first do 'npart %in% c(1L, 3L)', then put that inside 'all()', then
## put 'all(npart %in% c(1L, 3L))' inside 'stopifnot()'.
stopifnot(all(npart %in% c(1L, 3L)))

## Extract the index of seqlevels with 1 part only. Again, take the time to study
## how this code works exactly by breaking it down in two steps.
idx1 <- which(npart == 1L)

## First sanity check. Does it pass?
stopifnot(length(idx1) == length(ASSEMBLED_MOLECULES))

## Match the values in the ASSEMBLED_MOLECULES vector to the
## seqlevels with 1 part only. Take the time to learn about the match() function
## by reading its man page. This function is defined in the base package so you
## can quickly access its man page with '?base::match'.
oo1 <- match(ASSEMBLED_MOLECULES, seqlevels[idx1])

## Second sanity check. Does is pass? If not, what could be the problem?
## Hint: check the values in the 'oo1' vector returned by match(). What does
## it mean when match() returns NAs? 
stopifnot(!anyNA(oo1))

This is our first deep dive in real R code. There's a lot going on with the above code so it's going to be a great learning experience. Do not hesitate to ask if you have questions. I'll do my best to help.

@Simplecodez
Copy link
Contributor

Simplecodez commented Oct 26, 2022

Thank you very much for this explanation. I got a lot of insights as to how the code works.

@hpages
Copy link
Contributor Author

hpages commented Oct 26, 2022

so this organism doesn't have an "X" chromosome?

I don't know if the UCSC folks included the X chromosome or not. It could be that they didn't. If you don't see it in seqlevels after running the code above, that means that they didn't, which would be pretty unusual.

@Simplecodez
Copy link
Contributor

Yes I didn't see it in the seqlevels. So should I exclude it?

@Simplecodez
Copy link
Contributor

Good day sir. I just created a PR now on this issue. I will be anticipating your feedback

@hpages
Copy link
Contributor Author

hpages commented Oct 27, 2022

Thanks for the PR @Simplecodez. It looks good.

For the sake of keeping the PR focused on the precise task described at the top of this issue, can you please remove the NCBI_LINKER component? Adding it will be the topic of the next task (#48).

Thanks again.

@Simplecodez
Copy link
Contributor

Okay sir, I have removed the NCBI linker component.
Thank you

@hpages
Copy link
Contributor Author

hpages commented Oct 27, 2022

Perfect @Simplecodez, thanks! I just merged PR #71.

Next task in your group is issue #48. Whenever you are ready, go there and ask to be assigned.

@hpages hpages closed this as completed Oct 27, 2022
@Simplecodez
Copy link
Contributor

Simplecodez commented Oct 27, 2022

Thank you very much sir

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants