Skip to content

Commit

Permalink
Update function documentation
Browse files Browse the repository at this point in the history
  • Loading branch information
percyfal committed Dec 19, 2023
1 parent 9fb7e4d commit c5fcd37
Show file tree
Hide file tree
Showing 4 changed files with 93 additions and 31 deletions.
4 changes: 2 additions & 2 deletions .pre-commit-config.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -15,11 +15,11 @@ repos:
- id: markdownlint-cli2
files: \.(md|qmd)$
types: [file]
exclude: LICENSE.md
exclude: (README.md|LICENSE.md)$
- id: markdownlint-cli2-fix
files: \.(md|qmd)$
types: [file]
exclude: LICENSE.md
exclude: (README.md|LICENSE.md)$
- repo: https://github.com/lorenzwalthert/precommit
rev: v0.3.2.9027
hooks:
Expand Down
2 changes: 0 additions & 2 deletions README.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -118,8 +118,6 @@ optional arguments:
--width WIDTH figure width in inches [default 6.0]
```



### R package vignette

Alternatively, import the library in an R script and use the package
Expand Down
76 changes: 69 additions & 7 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,23 +9,33 @@
status](https://github.com/NBISweden/genecovr/workflows/R-CMD-check/badge.svg)](https://github.com/NBISweden/genecovr/actions)
<!-- badges: end -->

Perform gene body coverage analyses in R to evaluate genome assembly
quality.
`genecovr` is an `R` package that provides plotting functions that
summarize gene transcript to genome alignments. The main purpose is to
assess the effect of polishing and scaffolding operations has on the
quality of a genome assembly. The gene transcript set is a large
sequence set consisting of assembled transcripts from RNA-seq data
generated in relation to a genome assembly project. Therefore,
`genecovr` serves as a complement to software such as
[BUSCO](https://busco.ezlab.org/), which evaluates genome assembly
quality using a smaller set of well-defined single-copy orthologs.

## Installation

You can install the released version of genecovr from [NBIS
GitHub](https://github.com/nbis) with:

``` r
# If necessary, uncomment to install devtools
# install.packages("devtools")
devtools::install_github("NBISweden/genecovr")
```

## Quick usage
## Usage

### genecovr script quick start

There is a helper script for generating basic plots located in
PACKAGE\_DIR/bin/genecovr. Create a data input csv-delimited file with
PACKAGE_DIR/bin/genecovr. Create a data input csv-delimited file with
columns

1. data label
Expand All @@ -41,7 +51,59 @@ generate plots:
PACKAGE_DIR/bin/genecovr indata.csv
```

## Vignette
#### Example

There are example files located in PACKAGE_DIR/inst/extdata consisting
of two psl alignment files containing gmap alignments and fasta indices
for the transcript sequences and two for different assembly versions:

- nonpolished.fai - fasta index for raw assembly
- polished.fai - fasta index for polished assembly
- transcripts.fai - fasta index for transcript sequences
- transcripts2nonpolished.psl - gmap alignments, transcripts to raw
assembly
- transcripts2polished.psl - gmap alignments, transcripts to polished
assembly

Using these files and the labels `non` and `pol` for the different
assemblies, a `genecovr` input file (called e.g., `assemblies.csv`)
would look as follows:

nonpol,transcripts2nonpolished.psl,nonpolished.fai,transcripts.fai
pol,transcripts2polished.psl,polished.fai,transcripts.fai

and the command to run would be:

genecovr assemblies.csv

#### genecovr options

To list genecovr script options, type ’genecovr -h\`:

usage: genecovr [-h] [-v] [-p number]
[-d OUTPUT_DIRECTORY] [--height HEIGHT]
[--width WIDTH]
csvfile

positional arguments:
csvfile csv-delimited file with columns
1. data label
2. mapping file (supported formats: psl)
3. assembly file (fasta or fasta index)
4. transcript file (fasta or fasta index)

optional arguments:
-h, --help show this help message and exit
-v, --verbose print extra output
-p number, --cpus number
number of cpus [default 1]
-d OUTPUT_DIRECTORY, --output-directory OUTPUT_DIRECTORY
output directory
--height HEIGHT figure height in inches [default 6.0]
--width WIDTH figure width in inches [default 6.0]

### R package vignette

Alternatively, import the library as usual in an R script and use the
package functions. See the vignette for a minimum working example.
Alternatively, import the library in an R script and use the package
functions. See [Get started](articles/genecovr.html) or run
`vignette("genecovr")` for a minimum working example.
42 changes: 22 additions & 20 deletions vignettes/genecovr.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -58,13 +58,14 @@ psize <- 3
## On object representation of pairwise sequence alignments

`genecovr` has functionality to read pairwise sequence alignment files
and converts the pairwise alignments to `AlignmentPairs` objects. An
`AlignmentPairs` object is a subclass of the Bioconductor class
`S4Vectors::Pairs`. A `Pairs` object in turn aligns two vectors along
slot names `first` and `second`, and the `AlignmentPairs` object adds
slots for the query and subject, and possibly extra slots related to
additional information in the alignment file. The query and subject
are `GRanges` objects or objects derived from the `GRanges`class.
and converts the pairwise alignments to `genecovr::AlignmentPairs`
objects. An `AlignmentPairs` object is a subclass of the Bioconductor
class `S4Vectors::Pairs`. A `Pairs` object in turn aligns two vectors
along slot names `first` and `second`, and the `AlignmentPairs` object
adds slots for the query and subject, and possibly extra slots related
to additional information in the alignment file. The query and subject
are `GenomicRanges::GRanges` objects or objects derived from the
`GRanges`class.


# Analysing gene body coverage
Expand All @@ -75,10 +76,10 @@ gmap files in psl format, `transcripts2nonpolished.psl` and
`transcripts2polished.psl`. In addition there are fasta index files
for both assemblies (`nonpolished.fai` and `polished.fai`) and for the
transcriptome (`transcripts.fai`). The fasta indices are used to
generate `Seqinfo` objects that can be used to set sequence
information on the parsed output. We load the fasta indices and parse
the psl files with `readPsl`, storing the results in an
`AlignmentPairsList` for convenience.
generate `GenomeInfoDb::Seqinfo` objects that can be used to set
sequence information on the parsed output. We load the fasta indices
and parse the psl files with `genecovr::readPsl`, storing the results
in an `genecovr::AlignmentPairsList` for convenience.

``` {r gbc-load-data}
assembly_fai_fn <- list(
Expand Down Expand Up @@ -137,7 +138,7 @@ assembly, as expected.
## Plot summary indel and match distributions

We can also select multiple columns to plot in the
`AlignmentPairsList`.
`genecovr::AlignmentPairsList`.

```{r gbc-plot-match-indel}
cnames <- c("misMatches", "query.NumInsert", "query.BaseInsert")
Expand All @@ -157,7 +158,7 @@ plot(apl, aes(x = id, y = get_expr(enquo(cnames))), which = "boxplot") +

## Plot number of indels

The function `insertionSummary` summarizes the number of insertions,
The function `genecovr::insertionSummary` summarizes the number of insertions,
either at the transcript level (default) or per alignment. The
intuition is that as assembly quality improves, the number of indels
go down.
Expand Down Expand Up @@ -188,17 +189,18 @@ ggplot(x, aes(id)) +

## Gene body coverage

The function `geneBodyCoverage` takes an `AlignmentPairs` object and
summarizes breadth of coverage and number of subject hits per
transcript.
The function `genecovr::geneBodyCoverage` takes an `AlignmentPairs`
object and summarizes breadth of coverage and number of subject hits
per transcript.

``` {r gbc-make-gene-body-coverage}
gbc <- lapply(apl, geneBodyCoverage, min.match = 0.1)
```

A summary can be obtained with the `summarizeGeneBodyCoverage`
function. We define a range of minimum match hit cutoffs to filter out
hits with too few matches in the aligned region.
A summary can be obtained with the
`genecovr::summarizeGeneBodyCoverage` function. We define a range of
minimum match hit cutoffs to filter out hits with too few matches in
the aligned region.

``` {r gbc-summarize-gene-body-coverage}
min.match <- c(0.25, 0.5, 0.75, 0.9)
Expand Down Expand Up @@ -237,7 +239,7 @@ ggplot(

A fragmented assembly will lead to more transcripts mapping to several
contigs. We calculate the number of subjects by coverage cutoff with
the function `countSubjectsByCoverage`
the function `genecovr::countSubjectsByCoverage`

```{r gbc-ncontigs-per-transcript}
data <- dplyr::bind_rows(
Expand Down

0 comments on commit c5fcd37

Please sign in to comment.