Skip to content

Added functionality for writeBambuOutput() to write to gff3#567

Open
hafiz-ismail wants to merge 1 commit intodevel_pre_v4from
write_to_gff3
Open

Added functionality for writeBambuOutput() to write to gff3#567
hafiz-ismail wants to merge 1 commit intodevel_pre_v4from
write_to_gff3

Conversation

@hafiz-ismail
Copy link
Copy Markdown
Collaborator

Related Issue

Fixes # (issue number)

Type of Change

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (please specify if the change breaks existing functionality)
    • Non breaking change (the feature doesn't change existing functionality)
    • Breaking change (the feature that would cause existing functionality to not work as expected)
  • Documentation update
  • Performance optimization

Description

Implemented functionality for writeBambuOutput to export gff3 files from SummarizedExperiment objects. Currently Bambu can export the more stringent gtf file format, but there is no functionality for gff3. An overview of the differences of the 2 file formats:

GTF differs from GFF3 mainly in the 9th column (attribute) and the syntax of the key:value pairs.

GTF:

gene_id "GENE1"; transcript_id "TX1"; gene_name "BRCA1";
gene_id "GENE1"; transcript_id "TX1"; exon_number "1";

GTF documentation (Ensembl)

GFF3:

ID=TX1;Parent=GENE1;Name=BRCA1
ID=exon1;Parent=TX1;exon_number=1

GFF3 documentation (Ensembl)

Also, GTF attributes have a trailing ; whereas GFF3 does not.

Other differences include

  • GFF3 header
  • GTF column 3 (feature) is constrained to lesser terms than GFF3

Implementation Details

All code changes were implemented in readWrite.R.

  • Implemented two new helper functions: writeToGFF3 and writeAnnotationsToGFF3, following convention from already existing writeToGTF and writeAnnotationsToGTF
  • Implemented new parameter outputAnnFormat in writeBambuOutput: takes either "gtf" (default) or "gff3" as arguments
  • renamed transcript_gtffn to transcript_annfn
  • Updated documentation in README and comment markers parsed by roxygen2

Impact of Changes

  • All previous functionality is preserved, writeBambuOutput still exports to .gtf by default unless specified otherwise. It was found that the current writeBambuOutput in devel_pre_v4 is not fully functional - therefore a minimal function is provided to test results as below:
writeBambuOutput <- function(se, path, prefix = "", outputExtendedAnno = TRUE, 
                             outputAll = TRUE, outputBambuModels = TRUE, outputNovelOnly = TRUE, seperateSamples = FALSE,
                             outputAnnFormat = "gtf") {
  
  if (missing(se) | missing(path)) {
    stop("Both summarizedExperiment object from bambu and
            the path for the output files are required.")
  } else if (!outputAnnFormat %in% c("gtf", "gff3")) {
    stop("Please specify a valid annotation outputAnnFormat: 'gtf' or 'gff3', or leave blank to output to GTF by default.")
  } else {
    outdir <- paste0(path, "/")
    if (!dir.exists(outdir))
      dir.create(outdir, recursive = TRUE)
    
    transcript_grList <- rowRanges(se)
    prefix <- ifelse(prefix != "", paste0(prefix, "_"), "")
    transcript_annfn <- paste(outdir, prefix, sep = "")
    
    if (outputAnnFormat == "gtf") {
      gtf <- writeAnnotationsToGTF(annotation = transcript_grList,
                                   file = transcript_annfn, outputExtendedAnno = outputExtendedAnno, 
                                   outputAll = outputAll, outputBambuModels = outputBambuModels, outputNovelOnly = outputNovelOnly)
    }
    
    if (outputAnnFormat == "gff3") {
      gff3 <- writeAnnotationsToGFF3(annotation = transcript_grList,
                                     file = transcript_annfn, outputExtendedAnno = outputExtendedAnno, 
                                     outputAll = outputAll, outputBambuModels = outputBambuModels, outputNovelOnly = outputNovelOnly)
    }
  }
}

In addition, this output has not been updated for compatibility with importBambuResults - providing functionality for this will be associated with future implementations for the prepareAnnotations function

Checklist

  • My code follows the style guidelines of this project.
  • I have performed a self-review of my own code.
  • I have commented my code, particularly in hard-to-understand areas.
  • I have made corresponding changes to the documentation (vignettes, man pages).
  • I have tested the code on a full dataset, and any differences have been described in the Impact of Changes section.

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds optional GFF3 annotation export support to writeBambuOutput() so Bambu results can be written as either GTF (default) or GFF3, and updates user-facing documentation accordingly.

Changes:

  • Added outputAnnFormat parameter to writeBambuOutput() to select "gtf" vs "gff3".
  • Implemented writeToGFF3() / writeAnnotationsToGFF3() helpers analogous to existing GTF writers.
  • Updated README text to mention GFF3 output support.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 6 comments.

File Description
README.md Documents the new GFF3 option for annotation output.
R/readWrite.R Adds outputAnnFormat to writeBambuOutput() and introduces GFF3-writing helpers.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread README.md
For a full description of the other outputs see [Output Description](#Output-Description)

The full output can be written to a file using writeBambuOutput(). Using this function will generate six files, including 4four .gtf files(detailed below), and two .txt files for the expression counts at transcript and gene levels.
The full output can be written to a file using writeBambuOutput(). Using this function will generate six files, including four .gtf/.gff3 files(detailed below), and two .txt files for the expression counts at transcript and gene levels.
Comment thread README.md
For a full description of the other outputs see [Output Description](#Output-Description)

The full output can be written to a file using writeBambuOutput(). Using this function will generate six files, including 4four .gtf files(detailed below), and two .txt files for the expression counts at transcript and gene levels.
The full output can be written to a file using writeBambuOutput(). Using this function will generate six files, including four .gtf/.gff3 files(detailed below), and two .txt files for the expression counts at transcript and gene levels.
Comment thread R/readWrite.R
Comment on lines 4 to +12
#' @param path the destination of the output files
#' (gtf, transcript counts, and gene counts)
#' @param prefix the prefix of the output files
#' @details The function will write the output from Bambu to files. The
#' annotations will be written to a .gtf file, transcript counts (total counts,
#' @param outputAnnFormat the file format in which to output annotations, must
#' be one of \code{"gtf"} or \code{"gff3"}. \code{"gtf"} is specified by default.
#' @details The function will write the output from Bambu to files. The
#' annotations will be written to a .gtf or .gff3 file, transcript counts (total counts,
#' CPM, full-length counts, partial-length counts, and unique counts) and gene counts
#' will be written to .txt files.
#' will be written to .txt files.
Comment thread R/readWrite.R
Comment on lines 21 to +23
writeBambuOutput <- function(se, path, prefix = "", outputExtendedAnno = TRUE,
outputAll = TRUE, outputBambuModels = TRUE, outputNovelOnly = TRUE, seperateSamples = FALSE) {
if (missing(se) | missing(path)) {
stop("Both summarizedExperiment object from bambu and
outputAll = TRUE, outputBambuModels = TRUE, outputNovelOnly = TRUE, seperateSamples = FALSE,
outputAnnFormat = "gtf") {
Comment thread R/readWrite.R
Comment on lines +39 to +49
if (outputAnnFormat == "gtf") {
gtf <- writeAnnotationsToGTF(annotation = transcript_grList,
file = transcript_annfn, outputExtendedAnno = outputExtendedAnno,
outputAll = outputAll, outputBambuModels = outputBambuModels, outputNovelOnly = outputNovelOnly)
}

if (outputAnnFormat == "gff3") {
gff3 <- writeAnnotationsToGFF3(annotation = transcript_grList,
file = transcript_annfn, outputExtendedAnno = outputExtendedAnno,
outputAll = outputAll, outputBambuModels = outputBambuModels, outputNovelOnly = outputNovelOnly)
}
Comment thread R/readWrite.R
if(outputAll){
annotationAll = setNDR(annotation, 1)
if(length(annotationAll) == length(annotation))
message("The current NDR threshold already outputs all transcript models. This may result in reduced precision for th extendedAnnotations and supportedTranscriptModels gtfs")
@jonathangoeke
Copy link
Copy Markdown
Member

Code review

Found 3 issues:

  1. writeToGFF3() dfTx block accesses mcols(annotation)$txScore and mcols(annotation)$txScore.noFit, but these columns do not exist on the annotation GRangesList. The exon block in the same function correctly uses mcols(annotation)$maxTxScore and mcols(annotation)$maxTxScore.noFit (lines 290-291). This will produce NULL values in transcript-level GFF3 attributes. (bug: wrong column name, copy-pasted from the same pre-existing bug in writeToGTF at lines 199-200)

bambu/R/readWrite.R

Lines 338 to 341 in 67f184d

dfTx$NDR <- paste("NDR=", as.character(mcols(annotation)$NDR), ";", sep = "")
dfTx$txScore <- paste("txScore=", as.character(mcols(annotation)$txScore), ";", sep = "")
dfTx$txScore.noFit <- paste("txScore.noFit=", as.character(mcols(annotation)$txScore.noFit), ";", sep = "")
dfTx$novelGene <- paste("novelGene=", as.character(mcols(annotation)$novelGene), ";", sep = "")

  1. writeToGFF3() writes mRNA records with Parent=GENE_ID and exon records with Parent=TX_ID, but no gene-level records are written. The GFF3 specification requires a three-level hierarchy (gene -> mRNA -> exon) where every Parent attribute references an ID that exists in the file. Without gene records, every mRNA row has a dangling Parent reference, making the output invalid for downstream tools that validate parent-child relationships (AGAT, gffread, IGV). (bug: GFF3 spec violation -- missing gene feature records)

bambu/R/readWrite.R

Lines 320 to 355 in 67f184d

# Create GFF3 formatted dataframes for exons and transcripts separately
# Exons: no need gene IDs but set Parent = transcript ID
dfExon <- mutate(df, source = "Bambu", feature = "exon", score = ".",
frame = ".", attributes = paste0(group_name, exon_rank, NDR, txScore, txScore.noFit, novelGene, novelTranscript, txClassDescription )) %>%
select(seqnames, source, feature, start, end, score,
strand, frame, attributes, group_name) %>%
mutate(attributes = sub("ID=", "Parent=", attributes))
# Dataframe for transcripts
dfTx <- as_tibble(as.data.frame(range(ranges(annotation))))
dfTx <- left_join(dfTx, geneIDs, by = c("group_name" = "TXNAME"))
dfTx$group_name <- paste("ID=", dfTx$group_name, ";", sep = "")
dfTx$GENEID <- paste("Parent=", dfTx$GENEID, ";", sep = "")
if(!is.null(mcols(annotation)$NDR)) {
dfTx$NDR <- paste("NDR=", as.character(mcols(annotation)$NDR), ";", sep = "")
dfTx$txScore <- paste("txScore=", as.character(mcols(annotation)$txScore), ";", sep = "")
dfTx$txScore.noFit <- paste("txScore.noFit=", as.character(mcols(annotation)$txScore.noFit), ";", sep = "")
dfTx$novelGene <- paste("novelGene=", as.character(mcols(annotation)$novelGene), ";", sep = "")
dfTx$novelTranscript <- paste("novelTranscript=", as.character(mcols(annotation)$novelTranscript), ";", sep = "")
dfTx$txClassDescription <- paste("txClassDescription=", as.character(mcols(annotation)$txClassDescription), ";", sep = "")
}
# Parse to gff3 format
dfTx <- mutate(dfTx, source = "Bambu", feature = "mRNA", score = ".",
frame = ".", attributes = paste0(GENEID, group_name, NDR, txScore, txScore.noFit, novelGene, novelTranscript, txClassDescription )) %>%
select(seqnames, source, feature, start, end, score,
strand, frame, attributes, group_name)
# Concat tx and exon, then drop the group_name
gff3 <- rbind(dfTx, dfExon) %>% group_by(group_name) %>%
arrange(as.character(seqnames), start) %>% ungroup() %>%
select(seqnames, source, feature, start, end, score,

  1. writeAnnotationsToGFF3() message says "gtfs" instead of "gff3s" and contains a typo "th " (should be "the"). This was copy-pasted from writeAnnotationsToGTF() without updating. (bug: misleading user-facing message referencing wrong format)

bambu/R/readWrite.R

Lines 376 to 378 in 67f184d

if(length(annotationAll) == length(annotation))
message("The current NDR threshold already outputs all transcript models. This may result in reduced precision for th extendedAnnotations and supportedTranscriptModels gtfs")
writeToGFF3(annotationAll, paste0(file, "allTranscriptModels.gff3"), geneIDs)

🤖 Generated with Claude Code

- If this code review was useful, please react with 👍. Otherwise, react with 👎.

@cying111
Copy link
Copy Markdown
Collaborator

cying111 commented Apr 17, 2026

gff3 entries are not sorted, would suggest to follow sth similar to gtf entries? i.e., transcript entry, then followed by all exon entries sorted by chromosome coordinate, if need to add gene entry, gene entry can be on top of transcript entry

but I noted that writeToGTF output gtf entries are also not sorted, maybe need to change this as well?

One more thing when reading gff3 documentation, ID always comes before Parent, maybe also good to keep the same with the convention?

Missing IDs for exons, maybe can have sth as ID=exon_1, for all exons, so that rank is not necessary?

Screenshot 2026-04-17 at 11 10 04 AM

Comment thread R/readWrite.R
gtf <- writeAnnotationsToGTF(annotation = transcript_grList,
file = transcript_gtffn, outputExtendedAnno = outputExtendedAnno,
outputAll = outputAll, outputBambuModels = outputBambuModels, outputNovelOnly = outputNovelOnly)
} else if (!outputAnnFormat %in% c("gtf", "gff3")) {
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe should also allow for gff? and assume gff meant for gff3?

Comment thread R/readWrite.R
}

if (outputAnnFormat == "gff3") {
gff3 <- writeAnnotationsToGFF3(annotation = transcript_grList,
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

writeAnnotationsToGFF3 and writeAnnotationsToGTF already write to files, so not need to assign to gff3, same for above, no need to assign to gtf, and both are not used in following lines as well

Suggested change
gff3 <- writeAnnotationsToGFF3(annotation = transcript_grList,
writeAnnotationsToGFF3(annotation = transcript_grList,

Comment thread R/readWrite.R
}


writeAnnotationsToGFF3 <- function(annotation, file, geneIDs = NULL, outputExtendedAnno = TRUE,
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this function maybe can be combined with writeAnnotationsToGTF to one writeAnnotations function, with one more parameter to call writeToGTF or writeToGFF3 later?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

to do in the future as part of deduplicating code

Comment thread R/readWrite.R

df <- left_join(df, geneIDs[, c("TXNAME", "GENEID")],
by = c("group_name" = "TXNAME"))

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

processing until this step is duplicated from writeToGTF and writeToGFF3, maybe can consider to combine 2 functions?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

to do in the future as part of deduplicating code

Comment thread R/readWrite.R
dfTx$GENEID <- paste("Parent=", dfTx$GENEID, ";", sep = "")


if(!is.null(mcols(annotation)$NDR)) {
Copy link
Copy Markdown
Collaborator

@cying111 cying111 Apr 17, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this if section is duplicated (appears 4 times)? maybe can modularise it as well?

Copy link
Copy Markdown
Collaborator Author

@hafiz-ismail hafiz-ismail Apr 17, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the column assignment is slightly different from the one at line 287

i.e. lines 296-302

    df$NDR <- paste("NDR=", as.character(NDR), ";", sep = "")
    df$txScore <- paste("maxTxScore=", as.character(txScore), ";", sep = "")
    df$txScore.noFit <- paste("maxTxScore.noFit=", as.character(txScore.noFit), ";", sep = "")
    df$novelGene <- paste("novelGene=", as.character(novelGene), ";", sep = "")
    df$novelTranscript <- paste("novelTranscript=", as.character(novelTranscript), ";", sep = "")
    df$txClassDescription <- paste("txClassDescription=", as.character(txClassDescription), ";", sep = "")
  }

vs lines 338-344

    dfTx$NDR <- paste("NDR=", as.character(mcols(annotation)$NDR), ";", sep = "")
    dfTx$txScore <- paste("txScore=", as.character(mcols(annotation)$txScore), ";", sep = "")
    dfTx$txScore.noFit <- paste("txScore.noFit=", as.character(mcols(annotation)$txScore.noFit), ";", sep = "")
    dfTx$novelGene <- paste("novelGene=", as.character(mcols(annotation)$novelGene), ";", sep = "")
    dfTx$novelTranscript <- paste("novelTranscript=", as.character(mcols(annotation)$novelTranscript), ";", sep = "")
    dfTx$txClassDescription <- paste("txClassDescription=", as.character(mcols(annotation)$txClassDescription), ";", sep = "")
  }

likewise the same block of code is also 'duplicated' in writeToGTF.
nevertheless, they are definitely very similar, can be rewritten and also modularized in the future

Comment thread R/readWrite.R
# replace * with . and remove trailing ;
gff3 <- mutate(gff3, strand = recode_factor(strand, `*` = "."),
attributes = sub(";$", "", attributes))

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not sure where sorting should happen, I think maybe the step before mutate would be fine

@cying111
Copy link
Copy Markdown
Collaborator

cying111 commented Apr 17, 2026

Code review from Claude code:

my feeling is that: issue 4 and 5 maybe also good to address, the usage of mRNA might also need to be changed, maybe can just be transcript?

1. Copy-paste "gtfs" in GFF3 message — score 75

R/readWrite.R:377:

message("The current NDR threshold already outputs all transcript models.
This may result in reduced precision for th extendedAnnotations and
supportedTranscriptModels gtfs")

Two problems: (a) "gtfs" should be "gff3s" — copy-paste leftover from
the GTF version at L249. (b) Pre-existing typo "th" → "the".

Fix location: R/readWrite.R:377 — change "gtfs" to "gff3s" and
"th" to "the".

2. Parent before ID in transcript attributes — score 60

R/readWrite.R:348:

attributes = paste0(GENEID, group_name, NDR, txScore, ...)

Where GENEID = "Parent=<gene>;" (L334) and group_name = "ID=<tx>;"
(L333). Produces Parent=gene;ID=tx;... — convention is ID first,
then Parent.

Verified against the official GFF3 spec:
the spec does not mandate attribute order in column 9. It's
convention, not requirement. Most parsers (IGV, JBrowse) handle
either order. Downgraded from initial score of 85.

Fix location: R/readWrite.R:348 — swap GENEID and group_name
in the paste0() call:

attributes = paste0(group_name, GENEID, NDR, txScore, ...)

3. Missing gene-level records — score 60

writeToGFF3() writes only mRNA and exon records. Standard GFF3
practice (Ensembl, NCBI) uses a three-level hierarchy:
gene → mRNA → exon.

Verified against the GFF3 spec
and NCBI GFF3 docs:
gene features are "useful but NOT strictly required". The mRNA + exon
hierarchy is valid GFF3 on its own. The existing GTF writer
(writeToGTF) also skips gene records — consistent design choice.

Some genome browsers and validators may warn but will not reject the
file. No code change required unless broader tool compatibility is
needed.

4. Missing @noRd / @export on GFF3 functions — score 75

writeToGFF3 (L266) and writeAnnotationsToGFF3 (L369) have no
roxygen documentation at all. The parallel GTF functions (writeToGTF
L147, writeAnnotationsToGTF L241) are both @exported with full
roxygen blocks.

Neither GFF3 function appears in NAMESPACE. If they are meant to be
internal-only, they should have @noRd. If users should be able to
call them directly (as with writeToGTF()), they need @export +
documentation.

Fix location: Add #' @noRd above writeToGFF3 (L266) and
writeAnnotationsToGFF3 (L369). Or, if public use is intended, add
full roxygen blocks matching the GTF equivalents and add to NAMESPACE.

5. Two independent if blocks instead of if/else if — score 50

R/readWrite.R:56 and R/readWrite.R:62:

if (outputAnnFormat == "gtf") {
    gtf <- writeAnnotationsToGTF(...)
}
if (outputAnnFormat == "gff3") {
    gff3 <- writeAnnotationsToGFF3(...)
}

Formats are mutually exclusive (validated at L45). Should be
if/else if to make exclusivity explicit and avoid a redundant
string comparison. Not a crash risk.

Fix location: R/readWrite.R:62 — change if to else if.

Pre-existing issues (faithfully reproduced from GTF writer)

These exist identically in the GTF writer and were copy-pasted into the
GFF3 version. Not introduced by this PR.

  • novelTranscript NA subsetting (L388 / GTF L260):
    annotation[mcols(annotation)$novelTranscript] — if novelTranscript
    contains NAs, logical subsetting includes NA-indexed elements. Both
    versions have the same pattern; unlikely to trigger in practice since
    bambu always sets novelTranscript to TRUE/FALSE.

  • recode_factor deprecated (L359 / GTF L214): dplyr 1.1.0+
    deprecates recode_factor in favor of case_match(). Will emit
    deprecation warnings on current dplyr.

  • txScore vs maxTxScore key naming asymmetry (L297 exon-level
    maxTxScore= vs L339 transcript-level txScore=): These access
    different mcols fields ($maxTxScore vs $txScore), so the naming
    reflects the underlying data. Inconsistent but correct.

  • mRNA feature type for all transcripts (L347): GFF3 uses mRNA
    for protein-coding transcripts. Non-coding RNA should use transcript
    or ncRNA. Using mRNA universally is technically incorrect per
    Sequence Ontology but common in practice. Design choice.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

add functionality to write to GFF file

4 participants