Added functionality for writeBambuOutput() to write to gff3 by hafiz-ismail · Pull Request #567 · GoekeLab/bambu

hafiz-ismail · 2026-04-16T06:07:09Z

Related Issue

Fixes # (issue number)

Type of Change

Bug fix (non-breaking change which fixes an issue)
New feature (please specify if the change breaks existing functionality)
- Non breaking change (the feature doesn't change existing functionality)
- Breaking change (the feature that would cause existing functionality to not work as expected)
Documentation update
Performance optimization

Description

Implemented functionality for writeBambuOutput to export gff3 files from SummarizedExperiment objects. Currently Bambu can export the more stringent gtf file format, but there is no functionality for gff3. An overview of the differences of the 2 file formats:

GTF differs from GFF3 mainly in the 9th column (attribute) and the syntax of the key:value pairs.

GTF:

gene_id "GENE1"; transcript_id "TX1"; gene_name "BRCA1";
gene_id "GENE1"; transcript_id "TX1"; exon_number "1";

GTF documentation (Ensembl)

GFF3:

ID=TX1;Parent=GENE1;Name=BRCA1
ID=exon1;Parent=TX1;exon_number=1

GFF3 documentation (Ensembl)

Also, GTF attributes have a trailing ; whereas GFF3 does not.

Other differences include

GFF3 header
GTF column 3 (feature) is constrained to lesser terms than GFF3

Implementation Details

All code changes were implemented in readWrite.R.

Implemented two new helper functions: writeToGFF3 and writeAnnotationsToGFF3, following convention from already existing writeToGTF and writeAnnotationsToGTF
Implemented new parameter outputAnnFormat in writeBambuOutput: takes either "gtf" (default) or "gff3" as arguments
renamed transcript_gtffn to transcript_annfn
Updated documentation in README and comment markers parsed by roxygen2

Impact of Changes

All previous functionality is preserved, writeBambuOutput still exports to .gtf by default unless specified otherwise. It was found that the current writeBambuOutput in devel_pre_v4 is not fully functional - therefore a minimal function is provided to test results as below:

writeBambuOutput <- function(se, path, prefix = "", outputExtendedAnno = TRUE, 
                             outputAll = TRUE, outputBambuModels = TRUE, outputNovelOnly = TRUE, seperateSamples = FALSE,
                             outputAnnFormat = "gtf") {
  
  if (missing(se) | missing(path)) {
    stop("Both summarizedExperiment object from bambu and
            the path for the output files are required.")
  } else if (!outputAnnFormat %in% c("gtf", "gff3")) {
    stop("Please specify a valid annotation outputAnnFormat: 'gtf' or 'gff3', or leave blank to output to GTF by default.")
  } else {
    outdir <- paste0(path, "/")
    if (!dir.exists(outdir))
      dir.create(outdir, recursive = TRUE)
    
    transcript_grList <- rowRanges(se)
    prefix <- ifelse(prefix != "", paste0(prefix, "_"), "")
    transcript_annfn <- paste(outdir, prefix, sep = "")
    
    if (outputAnnFormat == "gtf") {
      gtf <- writeAnnotationsToGTF(annotation = transcript_grList,
                                   file = transcript_annfn, outputExtendedAnno = outputExtendedAnno, 
                                   outputAll = outputAll, outputBambuModels = outputBambuModels, outputNovelOnly = outputNovelOnly)
    }
    
    if (outputAnnFormat == "gff3") {
      gff3 <- writeAnnotationsToGFF3(annotation = transcript_grList,
                                     file = transcript_annfn, outputExtendedAnno = outputExtendedAnno, 
                                     outputAll = outputAll, outputBambuModels = outputBambuModels, outputNovelOnly = outputNovelOnly)
    }
  }
}

In addition, this output has not been updated for compatibility with importBambuResults - providing functionality for this will be associated with future implementations for the prepareAnnotations function

Checklist

My code follows the style guidelines of this project.
I have performed a self-review of my own code.
I have commented my code, particularly in hard-to-understand areas.
I have made corresponding changes to the documentation (vignettes, man pages).
I have tested the code on a full dataset, and any differences have been described in the Impact of Changes section.

Copilot

Pull request overview

Adds optional GFF3 annotation export support to writeBambuOutput() so Bambu results can be written as either GTF (default) or GFF3, and updates user-facing documentation accordingly.

Changes:

Added outputAnnFormat parameter to writeBambuOutput() to select "gtf" vs "gff3".
Implemented writeToGFF3() / writeAnnotationsToGFF3() helpers analogous to existing GTF writers.
Updated README text to mention GFF3 output support.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 6 comments.

File	Description
README.md	Documents the new GFF3 option for annotation output.
R/readWrite.R	Adds `outputAnnFormat` to `writeBambuOutput()` and introduces GFF3-writing helpers.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

 For a full description of the other outputs see [Output Description](#Output-Description)

-The full output can be written to a file using writeBambuOutput().  Using this function will generate six files, including 4four .gtf files(detailed below), and two .txt files for the expression counts at transcript and gene levels.
+The full output can be written to a file using writeBambuOutput().  Using this function will generate six files, including four .gtf/.gff3 files(detailed below), and two .txt files for the expression counts at transcript and gene levels.


 For a full description of the other outputs see [Output Description](#Output-Description)

-The full output can be written to a file using writeBambuOutput().  Using this function will generate six files, including 4four .gtf files(detailed below), and two .txt files for the expression counts at transcript and gene levels.
+The full output can be written to a file using writeBambuOutput().  Using this function will generate six files, including four .gtf/.gff3 files(detailed below), and two .txt files for the expression counts at transcript and gene levels.


 #' @param path the destination of the output files 
 #' (gtf, transcript counts, and gene counts)
 #' @param prefix the prefix of the output files
-#' @details The function will write the output from Bambu to files. The 
-#' annotations will be written to a .gtf file, transcript counts (total counts, 
+#' @param outputAnnFormat the file format in which to output annotations, must
+#' be one of \code{"gtf"} or \code{"gff3"}. \code{"gtf"} is specified by default.
+#' @details The function will write the output from Bambu to files. The
+#' annotations will be written to a .gtf or .gff3 file, transcript counts (total counts, 
 #' CPM, full-length counts, partial-length counts, and unique counts) and gene counts
-#' will be written to .txt files.  
+#' will be written to .txt files.


 writeBambuOutput <- function(se, path, prefix = "", outputExtendedAnno = TRUE, 
-                             outputAll = TRUE, outputBambuModels = TRUE, outputNovelOnly = TRUE, seperateSamples = FALSE) {
-    if (missing(se) | missing(path)) {
-        stop("Both summarizedExperiment object from bambu and
+                             outputAll = TRUE, outputBambuModels = TRUE, outputNovelOnly = TRUE, seperateSamples = FALSE,
+                             outputAnnFormat = "gtf") {


+    if (outputAnnFormat == "gtf") {
+      gtf <- writeAnnotationsToGTF(annotation = transcript_grList,
+                                   file = transcript_annfn, outputExtendedAnno = outputExtendedAnno, 
+                                   outputAll = outputAll, outputBambuModels = outputBambuModels, outputNovelOnly = outputNovelOnly)
+    }
+
+    if (outputAnnFormat == "gff3") {
+      gff3 <- writeAnnotationsToGFF3(annotation = transcript_grList,
+                                     file = transcript_annfn, outputExtendedAnno = outputExtendedAnno, 
+                                     outputAll = outputAll, outputBambuModels = outputBambuModels, outputNovelOnly = outputNovelOnly)
+    }


+  if(outputAll){
+    annotationAll = setNDR(annotation, 1)
+    if(length(annotationAll) == length(annotation)) 
+      message("The current NDR threshold already outputs all transcript models. This may result in reduced precision for th extendedAnnotations and supportedTranscriptModels gtfs")


jonathangoeke · 2026-04-16T06:40:50Z

Code review

Found 3 issues:

writeToGFF3() dfTx block accesses mcols(annotation)$txScore and mcols(annotation)$txScore.noFit, but these columns do not exist on the annotation GRangesList. The exon block in the same function correctly uses mcols(annotation)$maxTxScore and mcols(annotation)$maxTxScore.noFit (lines 290-291). This will produce NULL values in transcript-level GFF3 attributes. (bug: wrong column name, copy-pasted from the same pre-existing bug in writeToGTF at lines 199-200)

bambu/R/readWrite.R

Lines 338 to 341 in 67f184d

    
           dfTx$NDR <- paste("NDR=", as.character(mcols(annotation)$NDR), ";", sep = "") 
        
           dfTx$txScore <- paste("txScore=", as.character(mcols(annotation)$txScore), ";", sep = "") 
        
           dfTx$txScore.noFit <- paste("txScore.noFit=", as.character(mcols(annotation)$txScore.noFit), ";", sep = "") 
        
           dfTx$novelGene <- paste("novelGene=", as.character(mcols(annotation)$novelGene), ";", sep = "")

writeToGFF3() writes mRNA records with Parent=GENE_ID and exon records with Parent=TX_ID, but no gene-level records are written. The GFF3 specification requires a three-level hierarchy (gene -> mRNA -> exon) where every Parent attribute references an ID that exists in the file. Without gene records, every mRNA row has a dangling Parent reference, making the output invalid for downstream tools that validate parent-child relationships (AGAT, gffread, IGV). (bug: GFF3 spec violation -- missing gene feature records)

bambu/R/readWrite.R

Lines 320 to 355 in 67f184d

    
           # Create GFF3 formatted dataframes for exons and transcripts separately 
        
           # Exons: no need gene IDs but set Parent = transcript ID 
        
           dfExon <- mutate(df, source = "Bambu", feature = "exon", score = ".", 
        
                            frame = ".", attributes = paste0(group_name, exon_rank, NDR, txScore, txScore.noFit, novelGene, novelTranscript, txClassDescription )) %>% 
        
             select(seqnames, source, feature, start, end, score, 
        
                    strand, frame, attributes, group_name) %>% 
        
             mutate(attributes = sub("ID=", "Parent=", attributes)) 
        
           # Dataframe for transcripts 
        
           dfTx <- as_tibble(as.data.frame(range(ranges(annotation)))) 
        
           dfTx <- left_join(dfTx, geneIDs, by = c("group_name" = "TXNAME")) 
        
           dfTx$group_name <- paste("ID=", dfTx$group_name, ";", sep = "") 
        
           dfTx$GENEID <- paste("Parent=", dfTx$GENEID, ";", sep = "") 
        
           if(!is.null(mcols(annotation)$NDR)) { 
        
             dfTx$NDR <- paste("NDR=", as.character(mcols(annotation)$NDR), ";", sep = "") 
        
             dfTx$txScore <- paste("txScore=", as.character(mcols(annotation)$txScore), ";", sep = "") 
        
             dfTx$txScore.noFit <- paste("txScore.noFit=", as.character(mcols(annotation)$txScore.noFit), ";", sep = "") 
        
             dfTx$novelGene <- paste("novelGene=", as.character(mcols(annotation)$novelGene), ";", sep = "") 
        
             dfTx$novelTranscript <- paste("novelTranscript=", as.character(mcols(annotation)$novelTranscript), ";", sep = "") 
        
             dfTx$txClassDescription <- paste("txClassDescription=", as.character(mcols(annotation)$txClassDescription), ";", sep = "") 
        
           } 
        
           # Parse to gff3 format 
        
           dfTx <- mutate(dfTx, source = "Bambu", feature = "mRNA", score = ".", 
        
                          frame = ".", attributes = paste0(GENEID, group_name, NDR, txScore, txScore.noFit, novelGene, novelTranscript, txClassDescription )) %>% 
        
             select(seqnames, source, feature, start, end, score, 
        
                    strand, frame, attributes, group_name) 
        
           # Concat tx and exon, then drop the group_name 
        
           gff3 <- rbind(dfTx, dfExon) %>% group_by(group_name) %>% 
        
             arrange(as.character(seqnames), start) %>% ungroup() %>% 
        
             select(seqnames, source, feature, start, end, score,

writeAnnotationsToGFF3() message says "gtfs" instead of "gff3s" and contains a typo "th " (should be "the"). This was copy-pasted from writeAnnotationsToGTF() without updating. (bug: misleading user-facing message referencing wrong format)

bambu/R/readWrite.R

Lines 376 to 378 in 67f184d

    
           if(length(annotationAll) == length(annotation))  
        
             message("The current NDR threshold already outputs all transcript models. This may result in reduced precision for th extendedAnnotations and supportedTranscriptModels gtfs") 
        
           writeToGFF3(annotationAll, paste0(file, "allTranscriptModels.gff3"), geneIDs)

🤖 Generated with Claude Code

_{- If this code review was useful, please react with 👍. Otherwise, react with 👎.}

cying111 · 2026-04-17T03:11:21Z

gff3 entries are not sorted, would suggest to follow sth similar to gtf entries? i.e., transcript entry, then followed by all exon entries sorted by chromosome coordinate, if need to add gene entry, gene entry can be on top of transcript entry

but I noted that writeToGTF output gtf entries are also not sorted, maybe need to change this as well?

One more thing when reading gff3 documentation, ID always comes before Parent, maybe also good to keep the same with the convention?

Missing IDs for exons, maybe can have sth as ID=exon_1, for all exons, so that rank is not necessary?

cying111 · 2026-04-17T03:27:15Z

-        gtf <- writeAnnotationsToGTF(annotation = transcript_grList,
-            file = transcript_gtffn, outputExtendedAnno = outputExtendedAnno, 
-            outputAll = outputAll, outputBambuModels = outputBambuModels, outputNovelOnly = outputNovelOnly)
+  } else if (!outputAnnFormat %in% c("gtf", "gff3")) {


maybe should also allow for gff? and assume gff meant for gff3?

cying111 · 2026-04-17T03:33:03Z

+    }
+
+    if (outputAnnFormat == "gff3") {
+      gff3 <- writeAnnotationsToGFF3(annotation = transcript_grList,


writeAnnotationsToGFF3 and writeAnnotationsToGTF already write to files, so not need to assign to gff3, same for above, no need to assign to gtf, and both are not used in following lines as well

Suggested change

gff3 <- writeAnnotationsToGFF3(annotation = transcript_grList,

writeAnnotationsToGFF3(annotation = transcript_grList,

cying111 · 2026-04-17T03:36:00Z

+}
+
+
+writeAnnotationsToGFF3 <- function(annotation, file, geneIDs = NULL, outputExtendedAnno = TRUE, 


this function maybe can be combined with writeAnnotationsToGTF to one writeAnnotations function, with one more parameter to call writeToGTF or writeToGFF3 later?

to do in the future as part of deduplicating code

cying111 · 2026-04-17T03:43:19Z

+
+  df <- left_join(df, geneIDs[, c("TXNAME", "GENEID")], 
+                  by = c("group_name" = "TXNAME"))
+


processing until this step is duplicated from writeToGTF and writeToGFF3, maybe can consider to combine 2 functions?

to do in the future as part of deduplicating code

cying111 · 2026-04-17T03:43:42Z

+  dfTx$GENEID <- paste("Parent=", dfTx$GENEID, ";", sep = "")
+
+
+  if(!is.null(mcols(annotation)$NDR)) {


this if section is duplicated (appears 4 times)? maybe can modularise it as well?

the column assignment is slightly different from the one at line 287

i.e. lines 296-302

df$NDR <- paste("NDR=", as.character(NDR), ";", sep = "") df$txScore <- paste("maxTxScore=", as.character(txScore), ";", sep = "") df$txScore.noFit <- paste("maxTxScore.noFit=", as.character(txScore.noFit), ";", sep = "") df$novelGene <- paste("novelGene=", as.character(novelGene), ";", sep = "") df$novelTranscript <- paste("novelTranscript=", as.character(novelTranscript), ";", sep = "") df$txClassDescription <- paste("txClassDescription=", as.character(txClassDescription), ";", sep = "") }

vs lines 338-344

dfTx$NDR <- paste("NDR=", as.character(mcols(annotation)$NDR), ";", sep = "") dfTx$txScore <- paste("txScore=", as.character(mcols(annotation)$txScore), ";", sep = "") dfTx$txScore.noFit <- paste("txScore.noFit=", as.character(mcols(annotation)$txScore.noFit), ";", sep = "") dfTx$novelGene <- paste("novelGene=", as.character(mcols(annotation)$novelGene), ";", sep = "") dfTx$novelTranscript <- paste("novelTranscript=", as.character(mcols(annotation)$novelTranscript), ";", sep = "") dfTx$txClassDescription <- paste("txClassDescription=", as.character(mcols(annotation)$txClassDescription), ";", sep = "") }

likewise the same block of code is also 'duplicated' in writeToGTF.
nevertheless, they are definitely very similar, can be rewritten and also modularized in the future

cying111 · 2026-04-17T03:45:49Z

+  # replace * with . and remove trailing ;
+  gff3 <- mutate(gff3, strand = recode_factor(strand, `*` = "."),
+                 attributes = sub(";$", "", attributes))
+


not sure where sorting should happen, I think maybe the step before mutate would be fine

cying111 · 2026-04-17T03:50:08Z

Code review from Claude code:

my feeling is that: issue 4 and 5 maybe also good to address, the usage of mRNA might also need to be changed, maybe can just be transcript?

1. Copy-paste "gtfs" in GFF3 message — score 75

R/readWrite.R:377:

message("The current NDR threshold already outputs all transcript models.
This may result in reduced precision for th extendedAnnotations and
supportedTranscriptModels gtfs")

Two problems: (a) "gtfs" should be "gff3s" — copy-paste leftover from
the GTF version at L249. (b) Pre-existing typo "th" → "the".

Fix location: R/readWrite.R:377 — change "gtfs" to "gff3s" and
"th" to "the".

2. `Parent` before `ID` in transcript attributes — score 60

R/readWrite.R:348:

attributes = paste0(GENEID, group_name, NDR, txScore, ...)

Where GENEID = "Parent=<gene>;" (L334) and group_name = "ID=<tx>;"
(L333). Produces Parent=gene;ID=tx;... — convention is ID first,
then Parent.

Verified against the official GFF3 spec:
the spec does not mandate attribute order in column 9. It's
convention, not requirement. Most parsers (IGV, JBrowse) handle
either order. Downgraded from initial score of 85.

Fix location: R/readWrite.R:348 — swap GENEID and group_name
in the paste0() call:

attributes = paste0(group_name, GENEID, NDR, txScore, ...)

3. Missing gene-level records — score 60

writeToGFF3() writes only mRNA and exon records. Standard GFF3
practice (Ensembl, NCBI) uses a three-level hierarchy:
gene → mRNA → exon.

Verified against the GFF3 spec
and NCBI GFF3 docs:
gene features are "useful but NOT strictly required". The mRNA + exon
hierarchy is valid GFF3 on its own. The existing GTF writer
(writeToGTF) also skips gene records — consistent design choice.

Some genome browsers and validators may warn but will not reject the
file. No code change required unless broader tool compatibility is
needed.

4. Missing `@noRd` / `@export` on GFF3 functions — score 75

writeToGFF3 (L266) and writeAnnotationsToGFF3 (L369) have no
roxygen documentation at all. The parallel GTF functions (writeToGTF
L147, writeAnnotationsToGTF L241) are both @exported with full
roxygen blocks.

Neither GFF3 function appears in NAMESPACE. If they are meant to be
internal-only, they should have @noRd. If users should be able to
call them directly (as with writeToGTF()), they need @export +
documentation.

Fix location: Add #' @noRd above writeToGFF3 (L266) and
writeAnnotationsToGFF3 (L369). Or, if public use is intended, add
full roxygen blocks matching the GTF equivalents and add to NAMESPACE.

5. Two independent `if` blocks instead of `if/else if` — score 50

R/readWrite.R:56 and R/readWrite.R:62:

if (outputAnnFormat == "gtf") {
    gtf <- writeAnnotationsToGTF(...)
}
if (outputAnnFormat == "gff3") {
    gff3 <- writeAnnotationsToGFF3(...)
}

Formats are mutually exclusive (validated at L45). Should be
if/else if to make exclusivity explicit and avoid a redundant
string comparison. Not a crash risk.

Fix location: R/readWrite.R:62 — change if to else if.

Pre-existing issues (faithfully reproduced from GTF writer)

These exist identically in the GTF writer and were copy-pasted into the
GFF3 version. Not introduced by this PR.

novelTranscript NA subsetting (L388 / GTF L260):
annotation[mcols(annotation)$novelTranscript] — if novelTranscript
contains NAs, logical subsetting includes NA-indexed elements. Both
versions have the same pattern; unlikely to trigger in practice since
bambu always sets novelTranscript to TRUE/FALSE.
recode_factor deprecated (L359 / GTF L214): dplyr 1.1.0+
deprecates recode_factor in favor of case_match(). Will emit
deprecation warnings on current dplyr.
txScore vs maxTxScore key naming asymmetry (L297 exon-level
maxTxScore= vs L339 transcript-level txScore=): These access
different mcols fields ($maxTxScore vs $txScore), so the naming
reflects the underlying data. Inconsistent but correct.
mRNA feature type for all transcripts (L347): GFF3 uses mRNA
for protein-coding transcripts. Non-coding RNA should use transcript
or ncRNA. Using mRNA universally is technically incorrect per
Sequence Ontology but common in practice. Design choice.

implement gff3 functionality

67f184d

hafiz-ismail requested a review from Copilot April 16, 2026 06:07

hafiz-ismail assigned jonathangoeke and cying111 Apr 16, 2026

hafiz-ismail linked an issue Apr 16, 2026 that may be closed by this pull request

add functionality to write to GFF file #564

Open

Copilot started reviewing on behalf of hafiz-ismail April 16, 2026 06:07 View session

Copilot AI reviewed Apr 16, 2026

View reviewed changes

cying111 reviewed Apr 17, 2026

View reviewed changes

	gff3 <- writeAnnotationsToGFF3(annotation = transcript_grList,
	writeAnnotationsToGFF3(annotation = transcript_grList,

		}


		writeAnnotationsToGFF3 <- function(annotation, file, geneIDs = NULL, outputExtendedAnno = TRUE,


		df <- left_join(df, geneIDs[, c("TXNAME", "GENEID")],
		by = c("group_name" = "TXNAME"))

		dfTx$GENEID <- paste("Parent=", dfTx$GENEID, ";", sep = "")


		if(!is.null(mcols(annotation)$NDR)) {

Conversation

hafiz-ismail commented Apr 16, 2026

Related Issue

Type of Change

Description

Implementation Details

Impact of Changes

Checklist

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

jonathangoeke commented Apr 16, 2026

Code review

Uh oh!

cying111 commented Apr 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

cying111 Apr 17, 2026

Choose a reason for hiding this comment

Uh oh!

cying111 Apr 17, 2026

Choose a reason for hiding this comment

Uh oh!

cying111 Apr 17, 2026

Choose a reason for hiding this comment

Uh oh!

hafiz-ismail Apr 17, 2026

Choose a reason for hiding this comment

Uh oh!

cying111 Apr 17, 2026

Choose a reason for hiding this comment

Uh oh!

hafiz-ismail Apr 17, 2026

Choose a reason for hiding this comment

Uh oh!

cying111 Apr 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

hafiz-ismail Apr 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cying111 Apr 17, 2026

Choose a reason for hiding this comment

Uh oh!

cying111 commented Apr 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

1. Copy-paste "gtfs" in GFF3 message — score 75

2. Parent before ID in transcript attributes — score 60

3. Missing gene-level records — score 60

4. Missing @noRd / @export on GFF3 functions — score 75

5. Two independent if blocks instead of if/else if — score 50

Pre-existing issues (faithfully reproduced from GTF writer)

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

cying111 commented Apr 17, 2026 •

edited

Loading

cying111 Apr 17, 2026 •

edited

Loading

hafiz-ismail Apr 17, 2026 •

edited

Loading

cying111 commented Apr 17, 2026 •

edited

Loading

2. `Parent` before `ID` in transcript attributes — score 60

4. Missing `@noRd` / `@export` on GFF3 functions — score 75

5. Two independent `if` blocks instead of `if/else if` — score 50