# Exporting BioCantor data models

BioCantor data models can be exported to any of:

1. GenBank
2. GFF3
3. JSON
4. BED (TranscriptInterval and FeatureInterval only).

The JSON representation can be read directly by the `marshmallow` data structures that build the data model.

In [11]:
from inscripta.biocantor.io.gff3.parser import parse_standard_gff3, AnnotationCollectionModel

gff3 = "tests/data/INSC1006_chrI.gff3"

model = list(parse_standard_gff3(gff3))[0]
parsed = model.to_annotation_collection()

## GFF3

Each of the five interval objects in BioCantor is capable of directly printing to GFF3. If this function is called on a parent object, each child is also recursively called.

In [7]:
for gff_row in parsed.to_gff():
    print(gff_row)

CM021111.1	BioCantor	gene	16175	18079	.	+	.	ID=95dcc29c-0b5c-db9a-a1dc-83e2b81a7ccc;gene_biotype=ncRNA;gene_id=8ad3f444-384e-35e0-e560-aef88bd2863f;locus_tag=GI526_G0000001
CM021111.1	BioCantor	transcript	16175	18079	.	-	.	ID=90ee4e5b-64de-11fb-5d87-7a98577463bb;Parent=95dcc29c-0b5c-db9a-a1dc-83e2b81a7ccc;Name=GI526_G0000001;gene_biotype=ncRNA;gene_id=8ad3f444-384e-35e0-e560-aef88bd2863f;locus_tag=GI526_G0000001;ncrna_class=other;note=CAT%20transcript%20id:%20T0000001%3B%20CAT%20alignment%20id:%20IsoSeq-PB.2586.1%3B%20CAT%20novel%20prediction:%20IsoSeq;transcript_biotype=ncRNA;transcript_id=GI526_G0000001;transcript_name=GI526_G0000001
CM021111.1	BioCantor	exon	16175	18079	.	-	.	ID=exon-90ee4e5b-64de-11fb-5d87-7a98577463bb-1;Parent=90ee4e5b-64de-11fb-5d87-7a98577463bb;Name=GI526_G0000001;gene_biotype=ncRNA;gene_id=8ad3f444-384e-35e0-e560-aef88bd2863f;locus_tag=GI526_G0000001;ncrna_class=other;note=CAT%20transcript%20id:%20T0000001%3B%20CAT%20alignment%20id:%20IsoSeq-PB.2586.1%3B%20CAT%

### GFF3 with FASTA

In addition to being able to print GFF directly, convenience functions exist to export GFF3 in one go, and optionally include sequence info.

In [8]:
# this does not work because it was parsed without sequence information
from inscripta.biocantor.io.gff3.writer import collection_to_gff3

with open("/dev/null", "w") as fh:
    collection_to_gff3([parsed], fh, add_sequences=True)

GFF3ExportException: Cannot export FASTA in GFF3 if collection has no associated sequence

In [9]:
# parse the GFF3 with sequence instead this time and write to disk
from inscripta.biocantor.io.gff3.parser import parse_gff3_embedded_fasta

with open("/dev/null", "w") as fh:
    parsed_with_sequence = [x.to_annotation_collection() for x in parse_gff3_embedded_fasta(gff3)]
    collection_to_gff3(parsed_with_sequence, fh, add_sequences=True)

## JSON

Each object also has a `to_dict()` function, which produces a dict that the `marshmallow` library understands. As a result, the below two operations are identical.

In [12]:
AnnotationCollectionModel.Schema().load(parsed.to_dict()).to_annotation_collection().to_dict() == parsed.to_dict()

True

However, the below is not true, only because the marshmallow schemas are `Ordered`, and so produced `OrderedDict`:

In [14]:
parsed.to_dict() == AnnotationCollectionModel.Schema().dump(model)

False

## BED

BED export is only valid on `TranscriptInterval` and `FeatureInterval` objects, because BED format does not model relationships between rows. All models are exported in `BED12` format.

In [15]:
for gene_or_feature_collection in parsed:
    for transcript_or_feature in gene_or_feature_collection:
        print(transcript_or_feature.to_bed12())


CM021111.1	16174	18079	GI526_G0000001	0	-	0	0	0,0,0	1	1905	0
CM021111.1	37461	39103	GDH3	0	+	37637	39011	0,0,0	1	1642	0
CM021111.1	39518	40772	BDH2	0	+	39518	40772	0,0,0	1	1254	0
CM021111.1	41085	42503	BDH1	0	+	0	0	0,0,0	1	1418	0
CM021111.1	42579	43218	ECM1	0	+	0	0	0,0,0	1	639	0


## GenBank

All models can be exported to GenBank. GenBank export must be specified to be in either *prokaryotic* or *eukaryotic* flavors. See the document on parsing GenBank files for an explanation of the difference.

GenBank export is problematic for genomes that have multiple isoforms per gene due to the lack of the ability to explicitly define the hierarchical relationship. BioCantor GenBank is always locus sorted, which helps resolve this ambiguity.

GenBank export also the ability to export GenBank files compatible with Inscripta Engineering Portal. This mode of export ensures that there is always a unique `/gene` tag on every feature, and that `CDS` features have a `/translation` tag.

The `organism` and `source` fields can be set by keyword arguments.

GenBank files can also be exported in two common GenBank file flavors, prokaryotic and eukaryotic.
Eukaryotic GenBank files contain a `mRNA` feature as a child of a `gene` feature and parent of a `CDS` feature,
while Prokaryotic GenBank files skip the `mRNA` feature and only have `gene` and `CDS`. The GenBank writing function
defaults to the prokaryotic version, but this can be adjusted by passing `genbank_type=GenbankFlavor.EUKARYOTIC`.

In [16]:
from inscripta.biocantor.io.genbank.writer import collection_to_genbank

with open("/dev/null", "w") as fh:
    collection_to_genbank([parsed], fh)

GenBankExportError: Cannot export GenBank if collections do not have sequence information

In [17]:
from tempfile import TemporaryDirectory
from pathlib import Path

with TemporaryDirectory() as tmp_dir:
    tmp_file = Path(tmp_dir) / "test.gbk"
    with open(tmp_file, "w") as fh:
        collection_to_genbank(parsed_with_sequence, tmp_file)
    with open(tmp_file, "r") as fh:
        print(fh.read()[:2000])

LOCUS       CM021111.1             50040 bp    DNA              UNK 01-JAN-1980
DEFINITION  GenBank produced by BioCantor.
ACCESSION   CM021111
VERSION     CM021111.1
KEYWORDS    .
SOURCE      .
  ORGANISM  .
            .
FEATURES             Location/Qualifiers
     gene            complement(16175..18079)
                     /gene_id="8ad3f444-384e-35e0-e560-aef88bd2863f"
                     /gene_biotype="ncRNA"
                     /locus_tag="GI526_G0000001"
                     /gene="8ad3f444-384e-35e0-e560-aef88bd2863f"
     ncRNA           complement(16175..18079)
                     /ncrna_class="other"
                     /note="CAT transcript id: T0000001; CAT alignment id:
                     IsoSeq-PB.2586.1; CAT novel prediction: IsoSeq"
                     /transcript_id="GI526_G0000001"
                     /transcript_name="GI526_G0000001"
                     /transcript_biotype="ncRNA"
                     /gene="8ad3f444-384e-35e0-e560-aef88bd2863f"
        