# Overview

This notebook collects data from the [Raj et al. 2020](https://www.frontiersin.org/articles/10.3389/fcell.2021.743421/full) zebrafish brain developmental cell atlas.

#### Dataset description

- Data can be found at the [GEO Accession GSE158142](https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE158142).
- Genome annotation used was at [GEO Accession GSE105010](https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE105010)
- This notebook previously collects data from one sample, ["zBr15dpf1_S1"](https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSM4793235), for the purposes of data analysis and exploration.

# 0. Setup

Import packages and specify any important functions here.

In [1]:
# import standard python packages
import pandas as pd
import subprocess, os, dill, sys

# add the utils and env directories to the path
sys.path.append('../../utils/')
sys.path.append('../../env/')

# import functions from utils directory files
from string_functions import *
from biofile_handling import *

# import paths to software installs from env
from install_locs import *

# 1. Download and describe data

#### BioFileDocket
First, make a BioFileDocket for the dataset.

In [2]:
################
# general info #
################

# Specify the name of the species folder in Amazon S3
species = 'Danio_rerio'

# Specify any particular identifying conditions, eg tissue type:
conditions = 'juvbrain'

################
################

sample_BFD = BioFileDocket(species, conditions)

/home/ec2-user/glial-origins/output/Drer_juvbrain/ already exists
Files will be saved into /home/ec2-user/glial-origins/output/Drer_juvbrain/


In [3]:
################
# general info #
################

# Specify url and other variables
genome_fasta_url = 'https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/002/035/GCA_000002035.3_GRCz10/GCA_000002035.3_GRCz10_genomic.fna.gz'
genome_version = 'GRCz10-86'

annot_url = 'https://ftp.ncbi.nlm.nih.gov/geo/series/GSE105nnn/GSE105010/suppl/GSE105010_Danio_rerio.GRCz10.86.modified.gtf.gz'

barcodes_address = 'https://ftp.ncbi.nlm.nih.gov/geo/samples/GSM4793nnn/GSM4793240/suppl/GSM4793240_comb-zBr15dpf6_S1-barcodes.tsv.gz'
features_address = 'https://ftp.ncbi.nlm.nih.gov/geo/samples/GSM4793nnn/GSM4793240/suppl/GSM4793240_comb-zBr15dpf6_S1-genes.tsv.gz'
matrix_address = 'https://ftp.ncbi.nlm.nih.gov/geo/samples/GSM4793nnn/GSM4793240/suppl/GSM4793240_comb-zBr15dpf6_S1-matrix.mtx.gz'

###########
# runtime #
###########

protocol = 'curl'

genome_fasta_original = GenomeFastaFile(
    sampledict = sample_BFD.sampledict,
    version = genome_version,
    url = genome_fasta_url,
    protocol = protocol
)

# Renames RefSeq chromosomes, returning a new object
genome_fasta = genome_fasta_original.rename_RefSeq_chromosomes()

annot = GenomeGffFile(
    sampledict = sample_BFD.sampledict,
    reference_genome = genome_fasta,
    url = annot_url,
    protocol = protocol
)

# Download CellRanger files from urls
cellranger_filegroup = CellRangerFileGroup(
    sampledict = sample_BFD.sampledict, 
    barcodes_address = barcodes_address,
    features_address = features_address,
    matrix_address = matrix_address,
    how = 'url',
    protocol = protocol)

# Use method to generate a gxc file from cellranger file
# Will display first 10 rows of file before saving
# For some reason this takes a long time?
gxc = cellranger_filegroup.to_gxc(
    filename = 'GSM4793240_comb-zBr15dpf6_S1-gxc.tsv',
    reference_genome = genome_fasta,
    reference_annot = annot
)

keyfiles = {
    'annot': annot,
    'genome_fasta': genome_fasta,
    'genome_fasta_original': genome_fasta_original,
    'gxc': gxc,
    'cellranger_filegroup': cellranger_filegroup,
    'cellranger_features': cellranger_filegroup.features,
    'cellranger_matrix': cellranger_filegroup.matrix,
    'cellranger_barcodes': cellranger_filegroup.barcodes
}

sample_BFD.add_keyfiles(keyfiles)
display(vars(sample_BFD))

inferring file name as GCA_000002035.3_GRCz10_genomic.fna.gz
file GCA_000002035.3_GRCz10_genomic.fna.gz already exists at /home/ec2-user/glial-origins/output/Drer_juvbrain/GCA_000002035.3_GRCz10_genomic.fna.gz
file GCA_000002035.3_GRCz10_genomic.fna.gz unzipped and object renamed to GCA_000002035.3_GRCz10_genomic.fna
inferring file name as GSE105010_Danio_rerio.GRCz10.86.modified.gtf.gz
file GSE105010_Danio_rerio.GRCz10.86.modified.gtf.gz already exists at /home/ec2-user/glial-origins/output/Drer_juvbrain/GSE105010_Danio_rerio.GRCz10.86.modified.gtf.gz
file GSE105010_Danio_rerio.GRCz10.86.modified.gtf.gz unzipped and object renamed to GSE105010_Danio_rerio.GRCz10.86.modified.gtf
inferring file name as GSM4793240_comb-zBr15dpf6_S1-barcodes.tsv.gz
file GSM4793240_comb-zBr15dpf6_S1-barcodes.tsv.gz already exists at /home/ec2-user/glial-origins/output/Drer_juvbrain/GSM4793240_comb-zBr15dpf6_S1-barcodes.tsv.gz
file GSM4793240_comb-zBr15dpf6_S1-barcodes.tsv.gz unzipped and object renamed to 

gzip: /home/ec2-user/glial-origins/output/Drer_juvbrain/GCA_000002035.3_GRCz10_genomic.fna: unknown suffix -- ignored
gzip: /home/ec2-user/glial-origins/output/Drer_juvbrain/GSE105010_Danio_rerio.GRCz10.86.modified.gtf: unknown suffix -- ignored
gzip: /home/ec2-user/glial-origins/output/Drer_juvbrain/GSM4793240_comb-zBr15dpf6_S1-barcodes.tsv: unknown suffix -- ignored
gzip: /home/ec2-user/glial-origins/output/Drer_juvbrain/GSM4793240_comb-zBr15dpf6_S1-matrix.mtx: unknown suffix -- ignored
gzip: /home/ec2-user/glial-origins/output/Drer_juvbrain/GSM4793240_comb-zBr15dpf6_S1-genes.tsv: unknown suffix -- ignored


Unnamed: 0,gene_name,AAACCTGAGAAGCCCA-1,AAACCTGAGGATGGAA-1,AAACCTGAGTCTTGCA-1,AAACCTGCACAGACAG-1,AAACCTGCACTTGGAT-1,AAACCTGCATGGGAAC-1,AAACCTGCATTCCTCG-1,AAACCTGGTACATCCA-1,AAACCTGGTGCTGTAT-1,...,TTTGTCAAGCGCTTAT-1,TTTGTCAAGCTAAACA-1,TTTGTCACACAAGACG-1,TTTGTCACACGCTTTC-1,TTTGTCACATCCTTGC-1,TTTGTCAGTATAGTAG-1,TTTGTCAGTATATGAG-1,TTTGTCAGTCGACTAT-1,TTTGTCATCATTGCGA-1,TTTGTCATCCTAAGTG-1
0,si:ch73-252i11.3,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,PARP12,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,syn3,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,ptpro,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,eps8,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
5,si:ch1073-395m9.3,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
6,tbk1,0,0,0,0,0,0,0,1,0,...,0,0,0,0,0,0,0,0,0,0
7,gpr19,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
8,crebl2,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
9,dusp16,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


File already exists at /home/ec2-user/glial-origins/output/Drer_juvbrain/GSM4793240_comb-zBr15dpf6_S1-gxc.tsv . Set overwrite = True to overwrite.


{'species': 'Danio_rerio',
 'conditions': 'juvbrain',
 'directory': '/home/ec2-user/glial-origins/output/Drer_juvbrain/',
 'files': {},
 'metadata': <biofile_handling.metadata_object at 0x7f61ef1f7ee0>,
 'annot': <biofile_handling.GenomeGffFile at 0x7f61eecc7a30>,
 'genome_fasta': <biofile_handling.GenomeFastaFile at 0x7f613dd7f8b0>,
 'genome_fasta_original': <biofile_handling.GenomeFastaFile at 0x7f613dd7e920>,
 'gxc': <biofile_handling.GxcFile at 0x7f61ef1f58d0>,
 'cellranger_filegroup': <biofile_handling.CellRangerFileGroup at 0x7f61ef1f4e20>,
 'cellranger_features': <biofile_handling.CellRangerFeaturesFile at 0x7f613db38d90>,
 'cellranger_matrix': <biofile_handling.CellRangerMatrixFile at 0x7f613dd4d120>,
 'cellranger_barcodes': <biofile_handling.CellRangerBarcodesFile at 0x7f61ef1f7d60>}

# 2. Load in the gxc matrix and get gene names

In [4]:
genes_matrix = pd.read_csv(sample_BFD.gxc.path, sep = '\t', nrows = 10)
display(genes_matrix)

gxc_genes_list = pd.read_csv(sample_BFD.gxc.path, sep = '\t', usecols=[0], names = ['gene_name'])
display(gxc_genes_list)

sample_BFD.metadata.add('num_cells', len(genes_matrix.columns) - 1)
sample_BFD.metadata.add('num_genes', len(genes_matrix))

Unnamed: 0,gene_name,AAACCTGAGAAGCCCA-1,AAACCTGAGGATGGAA-1,AAACCTGAGTCTTGCA-1,AAACCTGCACAGACAG-1,AAACCTGCACTTGGAT-1,AAACCTGCATGGGAAC-1,AAACCTGCATTCCTCG-1,AAACCTGGTACATCCA-1,AAACCTGGTGCTGTAT-1,...,TTTGTCAAGCGCTTAT-1,TTTGTCAAGCTAAACA-1,TTTGTCACACAAGACG-1,TTTGTCACACGCTTTC-1,TTTGTCACATCCTTGC-1,TTTGTCAGTATAGTAG-1,TTTGTCAGTATATGAG-1,TTTGTCAGTCGACTAT-1,TTTGTCATCATTGCGA-1,TTTGTCATCCTAAGTG-1
0,si:ch73-252i11.3,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,PARP12,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,syn3,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,ptpro,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,eps8,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
5,si:ch1073-395m9.3,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
6,tbk1,0,0,0,0,0,0,0,1,0,...,0,0,0,0,0,0,0,0,0,0
7,gpr19,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
8,crebl2,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
9,dusp16,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


Unnamed: 0,gene_name
0,gene_name
1,si:ch73-252i11.3
2,PARP12
3,syn3
4,ptpro
...,...
32187,CABZ01109843.1
32188,CABZ01111913.1
32189,CABZ01079745.1
32190,NP5


# 3. Get mapping identifiers

In [5]:
# load in the original GFF-based annotation
models = pd.read_csv(sample_BFD.annot.path, header = None, sep = '\t', comment = '#')
display(models)

attributes_column = 8

# Check the structure of fields in the GFF additional fields section
display(models[attributes_column][3])

  models = pd.read_csv(sample_BFD.annot.path, header = None, sep = '\t', comment = '#')


Unnamed: 0,0,1,2,3,4,5,6,7,8
0,4,ensembl_havana,gene,6733,52120,.,-,.,"gene_id ""ENSDARG00000104632""; gene_version ""1""..."
1,4,ensembl,transcript,6733,52120,.,-,.,"gene_id ""ENSDARG00000104632""; gene_version ""1""..."
2,4,ensembl,exon,52002,52120,.,-,.,"gene_id ""ENSDARG00000104632""; gene_version ""1""..."
3,4,ensembl,exon,48613,48740,.,-,.,"gene_id ""ENSDARG00000104632""; gene_version ""1""..."
4,4,ensembl,exon,13755,13811,.,-,.,"gene_id ""ENSDARG00000104632""; gene_version ""1""..."
...,...,...,...,...,...,...,...,...,...
1149247,22,ensembl,gene,598076,599835,.,+,.,"gene_id ""ENSGNP31""; gene_version ""1""; transcri..."
1149248,22,ensembl,gene,598076,599835,.,+,.,"gene_id ""ENSGNP31""; gene_version ""1""; transcri..."
1149249,16,ensembl,gene,47449630,47452948,.,+,.,"gene_id ""ENSGNP33""; gene_version ""1""; gene_nam..."
1149250,16,ensembl,transcript,47449630,47452948,.,+,.,"gene_id ""ENSGNP33""; gene_version ""1""; transcri..."


'gene_id "ENSDARG00000104632"; gene_version "1"; transcript_id "ENSDART00000166186"; transcript_version "1"; exon_number "2"; gene_name "si:ch73-252i11.3"; gene_source "ensembl_havana"; gene_biotype "lincRNA"; havana_gene "OTTDARG00000037780"; havana_gene_version "1"; transcript_name "si:ch73-252i11.3-201"; transcript_source "ensembl"; transcript_biotype "lincRNA"; exon_id "ENSDARE00001217519"; exon_version "1"; tag "basic";'

In [6]:
# Remove any rows with NaNs
models.dropna(inplace = True)

# Extract field and database cross-ref (dbxref) information into columns
models['field_dictionary'] = models[attributes_column].apply(convert_fields_to_dict_gtf)
models['gene_name'] = [d.get('gene_name') for d in models['field_dictionary']]
models['gene_id'] = [d.get('gene_id') for d in models['field_dictionary']]

display(models)

Unnamed: 0,0,1,2,3,4,5,6,7,8,field_dictionary,gene_name,gene_id
0,4,ensembl_havana,gene,6733,52120,.,-,.,"gene_id ""ENSDARG00000104632""; gene_version ""1""...","{'gene_id': 'ENSDARG00000104632', 'gene_versio...",si:ch73-252i11.3,ENSDARG00000104632
1,4,ensembl,transcript,6733,52120,.,-,.,"gene_id ""ENSDARG00000104632""; gene_version ""1""...","{'gene_id': 'ENSDARG00000104632', 'gene_versio...",si:ch73-252i11.3,ENSDARG00000104632
2,4,ensembl,exon,52002,52120,.,-,.,"gene_id ""ENSDARG00000104632""; gene_version ""1""...","{'gene_id': 'ENSDARG00000104632', 'gene_versio...",si:ch73-252i11.3,ENSDARG00000104632
3,4,ensembl,exon,48613,48740,.,-,.,"gene_id ""ENSDARG00000104632""; gene_version ""1""...","{'gene_id': 'ENSDARG00000104632', 'gene_versio...",si:ch73-252i11.3,ENSDARG00000104632
4,4,ensembl,exon,13755,13811,.,-,.,"gene_id ""ENSDARG00000104632""; gene_version ""1""...","{'gene_id': 'ENSDARG00000104632', 'gene_versio...",si:ch73-252i11.3,ENSDARG00000104632
...,...,...,...,...,...,...,...,...,...,...,...,...
1149247,22,ensembl,gene,598076,599835,.,+,.,"gene_id ""ENSGNP31""; gene_version ""1""; transcri...","{'gene_id': 'ENSGNP31', 'gene_version': '1', '...",NP31,ENSGNP31
1149248,22,ensembl,gene,598076,599835,.,+,.,"gene_id ""ENSGNP31""; gene_version ""1""; transcri...","{'gene_id': 'ENSGNP31', 'gene_version': '1', '...",NP31,ENSGNP31
1149249,16,ensembl,gene,47449630,47452948,.,+,.,"gene_id ""ENSGNP33""; gene_version ""1""; gene_nam...","{'gene_id': 'ENSGNP33', 'gene_version': '1', '...",NP33,ENSGNP33
1149250,16,ensembl,transcript,47449630,47452948,.,+,.,"gene_id ""ENSGNP33""; gene_version ""1""; transcri...","{'gene_id': 'ENSGNP33', 'gene_version': '1', '...",NP33,ENSGNP33


# 4. Extract gene IDs for mapping to UniprotKB
Specify which set of identifiers will be use to query the [Uniprot ID Mapping Tool](https://www.uniprot.org/id-mapping) via API.

If using an identifier from the `dbxref_dict`, specify the name via string in the `dbxref_datafield` variable.

In [7]:
dbxref_datafield = 'gene_id'

models_subset = models[['gene_name', dbxref_datafield]].dropna().drop_duplicates()

display(models_subset)

Unnamed: 0,gene_name,gene_id
0,si:ch73-252i11.3,ENSDARG00000104632
10,PARP12,ENSDARG00000100660
57,syn3,ENSDARG00000098417
67,ptpro,ENSDARG00000100422
182,eps8,ENSDARG00000102128
...,...,...
1149224,CABZ01111913.1,ENSDARG00000100589
1149229,CABZ01079745.1,ENSDARG00000099551
1149240,NP5,ENSGNP5
1149246,NP31,ENSGNP31


# 5. Generate gene list file to query Uniprot ID Mapping API
Generate a text file ending in `_ids.txt` for sending to the ID mapping API.

In [8]:
gene_list = models_subset[dbxref_datafield].unique()

genelist_object = GeneListFile(
    sampledict = sample_BFD.sampledict,
    sources = [sample_BFD.annot],
    genes = gene_list,
    identifier = dbxref_datafield
    )

Wrote 32192 gene ids to /home/ec2-user/glial-origins/output/Drer_juvbrain/Drer_juvbrain_gene_id_ids.txt


# 6. Query Uniprot ID Mapping API
Specify the `from_type` variable based on the Uniprot name of the identifier.  
The table below lists some databases and the `from_type` string that the API accepts for that datatype.  

| datatype | `from_type` string | description |
| ---: | :--- | :--- |
| Mouse Genome Informatics | `MGI` | ID starts with `MGI:` |
| Zebrafish Information Network | `ZFIN` | ID starts with `ZDB-GENE-` |
| Xenbase | `Xenbase` | ID starts with `XB-GENE-` |

__NOTE:__ You may have to run the cell below twice - UniProt sometimes throws an "Resource not found" message on the first query to the database.

In [10]:
from_type = 'Ensembl'
to_type = 'UniProtKB'

uniprot_idmm_object = genelist_object.get_uniprot_ids(ID_MAPPER_LOC, from_type, to_type)
sample_BFD.add_keyfile('uniprot_output_idmm', uniprot_idmm_object, overwrite = True)

uniprot_idmm = pd.read_csv(sample_BFD.uniprot_output_idmm.path, sep = '\t')
display(uniprot_idmm)

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:--  0:02:09 --:--:--     0
curl: (28) Failed to connect to rest.uniprot.org port 443 after 129845 ms: Connection timed out
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100   119    0   119    0     0    194      0 --:--:-- --:--:-- --:--:--   194


Unnamed: 0,Error messages
0,Resource not found


# 7. Extract results and generate Uniprot IDMM
Generates an idmm that links `gene_name`, the `dbxref_datafield` seleted above, and `uniprot_id` returned by API.

In [11]:
uniprot_idpairs = uniprot_idmm[['From', 'Entry']]
uniprot_idpairs.rename(columns = {'From': dbxref_datafield, 'Entry': 'uniprot_id'}, inplace = True)
display(uniprot_idpairs)

uniprot_output_idmm = models_subset.merge(uniprot_idpairs, on = dbxref_datafield)
display(uniprot_output_idmm)

KeyError: "None of [Index(['From', 'Entry'], dtype='object')] are in the [columns]"

In [None]:
# generate a filename and file for the idmm
uniprot_output_idmm_filename = '_'.join([sample_BFD.species_prefix, conditions, 'uniprot-idmm.tsv'])
uniprot_output_idmm_object = IdmmFile(
    filename = uniprot_output_idmm_filename, 
    sampledict = sample_BFD.sampledict, 
    kind = 'uniprot_idmm', 
    sources = [sample_BFD.annot]
)

# save to file and add to the BioFileDocket
uniprot_output_idmm.to_csv(uniprot_output_idmm_object.path, sep = '\t')
sample_BFD.add_keyfile('uniprot_idmm', uniprot_output_idmm_object)

# 8. Convert GFF to GTF

In [None]:
# load the newly-generated GTF file as a dataframe
models_asgtf_df = pd.read_csv(sample_BFD.annot.path, skiprows = 5, header = None, sep = '\t')

display(models_asgtf_df)
display(models_asgtf_df[attributes_column][1])

In [None]:
# Use a custom function to extract useful fields from the additional fields section (column 8)
# Pull from that dict to fill in additional useful columns
models_asgtf_df['field_dictionary'] = models_asgtf_df[attributes_column].apply(convert_fields_to_dict_gtf)
models_asgtf_df['gene_name'] = [d.get('gene_name') for d in models_asgtf_df['field_dictionary']]
models_asgtf_df['gene_id'] = [d.get('gene_id') for d in models_asgtf_df['field_dictionary']]
models_asgtf_df['transcript_id'] = [d.get('transcript_id') for d in models_asgtf_df['field_dictionary']]

# Remove CDS annotations because they interfere with TransDecoder cDNA generation
models_asgtf_df = models_asgtf_df[models_asgtf_df[2] != 'CDS']
display(models_asgtf_df)

# 9. Generate gtf-idmm
This file maps the `gene_name` to `gene_id` and `transcript_id` fields generated by the conversion from GFF to GTF, which will be needed for downstream processing.

In [None]:
# Extract gene_name, gene_id, and transcript_id fields to generate an ID mapping matrix (idmm)
idmm_df = models_asgtf_df[['gene_name', 'gene_id', 'transcript_id']].drop_duplicates()
idmm_df.dropna(inplace = True)
display(idmm_df)

# generate a filename and file for the idmm
idmm_filename = '_'.join([sample_BFD.species_prefix, conditions, 'gtf-idmm.tsv'])
idmm = IdmmFile(
    filename = idmm_filename, 
    sampledict = sample_BFD.sampledict, 
    kind = 'gtf_idmm', 
    sources = [sample_BFD.annot]
)

# save to file and add to the BioFileDocket
idmm_df.to_csv(idmm.path, sep = '\t')
sample_BFD.add_keyfile('gtf_idmm', idmm)

# 10. Generate updated gtf
Generated an updated GTF file using transcript_id as the key. For some datasets, transcripts do not consistently get gene names and gene IDs added, which causes Transdecoder to throw errors. This resolves that problem.

In [None]:
models_asgtf_updated_df = models_asgtf_df.merge(idmm_df, on = 'transcript_id')
models_asgtf_updated_df.apply(lambda x: x['field_dictionary'].update({'gene_name': x['gene_name_y']}), axis = 1)
models_asgtf_updated_df.apply(lambda x: x['field_dictionary'].update({'gene_id': x['gene_id_y']}), axis = 1)
models_asgtf_updated_df['new_fields'] = models_asgtf_updated_df['field_dictionary'].apply(convert_dict_to_fields_gtf)
models_asgtf_updated_df = models_asgtf_updated_df[[0, 1, 2, 3, 4, 5, 6, 7, 'new_fields']]

models_asgtf_updated_filename = sample_BFD.annot.filename.replace('.gtf', '_updated.gtf')
models_asgtf_updated = GenomeGtfFile(
    filename = models_asgtf_updated_filename, 
    sampledict = sample_BFD.sampledict, 
    reference_genome = sample_BFD.genome_fasta
)

models_asgtf_updated_df.to_csv(models_asgtf_updated.path, header = None, index = None, sep = '\t')

# 11. Generate cDNA and peptide files
Using the updated gtf file and genome file, generate cDNA sequence.

Then, using the cDNA sequence, generate peptide sequences using transdecoder.

Expect this step to take some time, probably ~20-30min.

In [None]:
cdna = sample_BFD.genome_fasta.get_transdecoder_cdna_gtf(models_asgtf_updated, TRANSDECODER_LOC)
sample_BFD.add_keyfile('cdna', cdna)

transdecoder_files = sample_BFD.cdna.to_pep_files(TDLONGORF_LOC, TDPREDICT_LOC)
sample_BFD.add_keyfiles(transdecoder_files)

# 12. Push files to AWS S3

Iteratively moves through BioFileDocket BioFile objects and pushes to the right place in AWS.

In [None]:
sample_BFD.local_to_s3()

# 13. Pickle the `sample_BFD` variable for use by the next script

In [None]:
# Generate a .pkl file for the Docket
sample_BFD.pickle()

# Push to S3, optionally overwriting existing file
sample_BFD.push_to_s3(overwrite = True)