# Overview

This notebook collects data from the [Jiang et al. 2021](https://www.frontiersin.org/articles/10.3389/fcell.2021.743421/full) zebrafish cell atlas.

#### Dataset description

- Data can be found at the [GEO Accession GSE130487](https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE130487).
- This notebook previously collects data from one sample, ["Brain8"](https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSM3768152), for the purposes of data analysis and exploration.

# 0. Setup

Import packages and specify any important functions here.

In [1]:
# import standard python packages
import pandas as pd
import subprocess, os, dill, sys

# add the utils and env directories to the path
sys.path.append('../../utils/')
sys.path.append('../../env/')

# import functions from utils directory files
from string_functions import *
from biofile_handling import *

# import paths to software installs from env
from install_locs import *

# 1. Download files

#### BioFileDocket
First, make a BioFileDocket for the dataset.

In [2]:
################
# general info #
################

# Specify the name of the species folder in Amazon S3
species = 'Danio_rerio'

# Specify any particular identifying conditions, eg tissue type:
conditions = 'adultbrain'

################
################

sample_BFD = BioFileDocket(species, conditions)

/home/ec2-user/glial-origins/output/Drer_adultbrain/ already exists
Files will be saved into /home/ec2-user/glial-origins/output/Drer_adultbrain/


#### Download and add files
Next, download the genome_fasta, annotation, and gxc files based on urls.  

In [5]:
################
# general info #
################

# Specify url and other variables
genome_fasta_url = 'https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/002/035/GCF_000002035.5_GRCz10/GCF_000002035.5_GRCz10_genomic.fna.gz'
genome_version = 'GRCz10'

annot_url = 'https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/002/035/GCF_000002035.5_GRCz10/GCF_000002035.5_GRCz10_genomic.gff.gz'
gxc_url = 'https://ftp.ncbi.nlm.nih.gov/geo/samples/GSM3768nnn/GSM3768152/suppl/GSM3768152_Brain_8_dge.txt.gz'

###########
# runtime #
###########

protocol = 'curl'

genome_fasta = GenomeFastaFile(
    sampledict = sample_BFD.sampledict,
    version = genome_version,
    url = genome_fasta_url,
    protocol = protocol
)

annot = GenomeGffFile(
    sampledict = sample_BFD.sampledict,
    reference_genome = genome_fasta,
    url = annot_url,
    protocol = protocol
)

gxc = GxcFile(
    sampledict = sample_BFD.sampledict,
    reference_genome = genome_fasta,
    reference_annot = annot,
    url = gxc_url,
    protocol = protocol
)

keyfiles = {
    'annot': annot,
    'genome_fasta': genome_fasta,
    'gxc': gxc
}
sample_BFD.add_keyfiles(keyfiles)

display(vars(sample_BFD))

inferring file name as GCF_000002035.5_GRCz10_genomic.fna.gz
file GCF_000002035.5_GRCz10_genomic.fna.gz already exists at /home/ec2-user/glial-origins/output/Drer_adultbrain/GCF_000002035.5_GRCz10_genomic.fna.gz
file GCF_000002035.5_GRCz10_genomic.fna.gz unzipped and object renamed to GCF_000002035.5_GRCz10_genomic.fna
inferring file name as GCF_000002035.5_GRCz10_genomic.gff.gz
file GCF_000002035.5_GRCz10_genomic.gff.gz already exists at /home/ec2-user/glial-origins/output/Drer_adultbrain/GCF_000002035.5_GRCz10_genomic.gff.gz
file GCF_000002035.5_GRCz10_genomic.gff.gz unzipped and object renamed to GCF_000002035.5_GRCz10_genomic.gff
inferring file name as GSM3768152_Brain_8_dge.txt.gz
file GSM3768152_Brain_8_dge.txt.gz already exists at /home/ec2-user/glial-origins/output/Drer_adultbrain/GSM3768152_Brain_8_dge.txt.gz
file GSM3768152_Brain_8_dge.txt.gz unzipped and object renamed to GSM3768152_Brain_8_dge.txt
key "annot" already exists, ignoring
key "genome_fasta" already exists, ignor

gzip: /home/ec2-user/glial-origins/output/Drer_adultbrain/GCF_000002035.5_GRCz10_genomic.fna: unknown suffix -- ignored
gzip: /home/ec2-user/glial-origins/output/Drer_adultbrain/GCF_000002035.5_GRCz10_genomic.gff: unknown suffix -- ignored
gzip: /home/ec2-user/glial-origins/output/Drer_adultbrain/GSM3768152_Brain_8_dge.txt: unknown suffix -- ignored


{'species': 'Danio_rerio',
 'conditions': 'adultbrain',
 'directory': '/home/ec2-user/glial-origins/output/Drer_adultbrain/',
 'files': {},
 'metadata': <biofile_handling.metadata_object at 0x7fe17d01bf40>,
 'annot': <biofile_handling.GenomeGffFile at 0x7fe17d019030>,
 'genome_fasta': <biofile_handling.GenomeFastaFile at 0x7fe0cc0ecd00>,
 'gxc': <biofile_handling.GxcFile at 0x7fe17d01b400>}

# 2. Extract gxc information

#### Preview gxc matrix
- Show the first 10 rows of the gxc matrix.  
- Extract the first column as a 'gene_name' dataframe.
- Record the number of cells and genes in the gxc to the `sample_BFD.metadata` attribute.

In [6]:
genes_matrix = pd.read_csv(sample_BFD.gxc.path, sep = '\t', nrows = 10)
display(genes_matrix)

gxc_genes_list = pd.read_csv(sample_BFD.gxc.path, sep = '\t', usecols=[0], names = ['gene_name'])
display(gxc_genes_list)

sample_BFD.metadata.add('num_cells', len(genes_matrix.columns) - 1)
sample_BFD.metadata.add('num_genes', len(genes_matrix))

Unnamed: 0,GENE,ACAATATATTGTACCTGA,ACGTTGATGGCGTAGAGA,AACCTAACCTGAATTTGC,CTCGCAGCCCTCTATGTA,ACGTTGCGTATTTAGTCG,AACCTATAGAGACCGACG,ACGAGCGCTGTGGCCTAG,GCGAATGGACATGGACAT,TCTACCGCTCAAGCTCAA,...,CGGCAGTCAAAGATCTCT,GACACTGCGAATCTGTGT,GCAGGAGGCTGCTAAGGG,TATGTATACTTCCGCACC,TGGATGTTCCGCACAATA,AACCTATGGATGGGGTTT,AAGCGGAGGACTCTCCAT,ACCTGACTCGCAAGCGAG,ACGTTGCAAAGTTTCATA,ATCAACTGCAATTTCCGC
0,ABCF3,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,ACOT12,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,ACSF3,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,ACTC1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,ACVR1C,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
5,ADAM12,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
6,ADAMTSL4,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
7,ADGRL3,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
8,AKAP13,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
9,AL590134.1,0,0,0,0,0,0,1,0,0,...,0,0,0,0,0,0,0,0,0,0


Unnamed: 0,gene_name
0,GENE
1,ABCF3
2,ACOT12
3,ACSF3
4,ACTC1
...,...
20803,zwilch
20804,zyg11
20805,zyx
20806,zzef1


# 3. Get mapping identifiers

In [7]:
# load in the original GFF-based annotation
models = pd.read_csv(sample_BFD.annot.path, skiprows = 7, header = None, sep = '\t')
display(models)

attributes_column = 8

# Check the structure of fields in the GFF additional fields section
display(models[attributes_column][3])

Unnamed: 0,0,1,2,3,4,5,6,7,8
0,NC_007112.6,RefSeq,region,1.0,58871917.0,.,+,.,ID=id0;Dbxref=taxon:7955;Name=1;chromosome=1;g...
1,NC_007112.6,BestRefSeq,gene,6642.0,11878.0,.,-,.,"ID=gene0;Dbxref=GeneID:192301,ZFIN:ZDB-GENE-02..."
2,NC_007112.6,BestRefSeq,mRNA,6642.0,11878.0,.,-,.,"ID=rna0;Parent=gene0;Dbxref=GeneID:192301,Genb..."
3,NC_007112.6,BestRefSeq,exon,11751.0,11878.0,.,-,.,"ID=id1;Parent=rna0;Dbxref=GeneID:192301,Genban..."
4,NC_007112.6,BestRefSeq,exon,11550.0,11625.0,.,-,.,"ID=id2;Parent=rna0;Dbxref=GeneID:192301,Genban..."
...,...,...,...,...,...,...,...,...,...
1362356,NC_002333.2,RefSeq,exon,16449.0,16520.0,.,+,.,ID=id634507;Parent=rna65516;Dbxref=GeneID:1405...
1362357,NC_002333.2,RefSeq,gene,16527.0,16596.0,.,-,.,"ID=gene42297;Dbxref=GeneID:140511,ZFIN:ZDB-GEN..."
1362358,NC_002333.2,RefSeq,tRNA,16527.0,16596.0,.,-,.,ID=rna65517;Parent=gene42297;Dbxref=GeneID:140...
1362359,NC_002333.2,RefSeq,exon,16527.0,16596.0,.,-,.,ID=id634508;Parent=rna65517;Dbxref=GeneID:1405...


'ID=id1;Parent=rna0;Dbxref=GeneID:192301,Genbank:NM_173235.3,ZFIN:ZDB-GENE-020419-25;gbkey=mRNA;gene=rpl24;product=ribosomal protein L24;transcript_id=NM_173235.3'

In [8]:
# Remove any rows with NaNs
models.dropna(inplace = True)

# Extract field and database cross-ref (dbxref) information into columns
models['field_dictionary'] = models[attributes_column].apply(convert_fields_to_dict_gff)
models['gene_name'] = [d.get('gene') for d in models['field_dictionary']]
models['Dbxref'] = [d.get('Dbxref') for d in models['field_dictionary']]
models['dbxref_dict'] = models['Dbxref'].apply(convert_dbxref_to_dict)

display(models)

Unnamed: 0,0,1,2,3,4,5,6,7,8,field_dictionary,gene_name,Dbxref,dbxref_dict
0,NC_007112.6,RefSeq,region,1.0,58871917.0,.,+,.,ID=id0;Dbxref=taxon:7955;Name=1;chromosome=1;g...,"{'ID': 'id0', 'Dbxref': 'taxon:7955', 'Name': ...",,taxon:7955,{'taxon': '7955'}
1,NC_007112.6,BestRefSeq,gene,6642.0,11878.0,.,-,.,"ID=gene0;Dbxref=GeneID:192301,ZFIN:ZDB-GENE-02...","{'ID': 'gene0', 'Dbxref': 'GeneID:192301,ZFIN:...",rpl24,"GeneID:192301,ZFIN:ZDB-GENE-020419-25","{'GeneID': '192301', 'ZFIN': 'ZDB-GENE-020419-..."
2,NC_007112.6,BestRefSeq,mRNA,6642.0,11878.0,.,-,.,"ID=rna0;Parent=gene0;Dbxref=GeneID:192301,Genb...","{'ID': 'rna0', 'Parent': 'gene0', 'Dbxref': 'G...",rpl24,"GeneID:192301,Genbank:NM_173235.3,ZFIN:ZDB-GEN...","{'GeneID': '192301', 'Genbank': 'NM_173235.3',..."
3,NC_007112.6,BestRefSeq,exon,11751.0,11878.0,.,-,.,"ID=id1;Parent=rna0;Dbxref=GeneID:192301,Genban...","{'ID': 'id1', 'Parent': 'rna0', 'Dbxref': 'Gen...",rpl24,"GeneID:192301,Genbank:NM_173235.3,ZFIN:ZDB-GEN...","{'GeneID': '192301', 'Genbank': 'NM_173235.3',..."
4,NC_007112.6,BestRefSeq,exon,11550.0,11625.0,.,-,.,"ID=id2;Parent=rna0;Dbxref=GeneID:192301,Genban...","{'ID': 'id2', 'Parent': 'rna0', 'Dbxref': 'Gen...",rpl24,"GeneID:192301,Genbank:NM_173235.3,ZFIN:ZDB-GEN...","{'GeneID': '192301', 'Genbank': 'NM_173235.3',..."
...,...,...,...,...,...,...,...,...,...,...,...,...,...
1362355,NC_002333.2,RefSeq,tRNA,16449.0,16520.0,.,+,.,ID=rna65516;Parent=gene42296;Dbxref=GeneID:140...,"{'ID': 'rna65516', 'Parent': 'gene42296', 'Dbx...",trnT,"GeneID:140518,ZFIN:ZDB-GENE-011205-39","{'GeneID': '140518', 'ZFIN': 'ZDB-GENE-011205-..."
1362356,NC_002333.2,RefSeq,exon,16449.0,16520.0,.,+,.,ID=id634507;Parent=rna65516;Dbxref=GeneID:1405...,"{'ID': 'id634507', 'Parent': 'rna65516', 'Dbxr...",trnT,"GeneID:140518,ZFIN:ZDB-GENE-011205-39","{'GeneID': '140518', 'ZFIN': 'ZDB-GENE-011205-..."
1362357,NC_002333.2,RefSeq,gene,16527.0,16596.0,.,-,.,"ID=gene42297;Dbxref=GeneID:140511,ZFIN:ZDB-GEN...","{'ID': 'gene42297', 'Dbxref': 'GeneID:140511,Z...",trnP,"GeneID:140511,ZFIN:ZDB-GENE-011205-38","{'GeneID': '140511', 'ZFIN': 'ZDB-GENE-011205-..."
1362358,NC_002333.2,RefSeq,tRNA,16527.0,16596.0,.,-,.,ID=rna65517;Parent=gene42297;Dbxref=GeneID:140...,"{'ID': 'rna65517', 'Parent': 'gene42297', 'Dbx...",trnP,"GeneID:140511,ZFIN:ZDB-GENE-011205-38","{'GeneID': '140511', 'ZFIN': 'ZDB-GENE-011205-..."


# 4. Extract gene IDs for mapping to UniprotKB
Specify which set of identifiers will be use to query the [Uniprot ID Mapping Tool](https://www.uniprot.org/id-mapping) via API.

If using an identifier from the `dbxref_dict`, specify the name via string in the `dbxref_datafield` variable.

In [9]:
dbxref_datafield = 'ZFIN'

models.dropna(axis = 0, subset = ['dbxref_dict'], inplace = True)
models[dbxref_datafield] = [d.get(dbxref_datafield) for d in models['dbxref_dict']]

models_subset = models[['gene_name', dbxref_datafield]].dropna().drop_duplicates()

display(models_subset)

Unnamed: 0,gene_name,ZFIN
1,rpl24,ZDB-GENE-020419-25
15,cep97,ZDB-GENE-031030-11
74,nfkbiz,ZDB-GENE-071024-1
102,eed,ZDB-GENE-050417-287
128,zgc:110091,ZDB-GENE-050417-34
...,...,...
1362347,ND6,ZDB-GENE-011205-13
1362349,trnE,ZDB-GENE-011205-37
1362352,CYTB,ZDB-GENE-011205-17
1362354,trnT,ZDB-GENE-011205-39


# 5. Generate gene list file to query Uniprot ID Mapping API
Generate a text file ending in `_ids.txt` for sending to the ID mapping API.

In [10]:
gene_list = models_subset[dbxref_datafield].unique()

genelist_object = GeneListFile(
    sampledict = sample_BFD.sampledict,
    sources = [sample_BFD.annot],
    genes = gene_list,
    identifier = dbxref_datafield
    )

Wrote 22031 gene ids to /home/ec2-user/glial-origins/output/Drer_adultbrain/Drer_adultbrain_ZFIN_ids.txt


# 6. Query Uniprot ID Mapping API
Specify the `from_type` variable based on the Uniprot name of the identifier.  
The table below lists some databases and the `from_type` string that the API accepts for that datatype.  

| datatype | `from_type` string | description |
| ---: | :--- | :--- |
| Mouse Genome Informatics | `MGI` | ID starts with `MGI:` |
| Zebrafish Information Network | `ZFIN` | ID starts with `ZDB-GENE-` |
| Xenbase | `Xenbase` | ID starts with `XB-GENE-` |

__NOTE:__ You may have to run the cell below twice - UniProt sometimes throws an "Resource not found" message on the first query to the database.

In [11]:
from_type = 'ZFIN'
to_type = 'UniProtKB'

uniprot_idmm_object = genelist_object.get_uniprot_ids(ID_MAPPER_LOC, from_type, to_type)
sample_BFD.add_keyfile('uniprot_idmm', uniprot_idmm_object, overwrite = True)

uniprot_idmm = pd.read_csv(sample_BFD.uniprot_idmm.path, sep = '\t')
display(uniprot_idmm)

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  419k    0    52  100  418k     26   212k  0:00:01  0:00:01 --:--:--  212k
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100    24    0    24    0     0     28      0 --:--:-- --:--:-- --:--:--    28


Unnamed: 0,From,Entry,Entry Name,Reviewed,Protein names,Gene Names,Organism,Length
0,ZDB-GENE-020419-25,Q8JGR4,RL24_DANRE,reviewed,60S ribosomal protein L24,rpl24,Danio rerio (Zebrafish) (Brachydanio rerio),157
1,ZDB-GENE-020419-25,A0A0R4IMS3,A0A0R4IMS3_DANRE,unreviewed,60S ribosomal protein L24,rpl24 SO:0001217,Danio rerio (Zebrafish) (Brachydanio rerio),157
2,ZDB-GENE-031030-11,A0A0R4ICF0,A0A0R4ICF0_DANRE,unreviewed,Centrosomal protein 97,cep97 SO:0001217,Danio rerio (Zebrafish) (Brachydanio rerio),599
3,ZDB-GENE-031030-11,A0A0R4ISF5,A0A0R4ISF5_DANRE,unreviewed,Centrosomal protein 97,cep97 SO:0001217,Danio rerio (Zebrafish) (Brachydanio rerio),40
4,ZDB-GENE-031030-11,A0A0R4IX25,A0A0R4IX25_DANRE,unreviewed,Centrosomal protein 97,cep97 SO:0001217,Danio rerio (Zebrafish) (Brachydanio rerio),240
...,...,...,...,...,...,...,...,...
54701,ZDB-GENE-011205-12,Q9MIY0,NU5M_DANRE,reviewed,NADH-ubiquinone oxidoreductase chain 5 (EC 7.1...,mt-nd5 mtnd5 nd5,Danio rerio (Zebrafish) (Brachydanio rerio),606
54702,ZDB-GENE-011205-13,Q9MIX9,NU6M_DANRE,reviewed,NADH-ubiquinone oxidoreductase chain 6 (EC 7.1...,mt-nd6 mtnd6 nd6,Danio rerio (Zebrafish) (Brachydanio rerio),172
54703,ZDB-GENE-011205-13,A0A0A0VG13,A0A0A0VG13_DANRE,unreviewed,NADH-ubiquinone oxidoreductase chain 6 (EC 7.1...,ND6 mt-nd6 SO:0001217,Danio rerio (Zebrafish) (Brachydanio rerio),172
54704,ZDB-GENE-011205-17,Q9MIX8,CYB_DANRE,reviewed,Cytochrome b (Complex III subunit 3) (Complex ...,mt-cyb cob cytb mtcyb,Danio rerio (Zebrafish) (Brachydanio rerio),380


# 7. Extract results and generate Uniprot IDMM
Generates an idmm that links `gene_name`, the `dbxref_datafield` seleted above, and `uniprot_id` returned by API.

In [12]:
uniprot_idpairs = uniprot_idmm[['From', 'Entry']]
uniprot_idpairs.rename(columns = {'From': dbxref_datafield, 'Entry': 'uniprot_id'}, inplace = True)
display(uniprot_idpairs)

uniprot_output_idmm = models_subset.merge(uniprot_idpairs, on = dbxref_datafield)
display(uniprot_output_idmm)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  uniprot_idpairs.rename(columns = {'From': dbxref_datafield, 'Entry': 'uniprot_id'}, inplace = True)


Unnamed: 0,ZFIN,uniprot_id
0,ZDB-GENE-020419-25,Q8JGR4
1,ZDB-GENE-020419-25,A0A0R4IMS3
2,ZDB-GENE-031030-11,A0A0R4ICF0
3,ZDB-GENE-031030-11,A0A0R4ISF5
4,ZDB-GENE-031030-11,A0A0R4IX25
...,...,...
54701,ZDB-GENE-011205-12,Q9MIY0
54702,ZDB-GENE-011205-13,Q9MIX9
54703,ZDB-GENE-011205-13,A0A0A0VG13
54704,ZDB-GENE-011205-17,Q9MIX8


Unnamed: 0,gene_name,ZFIN,uniprot_id
0,rpl24,ZDB-GENE-020419-25,Q8JGR4
1,rpl24,ZDB-GENE-020419-25,A0A0R4IMS3
2,cep97,ZDB-GENE-031030-11,A0A0R4ICF0
3,cep97,ZDB-GENE-031030-11,A0A0R4ISF5
4,cep97,ZDB-GENE-031030-11,A0A0R4IX25
...,...,...,...
54701,ND5,ZDB-GENE-011205-12,Q9MIY0
54702,ND6,ZDB-GENE-011205-13,Q9MIX9
54703,ND6,ZDB-GENE-011205-13,A0A0A0VG13
54704,CYTB,ZDB-GENE-011205-17,Q9MIX8


In [15]:
# generate a filename and file for the idmm
uniprot_output_idmm_filename = '_'.join([sample_BFD.species_prefix, conditions, 'uniprot-idmm.tsv'])
uniprot_output_idmm_object = IdmmFile(
    filename = uniprot_output_idmm_filename, 
    sampledict = sample_BFD.sampledict, 
    kind = 'uniprot_idmm', 
    sources = [sample_BFD.annot]
)

# save to file and add to the BioFileDocket
uniprot_output_idmm.to_csv(uniprot_output_idmm_object.path, sep = '\t')
sample_BFD.add_keyfile('uniprot_idmm', uniprot_output_idmm_object)

# 8. Convert GFF to GTF

In [16]:
# convert the GFF file to GTF using gffread
models_asgtf = sample_BFD.annot.to_gtf(GFFREAD_LOC)

Converted file GCF_000002035.5_GRCz10_genomic.gtf already exists at:
 /home/ec2-user/glial-origins/output/Drer_adultbrain/GCF_000002035.5_GRCz10_genomic.gtf


In [17]:
# load the newly-generated GTF file as a dataframe
models_asgtf_df = pd.read_csv(models_asgtf.path, skiprows = 0, header = None, sep = '\t')

display(models_asgtf_df)
display(models_asgtf_df[attributes_column][1])

Unnamed: 0,0,1,2,3,4,5,6,7,8
0,NC_007112.6,BestRefSeq,transcript,6642,11878,.,-,.,"transcript_id ""rna0""; gene_id ""gene0""; gene_na..."
1,NC_007112.6,BestRefSeq,exon,6642,6760,.,-,.,"transcript_id ""rna0""; gene_id ""gene0""; gene_na..."
2,NC_007112.6,BestRefSeq,exon,6892,6955,.,-,.,"transcript_id ""rna0""; gene_id ""gene0""; gene_na..."
3,NC_007112.6,BestRefSeq,exon,9558,9694,.,-,.,"transcript_id ""rna0""; gene_id ""gene0""; gene_na..."
4,NC_007112.6,BestRefSeq,exon,10081,10191,.,-,.,"transcript_id ""rna0""; gene_id ""gene0""; gene_na..."
...,...,...,...,...,...,...,...,...,...
1258174,NC_002333.2,RefSeq,CDS,15308,16448,.,+,0,"transcript_id ""gene42295""; gene_name ""CYTB"";"
1258175,NC_002333.2,RefSeq,transcript,16449,16520,.,+,.,"transcript_id ""rna65516""; gene_id ""gene42296"";..."
1258176,NC_002333.2,RefSeq,exon,16449,16520,.,+,.,"transcript_id ""rna65516""; gene_id ""gene42296"";..."
1258177,NC_002333.2,RefSeq,transcript,16527,16596,.,-,.,"transcript_id ""rna65517""; gene_id ""gene42297"";..."


'transcript_id "rna0"; gene_id "gene0"; gene_name "rpl24";'

In [18]:
# Use a custom function to extract useful fields from the additional fields section (column 8)
# Pull from that dict to fill in additional useful columns
models_asgtf_df['field_dictionary'] = models_asgtf_df[attributes_column].apply(convert_fields_to_dict_gtf)
models_asgtf_df['gene_name'] = [d.get('gene_name') for d in models_asgtf_df['field_dictionary']]
models_asgtf_df['gene_id'] = [d.get('gene_id') for d in models_asgtf_df['field_dictionary']]
models_asgtf_df['transcript_id'] = [d.get('transcript_id') for d in models_asgtf_df['field_dictionary']]

# Remove CDS annotations because they interfere with TransDecoder cDNA generation
models_asgtf_df = models_asgtf_df[models_asgtf_df[2] != 'CDS']
display(models_asgtf_df)

Unnamed: 0,0,1,2,3,4,5,6,7,8,field_dictionary,gene_name,gene_id,transcript_id
0,NC_007112.6,BestRefSeq,transcript,6642,11878,.,-,.,"transcript_id ""rna0""; gene_id ""gene0""; gene_na...","{'transcript_id': 'rna0', 'gene_id': 'gene0', ...",rpl24,gene0,rna0
1,NC_007112.6,BestRefSeq,exon,6642,6760,.,-,.,"transcript_id ""rna0""; gene_id ""gene0""; gene_na...","{'transcript_id': 'rna0', 'gene_id': 'gene0', ...",rpl24,gene0,rna0
2,NC_007112.6,BestRefSeq,exon,6892,6955,.,-,.,"transcript_id ""rna0""; gene_id ""gene0""; gene_na...","{'transcript_id': 'rna0', 'gene_id': 'gene0', ...",rpl24,gene0,rna0
3,NC_007112.6,BestRefSeq,exon,9558,9694,.,-,.,"transcript_id ""rna0""; gene_id ""gene0""; gene_na...","{'transcript_id': 'rna0', 'gene_id': 'gene0', ...",rpl24,gene0,rna0
4,NC_007112.6,BestRefSeq,exon,10081,10191,.,-,.,"transcript_id ""rna0""; gene_id ""gene0""; gene_na...","{'transcript_id': 'rna0', 'gene_id': 'gene0', ...",rpl24,gene0,rna0
...,...,...,...,...,...,...,...,...,...,...,...,...,...
1258173,NC_002333.2,RefSeq,transcript,15308,16448,.,+,.,"transcript_id ""gene42295""; gene_id ""gene42295""...","{'transcript_id': 'gene42295', 'gene_id': 'gen...",CYTB,gene42295,gene42295
1258175,NC_002333.2,RefSeq,transcript,16449,16520,.,+,.,"transcript_id ""rna65516""; gene_id ""gene42296"";...","{'transcript_id': 'rna65516', 'gene_id': 'gene...",trnT,gene42296,rna65516
1258176,NC_002333.2,RefSeq,exon,16449,16520,.,+,.,"transcript_id ""rna65516""; gene_id ""gene42296"";...","{'transcript_id': 'rna65516', 'gene_id': 'gene...",trnT,gene42296,rna65516
1258177,NC_002333.2,RefSeq,transcript,16527,16596,.,-,.,"transcript_id ""rna65517""; gene_id ""gene42297"";...","{'transcript_id': 'rna65517', 'gene_id': 'gene...",trnP,gene42297,rna65517


# 9. Generate gtf-idmm
This file maps the `gene_name` to `gene_id` and `transcript_id` fields generated by the conversion from GFF to GTF, which will be needed for downstream processing.

In [20]:
# Extract gene_name, gene_id, and transcript_id fields to generate an ID mapping matrix (idmm)
idmm_df = models_asgtf_df[['gene_name', 'gene_id', 'transcript_id']].drop_duplicates()
idmm_df.dropna(inplace = True)
display(idmm_df)

# generate a filename and file for the idmm
idmm_filename = '_'.join([sample_BFD.species_prefix, conditions, 'gtf-idmm.tsv'])
idmm = IdmmFile(
    filename = idmm_filename, 
    sampledict = sample_BFD.sampledict, 
    kind = 'gtf_idmm', 
    sources = [sample_BFD.annot]
)

# save to file and add to the BioFileDocket
idmm_df.to_csv(idmm.path, sep = '\t')
sample_BFD.add_keyfile('gtf_idmm', idmm)

Unnamed: 0,gene_name,gene_id,transcript_id
0,rpl24,gene0,rna0
13,cep97,gene1,rna2
25,cep97,gene1,rna1
48,cep97,gene1,rna3
71,nfkbiz,gene2,rna4
...,...,...,...
1258169,ND6,gene42293,gene42293
1258171,trnE,gene42294,rna65515
1258173,CYTB,gene42295,gene42295
1258175,trnT,gene42296,rna65516


# 10. Generate updated gtf
Generated an updated GTF file using transcript_id as the key. For some datasets, transcripts do not consistently get gene names and gene IDs added, which causes Transdecoder to throw errors. This resolves that problem.

In [21]:
models_asgtf_updated_df = models_asgtf_df.merge(idmm_df, on = 'transcript_id')
models_asgtf_updated_df.apply(lambda x: x['field_dictionary'].update({'gene_name': x['gene_name_y']}), axis = 1)
models_asgtf_updated_df.apply(lambda x: x['field_dictionary'].update({'gene_id': x['gene_id_y']}), axis = 1)
models_asgtf_updated_df['new_fields'] = models_asgtf_updated_df['field_dictionary'].apply(convert_dict_to_fields_gtf)
models_asgtf_updated_df = models_asgtf_updated_df[[0, 1, 2, 3, 4, 5, 6, 7, 'new_fields']]

models_asgtf_updated_filename = models_asgtf.filename.replace('.gtf', '_updated.gtf')
models_asgtf_updated = GenomeGtfFile(
    filename = models_asgtf_updated_filename, 
    sampledict = sample_BFD.sampledict, 
    reference_genome = sample_BFD.genome_fasta
)

models_asgtf_updated_df.to_csv(models_asgtf_updated.path, header = None, index = None, sep = '\t')

# 11. Generate cDNA and peptide files
Using the updated gtf file and genome file, generate cDNA sequence.

Then, using the cDNA sequence, generate peptide sequences using transdecoder.

Expect this step to take some time, probably ~20-30min.

In [22]:
cdna = sample_BFD.genome_fasta.get_transdecoder_cdna_gtf(models_asgtf_updated, TRANSDECODER_LOC)
sample_BFD.add_keyfile('cdna', cdna)

transdecoder_files = sample_BFD.cdna.to_pep_files(TDLONGORF_LOC, TDPREDICT_LOC)
sample_BFD.add_keyfiles(transdecoder_files)

-- Skipping CMD: /home/ec2-user/miniconda3/pkgs/transdecoder-5.5.0-pl526_1/opt/transdecoder/util/compute_base_probs.pl /home/ec2-user/glial-origins/output/Drer_adultbrain/GCF_000002035.5_GRCz10_genomic_cDNA.fna 0 > /home/ec2-user/glial-origins/output/Drer_adultbrain/Danio_rerio_adultbrain.transdecoder_dir//base_freqs.dat, checkpoint [/home/ec2-user/glial-origins/output/Drer_adultbrain/Danio_rerio_adultbrain.transdecoder_dir/.__checkpoints_longorfs/base_freqs_file.ok] exists.
-skipping long orf extraction, already completed earlier as per checkpoint: /home/ec2-user/glial-origins/output/Drer_adultbrain/Danio_rerio_adultbrain.transdecoder_dir/.__checkpoints_longorfs/TD.longorfs.ok
-- Skipping CMD: /home/ec2-user/miniconda3/pkgs/transdecoder-5.5.0-pl526_1/opt/transdecoder/util/get_top_longest_fasta_entries.pl /home/ec2-user/glial-origins/output/Drer_adultbrain/Danio_rerio_adultbrain.transdecoder_dir//longest_orfs.cds 5000 5000 > /home/ec2-user/glial-origins/output/Drer_adultbrain/Danio_rer

null device 
          1 


* Running CMD: /home/ec2-user/miniconda3/pkgs/transdecoder-5.5.0-pl526_1/opt/transdecoder/util/PWM/compute_AUC.pl /home/ec2-user/glial-origins/output/Drer_adultbrain/Danio_rerio_adultbrain.transdecoder_dir//start_refinement.feature.scores.roc


null device 
          1 


* Running CMD: /home/ec2-user/miniconda3/pkgs/transdecoder-5.5.0-pl526_1/opt/transdecoder/util/PWM/make_seqLogo.Rscript /home/ec2-user/glial-origins/output/Drer_adultbrain/Danio_rerio_adultbrain.transdecoder_dir//start_refinement.+.pwm || :
Error in library(seqLogo) : there is no package called ‘seqLogo’
Execution halted
* Running CMD: /home/ec2-user/miniconda3/pkgs/transdecoder-5.5.0-pl526_1/opt/transdecoder/util/PWM/make_seqLogo.Rscript /home/ec2-user/glial-origins/output/Drer_adultbrain/Danio_rerio_adultbrain.transdecoder_dir//start_refinement.-.pwm || :
Error in library(seqLogo) : there is no package called ‘seqLogo’
Execution halted
* Running CMD: /home/ec2-user/miniconda3/pkgs/transdecoder-5.5.0-pl526_1/opt/transdecoder/util/PWM/deplete_feature_noise.pl  --features_plus /home/ec2-user/glial-origins/output/Drer_adultbrain/Danio_rerio_adultbrain.transdecoder_dir//start_refinement.+.features  --pwm_minus /home/ec2-user/glial-origins/output/Drer_adultbrain/Danio_rerio_adultbrain.tran

null device 
          1 


* Running CMD: /home/ec2-user/miniconda3/pkgs/transdecoder-5.5.0-pl526_1/opt/transdecoder/util/PWM/compute_AUC.pl /home/ec2-user/glial-origins/output/Drer_adultbrain/Danio_rerio_adultbrain.transdecoder_dir//start_refinement.enhanced.feature.scores.roc


null device 
          1 


* Running CMD: /home/ec2-user/miniconda3/pkgs/transdecoder-5.5.0-pl526_1/opt/transdecoder/util/PWM/make_seqLogo.Rscript /home/ec2-user/glial-origins/output/Drer_adultbrain/Danio_rerio_adultbrain.transdecoder_dir//start_refinement.enhanced.+.pwm || :
Error in library(seqLogo) : there is no package called ‘seqLogo’
Execution halted
* Running CMD: /home/ec2-user/miniconda3/pkgs/transdecoder-5.5.0-pl526_1/opt/transdecoder/util/start_codon_refinement.pl --transcripts /home/ec2-user/glial-origins/output/Drer_adultbrain/GCF_000002035.5_GRCz10_genomic_cDNA.fna --gff3_file /home/ec2-user/glial-origins/output/Drer_adultbrain/Danio_rerio_adultbrain.transdecoder_dir//longest_orfs.cds.best_candidates.gff3 --workdir /home/ec2-user/glial-origins/output/Drer_adultbrain/Danio_rerio_adultbrain.transdecoder_dir/ > /home/ec2-user/glial-origins/output/Drer_adultbrain/Danio_rerio_adultbrain.transdecoder_dir//longest_orfs.cds.best_candidates.gff3.revised_starts.gff3
Refining start codon selections.
-number o

# 12. Push files to AWS S3

Iteratively moves through BioFileDocket BioFile objects and pushes to the right place in AWS.

In [23]:
sample_BFD.local_to_s3()

GCF_000002035.5_GRCz10_genomic.gff already exists in S3 bucket, skipping upload. set overwrite = True to overwrite the existing file.
GCF_000002035.5_GRCz10_genomic.fna already exists in S3 bucket, skipping upload. set overwrite = True to overwrite the existing file.
GSM3768152_Brain_8_dge.txt already exists in S3 bucket, skipping upload. set overwrite = True to overwrite the existing file.
Drer_adultbrain_uniprot-idmm.tsv already exists in S3 bucket, skipping upload. set overwrite = True to overwrite the existing file.
Drer_adultbrain_gtf-idmm.tsv already exists in S3 bucket, skipping upload. set overwrite = True to overwrite the existing file.
GCF_000002035.5_GRCz10_genomic_cDNA.fna already exists in S3 bucket, skipping upload. set overwrite = True to overwrite the existing file.
GCF_000002035.5_GRCz10_genomic_cDNA.fna.transdecoder.bed already exists in S3 bucket, skipping upload. set overwrite = True to overwrite the existing file.
GCF_000002035.5_GRCz10_genomic_cDNA.fna.transdecode

# 13. Pickle the `sample_BFD` variable for use by the next script

In [24]:
# Generate a .pkl file for the Docket
sample_BFD.pickle()

# Push to S3, optionally overwriting existing file
sample_BFD.push_to_s3(overwrite = False)

upload: ../../output/Drer_adultbrain/Drer_adultbrain_BioFileDocket.pkl to s3://arcadia-reference-datasets/glial-origins-pkl/Drer_adultbrain_BioFileDocket.pkl
