# Overview
This notebook collects data from the [Liao et al. 2022](https://www.nature.com/articles/s41467-022-31949-2) Xenopus laevis adult cell atlas.

#### Dataset description

- Data can be found at the [GEO Accession GSE195790](https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE195790).
- This notebook previously collects data from one sample, ["Xenopus_brain_COL65"](https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSM6214268), for the purposes of data analysis and exploration.

# 0. Setup

Import packages and specify any important functions here.

In [1]:
# import standard python packages
import pandas as pd
import subprocess, os, dill, sys

# add the utils and env directories to the path
sys.path.append('../../utils/')
sys.path.append('../../env/')

# import functions from utils directory files
from string_functions import *
from biofile_handling import *

# import paths to software installs from env
from install_locs import *

# 1. Download files

#### BioFileDocket
First, make a BioFileDocket for the dataset.

In [2]:
################
# general info #
################

# Specify the name of the species folder in Amazon S3
species = 'Xenopus_laevis'

# Specify any particular identifying conditions, eg tissue type:
conditions = 'adultbrain'

################
################

sample_BFD = BioFileDocket(species, conditions)

/home/ec2-user/glial-origins/output/Xlae_adultbrain/ already exists
Files will be saved into /home/ec2-user/glial-origins/output/Xlae_adultbrain/


#### Download and add files
Next, download the genome_fasta, annotation, and gxc files based on urls.  

In [4]:
################
# general info #
################

# Specify url and other variables
genome_fasta_url = 'https://ftp.xenbase.org/pub/Genomics/JGI/Xenla9.2/XENLA_9.2_genome.fa.gz'
genome_version = 'XENLA_9.2'

annot_url = 'https://ftp.xenbase.org/pub/Genomics/JGI/Xenla9.2/XENLA_9.2_Xenbase.gtf'

# Need to get a second annotation in order to pull usable ID mapping information
annot2_url = 'https://ftp.xenbase.org/pub/Genomics/JGI/Xenla9.2/XENLA_9.2_GCA.gff3'

gxc_url = 'https://ftp.ncbi.nlm.nih.gov/geo/samples/GSM6214nnn/GSM6214268/suppl/GSM6214268_Xenopus_brain_COL65_dge.txt.gz'

################
################

protocol = 'curl'
    
genome_fasta = GenomeFastaFile(
    sampledict = sample_BFD.sampledict,
    version = genome_version,
    url = genome_fasta_url,
    protocol = protocol
)

annot = GenomeGtfFile(
    sampledict = sample_BFD.sampledict,
    reference_genome = genome_fasta,
    url = annot_url,
    protocol = protocol
)

annot2 = GenomeGffFile(
    sampledict = sample_BFD.sampledict,
    reference_genome = genome_fasta,
    url = annot2_url,
    protocol = protocol
)

gxc = GxcFile(
    sampledict = sample_BFD.sampledict,
    reference_genome  = genome_fasta,
    reference_annot = annot,
    url = gxc_url,
    protocol = protocol
)

keyfiles = {
    'annot': annot,
    'annot2': annot2,
    'genome_fasta': genome_fasta,
    'gxc': gxc
}
sample_BFD.add_keyfiles(keyfiles)

display(vars(sample_BFD))

inferring file name as XENLA_9.2_genome.fa.gz
file XENLA_9.2_genome.fa.gz already exists at /home/ec2-user/glial-origins/output/Xlae_adultbrain/XENLA_9.2_genome.fa.gz
file XENLA_9.2_genome.fa.gz unzipped and object renamed to XENLA_9.2_genome.fa
inferring file name as XENLA_9.2_Xenbase.gtf
file XENLA_9.2_Xenbase.gtf already exists at /home/ec2-user/glial-origins/output/Xlae_adultbrain/XENLA_9.2_Xenbase.gtf
inferring file name as XENLA_9.2_GCA.gff3


gzip: /home/ec2-user/glial-origins/output/Xlae_adultbrain/XENLA_9.2_genome.fa: unknown suffix -- ignored
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
 96  150M   96  144M    0     0  7523k      0  0:00:20  0:00:19  0:00:01 8992k

downloaded file /home/ec2-user/glial-origins/output/Xlae_adultbrain/XENLA_9.2_GCA.gff3
Renaming GFF3 file XENLA_9.2_GCA.gff3 to XENLA_9.2_GCA.gff
inferring file name as GSM6214268_Xenopus_brain_COL65_dge.txt.gz
file GSM6214268_Xenopus_brain_COL65_dge.txt.gz already exists at /home/ec2-user/glial-origins/output/Xlae_adultbrain/GSM6214268_Xenopus_brain_COL65_dge.txt.gz
file GSM6214268_Xenopus_brain_COL65_dge.txt.gz unzipped and object renamed to GSM6214268_Xenopus_brain_COL65_dge.txt
key "annot" already exists, ignoring
key "annot2" already exists, ignoring
key "genome_fasta" already exists, ignoring
key "gxc" already exists, ignoring


100  150M  100  150M    0     0  7581k      0  0:00:20  0:00:20 --:--:-- 9086k
gzip: /home/ec2-user/glial-origins/output/Xlae_adultbrain/GSM6214268_Xenopus_brain_COL65_dge.txt: unknown suffix -- ignored


{'species': 'Xenopus_laevis',
 'conditions': 'adultbrain',
 'directory': '/home/ec2-user/glial-origins/output/Xlae_adultbrain/',
 'files': {},
 'metadata': <biofile_handling.metadata_object at 0x7f4d617eaef0>,
 'annot': <biofile_handling.GenomeGtfFile at 0x7f4d617ebac0>,
 'annot2': <biofile_handling.GenomeGffFile at 0x7f4d617ea080>,
 'genome_fasta': <biofile_handling.GenomeFastaFile at 0x7f4cb0898be0>,
 'gxc': <biofile_handling.GxcFile at 0x7f4d617eabc0>}

# 2. Extract gxc information

#### Preview gxc matrix
- Show the first 10 rows of the gxc matrix.  
- Extract the first column as a 'gene_name' dataframe.
- Record the number of cells and genes in the gxc to the `sample_BFD.metadata` attribute.

In [5]:
genes_matrix = pd.read_csv(sample_BFD.gxc.path, sep = '\t', nrows = 10)
display(genes_matrix)

gxc_genes_list = pd.read_csv(sample_BFD.gxc.path, sep = '\t', usecols=[0], names = ['gene_name'])
display(gxc_genes_list)

sample_BFD.metadata.add('num_cells', len(genes_matrix.columns) - 1)
sample_BFD.metadata.add('num_genes', len(gxc_genes_list))

Unnamed: 0,GENE,AACCTATTCATATAAGGG,CTCGCATCAAAGTTAACT,AACCTAGTATACTTCCGC,AACCTAAAAGTTCTGAAA,CTCGCACGCACCCTCCAT,ACGTTGTATTGTAGCGAG,ACGAGCATGCTTTAGTCG,AACCTAGTCCCGCCATCT,AACCTAGCGAATTAGAGA,...,TCACTTGTTGCCATGCTT,TCGGGTTGTCACACTTAT,TCGTAATCGTAAGTTGCC,TGTCACGAATTACACAAG,TGTGCGTACTTCTAGTCG,TTAACTATACAGTGGATG,TTGGACACTTATGATCTT,AAAGTTACTTATGCCCTC,AACCTACGCACCTGCGGA,AACCTATAGTCGCTGTGT
0,3.S,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,42Sp43.L,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,42Sp50.L,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,AK6.L,0,0,0,0,0,0,1,0,0,...,0,0,0,0,0,0,0,0,0,0
4,AK6.S,1,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
5,MGC107841.L,1,0,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
6,MGC107851.L,0,1,0,0,0,0,0,1,0,...,0,0,0,0,0,0,0,0,0,0
7,MGC107876.L,0,0,0,0,0,0,0,1,0,...,0,0,0,0,0,0,0,0,0,0
8,MGC108117.L,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
9,MGC108429.L,0,1,3,1,1,0,1,0,0,...,0,0,0,0,0,0,0,0,0,0


Unnamed: 0,gene_name
0,GENE
1,3.S
2,42Sp43.L
3,42Sp50.L
4,AK6.L
...,...
25908,zyg11b.L
25909,zyg11b.S
25910,zzef1.S
25911,zzz3.L


# 3. Get mapping identifiers

In [6]:
# load in the original GFF-based annotation
models = pd.read_csv(sample_BFD.annot.path, skiprows = 0, header = None, sep = '\t', on_bad_lines = 'skip', comment='#')
display(models)

attributes_column = 8

# Check the structure of fields in the GFF additional fields section
display(models[attributes_column][0])

# Remove any rows with NaNs
models.dropna(inplace = True)

Unnamed: 0,0,1,2,3,4,5,6,7,8
0,MT,Xenbase,exon,2136,2204,.,+,.,"gene_id ""gene42065""; gene_name ""mt-trna-phe.L""..."
1,MT,Xenbase,exon,2205,3023,.,+,.,"gene_id ""gene34778""; gene_name ""mt-rnr1.L""; tr..."
2,MT,Xenbase,exon,3024,3092,.,+,.,"gene_id ""gene48202""; gene_name ""mt-trna-val.L""..."
3,MT,Xenbase,exon,3093,4723,.,+,.,"gene_id ""gene44770""; gene_name ""mt-rnr2.L""; tr..."
4,MT,Xenbase,exon,4724,4798,.,+,.,"gene_id ""gene43253""; gene_name ""mt-trna-leu1.L..."
...,...,...,...,...,...,...,...,...,...
771995,chr9_10S,Xenbase,CDS,104527008,104527068,.,-,0,"gene_id ""gene24857""; gene_name ""Xetrov90026942..."
771996,chr9_10S,Xenbase,exon,104530653,104530753,.,-,2,"gene_id ""gene24857""; gene_name ""Xetrov90026942..."
771997,chr9_10S,Xenbase,CDS,104530653,104530753,.,-,2,"gene_id ""gene24857""; gene_name ""Xetrov90026942..."
771998,chr9_10S,Xenbase,exon,104531027,104531054,.,-,0,"gene_id ""gene24857""; gene_name ""Xetrov90026942..."


'gene_id "gene42065"; gene_name "mt-trna-phe.L"; transcript_id "gene42065_t"; transcript_name "mt-trna-phe.L";'

In [7]:
# Extract field and database cross-ref (dbxref) information into columns
models['field_dictionary'] = models[attributes_column].apply(convert_fields_to_dict_gtf)
models['gene_name'] = [d.get('gene_name') for d in models['field_dictionary']]
display(models)

sample_BFD.metadata.add('num_annot_models', len(models['gene_name'].unique()))
display(sample_BFD.metadata.num_annot_models)

Unnamed: 0,0,1,2,3,4,5,6,7,8,field_dictionary,gene_name
0,MT,Xenbase,exon,2136,2204,.,+,.,"gene_id ""gene42065""; gene_name ""mt-trna-phe.L""...","{'gene_id': 'gene42065', 'gene_name': 'mt-trna...",mt-trna-phe.L
1,MT,Xenbase,exon,2205,3023,.,+,.,"gene_id ""gene34778""; gene_name ""mt-rnr1.L""; tr...","{'gene_id': 'gene34778', 'gene_name': 'mt-rnr1...",mt-rnr1.L
2,MT,Xenbase,exon,3024,3092,.,+,.,"gene_id ""gene48202""; gene_name ""mt-trna-val.L""...","{'gene_id': 'gene48202', 'gene_name': 'mt-trna...",mt-trna-val.L
3,MT,Xenbase,exon,3093,4723,.,+,.,"gene_id ""gene44770""; gene_name ""mt-rnr2.L""; tr...","{'gene_id': 'gene44770', 'gene_name': 'mt-rnr2...",mt-rnr2.L
4,MT,Xenbase,exon,4724,4798,.,+,.,"gene_id ""gene43253""; gene_name ""mt-trna-leu1.L...","{'gene_id': 'gene43253', 'gene_name': 'mt-trna...",mt-trna-leu1.L
...,...,...,...,...,...,...,...,...,...,...,...
771995,chr9_10S,Xenbase,CDS,104527008,104527068,.,-,0,"gene_id ""gene24857""; gene_name ""Xetrov90026942...","{'gene_id': 'gene24857', 'gene_name': 'Xetrov9...",Xetrov90026942m.S
771996,chr9_10S,Xenbase,exon,104530653,104530753,.,-,2,"gene_id ""gene24857""; gene_name ""Xetrov90026942...","{'gene_id': 'gene24857', 'gene_name': 'Xetrov9...",Xetrov90026942m.S
771997,chr9_10S,Xenbase,CDS,104530653,104530753,.,-,2,"gene_id ""gene24857""; gene_name ""Xetrov90026942...","{'gene_id': 'gene24857', 'gene_name': 'Xetrov9...",Xetrov90026942m.S
771998,chr9_10S,Xenbase,exon,104531027,104531054,.,-,0,"gene_id ""gene24857""; gene_name ""Xetrov90026942...","{'gene_id': 'gene24857', 'gene_name': 'Xetrov9...",Xetrov90026942m.S


49282

In [8]:
# Need to load another annotation to be able to extract a Xenbase ID
models2 = pd.read_csv(sample_BFD.annot2.path, skiprows = 0, header = None, sep = '\t', on_bad_lines = 'skip', comment='#')
display(models2)

Unnamed: 0,0,1,2,3,4,5,6,7,8
0,chr1L,Genbank,gene,17924,18399,.,-,.,"ID=gene0;Name=Xelaev18004747m;end_range=18399,..."
1,chr1L,Genbank,mRNA,17924,18399,.,-,.,ID=rna0;Parent=gene0;Note=transcript XELAEV_18...
2,chr1L,Genbank,exon,18336,18399,.,-,.,ID=id1;Parent=rna0;Note=transcript XELAEV_1800...
3,chr1L,Genbank,exon,17924,18243,.,-,.,ID=id2;Parent=rna0;Note=transcript XELAEV_1800...
4,chr1L,Genbank,CDS,18336,18399,.,-,0,ID=cds0;Parent=rna0;Dbxref=Phytozome:Xelaev180...
...,...,...,...,...,...,...,...,...,...
809901,Scaffold94051,Genbank,CDS,8,205,.,+,0,ID=cds47730;Parent=rna47730;Dbxref=Phytozome:X...
809902,Scaffold95291,Genbank,gene,9,236,.,-,.,ID=gene45941;Name=Xelaev18004691m;end_range=23...
809903,Scaffold95291,Genbank,mRNA,9,236,.,-,.,ID=rna47731;Parent=gene45941;Note=transcript X...
809904,Scaffold95291,Genbank,exon,9,236,.,-,.,ID=id452950;Parent=rna47731;Note=transcript XE...


In [9]:
# Extract field and database cross-ref (dbxref) information into columns
models2['field_dictionary'] = models2[attributes_column].apply(convert_fields_to_dict_gff)
models2['gene_name'] = [d.get('Name') for d in models2['field_dictionary']]
models2['Xenbase'] = [d.get('Alias') for d in models2['field_dictionary']]
display(models2)

# Extract gene_name and Xenbase unique pairs to get id mapping between two annotations
xenbase_keys = models2[['gene_name', 'Xenbase']].dropna().drop_duplicates()
xenbase_keys = xenbase_keys.groupby('gene_name').agg({'gene_name': 'first', 'Xenbase': ','.join}).reset_index(drop = True)
display(xenbase_keys)

Unnamed: 0,0,1,2,3,4,5,6,7,8,field_dictionary,gene_name,Xenbase
0,chr1L,Genbank,gene,17924,18399,.,-,.,"ID=gene0;Name=Xelaev18004747m;end_range=18399,...","{'ID': 'gene0', 'Name': 'Xelaev18004747m', 'en...",Xelaev18004747m,XB-GENE-5942444
1,chr1L,Genbank,mRNA,17924,18399,.,-,.,ID=rna0;Parent=gene0;Note=transcript XELAEV_18...,"{'ID': 'rna0', 'Parent': 'gene0', 'Note': 'tra...",,
2,chr1L,Genbank,exon,18336,18399,.,-,.,ID=id1;Parent=rna0;Note=transcript XELAEV_1800...,"{'ID': 'id1', 'Parent': 'rna0', 'Note': 'trans...",,
3,chr1L,Genbank,exon,17924,18243,.,-,.,ID=id2;Parent=rna0;Note=transcript XELAEV_1800...,"{'ID': 'id2', 'Parent': 'rna0', 'Note': 'trans...",,
4,chr1L,Genbank,CDS,18336,18399,.,-,0,ID=cds0;Parent=rna0;Dbxref=Phytozome:Xelaev180...,"{'ID': 'cds0', 'Parent': 'rna0', 'Dbxref': 'Ph...",OCT98948.1,
...,...,...,...,...,...,...,...,...,...,...,...,...
809901,Scaffold94051,Genbank,CDS,8,205,.,+,0,ID=cds47730;Parent=rna47730;Dbxref=Phytozome:X...,"{'ID': 'cds47730', 'Parent': 'rna47730', 'Dbxr...",OCT55143.1,
809902,Scaffold95291,Genbank,gene,9,236,.,-,.,ID=gene45941;Name=Xelaev18004691m;end_range=23...,"{'ID': 'gene45941', 'Name': 'Xelaev18004691m',...",Xelaev18004691m,
809903,Scaffold95291,Genbank,mRNA,9,236,.,-,.,ID=rna47731;Parent=gene45941;Note=transcript X...,"{'ID': 'rna47731', 'Parent': 'gene45941', 'Not...",,
809904,Scaffold95291,Genbank,exon,9,236,.,-,.,ID=id452950;Parent=rna47731;Note=transcript XE...,"{'ID': 'id452950', 'Parent': 'rna47731', 'Note...",,


Unnamed: 0,gene_name,Xenbase
0,42sp43.L,XB-GENE-6252610
1,42sp50.L,XB-GENE-5853356
2,LOC100036702.L,XB-GENE-18005853
3,LOC100037907.L,XB-GENE-18005786
4,LOC100037907.S,XB-GENE-18005873
...,...,...
21951,zyx.L,XB-GENE-6252950
21952,zyx.S,XB-GENE-6253946
21953,zzef1.S,XB-GENE-6488145
21954,zzz3.L,XB-GENE-6486634


In [10]:
models = models.merge(xenbase_keys, on = 'gene_name', how = 'left')
new_models = models.copy(deep = True)

# Collect information about the number of unique gene_name + Xenbase pairs
sample_BFD.metadata.add('num_xenbase_models', len(new_models[['gene_name', 'Xenbase']].dropna().drop_duplicates())) 
display(sample_BFD.metadata.num_xenbase_models)

# Create a new field dictionary based on new fields
new_models['Xenbase'] = new_models['Xenbase'].astype(str)
new_models.apply(lambda x: x['field_dictionary'].update({'Xenbase': x['Xenbase']}), axis = 1)
new_models['new_fields'] = new_models['field_dictionary'].apply(convert_dict_to_fields_gtf)

new_models = new_models[[0, 1, 2, 3, 4, 5, 6, 7, 'new_fields']]

models_asgtf_updated_filename = sample_BFD.annot.filename.replace('.gtf', '_updated.gtf')
models_asgtf_updated = GenomeGtfFile(
    filename = models_asgtf_updated_filename, 
    sampledict = sample_BFD.sampledict,
    reference_genome = sample_BFD.genome_fasta)

new_models.to_csv(models_asgtf_updated.path, sep = '\t', index = None, header = None)
display(new_models)

19286

Unnamed: 0,0,1,2,3,4,5,6,7,new_fields
0,MT,Xenbase,exon,2136,2204,.,+,.,"gene_id ""gene42065""; gene_name ""mt-trna-phe.L""..."
1,MT,Xenbase,exon,2205,3023,.,+,.,"gene_id ""gene34778""; gene_name ""mt-rnr1.L""; tr..."
2,MT,Xenbase,exon,3024,3092,.,+,.,"gene_id ""gene48202""; gene_name ""mt-trna-val.L""..."
3,MT,Xenbase,exon,3093,4723,.,+,.,"gene_id ""gene44770""; gene_name ""mt-rnr2.L""; tr..."
4,MT,Xenbase,exon,4724,4798,.,+,.,"gene_id ""gene43253""; gene_name ""mt-trna-leu1.L..."
...,...,...,...,...,...,...,...,...,...
771995,chr9_10S,Xenbase,CDS,104527008,104527068,.,-,0,"gene_id ""gene24857""; gene_name ""Xetrov90026942..."
771996,chr9_10S,Xenbase,exon,104530653,104530753,.,-,2,"gene_id ""gene24857""; gene_name ""Xetrov90026942..."
771997,chr9_10S,Xenbase,CDS,104530653,104530753,.,-,2,"gene_id ""gene24857""; gene_name ""Xetrov90026942..."
771998,chr9_10S,Xenbase,exon,104531027,104531054,.,-,0,"gene_id ""gene24857""; gene_name ""Xetrov90026942..."


# 4. Extract gene IDs for mapping to UniprotKB
Specify which set of identifiers will be use to query the [Uniprot ID Mapping Tool](https://www.uniprot.org/id-mapping) via API.

If using an identifier from the `dbxref_dict`, specify the name via string in the `dbxref_datafield` variable.

In [11]:
dbxref_datafield = ''
datafield = 'Xenbase'

if dbxref_datafield != '':
    models.dropna(axis = 0, subset = ['dbxref_dict'], inplace = True)
    models[dbxref_datafield] = [d.get(dbxref_datafield) for d in models['dbxref_dict']]

    models_subset = models[['gene_name', dbxref_datafield]].dropna().drop_duplicates()
    display(models_subset)

elif datafield == 'gene_name':
    models_subset = models[['gene_name']].dropna().drop_duplicates()
    display(models_subset)

elif datafield != '':
    models_subset = models[['gene_name', datafield]].dropna().drop_duplicates()
    display(models_subset)

else:
    raise Exception('You must provide a data field for ID mapping.')

Unnamed: 0,gene_name,Xenbase
192,vdac2.S,XB-GENE-954844
212,samd8.S,XB-GENE-17343426
227,dusp13.S,XB-GENE-17337763
258,adk.S,XB-GENE-997231
281,vcl.S,XB-GENE-5759162
...,...,...
771494,nif3l1.S,XB-GENE-6078750
771661,tpsg1.S,XB-GENE-5893017
771750,loc100491461.S,XB-GENE-6489054
771854,rnf112.1.S,XB-GENE-5960855


# 5. Generate gene list file to query Uniprot ID Mapping API
Generate a text file ending in `_ids.txt` for sending to the ID mapping API.

In [12]:
datafield = dbxref_datafield if dbxref_datafield != '' else datafield

gene_list = models_subset[datafield].unique()

genelist_object = GeneListFile(
    sampledict = sample_BFD.sampledict,
    sources = [sample_BFD.annot],
    genes = gene_list,
    identifier = datafield
    )

Wrote 19285 gene ids to /home/ec2-user/glial-origins/output/Xlae_adultbrain/Xlae_adultbrain_Xenbase_ids.txt


# 6. Query Uniprot ID Mapping API
Specify the `from_type` variable based on the Uniprot name of the identifier.  
The table below lists some databases and the `from_type` string that the API accepts for that datatype.  

| datatype | `from_type` string | description |
| ---: | :--- | :--- |
| Mouse Genome Informatics | `MGI` | ID starts with `MGI:` |
| Zebrafish Information Network | `ZFIN` | ID starts with `ZDB-GENE-` |
| Xenbase | `Xenbase` | ID starts with `XB-GENE-` |

__NOTE:__ You may have to run the cell below twice - UniProt sometimes throws an "Resource not found" message on the first query to the database.

In [13]:
from_type = 'Xenbase'
to_type = 'UniProtKB'

uniprot_idmm_object = genelist_object.get_uniprot_ids(ID_MAPPER_LOC, from_type, to_type)
sample_BFD.add_keyfile('uniprot_idmm', uniprot_idmm_object, overwrite = True)

uniprot_idmm = pd.read_csv(sample_BFD.uniprot_idmm.path, sep = '\t')
display(uniprot_idmm)

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  305k    0    52  100  305k     20   118k  0:00:02  0:00:02 --:--:--  118k
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100    24    0    24    0     0     28      0 --:--:-- --:--:-- --:--:--    28


Unnamed: 0,From,Entry,Entry Name,Reviewed,Protein names,Gene Names,Organism,Length
0,XB-GENE-954844,Q52KY1,Q52KY1_XENLA,unreviewed,Voltage-dependent anion-selective channel prot...,vdac2.S vdac2 vdac2.L,Xenopus laevis (African clawed frog),283
1,XB-GENE-997231,Q6DJM1,Q6DJM1_XENLA,unreviewed,Adenosine kinase (AK) (EC 2.7.1.20) (Adenosine...,adk.S adk MGC82032,Xenopus laevis (African clawed frog),361
2,XB-GENE-5910051,Q6GM53,IKKA_XENLA,reviewed,Inhibitor of nuclear factor kappa-B kinase sub...,chuk ikka,Xenopus laevis (African clawed frog),743
3,XB-GENE-958583,Q641H0,Q641H0_XENLA,unreviewed,Annexin,anxa11.S anx11 anxa11 cap50 MGC81584,Xenopus laevis (African clawed frog),502
4,XB-GENE-6251658,Q5FWP7,Q5FWP7_XENLA,unreviewed,MGC84996 protein (nuclear receptor coactivator...,ncoa4.S ara70 ele1 MGC84996 ncoa4 ncoa4-b ptc3...,Xenopus laevis (African clawed frog),625
...,...,...,...,...,...,...,...,...
17505,XB-GENE-6078750,A0A1L8ERE3,A0A1L8ERE3_XENLA,unreviewed,NIF3-like protein 1,nif3l1.S nif3l1 nif3l1.L,Xenopus laevis (African clawed frog),369
17506,XB-GENE-6078750,Q0IHC9,Q0IHC9_XENLA,unreviewed,NIF3-like protein 1,nif3l1.S MGC154449 nif3l1 nif3l1.L,Xenopus laevis (African clawed frog),344
17507,XB-GENE-5893017,A0A1L8ERF3,A0A1L8ERF3_XENLA,unreviewed,serine protease 27,tpsg1.S,Xenopus laevis (African clawed frog),280
17508,XB-GENE-5960855,A0A1L8ERI6,A0A1L8ERI6_XENLA,unreviewed,RING finger protein 112,rnf112.1.S,Xenopus laevis (African clawed frog),625


# 7. Extract results and generate Uniprot IDMM
Generates an idmm that links `gene_name`, the `dbxref_datafield` seleted above, and `uniprot_id` returned by API.

In [14]:
uniprot_idpairs = uniprot_idmm[['From', 'Entry']]
uniprot_idpairs.rename(columns = {'From': datafield, 'Entry': 'uniprot_id'}, inplace = True)
uniprot_idpairs[datafield] = uniprot_idpairs[datafield].astype(str)
display(uniprot_idpairs)

uniprot_output_idmm = models_subset.merge(uniprot_idpairs, on = datafield)
display(uniprot_output_idmm)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  uniprot_idpairs.rename(columns = {'From': datafield, 'Entry': 'uniprot_id'}, inplace = True)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  uniprot_idpairs[datafield] = uniprot_idpairs[datafield].astype(str)


Unnamed: 0,Xenbase,uniprot_id
0,XB-GENE-954844,Q52KY1
1,XB-GENE-997231,Q6DJM1
2,XB-GENE-5910051,Q6GM53
3,XB-GENE-958583,Q641H0
4,XB-GENE-6251658,Q5FWP7
...,...,...
17505,XB-GENE-6078750,A0A1L8ERE3
17506,XB-GENE-6078750,Q0IHC9
17507,XB-GENE-5893017,A0A1L8ERF3
17508,XB-GENE-5960855,A0A1L8ERI6


Unnamed: 0,gene_name,Xenbase,uniprot_id
0,vdac2.S,XB-GENE-954844,Q52KY1
1,adk.S,XB-GENE-997231,Q6DJM1
2,chuk.S,XB-GENE-5910051,Q6GM53
3,anxa11.S,XB-GENE-958583,Q641H0
4,ncoa4.S,XB-GENE-6251658,Q5FWP7
...,...,...,...
17474,nif3l1.S,XB-GENE-6078750,A0A1L8ERE3
17475,nif3l1.S,XB-GENE-6078750,Q0IHC9
17476,tpsg1.S,XB-GENE-5893017,A0A1L8ERF3
17477,rnf112.1.S,XB-GENE-5960855,A0A1L8ERI6


In [15]:
# generate a filename and file for the idmm
uniprot_output_idmm_filename = '_'.join([sample_BFD.species_prefix, conditions, 'uniprot-idmm.tsv'])
uniprot_output_idmm_object = IdmmFile(
    filename = uniprot_output_idmm_filename, 
    sampledict = sample_BFD.sampledict, 
    kind = 'uniprot_idmm', 
    sources = [sample_BFD.annot]
)

# save to file and add to the BioFileDocket
uniprot_output_idmm.to_csv(uniprot_output_idmm_object.path, sep = '\t')
sample_BFD.add_keyfile('uniprot_idmm', uniprot_output_idmm_object)

key "uniprot_idmm" already exists, ignoring


# 9. Generate gtf-idmm
This file maps the `gene_name` to `gene_id` and `transcript_id` fields generated by the conversion from GFF to GTF, which will be needed for downstream processing.

In [16]:
models_asgtf_df = pd.read_csv(models_asgtf_updated.path, sep = '\t', header = None)

display(models_asgtf_df)
display(models_asgtf_df[attributes_column][1])

Unnamed: 0,0,1,2,3,4,5,6,7,8
0,MT,Xenbase,exon,2136,2204,.,+,.,"gene_id ""gene42065""; gene_name ""mt-trna-phe.L""..."
1,MT,Xenbase,exon,2205,3023,.,+,.,"gene_id ""gene34778""; gene_name ""mt-rnr1.L""; tr..."
2,MT,Xenbase,exon,3024,3092,.,+,.,"gene_id ""gene48202""; gene_name ""mt-trna-val.L""..."
3,MT,Xenbase,exon,3093,4723,.,+,.,"gene_id ""gene44770""; gene_name ""mt-rnr2.L""; tr..."
4,MT,Xenbase,exon,4724,4798,.,+,.,"gene_id ""gene43253""; gene_name ""mt-trna-leu1.L..."
...,...,...,...,...,...,...,...,...,...
771995,chr9_10S,Xenbase,CDS,104527008,104527068,.,-,0,"gene_id ""gene24857""; gene_name ""Xetrov90026942..."
771996,chr9_10S,Xenbase,exon,104530653,104530753,.,-,2,"gene_id ""gene24857""; gene_name ""Xetrov90026942..."
771997,chr9_10S,Xenbase,CDS,104530653,104530753,.,-,2,"gene_id ""gene24857""; gene_name ""Xetrov90026942..."
771998,chr9_10S,Xenbase,exon,104531027,104531054,.,-,0,"gene_id ""gene24857""; gene_name ""Xetrov90026942..."


'gene_id "gene34778"; gene_name "mt-rnr1.L"; transcript_id "gene34778_t"; transcript_name "mt-rnr1.L"; Xenbase "nan"'

In [17]:
# Use a custom function to extract useful fields from the additional fields section (column 8)
# Pull from that dict to fill in additional useful columns
models_asgtf_df['field_dictionary'] = models_asgtf_df[attributes_column].apply(convert_fields_to_dict_gtf)
models_asgtf_df['gene_name'] = [d.get('gene_name') for d in models_asgtf_df['field_dictionary']]
models_asgtf_df['gene_id'] = [d.get('gene_id') for d in models_asgtf_df['field_dictionary']]
models_asgtf_df['transcript_id'] = [d.get('transcript_id') for d in models_asgtf_df['field_dictionary']]

# Remove CDS annotations because they interfere with TransDecoder cDNA generation
models_asgtf_df = models_asgtf_df[models_asgtf_df[2] != 'CDS']
display(models_asgtf_df)

Unnamed: 0,0,1,2,3,4,5,6,7,8,field_dictionary,gene_name,gene_id,transcript_id
0,MT,Xenbase,exon,2136,2204,.,+,.,"gene_id ""gene42065""; gene_name ""mt-trna-phe.L""...","{'gene_id': 'gene42065', 'gene_name': 'mt-trna...",mt-trna-phe.L,gene42065,gene42065_t
1,MT,Xenbase,exon,2205,3023,.,+,.,"gene_id ""gene34778""; gene_name ""mt-rnr1.L""; tr...","{'gene_id': 'gene34778', 'gene_name': 'mt-rnr1...",mt-rnr1.L,gene34778,gene34778_t
2,MT,Xenbase,exon,3024,3092,.,+,.,"gene_id ""gene48202""; gene_name ""mt-trna-val.L""...","{'gene_id': 'gene48202', 'gene_name': 'mt-trna...",mt-trna-val.L,gene48202,gene48202_t
3,MT,Xenbase,exon,3093,4723,.,+,.,"gene_id ""gene44770""; gene_name ""mt-rnr2.L""; tr...","{'gene_id': 'gene44770', 'gene_name': 'mt-rnr2...",mt-rnr2.L,gene44770,gene44770_t
4,MT,Xenbase,exon,4724,4798,.,+,.,"gene_id ""gene43253""; gene_name ""mt-trna-leu1.L...","{'gene_id': 'gene43253', 'gene_name': 'mt-trna...",mt-trna-leu1.L,gene43253,gene43253_t
...,...,...,...,...,...,...,...,...,...,...,...,...,...
771990,chr9_10S,Xenbase,exon,104524309,104524351,.,-,0,"gene_id ""gene24857""; gene_name ""Xetrov90026942...","{'gene_id': 'gene24857', 'gene_name': 'Xetrov9...",Xetrov90026942m.S,gene24857,rna64354
771992,chr9_10S,Xenbase,exon,104525747,104525868,.,-,2,"gene_id ""gene24857""; gene_name ""Xetrov90026942...","{'gene_id': 'gene24857', 'gene_name': 'Xetrov9...",Xetrov90026942m.S,gene24857,rna64354
771994,chr9_10S,Xenbase,exon,104527008,104527068,.,-,0,"gene_id ""gene24857""; gene_name ""Xetrov90026942...","{'gene_id': 'gene24857', 'gene_name': 'Xetrov9...",Xetrov90026942m.S,gene24857,rna64354
771996,chr9_10S,Xenbase,exon,104530653,104530753,.,-,2,"gene_id ""gene24857""; gene_name ""Xetrov90026942...","{'gene_id': 'gene24857', 'gene_name': 'Xetrov9...",Xetrov90026942m.S,gene24857,rna64354


In [18]:
# Extract gene_name, gene_id, and transcript_id fields to generate an ID mapping matrix (idmm)
idmm_df = models_asgtf_df[['gene_name', 'gene_id', 'transcript_id']].drop_duplicates()
idmm_df.dropna(inplace = True)
display(idmm_df)

# generate a filename and file for the idmm
idmm_filename = '_'.join([sample_BFD.species_prefix, conditions, 'gtf-idmm.tsv'])
idmm = IdmmFile(
    filename = idmm_filename, 
    sampledict = sample_BFD.sampledict, 
    kind = 'gtf_idmm', 
    sources = [sample_BFD.annot]
)

# save to file and add to the BioFileDocket
idmm_df.to_csv(idmm.path, sep = '\t')
sample_BFD.add_keyfile('gtf_idmm', idmm)

Unnamed: 0,gene_name,gene_id,transcript_id
0,mt-trna-phe.L,gene42065,gene42065_t
1,mt-rnr1.L,gene34778,gene34778_t
2,mt-trna-val.L,gene48202,gene48202_t
3,mt-rnr2.L,gene44770,gene44770_t
4,mt-trna-leu1.L,gene43253,gene43253_t
...,...,...,...
771955,vkorc1.S,gene13974,rna40672
771956,vkorc1.S,gene13974,rna40673
771962,LOC108703288,gene37157,rna77797
771977,LOC108702356 [provisional:urgcp],gene34762,rna74435


# 11. Generate cDNA and peptide files
Using the updated gtf file and genome file, generate cDNA sequence.

Then, using the cDNA sequence, generate peptide sequences using transdecoder.

Expect this step to take some time, probably ~20-30min.

In [19]:
cdna = sample_BFD.genome_fasta.get_transdecoder_cdna_gtf(models_asgtf_updated, TRANSDECODER_LOC)
sample_BFD.add_keyfile('cdna', cdna)

transdecoder_files = sample_BFD.cdna.to_pep_files(TDLONGORF_LOC, TDPREDICT_LOC)
sample_BFD.add_keyfiles(transdecoder_files)

-- Skipping CMD: /home/ec2-user/miniconda3/pkgs/transdecoder-5.5.0-pl526_1/opt/transdecoder/util/compute_base_probs.pl /home/ec2-user/glial-origins/output/Xlae_adultbrain/XENLA_9.2_genome_cDNA.fa 0 > /home/ec2-user/glial-origins/output/Xlae_adultbrain/Xenopus_laevis_adultbrain.transdecoder_dir//base_freqs.dat, checkpoint [/home/ec2-user/glial-origins/output/Xlae_adultbrain/Xenopus_laevis_adultbrain.transdecoder_dir/.__checkpoints_longorfs/base_freqs_file.ok] exists.
-skipping long orf extraction, already completed earlier as per checkpoint: /home/ec2-user/glial-origins/output/Xlae_adultbrain/Xenopus_laevis_adultbrain.transdecoder_dir/.__checkpoints_longorfs/TD.longorfs.ok
-- Skipping CMD: /home/ec2-user/miniconda3/pkgs/transdecoder-5.5.0-pl526_1/opt/transdecoder/util/get_top_longest_fasta_entries.pl /home/ec2-user/glial-origins/output/Xlae_adultbrain/Xenopus_laevis_adultbrain.transdecoder_dir//longest_orfs.cds 5000 5000 > /home/ec2-user/glial-origins/output/Xlae_adultbrain/Xenopus_laev

null device 
          1 


* Running CMD: /home/ec2-user/miniconda3/pkgs/transdecoder-5.5.0-pl526_1/opt/transdecoder/util/PWM/compute_AUC.pl /home/ec2-user/glial-origins/output/Xlae_adultbrain/Xenopus_laevis_adultbrain.transdecoder_dir//start_refinement.feature.scores.roc


null device 
          1 


* Running CMD: /home/ec2-user/miniconda3/pkgs/transdecoder-5.5.0-pl526_1/opt/transdecoder/util/PWM/make_seqLogo.Rscript /home/ec2-user/glial-origins/output/Xlae_adultbrain/Xenopus_laevis_adultbrain.transdecoder_dir//start_refinement.+.pwm || :
Error in library(seqLogo) : there is no package called ‘seqLogo’
Execution halted
* Running CMD: /home/ec2-user/miniconda3/pkgs/transdecoder-5.5.0-pl526_1/opt/transdecoder/util/PWM/make_seqLogo.Rscript /home/ec2-user/glial-origins/output/Xlae_adultbrain/Xenopus_laevis_adultbrain.transdecoder_dir//start_refinement.-.pwm || :
Error in library(seqLogo) : there is no package called ‘seqLogo’
Execution halted
* Running CMD: /home/ec2-user/miniconda3/pkgs/transdecoder-5.5.0-pl526_1/opt/transdecoder/util/PWM/deplete_feature_noise.pl  --features_plus /home/ec2-user/glial-origins/output/Xlae_adultbrain/Xenopus_laevis_adultbrain.transdecoder_dir//start_refinement.+.features  --pwm_minus /home/ec2-user/glial-origins/output/Xlae_adultbrain/Xenopus_laevis_adu

null device 
          1 


* Running CMD: /home/ec2-user/miniconda3/pkgs/transdecoder-5.5.0-pl526_1/opt/transdecoder/util/PWM/compute_AUC.pl /home/ec2-user/glial-origins/output/Xlae_adultbrain/Xenopus_laevis_adultbrain.transdecoder_dir//start_refinement.enhanced.feature.scores.roc


null device 
          1 


* Running CMD: /home/ec2-user/miniconda3/pkgs/transdecoder-5.5.0-pl526_1/opt/transdecoder/util/PWM/make_seqLogo.Rscript /home/ec2-user/glial-origins/output/Xlae_adultbrain/Xenopus_laevis_adultbrain.transdecoder_dir//start_refinement.enhanced.+.pwm || :
Error in library(seqLogo) : there is no package called ‘seqLogo’
Execution halted
* Running CMD: /home/ec2-user/miniconda3/pkgs/transdecoder-5.5.0-pl526_1/opt/transdecoder/util/start_codon_refinement.pl --transcripts /home/ec2-user/glial-origins/output/Xlae_adultbrain/XENLA_9.2_genome_cDNA.fa --gff3_file /home/ec2-user/glial-origins/output/Xlae_adultbrain/Xenopus_laevis_adultbrain.transdecoder_dir//longest_orfs.cds.best_candidates.gff3 --workdir /home/ec2-user/glial-origins/output/Xlae_adultbrain/Xenopus_laevis_adultbrain.transdecoder_dir/ > /home/ec2-user/glial-origins/output/Xlae_adultbrain/Xenopus_laevis_adultbrain.transdecoder_dir//longest_orfs.cds.best_candidates.gff3.revised_starts.gff3
Refining start codon selections.
-number of r

# 12. Push files to AWS S3

Iteratively moves through the file_set and file_dict variables and populates files into the right place in AWS.

In [20]:
sample_BFD.local_to_s3()

XENLA_9.2_Xenbase.gtf already exists in S3 bucket, skipping upload. set overwrite = True to overwrite the existing file.
XENLA_9.2_GCA.gff already exists in S3 bucket, skipping upload. set overwrite = True to overwrite the existing file.
XENLA_9.2_genome.fa already exists in S3 bucket, skipping upload. set overwrite = True to overwrite the existing file.
GSM6214268_Xenopus_brain_COL65_dge.txt already exists in S3 bucket, skipping upload. set overwrite = True to overwrite the existing file.
upload: ../../output/Xlae_adultbrain/Xlae_adultbrain_Xenbase_UniProtIDs.txt to s3://arcadia-reference-datasets/organisms/Xenopus_laevis/genomics_reference/mapping_file/Xlae_adultbrain_Xenbase_UniProtIDs.txt
Xlae_adultbrain_gtf-idmm.tsv already exists in S3 bucket, skipping upload. set overwrite = True to overwrite the existing file.
XENLA_9.2_genome_cDNA.fa already exists in S3 bucket, skipping upload. set overwrite = True to overwrite the existing file.
XENLA_9.2_genome_cDNA.fa.transdecoder.bed alre

# 13. Pickle the `sample_BFD` variable for use by the next script

In [21]:
# Generate a .pkl file for the Docket
sample_BFD.pickle()

# Push to S3, optionally overwriting existing file
sample_BFD.push_to_s3(overwrite = False)

upload: ../../output/Xlae_adultbrain/Xlae_adultbrain_BioFileDocket.pkl to s3://arcadia-reference-datasets/glial-origins-pkl/Xlae_adultbrain_BioFileDocket.pkl
