# Overview

This notebook downloads cell annotation from the [Liao et al. 2022](https://www.sciencedirect.com/science/article/pii/S0092867418301168#sec4) Xenopus laevis adult cell atlas for the sample `Brain3_4`, marked in GEO as ["Xenopus_brain_COL65"](https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSM6214268).  
This Batch ID was determined by empirically comparing the cell names between datasets.

## Guidelines

Due to the wide variability in potential formats for these filetypes, it is difficult to generate a standardized analysis pipeline for these files.  
For downstream analysis purposes, we'll coerce celltype annotation information into a standard format.

The final output generated by this script should be a .tsv file with the following format:

| cell_barcode | celltype |
|--------------|----------|
| ATGCAGATACAC | pyramidal cell |
| CCATACAGACTA | neuroblast |
| TTTCAAAGACAG | astrocyte |
| ... | ... |

The final datatype expects two columns: `cell_barcode` and `celltype`.  
The `cell_barcode` names should be properly matched to the same format as the `gxc` file.  
The `celltype` can have any format, but should include cell identity information.  

# 0. Setup

Import packages and specify any important functions here.

In [1]:
# import standard python packages
import pandas as pd
import subprocess
import os
import dill
import datetime

# add the utils and env directories to the path
import sys
sys.path.append('../../utils/')
sys.path.append('../../env/')

# import functions from utils directory files
from string_functions import *
from biofile_handling import *

# import paths to software installs from env
from install_locs import *

# 1. Load BioFileDocket

Load in the BioFileDocket file for this dataset using `BioFileDocket.unpickle()`

In [2]:
################
# general info #
################

# Specify the name of the species folder in Amazon S3
species = 'Xenopus_laevis'

# Specify any particular identifying conditions, eg tissue type:
conditions = 'adultbrain'

################

species_BFD = BioFileDocket(species, conditions).get_from_s3().unpickle()
species_BFD.s3_to_local()

/home/ec2-user/glial-origins/output/Xlae_adultbrain/ already exists
Files will be saved into /home/ec2-user/glial-origins/output/Xlae_adultbrain/
file Xlae_adultbrain_BioFileDocket.pkl already exists at /home/ec2-user/glial-origins/output/Xlae_adultbrain/Xlae_adultbrain_BioFileDocket.pkl
file XENLA_9.2_Xenbase.gtf already exists at /home/ec2-user/glial-origins/output/Xlae_adultbrain/XENLA_9.2_Xenbase.gtf
file XENLA_9.2_GCA.gff already exists at /home/ec2-user/glial-origins/output/Xlae_adultbrain/XENLA_9.2_GCA.gff
file XENLA_9.2_genome.fa already exists at /home/ec2-user/glial-origins/output/Xlae_adultbrain/XENLA_9.2_genome.fa
file GSM6214268_Xenopus_brain_COL65_dge.txt already exists at /home/ec2-user/glial-origins/output/Xlae_adultbrain/GSM6214268_Xenopus_brain_COL65_dge.txt
file Xlae_adultbrain_Xenbase_UniProtIDs.txt already exists at /home/ec2-user/glial-origins/output/Xlae_adultbrain/Xlae_adultbrain_Xenbase_UniProtIDs.txt
file Xlae_adultbrain_uniprot-idmm.tsv already exists at /hom

# 2. Download cell annotation file and load in raw file

Using `subprocess.run`, download the file using an appropriate protocol and load it as a pandas DataFrame.  
For .csv and .tsv files, use `pd.read_csv`.  
For .xlsx and .xls files, use `pd.read_excel`.  

In [3]:
# Specify file url and output information
file_url = 'https://figshare.com/ndownloader/files/34026644'
output_filename = 'Xlae_cellannot.zip'
output_filepath = species_BFD.sampledict.directory + output_filename
annot_filepath = species_BFD.sampledict.directory + 'dge_cell_info/' + 'Brain_cell_info.csv'

subprocess.run(['wget', file_url, '-O', output_filepath],
               stdout=subprocess.DEVNULL, stderr=subprocess.DEVNULL)
subprocess.run(['unzip', output_filepath, '-d', species_BFD.sampledict.directory],
               stdout=subprocess.DEVNULL, stderr=subprocess.DEVNULL)

cell_annots = pd.read_csv(annot_filepath)
display(cell_annots)

# Load in gxc matrix to check proper cell names
gxc = pd.read_csv(species_BFD.gxc.path, sep = '\t', nrows = 10)
display(gxc)

Unnamed: 0.1,Unnamed: 0,tSNE_1,tSNE_2,cluster,stage,celltype,cellID
0,Brain1_3.25.CTGTGTGATCTTGTGGTA,-25.083085,-1.615592,6,Adult,GABAergic neuron_tac1 high,Brain1_3.25.CTGTGTGATCTTGTGGTA
1,Brain1_3.25.ACAATACCTAGAAATAAA,-41.017949,17.559271,2,Adult,Myelinating oligodendrocyte progenitor cell,Brain1_3.25.ACAATACCTAGAAATAAA
2,Brain1_3.25.TGCGGATAGAGACCTTTC,-12.190198,-3.974513,19,Adult,Excitatory neuron_zbtb18 high,Brain1_3.25.TGCGGATAGAGACCTTTC
3,Brain1_3.25.GAGGAGATGGCGCGAGTA,-10.871158,-0.184952,19,Adult,Excitatory neuron_zbtb18 high,Brain1_3.25.GAGGAGATGGCGCGAGTA
4,Brain1_3.25.TTAACTAAAGTTAACGCC,-10.116023,-0.098181,19,Adult,Excitatory neuron_zbtb18 high,Brain1_3.25.TTAACTAAAGTTAACGCC
...,...,...,...,...,...,...,...
23901,Brain4_4.25.ATCAACTGAAGCGCGTGC,46.084537,-6.320718,12,Adult,Thyrotroph cell,Brain4_4.25.ATCAACTGAAGCGCGTGC
23902,Brain4_4.25.TGCAATGCGTGCAAAGTT,16.626571,9.189624,15,Adult,Growth hormone cell_Rp high,Brain4_4.25.TGCAATGCGTGCAAAGTT
23903,Brain4_4.25.GCTCAAAACCTAGCGTGC,-10.361278,23.647850,1,Adult,Gonadotroph cell_cga high,Brain4_4.25.GCTCAAAACCTAGCGTGC
23904,Brain4_4.25.TCGTAACGTATTGCGTGC,-14.885559,18.001137,1,Adult,Gonadotroph cell_cga high,Brain4_4.25.TCGTAACGTATTGCGTGC


Unnamed: 0,gene_name,AACCTATTCATATAAGGG,CTCGCATCAAAGTTAACT,AACCTAGTATACTTCCGC,AACCTAAAAGTTCTGAAA,CTCGCACGCACCCTCCAT,ACGTTGTATTGTAGCGAG,ACGAGCATGCTTTAGTCG,AACCTAGTCCCGCCATCT,AACCTAGCGAATTAGAGA,...,TCACTTGTTGCCATGCTT,TCGGGTTGTCACACTTAT,TCGTAATCGTAAGTTGCC,TGTCACGAATTACACAAG,TGTGCGTACTTCTAGTCG,TTAACTATACAGTGGATG,TTGGACACTTATGATCTT,AAAGTTACTTATGCCCTC,AACCTACGCACCTGCGGA,AACCTATAGTCGCTGTGT
0,3.S,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,42Sp43.L,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,42Sp50.L,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,AK6.L,0,0,0,0,0,0,1,0,0,...,0,0,0,0,0,0,0,0,0,0
4,AK6.S,1,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
5,MGC107841.L,1,0,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
6,MGC107851.L,0,1,0,0,0,0,0,1,0,...,0,0,0,0,0,0,0,0,0,0
7,MGC107876.L,0,0,0,0,0,0,0,1,0,...,0,0,0,0,0,0,0,0,0,0
8,MGC108117.L,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
9,MGC108429.L,0,1,3,1,1,0,1,0,0,...,0,0,0,0,0,0,0,0,0,0


# 3. Extract and coerce data into the correct format. 

In [4]:
# Specify sample name for filtering
sample_name = 'Brain3_4'

# Extract cell barcode and celltype info
cell_annots['Batch'] = cell_annots['Unnamed: 0'].str.split('.', expand = True)[0]
cell_annots['cell_barcode'] = cell_annots['Unnamed: 0'].str.split('.', expand = True)[2]

# Get cells from the Batch matching the sample_name
sample_cell_annots = cell_annots[cell_annots['Batch'] == sample_name]
sample_cell_annots = sample_cell_annots[['cell_barcode', 'celltype']]
display(sample_cell_annots)

# Generate filename
output_filename = '_'.join([prefixify(species_BFD.species), species_BFD.conditions, sample_name, 'cellannot.tsv'])

# Generate BioFile object
output_CellAnnotFile = CellAnnotFile(
    filename = output_filename,
    sampledict = species_BFD.sampledict,
    sources = [species_BFD.gxc]
)

# Export file and add to BioFileDocket
sample_cell_annots.to_csv(output_CellAnnotFile.path, sep = '\t', index = None)
species_BFD.add_keyfile('cellannot', output_CellAnnotFile)

Unnamed: 0,cell_barcode,celltype
8238,AACCTAGCGAATTAGAGA,Growth hormone progenitor cell
8239,CTGTGTGATCTTCCGCTA,Thyrotroph cell
8240,AAGCGGGGTACAATGGCG,GABAergic neuron_tac1 high
8241,CTCGCACGTATTCTGAAA,Myelinating oligodendrocyte progenitor cell
8242,CTGTGTTTGGACTATTGT,Myelinating oligodendrocyte progenitor cell
...,...,...
16322,CTCGCATGCGGAGCGTGC,Gonadotroph cell_cga high
16323,GAGGAGAGCGAGGCGTGC,Gonadotroph cell_cga high
16324,TCGTAAGCGTGCCTGTGT,Thyrotroph cell
16325,TATGTAATACAGGCGTGC,Gonadotroph cell_cga high


# 4. Update BioFileDocket and push files to S3

In [5]:
species_BFD.local_to_s3()
species_BFD.pickle()
species_BFD.push_to_s3(overwrite = True)

XENLA_9.2_Xenbase.gtf already exists in S3 bucket, skipping upload. set overwrite = True to overwrite the existing file.
XENLA_9.2_GCA.gff already exists in S3 bucket, skipping upload. set overwrite = True to overwrite the existing file.
XENLA_9.2_genome.fa already exists in S3 bucket, skipping upload. set overwrite = True to overwrite the existing file.
GSM6214268_Xenopus_brain_COL65_dge.txt already exists in S3 bucket, skipping upload. set overwrite = True to overwrite the existing file.
Xlae_adultbrain_Xenbase_UniProtIDs.txt already exists in S3 bucket, skipping upload. set overwrite = True to overwrite the existing file.
Xlae_adultbrain_uniprot-idmm.tsv already exists in S3 bucket, skipping upload. set overwrite = True to overwrite the existing file.
Xlae_adultbrain_gtf-idmm.tsv already exists in S3 bucket, skipping upload. set overwrite = True to overwrite the existing file.
XENLA_9.2_genome_cDNA.fa already exists in S3 bucket, skipping upload. set overwrite = True to overwrite th