# Overview

This notebook downloads cell annotation from the [Liao et al. 2022](https://www.sciencedirect.com/science/article/pii/S0092867418301168#sec4) Xenopus laevis adult cell atlas for the sample `Brain3_4`, marked in GEO as ["Xenopus_brain_COL65"](https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSM6214268).  
This Batch ID was determined by empirically comparing the cell names between datasets.

## Guidelines

Due to the wide variability in potential formats for these filetypes, it is difficult to generate a standardized analysis pipeline for these files.  
For downstream analysis purposes, we'll coerce celltype annotation information into a standard format.

The final output generated by this script should be a .tsv file with the following format:

| cell_barcode | celltype |
|--------------|----------|
| ATGCAGATACAC | pyramidal cell |
| CCATACAGACTA | neuroblast |
| TTTCAAAGACAG | astrocyte |
| ... | ... |

The final datatype expects two columns: `cell_barcode` and `celltype`.  
The `cell_barcode` names should be properly matched to the same format as the `gxc` file.  
The `celltype` can have any format, but should include cell identity information.  

# 0. Setup

Import packages and specify any important functions here.

In [1]:
# import standard python packages
import pandas as pd
import subprocess, os, dill, sys
import datetime

# add the utils and env directories to the path
sys.path.append('../../utils/')
sys.path.append('../../env/')

# import functions from utils directory files
from string_functions import *
from biofile_handling import *

# import paths to software installs from env
from install_locs import *

# 1. Load BioFileDocket

Load in the BioFileDocket file for this dataset using `BioFileDocket.unpickle()`

In [2]:
################
# general info #
################

# Specify the name of the species folder in Amazon S3
species = 'Mus_musculus'

# Specify any particular identifying conditions, eg tissue type:
conditions = 'adultbrain'

################

species_BFD = BioFileDocket(species, conditions).get_from_s3().unpickle()
species_BFD.s3_to_local()

/home/ec2-user/glial-origins/output/Mmus_adultbrain/ already exists
Files will be saved into /home/ec2-user/glial-origins/output/Mmus_adultbrain/
file Mmus_adultbrain_BioFileDocket.pkl already exists at /home/ec2-user/glial-origins/output/Mmus_adultbrain/Mmus_adultbrain_BioFileDocket.pkl
file GCF_000001635.23_GRCm38.p3_genomic.gff already exists at /home/ec2-user/glial-origins/output/Mmus_adultbrain/GCF_000001635.23_GRCm38.p3_genomic.gff
file GCF_000001635.23_GRCm38.p3_genomic.fna already exists at /home/ec2-user/glial-origins/output/Mmus_adultbrain/GCF_000001635.23_GRCm38.p3_genomic.fna
file GSM2906405_Brain1_dge_coerced.txt already exists at /home/ec2-user/glial-origins/output/Mmus_adultbrain/GSM2906405_Brain1_dge_coerced.txt
file Mmus_adultbrain_MGI_UniProtIDs.txt already exists at /home/ec2-user/glial-origins/output/Mmus_adultbrain/Mmus_adultbrain_MGI_UniProtIDs.txt
file Mmus_adultbrain_uniprot-idmm.tsv already exists at /home/ec2-user/glial-origins/output/Mmus_adultbrain/Mmus_adul

# 2. Download cell annotation file and load in raw file

Using `subprocess.run`, download the file using an appropriate protocol and load it as a pandas DataFrame.  
For .csv and .tsv files, use `pd.read_csv`.  
For .xlsx and .xls files, use `pd.read_excel`.  

In [3]:
# Specify file url and output information
file_url = 'https://figshare.com/ndownloader/files/11083451?private_link=865e694ad06d5857db4b'
output_filename = 'Mmus_cellannot.csv'

output_file = BioFile(
    filename = output_filename,
    sampledict = species_BFD.sampledict,
    url = file_url,
    protocol = 'wget'
)

cell_annots = pd.read_csv(output_file.path, index_col = 'Unnamed: 0')
display(cell_annots)

# Load in gxc matrix to check proper cell names
gxc = pd.read_csv(species_BFD.gxc.path, sep = '\t', nrows = 10)
display(gxc)

file Mmus_cellannot.csv already exists at /home/ec2-user/glial-origins/output/Mmus_adultbrain/Mmus_cellannot.csv


Unnamed: 0,Cell.name,ClusterID,Tissue,Batch,Cell.Barcode,Annotation
1,Bladder_1.AAAACGAAAACGGGGCGA,Bladder_1,Bladder,Bladder_1,AAAACGAAAACGGGGCGA,Stromal cell_Dpt high(Bladder)
2,Bladder_1.AAAACGAAGCGGCCGCTA,Bladder_5,Bladder,Bladder_1,AAAACGAAGCGGCCGCTA,Stromal cell_Car3 high(Bladder)
3,Bladder_1.AAAACGAAGTACTAGCAT,Bladder_16,Bladder,Bladder_1,AAAACGAAGTACTAGCAT,Vascular smooth muscle progenitor cell(Bladder)
4,Bladder_1.AAAACGACGTTGCTGTGT,Bladder_8,Bladder,Bladder_1,AAAACGACGTTGCTGTGT,Vascular endothelial cell(Bladder)
5,Bladder_1.AAAACGAGCGAGCGAGTA,Bladder_4,Bladder,Bladder_1,AAAACGAGCGAGCGAGTA,Urothelium(Bladder)
...,...,...,...,...,...,...
270844,NeonatalPancreas_1.AGTTTAAAAACGCTGAAA,NeonatalPancreas_,NeonatalPancreas,NeonatalPancreas_1,AGTTTAAAAACGCTGAAA,Urothelium(NeonatalPancreas)
270845,NeonatalPancreas_1.GGGTTTGAACGCTCTACC,NeonatalPancreas_,NeonatalPancreas,NeonatalPancreas_1,GGGTTTGAACGCTCTACC,Urothelium(NeonatalPancreas)
270846,NeonatalPancreas_1.GCTGTGACAATATTTAGG,NeonatalPancreas_,NeonatalPancreas,NeonatalPancreas_1,GCTGTGACAATATTTAGG,Urothelium(NeonatalPancreas)
270847,NeonatalPancreas_1.GTCCCGGATCTTTATTGT,NeonatalPancreas_,NeonatalPancreas,NeonatalPancreas_1,GTCCCGGATCTTTATTGT,Urothelium(NeonatalPancreas)


Unnamed: 0,gene_name,Brain_1.CCGCTAAATAAATAAGGG,Brain_1.AACGCCGATCTTGCCCTC,Brain_1.ACCTGAAGTTTATCGTAA,Brain_1.CTCGCACTGAAACCGCTA,Brain_1.ATCAACATCTCTTCGGGT,Brain_1.GCGAATAGGGTCTATGTA,Brain_1.CGAGTAAGGGTCTAGTCG,Brain_1.ATCTCTTCGTAAGTTGCC,Brain_1.AACGCCTAAGGGCTCGCA,...,Brain_1.TTGGACGCCTAGGAGATC,Brain_1.TTAACTAAAGTTTATGTA,Brain_1.GTCCCGGGACATAGGACT,Brain_1.CCGCTAGGGTTTGCTCAA,Brain_1.TGATCAGCTGTGTCAAAG,Brain_1.TCACTTGAATTATGAAGC,Brain_1.AAGTACGCTGTGTATGTA,Brain_1.CCTAGATAGAGAATTTGC,Brain_1.CATCCCATTTGCGGCTGC,Brain_1.CAAAGTGGGTTTAGCGAG
0,0610005C13Rik,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0610007P14Rik,0,3,1,2,0,0,1,2,0,...,0,0,0,0,0,0,0,0,0,0
2,0610009B22Rik,0,3,0,2,1,0,0,0,0,...,0,0,0,0,0,0,1,0,0,0
3,0610009E02Rik,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0610009L18Rik,0,0,0,0,2,0,0,0,0,...,0,0,0,0,0,0,0,1,0,0
5,0610009O20Rik,0,1,0,1,0,2,0,1,0,...,0,0,0,0,0,0,0,0,0,0
6,0610010F05Rik,0,0,1,1,0,0,0,1,0,...,0,0,0,0,0,0,0,0,0,0
7,0610010K14Rik,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
8,0610012G03Rik,1,2,0,1,1,2,0,1,2,...,0,0,0,0,0,0,0,0,0,0
9,0610025J13Rik,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


# 3. Extract and coerce data into the correct format. 

In [4]:
# Specify sample name for filtering
sample_name = 'Brain_1'

# Extract cell barcode and celltype info
sample_cell_annots = cell_annots[cell_annots['Batch'] == sample_name]
sample_cell_annots = sample_cell_annots[['Cell.name', 'Annotation']]

# Rename columns to have the correct header
sample_cell_annots.rename(columns = {'Cell.name': 'cell_barcode', 'Annotation': 'celltype'}, inplace = True)
sample_cell_annots['celltype'] = sample_cell_annots['celltype'].str.replace('\(Brain\)', '')
display(sample_cell_annots)

# Generate filename
output_filename = '_'.join([species_BFD.species_prefix, species_BFD.conditions, sample_name, 'cellannot.tsv'])

# Generate BioFile object
output_CellAnnotFile = CellAnnotFile(
    filename = output_filename,
    sampledict = species_BFD.sampledict,
    sources = [species_BFD.gxc]
)

# Export file and add to BioFileDocket
sample_cell_annots.to_csv(output_CellAnnotFile.path, sep = '\t', index = None)
species_BFD.add_keyfile('cellannot', output_CellAnnotFile)

  sample_cell_annots['celltype'] = sample_cell_annots['celltype'].str.replace('\(Brain\)', '')


Unnamed: 0,cell_barcode,celltype
45644,Brain_1.AAAACGAAAACGTCAAAG,Myelinating oligodendrocyte
45645,Brain_1.AAAACGAAAGTTAAAACG,Myelinating oligodendrocyte
45646,Brain_1.AAAACGAAAGTTACGTTG,Myelinating oligodendrocyte
45647,Brain_1.AAAACGAAAGTTTATTGT,Myelinating oligodendrocyte
45648,Brain_1.AAAACGAACCTAGGGTTT,Myelinating oligodendrocyte
...,...,...
48924,Brain_1.TTTAGGGATCTTAAGTAC,Macrophage_Klf2 high
48925,Brain_1.TTTAGGTAAGGGGGGCGA,Myelinating oligodendrocyte
48926,Brain_1.TTTAGGTCGTAAGTAATG,Myelinating oligodendrocyte
48927,Brain_1.TTTAGGTGCAATCCGACG,Myelinating oligodendrocyte


# 4. Update BioFileDocket and push files to S3

In [5]:
species_BFD.local_to_s3()
species_BFD.pickle()
species_BFD.push_to_s3(overwrite = True)

GCF_000001635.23_GRCm38.p3_genomic.gff already exists in S3 bucket, skipping upload. set overwrite = True to overwrite the existing file.
GCF_000001635.23_GRCm38.p3_genomic.fna already exists in S3 bucket, skipping upload. set overwrite = True to overwrite the existing file.
GSM2906405_Brain1_dge_coerced.txt already exists in S3 bucket, skipping upload. set overwrite = True to overwrite the existing file.
Mmus_adultbrain_MGI_UniProtIDs.txt already exists in S3 bucket, skipping upload. set overwrite = True to overwrite the existing file.
Mmus_adultbrain_uniprot-idmm.tsv already exists in S3 bucket, skipping upload. set overwrite = True to overwrite the existing file.
Mmus_adultbrain_gtf-idmm.tsv already exists in S3 bucket, skipping upload. set overwrite = True to overwrite the existing file.
GCF_000001635.23_GRCm38.p3_genomic_cDNA.fna already exists in S3 bucket, skipping upload. set overwrite = True to overwrite the existing file.
GCF_000001635.23_GRCm38.p3_genomic_cDNA.fna.transdecod