# Overview

This notebook downloads cell annotation from the [Jiang et al. 2021](https://www.frontiersin.org/articles/10.3389/fcell.2021.743421/full) zebrafish cell atlas for the sample `Brain_8`.


## Guidelines

Due to the wide variability in potential formats for these filetypes, it is difficult to generate a standardized analysis pipeline for these files.  
For downstream analysis purposes, we'll coerce celltype annotation information into a standard format.

The final output generated by this script should be a .tsv file with the following format:

| cell_barcode | celltype |
|--------------|----------|
| ATGCAGATACAC | pyramidal cell |
| CCATACAGACTA | neuroblast |
| TTTCAAAGACAG | astrocyte |
| ... | ... |

The final datatype expects two columns: `cell_barcode` and `celltype`.  
The `cell_barcode` names should be properly matched to the same format as the `gxc` file.  
The `celltype` can have any format, but should include cell identity information.  

# 0. Setup

Import packages and specify any important functions here.

In [1]:
# import standard python packages
import pandas as pd
import subprocess
import os
import dill
import datetime

# add the utils and env directories to the path
import sys
sys.path.append('../../utils/')
sys.path.append('../../env/')

# import functions from utils directory files
from string_functions import *
from biofile_handling import *

# import paths to software installs from env
from install_locs import *

# 1. Load BioFileDocket

Load in the BioFileDocket file for this dataset using `BioFileDocket.unpickle()`

In [2]:
################
# general info #
################

# Specify the name of the species folder in Amazon S3
species = 'Danio_rerio'

# Specify any particular identifying conditions, eg tissue type:
conditions = 'adultbrain'

################

species_BFD = BioFileDocket(species, conditions).get_from_s3().unpickle()
species_BFD.s3_to_local()

/home/ec2-user/glial-origins/output/Drer_adultbrain/ already exists
Files will be saved into /home/ec2-user/glial-origins/output/Drer_adultbrain/
file Drer_adultbrain_BioFileDocket.pkl already exists at /home/ec2-user/glial-origins/output/Drer_adultbrain/Drer_adultbrain_BioFileDocket.pkl
file GCF_000002035.5_GRCz10_genomic.gff already exists at /home/ec2-user/glial-origins/output/Drer_adultbrain/GCF_000002035.5_GRCz10_genomic.gff
file GCF_000002035.5_GRCz10_genomic.fna already exists at /home/ec2-user/glial-origins/output/Drer_adultbrain/GCF_000002035.5_GRCz10_genomic.fna
file GSM3768152_Brain_8_dge.txt already exists at /home/ec2-user/glial-origins/output/Drer_adultbrain/GSM3768152_Brain_8_dge.txt
file Drer_adultbrain_ZFIN_UniProtIDs.txt already exists at /home/ec2-user/glial-origins/output/Drer_adultbrain/Drer_adultbrain_ZFIN_UniProtIDs.txt
file Drer_adultbrain_uniprot-idmm.tsv already exists at /home/ec2-user/glial-origins/output/Drer_adultbrain/Drer_adultbrain_uniprot-idmm.tsv
file

# 2. Download cell annotation file and load in raw file

Using `subprocess.run`, download the file using an appropriate protocol and load it as a pandas DataFrame.  
For .csv and .tsv files, use `pd.read_csv`.  
For .xlsx and .xls files, use `pd.read_excel`.  

In [3]:
# Specify file url and output information
file_url = 'https://ndownloader.figstatic.com/files/30949762'
output_filename = 'Drer_cellannot.xlsx'
output_filepath = species_BFD.sampledict.directory + output_filename

subprocess.run(['wget', file_url, '-O', output_filepath],
               stdout=subprocess.DEVNULL, stderr=subprocess.DEVNULL)

cell_annots = pd.read_excel(output_filepath, sheet_name= 'cell barcode of each cell')
cell_annots['Batch'] = cell_annots['cell barcode'].str.split('.', expand = True)[0]
cell_annots['cell_barcode'] = cell_annots['cell barcode'].str.split('.', expand = True)[1]
display(cell_annots)

# Load in gxc matrix to check proper cell names
gxc = pd.read_csv(species_BFD.gxc.path, sep = '\t', nrows = 10)
display(gxc)

Unnamed: 0,cell barcode,Cluster,Annotation,Batch,cell_barcode
0,Brain_1.TTCCGCAAGTACTGTGCG,Brain_cluster3,Brain.Microglia,Brain_1,TTCCGCAAGTACTGTGCG
1,Brain_1.TTCATATGTGCGTTTAGG,Brain_cluster3,Brain.Microglia,Brain_1,TTCATATGTGCGTTTAGG
2,Brain_1.TGCGGAACTTATGCTCAA,Brain_cluster3,Brain.Microglia,Brain_1,TGCGGAACTTATGCTCAA
3,Brain_1.TGCGGAGTTGCCGCGTCC,Brain_cluster3,Brain.Microglia,Brain_1,TGCGGAGTTGCCGCGTCC
4,Brain_1.GCAGGAGAATTATGATCA,Brain_cluster3,Brain.Microglia,Brain_1,GCAGGAGAATTATGATCA
...,...,...,...,...,...
201089,Z72h2.CATCCCCAACAACGTATT,Z72h_cluster1,Z72h.Hatching gland,Z72h2,CATCCCCAACAACGTATT
201090,Z72h2.CCAGACCCGACGAGCGAG,Z72h_cluster1,Z72h.Hatching gland,Z72h2,CCAGACCCGACGAGCGAG
201091,Z72h2.CCGCTAACACCCAAAGTT,Z72h_cluster1,Z72h.Hatching gland,Z72h2,CCGCTAACACCCAAAGTT
201092,Z72h2.CGGCAGGGACATCTGTGT,Z72h_cluster1,Z72h.Hatching gland,Z72h2,CGGCAGGGACATCTGTGT


Unnamed: 0,GENE,ACAATATATTGTACCTGA,ACGTTGATGGCGTAGAGA,AACCTAACCTGAATTTGC,CTCGCAGCCCTCTATGTA,ACGTTGCGTATTTAGTCG,AACCTATAGAGACCGACG,ACGAGCGCTGTGGCCTAG,GCGAATGGACATGGACAT,TCTACCGCTCAAGCTCAA,...,CGGCAGTCAAAGATCTCT,GACACTGCGAATCTGTGT,GCAGGAGGCTGCTAAGGG,TATGTATACTTCCGCACC,TGGATGTTCCGCACAATA,AACCTATGGATGGGGTTT,AAGCGGAGGACTCTCCAT,ACCTGACTCGCAAGCGAG,ACGTTGCAAAGTTTCATA,ATCAACTGCAATTTCCGC
0,ABCF3,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,ACOT12,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,ACSF3,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,ACTC1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,ACVR1C,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
5,ADAM12,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
6,ADAMTSL4,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
7,ADGRL3,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
8,AKAP13,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
9,AL590134.1,0,0,0,0,0,0,1,0,0,...,0,0,0,0,0,0,0,0,0,0


# 3. Extract and coerce data into the correct format. 

In [4]:
# Specify sample name for filtering
sample_name = 'Brain_8'

# Get cells from the Batch matching the sample_name
sample_cell_annots = cell_annots[cell_annots['Batch'] == sample_name]

# Extract cell barcode and celltype info
sample_cell_annots = sample_cell_annots[['cell_barcode', 'Annotation']]

# Rename columns to have the correct header
sample_cell_annots.rename(columns = {'Annotation': 'celltype'}, inplace = True)
sample_cell_annots.drop_duplicates(subset = 'cell_barcode', inplace = True)
sample_cell_annots['celltype'] = sample_cell_annots['celltype'].str.replace('Brain\.', '')
display(sample_cell_annots)

# Generate filename
output_filename = '_'.join([prefixify(species_BFD.species), species_BFD.conditions, sample_name, 'cellannot.tsv'])

# Generate BioFile object
output_CellAnnotFile = CellAnnotFile(
    filename = output_filename,
    sampledict = species_BFD.sampledict,
    sources = [species_BFD.gxc]
)

# Export file and add to BioFileDocket
sample_cell_annots.to_csv(output_CellAnnotFile.path, sep = '\t', index = None)
species_BFD.add_keyfile('cellannot', output_CellAnnotFile)

  sample_cell_annots['celltype'] = sample_cell_annots['celltype'].str.replace('Brain\.', '')


Unnamed: 0,cell_barcode,celltype
3400,ATCTCTGCTCAAAAAACG,Microglia
3401,TAAGGGACGTTGATTCCA,Microglia
3402,GCAGGAACAATACGGCAG,Microglia
3403,GCGAATTGTGCGGCGTCC,Microglia
3404,AACCTACTGTGTGAGATC,Microglia
...,...,...
20786,TGATCAGGGCGAATACAG,Innate_immune_cell
20787,TTGGACGCTCAAGTCCCG,Innate_immune_cell
20788,AAAACGTCTACCCTCGCA,Innate_immune_cell
20789,TAAGGGAGGACTGCGTCC,Innate_immune_cell


# 4. Update BioFileDocket and push files to S3

In [5]:
species_BFD.local_to_s3()
species_BFD.pickle()
species_BFD.push_to_s3(overwrite = True)

GCF_000002035.5_GRCz10_genomic.gff already exists in S3 bucket, skipping upload. set overwrite = True to overwrite the existing file.
GCF_000002035.5_GRCz10_genomic.fna already exists in S3 bucket, skipping upload. set overwrite = True to overwrite the existing file.
GSM3768152_Brain_8_dge.txt already exists in S3 bucket, skipping upload. set overwrite = True to overwrite the existing file.
Drer_adultbrain_ZFIN_UniProtIDs.txt already exists in S3 bucket, skipping upload. set overwrite = True to overwrite the existing file.
Drer_adultbrain_uniprot-idmm.tsv already exists in S3 bucket, skipping upload. set overwrite = True to overwrite the existing file.
Drer_adultbrain_gtf-idmm.tsv already exists in S3 bucket, skipping upload. set overwrite = True to overwrite the existing file.
GCF_000002035.5_GRCz10_genomic_cDNA.fna already exists in S3 bucket, skipping upload. set overwrite = True to overwrite the existing file.
GCF_000002035.5_GRCz10_genomic_cDNA.fna.transdecoder.bed already exists 