# MERFISH whole brain spatial transcriptomics (part 2a)

In part 1, we explored two examples looking at the expression of canonical neurotransmitter transporter genes and gene Tac2 in the one coronal section. In this notebook, we will prepare data so that we can repeat the examples for all cells spanning the whole brain. This notebook takes ~20 seconds to run.

The results from this notebook has already been cached and saved. As such, if needed you can skip this notebook and continue with part 2b.

You need to be connected to the internet to run this notebook and that you have downloaded the all the log2 expression matrices associated with MERFISH-C57BL6J-638850 dataset.

In [1]:
import os
import pandas as pd
import numpy as np
import anndata
import time
import json
import requests

The prerequisite for running this notebook is that the data have been downloaded to local directory maintaining the organization from the manifest.json. **Change the download_base variable to where you have downloaded the data in your system.**

In [3]:
download_base = '../../abc_download_root'

url = 'https://allen-brain-cell-atlas.s3-us-west-2.amazonaws.com/releases/20230630/manifest.json'
manifest = json.loads(requests.get(url).text)
    
metadata = manifest['file_listing']['MERFISH-C57BL6J-638850']['metadata']

In [4]:
view_directory = os.path.join( download_base, 
                               manifest['directory_listing']['MERFISH-C57BL6J-638850']['directories']['metadata']['relative_path'], 
                              'views')
cache_views = False
if cache_views :
    os.makedirs( view_directory, exist_ok=True )

In [5]:
rpath = metadata['cell_metadata']['files']['csv']['relative_path']
file = os.path.join( download_base, rpath)
cell = pd.read_csv(file, dtype={'cell_label':str})
cell.set_index('cell_label',inplace=True)
print(len(cell))

4330907


In [7]:
matrices = cell.groupby('matrix_label')[['brain_section_label']].count()
matrices

Unnamed: 0_level_0,brain_section_label
matrix_label,Unnamed: 1_level_1
C57BL6J-638850.01,26217
C57BL6J-638850.02,29286
C57BL6J-638850.03,36028
C57BL6J-638850.04,47445
C57BL6J-638850.05,50990
C57BL6J-638850.06,50883
C57BL6J-638850.08,51941
C57BL6J-638850.09,75870
C57BL6J-638850.10,50248
C57BL6J-638850.11,54934


In [8]:
expression_matrices = manifest['file_listing']['MERFISH-C57BL6J-638850']['expression_matrices']

In [9]:
matrix_label = matrices.index[0]
rpath = expression_matrices[matrix_label]['log2']['files']['h5ad']['relative_path']
file = os.path.join( download_base, rpath)
print(file)

../../abc_download_root\expression_matrices/MERFISH-C57BL6J-638850/20230630/C57BL6J-638850.01-log2.h5ad


In [8]:
ad = anndata.read_h5ad(file,backed='r')
gene = ad.var

In [None]:
ntgenes = ['Slc17a7','Slc17a6','Slc17a8','Slc32a1','Slc6a5','Slc18a3','Slc6a3','Slc6a4','Slc6a2']
exgenes = ['Tac2']
gnames = ntgenes + exgenes
pred = [x in gnames for x in gene.gene_symbol]
gene_filtered = gene[pred]
gene_filtered

In [10]:
# create empty gene expression dataframe
gdata = pd.DataFrame(index=cell.index,columns=gene_filtered.index)
count = 0
total_start = time.process_time()

for mp in matrices.index :
    
    print(mp)
    
    rpath = expression_matrices[mp]['log2']['files']['h5ad']['relative_path']
    file = os.path.join( download_base, rpath)
        
    start = time.process_time()
    ad = anndata.read_h5ad(file,backed='r')
    exp = ad[:,gene_filtered.index].to_df()
    gdata.loc[ exp.index, gene_filtered.index ] = exp
    print(" - time taken: ", time.process_time() - start)
    
    ad.file.close()
    del ad
    
    count += 1
    
    #if count > 2 :
    #    break
        
print("total time taken: ", time.process_time() - total_start)
    

C57BL6J-638850.01
 - time taken:  1.8694569370000007
C57BL6J-638850.02
 - time taken:  0.10332510100000114
C57BL6J-638850.03
 - time taken:  0.12322267499999917
C57BL6J-638850.04
 - time taken:  0.16208481500000005
C57BL6J-638850.05
 - time taken:  0.17095455699999995
C57BL6J-638850.06
 - time taken:  0.16599446600000078
C57BL6J-638850.08
 - time taken:  0.16555449299999836
C57BL6J-638850.09
 - time taken:  0.21211741399999973
C57BL6J-638850.10
 - time taken:  0.15647994500000095
C57BL6J-638850.11
 - time taken:  0.16948801100000033
C57BL6J-638850.12
 - time taken:  0.1758591340000013
C57BL6J-638850.13
 - time taken:  0.23582353399999967
C57BL6J-638850.14
 - time taken:  0.29487359999999896
C57BL6J-638850.15
 - time taken:  0.27476923500000083
C57BL6J-638850.16
 - time taken:  0.300646948999999
C57BL6J-638850.17
 - time taken:  0.2379823390000002
C57BL6J-638850.18
 - time taken:  0.17992285599999924
C57BL6J-638850.19
 - time taken:  0.2458386929999996
C57BL6J-638850.24
 - time taken:  

In [11]:
# change columns from index to gene symbol
gdata.columns = gene_filtered.gene_symbol
pred = pd.notna(gdata[gdata.columns[0]])
gdata = gdata[pred].copy(deep=True)
print(len(gdata))

4330907


In [12]:
if cache_views :
    file = os.path.join( view_directory, 'example_genes_all_cells_expression.csv')
    gdata.to_csv( file )