# 10x scRNA-seq gene expression data (part 2a)

In part 1, we explore two examples looking at the expression of canonical neurotransmitter transporter genes and gene Tac2 in the thalamus. In this notebook, we will prepare data so that we can repeat the examples for all cells spanning the whole brain. This notebook takes ~ 5 minutes to run.

The results from this notebook has already been cached and saved. As such, you can skip this notebook and continue with part 2b.

In [1]:
import os
import pandas as pd
import numpy as np
import anndata
import time

In [2]:
input_base = '/allen/programs/celltypes/workgroups/rnaseqanalysis/lydian/ABC_handoff'
input_directory = os.path.join( input_base, 'dataframes', 'WMB-10X','20230630' )

view_directory = os.path.join( input_directory, 'views')
cache_views = False
if cache_views :
    os.makedirs( view_directory, exist_ok=True )

In [3]:
file = os.path.join( input_directory,'cell_metadata.csv')
cell = pd.read_csv(file)
cell.set_index('cell_label',inplace=True)

### Gene expression matrices

The large 4 million cell dataset has been divided into 23 packages to make data transfer and download more efficient. Each package is formatted as annadata h5ad file with minimal metadata. In this next section, we provide example code on how to open the file and connect with the rich cell level metadata discussed above.

For each subset, there are two h5ad files one storing the raw counts and the other log normalization of it. The file name has the pattern "dataset/release/matrix_prefix-normalization.h5ad".

In [4]:
matrices = cell.groupby(['dataset_label','matrix_prefix'])[['library_label']].count()
matrices

Unnamed: 0_level_0,Unnamed: 1_level_0,library_label
dataset_label,matrix_prefix,Unnamed: 2_level_1
WMB-10Xv2,WMB-10Xv2-CTXsp,44310
WMB-10Xv2,WMB-10Xv2-HPF,208299
WMB-10Xv2,WMB-10Xv2-HY,100562
WMB-10Xv2,WMB-10Xv2-Isocortex-1,250040
WMB-10Xv2,WMB-10Xv2-Isocortex-2,250040
WMB-10Xv2,WMB-10Xv2-Isocortex-3,250040
WMB-10Xv2,WMB-10Xv2-Isocortex-4,250040
WMB-10Xv2,WMB-10Xv2-MB,29891
WMB-10Xv2,WMB-10Xv2-OLF,193723
WMB-10Xv2,WMB-10Xv2-TH,131212


### Example use cases

In this section, we explore two use cases. The first example looks at the expression of nine canonical neurotransmitter transporter genes and the second the expression of gene Tac2.

To support these use cases, we will create a smaller submatrix (all cells and 10 genes) and write to file for resue in part 2b. *Note this operation takes around 5 minutes*.

In [5]:
expression_directory = os.path.join(input_base, 'expression_matrices')
ext = 'h5ad'
normalization = 'log2'

In [6]:
release = '20230630'
matrix_prefix = matrices.index[0][1]
dataset_label = matrices.index[0][0]
file = os.path.join( expression_directory, dataset_label, release, '%s-%s.%s'% (matrix_prefix,normalization,ext) )
print(file)

/allen/programs/celltypes/workgroups/rnaseqanalysis/lydian/ABC_handoff/expression_matrices/WMB-10Xv2/20230630/WMB-10Xv2-CTXsp-log2.h5ad


In [7]:
ad = anndata.read_h5ad(file,backed='r')
gene = ad.var

In [8]:
ntgenes = ['Slc17a7','Slc17a6','Slc17a8','Slc32a1','Slc6a5','Slc18a3','Slc6a3','Slc6a4','Slc6a2']
exgenes = ['Tac2']
gnames = ntgenes + exgenes
pred = [x in gnames for x in gene.gene_symbol]
gene_filtered = gene[pred]
gene_filtered

Unnamed: 0_level_0,gene_symbol
gene_identifier,Unnamed: 1_level_1
ENSMUSG00000037771,Slc32a1
ENSMUSG00000070570,Slc17a7
ENSMUSG00000039728,Slc6a5
ENSMUSG00000030500,Slc17a6
ENSMUSG00000055368,Slc6a2
ENSMUSG00000019935,Slc17a8
ENSMUSG00000025400,Tac2
ENSMUSG00000020838,Slc6a4
ENSMUSG00000021609,Slc6a3
ENSMUSG00000100241,Slc18a3


In [9]:
# create empty gene expression dataframe
gdata = pd.DataFrame(index=cell.index,columns=gene_filtered.index)
count = 0
total_start = time.process_time()

for matindex in matrices.index :
    
    ds = matindex[0]
    mp = matindex[1]
    
    print(mp)
    
    file = os.path.join( expression_directory, ds, release, '%s-%s.%s'% (mp,normalization,ext) )
    
    start = time.process_time()
    ad = anndata.read_h5ad(file,backed='r')
    exp = ad[:,gene_filtered.index].to_df()
    gdata.loc[ exp.index, gene_filtered.index ] = exp
    print(" - time taken: ", time.process_time() - start)
    
    ad.file.close()
    del ad
    
    count += 1
    
    #if count > 2 :
    #    break
        
print("total time taken: ", time.process_time() - total_start)
    

WMB-10Xv2-CTXsp
 - time taken:  3.42430225
WMB-10Xv2-HPF
 - time taken:  8.245478192
WMB-10Xv2-HY
 - time taken:  3.5195614909999975
WMB-10Xv2-Isocortex-1
 - time taken:  12.157451439000003
WMB-10Xv2-Isocortex-2
 - time taken:  13.157986844
WMB-10Xv2-Isocortex-3
 - time taken:  12.071538439999998
WMB-10Xv2-Isocortex-4
 - time taken:  12.051226679000003
WMB-10Xv2-MB
 - time taken:  0.931071935999995
WMB-10Xv2-OLF
 - time taken:  6.753865536000006
WMB-10Xv2-TH
 - time taken:  5.117735967999991
WMB-10Xv3-CB
 - time taken:  7.532659328999998
WMB-10Xv3-CTXsp
 - time taken:  3.850643286999997
WMB-10Xv3-HPF
 - time taken:  9.932796195000009
WMB-10Xv3-HY
 - time taken:  9.625935334000005
WMB-10Xv3-Isocortex-1
 - time taken:  16.557964082999987
WMB-10Xv3-Isocortex-2
 - time taken:  11.60080803400001
WMB-10Xv3-MB
 - time taken:  19.158708663
WMB-10Xv3-MY
 - time taken:  9.629073062000003
WMB-10Xv3-OLF
 - time taken:  3.7269512400000053
WMB-10Xv3-P
 - time taken:  6.852235084
WMB-10Xv3-PAL
 - tim

In [10]:
# change columns from index to gene symbol
gdata.columns = gene_filtered.gene_symbol
pred = pd.notna(gdata[gdata.columns[0]])
gdata = gdata[pred].copy(deep=True)
print(len(gdata))

4057701


In [11]:
if cache_views :
    file = os.path.join( view_directory, 'example_genes_all_cells_expression.csv')
    gdata.to_csv( file )