# 1_load_thalamus_data

This notebook demonstrates how to use the custom thalamus_merfish_analysis 
module to load a standardized thalamus subset of the Allen Brain Cell (ABC) 
Atlas' whole mouse brain MERFISH dataset (https://portal.brain-map.org/atlases-and-data/bkp/abc-atlas).

It also includes descriptions of what data is included in this thalamus dataset. 

Additional information on the full ABC Atlas dataset can be found at: https://alleninstitute.github.io/abc_atlas_access/intro.html

In [1]:
from thalamus_merfish_analysis import abc_load as abc
get_ipython().run_line_magic('matplotlib', 'inline') 

## 1. Load thalamus dataset

You can load a thalamus subset of the ABC Atlas as either:

A) a pandas DataFrame, which includes just the cell metadata

B) an AnnData object, which includes:
- gene expression, stored in adata.X
- cell metadata, stored in adata.obs 
- gene metadata, stored in adata.var

The DataFrame is useful if you just want to explore mapped cell types and don't 
need the gene expression data. 

The DataFrame will load faster (~1 min vs. ~2.5 min) & take up less memory (~0.5 GB vs ~2 GB) than the full AnnData object.

### 1A. Load thalamus data as DataFrame

Load select cell metadata. See comment block below for details on what metadata
are included in the cell metadata DataFrame.

Additional information on these metadata, and other metadata available, can be 
found at: https://alleninstitute.github.io/abc_atlas_access/intro.html


In [2]:
'''
These metadata descriptions are compiled & modified from the "Allen Brain Cell 
Atlas - Data Access" companion Jupyter book, which can be found at: 
https://alleninstitute.github.io/abc_atlas_access/intro.html

For more details on the spatial coordinates, see: 
https://alleninstitute.github.io/abc_atlas_access/notebooks/merfish_ccf_registration_tutorial.html

For more information on the cell type taxonomy & definitions, see:
https://alleninstitute.github.io/abc_atlas_access/notebooks/cluster_annotation_tutorial.html

cell_label : str
    unique string used for ID of each cell; Index of the DataFrame
brain_section_label : str
    [brain specimen ID].[section number], e.g. "C57BL6J-638850.37".
    [brain specimen ID] is the same, 'C57BL6J-638850', for all cells in this  
    dataset. [section number] specifies the ordered index of each coronal 
    section, from 
neurotransmitter : str, {Glut, GABA, None, Dopa, Glut-GABA}
    neurotransmitter type of the cell; assigned based on average expression of  
    both neurotransmitter transporter genes and key neurotransmitter 
    synthesizing enzyme genes
class : str
    top level of cell type definition, primarily determined by broad brain 
    region and neurotransmitter type. Classes group together related subclasses 
    & all cells within a subclass belong to the same class.
    Class names are constructed as "[class ID] [brain region abbrv] 
    [neurotransmitter abbrv]", e.g. "20 MB GABA". 
subclass:
    a coarse level of cell type definition. Subclass groups together related 
    supertypes & all cells within a supertype belong to the same subclass
    Class names are constructed as "[subclass ID] [select marker genes] 
    [neurotransmitter abbrv]", e.g. "197 SNr Six3 Gaba".
supertype:
    second finest level of cell type definition; groups together similar 
    clusters & all cells within a cluster belong to the same supertype.
    Supertype names are constructed as "[supertype ID] [parent subclass label]_
    [supertype # within parent subclass]", e.g. "0806 SNr Six3 Gaba_1" and 
    "0806 SNr Six3 Gaba_2"
cluster : str
    finest level of cell type definition; cells within a cluster share similar 
    characteristics and belong to the same supertype.
    Cluster names are constructed as "[cluster ID] [parent supertype label]",
    e.g. "3464 SNr Six3 Gaba_1"
cluster_alias : int?
    unique 4-digit integer to identify the cluster to which the cell was mapped
average_correlation_score: float in range [0,1]
    correlation score specifying how "well" each cell mapped to it's assigned cluster
x_section, y_section, z_section : float
    original experiment coordinate space for MERFISH dataset. x & y specify the
    coronal plane (M-L & D-V, respectively). z specifies the section in A-P, and 
    all cells from the same experimental section have the same z_section
x_reconstructed, y_reconstructed, z_reconstructed : float
    point-to-point mapping between the original MERFISH coordinate space and the
    CCF space to achieve a finer level match to the target CCF section. x & y 
    specify the coronal plane, z specifies the sagittal plane and can vary for 
    cells from the same MERFISH z_section
z_ccf, y_ccf, x_ccf : float
    3D global affine mapping that aligns CCF into the MERFISH space. z & y 
    specify the coronal plane, medial-lateral & dorsal-ventral, respectively. 
    x specifies anterior-posterior.
parcellation_index :
    unique integer identifying each parcellation_substructure; used as the pixel
    value in the annotation volume
parcellation_division, parcellation_structure, parcellation_substructure : str
    human readable Allen Reference Atlas (ARA) parcellation levels to which the 
    cell belongs; division is the highest level, substructure is the lowest
left_hemisphere : bool
    True if cell is in the left hemisphere, False if in the right
'''

'\nThese metadata descriptions are compiled & modified from the "Allen Brain Cell \nAtlas - Data Access" companion Jupyter book, which can be found at: \nhttps://alleninstitute.github.io/abc_atlas_access/intro.html\n\nFor more details on the spatial coordinates, see: \nhttps://alleninstitute.github.io/abc_atlas_access/notebooks/merfish_ccf_registration_tutorial.html\n\nFor more information on the cell type taxonomy & definitions, see:\nhttps://alleninstitute.github.io/abc_atlas_access/notebooks/cluster_annotation_tutorial.html\n\ncell_label : str\n    unique string used for ID of each cell; Index of the DataFrame\nbrain_section_label : str\n    [brain specimen ID].[section number], e.g. "C57BL6J-638850.37";\n    [brain specimen ID] is the same for all cells as this is a single brain, and\n    [section number] specifies the ordered index of each coronal section\nneurotransmitter : str, {Glut, GABA, None, Dopa, Glut-GABA}\n    neurotransmitter type of the cell; assigned based on average 

In [3]:
# Load cell metadata DataFrame
obs = abc.load_standard_thalamus(data_structure='obs')

In [4]:
# Display some info about the loaded DataFrame

# number of cells in this thalamus dataset
print(f'n_cells = {obs.shape[0]}')

# all metadata field names
display(obs.columns)

# first 11 metadata fields
display(obs.head(3).iloc[:, :11])
# last 11 metadata fields
display(obs.head(3).iloc[:, 11:])

n_cells = 79158


Index(['brain_section_label', 'cluster_alias', 'average_correlation_score',
       'x_section', 'y_section', 'z_section', 'neurotransmitter', 'class',
       'subclass', 'supertype', 'cluster', 'x_reconstructed',
       'y_reconstructed', 'z_reconstructed', 'parcellation_index', 'x_ccf',
       'y_ccf', 'z_ccf', 'parcellation_division', 'parcellation_structure',
       'parcellation_substructure', 'left_hemisphere'],
      dtype='object')

Unnamed: 0_level_0,brain_section_label,cluster_alias,average_correlation_score,x_section,y_section,z_section,neurotransmitter,class,subclass,supertype,cluster
cell_label,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
1019171907102970225,C57BL6J-638850.37,3155,0.515042,6.852032,6.459106,6.6,GABA,20 MB GABA,197 SNr Six3 Gaba,0806 SNr Six3 Gaba_1,3464 SNr Six3 Gaba_1
1018093344102600178-2,C57BL6J-638850.36,3155,0.505055,7.16597,6.029406,6.4,GABA,20 MB GABA,197 SNr Six3 Gaba,0806 SNr Six3 Gaba_1,3464 SNr Six3 Gaba_1
1018093344102510506-4,C57BL6J-638850.35,3155,0.513099,4.000065,6.243119,6.2,GABA,20 MB GABA,197 SNr Six3 Gaba,0806 SNr Six3 Gaba_1,3464 SNr Six3 Gaba_1


Unnamed: 0_level_0,x_reconstructed,y_reconstructed,z_reconstructed,parcellation_index,x_ccf,y_ccf,z_ccf,parcellation_division,parcellation_structure,parcellation_substructure,left_hemisphere
cell_label,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
1019171907102970225,6.806941,6.449028,6.6,787,7.204802,5.184069,7.025149,HY,ZI,ZI,False
1018093344102600178-2,7.192112,6.043638,6.4,787,7.442089,4.742288,7.4133,HY,ZI,ZI,False
1018093344102510506-4,3.964947,6.141358,6.2,787,7.521666,4.787502,4.114865,HY,ZI,ZI,True


### 1B. Load thalamus dataset as AnnData object

Includes:
- gene expression, stored in adata.X
- cell metadata, stored in adata.obs (identical to the DataFrame loaded in Part 1A)
- gene metadata, stored in adata.var
- the exact same set of cells as loaded into the DataFrame in Part 1A

In [5]:
'''
These metadata descriptions are compiled & modified from the "Allen Brain Cell 
Atlas - Data Access" companion Jupyter book, which can be found at: 
https://alleninstitute.github.io/abc_atlas_access/intro.html

adata_th.X : np.ndarray
    dense array of gene expression values for each cell in the dataset; standard 
    gene counts transform is log2cpm

adata_th.obs : pd.DataFrame
    cell metadata, identical to the DataFrame version loaded in Part 1A
    
adata_th.var : pd.DataFrame
    gene metadata, with the following fields:
    gene_symbol : str
        commonly used gene name. Both the Index of the DataFrame and a column
    transcript_identifier : str
        unique Ensembl transcript ID for each gene
'''

'\nThese metadata descriptions are compiled & modified from the "Allen Brain Cell \nAtlas - Data Access" companion Jupyter book, which can be found at: \nhttps://alleninstitute.github.io/abc_atlas_access/intro.html\n\nadata_th.X : np.ndarray\n    dense array of gene expression values for each cell in the dataset; standard \n    gene counts transform is log2cpm\n\nadata_th.obs : pd.DataFrame\n    cell metadata, identical to the DataFrame version loaded in Part 1A\n    \nadata_th.var : pd.DataFrame\n    gene metadata, with the following fields:\n    gene_symbol : str\n        commonly used gene name. Both the Index of the .var DataFrame and a column\n    transcript_identifier : str\n        unique Ensembl transcript ID for each gene\n'

In [6]:
# Load thalamus AnnData object (includes gene expression + cell & gene metadata)
adata_th = abc.load_standard_thalamus(data_structure='adata')

In [7]:
# Display some info about the loaded AnnData object
display(adata_th)

display(adata_th.var.head(3))

display(adata_th.uns)

display(adata_th.X)

AnnData object with n_obs × n_vars = 79158 × 500
    obs: 'brain_section_label', 'average_correlation_score', 'class', 'cluster', 'cluster_alias', 'left_hemisphere', 'neurotransmitter', 'parcellation_division', 'parcellation_index', 'parcellation_structure', 'parcellation_substructure', 'subclass', 'supertype', 'x_ccf', 'x_reconstructed', 'x_section', 'y_ccf', 'y_reconstructed', 'y_section', 'z_ccf', 'z_reconstructed', 'z_section'
    var: 'gene_symbol', 'transcript_identifier'
    uns: 'accessed_on', 'src', 'counts_transform'

Unnamed: 0_level_0,gene_symbol,transcript_identifier
gene_symbol,Unnamed: 1_level_1,Unnamed: 2_level_1
Prkcq,Prkcq,ENSMUST00000028118
Col5a1,Col5a1,ENSMUST00000028280
Grik3,Grik3,ENSMUST00000030676


OverloadedDict, wrapping:
	{'accessed_on': '2023-08-25-12-47-11', 'src': '/allen/programs/celltypes/workgroups/rnaseqanalysis/mFISH/michaelkunst/MERSCOPES/mouse/atlas/mouse_638850/cirro_folder/atlas_brain_638850_CCF.h5ad', 'counts_transform': 'log2cpm'}
With overloaded keys:
	['neighbors'].

array([[ 0.        , 10.75747652, 14.07865437, ...,  0.        ,
         0.        ,  0.        ],
       [ 0.        ,  0.        , 14.29605172, ...,  0.        ,
         0.        ,  0.        ],
       [ 0.        ,  0.        , 13.02685598, ...,  0.        ,
         0.        ,  0.        ],
       ...,
       [ 0.        ,  0.        ,  0.        , ...,  0.        ,
        10.26208764,  0.        ],
       [ 0.        ,  0.        , 12.37853023, ...,  0.        ,
         9.79492177,  0.        ],
       [ 0.        ,  0.        , 12.08015275, ...,  0.        ,
         0.        ,  0.        ]])