# Dot Plot of Neuromodulator Receptor Expression in the Basolateral Amygdala

This notebook examines 10x single-cell RNA sequencing data from the Allen Brain Cell
(ABC) Atlas to visualize the expression of serotonin, norepinephrine, and dopamine
receptor genes across cell types in the **basolateral amygdala (BLA)** of the mouse brain.

The output is a **dot plot** showing mean expression (color) and fraction of expressing
cells (dot size) for each gene by cell type (subclass level).

### Prerequisites
- Internet connection (for downloading data from AWS S3)
- Run the [getting started notebook](https://alleninstitute.github.io/abc_atlas_access/notebooks/getting_started.html) first
- Packages: `abc_atlas_access`, `anndata`, `scanpy`, `pandas`, `numpy`, `matplotlib`

In [None]:
import pandas as pd
import numpy as np
import anndata
import scanpy as sc
import matplotlib.pyplot as plt
from pathlib import Path

from abc_atlas_access.abc_atlas_cache.abc_project_cache import AbcProjectCache

## 1. Initialize the ABC Atlas Cache

Set the download directory and create the cache object. Data will be downloaded
from AWS S3 on first access and cached locally for subsequent runs.

In [None]:
download_base = Path('../../data/abc_atlas')
abc_cache = AbcProjectCache.from_s3_cache(download_base)

print(f"Current manifest: {abc_cache.current_manifest}")

## 2. Load Cell Metadata and Taxonomy

Load the cell metadata for all 4 million cells in the WMB-10X dataset,
the gene metadata, and the cluster taxonomy annotations.

In [None]:
# Load cell metadata
cell = abc_cache.get_metadata_dataframe(
    directory='WMB-10X',
    file_name='cell_metadata',
    dtype={'cell_label': str}
)
cell.set_index('cell_label', inplace=True)
print(f"Total cells in WMB-10X: {len(cell):,}")

In [None]:
# Load gene metadata
gene = abc_cache.get_metadata_dataframe(
    directory='WMB-10X',
    file_name='gene'
)
gene.set_index('gene_identifier', inplace=True)
print(f"Total genes: {len(gene):,}")

In [None]:
# Load cluster taxonomy pivot table (maps cluster_alias -> all annotation levels)
cluster_details = abc_cache.get_metadata_dataframe(
    directory='WMB-taxonomy',
    file_name='cluster_to_cluster_annotation_membership_pivoted',
    keep_default_na=False
)
cluster_details.set_index('cluster_alias', inplace=True)

# Join taxonomy annotations onto cell metadata
cell_extended = cell.join(cluster_details, on='cluster_alias')
print(f"Taxonomy levels: {list(cluster_details.columns)}")
print(f"Cells with annotations: {len(cell_extended):,}")

## 3. Identify BLA Region Cells

The basolateral amygdala (BLA) is part of the **cortical subplate (CTXsp)** dissection
region in the ABC Atlas. The CTXsp dissection captures BLA, LA (lateral amygdala),
BMA (basomedial amygdala), claustrum, endopiriform nucleus, and surrounding structures.

Since the 10x data is dissected at the CTXsp level, we use all cells from this region
and show the diversity of cell types present. We filter to subclasses with at least
100 cells for a readable dot plot.

In [None]:
# Filter to CTXsp dissection region
ctxsp_cells = cell_extended[cell_extended['region_of_interest_acronym'] == 'CTXsp']
print(f"Cells in CTXsp region: {len(ctxsp_cells):,}")

# Show subclass breakdown
print(f"\nSubclasses in CTXsp (top 30):")
subclass_counts = ctxsp_cells.groupby('subclass').size().sort_values(ascending=False)
for sc_name, count in subclass_counts.head(30).items():
    print(f"  {sc_name}: {count:,} cells")

In [None]:
# Filter to subclasses with at least 100 cells for a readable dot plot
min_subclass_cells = 100
subclass_counts = ctxsp_cells.groupby('subclass').size()
valid_subclasses = subclass_counts[subclass_counts >= min_subclass_cells].index.tolist()

bla_cells = ctxsp_cells[ctxsp_cells['subclass'].isin(valid_subclasses)].copy()

print(f"Subclasses with >= {min_subclass_cells} cells: {len(valid_subclasses)}")
print(f"Total BLA-region cells selected: {len(bla_cells):,}")
print(f"\nCell types included:")
for sc_name in sorted(valid_subclasses):
    count = subclass_counts[sc_name]
    print(f"  {sc_name}: {count:,} cells")

In [None]:
# Summarize by neurotransmitter type
print("Cell type composition by neurotransmitter:")
for nt, count in bla_cells.groupby('neurotransmitter').size().sort_values(ascending=False).items():
    print(f"  {nt}: {count:,}")

print(f"\nCell type composition by class:")
for cls_name, count in bla_cells.groupby('class').size().sort_values(ascending=False).items():
    print(f"  {cls_name}: {count:,}")

## 4. Define Receptor Gene Lists

We define three families of neuromodulator receptor genes (mouse nomenclature):
- **Serotonin (5-HT) receptors** (`Htr*`)
- **Norepinephrine (adrenergic) receptors** (`Adra*`, `Adrb*`)
- **Dopamine receptors** (`Drd*`)

In [None]:
# Serotonin (5-HT) receptors
serotonin_receptors = [
    'Htr1a', 'Htr1b', 'Htr1d', 'Htr1f',
    'Htr2a', 'Htr2b', 'Htr2c',
    'Htr3a', 'Htr3b',
    'Htr4', 'Htr5a', 'Htr5b', 'Htr6', 'Htr7'
]

# Norepinephrine (adrenergic) receptors
norepinephrine_receptors = [
    'Adra1a', 'Adra1b', 'Adra1d',
    'Adra2a', 'Adra2b', 'Adra2c',
    'Adrb1', 'Adrb2', 'Adrb3'
]

# Dopamine receptors
dopamine_receptors = [
    'Drd1', 'Drd2', 'Drd3', 'Drd4', 'Drd5'
]

all_receptors = serotonin_receptors + norepinephrine_receptors + dopamine_receptors
print(f"Total receptor genes: {len(all_receptors)}")
print(f"  Serotonin: {len(serotonin_receptors)}")
print(f"  Norepinephrine: {len(norepinephrine_receptors)}")
print(f"  Dopamine: {len(dopamine_receptors)}")

In [None]:
# Verify which genes are present in the dataset
available_genes = gene[gene['gene_symbol'].isin(all_receptors)]
found_symbols = set(available_genes['gene_symbol'])
missing = [g for g in all_receptors if g not in found_symbols]

print(f"Found {len(found_symbols)}/{len(all_receptors)} receptor genes in the dataset")
if missing:
    print(f"Missing genes: {missing}")

# Update gene list to only include available genes
receptor_genes = [g for g in all_receptors if g in found_symbols]
print(f"\nUsing {len(receptor_genes)} genes for analysis")

## 5. Load Expression Data for BLA Cells

BLA cells come from the CTXsp expression matrices. We load the log2-normalized
data from both 10Xv2 and 10Xv3 chemistries and extract the receptor genes.

In [None]:
# Determine which expression matrix files contain our BLA cells
bla_matrices = bla_cells.groupby('feature_matrix_label').size()
print("Expression matrices containing BLA cells:")
for mat, count in bla_matrices.items():
    print(f"  {mat}: {count:,} cells")

In [None]:
# Extract gene expression for receptor genes from each relevant matrix file
gene_ensembl_ids = available_genes.index.tolist()
gene_symbols = available_genes['gene_symbol'].tolist()

expression_frames = []

for matrix_label in bla_matrices.index:
    # Determine directory (dataset_label)
    dataset_label = bla_cells[bla_cells['feature_matrix_label'] == matrix_label]['dataset_label'].iloc[0]
    file_name = f"{matrix_label}/log2"
    
    print(f"\nLoading {file_name} from {dataset_label}...")
    file_path = abc_cache.get_file_path(directory=dataset_label, file_name=file_name)
    
    # Open backed to avoid loading everything into memory
    adata = anndata.read_h5ad(file_path, backed='r')
    
    # Find gene indices for our receptor genes
    gene_mask = adata.var.index.isin(gene_ensembl_ids)
    gene_filtered = adata.var[gene_mask]
    
    # Filter to BLA cells that exist in this matrix
    bla_cell_labels = bla_cells[bla_cells['feature_matrix_label'] == matrix_label].index
    cell_mask = adata.obs.index.isin(bla_cell_labels)
    
    print(f"  Cells in matrix: {len(adata.obs):,}")
    print(f"  BLA cells found: {cell_mask.sum():,}")
    print(f"  Receptor genes found: {gene_mask.sum()}")
    
    # Extract the subset into memory using integer indices
    # (backed AnnData does not support chained view-of-view slicing)
    cell_idx = np.where(cell_mask)[0]
    gene_idx = np.where(gene_mask)[0]
    subset = adata[cell_idx, gene_idx].to_memory()
    
    # Convert to DataFrame
    expr_df = subset.to_df()
    expr_df.columns = gene_filtered['gene_symbol'].values
    
    expression_frames.append(expr_df)
    
    adata.file.close()
    del adata

# Concatenate expression data from all matrices
expression_data = pd.concat(expression_frames)
print(f"\nTotal expression data: {expression_data.shape[0]:,} cells x {expression_data.shape[1]} genes")

## 6. Build AnnData Object for Dot Plot

Combine the expression data with cell type annotations into a single
AnnData object for visualization with scanpy.

In [None]:
# Ensure expression columns are in the desired gene order
expression_data = expression_data[receptor_genes]

# Create AnnData object
adata_bla = anndata.AnnData(
    X=expression_data.values,
    obs=bla_cells.loc[expression_data.index, ['subclass', 'supertype', 'class', 'neurotransmitter']].copy(),
    var=pd.DataFrame(index=receptor_genes)
)

# Create a shorter display label for subclass (strip leading number prefix)
import re
adata_bla.obs['subclass_short'] = adata_bla.obs['subclass'].apply(
    lambda x: re.sub(r'^\d+\s+', '', x)
)
adata_bla.obs['subclass'] = pd.Categorical(adata_bla.obs['subclass'])
adata_bla.obs['subclass_short'] = pd.Categorical(
    adata_bla.obs['subclass_short'],
    categories=[re.sub(r'^\d+\s+', '', s) for s in sorted(adata_bla.obs['subclass'].cat.categories)]
)

n_subclasses = len(adata_bla.obs['subclass'].cat.categories)
print(adata_bla)
print(f"\nSubclasses in BLA region data ({n_subclasses}):")
for sc_name, count in adata_bla.obs.groupby('subclass', observed=True).size().sort_values(ascending=False).items():
    print(f"  {sc_name}: {count:,} cells")

## 7. Dot Plot: Receptor Expression by Cell Type

The dot plot shows:
- **Dot size**: Fraction of cells in each group expressing the gene (expression > 0)
- **Dot color**: Mean expression level among expressing cells

Genes are grouped by receptor family (serotonin, norepinephrine, dopamine).

In [None]:
# Define gene groups for the dot plot
receptor_groups = {}

sero_available = [g for g in serotonin_receptors if g in receptor_genes]
ne_available = [g for g in norepinephrine_receptors if g in receptor_genes]
da_available = [g for g in dopamine_receptors if g in receptor_genes]

if sero_available:
    receptor_groups['Serotonin (5-HT)'] = sero_available
if ne_available:
    receptor_groups['Norepinephrine (NE)'] = ne_available
if da_available:
    receptor_groups['Dopamine (DA)'] = da_available

print("Gene groups for dot plot:")
for group, genes in receptor_groups.items():
    print(f"  {group}: {genes}")

In [None]:
# Create the dot plot grouped by subclass
dp = sc.pl.dotplot(
    adata_bla,
    var_names=receptor_groups,
    groupby='subclass_short',
    standard_scale='var',
    cmap='Reds',
    figsize=(16, max(6, n_subclasses * 0.4)),
    show=False,
    return_fig=True
)
dp.style(dot_edge_color='black', dot_edge_lw=0.5)
dp.savefig('dotplot_BLA_receptors_by_subclass.png', dpi=150, bbox_inches='tight')
plt.show()
print("Saved: dotplot_BLA_receptors_by_subclass.png")

## 8. Alternative View: Grouped by Supertype

For a finer-grained view, we can also plot by supertype (one level below subclass).
We filter to supertypes with at least 50 cells for readability.

In [None]:
# Filter to supertypes with sufficient cells
min_cells = 50
supertype_counts = adata_bla.obs.groupby('supertype', observed=True).size()
valid_supertypes = supertype_counts[supertype_counts >= min_cells].index.tolist()

adata_supertype = adata_bla[adata_bla.obs['supertype'].isin(valid_supertypes)].copy()

# Create short labels for supertypes too
adata_supertype.obs['supertype_short'] = adata_supertype.obs['supertype'].apply(
    lambda x: re.sub(r'^\d+\s+', '', x)
)
adata_supertype.obs['supertype_short'] = pd.Categorical(adata_supertype.obs['supertype_short'])

n_supertypes = len(valid_supertypes)
print(f"Supertypes with >= {min_cells} cells: {n_supertypes}")

dp2 = sc.pl.dotplot(
    adata_supertype,
    var_names=receptor_groups,
    groupby='supertype_short',
    standard_scale='var',
    cmap='Reds',
    figsize=(16, max(8, n_supertypes * 0.35)),
    show=False,
    return_fig=True
)
dp2.style(dot_edge_color='black', dot_edge_lw=0.5)
dp2.savefig('dotplot_BLA_receptors_by_supertype.png', dpi=150, bbox_inches='tight')
plt.show()
print("Saved: dotplot_BLA_receptors_by_supertype.png")

## 9. Summary Statistics

Compute and display summary statistics for receptor expression across BLA cell types.

In [None]:
# Compute mean expression and fraction expressing per subclass
expr_df = pd.DataFrame(
    adata_bla.X,
    index=adata_bla.obs.index,
    columns=adata_bla.var.index
)
expr_df['subclass'] = adata_bla.obs['subclass'].values

mean_expr = expr_df.groupby('subclass', observed=True)[receptor_genes].mean()
frac_expr = expr_df.groupby('subclass', observed=True)[receptor_genes].apply(
    lambda x: (x > 0).mean()
)

print("=" * 80)
print("Mean Expression (log2) by Subclass")
print("=" * 80)
display(mean_expr.round(2))

print("\n" + "=" * 80)
print("Fraction of Expressing Cells by Subclass")
print("=" * 80)
display(frac_expr.round(3))