# Single cell gene expression in the 10x Mouse Whole Brain Atlas  ([CCN20230722](https://alleninstitute.github.io/abc_atlas_access/notebooks/cluster_annotation_tutorial.html)).

## AbcProjectCache

`AbcProjectCache` serves as the primary interface for programmatically interacting with the Allen Brain Cell Atlas. This Python class manages the local storage and retrieval of datasets to ensure that files are organized systematically on the user's drive. It handles version control by tracking specific release manifests, which allows to work with consistent data snapshots for reproducibility. By abstracting the logic of file paths and network requests, the class provides a stable mechanism to access large-scale transcriptomic and spatial data without requiring manual file management.

## Install cache files

__Initialize the interface for the Allen Brain Cell Atlas and establish a local directory for data storage__.
The `pathlib` library defines the destination folder, as the API requires `Path` objects rather than strings. 
The `AbcProjectCache` is instantiated using the `from_cache_dir` method to manage data retrieval and version control. To verify the configuration, the code outputs the details of the active manifest. It concludes by listing all downloaded manifests, providing a record of the data versions currently stored locally.


In [9]:
from pathlib import Path
from abc_atlas_access.abc_atlas_cache.abc_project_cache import AbcProjectCache

# 1. Define the local folder using pathlib (required by the new API)
download_base = Path("abc_data")

# 2. Initialize using the 'from_cache_dir' helper method
abc_cache = AbcProjectCache.from_cache_dir(download_base)

print("Cache initialized successfully.")
print("Most Current manifest:")
print(abc_cache.current_manifest)
print("----------")
print("All downloaded manifests:")
abc_cache.list_all_downloaded_manifests



Cache initialized successfully.
Most Current manifest:
releases/20251031/manifest.json
----------
All downloaded manifests:


['releases/20251031/manifest.json']

## Download Gene expression matrices
The 4 million cell dataset has been divided into 24 expression matrices to make data transfer and download more efficient. Each package is formatted as an annadata h5ad file with minimal metadata. 

In [15]:
v2 = abc_cache.get_directory_expression_matrix_size('WMB-10Xv2')
v3 = abc_cache.get_directory_expression_matrix_size('WMB-10Xv3')
multi = abc_cache.get_directory_expression_matrix_size('WMB-10XMulti')
print(" WMB-10Xv2 size:", v2,"\n",
      "WMB-10Xv3 size:", v3,"\n",
      "WMB-10XMulti size:", multi)

 WMB-10Xv2 size: 104.16 GB 
 WMB-10Xv3 size: 176.41 GB 
 WMB-10XMulti size: 211.28 MB


In [16]:
abc_cache.list_expression_matrix_files('WMB-10Xv2')

['WMB-10Xv2-CTXsp/log2',
 'WMB-10Xv2-CTXsp/raw',
 'WMB-10Xv2-HPF/log2',
 'WMB-10Xv2-HPF/raw',
 'WMB-10Xv2-HY/log2',
 'WMB-10Xv2-HY/raw',
 'WMB-10Xv2-Isocortex-1/log2',
 'WMB-10Xv2-Isocortex-1/raw',
 'WMB-10Xv2-Isocortex-2/log2',
 'WMB-10Xv2-Isocortex-2/raw',
 'WMB-10Xv2-Isocortex-3/log2',
 'WMB-10Xv2-Isocortex-3/raw',
 'WMB-10Xv2-Isocortex-4/log2',
 'WMB-10Xv2-Isocortex-4/raw',
 'WMB-10Xv2-MB/log2',
 'WMB-10Xv2-MB/raw',
 'WMB-10Xv2-OLF/log2',
 'WMB-10Xv2-OLF/raw',
 'WMB-10Xv2-TH/log2',
 'WMB-10Xv2-TH/raw']

In [17]:


# 1. Load your downsampled cell list
# Ensure the CSV contains 'cell_label' and 'feature_matrix_label' columns
downsampled_cells = pd.read_csv('data/downsampled_cells.csv')
downsampled_cells.set_index('cell_label', inplace=True)

print(f"Targeting {len(downsampled_cells)} cells across {downsampled_cells['feature_matrix_label'].nunique()} matrices.")


Targeting 200000 cells across 24 matrices.


__Matrix Retrieval__

Ensures that the raw gene expression matrices corresponding to the downsampled cells are present on the local system. It begins by identifying unique matrix labels and iterating through them to map each dataset to its correct directory, distinguishing between 10x Genomics Multiome, v3, and v2 modalities.   

The core operation relies on the `get_file_path` method, which performs a dual function: it returns the absolute path if the file already exists or initiates a download from the remote server if it is missing. Valid paths are aggregated into a dictionary for efficient lookup, while error handling prevents a single missing file from interrupting the batch processing of the remaining datasets.



In [18]:
# Initialize storage for local paths
downloaded_paths = {}
unique_matrices = downsampled_cells['feature_matrix_label'].unique()

print(f"Verifying local files for {len(unique_matrices)} matrices...")

for i, matrix_label in enumerate(unique_matrices, start=1):
    # Determine the correct directory based on the label string
    if "WMB-10XMulti" in matrix_label:
        directory = "WMB-10XMulti"
    elif "WMB-10Xv3" in matrix_label:
        directory = "WMB-10Xv3"
    elif "WMB-10Xv2" in matrix_label:
        directory = "WMB-10Xv2"
    else:
        print(f"Skipping unknown dataset: {matrix_label}")
        continue

    # Construct the file key
    file_key = f"{matrix_label}/raw"
    
    try:
        print(f"[{i}/{len(unique_matrices)}] checking: {file_key}")
        # This ensures the file is downloaded and returns the absolute local path
        local_path = abc_cache.get_file_path(
            directory=directory, 
            file_name=file_key
        )
        downloaded_paths[matrix_label] = local_path
        
    except Exception as e:
        print(f"  > Failed to retrieve {file_key}: {e}")

print(f"\nSuccessfully located {len(downloaded_paths)} files.")

Verifying local files for 24 matrices...
[1/24] checking: WMB-10Xv3-PAL/raw
[2/24] checking: WMB-10Xv3-STR/raw
[3/24] checking: WMB-10Xv3-HY/raw
[4/24] checking: WMB-10Xv3-CTXsp/raw
[5/24] checking: WMB-10Xv3-OLF/raw
[6/24] checking: WMB-10Xv3-Isocortex-1/raw
[7/24] checking: WMB-10Xv3-HPF/raw
[8/24] checking: WMB-10Xv3-MB/raw
[9/24] checking: WMB-10Xv3-MY/raw
[10/24] checking: WMB-10Xv3-TH/raw
[11/24] checking: WMB-10Xv3-P/raw
[12/24] checking: WMB-10Xv3-CB/raw
[13/24] checking: WMB-10Xv2-CTXsp/raw
[14/24] checking: WMB-10Xv2-HY/raw
[15/24] checking: WMB-10Xv2-OLF/raw
[16/24] checking: WMB-10Xv2-TH/raw
[17/24] checking: WMB-10Xv2-Isocortex-1/raw
[18/24] checking: WMB-10Xv2-HPF/raw
[19/24] checking: WMB-10Xv2-MB/raw
[20/24] checking: WMB-10XMulti/raw
[21/24] checking: WMB-10Xv2-Isocortex-2/raw
[22/24] checking: WMB-10Xv3-Isocortex-2/raw
[23/24] checking: WMB-10Xv2-Isocortex-4/raw
[24/24] checking: WMB-10Xv2-Isocortex-3/raw

Successfully located 24 files.


In [22]:
# Initialize storage for local paths
downloaded_paths = {}
unique_matrices = downsampled_cells['feature_matrix_label'].unique()

print(f"Verifying local files for {len(unique_matrices)} matrices...")

for i, matrix_label in enumerate(unique_matrices, start=1):
    # Determine the correct directory based on the label string
    if "WMB-10XMulti" in matrix_label:
        directory = "WMB-10XMulti"
    elif "WMB-10Xv3" in matrix_label:
        directory = "WMB-10Xv3"
    elif "WMB-10Xv2" in matrix_label:
        directory = "WMB-10Xv2"
    else:
        print(f"Skipping unknown dataset: {matrix_label}")
        continue

    # Construct the file key
    file_key = f"{matrix_label}/raw"
    
    try:
        print(f"[{i}/{len(unique_matrices)}] checking: {file_key}")
        # This ensures the file is downloaded and returns the absolute local path
        local_path = abc_cache.get_file_path(
            directory=directory, 
            file_name=file_key
        )
        downloaded_paths[matrix_label] = local_path
        
    except Exception as e:
        print(f"  > Failed to retrieve {file_key}: {e}")

print(f"\nSuccessfully located {len(downloaded_paths)} files.")

Verifying local files for 24 matrices...
[1/24] checking: WMB-10Xv3-PAL/raw
[2/24] checking: WMB-10Xv3-STR/raw
[3/24] checking: WMB-10Xv3-HY/raw
[4/24] checking: WMB-10Xv3-CTXsp/raw
[5/24] checking: WMB-10Xv3-OLF/raw
[6/24] checking: WMB-10Xv3-Isocortex-1/raw
[7/24] checking: WMB-10Xv3-HPF/raw
[8/24] checking: WMB-10Xv3-MB/raw
[9/24] checking: WMB-10Xv3-MY/raw
[10/24] checking: WMB-10Xv3-TH/raw
[11/24] checking: WMB-10Xv3-P/raw
[12/24] checking: WMB-10Xv3-CB/raw
[13/24] checking: WMB-10Xv2-CTXsp/raw
[14/24] checking: WMB-10Xv2-HY/raw
[15/24] checking: WMB-10Xv2-OLF/raw
[16/24] checking: WMB-10Xv2-TH/raw
[17/24] checking: WMB-10Xv2-Isocortex-1/raw
[18/24] checking: WMB-10Xv2-HPF/raw
[19/24] checking: WMB-10Xv2-MB/raw
[20/24] checking: WMB-10XMulti/raw
[21/24] checking: WMB-10Xv2-Isocortex-2/raw
[22/24] checking: WMB-10Xv3-Isocortex-2/raw
[23/24] checking: WMB-10Xv2-Isocortex-4/raw
[24/24] checking: WMB-10Xv2-Isocortex-3/raw

Successfully located 24 files.


## Build subsetted anndata object

In [23]:

expression_data = {}
print(f"Loading data from {len(downloaded_paths)} local files...")

for matrix_label, local_path in downloaded_paths.items():
    print(f"\nProcessing: {matrix_label}")
    
    try:
        # Load the file directly from the local path resolved in Part 1
        adata = anndata.read_h5ad(local_path)
        
        # Identify the target cells for this specific matrix
        target_cells = downsampled_cells[
            downsampled_cells['feature_matrix_label'] == matrix_label
        ].index

        # Calculate the intersection of requested cells and available cells
        valid_cells = [c for c in target_cells if c in adata.obs_names]
        
        if valid_cells:
            # Subset and copy to ensure a clean memory footprint
            expression_data[matrix_label] = adata[valid_cells, :].copy()
            print(f"  > Loaded and subsetted {len(valid_cells)} cells.")
        else:
            print("  > No matching cells found in this matrix.")
            
    except Exception as e:
        print(f"  > Error processing {matrix_label}: {e}")

print(f"\nProcessing complete. {len(expression_data)} Anndata objects created.")

Loading data from 24 local files...

Processing: WMB-10Xv3-PAL
  > Loaded and subsetted 5828 cells.

Processing: WMB-10Xv3-STR
  > Loaded and subsetted 13786 cells.

Processing: WMB-10Xv3-HY
  > Loaded and subsetted 8622 cells.

Processing: WMB-10Xv3-CTXsp
  > Loaded and subsetted 3339 cells.

Processing: WMB-10Xv3-OLF
  > Loaded and subsetted 3786 cells.

Processing: WMB-10Xv3-Isocortex-1
  > Loaded and subsetted 9526 cells.

Processing: WMB-10Xv3-HPF
  > Loaded and subsetted 7866 cells.

Processing: WMB-10Xv3-MB
  > Loaded and subsetted 21337 cells.

Processing: WMB-10Xv3-MY
  > Loaded and subsetted 17795 cells.

Processing: WMB-10Xv3-TH
  > Loaded and subsetted 5773 cells.

Processing: WMB-10Xv3-P
  > Loaded and subsetted 11993 cells.

Processing: WMB-10Xv3-CB
  > Loaded and subsetted 7851 cells.

Processing: WMB-10Xv2-CTXsp
  > Loaded and subsetted 1849 cells.

Processing: WMB-10Xv2-HY
  > Loaded and subsetted 5012 cells.

Processing: WMB-10Xv2-OLF
  > Loaded and subsetted 8201 cel

In [24]:

# 4. Concatenate and Save
if expression_data:
    print("\nConcatenating...")
    combined_adata = anndata.concat(
        list(expression_data.values()),
        axis=0,
        join='outer', 
        merge='same'
    )
    combined_adata.obs_names = combined_adata.obs_names.astype(str)
    
    output_path = Path('data/downsampled_expression.h5ad')
    output_path.parent.mkdir(parents=True, exist_ok=True)
    combined_adata.write_h5ad(output_path)
    print(f"Success. Saved to {output_path}")
else:
    print("\nNo data collected.")


Concatenating...
Success. Saved to data/downsampled_expression.h5ad
