# Getting started

During the summer/fall of 2023, the Allen Brain Cell Atlas plan to include :
* 1.7 million single cell transcriptomes spanning the whole adult mouse brain using the 10Xv2 chemistry (**WMB-10Xv2**)
* 2.3 million single cell transcriptomes spanning the whole adult mouse brain using the 10Xv3 chemistry (**WMB-10Xv3**)
* Clustering analysis of 4.0 million single cell transcriptomes spanning the whole adult mouse brain combining the 10Xv2 and 10Xv3 datasets (**WMB-10X**)
* A five level whole adult mouse brain taxonomy of cell types (**WMB-taxonomy**)
* 4.0 million cell spatial transcriptomics dataset spanning a single adult mouse brain with a 500 gene panel and mapped to the whole mouse brain taxonomy (**MERFISH-C57BL6J-638850**)
* Definition of 18 cell types neighborhoods and UMAP embeddings for fine grain visualization and analysis of neuronal types within and between brain regions (**WMB-neighborhoods**)
* An updated Allen CCFv3 with additional annotations for layers of Ammon's horns, main olfactory blub and a simplifed 5-level anatomical heirarchy (**Allen-CCF-2020**)
* CCF mapped coordinates for cells in the whole brain spatial transcriptomics dataset (**MERFISH-C57BL6J-638850-CCF**)


Data associated with the Allen Brain Cell Atlas is hosted on Amazon Web Services (AWS) in an S3 bucket as a AWS Public Dataset. 
No account or login is required. The S3 bucket is located here [arn:aws:s3:::allen-brain-cell-atlas](https://allen-brain-cell-atlas.s3.us-west-2.amazonaws.com/index.html). You will need to be connected to the internet to run this notebook.

Each release has an associated **manifest.json** which list all the specific version of directories and files that are part of the release. We recommend using the manifest as the starting point of data download and usage.

Expression matrices are stored in the [anndata h5ad format](https://anndata.readthedocs.io/en/latest/) and needs to be downloaded to a local file system for usage.

The **AWS Command Line Interface ([AWS CLI](https://aws.amazon.com/cli/))** is a simple option to download specific directories and files from S3. Download and installation instructructions can be found here: https://aws.amazon.com/cli/. 


This notebook shows how to format AWS CLI commands to download the data required for the tutorials. You can copy those command onto a terminal shell or optionally you can run those command directly in this notebook by uncommenting the "subprocess.run" lines in the code.


In [1]:
import requests
import json
import os
import pathlib
import subprocess
import time

## Using the file manifest

Let's open the manifest.json file associated with the current release.

In [2]:
url = 'https://allen-brain-cell-atlas.s3-us-west-2.amazonaws.com/releases/20230630/manifest.json'
manifest = json.loads(requests.get(url).text)
print("version: ", manifest['version'])

version:  20230630


At the top level, the manifest consists of the release *version* tag, S3 *resource_uri*,  dictionaries *directory_listing* and *file_listing*. A simple option to download data is to use the AWS CLI to download specific directories or files. All the example notebooks in this repository assumes that data has been downloaded locally in the same file organization as specified by the "relative_path" field in the manifest.

In [3]:
manifest.keys()
print("version:",manifest['version'])
print("resource_uri:",manifest['resource_uri'])

version: 20230630
resource_uri: s3://allen-brain-cell-atlas/


Let's look at the information associated with the spatial transcriptomics dataset **MERFISH-C57BL6J-638850**. This dataset has two related directories: *expression_matrices* containing a set of h5ad files and *metadata* containing a set of csv files. Use the *view_link* url to browse the directories on a web-browser.

In [4]:
expression_matrices = manifest['directory_listing']['MERFISH-C57BL6J-638850']['directories']['expression_matrices']
print(expression_matrices)
print(expression_matrices['view_link'])

{'version': '20230630', 'relative_path': 'expression_matrices/MERFISH-C57BL6J-638850/20230630', 'url': 'https://allen-brain-cell-atlas.s3.us-west-2.amazonaws.com/expression_matrices/MERFISH-C57BL6J-638850/20230630/', 'view_link': 'https://allen-brain-cell-atlas.s3.us-west-2.amazonaws.com/index.html#expression_matrices/MERFISH-C57BL6J-638850/20230630/', 'total_size': 15245342518}
https://allen-brain-cell-atlas.s3.us-west-2.amazonaws.com/index.html#expression_matrices/MERFISH-C57BL6J-638850/20230630/


In [5]:
metadata = manifest['directory_listing']['MERFISH-C57BL6J-638850']['directories']['metadata']
print(metadata)
print(metadata['view_link'])

{'version': '20230630', 'relative_path': 'metadata/MERFISH-C57BL6J-638850/20230630', 'url': 'https://allen-brain-cell-atlas.s3.us-west-2.amazonaws.com/metadata/MERFISH-C57BL6J-638850/20230630/', 'view_link': 'https://allen-brain-cell-atlas.s3.us-west-2.amazonaws.com/index.html#metadata/MERFISH-C57BL6J-638850/20230630/', 'total_size': 2262270397}
https://allen-brain-cell-atlas.s3.us-west-2.amazonaws.com/index.html#metadata/MERFISH-C57BL6J-638850/20230630/


Directory sizes are also reported as part to the manifest.json. WARNING: the expression matrices directories can get very large > 100 GB.

In [6]:
GB = float(float(1024) ** 3)

for r in manifest['directory_listing'] :    
    r_dict =  manifest['directory_listing'][r]
    for d in r_dict['directories'] :
        d_dict = r_dict['directories'][d]        
        print(d_dict['relative_path'],":",'%0.2f GB' % (d_dict['total_size']/GB))
        

expression_matrices/MERFISH-C57BL6J-638850/20230630 : 14.20 GB
metadata/MERFISH-C57BL6J-638850/20230630 : 2.11 GB
expression_matrices/MERFISH-C57BL6J-638850-sections/20230630 : 14.28 GB
expression_matrices/WMB-10Xv2/20230630 : 104.05 GB
expression_matrices/WMB-10Xv3/20230630 : 176.31 GB
metadata/WMB-10X/20230630 : 2.34 GB
metadata/WMB-taxonomy/20230630 : 0.01 GB
metadata/WMB-neighborhoods/20230630 : 3.90 GB
image_volumes/Allen-CCF-2020/20230630 : 0.37 GB
metadata/Allen-CCF-2020/20230630 : 0.00 GB
image_volumes/MERFISH-C57BL6J-638850-CCF/20230630 : 0.11 GB
metadata/MERFISH-C57BL6J-638850-CCF/20230630 : 2.13 GB


## Downloading files for the tutorial notebooks

Suppose you would like to download data to your local path *../abc_download_root*.

In [7]:
download_base = '../../abc_download_root'

### Downloading all metadata directories

Since the metadata directories are relatively small we will download all the metadata directories. We loop through the manifest and download each metadata directory using  **[AWS CLI](https://aws.amazon.com/cli/)** sync command. This should take < 5 minutes.

In [8]:
for r in manifest['directory_listing'] :
    
    r_dict =  manifest['directory_listing'][r]
    
    for d in r_dict['directories'] :
        
        if d != 'metadata' :
            continue
        d_dict = r_dict['directories'][d]
        local_path = os.path.join( download_base, d_dict['relative_path'])
        local_path = pathlib.Path( local_path )
        remote_path = manifest['resource_uri'] + d_dict['relative_path']
        
        command = "aws s3 sync --no-sign-request %s %s" % (remote_path, local_path)
        print(command)
        
        start = time.process_time()
        # Uncomment to download directories
        #result = subprocess.run(command.split(),stdout=subprocess.PIPE)
        #print("time taken: ", time.process_time() - start)
  

aws s3 sync --no-sign-request s3://allen-brain-cell-atlas/metadata/MERFISH-C57BL6J-638850/20230630 ..\..\abc_download_root\metadata\MERFISH-C57BL6J-638850\20230630
aws s3 sync --no-sign-request s3://allen-brain-cell-atlas/metadata/WMB-10X/20230630 ..\..\abc_download_root\metadata\WMB-10X\20230630
aws s3 sync --no-sign-request s3://allen-brain-cell-atlas/metadata/WMB-taxonomy/20230630 ..\..\abc_download_root\metadata\WMB-taxonomy\20230630
aws s3 sync --no-sign-request s3://allen-brain-cell-atlas/metadata/WMB-neighborhoods/20230630 ..\..\abc_download_root\metadata\WMB-neighborhoods\20230630
aws s3 sync --no-sign-request s3://allen-brain-cell-atlas/metadata/Allen-CCF-2020/20230630 ..\..\abc_download_root\metadata\Allen-CCF-2020\20230630
aws s3 sync --no-sign-request s3://allen-brain-cell-atlas/metadata/MERFISH-C57BL6J-638850-CCF/20230630 ..\..\abc_download_root\metadata\MERFISH-C57BL6J-638850-CCF\20230630


### Downloading one 10x expression matrix
The prerequisite to run the 10x part 1 notebook is to have downloaded the log2 version of the "'WMB-10Xv2-TH'" matrix (4GB). Download takes ~ 1 min depending on your network speed. 

We define a simple helper function to create the require AWS command. You can copy the command into a terminal shell to run or optionally run it inside this notebook if you uncomment the "subprocess.run" line of code.

In [9]:
def download_file( file_dict ) :
    
    print(file_dict['relative_path'],file_dict['size'])
    local_path = os.path.join( download_base, file_dict['relative_path'] )
    local_path = pathlib.Path( local_path )
    remote_path = manifest['resource_uri'] + file_dict['relative_path']

    command = "aws s3 cp --no-sign-request %s %s" % (remote_path, local_path)
    print(command)

    start = time.process_time()
    # Uncomment to download file
    #result = subprocess.run(command.split(' '),stdout=subprocess.PIPE)
    #print("time taken: ", time.process_time() - start)

In [10]:
expression_matrices = manifest['file_listing']['WMB-10Xv2']['expression_matrices']
file_dict = expression_matrices['WMB-10Xv2-TH']['log2']['files']['h5ad']
print('size:',file_dict['size'])
download_file( file_dict )

size: 4028273658
expression_matrices/WMB-10Xv2/20230630/WMB-10Xv2-TH-log2.h5ad 4028273658
aws s3 cp --no-sign-request s3://allen-brain-cell-atlas/expression_matrices/WMB-10Xv2/20230630/WMB-10Xv2-TH-log2.h5ad ..\..\abc_download_root\expression_matrices\WMB-10Xv2\20230630\WMB-10Xv2-TH-log2.h5ad


### Downloading the MERFISH expression matrix

The prerequisite to run the MERFISH part 1 notebook is to have downloaded the log2 version of the "C57BL6J-638850" matrix (7GB). Download takes ~3 mins depending on tour network speed.

In [11]:
expression_matrices = manifest['file_listing']['MERFISH-C57BL6J-638850']['expression_matrices']
file_dict = expression_matrices['C57BL6J-638850']['log2']['files']['h5ad']
print('size:',file_dict['size'])
download_file( file_dict )

size: 7622671259
expression_matrices/MERFISH-C57BL6J-638850/20230630/C57BL6J-638850-log2.h5ad 7622671259
aws s3 cp --no-sign-request s3://allen-brain-cell-atlas/expression_matrices/MERFISH-C57BL6J-638850/20230630/C57BL6J-638850-log2.h5ad ..\..\abc_download_root\expression_matrices\MERFISH-C57BL6J-638850\20230630\C57BL6J-638850-log2.h5ad


### Downloading all image volumes

The prerequisite to run the CCF and MERFISH to CCF registration notebooks is to have downloaded the two set of image volumes.

In [12]:
for r in manifest['directory_listing'] :
    
    r_dict =  manifest['directory_listing'][r]
    
    for d in r_dict['directories'] :
        
        if d != 'image_volumes' :
            continue
        d_dict = r_dict['directories'][d]
        local_path = os.path.join( download_base, d_dict['relative_path'])
        local_path = pathlib.Path( local_path )
        remote_path = manifest['resource_uri'] + d_dict['relative_path']
        
        command = "aws s3 sync --no-sign-request %s %s" % (remote_path, local_path)
        print(command)
        
        start = time.process_time()
        # Uncomment to download directories
        #result = subprocess.run(command.split(),stdout=subprocess.PIPE)
        #print("time taken: ", time.process_time() - start)
  

aws s3 sync --no-sign-request s3://allen-brain-cell-atlas/image_volumes/Allen-CCF-2020/20230630 ..\..\abc_download_root\image_volumes\Allen-CCF-2020\20230630
aws s3 sync --no-sign-request s3://allen-brain-cell-atlas/image_volumes/MERFISH-C57BL6J-638850-CCF/20230630 ..\..\abc_download_root\image_volumes\MERFISH-C57BL6J-638850-CCF\20230630
