![igvflogo](images/igvf-winter-logo.png)

# Why use TileDB?
With [TileDB](https://tiledb.io/) you gain the ability to quickly query array-structured data using rectangular slices, update existing arrays with new or changed data, and easily optimize your physical data organization for maximizing compression and read performance.

# What is anndata
Anndata is a python package for handling annotated data matrices in memory and on disk. It is a widely used format for single-cell genomics data. For the purposes of this tutorial we will be using an experiment from [IGVF Project](https://data.igvf.org/matrix-files/IGVFFI0475WSGO/). For More information about anndata, see [anndata documentation](https://anndata.readthedocs.io/en/stable/)

# Installation and configuration
We will be making use of tiledb and tiledbsoma python packages.

In [1]:
!pip install -r requirements.txt



[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m25.1.1[0m[39;49m -> [0m[32;49m25.2[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m


In [2]:

import anndata as ad
import tiledb
import tiledbsoma
import tiledbsoma.io
from tiledbsoma import SOMAError

import requests
import boto3
import io
from urllib.parse import urlparse

tiledbsoma.show_package_versions()
cfg = tiledb.Config({"vfs.s3.no_sign_request": True})
vfs = tiledb.VFS(config=cfg)

tiledbsoma.__version__              1.17.1
TileDB core version (libtiledbsoma) 2.28.1
python version                      3.13.5.final.0
OS version                          Darwin 24.6.0


### *The matrix `IGVFFI8645YMGX.h5ad` is from a Perturb-seq dataset in TeloHAEC cells: [IGVFDS7943IIWZ](hhttps://data.igvf.org/analysis-sets/IGVFDS7943IIWZ/)*
We will start by fetching the File metadata and the s3_uri

In [3]:
file_metadata = requests.get("https://api.data.igvf.org/matrix-files/IGVFFI8645YMGX/").json()
uri = file_metadata['s3_uri']
parsed_uri = urlparse(uri)
bucket_name = parsed_uri.netloc
object_key = parsed_uri.path.lstrip("/")


# Open h5ad with tiledb vfs and anndata
From this point on you will need to be authenticated with AWS

In [4]:
# Initialize S3 client
session = boto3.session.Session(profile_name='igvf-prod')
# Initialize S3 client
s3_client = session.client('s3')

## Future open bucket
#session = boto3.Session()
#s3_client = session.client('s3', config=botocore.client.Config(signature_version=botocore.UNSIGNED))

# Get the object from S3 - requires creds
response = s3_client.get_object(Bucket=bucket_name, Key=object_key)

# Read the content of the object into a BytesIO stream
data_stream = io.BytesIO(response['Body'].read())

adata = ad.read_h5ad(data_stream)

# Explore anndata object
Anndata is a rich container, and we won't go into the detail here. Below we'll look at very basic properties of the object.

In [5]:
adata

AnnData object with n_obs × n_vars = 214449 × 17472
    obs: 'barcodes', 'n_genes', 'n_counts'
    var: 'n_cells'

In [6]:
adata.obs.head()

Unnamed: 0,barcodes,n_genes,n_counts
"('ANKEF1:GAAGGGACATCATTCACGCCT:AAACCCAAGAAGTCAT-scRNAseq_2kG_11AMDox_1',)",ANKEF1:GAAGGGACATCATTCACGCCT:AAACCCAAGAAGTCAT-...,3609,12474.0
"('MTRR:GTGGTCCTGGGTACCGAGCAT:AAACCCAAGAGGACTC-scRNAseq_2kG_11AMDox_1',)",MTRR:GTGGTCCTGGGTACCGAGCAT:AAACCCAAGAGGACTC-sc...,1904,5743.0
"('JAG1:GATGCGCCCTGCCCGGCGTGC:AAACCCACAATCGCAT-scRNAseq_2kG_11AMDox_1',)",JAG1:GATGCGCCCTGCCCGGCGTGC:AAACCCACAATCGCAT-sc...,4473,21423.0
"('GOLPH3L:GGAAGTTTGTGCTCTCTGCG:AAACCCACACCAGCGT-scRNAseq_2kG_11AMDox_1',)",GOLPH3L:GGAAGTTTGTGCTCTCTGCG:AAACCCACACCAGCGT-...,3306,13837.0
"('ARHGEF15-TSS2:GACCTACTGCAGAGTTAGGG:AAACCCACATTAAGCC-scRNAseq_2kG_11AMDox_1',)",ARHGEF15-TSS2:GACCTACTGCAGAGTTAGGG:AAACCCACATT...,3570,13278.0


In [7]:
adata.var.head()

Unnamed: 0,n_cells
FAM87B:ENSG00000177757,115
FAM41C:ENSG00000230368,2274
SAMD11:ENSG00000187634,491
NOC2L:ENSG00000188976,66487
KLHL17:ENSG00000187961,6897


# Ingest anndata into SOMA experiment
SOMA experiment can be created in a local file (demonstrated here), S3 bucket or in TileDB Cloud (requires setting up [TileDB Cloud](https://cloud.tiledb.com) account)

In [None]:
EXPERIMENT_URI = 'my-single-cell-soma-experiment' #This URI can also be of format s3:// or tiledb://
try:
    tiledbsoma.io.from_anndata(experiment_uri=EXPERIMENT_URI, measurement_name="RNA", anndata=adata)
    with tiledbsoma.open(EXPERIMENT_URI) as exp:
        print(exp.ms['RNA'].var.domain)
        print(exp.ms["RNA"].X["data"].shape)
except SOMAError:
    print(f'Experiment {EXPERIMENT_URI} already exists. Delete (or deregister if using TileDB Cloud) the experiment before continuing.') 