![igvflogo](images/igvf-winter-logo.png)

# Why use TileDB?
With [TileDB](https://tiledb.io/) you gain the ability to quickly query array-structured data using rectangular slices, update existing arrays with new or changed data, and easily optimize your physical data organization for maximizing compression and read performance.

# What is anndata
Anndata is a python package for handling annotated data matrices in memory and on disk. It is a widely used format for single-cell genomics data. For the purposes of this tutorial we will be using an experiment from [IGVF Project](https://data.igvf.org/matrix-files/IGVFFI0475WSGO/). For More information about anndata, see [anndata documentation](https://anndata.readthedocs.io/en/stable/)

# Installation and configuration
We will be making use of tiledb and tiledbsoma python packages.

In [None]:
!pip install -r requirements.txt


In [None]:

import anndata as ad
import tiledb
import tiledbsoma
import tiledbsoma.io
from tiledbsoma import SOMAError

import requests
import boto3
import io
from urllib.parse import urlparse

tiledbsoma.show_package_versions()
cfg = tiledb.Config({"vfs.s3.no_sign_request": True})
vfs = tiledb.VFS(config=cfg)

### *The matrix `IGVFFI8645YMGX.h5ad` is from a Perturb-seq dataset in TeloHAEC cells: [IGVFDS7943IIWZ](hhttps://data.igvf.org/analysis-sets/IGVFDS7943IIWZ/)*
We will start by fetching the File metadata and the s3_uri

In [None]:
file_metadata = requests.get("https://api.data.igvf.org/matrix-files/IGVFFI8645YMGX/").json()
uri = file_metadata['s3_uri']
parsed_uri = urlparse(uri)
bucket_name = parsed_uri.netloc
object_key = parsed_uri.path.lstrip("/")


# Open h5ad with tiledb vfs and anndata
From this point on you will need to be authenticated with AWS

In [None]:
# Initialize S3 client
s3_client = boto3.client('s3')

# Get the object from S3 - requires creds
response = s3_client.get_object(Bucket=bucket_name, Key=object_key)

# Read the content of the object into a BytesIO stream
data_stream = io.BytesIO(response['Body'].read())

adata = ad.read_h5ad(data_stream)

# Explore anndata object
Anndata is a rich container, and we won't go into the detail here. Below we'll look at very basic properties of the object.

In [None]:
adata

In [None]:
adata.obs.head()

In [None]:
adata.var.head()

# Ingest anndata into SOMA experiment
SOMA experiment can be created in a local file (demonstrated here), S3 bucket or in TileDB Cloud (requires setting up [TileDB Cloud](https://cloud.tiledb.com) account)

In [None]:
EXPERIMENT_URI = 'my-single-cell-soma-experiment' #This URI can also be of format s3:// or tiledb://
try:
    tiledbsoma.io.from_anndata(experiment_uri=EXPERIMENT_URI, measurement_name="RNA", anndata=adata)
    with tiledbsoma.open(EXPERIMENT_URI) as exp:
        print(exp.ms['RNA'].var.domain)
        print(exp.ms["RNA"].X["data"].shape)
except SOMAError:
    print(f'Experiment {EXPERIMENT_URI} already exists. Delete (or deregister if using TileDB Cloud) the experiment before continuing.') 