# Exploring scREF and scREF-mu with SomaData

We have open-sourced the [scREF](https://cloud.tiledb.com/soma/Phenomic/e3699068-6df6-419f-8b1a-d0021026159b/overview) and [scREF-mu](https://cloud.tiledb.com/soma/Phenomic/c4e59fe3-0013-4c0c-ae1d-450b953698ff/overview) corpora on [TileDB Cloud](https://cloud.tiledb.com/). If you do not want to run this notebook within your machine, head over to our [TileDB-hosted notebook](https://cloud.tiledb.com/notebooks/details/Phenomic/b06350d9-f829-4eb2-9b77-5c2d20d6932d/overview). 

The notebook below uses `SomaData`, a simple wrapper available in a package called [pai-soma-data](https://pypi.org/project/pai-soma-data/) to read from Phenomic's corpus. SomaData caches the `obs` and `var` in memory and allows you to explore the atlas Pandas style.

You can read from the databases using purely TileDB-SOMA syntax. To learn more, visit [this link](https://tiledbsoma.readthedocs.io/en/latest/notebooks/tutorial_exp_query.html).

Follow these steps to get started:

### Prerequisites

#### Step 1: Obtain Access to TileDB

Sign up for a [TileDB cloud](https://cloud.tiledb.com/). To access TileDB cloud-hosted objects from outside TileDB, you need to obtain a TileDB API REST Token [[Instructions to get a TileDB API token](https://docs.tiledb.com/cloud/how-to/account/create-api-tokens)]. Once you have a token, you can run the scripts below.

#### Step 2: Install Dependencies


In [None]:
!pip install pai-soma-data # hosted at https://pypi.org/project/pai-soma-data/

#### Import Packages


In [7]:
!python --version

Python 3.10.10


In [1]:
import tiledb
import tiledbsoma as soma

from pai_soma_data import SomaData

#### Configure External Laptop/Server to Access TileDB Objects


In [2]:
# Configure your external laptop/server to access objects on TileDB with an API access token
token = ""  # add your token here
ctx = tiledb.Ctx({"rest.token": token})
context = soma.SOMATileDBContext(tiledb_config=ctx.config())

#### Instantiate SomaData


In [3]:
sdata = SomaData(
    corpus_uri="tiledb://Phenomic/e3699068-6df6-419f-8b1a-d0021026159b",
    ctx=context,
    layer="norm",  # You can change this to raw if you want raw counts
)

Caching obs...
Caching var...


### Accessing Data


### Obs/Var

The metadata is stored the `.obs` attrribute. The gene metadata is stored in the `.var` attribute of the `sdata` object instantiated in the line above. You can navigate them as Pandas DataFrames.


In [4]:
# Obs
sdata.obs.head()

Unnamed: 0,soma_joinid,sample_idx,barcode,standard_true_celltype,authors_celltype,batch_name,cells_or_nuclei,dataset_name,sample_name,scrnaseq_protocol,study_name,tissue_collected,tissue_site,nnz,dataset_idx,included_scref_train,standardized_labels
0,0,0,001C_AAACCTGCATCGGGTC,Monocytes,ncMonocyte,normal.lung,cells,external_adams_sciadv_2020_32832599__normal.lung,001C,10x_3prime,external_adams_sciadv_2020_32832599,Lung,Primary,2147,0,False,
1,1,0,001C_AAACCTGTCAACACCA,Macrophages_and_other_myeloid,Macrophage_Alveolar,normal.lung,cells,external_adams_sciadv_2020_32832599__normal.lung,001C,10x_3prime,external_adams_sciadv_2020_32832599,Lung,Primary,4724,0,False,
2,2,0,001C_AAACCTGTCACAGTAC,NK_cells,NK,normal.lung,cells,external_adams_sciadv_2020_32832599__normal.lung,001C,10x_3prime,external_adams_sciadv_2020_32832599,Lung,Primary,880,0,False,
3,3,0,001C_AAACCTGTCTGTCTAT,Monocytes,cMonocyte,normal.lung,cells,external_adams_sciadv_2020_32832599__normal.lung,001C,10x_3prime,external_adams_sciadv_2020_32832599,Lung,Primary,1942,0,True,Monocytes
4,4,0,001C_AAACGGGAGACTAAGT,Endothelial,Lymphatic,normal.lung,cells,external_adams_sciadv_2020_32832599__normal.lung,001C,10x_3prime,external_adams_sciadv_2020_32832599,Lung,Primary,1714,0,False,


In [5]:
# Var
sdata.var.head()

Unnamed: 0,soma_joinid,gene
0,0,3.8-1.2
1,1,3.8-1.3
2,2,3.8-1.4
3,3,3.8-1.5
4,4,3110002H16RIK


### Slicing X Data

This is designed to be Pandas style, in format `(row_filter, col_filter)`. By default, SomaData fetches normalized data, but you can switch to fetching raw data by instantiating a new `SomaData` object with parameter `layer="raw"`.

- For the row filter, you can input a `pd.Series`, list of integers, or slices

- For the col filter, you can input a gene or list of genes

Two examples have been included below.


In [6]:
# Example 1
three_datasets_scref = sdata.obs["dataset_name"].unique().tolist()[:3]
print("first datasets in scref:", three_datasets_scref)
adata = sdata[sdata.obs["dataset_name"].isin(three_datasets_scref), ["CD40", "CD4"]]
print(adata)

first datasets in scref: ['external_adams_sciadv_2020_32832599__normal.lung', 'external_aida_cellxgene_2023_1__normal.blood', 'external_andrews_hepatolcommun_2022_34792289__normal.liver']
AnnData object with n_obs × n_vars = 1184218 × 2
    obs: 'soma_joinid', 'sample_idx', 'barcode', 'standard_true_celltype', 'authors_celltype', 'batch_name', 'cells_or_nuclei', 'dataset_name', 'sample_name', 'scrnaseq_protocol', 'study_name', 'tissue_collected', 'tissue_site', 'nnz', 'dataset_idx', 'included_scref_train', 'standardized_labels'
    var: 'gene'
    obsm: 'umap', 'embeddings'


In [None]:
# Example 2
print("first three rows and all genes in scref:")
adata = sdata[[0, 1, 2], :]
print(adata)