# Reading from SOMA objects

Copied from [TileDB-SOMA docs](https://single-cell-data.github.io/TileDB-SOMA/articles/soma-reading.html).

## Overview

In this tutorial we'll learn how to read data from various SOMA objects. We will assume familiarity with SOMA objects, so it is recommended to go through the `vignette("soma-objects")` first.

A core feature of SOMA is the ability to read _subsets_ of data from disk into memory as slices. SOMA uses [Apache Arrow](https://arrow.apache.org/) as an intermediate in-memory storage. From here, the slices can be further converted into native R objects, like data frames and matrices.

In [1]:
library(tiledbsoma)

## Example data

Load the bundled `SOMAExperiment` containing a subsetted version of the 10X genomics [PBMC dataset](https://satijalab.github.io/seurat-object/reference/pbmc_small.html) provided by SeuratObject. This will return a `SOMAExperiment` object. This is a small dataset that easily fits into memory, but we'll focus on operations that can easily scale to larger datasets as well.

In [2]:
experiment <- load_dataset("soma-exp-pbmc-small")

## SOMA DataFrame

We'll start with the `obs` dataframe. Simply calling the `read()$concat()` method will load all of the data in memory as an [Arrow Table](https://arrow.apache.org/docs/r/reference/table.html).

In [3]:
obs <- experiment$obs
obs$read()$concat()

Table
80 rows x 9 columns
$soma_joinid <int64 not null>
$orig.ident <large_string>
$nCount_RNA <double>
$nFeature_RNA <int32>
$RNA_snn_res.0.8 <large_string>
$letter.idents <large_string>
$groups <large_string>
$RNA_snn_res.1 <large_string>
$obs_id <large_string>

This is easily converted into a `data.frame` using Arrow's methods:

In [4]:
obs$read()$concat()$to_data_frame()

soma_joinid,orig.ident,nCount_RNA,nFeature_RNA,RNA_snn_res.0.8,letter.idents,groups,RNA_snn_res.1,obs_id
<int>,<chr>,<dbl>,<int>,<chr>,<chr>,<chr>,<chr>,<chr>
0,SeuratProject,70,47,0,A,g2,0,ATGCCAGAACGACT
1,SeuratProject,85,52,0,A,g1,0,CATGGCCTGTGCAT
2,SeuratProject,87,50,1,B,g2,0,GAACCTGATGAACC
3,SeuratProject,127,56,0,A,g2,0,TGACTGGATTCTCA
4,SeuratProject,173,53,0,A,g2,0,AGTCAGACTGCACA
5,SeuratProject,70,48,0,A,g1,0,TCTGATACACGTGT
6,SeuratProject,64,36,0,A,g1,0,TGGTATCTAAACAG
7,SeuratProject,72,45,0,A,g1,0,GCAGCTCTGTTTCT
8,SeuratProject,52,36,0,A,g1,0,GATATAACACGCAT
9,SeuratProject,100,41,0,A,g1,0,AATGTTGACAGTCA


### Slicing

Slices of data can be read by passing coordinates to the `read()` method. Before we do that, let's take a look at the schema of `obs`:

In [5]:
obs$schema()

Schema
soma_joinid: int64 not null
orig.ident: string
nCount_RNA: double
nFeature_RNA: int32
RNA_snn_res.0.8: string
letter.idents: string
groups: string
RNA_snn_res.1: string
obs_id: string

With any SOMA object, you can only slice across an indexed column (a "dimension" in TileDB parlance). You can use `dimnames()` to retrieve the names of any SOMA object's indexed dimensions:

In [6]:
obs$dimnames()

In this case, there is a single dimension called `soma_joinid`. From the schema above we can see this contains integers.

Let's look at a few ways to slice the dataframe.

Select a single row:

In [7]:
obs$read(coords = 0)$concat()

Table
1 rows x 9 columns
$soma_joinid <int64 not null>
$orig.ident <large_string>
$nCount_RNA <double>
$nFeature_RNA <int32>
$RNA_snn_res.0.8 <large_string>
$letter.idents <large_string>
$groups <large_string>
$RNA_snn_res.1 <large_string>
$obs_id <large_string>

Select multiple, non-contiguous rows:

In [8]:
obs$read(coords = c(0, 2))$concat()

Table
2 rows x 9 columns
$soma_joinid <int64 not null>
$orig.ident <large_string>
$nCount_RNA <double>
$nFeature_RNA <int32>
$RNA_snn_res.0.8 <large_string>
$letter.idents <large_string>
$groups <large_string>
$RNA_snn_res.1 <large_string>
$obs_id <large_string>

Select multiple, contiguous rows:

In [9]:
obs$read(coords = 0:4)$concat()

Table
5 rows x 9 columns
$soma_joinid <int64 not null>
$orig.ident <large_string>
$nCount_RNA <double>
$nFeature_RNA <int32>
$RNA_snn_res.0.8 <large_string>
$letter.idents <large_string>
$groups <large_string>
$RNA_snn_res.1 <large_string>
$obs_id <large_string>

### Selecting columns

As TileDB is a columnar format, it is possible to select a subset of columns to read by using the `column_names` argument:

In [10]:
obs$read(coords = 0:4, column_names = c("obs_id", "groups"))$concat()

Table
5 rows x 2 columns
$obs_id <large_string>
$groups <large_string>

### Filtering

In addition to slicing by coordinates you can also apply filters to the data using the `value_filter` argument. These expressions are pushed down to the TileDB engine and efficiently applied to the data on disk. Here are a few examples.

Identify all cells in the `"g1"` group:

In [11]:
obs$read(value_filter = "groups == 'g1'")$concat()$to_data_frame()

soma_joinid,orig.ident,nCount_RNA,nFeature_RNA,RNA_snn_res.0.8,letter.idents,groups,RNA_snn_res.1,obs_id
<int>,<chr>,<dbl>,<int>,<chr>,<chr>,<chr>,<chr>,<chr>
1,SeuratProject,85,52,0,A,g1,0,CATGGCCTGTGCAT
5,SeuratProject,70,48,0,A,g1,0,TCTGATACACGTGT
6,SeuratProject,64,36,0,A,g1,0,TGGTATCTAAACAG
7,SeuratProject,72,45,0,A,g1,0,GCAGCTCTGTTTCT
8,SeuratProject,52,36,0,A,g1,0,GATATAACACGCAT
9,SeuratProject,100,41,0,A,g1,0,AATGTTGACAGTCA
11,SeuratProject,191,61,0,A,g1,2,AGAGATGATCTCGC
15,SeuratProject,168,44,0,A,g1,2,CTAAACCTGTGCAT
17,SeuratProject,135,45,0,A,g1,2,TTGGTACTGAATCC
18,SeuratProject,79,43,0,A,g1,2,CATCATACGGAGCA


Identify all cells in the `"g1"` or `"g2"` group:

In [12]:
obs$read(value_filter = "groups == 'g1' | groups == 'g2'")$concat()$to_data_frame()

soma_joinid,orig.ident,nCount_RNA,nFeature_RNA,RNA_snn_res.0.8,letter.idents,groups,RNA_snn_res.1,obs_id
<int>,<chr>,<dbl>,<int>,<chr>,<chr>,<chr>,<chr>,<chr>
0,SeuratProject,70,47,0,A,g2,0,ATGCCAGAACGACT
1,SeuratProject,85,52,0,A,g1,0,CATGGCCTGTGCAT
2,SeuratProject,87,50,1,B,g2,0,GAACCTGATGAACC
3,SeuratProject,127,56,0,A,g2,0,TGACTGGATTCTCA
4,SeuratProject,173,53,0,A,g2,0,AGTCAGACTGCACA
5,SeuratProject,70,48,0,A,g1,0,TCTGATACACGTGT
6,SeuratProject,64,36,0,A,g1,0,TGGTATCTAAACAG
7,SeuratProject,72,45,0,A,g1,0,GCAGCTCTGTTTCT
8,SeuratProject,52,36,0,A,g1,0,GATATAACACGCAT
9,SeuratProject,100,41,0,A,g1,0,AATGTTGACAGTCA


Altenratively, you can use the `%in%` operator:

In [13]:
obs$read(value_filter = "groups %in% c('g1', 'g2')")$concat()$to_data_frame()

soma_joinid,orig.ident,nCount_RNA,nFeature_RNA,RNA_snn_res.0.8,letter.idents,groups,RNA_snn_res.1,obs_id
<int>,<chr>,<dbl>,<int>,<chr>,<chr>,<chr>,<chr>,<chr>
0,SeuratProject,70,47,0,A,g2,0,ATGCCAGAACGACT
1,SeuratProject,85,52,0,A,g1,0,CATGGCCTGTGCAT
2,SeuratProject,87,50,1,B,g2,0,GAACCTGATGAACC
3,SeuratProject,127,56,0,A,g2,0,TGACTGGATTCTCA
4,SeuratProject,173,53,0,A,g2,0,AGTCAGACTGCACA
5,SeuratProject,70,48,0,A,g1,0,TCTGATACACGTGT
6,SeuratProject,64,36,0,A,g1,0,TGGTATCTAAACAG
7,SeuratProject,72,45,0,A,g1,0,GCAGCTCTGTTTCT
8,SeuratProject,52,36,0,A,g1,0,GATATAACACGCAT
9,SeuratProject,100,41,0,A,g1,0,AATGTTGACAGTCA


Identify all cells in the `"g1"` group with more than more than 60 features:

In [14]:
obs$read(value_filter = "groups == 'g1' & nFeature_RNA > 60")$concat()$to_data_frame()

soma_joinid,orig.ident,nCount_RNA,nFeature_RNA,RNA_snn_res.0.8,letter.idents,groups,RNA_snn_res.1,obs_id
<int>,<chr>,<dbl>,<int>,<chr>,<chr>,<chr>,<chr>,<chr>
11,SeuratProject,191,61,0,A,g1,2,AGAGATGATCTCGC
20,SeuratProject,298,65,1,B,g1,1,TTACCATGAATCGC
21,SeuratProject,406,74,1,B,g1,1,ATAGGAGAAACAGA
24,SeuratProject,463,77,1,B,g1,1,ATTACCTGCCTTAT
29,SeuratProject,353,80,1,B,g1,1,CATCAGGATGCACA
50,SeuratProject,371,75,1,B,g1,1,CGTAGCCTGTATGC
54,SeuratProject,443,77,1,B,g1,1,AAGCGACTTTGACG
55,SeuratProject,417,75,0,A,g1,1,ACCAGTGAATACCG
56,SeuratProject,502,81,1,B,g1,1,ATTGCACTTGCTTT
57,SeuratProject,324,76,1,B,g1,1,CTAGGTGATGGTTG


## SOMA SparseNDArray

For `SOMASparseNDArray`, let's consider the `X` layer containing the `"counts"` data:

In [15]:
counts <- experiment$ms$get("RNA")$X$get("counts")
counts

<SOMASparseNDArray>
  uri: file:///var/folders/7h/59ccydx96xz_2nsh5945tt6m0000gn/T/RtmpLYwMwP/soma-exp-pbmc-small/ms/RNA/X/counts 
  dimensions: soma_dim_0, soma_dim_1 
  attributes: soma_data 

Similar to `SOMADataFrame`, we can load the data into memory as an Arrow Table:

In [16]:
counts$read()$tables()$concat()

Table
4456 rows x 3 columns
$soma_dim_0 <int64 not null>
$soma_dim_1 <int64 not null>
$soma_data <double not null>

Or as a [`Matrix::sparseMatrix()`]:

In [17]:
counts$read()$sparse_matrix()$concat()

80 x 230 sparse Matrix of class "dgTMatrix"
                                                                               
 [1,] . 1  .   . .  1 . .  3 . . 1 . . . . . . . . . .  1 . . . .  . . .  4 . .
 [2,] . .  .   1 .  . . .  7 . . . . . . . . . . . . 1  1 . 2 . 1  . . .  4 3 1
 [3,] . .  .   . .  . . . 11 . . 1 . . . . . . . . . .  . 1 . . .  . . .  4 2 .
 [4,] . .  .   . .  . . . 13 . . 1 . . . . . . . . . .  6 . . . .  . . .  5 2 1
 [5,] . .  .   1 .  . . .  3 . . . . . . . . . . . . .  . . . . .  . . .  4 3 .
 [6,] . .  .   1 .  . . .  4 . . . . 1 . . . . . . . .  2 1 . . .  . . .  4 1 1
 [7,] . .  .   . .  . . .  6 . . . . . . . . . . . . .  4 . . . .  . . .  3 1 1
 [8,] . .  .   1 .  . . .  4 . . . . . . . . . . . . .  1 1 . . .  . . .  2 3 .
 [9,] . .  .   . .  . . .  2 . . . . . . . . . . . . .  . . . . .  . . .  2 2 .
[10,] . 1  .   . .  . . . 21 . . 1 . . . . . . . . . .  4 . 1 . .  . . .  2 1 1
[11,] 2 2  .  14 3  1 3 .  2 . . . 1 . 3 . . . . 1 1 1  2 2 . 2 .  1 1 1  . 

### Slicing

Just as with a `SOMADataFrame`, we can also retrieve subsets of the data from a `SOMASparseNDArray` that can fit in memory.

Unlike `SOMADataFrame`s, `SOMASparseNDArray`s are always indexed using a zero-based offset integer on each dimension, named `soma_dim_N`. Therefore, if the array is `N`-dimensional, the `read()` method can accept a list of length `N` that specifies how to slice the array.

`SOMASparseNDArray` dimensions are always named `soma_dim_N` where `N` is the dimension number. As before you could use `schema()` or `dimnames()` to retrieve the dimension names.

In [18]:
counts$schema()

Schema
soma_dim_0: int64 not null
soma_dim_1: int64 not null
soma_data: double not null

For example, here's how to fetch the first 5 rows of the matrix:

In [19]:
counts$read(coords = list(soma_dim_0 = 0:4))$tables()$concat()

Table
258 rows x 3 columns
$soma_dim_0 <int64 not null>
$soma_dim_1 <int64 not null>
$soma_data <double not null>