 # Loading an HCA matrix into Bioconductor
 
This vignette illustrates requesting an expression matrix from the HCA matrix service and loading it into a Bioconductor R object.

First, install and import some dependencies:

In [2]:
install.packages("downloader")
install.packages("BiocManager")

library("downloader")
library("BiocManager")

BiocManager::install("LoomExperiment")
library(LoomExperiment)

library("httr")


The downloaded binary packages are in
	/var/folders/nl/dgln31tj7l35g879d6f_tjtc0000gn/T//RtmpKQd50E/downloaded_packages

The downloaded binary packages are in
	/var/folders/nl/dgln31tj7l35g879d6f_tjtc0000gn/T//RtmpKQd50E/downloaded_packages


Bioconductor version 3.9 (BiocManager 1.30.4), R 3.6.0 (2019-04-26)
Installing package(s) 'LoomExperiment'



The downloaded binary packages are in
	/var/folders/nl/dgln31tj7l35g879d6f_tjtc0000gn/T//RtmpKQd50E/downloaded_packages


Update old packages: 'dplyr', 'googleAuthR', 'pillar'
Loading required package: SingleCellExperiment
Loading required package: SummarizedExperiment
Loading required package: DelayedArray

Attaching package: ‘DelayedArray’

The following objects are masked from ‘package:matrixStats’:

    colMaxs, colMins, colRanges, rowMaxs, rowMins, rowRanges

The following objects are masked from ‘package:base’:

    aperm, apply, rowsum

Loading required package: rhdf5
Loading required package: rtracklayer

Attaching package: ‘httr’

The following object is masked from ‘package:Biobase’:

    content



Now, we're going to make some requests to describe what fields and values we can filter on when we're selecting our matrix.

In [3]:
r <- GET("https://matrix.data.humancellatlas.org/v1/filters")
content(r)

That's the list of metadata fields we can filter on when requesting the matrix. We can describe any of them with further API calls:

In [4]:
r <- GET("https://matrix.data.humancellatlas.org/v1/filters/project.project_core.project_short_name")
print(content(r))

$cell_counts
$cell_counts$`Fetal/Maternal Interface`
[1] 1

$cell_counts$`Single cell RNAseq characterization of cell types produced over time in an in vitro model of human inhibitory interneuron differentiation.`
[1] 1733

$cell_counts$`Single cell transcriptome analysis of human pancreas`
[1] 2544


$field_description
[1] "A short name for the project."

$field_name
[1] "project.project_core.project_short_name"

$field_type
[1] "categorical"



In [5]:
r <- GET("https://matrix.data.humancellatlas.org/v1/filters/genes_detected")
print(content(r))

$field_description
[1] "Count of genes with a non-zero count."

$field_name
[1] "genes_detected"

$field_type
[1] "numeric"

$maximum
[1] 13108

$minimum
[1] 358



For categorical data, we see the number of cells associated with each category. For numeric, we see the range of value. If we want to request a matrix based on these metadata values, we can add them to the filter in the body of a POST request to the matrix service:

In [6]:
payload = list(
    filter =  list(
          op = "and", 
          value = list(
              list(op = "=", value = "Single cell transcriptome analysis of human pancreas",
                   field = "project.project_core.project_short_name"),
              list(op = ">=", value = 300,
                   field = "genes_detected")
    )),
    format = "loom"
)
r <- POST("https://matrix.data.humancellatlas.org/v1/matrix", body = payload, encode = "json")
response <- content(r)
print(response)

$eta
[1] ""

$matrix_url
[1] ""

$message
[1] "Job started."

$request_id
[1] "4748716f-772f-4716-8500-c8c21e4ad237"

$status
[1] "In Progress"



That call responds right away and tells us that the matrix is being prepared. We can use the request_id to wait until the matrix is done.

In [7]:
request_id <- response["request_id"]
status <- response["status"]
message(status)
while (status != "Complete") 
{
    url = paste("https://matrix.data.humancellatlas.org/v1/matrix/", request_id, sep = "")
    r <- GET(url)
    response <- content(r)
    status = response["status"]
    message(status)
    Sys.sleep(15)
}
print(response)

In Progress
In Progress
In Progress
In Progress
In Progress
In Progress
In Progress
In Progress
In Progress
In Progress
In Progress
Complete


$eta
[1] ""

$matrix_url
[1] "https://s3.amazonaws.com/dcp-matrix-service-results-prod/4748716f-772f-4716-8500-c8c21e4ad237.loom"

$message
[1] "Request 4748716f-772f-4716-8500-c8c21e4ad237 has successfully completed. The resultant expression matrix is available for download at https://s3.amazonaws.com/dcp-matrix-service-results-prod/4748716f-772f-4716-8500-c8c21e4ad237.loom"

$request_id
[1] "4748716f-772f-4716-8500-c8c21e4ad237"

$status
[1] "Complete"



Now, that the matrix is ready, we can download it. The file we download is a loom-formatted matrix. Loom is the default matrix format, but others can be specified in the matrix request (csv, mtx).

In [8]:
matrix_file_url = unlist(response["matrix_url"])

download.file(url=matrix_file_url,
              destfile='matrix.loom', method='curl')

# HCA Matrix Service Loom Output

The Loom format is documented more fully, along with code samples, [here](https://linnarssonlab.org/loompy/index.html).

Per Loom [conventions](https://linnarssonlab.org/loompy/conventions/index.html), columns in the loom-formatted expression matrix represent cells, and rows represent genes. The column and row attributes follow Loom conventions where applicable as well: `CellID` uniquely identifies a cell, `Gene` is a gene name, and `Accession` is an ensembl gene id.

Descriptions of the remaining metadata fields are available at the [HCA Data Browser](https://prod.data.humancellatlas.org/metadata).

And finally, we can `import` the loom file into a `Bioconductor::SingleCellLoomExperiment` object for further analysis in R.

In [9]:
scle <- import("./matrix.loom", type="SingleCellLoomExperiment")
scle

class: SingleCellLoomExperiment 
dim: 63925 2544 
metadata(3): CreationDate LOOM_SPEC_VERSION last_modified
assays(1): matrix
rownames: NULL
rowData names(7): Accession Gene ... featuretype isgene
colnames: NULL
colData names(30): CellID analysis_protocol.protocol_core.protocol_id
  ... specimen_from_organism.genus_species.ontology_label
  specimen_from_organism.provenance.document_id
reducedDimNames(0):
spikeNames(0):
rowGraphs(0): NULL
colGraphs(0): NULL

The `SingleCellLoomExperiment` also adheres to Loom [conventions](https://linnarssonlab.org/loompy/conventions/index.html) representing features as rows and samples as columns. Expression data is available via the `assays()` method, specifying a named assay.

In [10]:
assays(scle)$matrix

<63925 x 2544> DelayedMatrix object of type "double":
            [,1]    [,2]    [,3] ... [,2543] [,2544]
    [1,]       0       0       0   .       0       0
    [2,]       0       0       0   .       0       0
    [3,]      11       0      37   .       0       6
    [4,]       0       0       0   .     101       0
    [5,]       0       0       0   .       0       0
     ...       .       .       .   .       .       .
[63921,]    0.00    0.00    0.00   .       0       2
[63922,]    0.00    0.00    0.00   .       0       0
[63923,]    0.00    0.00    0.00   .       0       0
[63924,]    0.99    0.00    0.00   .       0       0
[63925,]    0.00    0.00    0.00   .       0       0

Row and column attribute data are available through `rowData()` and `colData()` methods respectively.

In [11]:
rowData(scle)

DataFrame with 63925 rows and 7 columns
            Accession        Gene  chromosome featureend featurestart
          <character> <character> <character>  <integer>    <integer>
1     ENSG00000000003      TSPAN6        chrX  100639991    100627109
2     ENSG00000000005        TNMD        chrX  100599885    100584802
3     ENSG00000000419        DPM1       chr20   50958555     50934867
4     ENSG00000000457       SCYL3        chr1  169894267    169849631
5     ENSG00000000460    C1orf112        chr1  169854080    169662007
...               ...         ...         ...        ...          ...
63921 ENSG00000284744  AL591163.1        chr1    6770038      6767954
63922 ENSG00000284745  AL589702.1        chr1    2968707      2960658
63923 ENSG00000284746 AC068587.10        chr8   12601376     12601158
63924 ENSG00000284747  AL034417.4        chr1    8005312      7991134
63925 ENSG00000284748  AL513220.1        chr1   37607336     37596126
                             featuretype    isgene

In [12]:
colData(scle)

DataFrame with 2544 rows and 30 columns
                                   CellID
                              <character>
1    00ca0d37-b787-41a4-be59-2aff5b13b0bd
2    0103aed0-29c2-4b29-a02a-2b58036fe875
3    01a5dd09-db87-47ac-be78-506c690c4efc
4    020d39f9-9375-4377-882e-db83d912aeb7
5    02583626-682b-4374-874a-99bd2e6a956e
...                                   ...
2540 fb29b70b-65af-4bd5-8c78-d33af8cefeb5
2541 fb8afe1d-6596-45a6-a6a4-3d83af03f6d1
2542 fc65dfd2-cafa-486e-a7f2-753c20d705a0
2543 fdb8ed17-e2f0-460a-bb25-9781d63eabf6
2544 fe0d170e-af6e-4420-827b-27b125fec214
     analysis_protocol.protocol_core.protocol_id
                                     <character>
1                               smartseq2_v2.3.0
2                               smartseq2_v2.4.0
3                               smartseq2_v2.3.0
4                               smartseq2_v2.3.0
5                               smartseq2_v2.3.0
...                                          ...
2540                  

For more examples on usage, see the Bioconductor documentation for the following parent classes of `SingleCellLoomExperiment`:

- [SummarizedExperiment](http://bioconductor.org/packages/release/bioc/vignettes/SummarizedExperiment/inst/doc/SummarizedExperiment.html#anatomy-of-a-summarizedexperiment)
- [SingleCellExperiment](https://bioconductor.org/packages/devel/bioc/vignettes/scater/inst/doc/vignette-intro.html#3_calculating_a_variety_of_expression_values)
- [LoomExperiment](http://bioconductor.org/packages/release/bioc/vignettes/LoomExperiment/inst/doc/LoomExperiment.html)