## Loading an HCA matrix into seurat

This vignette illustrates requesting an expression matrix from the HCA matrix service and loading it into seurat.


First, install and import some dependencies:

In [30]:
library(httr)
install.packages("remotes")
install.packages("downloader")
library(downloader)
remotes::install_github("satijalab/seurat")
library(Seurat)

Updating HTML index of packages in '.Library'
Making 'packages.html' ... done
Updating HTML index of packages in '.Library'
Making 'packages.html' ... done
Skipping install of 'Seurat' from a github remote, the SHA1 (245d72b5) has not changed since last install.
  Use `force = TRUE` to force installation


Now, we're going to make some requests to describe what fields and values we can filter on when we're selecting our matrix.

In [31]:
r <- GET("https://matrix.data.humancellatlas.org/v1/filters")
content(r)

That's the list of metadata fields we can filter on when requesting the matrix. We can describe any of them with further API calls:

In [32]:
r <- GET("https://matrix.data.humancellatlas.org/v1/filters/project.project_core.project_short_name")
content(r)

In [33]:
r <- GET("https://matrix.data.humancellatlas.org/v1/filters/genes_detected")
content(r)

For categorical data, we see the number of cells associated with each category. For numeric, we see the range of value. If we want to request a matrix based on these metadata values, we can add them to the filter in the body of a POST request to the matrix service:

In [34]:
payload = list(
    filter =  list(
          op = "and", 
          value = list(
              list(op = "=", value = "Single cell transcriptome analysis of human pancreas",
                   field = "project.project_core.project_short_name"),
              list(op = ">=", value = 300,
                   field = "genes_detected")
    )),
    format = "csv"
)
r <- POST("https://matrix.data.humancellatlas.org/v1/matrix", body = payload, encode = "json")
response <- content(r)
print(response)

$eta
[1] ""

$matrix_url
[1] ""

$message
[1] "Job started."

$request_id
[1] "9587c4a2-5f3b-4e36-95ac-38d7834c56dd"

$status
[1] "In Progress"



That call responds right away and tells us that the matrix is being prepared. We can use the request_id to wait until the matrix is done.

In [35]:
request_id <- response["request_id"]
status <- response["status"]
message(status)
while (status != "Complete") 
{
    url = paste("https://matrix.data.humancellatlas.org/v1/matrix/", request_id, sep = "")
    r <- GET(url)
    response <- content(r)
    status = response["status"]
    message(status)
    Sys.sleep(15)
}
print(response)

In Progress
In Progress
In Progress
In Progress
In Progress
In Progress
In Progress
In Progress
In Progress
In Progress
In Progress
In Progress
In Progress
In Progress
In Progress
In Progress
In Progress
In Progress
In Progress
In Progress
In Progress
In Progress
In Progress
In Progress
In Progress
Complete


$eta
[1] ""

$matrix_url
[1] "https://s3.amazonaws.com/dcp-matrix-service-results-prod/9587c4a2-5f3b-4e36-95ac-38d7834c56dd.csv.zip"

$message
[1] "Request 9587c4a2-5f3b-4e36-95ac-38d7834c56dd has successfully completed. The resultant expression matrix is available for download at https://s3.amazonaws.com/dcp-matrix-service-results-prod/9587c4a2-5f3b-4e36-95ac-38d7834c56dd.csv.zip"

$request_id
[1] "9587c4a2-5f3b-4e36-95ac-38d7834c56dd"

$status
[1] "Complete"



Now, that the matrix is ready, we can download it. The file we download is a zip archive that contains a readme and a csv-formatted matrix. Other formats (loom, mtx) can be specified in the matrix request.

In [36]:
matrix_file_url = unlist(response["matrix_url"])

download.file(url=matrix_file_url,
              destfile='matrix.zip', method='curl')
unzip("matrix.zip", exdir = "./")
file.show('./csv_readme.md')







Finally, we load the expression matrix into a seurat object.

In [37]:
data_dir = paste("./", request_id, ".csv/", sep = "")
list.files(data_dir)
raw_counts<-t(read.table(file=paste0(data_dir,"expression.csv"),sep=",", header=T,row.names=1))
cell_metadata<-read.table(file=paste0(data_dir,"cells.csv"),sep=",", header=T,row.names=1)
head(raw_counts)
head(cell_metadata)
pancreas <- CreateSeuratObject(
    counts = raw_counts, 
    project = "Single cell transcriptome analysis of human pancreas",
    meta.data = cell_metadata,
    assay = "Smart-seq-2")
pancreas



Unnamed: 0,00ca0d37-b787-41a4-be59-2aff5b13b0bd,0103aed0-29c2-4b29-a02a-2b58036fe875,01a5dd09-db87-47ac-be78-506c690c4efc,020d39f9-9375-4377-882e-db83d912aeb7,02583626-682b-4374-874a-99bd2e6a956e,041637f8-d5c9-49c4-aff7-230da2f95c69,044472bd-588a-4de1-887f-55facdc5ddf9,046c1a85-77f7-4033-b9ea-5994df96b83e,04f60cb7-5ced-4f3f-982f-799751334d45,061f92bf-fcfc-45f9-9a44-6779436748e7,...,f7348197-095b-431b-b3c5-606228bed522,f7d211e4-45ec-49cd-a781-1e6d4fc01ea2,f8082391-2de7-4a3e-baa2-d5d189ab4e5d,f87e69fc-a359-447d-a0af-0a71cdc77d80,f946e5db-3ca3-4de4-b1a4-8d576082ed8f,fb29b70b-65af-4bd5-8c78-d33af8cefeb5,fb8afe1d-6596-45a6-a6a4-3d83af03f6d1,fc65dfd2-cafa-486e-a7f2-753c20d705a0,fdb8ed17-e2f0-460a-bb25-9781d63eabf6,fe0d170e-af6e-4420-827b-27b125fec214
ENSG00000000003,0,0,0,0,0,0,0,127,0,0,...,0,626,0,0.0,0,560,0,0,0,0
ENSG00000000005,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0.0,0,0,0,0,0,0
ENSG00000000419,11,0,37,0,213,0,0,58,0,5,...,0,257,0,0.0,0,0,78,125,0,6
ENSG00000000457,0,0,0,0,0,6,0,105,0,0,...,0,0,0,0.72,0,0,0,0,101,0
ENSG00000000460,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0.0,0,0,0,0,0,0
ENSG00000000938,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0.0,0,0,0,0,0,0


Unnamed: 0,cell_suspension.provenance.document_id,genes_detected,specimen_from_organism.provenance.document_id,specimen_from_organism.genus_species.ontology,specimen_from_organism.genus_species.ontology_label,donor_organism.human_specific.ethnicity.ontology,donor_organism.human_specific.ethnicity.ontology_label,donor_organism.diseases.ontology,donor_organism.diseases.ontology_label,donor_organism.development_stage.ontology,...,library_preparation_protocol.library_construction_method.ontology_label,library_preparation_protocol.end_bias,library_preparation_protocol.strand,project.provenance.document_id,project.project_core.project_short_name,project.project_core.project_title,analysis_protocol.provenance.document_id,dss_bundle_fqid,analysis_protocol.protocol_core.protocol_id,analysis_working_group_approval_status
00ca0d37-b787-41a4-be59-2aff5b13b0bd,00ca0d37-b787-41a4-be59-2aff5b13b0bd,6924,9c1445a1-7287-410e-bb8a-977a8b8e9b05,NCBITAXON:9606,Homo sapiens,HANCESTRO:0005,European,PATO:0000461,normal,HSAPDV:0000087,...,Smart-seq2,full length,unstranded,cddab57b-6868-4be4-806f-395ed9dd635a,Single cell transcriptome analysis of human pancreas,Single cell transcriptome analysis of human pancreas reveals transcriptional signatures of aging and somatic mutation patterns.,38f726ad-86fc-404a-97fc-2ac16e6d8461,1f578cdc-144a-44f1-936c-52fbbc6f71b8.2019-05-14T124411.708000Z,smartseq2_v2.3.0,blessed
0103aed0-29c2-4b29-a02a-2b58036fe875,0103aed0-29c2-4b29-a02a-2b58036fe875,3171,14875995-58ca-42cd-9d37-79f5a1f35270,NCBITAXON:9606,Homo sapiens,HANCESTRO:0016,African American or Afro-Caribbean,PATO:0000461,normal,HSAPDV_0000174,...,Smart-seq2,full length,unstranded,cddab57b-6868-4be4-806f-395ed9dd635a,Single cell transcriptome analysis of human pancreas,Single cell transcriptome analysis of human pancreas reveals transcriptional signatures of aging and somatic mutation patterns.,1dce56e1-7ae5-4ac4-8002-ebf9c9f8b94d,018a3756-0d49-4399-9e56-bc55375ab618.2019-05-30T211857.243000Z,smartseq2_v2.4.0,blessed
01a5dd09-db87-47ac-be78-506c690c4efc,01a5dd09-db87-47ac-be78-506c690c4efc,3838,56b6cd1e-7c2c-43b0-8124-de4a467550fe,NCBITAXON:9606,Homo sapiens,HANCESTRO:0005,European,PATO:0000461,normal,HSAPDV_0000099,...,Smart-seq2,full length,unstranded,cddab57b-6868-4be4-806f-395ed9dd635a,Single cell transcriptome analysis of human pancreas,Single cell transcriptome analysis of human pancreas reveals transcriptional signatures of aging and somatic mutation patterns.,0e47010d-03b2-48b7-81a6-271bd3ba09d8,463a2bc6-e538-453a-bc22-d158c1ed8fb7.2019-05-14T122714.497000Z,smartseq2_v2.3.0,blessed
020d39f9-9375-4377-882e-db83d912aeb7,020d39f9-9375-4377-882e-db83d912aeb7,4111,a1b35ebb-b79e-498f-bfc6-f5b4af5bc719,NCBITAXON:9606,Homo sapiens,HANCESTRO:0008,Asian,PATO:0000461,normal,HSAPDV:0000088,...,Smart-seq2,full length,unstranded,cddab57b-6868-4be4-806f-395ed9dd635a,Single cell transcriptome analysis of human pancreas,Single cell transcriptome analysis of human pancreas reveals transcriptional signatures of aging and somatic mutation patterns.,533acec2-bd51-4dc6-976d-dfbc1a6d725a,192c10f3-f3a9-464a-a3fe-ef14f7e43e70.2019-05-14T121850.763000Z,smartseq2_v2.3.0,blessed
02583626-682b-4374-874a-99bd2e6a956e,02583626-682b-4374-874a-99bd2e6a956e,5834,1f43dc7a-3f89-42a3-8ed7-ee295e59ccb9,NCBITAXON:9606,Homo sapiens,HANCESTRO:0016,African American or Afro-Caribbean,PATO:0000461,normal,HSAPDV:0000090,...,Smart-seq2,full length,unstranded,cddab57b-6868-4be4-806f-395ed9dd635a,Single cell transcriptome analysis of human pancreas,Single cell transcriptome analysis of human pancreas reveals transcriptional signatures of aging and somatic mutation patterns.,4d8b91cb-655f-41e9-89f9-71051ef84ca6,004ad9a7-240b-4521-96c7-c78a8fec769c.2019-05-14T113159.892000Z,smartseq2_v2.3.0,blessed
041637f8-d5c9-49c4-aff7-230da2f95c69,041637f8-d5c9-49c4-aff7-230da2f95c69,2564,14875995-58ca-42cd-9d37-79f5a1f35270,NCBITAXON:9606,Homo sapiens,HANCESTRO:0016,African American or Afro-Caribbean,PATO:0000461,normal,HSAPDV_0000174,...,Smart-seq2,full length,unstranded,cddab57b-6868-4be4-806f-395ed9dd635a,Single cell transcriptome analysis of human pancreas,Single cell transcriptome analysis of human pancreas reveals transcriptional signatures of aging and somatic mutation patterns.,d7e16fc2-588d-423c-a835-a7969b2a77d8,bac34c96-62b3-4abd-aca8-bb16f3c37b38.2019-05-14T121642.299000Z,smartseq2_v2.3.0,blessed


“Feature names cannot have underscores ('_'), replacing with dashes ('-')”

An object of class Seurat 
63925 features across 2544 samples within 1 assay 
Active assay: Smart-seq-2 (63925 features)

In [38]:
pancreas <- FindVariableFeatures(pancreas, selection.method = "vst", nfeatures = 2000)

# Identify the 10 most highly variable genes
top10 <- head(VariableFeatures(pancreas), 10)
top10

“All object keys must be alphanumeric characters, followed by an underscore ('_'), setting key to 'smartseq2_'”