-
Notifications
You must be signed in to change notification settings - Fork 3
/
Damond_2019_Pancreas.R
207 lines (204 loc) · 9.65 KB
/
Damond_2019_Pancreas.R
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
#' Obtain the Damond_2019_Pancreas dataset
#'
#' Obtain the Damond_2019_Pancreas dataset, which consists of three data
#' objects: single cell data, multichannel images and cell segmentation masks.
#' The data was obtained by imaging mass cytometry (IMC) of human pancreas
#' sections from donors with type 1 diabetes.
#'
#' @param data_type type of object to load, `images` for multichannel images or
#' `masks` for cell segmentation masks. Single cell data are retrieved using
#' either `sce` for the \code{SingleCellExperiment} format or `spe` for the
#' \code{SpatialExperiment} format.
#' @param full_dataset if FALSE (default), a subset corresponding to 100 images
#' is returned. If TRUE, the full dataset (corresponding to 845 images) is
#' returned. Due to memory space limitations, this option is only available for
#' single cell data and masks, not for \code{data_type = "images"}.
#' @param version dataset version. By default, the latest version is returned.
#' @param metadata if FALSE (default), the data object selected in
#' \code{data_type} is returned. If TRUE, only the metadata associated to this
#' object is returned.
#' @param on_disk logical indicating if images in form of
#' \linkS4class{HDF5Array} objects (as .h5 files) should be stored on disk
#' rather than in memory. This setting is valid when downloading \code{images}
#' and \code{masks}.
#' @param h5FilesPath path to where the .h5 files for on disk representation
#' are stored. This path needs to be defined when \code{on_disk = TRUE}.
#' When files should only temporarily be stored on disk, please set
#' \code{h5FilesPath = getHDF5DumpDir()}.
#' @param force logical indicating if images should be overwritten when files
#' with the same name already exist on disk.
#'
#' @details
#' This is an Imaging Mass Cytometry (IMC) dataset from Damond et al. (2019):
#' \itemize{
#' \item \code{images} contains a hundred 38-channel
#' images in the form of a \linkS4class{CytoImageList} class object.
#' \item \code{masks} contains the cell segmentation
#' masks associated with the images, in the form of a
#' \linkS4class{CytoImageList} class object.
#' \item \code{sce} contains the single cell data extracted from the
#' multichannel images using the cell segmentation masks, as well as the
#' associated metadata, in the form of a
#' \linkS4class{SingleCellExperiment}. This represents a total of 252,059
#' cells x 38 channels.
#' \item \code{spe} same single cell data as for \code{sce}, but in the
#' \linkS4class{SpatialExperiment} format.
#' }
#'
#' All data are downloaded from ExperimentHub and cached for local re-use.
#'
#' Mapping between the three data objects is performed via variables located in
#' their metadata columns: \code{mcols()} for the \linkS4class{CytoImageList}
#' objects and \code{ColData()} for the \linkS4class{SingleCellExperiment} and
#' \linkS4class{SpatialExperiment} objects. Mapping at the image level can be
#' performed with the \code{image_name} or \code{image_number} variables.
#' Mapping between cell segmentation masks and single cell data is performed
#' with the \code{cell_number} variable, the values of which correspond to the
#' intensity values of the \code{masks} object. For practical
#' examples, please refer to the "Accessing IMC datasets" vignette.
#'
#' This dataset is a subset of the complete Damond et al. (2019) dataset
#' comprising the data from three pancreas donors at different stages of type 1
#' diabetes (T1D). The three donors present clearly diverging characteristics
#' in terms of cell type composition and cell-cell interactions, which makes
#' this dataset ideal for benchmarking spatial and neighborhood analysis
#' algorithms. If \code{full_dataset = TRUE}, the full dataset (845 images from
#' 12 patients) is returned. This option is not available for multichannel
#' images.
#'
#' The \code{assay} slots of the \linkS4class{SingleCellExperiment} and
#' \linkS4class{SpatialExperiment} objects contain three assays:
#' \itemize{
#' \item \code{counts} contains raw mean ion counts per cell.
#' \item \code{exprs} contains arsinh-transformed counts, with cofactor 1.
#' \item \code{quant_norm} contains counts censored at the 99th percentile
#' and scaled 0-1.
#' }
#'
#' The marker-associated metadata, including antibody information and metal
#' tags are stored in the \code{rowData} of the
#' \linkS4class{SingleCellExperiment} / \linkS4class{SpatialExperiment}
#' objects.
#'
#' The cell-associated metadata are stored in the \code{colData} of the
#' \linkS4class{SingleCellExperiment} and \linkS4class{SpatialExperiment}
#' objects. These metadata include cell types (in
#' \code{colData(sce)$cell_type}) and broader cell categories, such as
#' "immune" or "islet" cells (in \code{colData(sce)$cell_category}). In
#' addition, for cells located inside pancreatic islets, the islet they belong
#' to is indicated in \code{colData(sce)$islet_parent}. For cells not located
#' in islets, the "islet_parent" value is set to 0 but the spatially closest
#' islet can be identified with \code{colData(sce)$islet_closest}.
#'
#' The donor-associated metadata are also stored in the \code{colData} of the
#' \linkS4class{SingleCellExperiment} and \linkS4class{SpatialExperiment}
#' objects. For instance, the donors' IDs can be retrieved with
#' \code{colData(sce)$patient_id} and the donors' disease stage can be obtained
#' with \code{colData(sce)$patient_stage}.
#'
#' Neighborhood information, defined here as cells that are localized next to
#' each other, is stored as a \code{SelfHits} object in the \code{colPairs}
#' slot of the \code{SingleCellExperiment} and \linkS4class{SpatialExperiment}
#' objects.
#'
#' The three donors in the subset present the following characteristics:
#' \itemize{
#' \item \code{6126} is a non-diabetic donor, with large islets containing
#' many beta cells, severe infiltration of the exocrine pancreas with
#' myeloid cells but limited infiltration of islets.
#' \item \code{6414} is a donor with recent T1D onset (shortly after
#' diagnosis) showing partial beta cell destruction and mild infiltration
#' of islets with T cells.
#' \item \code{6180} is a donor with long-duration T1D (11 years after
#' diagnosis), showing near-total beta cell destruction and limited immune
#' cell infiltration in both the islets and the pancreas.
#' }
#' For information about other donors in the full dataset, please refer to the
#' Damond et al. publication.
#'
#' Dataset versions: a \code{version} argument can be passed to the function to
#' specify which dataset version should be retrieved.
#' \itemize{
#' \item \code{`v0`}: original version (Bioconductor <= 3.15).
#' \item \code{`v1`}: consistent object formatting across datasets.
#' }
#'
#' File sizes:
#' \itemize{
#' \item \code{`images`}: size in memory = 7.4 Gb, size on disk = 1.7 Gb.
#' \item \code{`masks`}: size in memory = 200 Mb, size on disk = 8.2 Mb.
#' \item \code{`sce`}: size in memory = 353 Mb, size on disk = 204 Mb.
#' \item \code{`spe`}: size in memory = 372 Mb, size on disk = 205 Mb.
#' \item \code{`sce_full`}: size in memory = 2.4 Gb, size on disk = 1.5 Gb.
#' \item \code{`spe_full`}: size in memory = 2.5 Gb, size on disk = 1.5 Gb.
#' \item \code{`masks_full`}: size in memory = 1.4 Gb,
#' size on disk = 60 Mb.
#' }
#'
#' When storing images on disk, these need to be first fully read into memory
#' before writing them to disk. This means the process of downloading the data
#' is slower than directly keeping them in memory. However, downstream analysis
#' will lose its memory overhead when storing images on disk.
#'
#' Original source: Damond et al. (2019):
#' https://doi.org/10.1016/j.cmet.2018.11.014
#'
#' Original link to raw data, also containing the entire dataset:
#' https://data.mendeley.com/datasets/cydmwsfztj/2
#'
#' @return A \linkS4class{SingleCellExperiment} object with single cell data, a
#' \linkS4class{SpatialExperiment} object with single cell data, a
#' \linkS4class{CytoImageList} object containing multichannel images, or a
#' \linkS4class{CytoImageList} object containing cell segmentation masks.
#'
#' @author Nicolas Damond
#'
#' @references
#' Damond N et al. (2019).
#' A Map of Human Type 1 Diabetes Progression by Imaging Mass Cytometry.
#' \emph{Cell Metab} 29(3), 755-768.
#'
#' @examples
#' # Load single cell data
#' sce <- Damond_2019_Pancreas(data_type = "sce")
#' print(sce)
#'
#' # Display metadata
#' Damond_2019_Pancreas(data_type = "sce", metadata = TRUE)
#'
#' # Load masks on disk
#' library(HDF5Array)
#' masks <- Damond_2019_Pancreas(data_type = "masks", on_disk = TRUE,
#' h5FilesPath = getHDF5DumpDir())
#' print(head(masks))
#'
#' @import cytomapper
#' @import SingleCellExperiment
#' @import methods
#' @importFrom utils download.file
#' @importFrom utils read.csv
#' @importFrom ExperimentHub ExperimentHub
#' @importFrom SpatialExperiment SpatialExperiment
#' @importFrom HDF5Array writeHDF5Array
#' @importFrom DelayedArray DelayedArray
#'
#' @export
Damond_2019_Pancreas <- function (
data_type = c("sce", "spe", "images", "masks"),
full_dataset = FALSE,
version = "latest",
metadata = FALSE,
on_disk = FALSE,
h5FilesPath = NULL,
force = FALSE
) {
available_versions <- c("v0", "v1")
dataset_name <- "Damond_2019_Pancreas"
dataset_version <- ifelse(version == "latest",
utils::tail(available_versions, n=1), version)
.checkArguments(data_type, metadata, dataset_version, available_versions,
full_dataset, on_disk, h5FilesPath, force)
cur_dat <- .loadDataObject(data_type, metadata, dataset_name,
dataset_version, full_dataset, on_disk, h5FilesPath, force)
return(cur_dat)
}