vignettes/CDSE2019.Rmd

---
title: "Bioconductor for Everyone: Exploring, Analyzing, and Visualizing Large Data Sets with R"
author:
- name: Martin Morgan
  affiliation: Roswell Park Comprehensive Cancer Center
output:
  BiocStyle::html_document
abstract: |
  We'll take a fast-paced tour through R and the software project I
  work on, Bioconductor (https://bioconductor.org), learning how to
  explore and visualize large cancer-related data sets. We'll work
  through two particular analyses. In the process, we'll learn some
  pretty significant new R skills. For instance, we will learn about
  formal classes for representing complex data, strategies for
  iteration and parallel processing, and accessing 'remote' resources
  accessible through web-based interfaces. This workshop should be
  interesting to people who know a bit of R, and want to learn more!
vignette: |
  %\VignetteIndexEntry{Bioconductor for Everyone}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r style, echo = FALSE, results = 'asis'}
knitr::opts_chunk$set(
    eval=as.logical(Sys.getenv("KNITR_EVAL", "TRUE")),
    cache=as.logical(Sys.getenv("KNITR_CACHE", "TRUE"))
)
```

# Biology

Gene expression

RNA-seq

Single-cell technologies

# Large data

## What and how

What is it?

- Doesn't fit in memory
- Computationally expensive to process

Common use cases

- Query -- in this haystack, where is my needle
- Processing -- e.g., reduction to summary statistics

## Strategies

Queries

- e.g., of data bases

Processing

- Iteration & chunk-wise
- Distributed / parallel

# Real-world: objects

## Objects

We'll set up some data to use (pay no attention to this!)

```{r setup-airway, message=FALSE}
dir <- tempdir()
if (!dir.exists(dir))
    dir.create(dir)
library(airway)
data(airway)
write.table(colData(airway), file.path(dir, "samples.csv"))
write.table(assay(airway), file.path(dir, "expression.csv"))
```

Related data elements

```{r}
samples <- read.table(file.path(dir, "samples.csv"))
samples
counts <- read.table(file.path(dir, "expression.csv"))
head(counts)
counts <- as.matrix(counts)
```

Coordinate data management

- separately 'managing' `samples` and `counts` is error-prone, e.g. subset
  `samples` but not `counts`, and the association of sample rows and count
  columns is distrupted.

```{r, message=FALSE}
library(SummarizedExperiment)
SummarizedExperiment()
```

An 'S4' object for coordinating 'assay' matrices with row and column annotations

```{r}
se <- SummarizedExperiment(
    assays = list(counts = counts),
    colData = samples
)
se
```

Classes allow the developer to make data access easy

- e.g., matrix-like 'interface'
- Accessors, so internal representation can be chosen for efficiency while
  user interface remains easy to use

```{r}
se$dex
se[, se$dex == "trt"]
```

Data manipulation, e.g., non-zero rows

```{r}
idx <- rowSums(assay(se)) > 0
se[idx,]
```

Simple visualize

```{r}
dotchart(colSums(assay(se)), xlab = "Library size")
```

```{r}
expr <- rowSums(assay(se))
plot(density(log(expr[expr > 0])), ylab = "log expression")
```

# Tidy data and the tidyverse

Not 'better', but different

Challenges in base R and formal objects

- Many different types -- vector, data.frame, matrix, SummarizedExperiment --
  all with different operations
- Each function is somehow different, e.g., `[` applied to a matrix usually (!)
  returns a matrix, whereas `rowSums()` returns a vector

'tidy' analysis

- Consistent representation of data -- data.frame
- Consistent methods

  - First argument is always 'the data',
  - Tidy functions are always 'endomorphisms' -- the class of the input data is
    the same as the class of the result
  - Only a few standard 'verbs' -- `filter()`, `select()`, `group_by()`,
    `count()`, `summarize()`, `mutate()`, ...

```{r, message = FALSE}
library(dplyr)
library(tibble)
```

The pipe, `%>%`

- Base R: reasoning 'inside out'

    ```{r}
    x <- runif(10, 1, 5)
    log(ceiling(x))
    ```

- Procedural R

    ```
    x1 <- ceiling(x)
    log(x1)
    ```

- Tidy R -- use `%>%` so that operations read left-to-right

    ```
    x %>% ceiling() %>% log()
    ```

Data representation

- A `tibble` is like a user-friendly `data.frame`

    ```{r}
    as_tibble(mtcars)
    ```

- Row names are just another column

    ```{r}
    sample <- tibble::rownames_to_column(
        as.data.frame(colData(se)),
        var="Accession"
    ) %>% as_tibble()
    ```


- Data is represented in 'long-form'

    ```{r}
    count <- reshape::melt(assay(se), c("Feature", "Accession")) %>%
        as_tibble()
    colnames(count)[3] = "Count"
    count
    ```

Endomorphism: tibble in, tibble out

```{r}
cars <- rownames_to_column(mtcars, "make") %>%
    as_tibble()
cars %>% filter(cyl >= 6)
cars %>% select(make, mpg, cyl, disp)
```

Generally, playing well with other packages in the 'tidyverse'

```{r setup-ggplot2, message=FALSE}
library(ggplot2)
```

```{r}
cars %>% ggplot(aes(x = factor(cyl), y = mpg)) + geom_boxplot()
```

# Data base

## Data base representation

Create a sqlite data base

```{r setup-RSQLite, message = FALSE}
library(RSQLite)
airway_db <- file.path(dir, "airway.sqlite")
con <- dbConnect(SQLite(), airway_db)
```

Add a sample table

```{r}
sample <- tibble::rownames_to_column(
    as.data.frame(colData(se)),
    var="Accession"
)
dbWriteTable(con, "sample", sample)
```

Add a count table, in 'tidy' form

```{r}
count <- reshape::melt(assay(se), c("Feature", "Accession"))
colnames(count)[3] = "Count"
dbWriteTable(con, "count", count)
```

Extract data using SQL statements

```{r}
dbListTables(con)
dbGetQuery(con, "SELECT * FROM Sample;")
dbGetQuery(con, "SELECT Accession, cell, dex FROM Sample;")
dbGetQuery(con, "SELECT * FROM Count LIMIT 3;")
dbDisconnect(con)
```

## dbplyr

Open data base

```{r, message = FALSE}
library(dplyr)
library(dbplyr)
src <- src_sqlite(airway_db)
src
```

Manipulations on "Sample" table -- standard verbs, plus `collect()`

```{r}
tbl(src, "Sample")
tbl(src, "Sample") %>% select(Accession, cell, dex)
tbl(src, "Sample") %>% filter(dex == "trt") %>% collect()
```

Manipulations on "Count" table

```{r}
tbl(src, "Count")
tbl(src, "Count") %>%
    group_by(Accession) %>%
    summarize(library_size = SUM(Count)) %>%
    collect()
```

Relations between tables

```{r}
left_join(tbl(src, "Count"), tbl(src, "Sample"))
left_join(
    tbl(src, "Count"),
    tbl(src, "Sample") %>% select(Accession, cell, dex)
)
```

Library size

- Find column (Accession) counts

    ```{r}
    tbl(src, "Count") %>%
        group_by(Accession) %>%
        summarize(library_size = SUM(Count))
    ```

Filter rows with non-zero counts

- Rows with non-zero counts

    ```{r}
    keep <- tbl(src, "Count") %>%
        group_by(Feature) %>%
        summarize(row_sum = SUM(Count)) %>%
        filter(row_sum > 0) %>%
        select(Feature)
    ```
- `left_join()` to keep only these rows

    ```{r}
    left_join(keep, tbl(src, "Count"))
    ```

## Aside: SRAdb

```{r setup-SRAdb, message = FALSE, eval = FALSE}
library(BiocFileCache)
if (nrow(bfcquery(query="SRAdb", field = "rname")) == 0L) {
    fl <- SRAdb::getSRAdbFile(tempdir())
    bfcadd(rname = "SRAdb", fpath = fl, action = "move")
}
```

```{r, eval = FALSE}
fl <- BiocFileCache::bfcrpath(rnames = "SRAdb")
src <- src_sqlite(fl)
tbl(src, "study")
tbl(src, "study") %>%
    filter(study_title %like% "%ovarian%")
```

# Other on-disk or remote representations

Data bases are appropriate for 'relational' data.

The 'big' part of scientific data is often not relational

- E.g., an expression _matrix_

Access patterns for databases adn scientific data often differ.

- database: query
- scientific data: process all data

Strategy for processing data: iterate through the file

- In python, other languages: iterate one record (e.g., sample) at a time.
- In R: iterate in chunks to allow vector processing.

## hdf5

```{r setup-rhdf5, message = FALSE}
library(rhdf5)
```

Fast 'block-wise' access

## TENxBrainData

```{r setup-TENxBrainData, message = FALSE}
library(TENxBrainData)
tenx <- TENxBrainData()
```

Illusions...

```{r}
log(1 + assay(tenx))
```

Subset

```{r}
tenx_subset <- tenx[, sample(ncol(tenx), 200)]
count <- as.matrix(assay(tenx_subset))
dotchart(
    unname(colSums(count)),
    xlab = "Library size"
)
hist(log(1 + rowSums(count)))
```

Actually, though, chunk-wise data processing is transparent

```{r}
dotchart(
    unname(colSums(assay(tenx_subset))),
    xlab = "Library size"
)
```

## restfulSE

# End matter

## Acknowledgements

A portion of this work is supported by the Chan Zuckerberg Initiative
DAF, an advised fund of Silicon Valley Community Foundation.

Research reported in this presentation was supported by the NHGRI and
NCI of the National Institutes of Health under award numbers
U41HG004059, U24CA180996, and U24CA232979. The content is solely the
responsibility of the authors and does not necessarily represent the
official views of the National Institutes of Health.

This work was performed on behalf of the SOUND Consortium and funded
under the EU H2020 Personalizing Health and Care Program, Action
contract number 633974.


## Session info {.unnumbered}

```{r sessionInfo, echo=FALSE}
sessionInfo()
```