Skip to content

Commit

Permalink
Add new proper vignettes (#35)
Browse files Browse the repository at this point in the history
- moved statistical examples into their own
  vignettes
- they will now be rendered by `pkgdown`
- simplifies `README`
- four new vignettes:
  - loading-and-wrangling
  - binary-classification
  - linear-regression
  - two-group-comparison
- fixes #35
  • Loading branch information
stufield committed Apr 5, 2023
1 parent dc7bff9 commit a60a5dd
Show file tree
Hide file tree
Showing 10 changed files with 705 additions and 5 deletions.
1 change: 1 addition & 0 deletions DESCRIPTION
Original file line number Diff line number Diff line change
Expand Up @@ -37,6 +37,7 @@ Suggests:
Biobase,
ggplot2,
knitr,
purrr,
recipes,
rmarkdown,
spelling,
Expand Down
29 changes: 29 additions & 0 deletions _pkgdown.yml
Original file line number Diff line number Diff line change
Expand Up @@ -27,6 +27,35 @@ articles:
contents:
- SomaDataIO

- title: Loading and Wrangling
navbar: ~
desc: >
How to load and manipulate a 'SomaScan' flat text file into
and R environment.
contents:
- loading-and-wrangling

- title: Two-group Comparison
navbar: ~
desc: >
Typical two-group comparison of 'SomaScan' data.
contents:
- two-group-comparison

- title: Binary Classification
navbar: ~
desc: >
Typical binary classification of 'SomaScan' data.
contents:
- binary-classification

- title: Linear Regression
navbar: ~
desc: >
Typical linear regression of continuous 'SomaScan' data.
contents:
- linear-regression

reference:
- title: Load an ADAT
desc: >
Expand Down
2 changes: 2 additions & 0 deletions vignettes/.gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
*.html
*.R
21 changes: 16 additions & 5 deletions vignettes/SomaDataIO.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -10,6 +10,7 @@ vignette: >
```{r setup, include = FALSE}
library(SomaDataIO)
knitr::opts_chunk$set(
echo = TRUE,
collapse = TRUE,
comment = "#>"
)
Expand All @@ -20,7 +21,8 @@ knitr::opts_chunk$set(
This document accompanies the `SomaDataIO` R package, which loads
and exports 'SomaScan' data via the SomaLogic Operating Co., Inc.
proprietary text file called an ADAT (`*.adat`).
For file format see [here](https://github.com/SomaLogic/SomaLogic-Data/blob/master/README.md).
For file format see
[here](https://github.com/SomaLogic/SomaLogic-Data/blob/master/README.md).
The package also exports auxiliary functions for manipulating, wrangling,
and extracting relevant information from an ADAT object once in memory.
Basic familiarity with the R environment is assumed, as is the ability to install
Expand All @@ -42,19 +44,28 @@ contributed packages from the Comprehensive R Archive Network (CRAN).
+ `?SeqId` analyte (feature) matching.
+ `dplyr` and `tidyr` verb S3 methods for the `soma_adat` class.
+ `?rownames` helpers that do not break `soma_adat` attributes.
+ please see vignette `vignette("loading-and-wrangling")`

* Exporting data (Output)
+ write out a `soma_adat` object as a `*.adat` text file.


----------------------

## Workflows and Analysis (TODO)
## Workflows and Analysis
This section will become more fleshed out in future versions of
`SomaDataIO`
`SomaDataIO`. In the meantime, below are 3 examples of typical
primary statistical analyses that are commonly performed on
'SomaScan' data:

#### In the meantime please see the package
[README](https://github.com/SomaLogic/SomaDataIO/blob/main/README.md)
- Two-group comparison (e.g. differential expression) via *t*-test
+ see vignette `vignette("two-group-comparison")`

- Binary classification
+ see vignette `vignette("binary-classification")`

- Linear regression
+ see vignette `vignette("linear-regression")`


---------------------
Expand Down
151 changes: 151 additions & 0 deletions vignettes/binary-classification.Rmd
Original file line number Diff line number Diff line change
@@ -0,0 +1,151 @@
---
title: "Binary Classification"
author: "Stu Field"
output:
rmarkdown::html_vignette:
fig_caption: yes
ratio: '9:16'
fontsize: 12pt
vignette: >
%\VignetteIndexEntry{Binary Classification}
%\VignetteEngine{knitr::rmarkdown}
%\VignetteEncoding{UTF-8}
---

```{r setup, echo = FALSE, results = FALSE, message = FALSE}
options(width = 101)
#Sys.setlocale("LC_COLLATE", "C")
Sys.setlocale("LC_COLLATE", "en_US.UTF-8") # ensure common sorting envir
library(SomaDataIO)
library(dplyr)
library(tidyr)
library(purrr)
knitr::opts_chunk$set(
echo = TRUE,
collapse = TRUE,
comment = "#>",
fig.path = "figures/classify-"
)
```


----------------


# Binary classification (via logistic regression)


Although targeted statistical analyses are beyond the scope of
the `SomaDataIO` package, below is an example analysis
that typical users/customers would perform on 'SomaScan' data.

It is not intended to be a definitive guide in statistical
analysis and existing packages do exist in the `R` ecosystem that perform
parts or extensions of these techniques. Many variations of the workflow
below exist, however the framework highlights how one could perform standard
_preliminary_ analyses on 'SomaScan' data.


## Data Preparation
```{r data-prep}
# the `example_data` object package data
dim(example_data)
table(example_data$SampleType)
cs <- function(.x) { # .x = numeric vector
out <- .x - mean(.x) # center
out / sd(out) # scale
}
# Prepare data set for analysis
cleanData <- example_data |>
filter(SampleType == "Sample") |> # rm control samples
drop_na(Sex) |> # rm NAs if present
log10() |> # log10-transform (Math Generic)
mutate(Group = as.numeric(factor(Sex)) - 1) |> # map Sex -> 0/1
modify_at(getAnalytes(example_data), cs)
table(cleanData$Sex)
table(cleanData$Group) # F = 0; M = 1
```

## Set up Train/Test Data

```{r train-test}
# idx = hold-out
# seed resulting in 50/50 class balance
idx <- withr::with_seed(3, sample(1:nrow(cleanData), size = nrow(cleanData) - 50))
train <- cleanData[idx, ]
test <- cleanData[-idx, ]
# assert no overlap
isTRUE(
all.equal(intersect(rownames(train), rownames(test)), character(0))
)
```


## Logistic Regression
We use the `cleanData`, `train`, and `test` data objects from above.

### Predict Sex
```{r logreg-tbl}
LR_tbl <- getAnalyteInfo(train) |>
select(AptName, SeqId, Target = TargetFullName, EntrezGeneSymbol, UniProt) |>
mutate(
formula = map(AptName, ~ as.formula(paste("Group ~", .x))), # create formula
model = map(formula, ~ stats::glm(.x, data = train, family = "binomial", model = FALSE)), # fit glm()
beta_hat = map_dbl(model, ~ coef(.x)[2L]), # pull out coef Beta
p.value = map2_dbl(model, AptName, ~ {
summary(.x)$coefficients[.y, "Pr(>|z|)"] }), # pull out p-values
fdr = p.adjust(p.value, method = "BH") # FDR correction multiple testing
) |>
arrange(p.value) |> # re-order by `p-value`
mutate(rank = row_number()) # add numeric ranks
LR_tbl
```


### Fit Model | Calculate Performance

Next, select features for the model fit. We have a good idea of reasonable `Sex`
markers from prior knowledge (`CGA*`), and fortunately many of these are highly
ranked in `LR_tbl`. Below we fit a 4-marker logistic regression model from
cherry-picked gender-related features:

```{r fit-logreg}
# AptName is index key between `LR_tbl` and `train`
feats <- LR_tbl$AptName[c(1L, 3L, 5L, 7L)]
form <- as.formula(paste("Group ~", paste(feats, collapse = "+")))
fit <- glm(form, data = train, family = "binomial", model = FALSE)
pred <- tibble(
true_class = test$Sex, # orig class label
pred = predict(fit, newdata = test, type = "response"), # prob. 'Male'
pred_class = ifelse(pred < 0.5, "F", "M"), # class label
)
conf <- table(pred$true_class, pred$pred_class, dnn = list("Actual", "Predicted"))
tp <- conf[2L, 2L]
tn <- conf[1L, 1L]
fp <- conf[1L, 2L]
fn <- conf[2L, 1L]
# Confusion matrix
conf
# Classification metrics
tibble(Sensitivity = tp / (tp + fn),
Specificity = tn / (tn + fp),
Accuracy = (tp + tn) / sum(conf),
PPV = tp / (tp + fp),
NPV = tn / (tn + fn)
)
```


---------------------


Created by [Rmarkdown](https://github.com/rstudio/rmarkdown)
(v`r utils::packageVersion("rmarkdown")`) and `r R.version$version.string`.
Binary file added vignettes/figures/linear-reg-linreg-plot-1.png
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added vignettes/figures/two-group-ggplot-boxes-1.png
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.

0 comments on commit a60a5dd

Please sign in to comment.