/
customFA.Rmd
170 lines (144 loc) · 6.99 KB
/
customFA.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
---
title: "Customizable Factor Analyses"
author: "David Gerard"
date: "`r Sys.Date()`"
output: rmarkdown::html_vignette
vignette: >
%\VignetteIndexEntry{Customizable Factor Analyses}
%\VignetteEngine{knitr::rmarkdown}
%\VignetteEncoding{UTF-8}
bibliography: "vicar.bib"
---
This vignette describes how to build your own factor analysis functions for use in vicar's primary functions `vruv4`, `ruv3`, `mouthwash`, `backwash`, and `ruvb`. These methods are described in detail in @gerard2020empirical and @gerard2021unifying.
# `fa_func` in `vruv4`, `ruv3`, `mouthwash`, and `backwash`
A factor analysis is a list of three point estimates {$\alpha$, $Z$, $\Sigma$} from the following model
$$
Y = Z\alpha^T + E,
$$
where $Y$ is $n$ by $p$, $Z$ is $n$ by $r$, $\alpha$ is $p$ by $r$, and $E$ contains independent Gaussian errors with $p$ by $p$ diagonal column covariance matrix $\Sigma$. The user-specified function `fa_func` is assumed to return estimates from this model.
`fa_func` should take as input the following elements
1. `Y`: A matrix of numerics. For the rest of this vignette, we say that `Y` has $n$ rows and $p$ columns.
2. `r`: The rank of the factor analysis.
3. Additional arguments passed to `fa_func` in `fa_args`.
`fa_func` should output a list with the following elements
1. `alpha`: A $p$ by `r` matrix. The estimated factors.
2. `Z`: An $n$ by `r` matrix. The estimated loadings.
3. `sig_diag`: A $p$-length numeric vector with non-negative entries. The estimated column-wise variances.
4. Any additional output the user desires. However, this output will not be used in any confounder adjustment functions.
Note that `alpha` and `Z` should still be matrices even when $r$ is 1.
Because of the way that the data, $Y$, were derived, only certain factor analyses make sense. That is, the factor analysis should be left orthogonally equivariant. What this means is that if $Z_1$, $\alpha_1$, and $\Sigma_1$ are estimates from $Y$ and $Z_2$, $\alpha_2$, and $\Sigma_2$ are estimates from $QY$ for any orthogonal matrix $Q$ (where $Q^TQ = QQ^T = I_n$), then
$$
Z_1\alpha_1 = Q^TZ_2\alpha_2
$$
and
$$
\Sigma_1 = \Sigma_2.
$$
I've included a function, `fa_tester`, to test whether a user-provided factor analysis satisfies all of these conditions.
The default factor analysis for all methods is `pca_naive`. We can compare its performance to a stupid estimator that always returns the identity matrix padded with zeros for the estimate of $\alpha$, the first $r$ left singular vectors of $Y$ for the estimate of $Z$, and a vector of ones for `sig_diag`.
```{r}
library(vicar)
library(ggplot2)
fa_stupid <- function(Y, r) {
p <- ncol(Y)
n <- nrow(Y)
svY <- svd(Y)
alpha <- rbind(diag(r), matrix(0, nrow = p - r, ncol = r))
Z <- svY$u[, 1:r, drop = FALSE]
sig_diag <- rep(1, p)
return(list(alpha = alpha, Z = Z, sig_diag = sig_diag))
}
```
`fa_stupid` does indeed technically work as a factor analysis, as demonstrated by `fa_tester`
```{r}
tout <- fa_tester(fa_stupid)
tout
```
Let's generate some sample data where the covariate of interest is the second column of $X$.
```{r}
## Generate data and controls ---------------------------------------------
set.seed(1327)
n <- 13
p <- 1001
k <- 2
q <- 3
is_null <- rep(FALSE, length = p)
is_null[1:501] <- TRUE
ctl <- rep(FALSE, length = p)
ctl[1:201] <- TRUE
X <- matrix(stats::rnorm(n * q), nrow = n)
B <- matrix(stats::rnorm(q * p), nrow = q)
B[2, is_null] <- 0
Z <- X %*% matrix(stats::rnorm(q * k), nrow = q) +
matrix(rnorm(n * k), nrow = n)
A <- matrix(stats::rnorm(k * p), nrow = k)
E <- matrix(stats::rnorm(n * p, sd = 1 / 2), nrow = n)
Y <- X %*% B + Z %*% A + E
```
And let's compare using `fa_stupid` against the default `pca_naive`.
```{r, fig.width=7}
ruv4out <- vruv4(Y = Y, X = X, ctl = ctl, cov_of_interest = 2, include_intercept = FALSE)
ruv4out_stupid <- vruv4(Y = Y, X = X, ctl = ctl, cov_of_interest = 2, include_intercept = FALSE,
fa_func = fa_stupid)
order_4p <- order(ruv4out$pvalues)
order_4p_stupid <- order(ruv4out_stupid$pvalues)
fpr4 <- cumsum(is_null[order_4p]) / sum(is_null)
tpr4 <- cumsum(!is_null[order_4p]) / sum(!is_null)
fpr4stupid <- cumsum(is_null[order_4p_stupid]) / sum(is_null)
tpr4stupid <- cumsum(!is_null[order_4p_stupid]) / sum(!is_null)
dfdat <- data.frame(fpr = c(fpr4, fpr4stupid), tpr = c(tpr4, tpr4stupid),
fa = c(rep("PCA", p), rep("Stupid", p)))
ggplot(data = dfdat, mapping = aes(x = fpr, y = tpr, col = fa)) +
geom_line() +
geom_abline(slope = 1, intercept = 0, col = 1, lty = 2) +
theme_bw() +
ggtitle("ROC Curve") + xlab("False Positive Rate") +
ylab("True Positive Rate") +
scale_color_discrete(guide = guide_legend(title = "Factor Analysis"))
```
# `fa_func` in `ruvb`
Providing a user-specified `fa_func` for `ruvb` is more complicated. We are assuming a more complicated model
\begin{align}
\left(
\begin{array}{cc}
Y_{21} & Y_{22}\\
Y_{31} & Y_{32}
\end{array}
\right)
=
\left(
\begin{array}{cc}
\Omega_{21} & \Omega_{22}\\
\Omega_{31} & \Omega_{32}
\end{array}
\right)
+
E,
\end{align}
where $Y_{22}$ is _unobserved_, $\Omega$ has some low-dimensional structure, and $E$ has Gaussian errors with possibly dependent columns. `fa_func` returns an array of posterior draws from from $[Y_{22} | Y_{21}, Y_{31}, Y_{32}]$ when fitting some user-specified Bayesian model.
Let $n$ be the number of samples, $p$ be the number of genes, $m$ be the number of control genes, and $k$ be the number of covariates of interest. `fa_func` should take as input the following elements.
1. `Y21` The upper left matrix of numerics. This is of dimension $k$ by $m$.
2. `Y31` The lower left matrix of numerics. This is of dimension $n - k$ by $m$.
3. `Y32` The lower right matrix of numerics. This is of dimension $n - k$ by $p - m$.
4. `k`: The rank of the factor analysis when assuming a factor model for $\Omega$. `fa_func` doesn't need to use it if you aren't assuming a factor model, but `fa_func` needs to have it as an argument.
`fa_func` should return
1. `Y22_array`, a three-way array of dimension $k$ by $p - m$ by the number of posterior draws.
2. Any other parameters you wish to return. However, they won't be used in confounder adjustment.
The default value for `fa_func` is `bfa_gs_linked`. We can test a factor analysis for compatibility with `ruvb` using the function `fa_tester_ruvb`.
```{r}
fa_tester_ruvb(fa_func = bfa_gs_linked, fa_args = list(nsamp = 20, display_progress = FALSE))
```
Here's another technically OK (though stupid) factor analysis that just returns an array of 1's
```{r}
fa_stupid_ruvb <- function(Y21, Y31, Y32, k) {
dumb_est <- array(1, dim = c(nrow(Y21), ncol(Y32), 30))
return(list(Y22_array = dumb_est))
}
fa_tester_ruvb(fa_func = fa_stupid_ruvb)
```
We can use `fa_stupid_ruvb` on the data in the previous section. But because it returns the same value every posterior draw, we obtain unhelpful lfsr's of 0 for all points.
```{r}
ruvbout <- ruvb(Y = Y, X = X, ctl = ctl, fa_func = fa_stupid_ruvb)
graphics::plot(table(ruvbout$lfsr2), type = "h", ylab = "lfsr")
```
# References