-
Notifications
You must be signed in to change notification settings - Fork 1
/
sparse_matrices.Rmd
160 lines (118 loc) · 5.08 KB
/
sparse_matrices.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
---
title: "Sparse Matrices with `funrar`"
author: "Matthias Greni\u00e9"
date: "`r Sys.Date()`"
output: rmarkdown::html_vignette
vignette: >
%\VignetteIndexEntry{Sparse Marices with 'funrar'}
%\VignetteEngine{knitr::rmarkdown}
%\VignetteEncoding{UTF-8}
---
# Sparse Matrices Usefulness
When a matrix contains a lot of zero, there is no need to store all its values. But only the non-zero values, for example the following matrix:
\begin{equation}
M =
\bordermatrix{
~ & \text{sp. A} & \text{sp. B} & {sp. C} \cr
\text{site 1} & 1 & 0 & 0 \cr
\text{site 2} & 0 & 1 & 1 \cr
\text{site 3} & 0 & 1 & 0 \cr}
\end{equation}
can be stored using only a matrix with non zero-values $S$, with row index $i$ and column index $j$:
\begin{equation}
S =
\begin{matrix}
i & j & \text{value} \\
\text{site 1} & \text{sp. A} & 1 \\
\text{site 2} & \text{sp. B} & 1 \\
\text{site 3} & \text{sp. B} & 1 \\
\text{site 2} & \text{sp. C} & 1
\end{matrix}
\end{equation}
If your site-species matrix is big, it may contain a lot zero because it is very unlikely that all species are present at all sites when looking at thousands of species and hundreds of sites. Thus, the sparse matrix notation would save a good amount of lines (look how $S$ compared to $M$).
```{r sparseMatExample}
# Generate a matrix with 1000 species and 200 sites
my_mat = matrix(sample(c(0, 0, 0, 0, 0, 0, 1), replace = TRUE, size = 200000),
ncol = 1000, nrow = 200)
colnames(my_mat) = paste0("sp", 1:ncol(my_mat))
rownames(my_mat) = paste0("site", 1:nrow(my_mat))
my_mat[1:5, 1:5]
```
`funrar` lets you use sparse matrices directly using a function implemented in the `Matrix` package. When your matrix is filled with 0s it can be quicker to use sparse matrices. To know if you should use sparse matrices you can compute the filling of your matrix, i.e. the percentage of non-zero cells in your matrix:
```{r estimateFilling}
filling = 1 - sum(my_mat == 0)/(ncol(my_mat)*nrow(my_mat))
filling
```
To convert from a normal matrix to a sparse matrix you can use `as(my_mat, "sparseMatrix")`:
```{r convertSparseMat}
library(Matrix)
sparse_mat = as(my_mat, "sparseMatrix")
is(my_mat, "sparseMatrix")
is(sparse_mat, "sparseMatrix")
```
it completely changes the structure as well as the memory it takes in the RAM:
```{r memoryUsage}
# Regular Matrix
str(my_mat)
print(object.size(my_mat), units = "Kb")
# Sparse Matrix from 'Matrix' package
str(sparse_mat)
print(object.size(sparse_mat), units = "Kb")
```
sparse matrices reduce the amount of RAM necessary to store them. The more zeros there are in a given matrix the more RAM will be spared when turning it into a sparse matrix.
## Benchmark
We can now compare the performances of the algorithms in rarity indices computation between a sparse matrix and a regular one, using the popular [`microbenchmark`](https://cran.r-project.org/package=microbenchmark) R package:
```{r microBenchDist, eval = FALSE}
library(funrar)
# Get a table of traits
trait_df = data.frame(trait = runif(ncol(my_mat), 0, 1))
rownames(trait_df) = paste0("sp", 1:ncol(my_mat))
# Compute distance matrix
dist_mat = compute_dist_matrix(trait_df)
if (requireNamespace("microbenchmark", quietly = TRUE)) {
microbenchmark::microbenchmark(
regular = distinctiveness(my_mat, dist_mat),
sparse = distinctiveness(sparse_mat, dist_mat))
}
```
### Systematic benchmark
We generate matrices with different filling rate and compare the speed of regular matrix and sparse matrices computation.
```{r fillingInfluence}
generate_matrix = function(n_zero = 5, nrow = 200, ncol = 1000) {
matrix(sample(c(rep(0, n_zero), 1), replace = TRUE, size = nrow*ncol),
ncol = ncol, nrow = nrow)
}
mat_filling = function(my_mat) {
sum(my_mat != 0)/(ncol(my_mat)*nrow(my_mat))
}
sparse_and_mat = function(n_zero) {
my_mat = generate_matrix(n_zero)
colnames(my_mat) = paste0("sp", 1:ncol(my_mat))
rownames(my_mat) = paste0("site", 1:nrow(my_mat))
sparse_mat = as(my_mat, "sparseMatrix")
return(list(mat = my_mat, sparse = sparse_mat))
}
n_zero_vector = c(0, 1, 2, 49, 99)
names(n_zero_vector) = n_zero_vector
all_mats = lapply(n_zero_vector, sparse_and_mat)
mat_filling(all_mats$`0`$mat)
mat_filling(all_mats$`99`$mat)
```
Now we can compare the speed of the algorithms:
```{r, eval=FALSE}
if (requireNamespace("microbenchmark", quietly = TRUE)) {
mat_bench = microbenchmark::microbenchmark(
mat_0 = distinctiveness(all_mats$`0`$mat, dist_mat),
sparse_0 = distinctiveness(all_mats$`0`$sparse, dist_mat),
mat_1 = distinctiveness(all_mats$`1`$mat, dist_mat),
sparse_1 = distinctiveness(all_mats$`1`$sparse, dist_mat),
mat_2 = distinctiveness(all_mats$`2`$mat, dist_mat),
sparse_2 = distinctiveness(all_mats$`2`$sparse, dist_mat),
mat_49 = distinctiveness(all_mats$`49`$mat, dist_mat),
sparse_49 = distinctiveness(all_mats$`49`$sparse, dist_mat),
mat_99 = distinctiveness(all_mats$`99`$mat, dist_mat),
sparse_99 = distinctiveness(all_mats$`99`$sparse, dist_mat),
times = 5)
autoplot(mat_bench)
}
```