Unlock analysis of larger datasets

Currently CASTCLUSTER uses a matrix to store similarity values. R has a vector limit of 2^31, which is about 2 billion, and a matrix in R is a vector with dimension attributes. Given that the matrix needs to store information on pairs of sites, it can hold the similiarty scores for about sqrt(2^31) = 46 000. Large, but not even large enough for a global 1 degree analysis.

Internally, there is no need for a mathematically proper matrix. No matrix algebra is performed. All operations pull out a row as a vector, then do element-wise operations on those row-vectors.

A better internal structure would be a list or environment where each entry `i` is the similarity vector from site `i` to all other sites. The list/environment and each element are separate objects in memory, so we can handle up to (2^31-1) sites.

Actually trying to store the similarities for 2^31 sites would require 2^62 similariy scores, 4 exabytes, approaching the limit of 64-bit address space. Don't try that at your local supercomputing cluster.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unlock analysis of larger datasets #3

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Unlock analysis of larger datasets #3

Description

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions