Currently CASTCLUSTER uses a matrix to store similarity values. R has a vector limit of 2^31, which is about 2 billion, and a matrix in R is a vector with dimension attributes. Given that the matrix needs to store information on pairs of sites, it can hold the similiarty scores for about sqrt(2^31) = 46 000. Large, but not even large enough for a global 1 degree analysis.
Internally, there is no need for a mathematically proper matrix. No matrix algebra is performed. All operations pull out a row as a vector, then do element-wise operations on those row-vectors.
A better internal structure would be a list or environment where each entry i is the similarity vector from site i to all other sites. The list/environment and each element are separate objects in memory, so we can handle up to (2^31-1) sites.
Actually trying to store the similarities for 2^31 sites would require 2^62 similariy scores, 4 exabytes, approaching the limit of 64-bit address space. Don't try that at your local supercomputing cluster.
Currently CASTCLUSTER uses a matrix to store similarity values. R has a vector limit of 2^31, which is about 2 billion, and a matrix in R is a vector with dimension attributes. Given that the matrix needs to store information on pairs of sites, it can hold the similiarty scores for about sqrt(2^31) = 46 000. Large, but not even large enough for a global 1 degree analysis.
Internally, there is no need for a mathematically proper matrix. No matrix algebra is performed. All operations pull out a row as a vector, then do element-wise operations on those row-vectors.
A better internal structure would be a list or environment where each entry
iis the similarity vector from siteito all other sites. The list/environment and each element are separate objects in memory, so we can handle up to (2^31-1) sites.Actually trying to store the similarities for 2^31 sites would require 2^62 similariy scores, 4 exabytes, approaching the limit of 64-bit address space. Don't try that at your local supercomputing cluster.