Skip to content

Unlock analysis of larger datasets #3

@PhDyellow

Description

@PhDyellow

Currently CASTCLUSTER uses a matrix to store similarity values. R has a vector limit of 2^31, which is about 2 billion, and a matrix in R is a vector with dimension attributes. Given that the matrix needs to store information on pairs of sites, it can hold the similiarty scores for about sqrt(2^31) = 46 000. Large, but not even large enough for a global 1 degree analysis.

Internally, there is no need for a mathematically proper matrix. No matrix algebra is performed. All operations pull out a row as a vector, then do element-wise operations on those row-vectors.

A better internal structure would be a list or environment where each entry i is the similarity vector from site i to all other sites. The list/environment and each element are separate objects in memory, so we can handle up to (2^31-1) sites.

Actually trying to store the similarities for 2^31 sites would require 2^62 similariy scores, 4 exabytes, approaching the limit of 64-bit address space. Don't try that at your local supercomputing cluster.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions