Skip to content

tidyCovariates is extremely slow and resource intensive for large data when using Andromeda >= 1.0.0 #308

@schuemie

Description

@schuemie

The new Andromeda makes most operations including tidyCovariates() much faster, but not when the covariate data is very large (e.g. in my case 2 million subject with > 160k covariates). This is caused by inefficiencies in how DuckDB handles the combination of filtering by covariate ID and normalization.

I have created a fix that reduced the processing time for my data from > 3 hours (at 3 hours it was at 3% and I stopped it) to 3 minutes.

I post a PR.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions