Skip to content

[BUG] DiskANN tries to allocate memory for the whole set of points #612

Open
@bioinsilico

Description

@bioinsilico

Expected Behavior

No allocation memory errors for large datasets and cosine distance.

Actual Behavior

The program crashes when using cosine distance and large datasets due to memory allocation.

Example Code

/opt/DiskANN/build/apps/build_disk_index --data_type float --dist_fn cosine --index_path_prefix /mnt/raid0/DiskANN/af_ann_index -B 116 -M 116 --data_path /mnt/raid0/DiskANN/tmp/embeddings.bin

Dataset Description

  • Dimensions: 1280
  • Number of Points: 214386453
  • Data type: float32

Error

Normalizing data for cosine to temporary file, please ensure there is additional (n*d*4) bytes for storing normalized base vectors, apart from the interim indices created by DiskANN and the final index.
Normalizing FLOAT vectors in file: /mnt/raid0/DiskANN/tmp/embeddings.bin
Dataset: #pts = 214386453, # dims = 1280
# blks: 1636
tcmalloc: large alloc 1097658646528 bytes == (nil) @  0x7f96cdbc0680 0x7f96cdbe0ff4 0x561f428bcdbd 0x561f423d1cce 0x561f42390637 0x7f96c6291083 0x561f4239115e
std::bad_alloc
Index build failed.

Your Environment

  • Ubuntu 22.04.1 LTS
  • DiskANN version 0.7.0

Additional Details

The problem seems to be in the utils function void normalize_data_file(const std::string &inFileName, const std::string &outFileName)

float *read_buf = new float[npts * ndims];

It tries to allocate memory for the whole dataset. If I had to guess, the allocation should be only for the block size blk_size.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions