Open
Description
Expected Behavior
No allocation memory errors for large datasets and cosine distance.
Actual Behavior
The program crashes when using cosine distance and large datasets due to memory allocation.
Example Code
/opt/DiskANN/build/apps/build_disk_index --data_type float --dist_fn cosine --index_path_prefix /mnt/raid0/DiskANN/af_ann_index -B 116 -M 116 --data_path /mnt/raid0/DiskANN/tmp/embeddings.bin
Dataset Description
- Dimensions: 1280
- Number of Points: 214386453
- Data type: float32
Error
Normalizing data for cosine to temporary file, please ensure there is additional (n*d*4) bytes for storing normalized base vectors, apart from the interim indices created by DiskANN and the final index.
Normalizing FLOAT vectors in file: /mnt/raid0/DiskANN/tmp/embeddings.bin
Dataset: #pts = 214386453, # dims = 1280
# blks: 1636
tcmalloc: large alloc 1097658646528 bytes == (nil) @ 0x7f96cdbc0680 0x7f96cdbe0ff4 0x561f428bcdbd 0x561f423d1cce 0x561f42390637 0x7f96c6291083 0x561f4239115e
std::bad_alloc
Index build failed.
Your Environment
- Ubuntu 22.04.1 LTS
- DiskANN version 0.7.0
Additional Details
The problem seems to be in the utils function void normalize_data_file(const std::string &inFileName, const std::string &outFileName)
Line 118 in fc3c6e2
It tries to allocate memory for the whole dataset. If I had to guess, the allocation should be only for the block size blk_size
.