New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Segfault while training Doc2Vec with mixed precision #3347
Comments
Update: I came up with this function to convert all the attributes to FP16. def convert_model_attributes_to_fp16(attributes: dict) -> None:
"""Convert all the arrays found in the attributes to fp16"""
for attribute in attributes.keys():
# Check if it is a numpy array
if isinstance(attributes[attribute], np.ndarray):
# Convert to fp16
print(f"Converting {attribute} to fp16")
attributes[attribute] = attributes[attribute].astype(np.float16)
else:
if isinstance(attributes[attribute], dict):
attribute_inner_attributes = attributes[attribute]
else:
try:
attribute_inner_attributes = attributes[attribute].__dict__
except AttributeError:
continue
# Try finding more arrays in the inner attributes
convert_model_attributes_to_fp16(attribute_inner_attributes) I call it like this right before training: convert_model_attributes_to_fp16(model.__dict__) Now several other vectors are converted to np.float16 too besides dv.vectors, wv.vectors, and syn1neg, however, the error still happens.
In the end, I checked the dtypes and they were effectively converted, however, the error is the same. @gojomo since you already played around with this, can you please give me a hand? |
That's an iffy strategy to save memory, for at least 2 reasons:
(I discovered this when testing a mere downconversion in a final model, to have a smaller frozen set-of-vectors - then noticing that the common With just 1 million words, & 5 million doc-vectors, & 100 dimensions, I'd think the model would be roughly: 1 million words * 100 dimensions * 4 bytes/dimension * 2 (vectors & internal weights) = 800MB So mainly, I'd aim to use a system with far more than 4GB RAM. If your model is somehow using more RAM than that after the (Separately: note that even with 16 or more cores, maximum training thoroughput using a corpus iterator is usually reached somewhere in the 6-12 workers count, due to other thread contention inherent to our current multithreaded design & the Python GIL. The actual best worker count varies a bit with other parameters, but unfortunately can only be deduced via trial-and-error - running & aborting brief training sessions, watching the logged training rate over a few minutes.) |
While I'm happy to provide more ideas here or on the discussion list, there's not really any bug here – forcing Gensim to accept/use lower-precision floats is definitely not a supported/priority use-case. So closing this as a pending issue. |
Problem description
My model consumes too much ram, I've been trying to convert it to np.float16 during inference and it runs correctly without any error, the problem is that the accuracy seems to be drastically reduced (results are not relevant at all), the model was trained normally without any dtype conversion.
Now, I'm trying to train the Doc2Vec model with mixed precision (np.float16 instead of np.float32) to check whether this accuracy issue is fixed if that's the type of the vectors from the very beginning. The problem is that I get the error
Segmentation fault (core dumped)
and the script stops, then that's followed bylib/python3.10/multiprocessing/resource_tracker.py:224: UserWarning: resource_tracker: There appear to be 6 leaked semaphore objects to clean up at shutdown warnings.warn('resource_tracker: There appear to be %d '
.Steps/code/corpus to reproduce
Am I missing something to train using mixed precision? I am aware of the fact there'll still be some conversions to FP32 under the hood as noted by an important collaborator here: #2413 (comment), but I still want to leverage the smaller memory footprint at rest and the reduced file size (looks like it is 50% of savings from the low accuracy tests I did during inference)
Versions
The text was updated successfully, but these errors were encountered: