Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Segfault while training Doc2Vec with mixed precision #3347

Closed
marcelodiaz558 opened this issue May 20, 2022 · 3 comments
Closed

Segfault while training Doc2Vec with mixed precision #3347

marcelodiaz558 opened this issue May 20, 2022 · 3 comments

Comments

@marcelodiaz558
Copy link

Problem description

My model consumes too much ram, I've been trying to convert it to np.float16 during inference and it runs correctly without any error, the problem is that the accuracy seems to be drastically reduced (results are not relevant at all), the model was trained normally without any dtype conversion.

Now, I'm trying to train the Doc2Vec model with mixed precision (np.float16 instead of np.float32) to check whether this accuracy issue is fixed if that's the type of the vectors from the very beginning. The problem is that I get the error Segmentation fault (core dumped) and the script stops, then that's followed by lib/python3.10/multiprocessing/resource_tracker.py:224: UserWarning: resource_tracker: There appear to be 6 leaked semaphore objects to clean up at shutdown warnings.warn('resource_tracker: There appear to be %d '.

Steps/code/corpus to reproduce

import os

from gensim.models import Doc2Vec, KeyedVectors

# Get the model
model = Doc2Vec(
        dm=1,
        dm_mean=1,
        vector_size=100,
        window=8,
        epochs=10,
        workers=os.cpu_count(),
        max_final_vocab=1000000,
        sample=1e-4,
        min_count=30,
    )

print("Converting model weights to FP16")
model.dv.vectors = model.dv.vectors.astype("float16")
model.wv.vectors = model.wv.vectors.astype("float16")

# Build the vocabulary (documents is an iterator with 5M preprocessed docs)
model.build_vocab(documents)

print("Converting syn1neg to FP16")
model.syn1neg = model.syn1neg.astype("float16")

# Train
model.train(documents, total_examples=model.corpus_count, epochs=10)

Am I missing something to train using mixed precision? I am aware of the fact there'll still be some conversions to FP32 under the hood as noted by an important collaborator here: #2413 (comment), but I still want to leverage the smaller memory footprint at rest and the reduced file size (looks like it is 50% of savings from the low accuracy tests I did during inference)

Versions

>>> import platform; print(platform.platform())
Linux-5.15.0-1005-aws-x86_64-with-glibc2.35

>>> import sys; print("Python", sys.version)
Python 3.10.0 (default, Mar  3 2022, 09:58:08) [GCC 7.5.0]

>>> import struct; print("Bits", 8 * struct.calcsize("P"))
Bits 64

>>> import numpy; print("NumPy", numpy.__version__)
NumPy 1.21.5

>>> import scipy; print("SciPy", scipy.__version__)
SciPy 1.7.3

>>> import gensim; print("gensim", gensim.__version__)
gensim 4.1.2

>>> from gensim.models import word2vec;print("FAST_VERSION", word2vec.FAST_VERSION)
FAST_VERSION 1
>>> 
@marcelodiaz558
Copy link
Author

Update: I came up with this function to convert all the attributes to FP16.

def convert_model_attributes_to_fp16(attributes: dict) -> None:
    """Convert all the arrays found in the attributes to fp16"""
    for attribute in attributes.keys():
        # Check if it is a numpy array
        if isinstance(attributes[attribute], np.ndarray):
            # Convert to fp16
            print(f"Converting {attribute} to fp16")
            attributes[attribute] = attributes[attribute].astype(np.float16)
        else:
            if isinstance(attributes[attribute], dict):
                attribute_inner_attributes = attributes[attribute]
            else:
                try:
                    attribute_inner_attributes = attributes[attribute].__dict__
                except AttributeError:
                    continue

            # Try finding more arrays in the inner attributes
            convert_model_attributes_to_fp16(attribute_inner_attributes)

I call it like this right before training:

convert_model_attributes_to_fp16(model.__dict__)

Now several other vectors are converted to np.float16 too besides dv.vectors, wv.vectors, and syn1neg, however, the error still happens.
This is the output of the conversion function:

Converting vectors to fp16
Converting word_count to fp16
Converting doc_count to fp16
Converting vectors_lockf to fp16
Converting cum_table to fp16
Converting vectors to fp16
Converting count to fp16
Converting sample_int to fp16
Converting vectors_lockf to fp16
Converting syn1neg to fp16

In the end, I checked the dtypes and they were effectively converted, however, the error is the same.

@gojomo since you already played around with this, can you please give me a hand?

@gojomo
Copy link
Collaborator

gojomo commented May 24, 2022

That's an iffy strategy to save memory, for at least 2 reasons:

  • to actually change all the relevant calculations, you'd also need to make extensive edits to the optimized Cython code - and its low-level/less-rigorously-typed memory accesses are very prone to subtle bugs causing memory-access problems (that then trigger errors like this segmentation fault
  • many of the raw BLAS array operations are only optimized in 32-bit (& larger) operations, even if their inputs are lower-precisions – so when you attempt them on lower-precision numbers, a upconversion precedes the array math, then a downconversion copies the results back to your array - slowing things quite a bit (& perhaps using more memory dynamically than you expected).

(I discovered this when testing a mere downconversion in a final model, to have a smaller frozen set-of-vectors - then noticing that the common .most_similar() operation, which needs to do a bulk operation over the array-of-all-vectors, was actually creating a temporary full 32bit copy every time, outweighing any memory-savings-at-rest.)

With just 1 million words, & 5 million doc-vectors, & 100 dimensions, I'd think the model would be roughly:

1 million words * 100 dimensions * 4 bytes/dimension * 2 (vectors & internal weights) = 800MB
5 million docs * 100 dimensions & 4 bytes/dimension = 2GB
[plus some hundreds of MB for word- and doc-vector lookup dictionaries]
~~== 3-4GB in size total

So mainly, I'd aim to use a system with far more than 4GB RAM.

If your model is somehow using more RAM than that after the .build_vocab() step, there may be something else odd: incredibly long words? multiple/long tags per document? using plain-ints as doc-ids that don't start at 0, causing large overallocation of the doc-vectors array?

(Separately: note that even with 16 or more cores, maximum training thoroughput using a corpus iterator is usually reached somewhere in the 6-12 workers count, due to other thread contention inherent to our current multithreaded design & the Python GIL. The actual best worker count varies a bit with other parameters, but unfortunately can only be deduced via trial-and-error - running & aborting brief training sessions, watching the logged training rate over a few minutes.)

@gojomo
Copy link
Collaborator

gojomo commented May 27, 2022

While I'm happy to provide more ideas here or on the discussion list, there's not really any bug here – forcing Gensim to accept/use lower-precision floats is definitely not a supported/priority use-case. So closing this as a pending issue.

@gojomo gojomo closed this as completed May 27, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants