Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Github actions: ValueError: numpy.ndarray size changed, may indicate binary incompatibility. #392

Closed
MaartenGr opened this issue Jan 3, 2022 · 27 comments

Comments

@MaartenGr
Copy link
Owner

The github actions workflow is suddenly giving me the following error:

ValueError: numpy.ndarray size changed, may indicate binary incompatibility. Expected 96 from C header, got 88 from PyObject

It seems that it has most likely to do with numpy-based binary compatibility issues (some more info here). However, I cannot seem to fix it thus far with the suggested method (setting oldest-supported-numpy in pyproject.toml).

If you have any idea, please follow along with the full discussions here. Any help is greatly appreciated!

@lsch0lz
Copy link

lsch0lz commented Jan 4, 2022

Hi Maarten! Any news on that issue?
Yesterday I tried several combinations of NumPy and BERTopic, but I couldn't import it due to the mentioned error. Is there any version that's working right now? The NumPy issue mentions a conflict with HDBSCAN.

@MaartenGr
Copy link
Owner Author

@therealgherkhin Not yet. The thing is, I am only experiencing this issue with the Github actions pipeline. I do not run into any issues either locally or on both Kaggle and Google Colab, so I also have issues replicating this issue...

Just to be sure, how did you try to install BERTopic, and in what kind of environment?

@lsch0lz
Copy link

lsch0lz commented Jan 4, 2022

Oh okay.. I run into this issue when importing bertopic in a docker environment. I installed it via pip install bertopic.
The installation seems fine, but when I import the libary I get the mentioned error.

Edit: I also tried it local on my machine and there everything seems fine. The docker installation still fails.

Environment:

Python: 3.8
anyio==3.3.4
asgiref==3.4.1
atomicwrites==1.4.0
attrs==21.2.0
banal==1.0.6
bertopic==0.9.4
certifi==2021.10.8
charset-normalizer==2.0.7
click==8.0.3
colorama==0.4.4
Cython==0.29.25
fastapi==0.70.0
fastapi-utils==0.2.1
filelock==3.4.0
greenlet==1.1.2
h11==0.12.0
hdbscan==0.8.27
huggingface-hub==0.2.1
idna==3.3
iniconfig==1.1.1
joblib==1.1.0
langcodes==3.3.0
language-data==1.1
languagecodes==1.1.1
llvmlite==0.37.0
loguru==0.5.3
marisa-trie==0.7.7
nltk==3.6.5
numba==0.54.1
numpy==1.20.0
packaging==21.3
pandas==1.3.5
Pillow==8.4.0
plotly==5.4.0
pluggy==1.0.0
psycopg2==2.9.2
psycopg2-binary==2.9.2
py==1.11.0
pydantic==1.8.2
pymongo==3.12.1
pynndescent==0.5.5
pyparsing==3.0.6
pytest==6.2.5
python-dateutil==2.8.2
pytz==2021.3
PyYAML==5.4.1
regex==2021.11.10
requests==2.26.0
sacremoses==0.0.46
scikit-learn==1.0.1
scipy==1.7.3
sentence-transformers==2.1.0
sentencepiece==0.1.96
six==1.16.0
sniffio==1.2.0
SQLAlchemy==1.4.27
starlette==0.16.0
tenacity==8.0.1
threadpoolctl==3.0.0
tokenizers==0.10.3
toml==0.10.2
torch==1.10.0
torchvision==0.11.1
tqdm==4.62.3
transformers==4.13.0
typing_extensions==4.0.0
umap-learn==0.5.2
urllib3==1.26.7
uvicorn==0.15.0

@emieldatalytica
Copy link

emieldatalytica commented Jan 4, 2022

so I also have issues replicating this issue...

Hi Maarten,

You can replicate the issue by building the following Docker image:

FROM python:3.8

# install rust compiler (required for tokenizers dependency of bertopic)
RUN curl https://sh.rustup.rs -sSf | bash -s -- -y

ENV PATH="/root/.cargo/bin:${PATH}"

RUN pip install bertopic

RUN python -c 'from bertopic import BERTopic'

Or just by running pip install bertopic in a fresh virtual environment and then running python -c 'from bertopic import BERTopic'

The weird thing is, a Docker image that I created last week (with BERTopic version 0.9.3) is still running fine, but when I try to build it again with the exact same requirements and then run it, it is now failing. It seems to use something in the cached layers that it is now unable to rebuild, but I'm not sure exactly what it is. The same goes for my virtual environment; importing it still works fine in the old virtual environment, installing it in a new virtual environment with the same requirements also works, but importing it fails with the same error.

The error happens when importing the HDBSCAN library (see scikit-learn-contrib/hdbscan#457 (comment)), which could be solved with bumping Numpy to version 1.22.0. However, the UMAP library then depends on Numba, which is only compatible with Numpy < 1.21.0. So, no solution yet..

@SkyeCC
Copy link

SkyeCC commented Jan 4, 2022

I got this problem too.I tried to fix it by upgrade numpy,but I have got another problem.numpy.core._exceptions.UFuncTypeError: ufunc 'correct_alternative_cosine' did not contain a loop with signature matching types <class 'numpy.dtype[float32]'> -> None

@lsch0lz
Copy link

lsch0lz commented Jan 4, 2022

Okay so I downgraded Python to 3.7 and now it works. I'm still not sure why it doesnt work with 3.8

@MaartenGr
Copy link
Owner Author

MaartenGr commented Jan 6, 2022

The last few days I have been bug-fixing this as much as I could. However, it seems that the issue stems from ABI issues between HDBSCAN and Numpy. Whenever a major version is released from Numpy, there is a chance that it will break HDBSCAN if used together with UMAP.

Python 3.7

BERTopic works in python 3.7 seemingly without any problems, simply pip install bertopic should work.

Python 3.8+

For now, if you are on Python 3.8 or higher, it seems that the following will work:

pip install --upgrade pip setuptools wheel
pip install bertopic --no-cache-dir
pip uninstall hdbscan -y
pip install hdbscan --no-cache-dir --no-binary :all: --no-build-isolation

Future Fix

At this point, I am not entirely sure how I want to proceed. It seems that numpy>1.20.3 may introduce issues with large datasets on python 3.8+ as UMAP and HDBSCAN do not work properly together in that case. Thus, there does not seem to be a solid fix for now unless HDBSCAN gets updated to prevent this from happening in the future.

Having said that, any and all help is greatly appreciated!

@SkyeCC
Copy link

SkyeCC commented Jan 6, 2022

Hi MaartenGr!Thanks for your awesome work on bert topic.I tried your advice:
pip install bertopic pip uninstall numpy -y pip install numpy==1.22.0 --no-cache-dir
But I got a new error numpy.core._exceptions.UFuncTypeError: ufunc 'correct_alternative_cosine' did not contain a loop with signature matching types <class 'numpy.dtype[float32]'> -> None.

@MaartenGr
Copy link
Owner Author

@SkyeCC I edited the message directly above yours to provide more up-to-date instructions on how to overcome your issue as numpy==1.22.0 indeed does not work. Seeing your error message it is likely that you are using python 3.8+, so if you follow along with the instructions above (uninstalling and re-installing HDBSCAN with some extra params), hopefully, it will work out.

@MaartenGr
Copy link
Owner Author

Small update, it seems that the HDBSCAN issue can easily be fixed with a new pypi release of HDBSCAN or update the requirements to include the master branch of the package. Turned out that oldest-supported-numpy was not yet released to pypi and since we were all installing from pypi, it had ABI issues with numpy.

Before I can introduce a fix, I will have to wait until HDBSCAN updates their pypi package to include oldest-supported-numpy. Another approach is to update the requirements to point to a specific commit in the master branch but I am not too sure how stable that is.

@salfaro1
Copy link

Python 3.8+

For now, if you are on Python 3.8 or higher, it seems that the following will work:

pip install --upgrade pip setuptools wheel
pip install bertopic --no-cache-dir
pip uninstall hdbscan -y
pip install hdbscan --no-cache-dir --no-binary :all: --no-build-isolation

Hello,

Thanks for getting to the bottom of this issue!!

I tried the above and it resolved my numpy error, but I am now getting an error saying "ModuleNotFoundError: No module named 'torch._C'" when trying to import BERTopic. I have python version 3.9.6.

Any thoughts?

@MaartenGr
Copy link
Owner Author

@salfaro1 Could you share the entire error? Also, do you by chance have a torch folder or torch.py file in the folder you currently work in? I would advise starting from a completely fresh environment and then following the instructions for python 3.8+. Most likely, you either are in a torch folder or have a torch.py file, or there are some leftover files from a previous install, so a fresh environment should work in the latter case.

@TAsUjxnMIL
Copy link

Hello Maarten,
your approach worked for me to solve the error. But there is a new error appeared when calling the UMAP object.
I get the error --> TypeError: 'module' object is not callable. Is there any solution for this?

Information:

  • I am working in my conda env

@MaartenGr
Copy link
Owner Author

@TAsUjxnMIL Could you share the entire code for running BERTopic including the full error message? It may be that UMAP is improperly installed. Python 3.7 is the most stable for BERTopic, so it might be worthwhile to use a completely fresh environment with that version. Note that conda env might have packages pre-installed, so make sure to create a fully fresh environment when doing so.

@TAsUjxnMIL
Copy link

TAsUjxnMIL commented Jan 13, 2022

Update:
This worked for me after changing to Python 3.7:
import umap
umap_model = umap.UMAP(n_neighbors=15, n_components=10, min_dist=0.0, metric='cosine')

@salfaro1
Copy link

@salfaro1 Could you share the entire error? Also, do you by chance have a torch folder or torch.py file in the folder you currently work in? I would advise starting from a completely fresh environment and then following the instructions for python 3.8+. Most likely, you either are in a torch folder or have a torch.py file, or there are some leftover files from a previous install, so a fresh environment should work in the latter case.

Hi, so I checked and there is no torch folder or file in the folder I am working in. To be extra sure this wasn't it, I changed directory to a different location, to no avail. Could you explain what you mean by a fresh environment? I uninstalled Python and started fresh by reinstalling it, but the same exact error appeared.

Here is the entire error message with the traceback:

ModuleNotFoundError Traceback (most recent call last)
~\AppData\Local\Temp/ipykernel_12928/4237668196.py in
----> 1 from bertopic import BERTopic
2 import pandas as pd

~\AppData\Local\Programs\Python\Python39\lib\site-packages\bertopic_init_.py in
----> 1 from bertopic._bertopic import BERTopic
2
3 version = "0.9.4"
4
5 all = [

~\AppData\Local\Programs\Python\Python39\lib\site-packages\bertopic_bertopic.py in
29 from bertopic._utils import MyLogger, check_documents_type, check_embeddings_shape, check_is_fitted
30 from bertopic._mmr import mmr
---> 31 from bertopic.backend._utils import select_backend
32 from bertopic import plotting
33

~\AppData\Local\Programs\Python\Python39\lib\site-packages\bertopic\backend_init_.py in
1 from ._base import BaseEmbedder
----> 2 from ._word_doc import WordDocEmbedder
3 from ._utils import languages
4
5 all = [

~\AppData\Local\Programs\Python\Python39\lib\site-packages\bertopic\backend_word_doc.py in
2 from typing import List
3 from bertopic.backend._base import BaseEmbedder
----> 4 from bertopic.backend._utils import select_backend
5
6

~\AppData\Local\Programs\Python\Python39\lib\site-packages\bertopic\backend_utils.py in
1 from ._base import BaseEmbedder
----> 2 from ._sentencetransformers import SentenceTransformerBackend
3
4 languages = ['afrikaans', 'albanian', 'amharic', 'arabic', 'armenian', 'assamese',
5 'azerbaijani', 'basque', 'belarusian', 'bengali', 'bengali romanize',

~\AppData\Local\Programs\Python\Python39\lib\site-packages\bertopic\backend_sentencetransformers.py in
1 import numpy as np
2 from typing import List, Union
----> 3 from sentence_transformers import SentenceTransformer
4
5 from bertopic.backend import BaseEmbedder

~\AppData\Local\Programs\Python\Python39\lib\site-packages\sentence_transformers_init_.py in
1 version = "2.1.0"
2 MODEL_HUB_ORGANIZATION = 'sentence-transformers'
----> 3 from .datasets import SentencesDataset, ParallelSentencesDataset
4 from .LoggingHandler import LoggingHandler
5 from .SentenceTransformer import SentenceTransformer

~\AppData\Local\Programs\Python\Python39\lib\site-packages\sentence_transformers\datasets_init_.py in
----> 1 from .DenoisingAutoEncoderDataset import DenoisingAutoEncoderDataset
2 from .NoDuplicatesDataLoader import NoDuplicatesDataLoader
3 from .ParallelSentencesDataset import ParallelSentencesDataset
4 from .SentencesDataset import SentencesDataset
5 from .SentenceLabelDataset import SentenceLabelDataset

~\AppData\Local\Programs\Python\Python39\lib\site-packages\sentence_transformers\datasets\DenoisingAutoEncoderDataset.py in
----> 1 from torch.utils.data import Dataset
2 from typing import List
3 from ..readers.InputExample import InputExample
4 import numpy as np
5 import nltk

~\AppData\Local\Programs\Python\Python39\lib\site-packages\torch\utils_init_.py in
2 import sys
3
----> 4 from .throughput_benchmark import ThroughputBenchmark
5 from ._crash_handler import enable_minidumps, disable_minidumps, enable_minidumps_on_exceptions
6

~\AppData\Local\Programs\Python\Python39\lib\site-packages\torch\utils\throughput_benchmark.py in
1
----> 2 import torch._C
3
4 def format_time(time_us=None, time_ms=None, time_s=None):
5 '''Defines how to format time'''

ModuleNotFoundError: No module named 'torch._C'

@MaartenGr
Copy link
Owner Author

@salfaro1 Apologies for the late response. With a fresh environment, I mean a clean conda, pyenv, or poetry environment. Whenever things are not working, you typically start with a fresh install of python through these environments. These versions of python do not have anything else installed aside from the base packages. What is likely happening in your environment is that there are dependencies that might clash, starting from a clean slate might help.

If that does not work, using python 3.7 might fix your issue.

@YoussefSultan
Copy link

If you receive a metadata error where you are unable to find NumPy's METADATA folder for 1.19.x after following the bug fix for python 3.8+ make sure to run the following pip install --force-reinstall numpy and restart kernel

The last few days I have been bug-fixing this as much as I could. However, it seems that the issue stems from ABI issues between HDBSCAN and Numpy. Whenever a major version is released from Numpy, there is a chance that it will break HDBSCAN if used together with UMAP.

Python 3.7

BERTopic works in python 3.7 seemingly without any problems, simply pip install bertopic should work.

Python 3.8+

For now, if you are on Python 3.8 or higher, it seems that the following will work:

pip install --upgrade pip setuptools wheel
pip install bertopic --no-cache-dir
pip uninstall hdbscan -y
pip install hdbscan --no-cache-dir --no-binary :all: --no-build-isolation

Future Fix

At this point, I am not entirely sure how I want to proceed. It seems that numpy>1.20.3 may introduce issues with large datasets on python 3.8+ as UMAP and HDBSCAN do not work properly together in that case. Thus, there does not seem to be a solid fix for now unless HDBSCAN gets updated to prevent this from happening in the future.

Having said that, any and all help is greatly appreciated!

@Ariannaperla
Copy link

Hi Maarten,
'm facing the following issue: I start a training and after having created the embeddings I obtain "UFuncTypeError: ufunc 'correct_alternative_cosine' did not contain a loop with signature matching types <class 'numpy.dtype[float32]'> -> None".
I also tried, as suggested:
pip install --upgrade pip setuptools wheel
pip install bertopic --no-cache-dir
pip uninstall hdbscan -y
pip install hdbscan --no-cache-dir --no-binary :all: --no-build-isolation
but I can not solve the problem. Any suggestions?

@MaartenGr
Copy link
Owner Author

@Ariannaperla Most likely, you updated to an unsupported numpy or numba version. I would advise starting from a fresh environment and trying the above again. If that does not work, using python 3.7 might solve your issue.

If all fails, you can also install BERTopic from conda, as instructed here.

@MaartenGr
Copy link
Owner Author

Conda

To those interested, some of the issues users are having with the installation of BERTopic might be resolved by using conda to install BERTopic.

Installing bertopic from the conda-forge channel can be achieved by adding conda-forge to your channels with:

conda config --add channels conda-forge
conda config --set channel_priority strict

Once the conda-forge channel has been enabled, bertopic can be installed with:

conda install bertopic

@denson
Copy link

denson commented Jan 30, 2022

Using conda to install bertopic worked for me.

@agmo1993
Copy link

I'm running in a python3.9 container. Upgrading the following did the trick for me

numba==0.55.1
numpy==1.21.5
llvmlite==0.38.0
hdbscan==0.8.28

@MaartenGr
Copy link
Owner Author

Good news! HDBSCAN was updated to 0.8.28 which means that the numpy.ndarray size changed issue should be solved if you upgrade HDBSCAN to the newest version. In most cases, when installing BERTopic, it will already select that version that should solve the issue for most users.

There will be a fix in the future to make sure only 0.8.28 is selected but for now, this should be working.

@MaartenGr
Copy link
Owner Author

Since this issue seems to be resolved, I will close this. To those still experiencing this issue, let me know and we'll see if we can figure something out.

@wilmerhenao
Copy link

wilmerhenao commented Sep 26, 2022

Hi guys. I would like to add another solution. I know the above solutions work for most people, but they didn't work for me. What really seemed to be the problem in my case was a PYTORCH 1.8 and numpy combination

I upgraded pytorch to 1.10 and numpy to 1.23.3 and the problem disappeared. I hope this helps someone out there.

@RubTalha
Copy link

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests