[Bug]: Error loading punkt #2412

aomeng1219 · 2024-04-08T04:19:59Z

What happened?

nltk_data data could not be loaded and reloading failed

Relevant log output

_id%2C%20name%2C%20%2A&brain_id=eq.40ba47d7-51b2-4b2a-9247-89e29619efb0 "HTTP/1.1 200 OK"
2024-04-08 12:13:00 [2024-04-08 04:13:00,756: INFO/ForkPoolWorker-10] pikepdf C++ to Python logger bridge initialized
2024-04-08 12:13:00 [2024-04-08 04:13:00,913: WARNING/ForkPoolWorker-10] [nltk_data] Error loading punkt: <urlopen error [Errno 99] Cannot
2024-04-08 12:13:00 [nltk_data]     assign requested address>
2024-04-08 12:13:00 [2024-04-08 04:13:00,914: ERROR/ForkPoolWorker-10] 
2024-04-08 12:13:00 **********************************************************************
2024-04-08 12:13:00   Resource punkt not found.
2024-04-08 12:13:00   Please use the NLTK Downloader to obtain the resource:
2024-04-08 12:13:00 
2024-04-08 12:13:00   >>> import nltk
2024-04-08 12:13:00   >>> nltk.download('punkt')
2024-04-08 12:13:00   
2024-04-08 12:13:00   For more information see: https://www.nltk.org/data.html
2024-04-08 12:13:00 
2024-04-08 12:13:00   Attempted to load tokenizers/punkt/PY3/english.pickle
2024-04-08 12:13:00 
2024-04-08 12:13:00   Searched in:
2024-04-08 12:13:00     - '/root/nltk_data'
2024-04-08 12:13:00     - '/usr/local/nltk_data'
2024-04-08 12:13:00     - '/usr/local/share/nltk_data'
2024-04-08 12:13:00     - '/usr/local/lib/nltk_data'
2024-04-08 12:13:00     - '/usr/share/nltk_data'
2024-04-08 12:13:00     - '/usr/local/share/nltk_data'
2024-04-08 12:13:00     - '/usr/lib/nltk_data'
2024-04-08 12:13:00     - '/usr/local/lib/nltk_data'
2024-04-08 12:13:00     - ''
2024-04-08 12:13:00 **********************************************************************
2024-04-08 12:13:00 
2024-04-08 12:13:00 [2024-04-08 04:13:00,914: WARNING/ForkPoolWorker-10] PDF text extraction failed, skip text extraction...
2024-04-08 12:13:04 [2024-04-08 04:13:04,273: INFO/ForkPoolWorker-10] Processing entire page OCR with tesseract...
2024-04-08 12:13:11 [2024-04-08 04:13:11,062: INFO/ForkPoolWorker-10] Processing entire page OCR with tesseract...
2024-04-08 12:13:22 [2024-04-08 04:13:22,807: INFO/ForkPoolWorker-10] Processing entire page OCR with tesseract...
2024-04-08 12:13:25 [2024-04-08 04:13:25,759: INFO/ForkPoolWorker-10] Processing entire page OCR with tesseract...
2024-04-08 12:13:29 [2024-04-08 04:13:29,689: INFO/ForkPoolWorker-10] Processing entire page OCR with tesseract...
2024-04-08 12:13:35 [2024-04-08 04:13:35,570: INFO/ForkPoolWorker-10] Processing entire page OCR with tesseract...
2024-04-08 12:13:40 [2024-04-08 04:13:40,788: INFO/ForkPoolWorker-10] Processing entire page OCR with tesseract...
2024-04-08 12:13:42 [2024-04-08 04:13:42,698: INFO/ForkPoolWorker-10] Processing entire page OCR with tesseract...
2024-04-08 12:13:45 [2024-04-08 04:13:45,355: INFO/ForkPoolWorker-10] Processing entire page OCR with tesseract...
2024-04-08 12:13:51 [2024-04-08 04:13:51,109: INFO/ForkPoolWorker-10] Processing entire page OCR with tesseract...
2024-04-08 12:13:59 [2024-04-08 04:13:59,008: INFO/ForkPoolWorker-10] Processing entire page OCR with tesseract...
2024-04-08 12:14:07 [2024-04-08 04:14:07,593: INFO/ForkPoolWorker-10] Processing entire page OCR with tesseract...
2024-04-08 12:14:14 [2024-04-08 04:14:14,757: WARNING/ForkPoolWorker-10] [nltk_data] Error loading punkt: <urlopen error [Errno 99] Cannot
2024-04-08 12:14:14 [nltk_data]     assign requested address>
2024-04-08 12:14:14 [2024-04-08 04:14:14,757: WARNING/ForkPoolWorker-10] Error processing file: 
2024-04-08 12:14:14 **********************************************************************
2024-04-08 12:14:14   Resource punkt not found.
2024-04-08 12:14:14   Please use the NLTK Downloader to obtain the resource:
2024-04-08 12:14:14 
2024-04-08 12:14:14   >>> import nltk
2024-04-08 12:14:14   >>> nltk.download('punkt')
2024-04-08 12:14:14   
2024-04-08 12:14:14   For more information see: https://www.nltk.org/data.html
2024-04-08 12:14:14 
2024-04-08 12:14:14   Attempted to load tokenizers/punkt/PY3/english.pickle
2024-04-08 12:14:14 
2024-04-08 12:14:14   Searched in:
2024-04-08 12:14:14     - '/root/nltk_data'
2024-04-08 12:14:14     - '/usr/local/nltk_data'
2024-04-08 12:14:14     - '/usr/local/share/nltk_data'
2024-04-08 12:14:14     - '/usr/local/lib/nltk_data'
2024-04-08 12:14:14     - '/usr/share/nltk_data'
2024-04-08 12:14:14     - '/usr/local/share/nltk_data'
2024-04-08 12:14:14     - '/usr/lib/nltk_data'
2024-04-08 12:14:14     - '/usr/local/lib/nltk_data'
2024-04-08 12:14:14     - ''
2024-04-08 12:14:14 **********************************************************************
2024-04-08 12:14:14 [2024-04-08 04:14:14,784: INFO/ForkPoolWorker-10] HTTP Request: PATCH http://host.docker.internal:54321/rest/v1/notifications?id=eq.d4cf8bd6-c89f-448e-b09d-7d7cb030d925 "HTTP/1.1 200 OK"
2024-04-08 12:14:14 [2024-04-08 04:14:14,802: INFO/ForkPoolWorker-10] HTTP Request: PATCH http://host.docker.internal:54321/rest/v1/brains?brain_id=eq.40ba47d7-51b2-4b2a-9247-89e29619efb0 "HTTP/1.1 200 OK"
2024-04-08 12:14:14 [2024-04-08 04:14:14,807: INFO/ForkPoolWorker-10] Task process_file_and_notify[6d1b3621-892f-46b8-891e-b65ca491c961] succeeded in 76.24197530501988s: True
2024-04-08 12:15:00 [2024-04-08 04:15:00,002: INFO/MainProcess] Task celery_worker.process_integration_brain_sync[1eb6f18c-55ce-4129-8d68-3158b0f1831c] received
2024-04-08 12:15:00 [2024-04-08 04:15:00,035: INFO/ForkPoolWorker-10] HTTP Request: GET http://host.docker.internal:54321/rest/v1/integrations_user?select=%2A%2C%20integrations%20%28%29&integrations.integration_name=eq.notion "HTTP/1.1 200 OK"
2024-04-08 12:15:00 [2024-04-08 04:15:00,038: INFO/ForkPoolWorker-10] Task celery_worker.process_integration_brain_sync[1eb6f18c-55ce-4129-8d68-3158b0f1831c] succeeded in 0.034802476002369076s: None

Twitter / LinkedIn details

No response

dosubot · 2024-04-08T17:55:09Z

Hey there, @aomeng1219! 😄 Long time no chat. How's the digital world treating you lately, aside from this pesky NLTK hiccup?

Based on the error message you provided, it seems like the issue is related to the NLTK library not being able to download the 'punkt' tokenizer due to a network error (<urlopen error [Errno 99] Cannot assign requested address>). This error is typically associated with a network-related issue, such as a DNS resolution failure, a proxy server issue, or a firewall blocking the request.

Here are some steps you can take to resolve this issue:

Check your network connection: Ensure that your server has a stable internet connection and can access the NLTK servers.
Check your firewall settings: If you're running this on a server, there might be firewall rules that are blocking the request. You might need to adjust these settings to allow the download to proceed.
Check your proxy settings: If you're behind a proxy, you might need to configure Python to use the correct proxy settings. You can do this by setting the http_proxy and https_proxy environment variables.
Manual download: If the automatic download still fails, you can manually download the 'punkt' tokenizer from the NLTK data source, and then place it in one of the directories that NLTK checks for data. The directories that NLTK checks are listed in your error message.

Here's an example of how you can manually download and specify the download location in your code:

import nltk

# Specify your custom NLTK data directory here
nltk_data_dir = "/path/to/nltk_data"

# Set the NLTK data directory
nltk.data.path.append(nltk_data_dir)

# Try downloading 'punkt' using the specified directory
nltk.download('punkt', download_dir=nltk_data_dir)

# Now you can load 'punkt' tokenizer as usual
from nltk.tokenize import word_tokenize

# Your code for PDF text extraction and OCR processing using Tesseract goes here

Regarding your question about whether the quivr codebase uses the NLTK 'punkt' tokenizer for document processing, specifically for PDF text extraction and OCR processing with Tesseract, I wasn't able to find any specific references to the 'punkt' tokenizer in the quivr repository. However, it's possible that it's being used indirectly through another library or module. If you could provide more information about where you're seeing this issue in the quivr codebase, I might be able to provide more specific guidance.

I hope this helps! If you have any more questions or need further clarification, feel free to ask.

Sources

backend/packages/files/parsers/common.py

About Dosu

This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research.

Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.

If you want to continue the conversation, start your reply with @dosu-bot.

github-actions · 2024-07-07T20:05:09Z

Thanks for your contributions, we'll be closing this issue as it has gone stale. Feel free to reopen if you'd like to continue the discussion.

aomeng1219 added the bug Something isn't working label Apr 8, 2024

github-actions bot added the Stale label Jul 7, 2024

github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Jul 12, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug]: Error loading punkt #2412

[Bug]: Error loading punkt #2412

aomeng1219 commented Apr 8, 2024

dosubot bot commented Apr 8, 2024

About Dosu

github-actions bot commented Jul 7, 2024

[Bug]: Error loading punkt #2412

[Bug]: Error loading punkt #2412

Comments

aomeng1219 commented Apr 8, 2024

What happened?

Relevant log output

Twitter / LinkedIn details

dosubot bot commented Apr 8, 2024

Sources

About Dosu

github-actions bot commented Jul 7, 2024