-
-
Notifications
You must be signed in to change notification settings - Fork 3.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Bug]: Error loading punkt #2412
Comments
Hey there, @aomeng1219! 😄 Long time no chat. How's the digital world treating you lately, aside from this pesky NLTK hiccup? Based on the error message you provided, it seems like the issue is related to the NLTK library not being able to download the 'punkt' tokenizer due to a network error ( Here are some steps you can take to resolve this issue:
Here's an example of how you can manually download and specify the download location in your code: import nltk
# Specify your custom NLTK data directory here
nltk_data_dir = "/path/to/nltk_data"
# Set the NLTK data directory
nltk.data.path.append(nltk_data_dir)
# Try downloading 'punkt' using the specified directory
nltk.download('punkt', download_dir=nltk_data_dir)
# Now you can load 'punkt' tokenizer as usual
from nltk.tokenize import word_tokenize
# Your code for PDF text extraction and OCR processing using Tesseract goes here Regarding your question about whether the quivr codebase uses the NLTK 'punkt' tokenizer for document processing, specifically for PDF text extraction and OCR processing with Tesseract, I wasn't able to find any specific references to the 'punkt' tokenizer in the quivr repository. However, it's possible that it's being used indirectly through another library or module. If you could provide more information about where you're seeing this issue in the quivr codebase, I might be able to provide more specific guidance. I hope this helps! If you have any more questions or need further clarification, feel free to ask. Sources
|
Thanks for your contributions, we'll be closing this issue as it has gone stale. Feel free to reopen if you'd like to continue the discussion. |
What happened?
nltk_data data could not be loaded and reloading failed
Relevant log output
Twitter / LinkedIn details
No response
The text was updated successfully, but these errors were encountered: