Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Need help with personal docx files #1

Open
mattolson93 opened this issue Dec 21, 2023 · 2 comments
Open

Need help with personal docx files #1

mattolson93 opened this issue Dec 21, 2023 · 2 comments

Comments

@mattolson93
Copy link

Hello, great poster at NuerIPS and it was good to meet you all!

I have some custom docx files (pdfs that I converted to docx with adobe), that I am trying to extract text from. I am able to get the docker file up and running, and I've modified run_single_node.sh to run just the annotation on my_docxs.tar.gz in the data folder. The script seems to execute, but I don't see anything in failed or extracted text. What am I doing wrong? I've pasted the whole log below, and I've also tried a tar of just a simple docx with random text in to verify it's not my converted files causing the issue.

Lastly, a demo for just using a personal set of docxs that works for you would be very helpful in debugging.

Thanks,
Matt Olson

[2023-12-21 21:21:36,464]::MainProcess          ::INFO::source_tars: [PosixPath('data/paper.tar.gz'), PosixPath('data/paper2.tar.gz')]
[2023-12-21 21:21:36,471]::MainProcess          ::INFO::args: {'data_dir': 'data', 'output_dir': './data/out', 'input_files': None, 'crawl_id': 'test', 'max_docs': -1, 'soffice_executable': 'soffice', 'config': 'configs/default_config.yaml', 'job_id': None}
[2023-12-21 21:21:36,475]::MainProcess          ::INFO::results_dir: data/out
[2023-12-21 21:21:36,476]::MainProcess          ::INFO::annotations_dir: data/out/multimodal
[2023-12-21 21:21:36,476]::MainProcess          ::INFO::meta_dir: data/out/meta
[2023-12-21 21:21:36,477]::MainProcess          ::INFO::text_dir: data/out/text
[2023-12-21 21:21:36,477]::MainProcess          ::INFO::failed_dir: data/out/failed
[2023-12-21 21:21:36,477]::MainProcess          ::INFO::num_annotators: 2
[2023-12-21 21:21:36,482]::MainProcess          ::INFO::max_docs_per_process: -1
[2023-12-21 21:21:36,486]::AnnotationMonitor-2  ::INFO::Start monitoring...
[2023-12-21 21:21:41,273]::MainProcess          ::INFO::soffice(PID=104) started @ localhost:38357
[2023-12-21 21:21:41,276]::MainProcess          ::INFO::initialized.
[2023-12-21 21:21:41,277]::MainProcess          ::INFO::input_tars=[PosixPath('data/paper.tar.gz')]
[2023-12-21 21:21:45,824]::MainProcess          ::INFO::soffice(PID=178) started @ localhost:58509
[2023-12-21 21:21:45,827]::MainProcess          ::INFO::initialized.
[2023-12-21 21:21:45,827]::MainProcess          ::INFO::input_tars=[PosixPath('data/paper2.tar.gz')]
[2023-12-21 21:21:45,838]::AnnotatorProcess-4   ::INFO::annotator_3bae1666-e805-4ada-8543-0242642f26eb_0001 start processing data/paper2.tar.gz.
[2023-12-21 21:21:45,837]::AnnotatorProcess-3   ::INFO::annotator_3bae1666-e805-4ada-8543-0242642f26eb_0000 start processing data/paper.tar.gz.
[2023-12-21 21:21:45,857]::AnnotatorProcess-3   ::ERROR::(self.run) FileNotFoundError: [Errno 2] No such file or directory: '/usr/app/data/tmp/tmpyo4oefe3'
[2023-12-21 21:21:45,857]::AnnotatorProcess-3   ::INFO::annotator_3bae1666-e805-4ada-8543-0242642f26eb_0000 finished. Shutting down.
[2023-12-21 21:21:45,860]::AnnotatorProcess-3   ::INFO::shutting down soffice process with pid 104
[2023-12-21 21:21:45,944]::AnnotatorProcess-4   ::ERROR::(self.run) FileNotFoundError: [Errno 2] No such file or directory: '/usr/app/data/tmp/tmp3veibb1j'
[2023-12-21 21:21:45,945]::AnnotatorProcess-4   ::INFO::annotator_3bae1666-e805-4ada-8543-0242642f26eb_0001 finished. Shutting down.
[2023-12-21 21:21:45,947]::AnnotatorProcess-4   ::INFO::shutting down soffice process with pid 178
[2023-12-21 21:21:46,892]::MainProcess          ::INFO::annotator_3bae1666-e805-4ada-8543-0242642f26eb_0000 done.
[2023-12-21 21:21:46,971]::MainProcess          ::INFO::annotator_3bae1666-e805-4ada-8543-0242642f26eb_0001 done.
[2023-12-21 21:21:46,980]::AnnotationMonitor-2  ::INFO::AnnotationMonitor done.
[2023-12-21 21:21:46,991]::MainProcess          ::INFO::annotation done.
[2023-12-21 21:21:46,992]::MainProcess          ::INFO::total time: 0:00:10.557422
@zhangzhiyang-2020
Copy link

I met the same errror: (self.run) FileNotFoundError: [Errno 2] No such file or directory ... ...
Did you solve this problem?
Thanks in advance!

@mattolson93
Copy link
Author

No sorry :( I gave up on using docs

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants