Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enhanced pdfs generated by the import-external-reference enrichment connector #1698

Closed
Lhorus6 opened this issue Jan 15, 2024 · 0 comments · Fixed by #1699
Closed

Enhanced pdfs generated by the import-external-reference enrichment connector #1698

Lhorus6 opened this issue Jan 15, 2024 · 0 comments · Fixed by #1699
Assignees
Labels
feature use for describing a new feature to develop solved use to identify issue that has been solved (must be linked to the solving PR)

Comments

@Lhorus6
Copy link
Contributor

Lhorus6 commented Jan 15, 2024

Description

Currently, the pdf seems to generate pdfs files containing an image/screenshot of the source web page.

This behavior prevents:

  1. file text indexing
  2. running the import-document connector to automate the generation of entities/relationships linked to the pdf.

We need to modify the pdf generation so that it contains text rather than an image.

Proposed Solution

We need to modify the pdf generation so that it contains text rather than an image.

PDF generation is on line 89 of the file "import-external-reference.py", with the "pdfkit" library. This library uses the "wkhtmltopdf" utility to generate the pdf.

Additional Information

After testing, the version of the utility "wkhtmltopdf" retrieved from the dockerfile via the Debian repo is 0.12.5.

The new version 0.12.6 solves the problem, so we need to force the installation of this version.

Would you be willing to submit a PR?

Yes

@Lhorus6 Lhorus6 added the needs triage use to identify issue needing triage from Filigran Product team label Jan 15, 2024
@Lhorus6 Lhorus6 self-assigned this Jan 15, 2024
@Lhorus6 Lhorus6 linked a pull request Jan 15, 2024 that will close this issue
4 tasks
SamuelHassine pushed a commit that referenced this issue Jan 16, 2024
…rt-external-reference (#1698)

Co-authored-by: A. Jard <angelique.jard@gmail.com>
@SamuelHassine SamuelHassine added feature use for describing a new feature to develop solved use to identify issue that has been solved (must be linked to the solving PR) and removed needs triage use to identify issue needing triage from Filigran Product team labels Jan 16, 2024
@SamuelHassine SamuelHassine added this to the Release 5.12.18 milestone Jan 16, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature use for describing a new feature to develop solved use to identify issue that has been solved (must be linked to the solving PR)
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants