-
Notifications
You must be signed in to change notification settings - Fork 68
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
How To Download PDFs With Specific Content #1028
Comments
You can use keepDownloads parameter to keep downloaded files. It's not clear what you mean by PDFs with specific content. Are you looking to filter PDFs based on their metadata or their content? |
Thanks @ohtwadi. I want the PDFs and only the ones with specific keywords in their text content. Would you please provide a configuration example/snippet? |
Here is one way to satisfy your requirements with the ExternalTransformer. Please read this blog post to get a better understanding of ExternalTransformer. The basic idea is:
Here is some pseudo code to get you started
<preParserHandlers>
<transformer class="${transformer}.ExternalTransformer">
<!-- apply this transfomer to .pdf files only -->
<restrictTo field="document.reference">.*\.pdf$</restrictTo>
<!--
call system `cp` command to place the file in `extracted` dir
(Use `copy` if on Windows)
-->
<command>cp ${INPUT} {extractedDir}</command>
<tempDir>${workdir}/temp</tempDir>
</transformer>
</preParserHandlers> <postParserHandlers>
<handler class="ScriptTagger">
<script>
<!--
Add this field if you do _not_ want to keep the document. For example,
-->
<![CDATA[
if(! content.contains("mytext")) {
metadata.add("keepThis", "false")
}]]>
</script>
</handler>
<transformer class="${transformer}.ExternalTransformer">
<!-- apply this transfomer only to the docs we tagged via ScriptTagger above -->
<restrictTo field="keepThis">false</restrictTo>
<!--
calls on the system `rm` command to delete the file.
Use `del` if on Windows
Note: You need figure out how to get the ${filename}
-->
<command>rm {$extractedDir}/${filename}</command>
<tempDir>${workdir}/temp</tempDir>
</transformer>
</postParserHandlers> (Not tested) |
Would you please provide an XML configuration (v3.x) example to download PDFs with specific content? Thanks!
The text was updated successfully, but these errors were encountered: