Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How To Download PDFs With Specific Content #1028

Open
teklot opened this issue Jun 26, 2024 · 3 comments
Open

How To Download PDFs With Specific Content #1028

teklot opened this issue Jun 26, 2024 · 3 comments

Comments

@teklot
Copy link

teklot commented Jun 26, 2024

Would you please provide an XML configuration (v3.x) example to download PDFs with specific content? Thanks!

@ohtwadi
Copy link
Contributor

ohtwadi commented Jun 26, 2024

You can use keepDownloads parameter to keep downloaded files.

It's not clear what you mean by PDFs with specific content. Are you looking to filter PDFs based on their metadata or their content?

@teklot
Copy link
Author

teklot commented Jun 26, 2024

Thanks @ohtwadi. I want the PDFs and only the ones with specific keywords in their text content. Would you please provide a configuration example/snippet?

@ohtwadi
Copy link
Contributor

ohtwadi commented Jun 28, 2024

Here is one way to satisfy your requirements with the ExternalTransformer. Please read this blog post to get a better understanding of ExternalTransformer.

The basic idea is:

  • Use ET to copy all PDFs to a separate directory in a preParseHandler
  • Use ScriptTagger to add a metadata field to the document if it does not contain the specific keywords
  • Use ET to delete the PDF which does not have the metadata field in postParseHandler

Here is some pseudo code to get you started

# declare a few variables 
#set($workdir = .\workdir-main)
#set($transformer = "com.norconex.importer.handler.transformer.impl")
#set($extractedDir = .\extracted)
<preParserHandlers>
  <transformer class="${transformer}.ExternalTransformer">
    	<!-- apply this transfomer to .pdf files only -->
    	<restrictTo field="document.reference">.*\.pdf$</restrictTo>
    	<!-- 
            call system `cp` command to place the file in `extracted` dir 
            (Use `copy` if on Windows)
          -->
    	<command>cp ${INPUT} {extractedDir}</command>
    	<tempDir>${workdir}/temp</tempDir>
  </transformer>
</preParserHandlers>
<postParserHandlers>
   <handler class="ScriptTagger">
      <script>
           <!-- 
             Add this field if you do _not_ want to keep the document. For example,
            -->
         <![CDATA[
          if(! content.contains("mytext")) {
            metadata.add("keepThis", "false")
         }]]>
      </script>
   </handler>
   
  <transformer class="${transformer}.ExternalTransformer">
      <!-- apply this transfomer only to the docs we tagged via ScriptTagger above -->
      <restrictTo field="keepThis">false</restrictTo>

      <!--
      	calls on the system `rm` command to delete the file.
      	Use `del` if on Windows
      	Note: You need figure out how to get the ${filename}
    	-->
      <command>rm {$extractedDir}/${filename}</command>
      <tempDir>${workdir}/temp</tempDir>
   </transformer>
</postParserHandlers>

(Not tested)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants