How To Download PDFs With Specific Content #1028

teklot · 2024-06-26T13:07:43Z

Would you please provide an XML configuration (v3.x) example to download PDFs with specific content? Thanks!

ohtwadi · 2024-06-26T16:35:48Z

You can use keepDownloads parameter to keep downloaded files.

It's not clear what you mean by PDFs with specific content. Are you looking to filter PDFs based on their metadata or their content?

teklot · 2024-06-26T20:47:08Z

Thanks @ohtwadi. I want the PDFs and only the ones with specific keywords in their text content. Would you please provide a configuration example/snippet?

ohtwadi · 2024-06-28T20:59:48Z

Here is one way to satisfy your requirements with the ExternalTransformer. Please read this blog post to get a better understanding of ExternalTransformer.

The basic idea is:

Use ET to copy all PDFs to a separate directory in a preParseHandler
Use ScriptTagger to add a metadata field to the document if it does not contain the specific keywords
Use ET to delete the PDF which does not have the metadata field in postParseHandler

Here is some pseudo code to get you started

# declare a few variables 
#set($workdir = .\workdir-main)
#set($transformer = "com.norconex.importer.handler.transformer.impl")
#set($extractedDir = .\extracted)

<preParserHandlers>
  <transformer class="${transformer}.ExternalTransformer">
    	<!-- apply this transfomer to .pdf files only -->
    	<restrictTo field="document.reference">.*\.pdf$</restrictTo>
    	<!-- 
            call system `cp` command to place the file in `extracted` dir 
            (Use `copy` if on Windows)
          -->
    	<command>cp ${INPUT} {extractedDir}</command>
    	<tempDir>${workdir}/temp</tempDir>
  </transformer>
</preParserHandlers>

<postParserHandlers>
   <handler class="ScriptTagger">
      <script>
           <!-- 
             Add this field if you do _not_ want to keep the document. For example,
            -->
         <![CDATA[
          if(! content.contains("mytext")) {
            metadata.add("keepThis", "false")
         }]]>
      </script>
   </handler>
   
  <transformer class="${transformer}.ExternalTransformer">
      <!-- apply this transfomer only to the docs we tagged via ScriptTagger above -->
      <restrictTo field="keepThis">false</restrictTo>

      <!--
      	calls on the system `rm` command to delete the file.
      	Use `del` if on Windows
      	Note: You need figure out how to get the ${filename}
    	-->
      <command>rm {$extractedDir}/${filename}</command>
      <tempDir>${workdir}/temp</tempDir>
   </transformer>
</postParserHandlers>

(Not tested)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How To Download PDFs With Specific Content #1028

How To Download PDFs With Specific Content #1028

teklot commented Jun 26, 2024

ohtwadi commented Jun 26, 2024

teklot commented Jun 26, 2024 •

edited

Loading

ohtwadi commented Jun 28, 2024

How To Download PDFs With Specific Content #1028

How To Download PDFs With Specific Content #1028

Comments

teklot commented Jun 26, 2024

ohtwadi commented Jun 26, 2024

teklot commented Jun 26, 2024 • edited Loading

ohtwadi commented Jun 28, 2024

teklot commented Jun 26, 2024 •

edited

Loading