Skip to content
📤Extract figure from pdf without text in it
JavaScript
Branch: master
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Type Name Latest commit message Commit time
Failed to load latest commit information.
lib
output
pdf
test
tmp
.gitignore
README.md
index.js
package.json
pdf-figure-extractor

README.md

Pdf-figure-extractor

Extract figure from pdf without text in it

Required Packages

Install dependencies:

sudo apt-get install libopencv-dev libcv-dev libtesseract-dev  tesseract-ocr

Installation

Install project dependencies:

npm install

Run

If you want to execute in command line interface:

npm install -g pdf-figure-extractor

Usage:

Usage: pdf-figure-extractor [options]

  Options:

    -h, --help             output usage information
    -V, --version          output the version number
    -o, --output <path>    Directory to put results
    -i, --input <path>     Directory to process
    -t, --tmp <path>       Directory to put temporary files
    -p, --partials <path>  Directory to put figure directory

For instance:

pdf-figure-extractor --input "pdf" --output "output"

If you want to execute as a module:

const pfe = require('pdf-figure-extractor')

const config = {
  pdfInputPath: input,
  directoryOutputPath: output,
  directoryPartialPath: partials,
  tmp: tmp,
  debug:true
}
new pfe(config).then((self) => {
  return self.exec()
}).then((partials)=>{
  console.log(partials)
}).catch(err=>console.log(err))

TODO

  • Extract array
  • Extract graphs (partial: heritage from array when graph have grid inside)
  • Extract images
You can’t perform that action at this time.