GitHub - EDS-APHP-legacy/SparkPdfExtractor: A way to distribute pdf to text extractions over spark and pdfbox

GOAL

pdfs are serialized into AVRO
AVRO si distributed as a spark RDD in X partitions
each partition is collected and stored as a csv part
csv are then merged, and compressed
archive goes back to application serveur that load the postgresql table

PERFORMANCES

50 Million of pdf of 3 pages average were transformed and dumped to text for 2 hours of runtime

BUILD

make build

USE (yarn)

transform the pdf to avro (see PdfAvro folder)
push 2 jars on the spark computer cluster
spark-submit --jars wind-pdf-extractor-1.0-SNAPSHOT-jar-with-dependencies.jar --driver-java-options "-Dlog4j.configuration=file:log4jmaster" --conf "spark.executor.extraJavaOptions=-Dlog4j.configuration=file:log4jslave" --num-executors 120 --executor-cores 1 --master yarn pdfextractor_2.11-0.1.0-SNAPSHOT.jar inputAvroHdfsFolder/ outputCsvHdfsFolder/ 400`
it is crucial to put only one executor core

CONFIGURATION

ulimit -n 64000 (default is 1024, way too low)

READING

problem is hdfs does not manage well many little files
avro is a better choice
then the idea :
1. put the pdf as bytes into an avro file
2. transform the avro into a spark RDD
3. run pdfbox with the bytes using a ByteArrayInputStream
4. append the result as an avro file into hdfs OR

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commits
PdfAvro		PdfAvro
PdfboxPojo		PdfboxPojo
SparkPdfExtractor		SparkPdfExtractor
.gitignore		.gitignore
Makefile		Makefile
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PdfAvro

PdfAvro

PdfboxPojo

PdfboxPojo

SparkPdfExtractor

SparkPdfExtractor

.gitignore

.gitignore

Makefile

Makefile

README.md

README.md

Repository files navigation

GOAL

PERFORMANCES

BUILD

USE (yarn)

CONFIGURATION

READING

About

Releases

Packages

Languages

EDS-APHP-legacy/SparkPdfExtractor

Folders and files

Latest commit

History

Repository files navigation

GOAL

PERFORMANCES

BUILD

USE (yarn)

CONFIGURATION

READING

About

Topics

Resources

Stars

Watchers

Forks

Languages