In [1]:
# necessary imports
import os

from data_pipeline.pdf_scraper.tika_pdf_scraper import pdf_parser_from_folder

# Illustration of usage of tika_pdf_scraper.py

This pdf scraper assumes to-be-scraped documents to be present in a subfolder in the same project (here: `../../data/nrw/pdfs`). Due to storage constraints, the pdfs have however not been uploaded to this Github repository. Make sure to provide files before running the code. While Apache tika technically allows scraping based on URLs too, here, pre-processing is applied to the pdfs to improve performance of the scraper. Therefore, the pdf scraper expects a folder containing the files (in .pdf or .png format) as input.

To use the pdf scraper:
- Install java (we originally used version 20.0.2, [link](https://www.oracle.com/java/technologies/downloads/)
- **Adjust the data input and data output in code block 1**: Specify a functioning file path leading to a folder containing pdf/png documents
- **Adjust the optional sample_size parameter in code block 2**: Specify the desired sample size or leave empty to consider all.

In [2]:
# code block 1: specify data inputs and outputs
INPUT_FOLDER_PATH = os.path.join("..", "data", "nrw", "pdfs")
OUTPUT_FILENAME_CSV = os.path.join("..", "data", "nrw", "text", "test_nrw_all.csv")
OUTPUT_FILENAME_JSON = os.path.join("..", "data", "nrw", "text", "test_nrw_all.json")

## Parse PDFs from folder

In [3]:
# code block 2: apply pdf_parser function to folder containing pdfs
parsed_df = pdf_parser_from_folder(folder_path=INPUT_FOLDER_PATH,
                                   sample_size=3)

2023-09-18 17:03:49.502 | INFO     | data_pipeline.pdf_scraper.tika_pdf_scraper:pdf_parser_from_folder:64 - Parsing file: 2368027_8.pdf
2023-09-18 17:03:50,775 [MainThread  ] [WARNI]  Failed to see startup log message; retrying...
2023-09-18 17:04:04.482 | INFO     | data_pipeline.pdf_scraper.tika_pdf_scraper:pdf_parser_from_folder:64 - Parsing file: 2368027_6.pdf
2023-09-18 17:04:04.867 | INFO     | data_pipeline.pdf_scraper.tika_pdf_scraper:pdf_parser_from_folder:64 - Parsing file: 2220502_14.pdf
2023-09-18 17:04:04.957 | INFO     | data_pipeline.pdf_scraper.tika_pdf_scraper:pdf_parser_from_folder:75 - Parsing done.


In [4]:
# code block 3: check results
parsed_df.head()

Unnamed: 0,filename,content,metadata
0,2368027_8.pdf,\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n...,"{'pdf:PDFVersion': '1.6', 'xmp:CreatorTool': '..."
1,2368027_6.pdf,\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n...,"{'pdf:PDFVersion': '1.6', 'xmp:CreatorTool': '..."
2,2220502_14.pdf,\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n...,"{'pdf:unmappedUnicodeCharsPerPage': ['0', '0']..."


## Write results to CSV and json

In [5]:
# code block 4:
# write results to csv
parsed_df.to_csv(OUTPUT_FILENAME_CSV, header=True, index=False)

# write results to json
result_json = parsed_df.to_json(orient='records')

with open(OUTPUT_FILENAME_JSON, 'w') as outputfile:
    outputfile.write(result_json)