![dssg_banner](assets/dssg_banner.png)

## Getting Started

### Optional: Set up Google Colab environment

If you work in your local IDE and installed the packages and requirements on your own machine, you can skip this section and start from the import section.
Otherwise you can follow and execute this notebook in your browser. For this, click on the button below to open this page in the Colab environment.

<a href="https://colab.research.google.com/github/DSSGxMunich/land-sealing-dataset-and-analysis/blob/main/src/2_2_pdf_scraper.ipynb" target="_parent"> <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Google Colab"/> </a>

By running the first cell you are going to create a folder in your Google Drive. All the files for this tutorial will have to be uploaded to this folder. After the first execution you might receive some warning and notification, please follow these instructions:

1. Warning: This notebook was not authored by Google. Click on 'Run anyway'.
2. Permit this notebook to access your Google Drive files? *Click* on 'Yes', and select your account.
3. Google Drive for desktop wants to access your Google Account. *Click* on 'Allow'.

At this point, a folder has been created and you can navigate it through the lefthand panel in Colab, you might also have received an email that informs you about the access on your Google Drive. 

In [None]:
# Create a folder in your Google Drive
# from google.colab import drive                                                                          
# drive.mount('/content/drive')

In [None]:
# Don't run this cell if you already cloned the repo 
# !git clone https://github.com/DSSGxMunich/land-sealing-dataset-and-analysis.git

In [None]:
# %cd land-sealing-dataset-and-analysis

## Imports

In [1]:
from data_pipeline.pdf_scraper.tika_pdf_scraper import pdf_parser_from_folder

## Prerequisites

This notebook will extract text from PDF documents, i.e., to run this notebook, you need the PDF files of building plans saved in an order corresponding to the folder structure you see in the file path below (or adjust the file path). The previous notebook `2_1_land_parcels_demo` downloads these PDF documents from the NRW geoportal - run the previous notebook if you haven't done so already. Alternatively, you can also use the PDF scraper to extract text from other PDF files of your choice.

# Illustration of usage of pdf_scraper.py

This pdf scraper assumes to-be-scraped documents to be present in a subfolder in the same project (here: `../data/nrw/bplan/raw/pdfs`). Due to storage constraints, the pdfs have however not been uploaded to this Github repository. Make sure to provide files before running the code. While Apache tika (the API used for Optical Character Recognition) technically allows scraping based on URLs too, here, the pdfs were further processed too (e.g., image extraction) and download necessary anyways. Therefore, the pdf scraper expects a folder containing the files (in .pdf or .png format) as input.

To use the pdf scraper:
- **Install Java** (we originally used version 20.0.2, [download here](https://www.oracle.com/java/technologies/downloads/))
- **Adjust the data input and data output in code block 1**: Specify a functioning file path leading to a folder containing pdf/png documents
- **Adjust the optional sample_size parameter in code block 2**: Specify the desired sample size or leave empty to consider all.

In [2]:
# code block 1: specify data inputs and outputs
INPUT_FOLDER_PATH = "../data/nrw/bplan/raw/pdfs"
OUTPUT_FILENAME_CSV = "../data/nrw/bplan/raw/text/bp_text.csv"
OUTPUT_FILENAME_JSON = "../data/nrw/bplan/raw/text/bp_text.json"

## Parse PDFs from folder

In [3]:
# code block 2: apply pdf_parser function to folder containing pdfs
parsed_df = pdf_parser_from_folder(folder_path=INPUT_FOLDER_PATH,
                                   sample_size=3)

2023-09-18 17:03:49.502 | INFO     | data_pipeline.pdf_scraper.tika_pdf_scraper:pdf_parser_from_folder:64 - Parsing file: 2368027_8.pdf
2023-09-18 17:03:50,775 [MainThread  ] [WARNI]  Failed to see startup log message; retrying...
2023-09-18 17:04:04.482 | INFO     | data_pipeline.pdf_scraper.tika_pdf_scraper:pdf_parser_from_folder:64 - Parsing file: 2368027_6.pdf
2023-09-18 17:04:04.867 | INFO     | data_pipeline.pdf_scraper.tika_pdf_scraper:pdf_parser_from_folder:64 - Parsing file: 2220502_14.pdf
2023-09-18 17:04:04.957 | INFO     | data_pipeline.pdf_scraper.tika_pdf_scraper:pdf_parser_from_folder:75 - Parsing done.


In [4]:
# code block 3: check results
parsed_df.head()

Unnamed: 0,filename,content,metadata
0,2368027_8.pdf,\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n...,"{'pdf:PDFVersion': '1.6', 'xmp:CreatorTool': '..."
1,2368027_6.pdf,\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n...,"{'pdf:PDFVersion': '1.6', 'xmp:CreatorTool': '..."
2,2220502_14.pdf,\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n...,"{'pdf:unmappedUnicodeCharsPerPage': ['0', '0']..."


## Write results to CSV and json

In [5]:
# code block 4:
# write results to csv
parsed_df.to_csv(OUTPUT_FILENAME_CSV, header=True, index=False)

# write results to json
result_json = parsed_df.to_json(orient='records')

with open(OUTPUT_FILENAME_JSON, 'w') as outputfile:
    outputfile.write(result_json)