![dssg_banner](assets/dssg_banner.png)

In [1]:
import os

from data_pipeline.nrw_pdf_downloader.geojson_parser import parse_geojson
from data_pipeline.nrw_pdf_downloader.nrw_pdf_scraper import run_pdf_downloader
from data_pipeline.match_RPlan_BPlan.matching_plans import merge_rp_bp
from data_pipeline.match_RPlan_BPlan.matching_plans import export_merged_bp_rp

# Get started: PDF downloader for land parcels of Bebauungspläne

The first step necessary to run the BP downloader is to have a database that contains links to different building plans in PDF format. The input for this was the information provided in the [NRW geoportal](https://www.geoportal.nrw/?activetab=map). Clicking on the download button there, you should be able to select all the areas of NRW, select to download information from Bebauungsplane and get the information in GeoPackage format (gpkg extension). 

This extension can be loaded into any GIS interface, and exported into a geojson format. This is the format that the functions finally take as input.

- `parse_geojson:` parses geojson file with download links to different building plans. It iterates over all rows and checks if the url matches the pattern of a osp-plan.de link without a list format, meaning than the scan url is not directly to a pdf, but the pdf is contained somewhere in the html of the page. If the url matches the pattern, the html of the page is downloaded and parsed with beautiful soup. All links that start with https://www.o-sp.de/download/ are extracted and written to a dataframe.
    - to parse only a sample of the rows, set a sample size defined by sample_n.
    
    $~$

- `run_pdf_downloader:` goes through a GDF with PDF download links and downloads all the files. Links that return error are saved in a csv called error_links in the defined output folder. 
    - to parse only a sample of the rows, set a sample size defined by sample_n. 

## Necessary file path specifications:

In [8]:
INPUT_BP_FILE_PATH = os.path.join("..", "data","nrw", "bplan", "raw", "links", "NRW_BP.geojson")
OUTPUT_PDF_FOLDER_PATH = os.path.join("..", "data", "nrw", "bplan", "raw", "pdfs")
OUTPUT_CSV_PATH = os.path.join("..", "data", "nrw", "bplan", "raw", "links", "NRW_BP_parsed_links.csv")
OUTPUT_LAND_PARCELS_PATH = os.path.join("..", "data", "nrw", "bplan", "raw", "links", "land_parcels.geojson")
INPUT_REGIONS_FILE_PATH = os.path.join( "..", "data","nrw", "rplan", "raw", "geo", "regions_map.geojson")

## Now, let's start the process:

In [3]:
df = parse_geojson(file_path=INPUT_BP_FILE_PATH,
                   sample_n = 5,
                   output_path = OUTPUT_CSV_PATH)

100%|██████████| 5/5 [00:00<00:00,  8.54it/s]


## Run the downloader and save the pdfs in output folder:

In [4]:
run_pdf_downloader(input_df=df,
                   output_folder=OUTPUT_PDF_FOLDER_PATH,
                   sample_n=3)

  0%|          | 0/3 [00:00<?, ?it/s]

100%|██████████| 3/3 [00:01<00:00,  1.82it/s]


## Enrich bplan info to create `land_parcels.csv`

To generate the file `land_parcels.csv` we need the columns from the original NRW_BP but we also need to add the columns that refer to the regional plans that match each parcel. For that, we will use the function merge_rp_bp stored in the module match_rplan_bplan.matching_plans. It takes as input the same `INPUT_BP_FILE_PATH` we were working with, but also the file that contains geodata of the regions (provided by GreenDIA).

In [9]:
land_parcels = merge_rp_bp(path_bp_geo=INPUT_BP_FILE_PATH,
                           path_rp_geo=INPUT_REGIONS_FILE_PATH)

The result is a dataframe that contains all the original columns from the BP dataset and the columns from the regions. The *relevant* columns in this dataset are:

- **objectid:** unique numeric ID of the building plan. 
- **geometry:** contains the spatial information of the polygons. 
- **kommune:** name of the municipality.
- **name:** name of the building plan.
- **datum:** date of the building plan. 
- **regional_plan_id:** unique numeric ID of the regional plan. 
- **regional_plan_name:** nominal name of the regional plan. 

In [10]:
land_parcels.head()

Unnamed: 0,objectid,geometry,planid,levelplan,name,kommune,gkz,nr,besch,aend,...,aendnr,begruendurl,umweltberurl,erklaerungurl,shape_Length,shape_Area,regional_plan_id,regional_plan_name,ART,LND
0,84060,"POLYGON ((7.28543 50.82280, 7.28728 50.82179, ...",DE_05382060_Siegburg_BP93/1,infra-local,"Im Klausgarten, Braschosser Straße, Am Kreuztor",Siegburg,5382060,93/1,,,...,,,,,868.647801,31960.32,5022,Region Bonn/Rhein-Sieg,Teilabschnitt,5
126,559438,"POLYGON ((7.39385 50.90281, 7.39416 50.90240, ...",DE_05382036_02_32,infra-local,32. Änderung des Bebauungsplanes Nr. 2 „Much-K...,Much,5382036,0,,32. Änderung,...,32.0,https://www.much.de/zukunft/bauleitplanungen,https://www.much.de/zukunft/bauleitplanungen,,473.229327,4467.916,5022,Region Bonn/Rhein-Sieg,Teilabschnitt,5
2722,2257588,"POLYGON ((7.12896 50.77292, 7.12899 50.77292, ...",DE_05314000_00,local,Flächennutzungsplan der Bundesstadt Bonn,Bonn,5314000,00,,,...,,,,,69372.039264,141014600.0,5022,Region Bonn/Rhein-Sieg,Teilabschnitt,5
3436,2367967,"MULTIPOLYGON (((7.23255 50.91855, 7.23242 50.9...",DE_05378028_9aenderungI_Ur,local,9. Änderung §34_Urschrift,Rösrath,5378028,9aenderungI_Ur,Breide und Durbusch,Urschrift,...,,http://www.roesrath.de/34-9.-aenderung-breide-...,,,739.659941,7348.491,5022,Region Bonn/Rhein-Sieg,Teilabschnitt,5
3444,2367975,"MULTIPOLYGON (((7.19091 50.88535, 7.19112 50.8...",DE_05378028_1aenderungundUrschriftI_Ur,local,1. Änderung und Urschrift §34_Urschrift,Rösrath,5378028,1aenderungundUrschriftI_Ur,,Urschrift,...,,http://www.roesrath.de/34-urfassung-und-1.-aen...,,,56630.267941,6082747.0,5022,Region Bonn/Rhein-Sieg,Teilabschnitt,5


File can be exported with the function export_merged_BP_RP() (runs the same as merge_RP_BP, but have to add output_path parameter) in the module, or by using *to_file* from the geopandas module. We will do a run of the export function. 

In [12]:
export_merged_bp_rp(output_path=OUTPUT_LAND_PARCELS_PATH,
                    path_bp_geo=INPUT_BP_FILE_PATH,
                    path_rp_geo=INPUT_REGIONS_FILE_PATH)
                