![dssg_banner](assets/dssg_banner.png)

## Getting Started

### Optional: Set up Google Colab environment

If you work in your local IDE and installed the packages and requirements on your own machine, you can skip this section and start from the import section.
Otherwise you can follow and execute this notebook in your browser. For this, click on the button below to open this page in the Colab environment.

<a href="https://colab.research.google.com/github/DSSGxMunich/land-sealing-dataset-and-analysis/blob/main/src/2_8_regional_plans_demo.ipynb" target="_parent"> <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Google Colab"/> </a>

By running the first cell you are going to create a folder in your Google Drive. All the files for this tutorial will have to be uploaded to this folder. After the first execution you might receive some warning and notification, please follow these instructions:

1. Warning: This notebook was not authored by Google. Click on 'Run anyway'.
2. Permit this notebook to access your Google Drive files? *Click* on 'Yes', and select your account.
3. Google Drive for desktop wants to access your Google Account. *Click* on 'Allow'.

At this point, a folder has been created and you can navigate it through the lefthand panel in Colab, you might also have received an email that informs you about the access on your Google Drive. 

In [None]:
# Create a folder in your Google Drive
# from google.colab import drive                                                                          
# drive.mount('/content/drive')

In [None]:
# Don't run this cell if you already cloned the repo 
# !git clone https://github.com/DSSGxMunich/land-sealing-dataset-and-analysis.git

In [None]:
# %cd land-sealing-dataset-and-analysis

## Imports

In [None]:
# Import the data generation functions
from data_pipeline.rplan_content_extraction.rplan_utils import extract_text_and_save_to_txt_files
from data_pipeline.rplan_content_extraction.rplan_content_extractor import parse_rplan_directory
from data_pipeline.rplan_content_extraction.rplan_utils import parse_result_df

# Import the keyword search functions
from data_pipeline.rplan_content_extraction.rplan_keyword_search import rplan_exact_keyword_search
# Import the visualization function
from visualizations.rplan_visualization import plot_keyword_search_results

## Prerequisites

To run this notebook, you need PDF files of regional plans saved in an order corresponding to the folder structure you see in the file path below (or adjust the file path).

# Regional plans
This notebook shows how to extract content from regional plans, i.e. parse the text from the pdfs and divide them into chapters / sections.

In [None]:
# Set the paths to the PDF and TXT directories
RPLAN_PDF_DIR = "../data/nrw/rplan/raw/pdfs"
RPLAN_TXT_DIR = "../data/nrw/rplan/raw/text"
RPLAN_OUTPUT_PATH = "../data/nrw/rplan/features/rplan_content.json"

## Step 1: Generate content

In [None]:
extract_text_and_save_to_txt_files(pdf_dir_path=RPLAN_PDF_DIR,
                                   txt_dir_path=RPLAN_TXT_DIR)

input_df = parse_rplan_directory(txt_dir_path=RPLAN_TXT_DIR, 
                                 json_output_path=RPLAN_OUTPUT_PATH)

input_df = parse_result_df(df=input_df)

# save df as JSON
input_df.to_json(RPLAN_OUTPUT_PATH)

## Step 2: Exact keyword search

Now we perform an exact keyword search on the data and plot the results.

In [None]:
exact_result, exact_keywords = rplan_exact_keyword_search(input_df=input_df)
plot_keyword_search_results(exact_result, keyword_columns=exact_keywords,
                            title="Exact Keyword Search Results")