![dssg_banner](assets/dssg_banner.png)

## Getting Started

### Optional: Set up Google Colab environment

If you work in your local IDE and installed the packages and requirements on your own machine, you can skip this section and start from the import section.
Otherwise you can follow and execute this notebook in your browser. For this, click on the button below to open this page in the Colab environment.

<a href="https://colab.research.google.com/github/DSSGxMunich/land-sealing-dataset-and-analysis/blob/main/src/2_4_exact_keyword_search_demo.ipynb" target="_parent"> <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Google Colab"/> </a>

Once in Colab, run the cell below to install the packages we will use.
What's important to properly set up the notebook:

1. Warning: This notebook was not authored by Google. Click on 'Run anyway'.
2. When the installation commands are done, there might be "Restart runtime" button at the end of the output. Please, *click* it. 

In [None]:
# %pip install json
# %pip install pandas

By running the next cell you are going to create a folder in your Google Drive. All the files for this tutorial will be uploaded to this folder. After the first execution you might receive some warning and notification, please follow these instructions:
1. Permit this notebook to access your Google Drive files? *Click* on 'Yes', and select your account.
2. Google Drive for desktop wants to access your Google Account. *Click* on 'Allow'.

At this point, a folder has been created and you can navigate it through the lefthand panel in Colab, you might also have received an email that informs you about the access on your Google Drive. 

In [None]:
# Create a folder in your Google Drive
# from google.colab import drive                                                                          
# drive.mount('/content/drive')

In [None]:
# Don't run this cell if you already cloned the repo 
# !git clone https://github.com/DSSGxMunich/land-sealing-dataset-and-analysis.git

In [None]:
# %cd land-sealing-dataset-and-analysis

## Imports

In [1]:
import json
import pandas as pd

from features.textual_features.keyword_search.exact_keyword_search import search_df_for_keywords

pd.set_option('display.max_colwidth', None)
pd.set_option('display.max_rows', None)

## Prerequisites

To run this notebook, you need a file containing the text extracted from building plans, which is now to be searched for keywords (see output of `2_2_pdf_scraper_demo`), or any other input text of your choice. The input file should be saved in an order corresponding to the folder structure you see in the file path below (or adjust the file path).

# Exact keyword search for paragraphs from BauNVO & BauGB

## Prepare data

- **Change the folder path** in the code block below to read in the data.
- **Specify the relevant column names.** The function that is used in the following expects the input data frame to have (at least) two columns, i.e., one id and one content column. Here, the columns are called `filename` and `content`. If named differently, change the column names in the code below.

In [3]:
# specify file path
INPUT_FILE_PATH = "../data/nrw/bplan/raw/text/bp_text.csv"
OUTPUT_FILE_PATH = "../data/nrw/bplan/features/keywords/exact_search/exact_search.csv"

# specify relevant column names
ID_COLUMN='filename'
TEXT_COLUMN='content'

# read in data
input_df = pd.read_csv(INPUT_FILE_PATH, names=[ID_COLUMN, TEXT_COLUMN])

## Define keyword dictionary

Keywords are specified in a separate json file to apply the exact keyword search more easily to different sets of keywords, simply by reading in the relevant dictionary. The dictionary is structured so that each keyword category (e.g. baunvo-1) can contain one or more keywords to consider the category covered (e.g., "§1 baunvo", "1 baunvo", or "allgemeine vorschriften für bauflächen und baugebiete").

In [5]:
with open('features/textual_features/keyword_search/keyword_dict_exact.json') as f:
    BAUNVO_KEYWORDS = json.load(f)

# Apply function

Exact keyword matching based on input dictionary, returns df showing which keyword appeared in each pdf per category.

In [6]:
result_df = search_df_for_keywords(input_df=input_df,
                                   text_column_name=TEXT_COLUMN,
                                   id_column_name=ID_COLUMN,
                                   keyword_dict=BAUNVO_KEYWORDS)

result_df.head(30)

Unnamed: 0,filename,baunvo-1,baunvo-2,baunvo-3,baunvo-4,baunvo-4a,baunvo-5,baunvo-5a,baunvo-6,baunvo-6a,...,baunvo-17,baunvo-18,baunvo-19,baunvo-20,baunvo-21,baunvo-21a,13b,hq100,hqhäufig,hqextrem
0,116995_0.pdf,,,,,,,,,,...,,,,,,,,,,
1,116995_10.pdf,,,,,,,,,,...,,,,,,,,,,
2,116995_2.pdf,,,,,,,,,,...,,,,,,,,,,
3,116995_4.pdf,,,,,,,,,,...,,,[grz],,,,,,,
4,116995_6.pdf,,,,,,,,,,...,,,,,,,,,,
5,116995_8.pdf,[1 baunvo],[2 baunvo],"[3 baunvo, reine wohngebiete]","[4 baunvo, allgemeine wohngebiete]",[besondere wohngebiete],[5 baunvo],,[6 baunvo],,...,[17 baunvo],[18 baunvo],"[19 baunvo, grundflächenzahl, grz]","[20 baunvo, vollgeschosse, gfz]",[21 baunvo],,,,,
6,1423897.pdf,,,,,,,,,,...,,,,,,,,,,
7,1427478.pdf,,,,,,,,,,...,,,,,,,,,,
8,1427479.pdf,,,,,,,,,,...,,,,,,,,,,
9,1427480.pdf,,,,,,,,,,...,,,,,,,,,,


## Check results

To inspect the keyword coverage across all files:

In [7]:
result_df.count()

filename      21573
baunvo-1       7086
baunvo-2       7217
baunvo-3       7259
baunvo-4       7542
baunvo-4a      2758
baunvo-5       7152
baunvo-5a       499
baunvo-6       7370
baunvo-6a       619
baunvo-7       6819
baunvo-8       7186
baunvo-9       7105
baunvo-10      6658
baunvo-11      6415
baunvo-12      6864
baunvo-13      5701
baunvo-13a     1985
baunvo-14      6182
baunvo-15      6167
baunvo-16      6239
baunvo-17      4991
baunvo-18      6409
baunvo-19      6920
baunvo-20      6897
baunvo-21      5249
baunvo-21a     6129
13b             224
hq100           379
hqhäufig        151
hqextrem        242
dtype: int64

Also, for a given pdf, one can extract all usage options listed in the bplan:

In [8]:
print(result_df.loc[[5]].values.tolist())

[['116995_8.pdf', ['1 baunvo'], ['2 baunvo'], ['3 baunvo', 'reine wohngebiete'], ['4 baunvo', 'allgemeine wohngebiete'], ['besondere wohngebiete'], ['5 baunvo'], None, ['6 baunvo'], None, ['7 baunvo'], ['8 baunvo'], ['9 baunvo'], ['10 baunvo'], ['11 baunvo'], ['12 baunvo'], ['13 baunvo'], None, ['14 baunvo'], ['15 baunvo'], ['16 baunvo'], ['17 baunvo'], ['18 baunvo'], ['19 baunvo', 'grundflächenzahl', 'grz'], ['20 baunvo', 'vollgeschosse', 'gfz'], ['21 baunvo'], None, None, None, None, None]]


# Transform to Boolean

For better consecutive analysis, Boolean values may be preferred. The optional argument `boolean=True` can be set. Instead of an overview of all keyword hits per category, a dataframe will be returned that shows whether a category was covered or not.

In [9]:
boolean_result_df = search_df_for_keywords(input_df=input_df,
                                           text_column_name=TEXT_COLUMN,
                                           id_column_name=ID_COLUMN,
                                           keyword_dict=BAUNVO_KEYWORDS,
                                           boolean=True)

# Write results to csv

In [11]:
result_df.to_csv(OUTPUT_FILE_PATH, header=True, index=False)