<a href="https://colab.research.google.com/github/BowieSteutel/acc-nlp-firecodes/blob/main/1A_Regulatory_Document_Preprocessing.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**Install libraries**

In [None]:
# use !pip install -U ... if upgrade is needed
!pip install deep_translator # for translation
#!pip install lemminflect # for additional word analysis



# **Module 1A - Regulatory Information Preprocessing**

*This preprocessing module is based on a text document. To work with XML, a pipeline is needed with different RegEx patterns and a more simplified approach to hierarchy extraction (since this is part of the metadata)*

**Import libraries**



In [None]:
# Import standard libraries
import re # for pattern matching using regular expressions
import pandas as pd # for dataframes
import collections # for frequency counting

# Prepare translation library
from deep_translator import GoogleTranslator

# Prepare SpaCy extensions
#from lemminflect import getAllLemmas, getInflection, getAllInflections, getAllInflectionsOOV

---
# **Load inputs**

In [None]:
# mount drive (needed for google colab files)
import os

if os.path.ismount('/content/drive'):
  print("Google Drive is already mounted!")
else:
  from google.colab import drive
  drive.mount('/content/drive')
  print("Google Drive is mounted.")

Google Drive is already mounted!


In [None]:
# @title Load file path

file_path = "/content/drive/Othercomputers/My Laptop/Uni/Y6 Q1/Graduation Project/Development/FINAL/" #  @param {"type":"string","placeholder":"for relative paths"}

In [None]:
# @title Load txt document

# define variables
## path of the text document
doc_path = file_path+"input/BBL_2024-8-1.txt" #  @param {"placeholder":"path to txt document"}
doc_lang = "nl" #  @param {"type":"string", "placeholder": "original language of the document (abbreviation for Google Translator)"}
## encoding of text document (default: utf8)
doc_enc = "utf8" #  @param {"type":"string","placeholder":"utf8"}

# if rel_doc_path != "":
#     doc_path = current + rel_doc_path
# import BBL document
file = open(doc_path, "r", encoding=doc_enc)

document = file.read()
# remove non-breaking spaces
nonBreakSpace = u'\xa0'
document = document.replace(nonBreakSpace, ' ')

# normalize the document (if needed)
import unicodedata
document = unicodedata.normalize("NFKD", document)
document

'\n\nBesluit bouwwerken leefomgeving\nWijziging(en) op nader te bepalen datum(s); laatste bekendgemaakt in 2023. Zie het\n\noverzicht van wijzigingen.\nGeraadpleegd op 17-11-2024.\nGeldend van 01-08-2024 t/m heden\n\nBesluit van 3 juli 2018, houdende regels over bouwwerken in de fysieke leefomgeving (Besluit bouwwerken leefomgeving)\n\nWij Willem-Alexander, bij de gratie Gods, Koning der Nederlanden, Prins van Oranje-Nassau, enz. enz. enz.\n\nOp de voordracht van Onze Minister van Binnenlandse Zaken en Koninkrijksrelaties van 30 juni 2018, nr. 2017-0000316593, Directie Constitutionele Zaken en Wetgeving, gedaan mede namens Onze Minister van Infrastructuur en Waterstaat;\n\nGelet op de richtlijn breedband, de richtlijn energieprestatie van gebouwen, de richtlijn hernieuwbare energie, de richtlijn veiligheid wegtunnels, de verordening bouwproducten en het VN-gehandicaptenverdrag en de artikelen 4.3, eerste lid, en 5.1 van de Omgevingswet, en artikel 119 van de Woningwet;\n\nDe Afdeling a

In [None]:
for line in document.split('\n')[:80]:
  if line!="":
    print(line)

Besluit bouwwerken leefomgeving
Wijziging(en) op nader te bepalen datum(s); laatste bekendgemaakt in 2023. Zie het
overzicht van wijzigingen.
Geraadpleegd op 17-11-2024.
Geldend van 01-08-2024 t/m heden
Besluit van 3 juli 2018, houdende regels over bouwwerken in de fysieke leefomgeving (Besluit bouwwerken leefomgeving)
Wij Willem-Alexander, bij de gratie Gods, Koning der Nederlanden, Prins van Oranje-Nassau, enz. enz. enz.
Op de voordracht van Onze Minister van Binnenlandse Zaken en Koninkrijksrelaties van 30 juni 2018, nr. 2017-0000316593, Directie Constitutionele Zaken en Wetgeving, gedaan mede namens Onze Minister van Infrastructuur en Waterstaat;
Gelet op de richtlijn breedband, de richtlijn energieprestatie van gebouwen, de richtlijn hernieuwbare energie, de richtlijn veiligheid wegtunnels, de verordening bouwproducten en het VN-gehandicaptenverdrag en de artikelen 4.3, eerste lid, en 5.1 van de Omgevingswet, en artikel 119 van de Woningwet;
De Afdeling advisering van de Raad van 

In [None]:
# @title Define base URL of regulations (for fragment URLs)
base_URL = "https://wetten.overheid.nl/BWBR0041297/2024-08-01#" #  @param {"type":"string","placeholder":"URL (for online regulatory documents using fragment URLs)"}

print(base_URL)

https://wetten.overheid.nl/BWBR0041297/2024-08-01#


## Define hierarchy dictionary

manually define translation of hierarchical elements in text (used for pretranslation)

In [None]:
dict_hier_elements = {"chapter": {"re_untranslated"    : r"hoofdstuk(?:ken)?",
                          "re_translated"       : r"chapter[s]?",
                          "code": "C"},
                      "section": {"re_untranslated"    : r"afdeling(?:en)?",
                                  "re_translated"       : r"section[s]?",
                                  "code": "S"},
                      "paragraph": {"re_untranslated"  : r"(?:paragra(?:af|fen)|§)",
                                  "re_translated"       : r"paragraph[s]?",
                                  "code": "P"},
                      "article": {"re_untranslated"    : r"artikel(?:en|s)?",
                                  "re_translated"       : r"article[s]?",
                                  "code": "A"},
                      "sub-article": {"re_untranslated": r"lid",
                                  "re_translated"       : r"sub[-]?article[s]?",
                                  "code": "SUB"},
                      "table": {"re_untranslated"      : r"tabel(?:len)?",
                                  "re_translated"       : r"table[s]?",
                                  "code": "TAB"},
                      "appendix": {"re_untranslated"   : r"bijlage[n]?[s]?",
                                  "re_translated"       : r"appendi(?:x|ces)",
                                  "code": "APPX"},
                      "figure": {"re_untranslated"     : r"figu(?:ur|en)?",
                                  "re_translated"       : r"figure[s]?",
                                  "code": "FIG"}
          }

---
# **Determine hierarchy**

Establish possible hierarchy based on document structure

*Simplified use-case specific version*

*Still rudimentary, should be improved for standardization*


*   Establish hierarchy structure possibilities
  *  Indicate sorting order
  *  Indicate creation order
*   Define hierarchy identification patterns
  *  Enumeration patterns
  *  Number list patterns
  *  Bullet list patterns
  *  Multilist patterns
  *  Hierarchy label patterns
  *  Hierarchy title patterns
*   Extract hierarchy elements
  *  Identify potential main elements, sub elements & captions
  *  Improve automated hierarchy classifications
  *  Identify caption type? (WIP)
*   *User validation & correction (out of scope)*
*   Determine relations of hierarchy elements
  *  If no labeled elements found: determine hierarchy based on label depth and enumeration variations (WIP)
  *  Determine hierarchy level of main elements
  *  Determine sublevels per main level based on sequence
  *  Clean up enumerations by checking whether they are sequential??? (WIP)
*   *User validation & correction (out of scope)*
*   Establish final hierarchy structure
  *  Map hierarchy for user check (WIP)
  *  Make adjustments if necessary (WIP)
  *  Determine paths of hierarchy elements
  *  Translate hierarchy elements & paths
  * **ALWAYS ADD "." AT END OF ELEMENT IN PATH?**



## Define hierarchy identification patterns

Assumptions:

* Most main labels (and some sublabels or captions) are indicated with the label and an enumeration and have a title that starts with a capital letter
* Most subarticles (and some sublabels) are indicated only of an enumeration
* Most captions (and some main labels or sublabels) are indicated with the label and have an enumeration (that is either numerical or alphanumerical) but no title

In [None]:
# RegEx for enumerations
## Match number enumerations: 1 2 3 up to 99997 99998 99999
re_enum_number = "(?:[0-9]{1,5})"
## Match letter enumerations:
re_enum_letter = "(?:[A-Z]{1,3})" #A B C up to ZZX ZZY ZZZ
re_enum_letter_ci = "(?:[A-Z]{1,3}|[a-z]{1,3})" # also includes a b c up to zzx zzy zzz (but not mixed uppercase and lowercase!)
## Match mixed enumerations: 1a 1b 1c up to 99999zzx 99999zzy 99999zzz and a1 a2 a3 up to zzz99997 zzz99998 zzz99999, or with uppercase letters
re_enum_mix = f"(?:{re_enum_number}{re_enum_letter_ci}|{re_enum_letter_ci}{re_enum_number})"
## Match roman numerals (I-II-III)
re_enum_roman = "(?:M{0,4}(?:CM|CD|D?C{0,3})(?:XC|XL|L?X{0,3})(?:IX|IV|V?I{0,3}))"
#re_enum_roman = "(?:\\bM{0,4}(?:CM|CD|D?C{0,3})(?:XC|XL|L?X{0,3})(?:IX|IV|V?I{0,3})\\b)"
## Also match lowercase roman numerals (i-ii-iii)
re_enum_roman_ci = f"(?:{re_enum_roman}|{re_enum_roman.lower()})"
## Match repeated alphanumeric characters? (a-aa-aaa)
#re_enum_repeat = "(?:(?:[a-z0-9])\\1{1,9})"

## Combined expression (ordered such that in the case of multiple matches, the first match (which is returned) is the most logical one):
re_enum = f"(?:\\b(?:{re_enum_mix}|{re_enum_roman_ci}|{re_enum_number}|{re_enum_letter})\\b)"

## Using capture groups for later identification:
re_enum_type = f"(?:\\b(?P<e1>{re_enum_mix})|(?P<e2>{re_enum_roman_ci})|(?P<e3>{re_enum_number})|(?P<e4>{re_enum_letter})\\b)"

# RegEx before titles: ".", ":", ";" or any type of dash ("-", "–", "—", "--")
re_title_start = "(?:\:|\;|\–|\—|-{1,2})"

# RegEx for horizontal whitespaces without using \h (which gives an error)
re_space = "[^\\S\\n\\r]{1,3}"#"[^\S\x{2028}\n\r\x{000B}\f\x{2029}\x{0085}]"
# RegEx for defining where a title should start
re_linestart = "^[^\\S\\n\\r]{0,20}"



# RegEx for list item notation possibilities, with any enumeration type listed above:
## 1. 2. 3.
re_list_1 = f"(?:{re_enum}\.)"
## 1) 2) 3) / (1) (2) (3)
re_list_2 = f"(?:\(?{re_enum}\))"
## 1° 2° 3° / 1°. 2°. 3°.
re_list_3 = f"(?:{re_enum}\°\.?)"
## 1 2 3 also? Might return too many false positives.
#re_list_4 = f"{re_enum}"
## Any non-ascii character? Might return too many false positives.
#re_list_5 = f"{re_enum}[^[:ascii:]]\.?"

## Combined expression for finding list item, followed by a whitespace
re_list = f"(?:{re_list_1}|{re_list_2}|{re_list_3}){re_space}"
## Using capture groups for later identification (e.g. for confirming subsequent enumerations):
re_list_type = f"(?:(?P<l1>{re_list_1})|(?P<l2>{re_list_2})|(?P<l3>{re_list_3})){re_space}"


# RegEx for bullet lists without enumerations
## Actual bullets
re_bullet_real = "(?:•|∙|○|●|◦|∘|◼|◘|‣|⁃|⁌|⁍)"
## ALTERNATIVE: check for non-ascii characters (except dashes) - might be too general
re_bullet_real_alt = "(?:[^[:ascii:]\–|\—])"
## Frequently used instead of bullets
re_bullet_fake = "(?:-|\–|\—|\*|\.|\+|o|x)" # these need to be checked for spaces after since they are characters not only used for bullets

## Combined expression, match bullets at beginning of strings (excluding whitespaces)
re_bullet = f"{re_linestart}({re_bullet_real}|(?:{re_bullet_fake}{re_space}))"
## Using capture groups for later identification:
re_bullet_type = f"{re_linestart}((?P<b1>{re_bullet_real})|(?P<b2>{re_bullet_fake}{re_space}))"



# RegEx for finding multiple levels of enumerations, up to 6 levels, starting with the dividable enumerations and ending with any enumeration:
# 1.1 1.2 1.3   1.1.1 1.1.2 1.1.3   etc.
re_multilist = f"(?:(?:{re_enum}\.)"+"{1,5}"+f"{re_enum})" #leaving "\.?" out at the end to improve consistency in capture group with path matching later on

# RegEx for tables, figures and formulas, always using numerical or alphanumerical enumeration
re_caption = f"(?:(?:{re_enum_mix}|{re_enum_number})(?:\.{re_enum_mix}|\.{re_enum_number}){0,5})"

# also include 1.1. 1.2. 1.3.   1.1.1. 1.1.2. 1.1.3.  etc:
#re_multilist = f"(?:{re_enum}\.)"+"{1,5}"+f"{re_enum}\.?"
## Can be split based on "." occurrences later to check current level depth!!!!

## 1-1 1-2 1-3  1–1 1–2 1–3 1—1 1—2 1—3 1--1 1--2 1--3 etc. also??
## If so, use version below:
#re_multilist = f"(?:{re_enum}[\.{re_dash}])"+"{1,5}"+f"{re_enum}[\.{re_dash}]?"

## RegEx for words at beginning of sentence followed by enumeration: [label] 123456. § 123456. (with different types of section symbols)
re_label = "(?:¶|❡|§|⁋|⸿|⹍|(?:[A-Z][\w]*))"

# RegEx for finding the title of an element in the hierarchy (should start with a capital letter or bracket and not be a sentence, and followed by a linebreak)
re_title_cs_1 = "[A-Z(][^\.]*$" # case-senstive version
re_title_cs_2 = "[A-Z(][^\.]*\.?$" # "." can be in the text, but only once at the end
re_title_cs_3 = "[A-Z(][^\.]*[^\.\:\;]$" # string cannot end with interpunction indicating that this is part of a sentence
re_title_ci_1 = "[A-Za-z(][^\.]*$" # case-insensitive version (also lowercase letters as first letter)
re_title_ci_2 = "[A-Za-z(][^\.]*\.?$" # "." can be in the text, but only once at the end
re_title_ci_3 = "[A-Za-z(][^\.]*[^\.\:\;]$" # string cannot end with interpunction indicating that this is part of a sentence

# Different RegEx for finding main hierarchy items
# Returns: [0]: full, [1]: trimmed, [2]: id (label+enum), [3]: label, [4]: enum, [5]: title
re_hier_main = f"{re_linestart}(?P<trim>(?P<id>(?P<label>{re_label}){re_space}(?P<enum>{re_multilist}|{re_enum})\.?){re_space}(?:{re_title_start}{re_space})?(?P<name>{re_title_cs_3}))"

# RegEx for finding sub hierarchy items
# Returns: [0]: full, [1]: trimmed (enum), [2]: id (enum), [3]: empty, [4]: enum, [5]: empty
# very simple version:
re_hier_sub = f"^(?P<trim>(?P<id>()(?P<enum>{re_list_1}|{re_list_2}|{re_list_3})){re_space})"
# version below looks for bullet items as well:
#re_hier_sub = f"^(?P<trim>(?P<id>()(?P<enum>{re_bullet}|{re_list}|{re_multilist})()))"
# version below looks for multiple indents in sublevels as well:
#re_hier_sub = f"{re_linestart}(?P<trim>(?P<id>()(?P<enum>{re_bullet}|{re_list}|{re_multilist})()))"

# RegEx for finding captions
# Returns: [0]: full, [1]: trimmed, [2]: id (label+enum), [3]: label, [4]: enum, [5]: title/empty
re_hier_caption = f"{re_linestart}(?P<trim>(?P<id>(?P<label>{re_label}){re_space}(?P<enum>{re_multilist}))\.?(?:()$|{re_space}(?:{re_title_start}{re_space})?(?P<name>{re_title_ci_3})))"

# RegEx for finding appendices
re_hier_appendix = f"{re_linestart}(?P<trim>(?P<id>(?P<label>{re_label}){re_space}(?P<enum>{re_multilist}|{re_enum_roman}))\.{re_space}(?P<name>.*$))"
# re_hier_appendix = f"{re_linestart}((?P<id>{re_label}{re_space}(?:{re_multilist}|{re_enum})\.){re_title_start}?)"

print(re_hier_main)
print(re_hier_sub)
print(re_hier_caption)
print(re_hier_appendix)

^[^\S\n\r]{0,20}(?P<trim>(?P<id>(?P<label>(?:¶|❡|§|⁋|⸿|⹍|(?:[A-Z][\w]*)))[^\S\n\r]{1,3}(?P<enum>(?:(?:(?:\b(?:(?:(?:[0-9]{1,5})(?:[A-Z]{1,3}|[a-z]{1,3})|(?:[A-Z]{1,3}|[a-z]{1,3})(?:[0-9]{1,5}))|(?:(?:M{0,4}(?:CM|CD|D?C{0,3})(?:XC|XL|L?X{0,3})(?:IX|IV|V?I{0,3}))|(?:m{0,4}(?:cm|cd|d?c{0,3})(?:xc|xl|l?x{0,3})(?:ix|iv|v?i{0,3})))|(?:[0-9]{1,5})|(?:[A-Z]{1,3}))\b)\.){1,5}(?:\b(?:(?:(?:[0-9]{1,5})(?:[A-Z]{1,3}|[a-z]{1,3})|(?:[A-Z]{1,3}|[a-z]{1,3})(?:[0-9]{1,5}))|(?:(?:M{0,4}(?:CM|CD|D?C{0,3})(?:XC|XL|L?X{0,3})(?:IX|IV|V?I{0,3}))|(?:m{0,4}(?:cm|cd|d?c{0,3})(?:xc|xl|l?x{0,3})(?:ix|iv|v?i{0,3})))|(?:[0-9]{1,5})|(?:[A-Z]{1,3}))\b))|(?:\b(?:(?:(?:[0-9]{1,5})(?:[A-Z]{1,3}|[a-z]{1,3})|(?:[A-Z]{1,3}|[a-z]{1,3})(?:[0-9]{1,5}))|(?:(?:M{0,4}(?:CM|CD|D?C{0,3})(?:XC|XL|L?X{0,3})(?:IX|IV|V?I{0,3}))|(?:m{0,4}(?:cm|cd|d?c{0,3})(?:xc|xl|l?x{0,3})(?:ix|iv|v?i{0,3})))|(?:[0-9]{1,5})|(?:[A-Z]{1,3}))\b))\.?)[^\S\n\r]{1,3}(?:(?:\:|\;|\–|\—|-{1,2})[^\S\n\r]{1,3})?(?P<name>[A-Z(][^\.]*[^\.\:\;]$))
^(?P<trim>(?P<id>

## Extract hierarchy elements

In [None]:
# @title
# Function for identifying potential hierarchical element labels
def find_hier_candidates(lines):
  # Prepare lists
  #l_hier_main = [] # candidates for main elements
  #l_hier_sub = [] # candidates for sub elements
  #l_hier_caption = [] # candidates for other elements
  l_hier_elements = [] # all matching occurences

  #df_hier_candidates = pd.DataFrame(data=None,columns:["Type", "Match"])
  #pd_hier_info = pd.DataFrame(data=None, columns=["line", "classification"])

  l_hier_index = []
  l_hier_type = []

  # Check each line separately
  for i in range(0, len(lines)):
    # Check each item for main elements
    candidate_main = re.match(re_hier_main, lines[i]) # enable for case-sensitive matching
    #candidate_main = re.match(re_hier_main, lines[i], flags=re.I) # enable for case-insensitive matching

    # Check each item for implicit subarticles and other unnamed subelements
    candidate_sub = re.match(re_hier_sub, lines[i]) # enable for case-sensitive matching
    #candidate_sub = re.match(re_hier_sub, lines[i], flags=re.I) #  enable for case-insensitive matching

    # Check each item for captions
    candidate_caption = re.match(re_hier_caption, lines[i]) # enable for case-sensitive matching
    # candidate_caption = re.match(re_hier_caption, lines[i], flags=re.I) #  enable for case-insensitive matching

    # Check each item for appendices
    candidate_appendix = re.match(re_hier_appendix, lines[i]) # enable for case-sensitive matching
    #candidate_caption = re.match(re_hier_caption, lines[i], flags=re.I) #  enable for case-insensitive matching

    # Check matches in order of importance (which avoids duplicate matches)
    if candidate_main:
      print(f"{i:<5} MAINLVL?  {candidate_main[1]}")
      #l_hier_main.append(candidate_main)
      l_hier_elements.append(candidate_main)

      l_hier_index.append(i)
      l_hier_type.append("MAINLVL")
      #pd_hier_info = pd_hier_info._append([[i], ["MAINLVL"]], ignore_index=True)

    elif candidate_sub:
      print(f"{i:<5} SUBLVLI?  {candidate_sub[0]}")
      #l_hier_sub.append(candidate_sub)
      l_hier_elements.append(candidate_sub)

      l_hier_index.append(i)
      l_hier_type.append("SUBLVLI")
      #pd_hier_info = pd_hier_info._append([[i], ["SUBLVL"]], ignore_index=True)


    elif candidate_caption:
      print(f"{i:<5} CAPTION?  {candidate_caption[0]}")
      #l_hier_caption.append(candidate_caption)
      l_hier_elements.append(candidate_caption)

      l_hier_index.append(i)
      l_hier_type.append("CAPTION")
      #pd_hier_info = pd_hier_info._append([[i], ["CAPTION"]], ignore_index=True)

    elif candidate_appendix:
      print(f"{i:<5} APPENDIX?  {candidate_appendix[0]}")
      #l_hier_caption.append(candidate_appendix)
      l_hier_elements.append(candidate_appendix)

      l_hier_index.append(i)
      l_hier_type.append("APPENDIX")
      #pd_hier_info = pd_hier_info._append([[i], ["CAPTION"]], ignore_index=True)

  # Give lists as output
  #return l_hier_elements, l_hier_main, l_hier_sub, l_hier_caption
  return l_hier_index, l_hier_type, l_hier_elements, #l_hier_main, l_hier_sub, l_hier_caption
  #return pd_hier_info, l_hier_elements, #l_hier_main, l_hier_sub, l_hier_caption

# Check whether document has to be split in lines before continuing
if isinstance(document, str):
  #lines = document.splitlines()
  lines = re.split("\n+", document) #works better than document.splitlines() since that function has a limit and a finite line length
else:
  lines = document

# Perform extraction
l_hier_index, l_hier_type, l_hier_elements = find_hier_candidates(lines)
#pd_hier_info, l_hier_elements = find_hier_candidates(lines)
#l_hier_index, l_hier_type, l_hier_elements, l_hier_main, l_hier_sub, l_hier_caption = find_hier_candidates(lines)

13    MAINLVL?  Hoofdstuk 1. Algemene bepalingen
14    MAINLVL?  Afdeling 1.1. Algemeen
15    MAINLVL?  Artikel 1.1. (begripsbepalingen)
17    MAINLVL?  Artikel 1.1a. (grondslag)
18    SUBLVLI?  1.	
19    SUBLVLI?  2.	
20    MAINLVL?  Afdeling 1.2. Internationaalrechtelijke verplichtingen
21    MAINLVL?  Artikel 1.2. (wederzijdse erkenning)
23    MAINLVL?  Hoofdstuk 2. Algemene bepalingen voor bouwwerken
24    MAINLVL?  Afdeling 2.1. Algemeen
25    MAINLVL?  Artikel 2.1. (toepassingsbereik: activiteiten)
27    MAINLVL?  Artikel 2.2. (bevoegd gezag)
28    SUBLVLI?  1.	
32    SUBLVLI?  2.	
33    MAINLVL?  Artikel 2.3. (maatwerkregels)
35    MAINLVL?  Artikel 2.4. (gelijkwaardigheid bij melding of vergunningvrije activiteit)
36    SUBLVLI?  1.	
39    SUBLVLI?  2.	
40    MAINLVL?  Artikel 2.5. (instandhouden gelijkwaardige maatregel)
42    MAINLVL?  Artikel 2.6. (specifieke zorgplicht: bouwwerkinstallatie)
47    MAINLVL?  Artikel 2.7. (gemeenschappelijk en gezamenlijk)
48    SUBLVLI?  1.	


### *Identify caption type?*

If both formula and table present, identify which is which based on the content? is this necessary?

Figures are out of scope

### Improve automated hierarchy classifications

#### Find explicit sublevels (not the case for BBL)
If there are two elements where one name contains the name of another element (e.g. article in subarticle), the longer element name is assumed to be a subsection.

In [None]:
# Function for finding explicit sublevels by searching for labels that contain other labels in their string
def find_explicit_sublevels(labels):
  # Sort by length for optimal matching (longest matches will be returned)
  labels.sort(key=len, reverse=True)
  # Create dictionary for storing level and its sublevel, if any
  dict_hier_sub = {}
  for x1 in labels:
    for x2 in [x for x in labels if x != x1]:
      # Check if x1 is a substring of x2
      substring = x1.lower().find(x2.lower())
      # Return string x1 if substring of x2
      if substring != -1:
        dict_hier_sub[x2] = x1
        print(f"{x1} is sublevel of {x2}")
        break # only retrieve longest match

  if len(dict_hier_sub) == 0:
    print("No explicit sublevels found")
    return None
  else:
    # Reverse sorting of dictionary for more logical sequence
    return dict(reversed(dict_hier_sub.items()))

# Example of function
find_explicit_sublevels(["Subartikel", "Artikel", "Niks", "ondertekst", "tekst", "Subsubartikel", "tekstonder"])

Subsubartikel is sublevel of Subartikel
Subartikel is sublevel of Artikel
ondertekst is sublevel of tekst
tekstonder is sublevel of tekst


{'tekst': 'tekstonder', 'Artikel': 'Subartikel', 'Subartikel': 'Subsubartikel'}

In [None]:
# Check which labels are in the candidates (all types of non-empty labels (regex match [3]))
l_hier_labels = list(set([x[3] for x in l_hier_elements if x[3]]))
print(l_hier_labels)
dict_hier_sub = find_explicit_sublevels(l_hier_labels)

# Reclassify articles from list to explicit sublevel if the label is in the subelements list
for i in range(0, len(l_hier_index)):
  if dict_hier_sub and l_hier_elements[i][3] in dict_hier_sub.values():
    # Preview of the changed classifications:
    print(f"{l_hier_index[i]:<6} {l_hier_type[i]:<4} → SUBLVLE {l_hier_elements[i][3]}")
    # Change classification
    l_hier_type[i] = "SUBLVLE"

['Afdeling', 'Hoofdstuk', 'Artikel', 'Tabel', '§', 'Bijlage']
No explicit sublevels found


#### Remove discrepancies

Update types of outliers based on frequencies

In [None]:
# Function for improving the element classifications
def improve_hier_classification(l_labels, l_type, l_set, label_treshold=2):
  # Define dictionary for output
  dict_type = {}
  # Perform check for each label
  for label in l_labels:
    # Check whether elements matching the label are classified as MAIN or CAPTION
    l_matches = [l_type[i] for i in range(0, len(l_set)) if l_set[i][3] == label]

    # If the amount of elements with the current label is below the treshold, classify the label as outlier
    if len(l_matches) < label_treshold:
      dict_type[label] = "OUTLIER"
      print(f"{label:<16}OUTLIER   {len(l_matches)}")

    # If there are labels that contain each other, it is assumed that the label containing the other label is a subelement
    #elif label[:3]

    # If there are multiple classifications for the current label, apply the most common classification to all the elements with the current label
    else:
      # Count the classification distribution
      frequency = collections.Counter(l_matches)
      # Match the classification with the highest frequency
      dict_type[label] = list(frequency)[0]
      print(f"{label:<16}{list(frequency)[0]:<10}{frequency}")

  return dict_type


# Perform classification check for each label separately
dict_hier_type = improve_hier_classification(l_hier_labels, l_hier_type, l_hier_elements, label_treshold=2) # treshold: a label should have at least this many occurrences to not be considered an outlier and excluded from the set

# Preview of the changed classifications (up to 20 changes in preview)
print("Preview of changed elements:")
n=0
for i in range(0, len(l_hier_index)):
  if l_hier_type[i] != "SUBLVLI" and l_hier_type[i] != dict_hier_type[l_hier_elements[i][3]] and n<20:
    print(f"{l_hier_index[i]:<6} {l_hier_type[i]:<4} → {dict_hier_type[l_hier_elements[i][3]]:<8} {l_hier_elements[i][3]}")#{l_hier_elements}")
    l_hier_type[i] = dict_hier_type[l_hier_elements[i][3]]
    n+=1

Hoofdstuk       MAINLVL   Counter({'MAINLVL': 9})
Afdeling        MAINLVL   Counter({'MAINLVL': 32})
Artikel         MAINLVL   Counter({'MAINLVL': 594, 'CAPTION': 5, 'APPENDIX': 1})
Bijlage         APPENDIX  Counter({'APPENDIX': 2})
Tabel           CAPTION   Counter({'CAPTION': 73, 'MAINLVL': 5})
§               MAINLVL   Counter({'MAINLVL': 110})
Preview of changed elements:
186    APPENDIX → MAINLVL  Artikel
1643   MAINLVL → CAPTION  Tabel
1657   MAINLVL → CAPTION  Tabel
1909   MAINLVL → CAPTION  Tabel
2867   MAINLVL → CAPTION  Tabel
2892   MAINLVL → CAPTION  Tabel
3103   CAPTION → MAINLVL  Artikel
3105   CAPTION → MAINLVL  Artikel
3137   CAPTION → MAINLVL  Artikel
3159   CAPTION → MAINLVL  Artikel
4423   CAPTION → MAINLVL  Artikel


#### Filter & Update set based on user validation

In [None]:
# Define temporary lists
l_cand_index = l_hier_index.copy()
l_cand_type = l_hier_type.copy()
l_cand_elements = l_hier_elements.copy()

# Perform check for each element
for i in range(0, len(l_cand_elements)):
  # Check whether current element has a label
  if l_cand_elements[i][3]:
    # Update classification based on dictionary
    l_hier_type[i] = dict_hier_type[l_cand_elements[i][3]]

# Delete element from list, if necessary
## Indictate which indices to delete
l_hier_outliers = [i for i in range(0, len(l_hier_type)) if l_hier_type[i] == "OUTLIER"]
## Create new lists without the deleted indices
l_hier_index = [i for j, i in enumerate(l_cand_index) if j not in l_hier_outliers ]
l_hier_type = [i for j, i in enumerate(l_hier_type) if j not in l_hier_outliers ]
l_hier_elements = [i for j, i in enumerate(l_cand_elements) if j not in l_hier_outliers ]
#l_hier_labels = [i for j, i in enumerate(l_cand_labels) if j not in l_hier_outliers ]

# Delete the candidate values from system memory
del l_cand_index
del l_cand_type
del l_cand_elements
#del l_cand_labels

# Preview of final results
for i in range(0, 30):
  if l_hier_type[i] != "SUBLVLI":
  #if l_hier_type[i] != "SUBLVLI" and l_hier_type[i] != "SUBLVLE":
    print(f"{l_hier_index[i]:<6} {l_hier_type[i]:<4} {l_hier_elements[i][3]}")#{l_cand_elements}")

13     MAINLVL Hoofdstuk
14     MAINLVL Afdeling
15     MAINLVL Artikel
17     MAINLVL Artikel
20     MAINLVL Afdeling
21     MAINLVL Artikel
23     MAINLVL Hoofdstuk
24     MAINLVL Afdeling
25     MAINLVL Artikel
27     MAINLVL Artikel
33     MAINLVL Artikel
35     MAINLVL Artikel
40     MAINLVL Artikel
42     MAINLVL Artikel
47     MAINLVL Artikel
51     MAINLVL Artikel
59     MAINLVL Artikel
61     MAINLVL Artikel
63     MAINLVL Artikel
65     MAINLVL Artikel
67     MAINLVL Artikel


In [None]:
l_hier_labels

['Hoofdstuk', 'Afdeling', 'Artikel', 'Bijlage', 'Tabel', '§']

## *User validation #1 (out of scope)*

Confirm all main labels

In [None]:
# Let user create, update or delete elements from dictionary via interface
print(dict_hier_type)
print(dict_hier_sub)

# Define final list after checking
try:
  print(dict_hier_type_validated)
  print(dict_hier_sub_validated)
except:
  None

{'Hoofdstuk': 'MAINLVL', 'Afdeling': 'MAINLVL', 'Artikel': 'MAINLVL', 'Bijlage': 'APPENDIX', 'Tabel': 'CAPTION', '§': 'MAINLVL'}
None


## Determine taxonomy of the hierarchical elements

The following assumptions are made for this method:


* The document starts with the main label of the highest level.
* Every section ends with the main label of the lowest level, which is always the same label.
* Sublevels are always a subset of a main level.
*  Within each child of the highest level, some levels can be skipped, but the hierarchy order always goes from biggest level to smallest level and never back to bigger levels (the topology is acylic). Thus, when going to a bigger level in the hierarchy, it is assumed that a new branch of the parent level is created.

---
Steps in script (skip to step 8 whenever all labels have been found):

1. Get a checking list of all instances of main labels in text (without consecutive duplicates).

2. Identify the first label in the checking list, and add to the ordered label list

3. Split the checking list into sublists, each starting with the first label as item 1. Remove start label from all sublists. Remove empty sublists.

4. Find the last item in each sublist. add this label to the end of the ordered label list.
  * Last items should be the same label. if not: give error

5. Split the sublists after each occurrence of the end label (but keep checking list 2-dimensional). Remove end label from all sublists. Remove empty sublists.

6. Select the next start labels in the sublists. If found in other sublists in any item but the first, remove from selection.
  * If multiple items left: list indeterminable, give error.
  * If one item left: add this before end label to ordered label list, remove this item from sublists. Remove empty sublists.

7. Repeat step 6 until all labels are added to ordered label list.

8. Check for sublabels

9. Add sublabels back into the list

10. Return ordered label list


In [None]:
# Get all main labels from the hierarchy
l_hier_sequence = [l_hier_elements[i][3] for i in range(0, len(l_hier_elements)) if l_hier_type[i] == "MAINLVL"]
l_hier_check = set(l_hier_sequence)
# Remove consecutive duplicates from list
i=0
while i < len(l_hier_sequence)-1:
  if l_hier_sequence[i] == l_hier_sequence[i+1]:
    del l_hier_sequence[i] # Delete consecutive duplicates
  else:
    i = i+1
print(l_hier_sequence)

['Hoofdstuk', 'Afdeling', 'Artikel', 'Afdeling', 'Artikel', 'Hoofdstuk', 'Afdeling', 'Artikel', 'Afdeling', 'Artikel', 'Afdeling', 'Artikel', 'Afdeling', '§', 'Artikel', '§', 'Artikel', '§', 'Artikel', 'Afdeling', 'Artikel', 'Hoofdstuk', 'Afdeling', 'Artikel', 'Afdeling', '§', 'Artikel', '§', 'Artikel', '§', 'Artikel', '§', 'Artikel', '§', 'Artikel', '§', 'Artikel', '§', 'Artikel', '§', 'Artikel', '§', 'Artikel', '§', 'Artikel', '§', 'Artikel', '§', 'Artikel', 'Afdeling', '§', 'Artikel', '§', 'Artikel', '§', 'Artikel', '§', 'Artikel', '§', 'Artikel', '§', 'Artikel', 'Afdeling', '§', 'Artikel', '§', 'Artikel', 'Afdeling', '§', 'Artikel', '§', 'Artikel', '§', 'Artikel', 'Afdeling', 'Artikel', 'Afdeling', '§', 'Artikel', '§', 'Artikel', '§', 'Artikel', '§', 'Artikel', '§', 'Artikel', '§', 'Artikel', '§', 'Artikel', '§', 'Artikel', '§', 'Artikel', '§', 'Artikel', '§', 'Artikel', '§', 'Artikel', 'Hoofdstuk', 'Afdeling', 'Artikel', 'Afdeling', '§', 'Artikel', '§', 'Artikel', '§', 'Artikel', 

In [None]:
# Split list into sublists at every occurrence of the first label
def split_first(lst, label):
  result = []
  sublist = []
  # Check the items in the list
  for item in lst:
    if item == label:
      # Start a new sublist when label is encountered
      if sublist:
        result.append(sublist)
        sublist = [item]  # Create a new sublist starting with label
    else:
      sublist.append(item)

    # Append the last sublist if it's not empty
  if sublist:
    result.append(sublist)

  return result


# Function for removing label from sublists (if present)
def clean_sublists(lst, label):
  sublist_new = []
  # Check if list needs to be split up further before continuing
  #if lst == isinstance(list):
  #  for lst in
  for sublist in lst:
    #if start == True and sublist[0] == label: # Remove first label
    # Start new sublist from second index (if it exists) after label occurrence
    if sublist[0] == label and len(sublist) > 1:
      sublist_new.append(sublist[1:])

    #elif start == False and sublist[-1] == label: # Remove last label
      #sublist_new.append(sublist[:-2])
    elif len(sublist) > 1:
      sublist_new.append(sublist)
  # Remove empty sublists
  #sublist_new = [x for x in sublist_new if x]
  # Output the new sublist
  return sublist_new

# Identify the first label in the checking list (which is assumed to be the highest level)
hier_label_main = l_hier_sequence[0]
l_labels_main_sorted = []
## Add first label as highest level to final list
l_labels_main_sorted.append(hier_label_main)

# Split the list before every occurrence of the main label
l_sublists = split_first(l_hier_sequence, hier_label_main)

# Remove label from sublists (if present)
l_sublists = clean_sublists(l_sublists, hier_label_main)

print(l_sublists)
print(l_labels_main_sorted)

[['Afdeling', 'Artikel', 'Afdeling', 'Artikel'], ['Afdeling', 'Artikel', 'Afdeling', 'Artikel', 'Afdeling', 'Artikel', 'Afdeling', '§', 'Artikel', '§', 'Artikel', '§', 'Artikel', 'Afdeling', 'Artikel'], ['Afdeling', 'Artikel', 'Afdeling', '§', 'Artikel', '§', 'Artikel', '§', 'Artikel', '§', 'Artikel', '§', 'Artikel', '§', 'Artikel', '§', 'Artikel', '§', 'Artikel', '§', 'Artikel', '§', 'Artikel', '§', 'Artikel', '§', 'Artikel', 'Afdeling', '§', 'Artikel', '§', 'Artikel', '§', 'Artikel', '§', 'Artikel', '§', 'Artikel', '§', 'Artikel', 'Afdeling', '§', 'Artikel', '§', 'Artikel', 'Afdeling', '§', 'Artikel', '§', 'Artikel', '§', 'Artikel', 'Afdeling', 'Artikel', 'Afdeling', '§', 'Artikel', '§', 'Artikel', '§', 'Artikel', '§', 'Artikel', '§', 'Artikel', '§', 'Artikel', '§', 'Artikel', '§', 'Artikel', '§', 'Artikel', '§', 'Artikel', '§', 'Artikel', '§', 'Artikel'], ['Afdeling', 'Artikel', 'Afdeling', '§', 'Artikel', '§', 'Artikel', '§', 'Artikel', '§', 'Artikel', '§', 'Artikel', '§', 'Artikel

In [None]:
# Split sublists into sublists after every occurrence of the last label
def split_last(lst, label):
  result = []
  # Check the items in the list
  for sublist in lst:
    branch = []
    for item in sublist:
      #if isinstance(item, list):
      # Check if label has already been encountered (which shouldn't be the case)
      if item in branch:
        print("Error: not all branches end with lowest hierarchy level!")
        print("Cyclic hierarchy level: ", item)
        return lst
        #break
      # Check if last label is encountered
      elif item == label:
        # Add current sublist as branch to list (without last label) if it's not empty
        if branch:
          result.append(branch)
        branch = [] # Reset branch list
      # Otherwise, continue extending branch
      else:
        branch.append(item)
        previous = item

  return result

# Find the last item in each sublist (which is assumed to be the lowest level)
hier_label_last = [x[-1] for x in l_sublists]
print(hier_label_last)
## Last items should be the same label. If not: give error
if len(set(hier_label_last)) != 1:
  print('ERROR: Hierarchy order could not be determined!')
  print("Conflicting hierarchy levels:", list(set(hier_label_last)))
  print(hier_label_last, len(set(hier_label_last)))
  #sys.exit()
## Else, continue analysing hierarchy (if it contains more than one label)
elif len(l_labels_main_sorted) < len(set(l_hier_sequence)):
  hier_label_last = l_sublists[0][-1]
  ## Add last label as highest level to final list
  l_labels_main_sorted.append(hier_label_last)
  # Split the list before every occurrence of the main label
  l_sublists = split_last(l_sublists, hier_label_last)
  # Remove label from sublists (if present)
  l_sublists = clean_sublists(l_sublists, hier_label_last)

print(l_sublists)
print(l_labels_main_sorted)

['Artikel', 'Artikel', 'Artikel', 'Artikel', 'Artikel', 'Artikel', 'Artikel', 'Artikel', 'Artikel']
[['Afdeling', '§'], ['Afdeling', '§'], ['Afdeling', '§'], ['Afdeling', '§'], ['Afdeling', '§'], ['Afdeling', '§'], ['Afdeling', '§'], ['Afdeling', '§'], ['Afdeling', '§'], ['Afdeling', '§'], ['Afdeling', '§'], ['Afdeling', '§'], ['Afdeling', '§'], ['Afdeling', '§'], ['Afdeling', '§'], ['Afdeling', '§'], ['Afdeling', '§']]
['Hoofdstuk', 'Artikel']


In [None]:
def get_next_label(lst):
  l_children = []
  for sublist in lst:
    # Check if sublist has more than one item
    if len(sublist) > 1:
      # Add all items that are not the first to a list
      for item in range(1, len(sublist)):
        #print(item)
        l_children.append(sublist[item])

  # Get unique values in flattened version of sublists
  candidates = [x for xs in lst for x in xs]
  #print(candidates)
  #print(set(candidates))
  #print(set(l_children))
  # Get the only value not in the list
  parent = list(set(candidates) - set(l_children))
  # Check if there is actually only one value left (or none, if this was the last label)
  if len(parent) == 1:
    #print(parent[0])
    #print(lst)
    return parent[0]
  #elif len(parent) == 0:

  else:
    print("ERROR: Order of levels could not be determined due to a lack of information.")
    print("Conflicting hierarchy levels: ", parent)
    return lst

  #print("PARENT: ", parent)
  #return parent, lst


#CHECK IF LIST IS DONE
while len(l_labels_main_sorted) < len(set(l_hier_sequence)):
  # Find the next label and new sublist
  nextlabel = get_next_label(l_sublists)
  # l_sublists = clean_sublists(l_sublists, hier_label_main, start=True)
  # Remove label from sublists (if present)
  l_sublists = clean_sublists(l_sublists, nextlabel)
  #print(l_sublists)
  # Add next label before the last label (which is the lowest level)
  l_labels_main_sorted.insert(-1, nextlabel)

#l_children
l_labels_main_sorted

['Hoofdstuk', 'Afdeling', '§', 'Artikel']

## *User validation #2 (out of scope)*

Confirm all sublabels and the ordening

*Also make a sorted dictionary for hierarchy elements and their possible representations?*

In [None]:
l_labels_main_sorted

['Hoofdstuk', 'Afdeling', '§', 'Artikel']

Show mapping to user for validation

>Hoofdstuk
>>  Afdeling
>>> §
>>>> Artikel
>>>>> 1.
>
> Bijlage
>>?

USER SHOULD ALSO ADD INDICATION OF SUBARTICLE NAME

## Final recognized hierarchy elements

In [None]:
for i in range(0,len(l_hier_index)):

  #try:
  print(l_hier_index[i], l_hier_type[i], l_hier_elements[i]['trim'])
  #except:
    #continue

13 MAINLVL Hoofdstuk 1. Algemene bepalingen
14 MAINLVL Afdeling 1.1. Algemeen
15 MAINLVL Artikel 1.1. (begripsbepalingen)
17 MAINLVL Artikel 1.1a. (grondslag)
18 SUBLVLI 1.	
19 SUBLVLI 2.	
20 MAINLVL Afdeling 1.2. Internationaalrechtelijke verplichtingen
21 MAINLVL Artikel 1.2. (wederzijdse erkenning)
23 MAINLVL Hoofdstuk 2. Algemene bepalingen voor bouwwerken
24 MAINLVL Afdeling 2.1. Algemeen
25 MAINLVL Artikel 2.1. (toepassingsbereik: activiteiten)
27 MAINLVL Artikel 2.2. (bevoegd gezag)
28 SUBLVLI 1.	
32 SUBLVLI 2.	
33 MAINLVL Artikel 2.3. (maatwerkregels)
35 MAINLVL Artikel 2.4. (gelijkwaardigheid bij melding of vergunningvrije activiteit)
36 SUBLVLI 1.	
39 SUBLVLI 2.	
40 MAINLVL Artikel 2.5. (instandhouden gelijkwaardige maatregel)
42 MAINLVL Artikel 2.6. (specifieke zorgplicht: bouwwerkinstallatie)
47 MAINLVL Artikel 2.7. (gemeenschappelijk en gezamenlijk)
48 SUBLVLI 1.	
49 SUBLVLI 2.	
50 SUBLVLI 3.	
51 MAINLVL Artikel 2.8. (monumenten)
59 MAINLVL Artikel 2.9. (afwijking wegens i

---
# **Extract hierarchy elements**

Store as dataframe with corresponding information about regulation paths

## Store all hierarchy elements

In [None]:
# define empty dataframe to store hierarchy elements in
df_hier_elements = pd.DataFrame(columns=["line", "type", "full_match", "label", "id", "title"])

for i in range(0,len(l_hier_index)):
  # add data to dataframe, depending on type of element:
  if l_hier_type[i].startswith("MAIN"):
    df_hier_elements.loc[i] = [l_hier_index[i], str(l_hier_type[i]), str(l_hier_elements[i][0]), str(l_hier_elements[i]['label']), str(l_hier_elements[i]['enum']), l_hier_elements[i]['name']]
  elif l_hier_type[i].startswith("SUB"):
    # for subelements, use the name from the main element if there is no name
    try:
      name = l_hier_elements[i]['name']
    except:
      try:
        name = l_hier_elements[i-1]['name']
      except:
        None # take previous name if no current name (for subsequent subarticles)
    df_hier_elements.loc[i] = [l_hier_index[i], str(l_hier_type[i]), str(l_hier_elements[i][0]), 'Lid', str(l_hier_elements[i]['enum']), name]

  elif l_hier_type[i].startswith("CAP"):
    df_hier_elements.loc[i] = [l_hier_index[i], str(l_hier_type[i]), str(l_hier_elements[i][0]), str(l_hier_elements[i]['label']), str(l_hier_elements[i]['enum']), l_hier_elements[i]['name']]
  elif l_hier_type[i].startswith("APP"):
    df_hier_elements.loc[i] = [l_hier_index[i], str(l_hier_type[i]), str(l_hier_elements[i][0]), str(l_hier_elements[i]['label']), str(l_hier_elements[i]['enum']), l_hier_elements[i]['name']]
  else:
      print("could not find type for element with index", i, "-", l_hier_elements[i]['trim'])

# Use line number as index for dataframe?
df_hier_elements.set_index('line', drop=True, inplace=True)

df_hier_elements

Unnamed: 0_level_0,type,full_match,label,id,title
line,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
13,MAINLVL,Hoofdstuk 1. Algemene bepalingen,Hoofdstuk,1,Algemene bepalingen
14,MAINLVL,Afdeling 1.1. Algemeen,Afdeling,1.1,Algemeen
15,MAINLVL,Artikel 1.1. (begripsbepalingen),Artikel,1.1,(begripsbepalingen)
17,MAINLVL,Artikel 1.1a. (grondslag),Artikel,1.1a,(grondslag)
18,SUBLVLI,1.\t,Lid,1.,(grondslag)
...,...,...,...,...,...
4879,APPENDIX,Bijlage I. bij artikel 1.1 van dit besluit (be...,Bijlage,I,bij artikel 1.1 van dit besluit (begrippen)
4880,SUBLVLI,A.,Lid,A.,bij artikel 1.1 van dit besluit (begrippen)
5006,SUBLVLI,B.,Lid,B.,bij artikel 1.1 van dit besluit (begrippen)
5040,SUBLVLI,C.,Lid,C.,bij artikel 1.1 van dit besluit (begrippen)


## Determine paths

Store paths for elements (except for tables & figures)


* Using the parent-child relationships between hierarchy labels identified earlier
* Assumes that the document is sorted!
* Paths are stored as URLs with parents in URL, for example: https://wetten.overheid.nl/BWBR0041297/2024-08-01#Hoofdstuk2_Afdeling2.2a_Artikel2.18_Lid3


In [None]:
l_labels_main_sorted

['Hoofdstuk', 'Afdeling', '§', 'Artikel']

In [None]:
def determine_hier_URLs(df, labels):
  path = [] # will store label & id
  path_levels = [] # will store path levels
  l_urls = []  # will store URLS

  for i in df_hier_elements.index:

    if df_hier_elements.loc[i, "type"].startswith("MAIN"): # for main labels
      current_label = df_hier_elements.loc[i, 'label']
      current = current_label + df_hier_elements.loc[i, 'id'].rstrip('.') #rstrip removes any trailing periods
      current_level = labels.index(current_label)
      if path == []: # if path is empty, current = parent
        path = [current]
        path_levels = [current_level]
      # Check how the current level compares to the specified parent level.
      # if last parent level is eligible as a parent for the current level, add current to path
      elif path_levels[-1] < current_level:
        # add current to path & levels
        path.append(current)
        path_levels.append(current_level)
      # if existing path are not all path of the current level:
      else:
        # remove parent levels that are above or equal to current level
        path_levels = [x for x in path_levels if x < current_level]
        # keep corresponding path below old level
        path = path[:len(path_levels)]
        # add current to path & levels
        path.append(current)
        path_levels.append(current_level)

    elif df_hier_elements.loc[i, "type"].startswith("SUB"): # for sublevels of the current parent level
      # check if current label is subarticle
      if df_hier_elements.loc[i, 'label'].startswith("Lid"):
        #print(df_hier_elements.loc[i, 'label'])
        current = df_hier_elements.loc[i, 'label']+df_hier_elements.loc[i, 'id'].rstrip('.') #rstrip removes any trailing periods
        # add URL to string by adding all parent labels together and the current label
        l_urls.append(base_URL+'_'.join(path)+'_'+current)
        continue # skip URL assignment at end
      #otherwise, only add enumeration
      else:
        current = df_hier_elements.loc[i, 'id'].rstrip('.') #rstrip removes any trailing periods
        path.append(current)
        path_levels.append(path_levels[-1]+0.5) #add level

    elif df_hier_elements.loc[i, "type"].startswith("APP"): # for appendices
      current = df_hier_elements.loc[i, 'label']+df_hier_elements.loc[i, 'id'].rstrip('.') #rstrip removes any trailing periods
      path = [current]
      path_levels = [current_level]

    else:
      l_urls.append("")
      continue # skip URL assignment at end

    # add URL to string by adding all parent labels together and the current label
    l_urls.append(base_URL+'_'.join(path))

  return l_urls


# get URLs for hierarchy elements
l_urls = determine_hier_URLs(df_hier_elements, l_labels_main_sorted)

# Store paragraphs as name and not paragraph sign!
l_urls = [x.replace("§", "Paragraaf") for x in l_urls]

# Store URLs in new column
df_hier_elements["URL"] = l_urls


l_urls[50:250]

['https://wetten.overheid.nl/BWBR0041297/2024-08-01#Hoofdstuk2_Afdeling2.2_Artikel2.15a_Lid2',
 'https://wetten.overheid.nl/BWBR0041297/2024-08-01#Hoofdstuk2_Afdeling2.2_Artikel2.15a_Lid3',
 'https://wetten.overheid.nl/BWBR0041297/2024-08-01#Hoofdstuk2_Afdeling2.2_Artikel2.15a_Lid4',
 'https://wetten.overheid.nl/BWBR0041297/2024-08-01#Hoofdstuk2_Afdeling2.2_Artikel2.15a_Lid5',
 'https://wetten.overheid.nl/BWBR0041297/2024-08-01#Hoofdstuk2_Afdeling2.2a',
 'https://wetten.overheid.nl/BWBR0041297/2024-08-01#Hoofdstuk2_Afdeling2.2a_Artikel2.16',
 'https://wetten.overheid.nl/BWBR0041297/2024-08-01#Hoofdstuk2_Afdeling2.2a_Artikel2.17',
 'https://wetten.overheid.nl/BWBR0041297/2024-08-01#Hoofdstuk2_Afdeling2.2a_Artikel2.17_Lid1',
 'https://wetten.overheid.nl/BWBR0041297/2024-08-01#Hoofdstuk2_Afdeling2.2a_Artikel2.17_Lid2',
 'https://wetten.overheid.nl/BWBR0041297/2024-08-01#Hoofdstuk2_Afdeling2.2a_Artikel2.17_Lid3',
 'https://wetten.overheid.nl/BWBR0041297/2024-08-01#Hoofdstuk2_Afdeling2.2a_A

## Determine regulation code

Generate a code for each regulation or regulatory item based on the path name

In [None]:
def generate_codes(df, dictionary):
  l_codes = []
  for URL in [x.split("#") for x in df["URL"]]:
    try:
      code = URL[1]
      for key, value in dictionary.items():
          pattern = rf"{value['re_untranslated']}(.*)"
          code = re.sub(pattern, lambda m: f"{value['code']}{m.group(1).replace('.', '_')}", code, flags=re.IGNORECASE)
      l_codes.append(code)
    except:
      l_codes.append("")
  return l_codes

# # Example usage
# input_text = "Hoofdstuk4_Afdeling4.2_Paragraaf4.2.13_Artikel4.88_Lid2"
# output_text = rename_item(input_text)
# print(output_text)  # Expected: "H4_A4_2_P13_A4_88_L2"

df_hier_elements["code"] = generate_codes(df_hier_elements, dict_hier_elements)
print(df_hier_elements["code"][:10])
print(df_hier_elements["code"][-10:])

line
13                    C1
14               C1_S1_1
15          C1_S1_1_A1_1
17         C1_S1_1_A1_1a
18    C1_S1_1_A1_1a_SUB1
19    C1_S1_1_A1_1a_SUB2
20               C1_S1_2
21          C1_S1_2_A1_2
23                    C2
24               C2_S2_1
Name: code, dtype: object
line
4864    C8_A8_3_SUB2
4865    C8_A8_3_SUB3
4866              C9
4867         C9_A9_1
4869         C9_A9_2
4879           APPXI
4880      APPXI_SUBA
5006      APPXI_SUBB
5040      APPXI_SUBC
5061          APPXII
Name: code, dtype: object


In [None]:
print(df_hier_elements["code"][:20])

line
13                    C1
14               C1_S1_1
15          C1_S1_1_A1_1
17         C1_S1_1_A1_1a
18    C1_S1_1_A1_1a_SUB1
19    C1_S1_1_A1_1a_SUB2
20               C1_S1_2
21          C1_S1_2_A1_2
23                    C2
24               C2_S2_1
25          C2_S2_1_A2_1
27          C2_S2_1_A2_2
28     C2_S2_1_A2_2_SUB1
32     C2_S2_1_A2_2_SUB2
33          C2_S2_1_A2_3
35          C2_S2_1_A2_4
36     C2_S2_1_A2_4_SUB1
39     C2_S2_1_A2_4_SUB2
40          C2_S2_1_A2_5
42          C2_S2_1_A2_6
Name: code, dtype: object


## Final dataframe

In [None]:
df_hier_elements

Unnamed: 0_level_0,type,full_match,label,id,title,URL,code
line,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
13,MAINLVL,Hoofdstuk 1. Algemene bepalingen,Hoofdstuk,1,Algemene bepalingen,https://wetten.overheid.nl/BWBR0041297/2024-08...,C1
14,MAINLVL,Afdeling 1.1. Algemeen,Afdeling,1.1,Algemeen,https://wetten.overheid.nl/BWBR0041297/2024-08...,C1_S1_1
15,MAINLVL,Artikel 1.1. (begripsbepalingen),Artikel,1.1,(begripsbepalingen),https://wetten.overheid.nl/BWBR0041297/2024-08...,C1_S1_1_A1_1
17,MAINLVL,Artikel 1.1a. (grondslag),Artikel,1.1a,(grondslag),https://wetten.overheid.nl/BWBR0041297/2024-08...,C1_S1_1_A1_1a
18,SUBLVLI,1.\t,Lid,1.,(grondslag),https://wetten.overheid.nl/BWBR0041297/2024-08...,C1_S1_1_A1_1a_SUB1
...,...,...,...,...,...,...,...
4879,APPENDIX,Bijlage I. bij artikel 1.1 van dit besluit (be...,Bijlage,I,bij artikel 1.1 van dit besluit (begrippen),https://wetten.overheid.nl/BWBR0041297/2024-08...,APPXI
4880,SUBLVLI,A.,Lid,A.,bij artikel 1.1 van dit besluit (begrippen),https://wetten.overheid.nl/BWBR0041297/2024-08...,APPXI_SUBA
5006,SUBLVLI,B.,Lid,B.,bij artikel 1.1 van dit besluit (begrippen),https://wetten.overheid.nl/BWBR0041297/2024-08...,APPXI_SUBB
5040,SUBLVLI,C.,Lid,C.,bij artikel 1.1 van dit besluit (begrippen),https://wetten.overheid.nl/BWBR0041297/2024-08...,APPXI_SUBC


---
# **Extract regulations**

Use hierarchy information to extract elements and hierarchy paths

**!!! Tables, figures and captions out of scope!**

**!!! Use-case specific!**

## Split document based on hierarchy


In [None]:
def extract_regulations(lines, df):
  regulation = [] # for storing regulation text
  current = False # for keeping track of current item (needed for multiline regulations)
  for i, line in enumerate(lines):
    # check if index is also found in list of hierarchy elements
    if i in df.index:
      # check if hierarchy element is a subelement:
      if df.loc[i, "type"].startswith("SUB"):
        # add regulation to text without enumeration at start
        regulation.append(line[len(df.loc[i, "full_match"]):])
        current = True # placeholder will be filled depending on subarticle

      elif df.loc[i, "type"].startswith("MAIN") or df.loc[i, "type"].startswith("APP"):
        regulation.append("")
        current = True # placeholder will be filled depending on subarticle
      else:
        regulation.append("")
        current = False
    elif current == True and line != "":
      try:
        if regulation[-1] == "":
          regulation[-1] = line
        else:
          regulation[-1] += '\n' + line
      except:
        regulation.append(line)

  return regulation

df_regulations = df_hier_elements.copy()
df_regulations["text_original"] = extract_regulations(lines, df_regulations)

df_regulations

Unnamed: 0_level_0,type,full_match,label,id,title,URL,code,text_original
line,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
13,MAINLVL,Hoofdstuk 1. Algemene bepalingen,Hoofdstuk,1,Algemene bepalingen,https://wetten.overheid.nl/BWBR0041297/2024-08...,C1,
14,MAINLVL,Afdeling 1.1. Algemeen,Afdeling,1.1,Algemeen,https://wetten.overheid.nl/BWBR0041297/2024-08...,C1_S1_1,
15,MAINLVL,Artikel 1.1. (begripsbepalingen),Artikel,1.1,(begripsbepalingen),https://wetten.overheid.nl/BWBR0041297/2024-08...,C1_S1_1_A1_1,Bijlage I bevat begripsbepalingen voor de toep...
17,MAINLVL,Artikel 1.1a. (grondslag),Artikel,1.1a,(grondslag),https://wetten.overheid.nl/BWBR0041297/2024-08...,C1_S1_1_A1_1a,
18,SUBLVLI,1.\t,Lid,1.,(grondslag),https://wetten.overheid.nl/BWBR0041297/2024-08...,C1_S1_1_A1_1a_SUB1,"Dit besluit berust op de artikelen 4.3, eerste..."
...,...,...,...,...,...,...,...,...
4879,APPENDIX,Bijlage I. bij artikel 1.1 van dit besluit (be...,Bijlage,I,bij artikel 1.1 van dit besluit (begrippen),https://wetten.overheid.nl/BWBR0041297/2024-08...,APPXI,
4880,SUBLVLI,A.,Lid,A.,bij artikel 1.1 van dit besluit (begrippen),https://wetten.overheid.nl/BWBR0041297/2024-08...,APPXI_SUBA,Begrippen: algemeen\nVoor de toepassing van di...
5006,SUBLVLI,B.,Lid,B.,bij artikel 1.1 van dit besluit (begrippen),https://wetten.overheid.nl/BWBR0041297/2024-08...,APPXI_SUBB,Begrippen: gebruiksfuncties\nVoor de toepassin...
5040,SUBLVLI,C.,Lid,C.,bij artikel 1.1 van dit besluit (begrippen),https://wetten.overheid.nl/BWBR0041297/2024-08...,APPXI_SUBC,Tabel: symbolen en afkortingen\nVoor de toepas...


## Exclude items without regulations
Only keep rows with a text

In [None]:
# exclude rows without text
df_regulations = df_regulations[df_regulations["text_original"] != ""]

# exclude row that are not of type main, sub or appendix
df_regulations = df_regulations[df_regulations["type"].apply(lambda x: x.startswith("MAIN") or x.startswith("SUB") or x.startswith("APP"))]

df_regulations

Unnamed: 0_level_0,type,full_match,label,id,title,URL,code,text_original
line,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
15,MAINLVL,Artikel 1.1. (begripsbepalingen),Artikel,1.1,(begripsbepalingen),https://wetten.overheid.nl/BWBR0041297/2024-08...,C1_S1_1_A1_1,Bijlage I bevat begripsbepalingen voor de toep...
18,SUBLVLI,1.\t,Lid,1.,(grondslag),https://wetten.overheid.nl/BWBR0041297/2024-08...,C1_S1_1_A1_1a_SUB1,"Dit besluit berust op de artikelen 4.3, eerste..."
19,SUBLVLI,2.\t,Lid,2.,(grondslag),https://wetten.overheid.nl/BWBR0041297/2024-08...,C1_S1_1_A1_1a_SUB2,Dit besluit berust ook op de artikelen 119 en ...
21,MAINLVL,Artikel 1.2. (wederzijdse erkenning),Artikel,1.2,(wederzijdse erkenning),https://wetten.overheid.nl/BWBR0041297/2024-08...,C1_S1_2_A1_2,"Met een kwaliteitsverklaring bouw, certificaat..."
25,MAINLVL,Artikel 2.1. (toepassingsbereik: activiteiten),Artikel,2.1,(toepassingsbereik: activiteiten),https://wetten.overheid.nl/BWBR0041297/2024-08...,C2_S2_1_A2_1,Dit hoofdstuk is van toepassing op bouwwerken.
...,...,...,...,...,...,...,...,...
4869,MAINLVL,Artikel 9.2. (citeertitel),Artikel,9.2,(citeertitel),https://wetten.overheid.nl/BWBR0041297/2024-08...,C9_A9_2,Dit besluit wordt aangehaald als: Besluit bouw...
4880,SUBLVLI,A.,Lid,A.,bij artikel 1.1 van dit besluit (begrippen),https://wetten.overheid.nl/BWBR0041297/2024-08...,APPXI_SUBA,Begrippen: algemeen\nVoor de toepassing van di...
5006,SUBLVLI,B.,Lid,B.,bij artikel 1.1 van dit besluit (begrippen),https://wetten.overheid.nl/BWBR0041297/2024-08...,APPXI_SUBB,Begrippen: gebruiksfuncties\nVoor de toepassin...
5040,SUBLVLI,C.,Lid,C.,bij artikel 1.1 van dit besluit (begrippen),https://wetten.overheid.nl/BWBR0041297/2024-08...,APPXI_SUBC,Tabel: symbolen en afkortingen\nVoor de toepas...


---
# **Data Filtering**


*Filtering can be done either after translation (to include translation of all regulations), before translation (to reduce duration) or partly before and partly after translation (to include translation of indirectly relevant regulations)*


## Chapter-based filtering




The scope of this thesis is limited to Chapter 4, potentially also using definitions in Appendix I:

In [None]:
df_subset = df_regulations.copy()
df_subset = df_subset[df_subset["code"].apply(lambda x: x.startswith("C4_") or x.startswith("APPX_I_") )]
df_subset

Unnamed: 0_level_0,type,full_match,label,id,title,URL,code,text_original
line,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
1560,SUBLVLI,1.\t,Lid,1.,(toepassingsbereik: activiteiten),https://wetten.overheid.nl/BWBR0041297/2024-08...,C4_S4_1_A4_1_SUB1,Dit hoofdstuk is van toepassing op bouwactivit...
1561,SUBLVLI,2.\t,Lid,2.,(toepassingsbereik: activiteiten),https://wetten.overheid.nl/BWBR0041297/2024-08...,C4_S4_1_A4_1_SUB2,Met het bouwen van nieuwe bouwwerken wordt gel...
1562,MAINLVL,Artikel 4.2. (toepassingsbereik: oogmerken),Artikel,4.2,(toepassingsbereik: oogmerken),https://wetten.overheid.nl/BWBR0041297/2024-08...,C4_S4_1_A4_2,De regels in dit hoofdstuk zijn gesteld met he...
1567,MAINLVL,Artikel 4.3. (toepassingsbereik: normadressaat),Artikel,4.3,(toepassingsbereik: normadressaat),https://wetten.overheid.nl/BWBR0041297/2024-08...,C4_S4_1_A4_3,Aan de regels in dit hoofdstuk wordt voldaan d...
1569,MAINLVL,Artikel 4.4. (toepassingsbereik: aansturingsar...,Artikel,4.4,(toepassingsbereik: aansturingsartikel niet va...,https://wetten.overheid.nl/BWBR0041297/2024-08...,C4_S4_1_A4_4,In dit hoofdstuk is een aansturingsartikel nie...
...,...,...,...,...,...,...,...,...
3900,SUBLVLI,2.\t,Lid,2.,(systeemeisen),https://wetten.overheid.nl/BWBR0041297/2024-08...,C4_S4_7_P4_7_14_A4_248_SUB2,"Een technisch bouwsysteem, is adequaat gedimen..."
3901,SUBLVLI,3.\t,Lid,3.,(systeemeisen),https://wetten.overheid.nl/BWBR0041297/2024-08...,C4_S4_7_P4_7_14_A4_248_SUB3,Een technisch bouwsysteem voor ruimteverwarmin...
3902,SUBLVLI,4.\t,Lid,4.,(systeemeisen),https://wetten.overheid.nl/BWBR0041297/2024-08...,C4_S4_7_P4_7_14_A4_248_SUB4,Als een technisch bouwsysteem bestaat uit een ...
3912,MAINLVL,Artikel 4.249. (verslaglegging),Artikel,4.249,(verslaglegging),https://wetten.overheid.nl/BWBR0041297/2024-08...,C4_S4_7_P4_7_14_A4_249,De energieprestatie van de in deze paragraaf b...


## Topic-based filtering

*Eventually not used anymore due to manual filtering, but would work like this:*

In [None]:
# Search the hierarchical list structure for titles containing any of the keywords.
def keywords_filtering(l_keywords, df1, df2):

    # Find matches with keywords in hierarchy titles:
    l_included_codes = []
    for i in df1.index:
        title = df1.loc[i]["title"]
        code = df1.loc[i]["code"]
        for keyword in l_keywords:
            if title and len(code)!=0 and keyword in title: # len(code)!=0 excludes empty codes
                if code not in l_included_codes: # avoids duplicates
                    l_included_codes.append(code)
                break

    # Find all regulations included in the list above AND children of those matches
    l_included_regulations = []
    for reg in df2["code"]:
        for code in l_included_codes:
            if reg.startswith(code): #true for matches & children
                if reg not in l_included_regulations: # avoids duplicates
                    l_included_regulations.append(reg)
                break

    df_new = df2[df2["code"].isin(l_included_regulations)].copy()
    return df_new


keywords_filtering(["brand", "vuur", "blus"], df_hier_elements, df_subset)

Unnamed: 0_level_0,type,full_match,label,id,title,URL,code,text_original
line,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
1707,SUBLVLI,1.\t,Lid,1.,(aansturingsartikel),https://wetten.overheid.nl/BWBR0041297/2024-08...,C4_S4_2_P4_2_2_A4_16_SUB1,Een bouwwerk is bestand tegen brand zodat geen...
1708,SUBLVLI,2.\t,Lid,2.,(aansturingsartikel),https://wetten.overheid.nl/BWBR0041297/2024-08...,C4_S4_2_P4_2_2_A4_16_SUB2,Als voor een gebruiksfunctie in tabel 4.16 reg...
1744,SUBLVLI,1.\t,Lid,1.,(tijdsduur niet-bezwijken),https://wetten.overheid.nl/BWBR0041297/2024-08...,C4_S4_2_P4_2_2_A4_17_SUB1,"Een vloer, trap of hellingbaan waarover of waa..."
1745,SUBLVLI,2.\t,Lid,2.,(tijdsduur niet-bezwijken),https://wetten.overheid.nl/BWBR0041297/2024-08...,C4_S4_2_P4_2_2_A4_17_SUB2,Een bouwconstructie bezwijkt bij brand in een ...
1763,SUBLVLI,3.\t,Lid,3.,brandwerendheid met betrekking tot bezwijken,https://wetten.overheid.nl/BWBR0041297/2024-08...,C4_S4_2_P4_2_2_A4_17_SUB3,In afwijking van het tweede lid wordt de tijds...
...,...,...,...,...,...,...,...,...
3804,SUBLVLI,1.\t,Lid,1.,(brandweeringang),https://wetten.overheid.nl/BWBR0041297/2024-08...,C4_S4_7_P4_7_8_A4_226_SUB1,Een bouwwerk met een krachtens de wet voorgesc...
3805,SUBLVLI,2.\t,Lid,2.,(brandweeringang),https://wetten.overheid.nl/BWBR0041297/2024-08...,C4_S4_7_P4_7_8_A4_226_SUB2,In een bouwwerk met een krachtens de wet voorg...
3806,MAINLVL,Artikel 4.227. (afbakening maatwerkvoorschrift...,Artikel,4.227,(afbakening maatwerkvoorschriften brandweering...,https://wetten.overheid.nl/BWBR0041297/2024-08...,C4_S4_7_P4_7_8_A4_227,Een maatwerkvoorschrift over artikel 4.226 kan...
3810,MAINLVL,Artikel 4.228. (brandweerlift),Artikel,4.228,(brandweerlift),https://wetten.overheid.nl/BWBR0041297/2024-08...,C4_S4_7_P4_7_8_A4_228,Een gebouw waarvan een vloer van een verblijfs...


## Manual filtering


*Isn't always required at this stage*

The document has been analyzed before, and for the Proof of Concept, these are the relevant regulations:

In [None]:
# subset_article_ids = ["4.37", "4.38", "4.39",
# "4.42", "4.43", "4.44", "4.45", "4.45a", "4.46", "4.47", "4.49", "4.50", "4.51",
# "4.53", "4.54", "4.56", "4.57", "4.58", "4.59", "4.60", "4.61", "4.62", "4.88", "4.89", "4.90", "4.91", "4.92", "4.93"]

# # Define IDs articles that should be included in subset

In [None]:
subset_article_ids = [#"4.37",
                      "4.38", "4.39",
                      #"4.42",
                      "4.43", "4.44", "4.45", "4.45a", "4.46", "4.47",
                      #"4.49",
                      "4.50", "4.51", "4.53", "4.54",
                      #"4.56",
                      "4.57", "4.58", "4.59", "4.60", "4.61", "4.62",
                      #"4.88",
                      #"4.89",  SIMILAR TO A CONTROL ARTICLE
                      #"4.90",
                      "4.91", "4.92", "4.93"]

# Define IDs articles that should be included in subset WITHOUT CONTROL ARTICLES

Only including articles (and their subarticles) from subset:

In [None]:
df_subset = df_regulations.copy()

subset_mask = []
for code in df_subset["URL"]:
  # Match Arikel + valid number (optionally with letter)
  match = re.search(r"Artikel(\d+\.\d+[a-z]?)", code)
  if match:
    article_id = match[1]
    # Check if number is in subset ids
    if article_id in subset_article_ids:
        subset_mask.append(True)
    else:
      subset_mask.append(False)
  else:
    subset_mask.append(False)

df_subset = df_subset[subset_mask]
df_subset

Unnamed: 0_level_0,type,full_match,label,id,title,URL,code,text_original
line,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
1970,MAINLVL,Artikel 4.38. (stookplaats),Artikel,4.38,(stookplaats),https://wetten.overheid.nl/BWBR0041297/2024-08...,C4_S4_2_P4_2_6_A4_38,Materiaal ter plaatse van of nabij een stookpl...
1975,SUBLVLI,1.\t,Lid,1.,"(schacht, koker of kanaal)",https://wetten.overheid.nl/BWBR0041297/2024-08...,C4_S4_2_P4_2_6_A4_39_SUB1,Materiaal toegepast aan de binnenzijde van een...
1976,SUBLVLI,2.\t,Lid,2.,"(schacht, koker of kanaal)",https://wetten.overheid.nl/BWBR0041297/2024-08...,C4_S4_2_P4_2_6_A4_39_SUB2,Het eerste lid is niet van toepassing op:\n\ta...
2030,SUBLVLI,1.\t,Lid,1.,(binnenoppervlak),https://wetten.overheid.nl/BWBR0041297/2024-08...,C4_S4_2_P4_2_7_A4_43_SUB1,Een zijde van een constructieonderdeel die gre...
2031,SUBLVLI,2.\t,Lid,2.,(binnenoppervlak),https://wetten.overheid.nl/BWBR0041297/2024-08...,C4_S4_2_P4_2_7_A4_43_SUB2,In afwijking van het eerste lid geldt de eis a...
...,...,...,...,...,...,...,...,...
2494,SUBLVLI,2.\t,Lid,2.,(brandklasse buitenoppervlak),https://wetten.overheid.nl/BWBR0041297/2024-08...,C4_S4_2_P4_2_14_A4_92_SUB2,In afwijking van het eerste lid voldoet een de...
2495,SUBLVLI,3.\t,Lid,3.,(brandklasse buitenoppervlak),https://wetten.overheid.nl/BWBR0041297/2024-08...,C4_S4_2_P4_2_14_A4_92_SUB3,Op ten hoogste 5% van de totale oppervlakte va...
2496,SUBLVLI,4.\t,Lid,4.,(brandklasse buitenoppervlak),https://wetten.overheid.nl/BWBR0041297/2024-08...,C4_S4_2_P4_2_14_A4_92_SUB4,Het eerste tot en met derde lid zijn niet van ...
2498,SUBLVLI,1.\t,Lid,1.,(brandklasse dak),https://wetten.overheid.nl/BWBR0041297/2024-08...,C4_S4_2_P4_2_14_A4_93_SUB1,"Een dak van een brandcompartiment is, voor zov..."


---
# **Translate regulations**

Includes rule-based version for manual pre-translation using dictionary (required when glossary is not available for translator)

## Pre-translate text

In [None]:
def pretranslate(text, dictionary):
    for key, value in dictionary.items():
        text = re.sub(f"\\b{value['re_untranslated']}", key.upper(), text, flags=re.IGNORECASE)
    return text

# Example usage
sentence = "Op ten hoogste 10% van de totale oppervlakte van de constructieonderdelen van elke afzonderlijke ruimte waardoor geen beschermde vluchtroute voert, zijn de in de artikelen 4.43 en 4.45a, eerste en tweede lid, bedoelde eisen aan de rookklasse niet van toepassing."
pretranslate(sentence, dict_hier_elements)

'Op ten hoogste 10% van de totale oppervlakte van de constructieonderdelen van elke afzonderlijke ruimte waardoor geen beschermde vluchtroute voert, zijn de in de ARTICLE 4.43 en 4.45a, eerste en tweede SUB-ARTICLE, bedoelde eisen aan de rookklasse niet van toepassing.'

In [None]:
# #use this if secondary filtering happens afterwards:
# df_regulations["text_translated"] = [pretranslate(sentence, dict_hier_elements) for sentence in df_regulations["text_original"]]
# df_regulations[:10]

#otherwise, use this:
df_subset["text_translated"] = [pretranslate(sentence, dict_hier_elements) for sentence in df_subset["text_original"]]
df_subset[:10]

Unnamed: 0_level_0,type,full_match,label,id,title,URL,code,text_original,text_translated
line,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
1970,MAINLVL,Artikel 4.38. (stookplaats),Artikel,4.38,(stookplaats),https://wetten.overheid.nl/BWBR0041297/2024-08...,C4_S4_2_P4_2_6_A4_38,Materiaal ter plaatse van of nabij een stookpl...,Materiaal ter plaatse van of nabij een stookpl...
1975,SUBLVLI,1.\t,Lid,1.0,"(schacht, koker of kanaal)",https://wetten.overheid.nl/BWBR0041297/2024-08...,C4_S4_2_P4_2_6_A4_39_SUB1,Materiaal toegepast aan de binnenzijde van een...,Materiaal toegepast aan de binnenzijde van een...
1976,SUBLVLI,2.\t,Lid,2.0,"(schacht, koker of kanaal)",https://wetten.overheid.nl/BWBR0041297/2024-08...,C4_S4_2_P4_2_6_A4_39_SUB2,Het eerste lid is niet van toepassing op:\n\ta...,Het eerste SUB-ARTICLE is niet van toepassing ...
2030,SUBLVLI,1.\t,Lid,1.0,(binnenoppervlak),https://wetten.overheid.nl/BWBR0041297/2024-08...,C4_S4_2_P4_2_7_A4_43_SUB1,Een zijde van een constructieonderdeel die gre...,Een zijde van een constructieonderdeel die gre...
2031,SUBLVLI,2.\t,Lid,2.0,(binnenoppervlak),https://wetten.overheid.nl/BWBR0041297/2024-08...,C4_S4_2_P4_2_7_A4_43_SUB2,In afwijking van het eerste lid geldt de eis a...,In afwijking van het eerste SUB-ARTICLE geldt ...
2032,SUBLVLI,3.\t,Lid,3.0,(binnenoppervlak),https://wetten.overheid.nl/BWBR0041297/2024-08...,C4_S4_2_P4_2_7_A4_43_SUB3,In afwijking van het eerste lid voldoet het be...,In afwijking van het eerste SUB-ARTICLE voldoe...
2037,SUBLVLI,1.\t,Lid,1.0,(buitenoppervlak),https://wetten.overheid.nl/BWBR0041297/2024-08...,C4_S4_2_P4_2_7_A4_44_SUB1,Een zijde van een constructieonderdeel die gre...,Een zijde van een constructieonderdeel die gre...
2038,SUBLVLI,2.\t,Lid,2.0,(buitenoppervlak),https://wetten.overheid.nl/BWBR0041297/2024-08...,C4_S4_2_P4_2_7_A4_44_SUB2,Het deel van een zijde van een constructieonde...,Het deel van een zijde van een constructieonde...
2039,SUBLVLI,3.\t,Lid,3.0,(buitenoppervlak),https://wetten.overheid.nl/BWBR0041297/2024-08...,C4_S4_2_P4_2_7_A4_44_SUB3,Een zijde van een constructieonderdeel die gre...,Een zijde van een constructieonderdeel die gre...
2040,SUBLVLI,4.\t,Lid,4.0,(buitenoppervlak),https://wetten.overheid.nl/BWBR0041297/2024-08...,C4_S4_2_P4_2_7_A4_44_SUB4,Het eerste tot en met derde lid zijn niet van ...,Het eerste tot en met derde SUB-ARTICLE zijn n...


## Translation & data cleaning

May take a while!

In [None]:
def translate_text(text, source_language):
    try:
      if len(text) < 5000:
        input = text.replace("4.", "4_") # rudimentary workaround to treating codes as decimal separators
        translated_text = GoogleTranslator(source=source_language, target="en").translate(input)
        translated_text = translated_text.replace("4_", "4.") # rudimentary workaround to treating codes as decimal separators
        return translated_text
      else:
        # Handle text exceeding the limit (e.g., split and translate in chunks)
        print(f"Warning: Text exceeding 5000 characters, skipping translation. Text length: {len(text)} ({str(text)[:100]})...")
        return ""
    except:
      print(f"Error translating the following text: {str(text)[:100]}...")
      return ""

# Perform translation
l_translated = []
n = 0

# #use this if secondary filtering happens afterwards:
# for i in df_regulations.index:
#   translated_text = translate_text(df_regulations["text_translated"][i], doc_lang)
#   translated_text = translated_text.replace("\u200b", "") # remove zero width spaces
#   translated_text = re.sub(r'\s*-\s*', '-', translated_text) # fix problems with splitting words in case of hyphens
#   translated_text = re.sub(r'\t*[ ]*(.\.)\s*(\w)', lambda x: x[1]+" "+x[2].upper(), translated_text)) # fix problems with multiple spaces, mostly in enumerations
#   l_translated.append(translated_text)
#   n+=1
#   print(f"{i:<5} {n}/{len(df_regulations)}  {str(translated_text)[:50]}...")

#otherwise, use this:
for i in df_subset.index:
  translated_text = translate_text(df_subset["text_translated"][i], doc_lang)
  translated_text = translated_text.replace("\u200b", "") # remove zero width spaces
  translated_text = re.sub(r'\s*-\s*', '-', translated_text) # fix problems with splitting words in case of hyphens
  translated_text = re.sub(r'\t+[ ]*(.\.)\s*(\w)', lambda x: f"\t{x[1]} {x[2].upper()}", translated_text) # fix problems with multiple spaces, mostly in enumerations
  l_translated.append(translated_text)

  n+=1
  print(f"{i:<5} {n}/{len(df_subset)}  {str(translated_text)[:50]}...")

1970  1/85  Material at the location of or near a fireplace me...
1975  2/85  Material applied to the inside of a shaft, a tube ...
1976  3/85  The first sub-article does not apply to:
	a. A sha...
2030  4/85  One side of a construction part that borders the i...
2031  5/85  Contrary to the first sub-article, the requirement...
2032  6/85  Contrary to the first sub-article, the movable par...
2037  7/85  One side of a construction part that borders the o...
2038  8/85  The part of one side of a construction part that b...
2039  9/85  One side of a construction part that borders on th...
2040  10/85  The first to third sub-article do not apply to the...
2041  11/85  Contrary from the first to third sub-article, a do...
2043  12/85  In deviation from article 4.43, the top of a floor...
2044  13/85  Contrary to the article 4.44, a top of a floor, a ...
2046  14/85  Contrary to article 4.43 applies to an electric pi...
2049  15/85  Contrary to article 4.43, the indoor air applies t...
2052

In [None]:
# Update dataframe
df_subset["text_translated"] = l_translated
df_subset["text_translated"][10:]

Unnamed: 0_level_0,text_translated
line,Unnamed: 1_level_1
2041,"Contrary from the first to third sub-article, ..."
2043,"In deviation from article 4.43, the top of a f..."
2044,"Contrary to the article 4.44, a top of a floor..."
2046,Contrary to article 4.43 applies to an electri...
2049,"Contrary to article 4.43, the indoor air appli..."
...,...
2494,"Contrary to the first sub-article, a door, a w..."
2495,At most 5% of the total area of the constructi...
2496,The first to third sub-article do not apply to...
2498,"A roof of a fire compartment is, insofar as th..."


In [None]:
re.sub(r'(\t)+', ' ', translated_text)

'The requirement of the first sub-article does not apply to a maximum of 5% of the surface area of the roof.'

## Data cleaning explanation

already implemented above

Mostly use case-specific problems & solutions

**PROBLEM:** words with hyphens are split into parts with spaces inbetween:

In [None]:
print("Before translation:")
print("Een bedgebied ligt in een beschermd subbrandcompartiment.")
print("After translation:")
print("A bed area is located in a protected sub -fire compartment.")

Before translation:
Een bedgebied ligt in een beschermd subbrandcompartiment.
After translation:
A bed area is located in a protected sub -fire compartment.


**SOLUTION:** Replace `sub -fire` with `sub-fire` after translation

In [None]:
print(re.sub(r'\s*-\s*', '-', "A bed area is located in a protected sub -fire compartment."))

A bed area is located in a protected sub-fire compartment.


**PROBLEM:** periods in article codes sometimes seen as thousands separators:

In [None]:
print("Before translation:")
print("Als voor een woonfunctie in tabel 4.174 regels zijn aangewezen, wordt voor die woonfunctie aan het eerste lid voldaan door naleving van die regels.")
print("After translation:")
print("If 4,174 rules are designated for a residential function, the first sub-article is met for that residential function by compliance with those rules.")

Before translation:
Als voor een woonfunctie in tabel 4.174 regels zijn aangewezen, wordt voor die woonfunctie aan het eerste lid voldaan door naleving van die regels.
After translation:
If 4,174 rules are designated for a residential function, the first sub-article is met for that residential function by compliance with those rules.


**SOLUTION:** Replace `4.321` with `4_321` before translation, and change back afterwards:

In [None]:
print("Als voor een woonfunctie in tabel 4.174 regels zijn aangewezen, wordt voor die woonfunctie aan het eerste lid voldaan door naleving van die regels.".replace('4.', '4_'))

Als voor een woonfunctie in tabel 4_174 regels zijn aangewezen, wordt voor die woonfunctie aan het eerste lid voldaan door naleving van die regels.


In [None]:
print("If 4,174 rules are designated for a residential function, the first sub-article is met for that residential function by compliance with those rules.".replace('4,', '4.').replace('4_', '4.'))

If 4.174 rules are designated for a residential function, the first sub-article is met for that residential function by compliance with those rules.


**PROBLEM:** Sometimes, tabs are included which can cause the NLP model to interpret the sentence incorrectly

**SOLUTION:** Remove consecutive spaces:

In [None]:
print(re.sub(r'(\t)+', ' ', """The first subarticle does not apply to:
	a. A toilet room;
	b.	a bathroom;
	c.	a lift shaft, if the construction parts on the inside of the shaft meet fire class B and smoking class S2, both determined according to NEN-EN 13501-1; and
	d.	A technical room with a use area of no more than 50 m2 not intended for one or more combustion devices with a total nominal load of more than 130 kW."""))

The first subarticle does not apply to:
 a. A toilet room;
 b. a bathroom;
 c. a lift shaft, if the construction parts on the inside of the shaft meet fire class B and smoking class S2, both determined according to NEN-EN 13501-1; and
 d. A technical room with a use area of no more than 50 m2 not intended for one or more combustion devices with a total nominal load of more than 130 kW.


**PROBLEM:** Enumerations do not always start with capitalization, which causes the NLP model to not treat them as a separate clause
**SOLUTION:** Add capitalization manually for these cases, plus the solution above becomes:

In [None]:
s = """The first subarticle does not apply to:
	a. A toilet room;
	b.	a bathroom;
	c.	a lift shaft, if the construction parts on the inside of the shaft meet fire class B and smoking class S2, both determined according to NEN-EN 13501-1; and
	d.	A technical room with a use area of no more than 50 m2 not intended for one or more combustion devices with a total nominal load of more than 130 kW."""

# print(re.sub(r'([\:;]\n)\s*(.\.)\s*(\w)', lambda x: x[1]+x[2]+" "+x[3].upper(), s))
print(re.sub(r'\t+[ ]*(.\.)\s*(\w)', lambda x: f"\t{x[1]} {x[2].upper()}", s))

The first subarticle does not apply to:
	a. A toilet room;
	b. A bathroom;
	c. A lift shaft, if the construction parts on the inside of the shaft meet fire class B and smoking class S2, both determined according to NEN-EN 13501-1; and
	d. A technical room with a use area of no more than 50 m2 not intended for one or more combustion devices with a total nominal load of more than 130 kW.


# Export outputs

Export follwing dataframes as CSV:




*   `df_hier_elements` (containing information about all hierarchy elements)
*   `df_subset` (containing information about the currrent subset)

Optionally, also include

In [None]:
# df_hier_elements.to_csv(file_path+'output/BBL_hier_elements.csv', index=True)

In [None]:
df_subset.to_csv(file_path+'output/BBL_subset.csv', index=True)