# NLP Project Sectorlense Vertragschecker

**Projectdescription**

Reviewing software contracts is often a complex and error-prone task, particularly when
assessing standardized requirements and identifying potential risks. Manual contract review
can be time-consuming, leading to inconsistencies and oversight. To address this challenge,
the project aims to develop an LLM-based contract checker that automates the review
process. By leveraging predefined checklists and legal standards, the system will
systematically analyze contracts, ensuring that required clauses are present while also
detecting critical or unusual formulations. This will streamline contract evaluation and
facilitate structured risk assessment, reducing both time and effort for legal professionals
and businesses.

The contract checker will incorporate three primary functionalities. A standard compliance
check will verify whether contracts include the necessary clauses and if they adhere to
established legal and business standards. Assessment based on standardized criteria will
evaluate key contractual aspects to ensure completeness and compliance. Risk identification
will highlight non-standard, ambiguous, or high-risk clauses, enabling users to assess their
appropriateness compared to standard contract terms. Additionally, an optional risk
detection feature could be introduced to flag further potential risks that may not be explicitly
covered in the predefined checklist.

The final deliverable will be a web application that enables users to upload contract
documents and receive an automated structured review including insights on compliance
and risk factors. This application will provide detailed feedback, highlight critical sections,
and suggest improvements, making contract review more efficient and reliable.
Development will build upon an existing prototype that includes both a frontend and basic
functionality, allowing for enhancements in accuracy, usability, and scalability.

**Meilensteine**:

Milestone 1: Understanding existing prototype and defining key requirements (Week 1-2)

Milestone 2: Developing/improving NLP-based contract analysis model (Week 3-6)

Milestone 3: Integration into the web application (Week 7-8)

Milestone 4: Testing and evaluation with real-world contracts (Week 9-10)

Milestone 5: Final presentation and documentation (Week 11-12)

**Data**

Contract documents in various formats (PDF, DOCX, TXT). Predefined checklists and legal standards.

In [1]:
#Imported libraries
import pdfplumber
import docx
import json
import requests
from bs4 import BeautifulSoup
import pandas as pd
import os
import re

# Einlesen von Vertragsdokumenten

The contract checker tool that is going to be created in this project needs to be tested and trained based on some real world example contracts. Therefore Sectorelense provided us with an excel sheet containing a list of various providers of Saas solutions and links to their websites where sample contracts are available.

These contract documents appear in various formats. Some of them in HTML, some in PDF, some in DOCX and some in the format of JSON.

To automate the collection of contracts our first approach was to try to build an automated scraping tool for each file format.

## Scraping HTML
We Started by creating a scraping tool for HTML websites. After a short time we realised that this woulden´t be as easy as expected, since all the websites appear in different formats which leads to different scraping properties for every website.

However we proceeded and tried to build a seperate scraping function for all the provided websites that seemed to be impactfull to us.

The following code shows scraping functions for different kind of websites. In the end you can find a chooser function, that chooses which scraping functtion to use exactly based on the link provided.

In [5]:
import re
import requests
from bs4 import BeautifulSoup

# 1. Scraper für Standard-HTML-Verträge
def scrape_html_standard(url):
    try:
        headers = {
            "User-Agent": (
                "Mozilla/5.0 (Windows NT 10.0; Win64; x64) "
                "AppleWebKit/537.36 (KHTML, like Gecko) "
                "Chrome/122.0.0.0 Safari/537.36"
            )
        }
        response = requests.get(url, headers=headers)
        response.encoding = 'utf-8'
        response.raise_for_status()

        soup = BeautifulSoup(response.text, "html.parser")
        for tag in soup(["script", "style", "header", "footer", "nav"]):
            tag.decompose()

        main_content = soup.find("div", class_="single-content") or soup
        raw_text = main_content.get_text(separator=" ", strip=True)
        full_text = re.sub(r'\s+', ' ', raw_text)

        start_patterns = [r"§\s?\d+", r"1\.\s+[^\n\.]+"]
        for pattern in start_patterns:
            match = re.search(pattern, full_text)
            if match:
                full_text = full_text[match.start():]
                break

        end_markers = [
            "Die eingetragene Marke MOCO", "Stand 12/2024", "Ort, Datum",
            "Unterschrift", "Impressum", "©", "Nachtrag Australische spezifische Begriffe"
        ]
        cutoff = int(len(full_text) * 0.7)
        positions = {m: full_text.find(m) for m in end_markers if full_text.find(m) > cutoff}
        if positions:
            full_text = full_text[:min(positions.values())]

        return full_text.strip()

    except Exception:
        return ""


# 2. Scraper für CommonPaper-Verträge
def scrape_html_commonpaper(url):
    try:
        response = requests.get(url)
        response.raise_for_status()

        soup = BeautifulSoup(response.text, "html.parser")
        content = soup.find("div", class_="entry-content")
        if not content:
            print(f"⚠️ CommonPaper: Kein Hauptbereich gefunden – {url}")
            return ""

        result = []

        def walk_list(ol, prefix=""):
            items = ol.find_all("li", recursive=False)
            for idx, li in enumerate(items, 1):
                number = f"{prefix}.{idx}" if prefix else str(idx)
                li_copy = BeautifulSoup(str(li), "html.parser")
                for sublist in li_copy.find_all("ol"):
                    sublist.decompose()
                text = li_copy.get_text(separator=" ", strip=True)
                result.append(f"{number}. {text}")

                sub_ol = li.find("ol")
                if sub_ol:
                    walk_list(sub_ol, number)

        top_ol = content.find("ol")
        if top_ol:
            walk_list(top_ol)
            print(f"🏁 CommonPaper: {len(result)} Punkte extrahiert.")
        else:
            print("⚠️ Keine <ol> gefunden!")

        return "\n".join(result)

    except Exception as e:
        print(f"Fehler beim Scrapen CommonPaper: {e}")
        return ""


# 3. Scraper für Fakturia-Verträge
def scrape_html_fakturia(url):
    try:
        headers = {
            "User-Agent": (
                "Mozilla/5.0 (Windows NT 10.0; Win64; x64) "
                "AppleWebKit/537.36 (KHTML, like Gecko) "
                "Chrome/122.0.0.0 Safari/537.36"
            )
        }
        response = requests.get(url, headers=headers)
        response.raise_for_status()

        soup = BeautifulSoup(response.text, "html.parser")
        content = soup.find("div", class_="entry-content-wrapper")
        if not content:
            print("⚠️ Fakturia: Kein Hauptbereich gefunden.")
            return ""

        result = []
        section = ""

        for elem in content.find_all(["h2", "p"]):
            text = re.sub(r'\s+', ' ', elem.get_text(separator=" ", strip=True))

            if elem.name == "h2":
                if section:
                    result.append(section.strip())
                section = text + "\n"
            elif elem.name == "p":
                if re.match(r'^\d+\.\d+', text):
                    section += text + " "
                else:
                    section += text + "\n"

        if section:
            result.append(section.strip())

        for marker in ["Copyright OSB Alliance e.V.", "gemäß CC BY", "Version 1/2015"]:
            if marker in result[-1]:
                result[-1] = result[-1].split(marker)[0].strip()
                break

        print(f"🏁 Fakturia: {len(result)} Abschnitte extrahiert.")
        return "\n\n".join(result)

    except Exception as e:
        print(f"Fehler beim Scrapen Fakturia: {e}")
        return ""


# 4. Scraper für Mitratech-Verträge
def scrape_html_mitratech(url):
    try:
        headers = {
            "User-Agent": (
                "Mozilla/5.0 (Windows NT 10.0; Win64; x64) "
                "AppleWebKit/537.36 (KHTML, like Gecko) "
                "Chrome/122.0.0.0 Safari/537.36"
            )
        }
        response = requests.get(url, headers=headers)
        response.raise_for_status()

        soup = BeautifulSoup(response.text, "html.parser")
        for tag in soup(["script", "style", "header", "footer", "nav", "form", "noscript"]):
            tag.decompose()

        main = soup.find("main") or soup
        found = False
        blocks = []

        for el in main.find_all(["h1", "h2", "h3", "p", "li", "ol", "ul"]):
            text = el.get_text(separator=" ", strip=True)
            if not text:
                continue

            if not found and text.startswith("1. Allgemeines"):
                found = True
                blocks.append(text)
                continue

            if found and el.name in ["h1", "h2", "h3"] and "Begriffsbestimmungen" in text:
                break

            if found:
                blocks.append(text)

        return "\n\n".join(blocks).strip()

    except Exception as e:
        print(f"Fehler beim Scrapen Mitratech: {e}")
        return ""


# Automatische Auswahl je nach URL
def scrape_contract_auto(url):
    url_lc = url.lower()

    if "commonpaper.com" in url_lc:
        print("🌐 CommonPaper erkannt – verwende scrape_html_commonpaper()")
        return scrape_html_commonpaper(url)
    elif "fakturia.de" in url_lc:
        print("🌐 Fakturia erkannt – verwende scrape_html_fakturia()")
        return scrape_html_fakturia(url)
    elif "mitratech.com" in url_lc or "alyne.com" in url_lc:
        print("🌐 Mitratech erkannt – verwende scrape_html_mitratech()")
        return scrape_html_mitratech(url)
    else:
        print("🌐 Standardvertrag erkannt – verwende scrape_html_standard()")
        return scrape_html_standard(url)

# Reading in PDF, DOCX and JSON

Since we realised that all the files are delivered in different formats and therefore trying to automate the reading process won´t be really sucsesfull, since you have to write a new function for every document we stopped that approach. If we would continue like this we would have to write a seperate function for each document, considering the slight differences each document comes with.

Since this would consume a lot of time and is not very efficient as prooven by the HTML example we decided to simply copy all the relevant DOCX, PDF and JSON files into TXT files manually. This is because it is way easier for us to read in txt files that are all in the same format.

This project is about NLP and not so much about building automated scraping tools. Therefore we think this apporach is reasonable.

**TXT**

In [6]:
#Funktion zum einlesen von .txt files
def read_txt_file(file_path):
    try:
        with open(file_path, "r", encoding="utf-8") as file:
            content = file.read()
        return content
    except Exception as e:
        print(f"Fehler beim Einlesen der Datei: {e}")
        return ""

**Mapping Datei einlesen**

In [8]:
from pathlib import Path

excel_path = Path("../data/input_mapping/Mappingliste_Verträge.xlsx")
df = pd.read_excel(excel_path)

**Neue Spalte Content und Filetype in DF erzeugen**

In [9]:
if 'Content' not in df.columns:
    df['Content'] = ""

if 'FileType' not in df.columns:
    df['FileType'] = ""

**TxT files und HTML links automatisiert in Data Frame einlesen und als pickle file speichern**

In [12]:
# Basisordner für lokale Vertragsdateien
base_path = Path("data/verträge/verträge_txt")

# Ausgabeordner und -datei
output_pickle_path = Path("data/data_scraped_input.pkl")

# Iteration über die Mapping-Tabelle
for idx, row in df.iterrows():
    mapping_field = row['Mapping']
    content = ""
    file_type = ""

    if pd.notna(mapping_field):
        mappings = [m.strip() for m in mapping_field.split(';')]
        texts = []

        for i, mapping in enumerate(mappings):
            if mapping.endswith('.txt'):
                filename = Path(mapping).name  # nur Dateiname
                filepath = base_path / filename
                texts.append(read_txt_file(filepath))
                if i == 0:
                    file_type = "TXT"
            else:
                texts.append(scrape_contract_auto(mapping))
                if i == 0:
                    file_type = "HTML"

        content = "\n\n".join(texts)

    df.at[idx, 'Content'] = content
    df.at[idx, 'FileType'] = file_type

# Ergebnisse speichern
df.to_pickle(output_pickle_path)

🌐 Standardvertrag erkannt – verwende scrape_html_standard()
Fehler beim Einlesen der Datei: [Errno 2] No such file or directory: 'data\\verträge\\verträge_txt\\Templates_3H_Solutions_AG_18-06_SaaS-Cloudsoftware_Vertrag.txt'
🌐 CommonPaper erkannt – verwende scrape_html_commonpaper()
🏁 CommonPaper: 120 Punkte extrahiert.
Fehler beim Einlesen der Datei: [Errno 2] No such file or directory: 'data\\verträge\\verträge_txt\\SaaS_SAP_Service_Level_Agreement.txt'
Fehler beim Einlesen der Datei: [Errno 2] No such file or directory: 'data\\verträge\\verträge_txt\\Saas_SAP_General_Terms.txt'
Fehler beim Einlesen der Datei: [Errno 2] No such file or directory: 'data\\verträge\\verträge_txt\\SaaS_SAP_Support_Scheudle.txt'
Fehler beim Einlesen der Datei: [Errno 2] No such file or directory: 'data\\verträge\\verträge_txt\\Saas_Oracle_Cloud_Services_Vertrag.txt'
Fehler beim Einlesen der Datei: [Errno 2] No such file or directory: 'data\\verträge\\verträge_txt\\Saas_Oracle_Data_Processing_Agreement.txt'

OSError: Cannot save file into a non-existent directory: 'data'