# NLP Project Sectorlense Contract checker

**Projectdescription**

Reviewing software contracts is often a complex and error-prone task, particularly when
assessing standardized requirements and identifying potential risks. Manual contract review
can be time-consuming, leading to inconsistencies and oversight. To address this challenge,
the project aims to develop an LLM-based contract checker that automates the review
process. By leveraging predefined checklists and legal standards, the system will
systematically analyze contracts, ensuring that required clauses are present while also
detecting critical or unusual formulations. This will streamline contract evaluation and
facilitate structured risk assessment, reducing both time and effort for legal professionals
and businesses.

The contract checker will incorporate three primary functionalities. A standard compliance
check will verify whether contracts include the necessary clauses and if they adhere to
established legal and business standards. Assessment based on standardized criteria will
evaluate key contractual aspects to ensure completeness and compliance. Risk identificatalogue_rawion
will highlight non-standard, ambiguous, or high-risk clauses, enabling users to assess their
appropriateness compared to standard contract terms. Additionally, an optional risk
detection feature could be introduced to flag further potential risks that may not be explicitly
covered in the predefined checklist.

The final deliverable will be a web application that enables users to upload contract
documents and receive an automated structured review including insights on compliance
and risk factors. This application will provide detailed feedback, highlight critical sections,
and suggest improvements, making contract review more efficient and reliable.
Development will build upon an existing prototype that includes both a frontend and basic
functionality, allowing for enhancements in accuracy, usability, and scalability.

**Meilensteine**:

Milestone 1: Understanding existing prototype and defining key requirements (Week 1-2)

Milestone 2: Developing/improving NLP-based contract analysis model (Week 3-6)

Milestone 3: Integration into the web application (Week 7-8)

Milestone 4: Testing and evaluation with real-world contracts (Week 9-10)

Milestone 5: Final presentation and documentation (Week 11-12)

**Data**

Contract documents in various formats (PDF, DOCX, TXT). Predefined checklists and legal standards.

In [1]:
# ==============================================================================
#  SYSTEM & ENVIRONMENT
# ==============================================================================
import os
import sys
import ssl
import certifi
import random
import pickle
from pathlib import Path
import time
from dotenv import load_dotenv

# SSL-Config (NLTK, Requests)
ssl._create_default_https_context = lambda: ssl.create_default_context(cafile=certifi.where())

# ==============================================================================
#  DATA HANDLING
# ==============================================================================
import pandas as pd
import numpy as np

# ==============================================================================
#  TEXT PROCESSING & NLP
# ==============================================================================
import string
import re
from itertools import chain

import nltk
from nltk.corpus import stopwords
nltk.download('stopwords')

import en_core_web_sm

from gensim.parsing.preprocessing import (
    STOPWORDS,
    strip_tags, strip_numeric, strip_punctuation,
    strip_multiple_whitespaces, remove_stopwords,
    strip_short, stem_text
)

from sklearn.feature_extraction.text import CountVectorizer

# ==============================================================================
#  FILE READING & SCRAPING
# ==============================================================================
import pdfplumber
import docx
import requests
from bs4 import BeautifulSoup

# ==============================================================================
#  VISUALIZATION
# ==============================================================================
import matplotlib.pyplot as plt
import seaborn as sns
from wordcloud import WordCloud
from tqdm import tqdm
import ipywidgets as widgets
from IPython.display import display

# ==============================================================================
#  MACHINE LEARNING / DEEP LEARNING
# ==============================================================================
import torch
from torch import nn
from torch.utils.data import DataLoader, TensorDataset
from torch.optim import AdamW

from sklearn.model_selection import train_test_split
from sklearn.metrics import (
    classification_report, confusion_matrix, accuracy_score,
    f1_score, recall_score, roc_curve, auc
)
from sklearn.metrics.pairwise import cosine_similarity

# Transformers & Sentence Embeddings
from transformers import (
    BertTokenizer, BertModel,
    AutoTokenizer, AutoModel, AutoConfig
)
from sentence_transformers import SentenceTransformer, models
from sentence_transformers.models import Pooling
import inspect

# ==============================================================================
#  OPENAI API AND JSON HANDLING
# ==============================================================================
from openai import OpenAI
import json

import sys
import os
sys.path.append(os.path.abspath(".."))
from key import OpenAiKey

# ==============================================================================
#  REPRODUCIBILITY
# ==============================================================================
def set_seed(seed=42):
    random.seed(seed)
    np.random.seed(seed)
    torch.manual_seed(seed)
    torch.cuda.manual_seed_all(seed)
    torch.backends.cudnn.deterministic = True
    torch.backends.cudnn.benchmark = False


# ==============================================================================
#  Custom Functions and Classes
# ==============================================================================
## scraping and reading files
from functions.function_contract_read_in import (  
    scrape_html_standard
    , scrape_html_commonpaper
    , scrape_html_fakturia
    , scrape_html_mitratech
    , scrape_contract_auto
    , read_txt_file
)
## text processing
from functions.functions_preprocessing import( 
    extract_paragraphs_and_sections
    , extract_title_fixed
    , clean_sections_and_paragraphs
)
## embeddings
from functions.functions_embeddings import add_embed_text_column


# Cosine Mapper
from classes.class_model import CosineMapper
# textlabeldataset
from classes.Class_TextLabelDataset import TextLabelDataset
# Predictor
from classes.class_predictor import SectionTopicPredictor


[nltk_data] Downloading package stopwords to /Users/dave/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


# 1. Read in Contracts

The contract checker tool that is going to be created in this project needs to be tested and trained based on some real world example contracts. Therefore Sectorelense provided us with an excel sheet containing a list of various providers of Saas solutions and links to their websites where sample contracts are available.

These contract documents appear in various formats. Some of them in HTML, some in PDF, some in DOCX and some in the format of JSON.

To automate the collection of contracts our first approach was to try to build an automated scraping tool for each file format.

## 1.1 Scraping HTML
We Started by creating a scraping tool for HTML websites. After a short time we realised that this woulden´t be as easy as expected, since all the websites appear in different formats which leads to different scraping properties for every website.

However we proceeded and tried to build a seperate scraping function for all the provided websites that seemed to be impactfull to us.

The following code shows scraping functions for different kind of websites. In the end you can find a chooser function, that chooses which scraping functtion to use exactly based on the link provided.

In [None]:

# # 1. Scraper für Standard-HTML-Verträge
# def scrape_html_standard(url):
#     try:
#         headers = {
#             "User-Agent": (
#                 "Mozilla/5.0 (Windows NT 10.0; Win64; x64) "
#                 "AppleWebKit/537.36 (KHTML, like Gecko) "
#                 "Chrome/122.0.0.0 Safari/537.36"
#             )
#         }
#         response = requests.get(url, headers=headers)
#         response.encoding = 'utf-8'
#         response.raise_for_status()

#         soup = BeautifulSoup(response.text, "html.parser")
#         for tag in soup(["script", "style", "header", "footer", "nav"]):
#             tag.decompose()

#         main_content = soup.find("div", class_="single-content") or soup
#         raw_text = main_content.get_text(separator=" ", strip=True)
#         full_text = re.sub(r'\s+', ' ', raw_text)

#         start_patterns = [r"§\s?\d+", r"1\.\s+[^\n\.]+"]
#         for pattern in start_patterns:
#             match = re.search(pattern, full_text)
#             if match:
#                 full_text = full_text[match.start():]
#                 break

#         end_markers = [
#             "Die eingetragene Marke MOCO", "Stand 12/2024", "Ort, Datum",
#             "Unterschrift", "Impressum", "©", "Nachtrag Australische spezifische Begriffe"
#         ]
#         cutoff = int(len(full_text) * 0.7)
#         positions = {m: full_text.find(m) for m in end_markers if full_text.find(m) > cutoff}
#         if positions:
#             full_text = full_text[:min(positions.values())]

#         return full_text.strip()

#     except Exception:
#         return ""


# # 2. Scraper für CommonPaper-Verträge
# def scrape_html_commonpaper(url):
#     try:
#         response = requests.get(url)
#         response.raise_for_status()

#         soup = BeautifulSoup(response.text, "html.parser")
#         content = soup.find("div", class_="entry-content")
#         if not content:
#             print(f"⚠️ CommonPaper: Kein Hauptbereich gefunden – {url}")
#             return ""

#         result = []

#         def walk_list(ol, prefix=""):
#             items = ol.find_all("li", recursive=False)
#             for idx, li in enumerate(items, 1):
#                 number = f"{prefix}.{idx}" if prefix else str(idx)
#                 li_copy = BeautifulSoup(str(li), "html.parser")
#                 for sublist in li_copy.find_all("ol"):
#                     sublist.decompose()
#                 text = li_copy.get_text(separator=" ", strip=True)
#                 result.append(f"{number}. {text}")

#                 sub_ol = li.find("ol")
#                 if sub_ol:
#                     walk_list(sub_ol, number)

#         top_ol = content.find("ol")
#         if top_ol:
#             walk_list(top_ol)
#         else:
#             print("⚠️ Keine <ol> gefunden!")

#         return "\n".join(result)

#     except Exception as e:
#         print(f"Fehler beim Scrapen CommonPaper: {e}")
#         return ""


# # 3. Scraper für Fakturia-Verträge
# def scrape_html_fakturia(url):
#     try:
#         headers = {
#             "User-Agent": (
#                 "Mozilla/5.0 (Windows NT 10.0; Win64; x64) "
#                 "AppleWebKit/537.36 (KHTML, like Gecko) "
#                 "Chrome/122.0.0.0 Safari/537.36"
#             )
#         }
#         response = requests.get(url, headers=headers)
#         response.raise_for_status()

#         soup = BeautifulSoup(response.text, "html.parser")
#         content = soup.find("div", class_="entry-content-wrapper")
#         if not content:
#             print("⚠️ Fakturia: Kein Hauptbereich gefunden.")
#             return ""

#         result = []
#         section = ""

#         for elem in content.find_all(["h2", "p"]):
#             text = re.sub(r'\s+', ' ', elem.get_text(separator=" ", strip=True))

#             if elem.name == "h2":
#                 if section:
#                     result.append(section.strip())
#                 section = text + "\n"
#             elif elem.name == "p":
#                 if re.match(r'^\d+\.\d+', text):
#                     section += text + " "
#                 else:
#                     section += text + "\n"

#         if section:
#             result.append(section.strip())

#         for marker in ["Copyright OSB Alliance e.V.", "gemäß CC BY", "Version 1/2015"]:
#             if marker in result[-1]:
#                 result[-1] = result[-1].split(marker)[0].strip()
#                 break

#         return "\n\n".join(result)

#     except Exception as e:
#         print(f"Fehler beim Scrapen Fakturia: {e}")
#         return ""


# # 4. Scraper für Mitratech-Verträge
# def scrape_html_mitratech(url):
#     try:
#         headers = {
#             "User-Agent": (
#                 "Mozilla/5.0 (Windows NT 10.0; Win64; x64) "
#                 "AppleWebKit/537.36 (KHTML, like Gecko) "
#                 "Chrome/122.0.0.0 Safari/537.36"
#             )
#         }
#         response = requests.get(url, headers=headers)
#         response.raise_for_status()

#         soup = BeautifulSoup(response.text, "html.parser")
#         for tag in soup(["script", "style", "header", "footer", "nav", "form", "noscript"]):
#             tag.decompose()

#         main = soup.find("main") or soup
#         found = False
#         blocks = []

#         for el in main.find_all(["h1", "h2", "h3", "p", "li", "ol", "ul"]):
#             text = el.get_text(separator=" ", strip=True)
#             if not text:
#                 continue

#             if not found and text.startswith("1. Allgemeines"):
#                 found = True
#                 blocks.append(text)
#                 continue

#             if found and el.name in ["h1", "h2", "h3"] and "Begriffsbestimmungen" in text:
#                 break

#             if found:
#                 blocks.append(text)

#         return "\n\n".join(blocks).strip()

#     except Exception as e:
#         print(f"Fehler beim Scrapen Mitratech: {e}")
#         return ""


# # Automatische Auswahl je nach URL
# def scrape_contract_auto(url):
    url_lc = url.lower()
    if "commonpaper.com" in url_lc:
        return scrape_html_commonpaper(url)
    elif "fakturia.de" in url_lc:
        return scrape_html_fakturia(url)
    elif "mitratech.com" in url_lc or "alyne.com" in url_lc:
        return scrape_html_mitratech(url)
    else:
        return scrape_html_standard(url)

## 1.2 Reading in PDF, DOCX and JSON

Since we realised that all the files are delivered in different formats and therefore trying to automate the reading process won´t be really sucsesfull, since you have to write a new function for every document we stopped that approach. If we would continue like this we would have to write a seperate function for each document, considering the slight differences each document comes with.

Since this would consume a lot of time and is not very efficient as prooven by the HTML example we decided to simply copy all the relevant DOCX, PDF and JSON files into TXT files manually. This is because it is way easier for us to read in txt files that are all in the same format.

This project is about NLP and not so much about building automated scraping tools. Therefore we think this apporach is reasonable.

**TXT**

In [None]:
# #Funktion zum einlesen von .txt files
# def read_txt_file(file_path):
#     try:
#         with open(file_path, "r", encoding="utf-8") as file:
#             content = file.read()
#         return content
#     except Exception as e:
#         print(f"Fehler beim Einlesen der Datei: {e}")
#         return ""

**Read mapping**

In [5]:


excel_path = Path("../data/input_mapping/Mappingliste_Verträge.xlsx")
df = pd.read_excel(excel_path)


**Neue Spalte Content und Filetype in DF erzeugen**

In [6]:
if 'Content' not in df.columns:
    df['Content'] = ""

if 'FileType' not in df.columns:
    df['FileType'] = ""

**TxT files und HTML links automatisiert in Data Frame einlesen und als pickle file speichern**

In [7]:
# Basisordner für lokale Vertragsdateien
base_path = Path("../data/verträge/verträge_txt")

# Iteration über die Mapping-Tabelle
for idx, row in df.iterrows():
    mapping_field = row['Mapping']
    content = ""
    file_type = ""

    if pd.notna(mapping_field):
        mappings = [m.strip() for m in mapping_field.split(';')]
        texts = []

        for i, mapping in enumerate(mappings):
            if mapping.endswith('.txt'):
                filename = Path(mapping).name  # nur Dateiname
                filepath = base_path / filename
                texts.append(read_txt_file(filepath))
                if i == 0:
                    file_type = "TXT"
            else:
                texts.append(scrape_contract_auto(mapping))
                if i == 0:
                    file_type = "HTML"

        content = "\n\n".join(texts)

    df.at[idx, 'Content'] = content
    df.at[idx, 'FileType'] = file_type



KeyboardInterrupt: 

**Englische Texte übersetzen**

In [None]:

# .env laden für API-Key
load_dotenv()

# src-Ordner zum Pfad hinzufügen, damit translate.py importiert werden kann
sys.path.append(str(Path("..") / "src"))

# Funktion importieren
from translate import translate_dataframe
# Übersetzung auf Texte mit Sprache 'EN' anwenden
df_translated = translate_dataframe(df)


**Fertige input files als pickle file speichern**

In [None]:
# Ausgabeordner und -datei
output_pickle_path = Path("../data/data_scraped_input.pkl")
# Ergebnisse speichern
df_translated.to_pickle(output_pickle_path)

# 2. Data cleaning
## 2.1 Data Loading and Initial Structuring

In the first step, we load the data from the pickled dataset produced by the data_script_input. After reading in the contracts, each entry in our custom dataset is enriched with key metadata, including:

- the source of the data
- the document type
- a mapping to the original source website
- the language of the document
- the full contract content
- and the file type

This step ensures that all contracts are consistently structured and traceable back to their origin.

In [2]:
df_all_contracts = pd.read_pickle("../data/data_scraped_input.pkl")
display(df_all_contracts.head())

Unnamed: 0,Kategorie,Quelle/Organisation,Dokumententyp,Mapping,Sprache,Content,FileType
0,Verbände / Templates,IT-Recht Hannover,Muster SaaS-Vertrag,https://it-rechthannover.de/IT-Muster/SaaS-Ver...,DE,§ 1 Vertragsgegenstand 1.1 Der Anbieter stellt...,HTML
1,Verbände / Templates,3H Solutions AG,Standard-Vertragsbedingungen SaaS,Templates_3H_Solutions_AG_18-06_SaaS-Cloudsoft...,DE,Standard-Vertragsbedingungen\nSaaS- und Clouds...,TXT
2,Verbände / Templates,Common Paper,Cloud Service Agreement,https://commonpaper.com/standards/cloud-servic...,EN,1. Service\n1.1. Access and Use. During the Su...,HTML
3,Öffentlich zugängliche Verträge großer SaaS-An...,SAP,,SaaS_SAP_Service_Level_Agreement.txt,DE,SERVICE-LEVEL-VEREINBARUNG FÜR PRIVATE CLOUD E...,TXT
4,Öffentlich zugängliche Verträge großer SaaS-An...,SAP,,Saas_SAP_General_Terms.txt,DE,ALLGEMEINE GESCHÄFTSBEDINGUNGEN FÜR CLOUD SERV...,TXT


**Example of Contract Content:**

In [3]:
df = df_all_contracts
print(df.iloc[13, 5][:1000] + "...")

Vertragsbedingungen SaaS-Vertrag
der TA Triumph-Adler Gruppe (Stand 01/2021)
Vertragsbedingungen SaaS-Vertrag der TA Triumph-Adler Gruppe
(Stand 01/2021) – Seite 1 von 5
1. Vertragsgegenstand, Anwendungsbereich
1.1. Diese „Vertragsbedingungen SaaS-Vertrag TA Triumph-Adler Gruppe“
(„Vertragsbedingungen“) sind Bestandteil des zwischen Auftragnehmer und
Auftraggeber (gemeinsam „Parteien“) abgeschlossenen Software as a
Service-Vertrags („SaaS-Vertrag“).
1.2. Bestandteil des SaaS-Vertrags sind je nach Vereinbarung im SaaS-Vertrag:
a) die entgeltliche Überlassung folgender Objekte:
- Softwareanwendung mittels Internet, soweit keine anderweitige
Telekommunikation ausdrücklich vereinbart wurde („Services“),
- und/oder
- Software („Vertragssoftware“) einschließlich der zugehörigen Beschreibung
der technischen Funktionalität, des Betriebs, der Installation und der Nutzung,
b) die Erbringung von Serviceleistungen an den Services,
c) die Erbringung von Softwarepflege- und -supportleistungen
(„SPS-

## 2.2 Filter and Select Data 
For further data processing, we retain only the content and contract columns, as these contain the essential information for our analysis.

Additionally, we focus exclusively on contracts written in German, since the goal is to develop a German-language contract checker. Filtering by language at this stage ensures consistency and avoids noise from multilingual data.

In [4]:
# filter df to relevant contracts
df_relevant = df[#(df['Kategorie'] == "kleinere SaaS-Anbieter (Hauptgruppe)") & 
                    (df['Sprache'] == "DE") #& (df['Quelle/Organisation'] != "Comarch ERP XT"	)
                    ]
df_relevant = df_relevant.iloc[:,[5]]
df_relevant.columns = ['content']
df_relevant["contract"] = range(1, df_relevant.shape[0] + 1)
df_relevant = df_relevant[['contract', 'content']]


**Example:**

In [5]:

print(df_relevant.head())
with open("../data/data_scraped_input_relevant.pkl", "wb") as f:
    pickle.dump(df_relevant, f)

# Save the DataFrame to an Excel file
df_relevant.to_excel("../data/data_scraped_input_relevant.xlsx", index=False)

   contract                                            content
0         1  § 1 Vertragsgegenstand 1.1 Der Anbieter stellt...
1         2  Standard-Vertragsbedingungen\nSaaS- und Clouds...
3         3  SERVICE-LEVEL-VEREINBARUNG FÜR PRIVATE CLOUD E...
4         4  ALLGEMEINE GESCHÄFTSBEDINGUNGEN FÜR CLOUD SERV...
5         5  SUPPORT SCHEDULE FÜR CLOUD SERVICES\nDieses Su...


## 2.3 Slicing 
Since we aim to analyze individual sections rather than entire contracts, the next step is to split the contract texts into smaller segments. Specifically, we divide each contract into multiple rows, first by paragraphs, and then by subsections within each paragraph. This segmentation makes it possible to process and classify specific parts of the contract more effectively.

In [6]:
# def extract_paragraphs_and_sections(row, col='content', contract_col='contract', print_steps = False):
#     import re

#     text = row[col]
#     if contract_col ==None:
#         contract_id= 1
#     else:
#         contract_id = row['contract']
#     lines = text.splitlines()
#     paragraphs = []
#     current_para_lines = []
#     current_para_number = 0
#     current_para_match = None
#     match_pat_type_1 = True
#     match_pat_type_2 = True
#     match_pat_type_3 = True
#     para_mode = None

#     # 1. extract paragraphs

#     for line in lines:
#         line = line.strip()
#         if not line:
#             continue

#         search_for = str(int(current_para_number) + 1)

        
#         if para_mode == "symbol":
#             if search_for:
#                 match_main = re.match(rf'§\s*{search_for}(?!\d)', line)
#         elif para_mode == "number":
#             if search_for:
#                 match_main = re.match(rf'\b{search_for}\.(?!\d)', line)
#         else:
#             # Noch kein Modus festgelegt: beides probieren
#             match_main = re.match(rf'(§\s*(\d+))(?!\d)|\b(\d+)\.(?!\d)', line)
#             if match_main:
#                 if match_main.group(1):  # § X
#                     para_mode = "symbol"
#                 elif match_main.group(3):  # X.
#                     para_mode = "number"
       

#         if match_main:
#             if current_para_lines:
#                 paragraphs.append((current_para_number, ' '.join(current_para_lines), current_para_match))
#             current_para_number = match_main.group(0).strip().lstrip('§').rstrip('.').strip()  # e.g § 2 lorem ipsum --> 2
#             current_para_lines = [line]                                                        # e.g § 2 lorem ipsum --> § 2 lorem impsum
#             current_para_match = match_main.group(0).strip()                                   # e.g § 2 lorem ipsum --> § 2
#         elif current_para_lines:
#             current_para_lines.append(line)

#     if current_para_lines:
#         paragraphs.append((current_para_number, ' '.join(current_para_lines), current_para_match))

#     rows = []
#     seen_sections = set()  # (contract_id, para_num, section_id)

#     for para_num, para_text, para_match in paragraphs:
        
#         if para_mode == "number":
#             matches = list(re.finditer(rf'(?:(?<=\s)|(?<=^))({para_num}\.\d{{1}})(?![\dA-Za-z])|\((\d+)\)', para_text))
#         if para_mode == "symbol":
#             matches = list(re.finditer(rf'(?:(?<=\s)|(?<=^))({para_num}\.\d{{1}})(?![\dA-Za-z])|\((\d+)\)|\b(\d+)\.(?!\d)', para_text))

#         if print_steps:
#             print(para_num)
#             print(seen_sections)
#             print(para_text)
#             print(matches)
            

#         if not matches:
#             rows.append({
#                 'contract': contract_id,
#                 'paragraph': para_match,
#                 'paragraph_content': para_text.strip(),
#                 'section': "no sections use paragraph",
#                 'section_content': para_text.strip()
#             })
#             continue

#         positions = []
#         last_section_number = 0
        

#         for match in matches:
#             # hole entweder dezimale section (z. B. 1.1) oder Klammer-section (z. B. (1))
#             section_id = match.group(1) or match.group(2) or match.group(3)
#             start = match.start()

#             # Unterscheide die Formate
#             if match.group(1) and match_pat_type_1:  # Dezimal: z. B. "1.5"
#                 try:
#                     section_suffix = int(section_id.split(".")[1])
#                 except (IndexError, ValueError):
#                     continue  # überspringen bei Fehler
#                 match_pat_type_2 = False # If first pattern type detected only look for this one
#                 match_pat_type_3 = False

#                 # verbiete z. B. "1.50"
#                # if re.match(rf'{para_num}\.\d{{2,}}$', section_id):
#                 #    continue

#             elif match.group(2) and match_pat_type_2:  # Klammer: z. B. "(2)"
#                 try:
#                     section_suffix = int(section_id.strip("()")) 
#                     section_id = f'({section_suffix})'  # Einheitliches Format für Ausgabe
#                 except ValueError:
#                     continue
#                 match_pat_type_1 = False # If second pattern type detected only look for this one
#                 match_pat_type_3 = False
#             elif para_mode == "symbol" and match.group(3) and match_pat_type_3:  # 1. (nur bei mode=symbol)
#                 if print_steps:
#                     print(section_id)
#                 try:
#                     section_suffix = int(section_id.split(".")[0])
#                     section_id = f'{section_suffix}.'  # für Klarheit
#                 except ValueError:
#                     continue
#                 match_pat_type_1 = False # If third pattern type detected only look for this one
#                 match_pat_type_2 = False

#             else:
#                 continue  # kein gültiges Format

#             # Nur nächste Zahl zulassen
#             if last_section_number != 0 and section_suffix != last_section_number + 1:
#                 continue

#             section_key = (contract_id, para_num, section_id)
#             if section_key in seen_sections:
#                 continue

#             seen_sections.add(section_key)
#             positions.append((start, section_id))
#             last_section_number = section_suffix


#         # Add end position
#         positions.append((len(para_text), None))
#         positions = sorted(positions)
#         if print_steps:
#             print(f'positions = {positions}')
#             print('###########')

#         for i in range(len(positions) - 1):
#             start_pos = positions[i][0]
#             end_pos = positions[i + 1][0]
#             section_id = positions[i][1]
#             section_text = para_text[start_pos:end_pos].strip()

#             rows.append({
#                 'contract': contract_id,
#                 'paragraph': para_match,
#                 'paragraph_content': para_text.strip(),
#                 'section': section_id,
#                 'section_content': section_text
#             })

#     return rows


**New structure**:

In [7]:
df_exploded = df_relevant.apply(extract_paragraphs_and_sections, axis=1)
print(df_exploded.head())
flattened_rows = list(chain.from_iterable(df_exploded))
df_structured = pd.DataFrame(flattened_rows)
print("               \\#########/")
print("                \\#######/")
print("                 \\#####/")
print("                  \\###/")
print("                   \\#/")


display(df_structured[["contract","paragraph","section"]])

0    [{'contract': 1, 'paragraph': '§ 1', 'paragrap...
1    [{'contract': 2, 'paragraph': '§ 1', 'paragrap...
3    [{'contract': 3, 'paragraph': '1.', 'paragraph...
4    [{'contract': 4, 'paragraph': '1.', 'paragraph...
5    [{'contract': 5, 'paragraph': '1.', 'paragraph...
dtype: object
               \#########/
                \#######/
                 \#####/
                  \###/
                   \#/


Unnamed: 0,contract,paragraph,section
0,1,§ 1,1.1
1,1,§ 1,1.2
2,1,§ 1,1.3
3,2,§ 1,(1)
4,2,§ 1,(2)
...,...,...,...
1371,26,19.,19.2
1372,26,19.,19.3
1373,26,19.,19.4
1374,26,19.,19.5


In a second step, we aim to extract paragraph titles directly from the paragraph content. To achieve this, we use regular expressions (regex) that match common patterns typically found at the beginning of legal paragraphs—such as numbered clauses, keywords like "Der", "Ein", or "Eine", or capitalized phrases.

Since this is not working totally well and quite often ni recognizable pattern is found, we apply a fallback strategy: we extract a short snippet from the beginning of the paragraph (e.g., the first few words or until the first full sentence) to serve as a temporary title.

This ensures that each paragraph receives a consistent and descriptive title, even if the document does not explicitly define one. These titles are useful for labeling, classification, and structuring contract documents for downstream tasks. Furthermore the Paragraph tag and the number is removed. 

Since we want the later algorithm to focus on the content rather than focusing on the title we also remove the title from the content of the prargraph as well as drom the content of the section

In [8]:
# def extract_title_fixed(group):
#     import re
#     paragraph_text = group['paragraph_content'].iloc[0]
#     section_texts = group['section_content'].tolist()

#     # No Sections (single paragraph)
#     if len(section_texts) == 1 and group['section'].iloc[0] == "no sections use paragraph":
#         # find sentence end
#         match = re.search(r'\b(Der|Die|Das|Es|Ein|Eine)\s+[A-ZÄÖÜ][a-zäöü]+\b', paragraph_text)
#         if match:
#             title = paragraph_text[:match.start()].strip()
#         else:
#             # Fallback: to first verb or 8 words
#             title = ' '.join(paragraph_text.split()[:8])
#         return pd.Series([title] * len(group), index=group.index)

#     # secction split
#     for section in section_texts:
#         paragraph_text = paragraph_text.replace(section, '')
#     title = paragraph_text.strip()
#     return pd.Series([title] * len(group), index=group.index)

In [9]:
df_structured['paragraph_title'] = df_structured.groupby(['contract', 'paragraph'], group_keys= False).apply(extract_title_fixed)


df_structured = df_structured[
    ['contract', 'paragraph', 'paragraph_title', 'paragraph_content', 'section', 'section_content']
]


df_structured['paragraph_title'] = df_structured.apply(
    lambda row: row['paragraph_title'].replace(row['paragraph'], '').strip() if pd.notnull(row['paragraph_title']) else '',
    axis=1
)


df_structured["paragraph_content"] = df_structured.apply(
    lambda row: row["paragraph_content"].replace(row['paragraph_title'], '').strip() if pd.notnull(row["paragraph_content"]) else '',
    axis=1
)

df_structured["section_content"] = df_structured.apply(
    lambda row: row["section_content"].replace(row['paragraph_title'], '').strip() if pd.notnull(row["section_content"]) else '',
    axis=1
)



  df_structured['paragraph_title'] = df_structured.groupby(['contract', 'paragraph'], group_keys= False).apply(extract_title_fixed)


**New structure**:

In [10]:
display(df_structured.head())
output_pickle_path = Path("../data/data_structured.pkl")
# Ergebnisse speichern
df_structured.to_pickle(output_pickle_path)
df_structured.to_excel("../data/data_structured.xlsx", index=False)

Unnamed: 0,contract,paragraph,paragraph_title,paragraph_content,section,section_content
0,1,§ 1,Vertragsgegenstand,§ 1 1.1 Der Anbieter stellt dem Kunden die So...,1.1,1.1 Der Anbieter stellt dem Kunden die Softwar...
1,1,§ 1,Vertragsgegenstand,§ 1 1.1 Der Anbieter stellt dem Kunden die So...,1.2,1.2 Die Nutzung umfasst die Bereitstellung von...
2,1,§ 1,Vertragsgegenstand,§ 1 1.1 Der Anbieter stellt dem Kunden die So...,1.3,1.3 Der Kunde erhält ausschließlich das vertra...
3,2,§ 1,Vertragsgegenstand,§ 1 (1) Dieser Software-as-a-Service-Vertrag ...,(1),(1) Dieser Software-as-a-Service-Vertrag ist a...
4,2,§ 1,Vertragsgegenstand,§ 1 (1) Dieser Software-as-a-Service-Vertrag ...,(2),(2) Die Software wird vom Anbieter als webbasi...


## 2.3 Cleaning

In the next step, we focused on cleaning and normalizing the core of the dataset. For this purpose, we implemented a flexible function that allows us to experiment with various cleaning strategies via parameters. These options include:

- Removing all paragraph markers (e.g., “§”, “1.2”)
- Converting all text to lowercase
- Stripping HTML tags
- Removing numbers
- Removing punctuation
- Reducing multiple whitespaces to a single space
- Removing short words (e.g., ≤ 2 characters)
- Removing known stopwords (using Gensim’s stopword library)
- Applying stemming to reduce words to their root form

We tested various combinations of these settings across multiple runs. The best results were achieved with all cleaning steps enabled, except for stemming, which tended to distort meaning too much in our context.

Therefore, we adopted this configuration as our standard cleaning approach going forward.



In [11]:
df_structured = pd.read_pickle("../data/data_structured.pkl")
display(df_structured.head())

Unnamed: 0,contract,paragraph,paragraph_title,paragraph_content,section,section_content
0,1,§ 1,Vertragsgegenstand,§ 1 1.1 Der Anbieter stellt dem Kunden die So...,1.1,1.1 Der Anbieter stellt dem Kunden die Softwar...
1,1,§ 1,Vertragsgegenstand,§ 1 1.1 Der Anbieter stellt dem Kunden die So...,1.2,1.2 Die Nutzung umfasst die Bereitstellung von...
2,1,§ 1,Vertragsgegenstand,§ 1 1.1 Der Anbieter stellt dem Kunden die So...,1.3,1.3 Der Kunde erhält ausschließlich das vertra...
3,2,§ 1,Vertragsgegenstand,§ 1 (1) Dieser Software-as-a-Service-Vertrag ...,(1),(1) Dieser Software-as-a-Service-Vertrag ist a...
4,2,§ 1,Vertragsgegenstand,§ 1 (1) Dieser Software-as-a-Service-Vertrag ...,(2),(2) Die Software wird vom Anbieter als webbasi...


In [12]:
# def clean_sections_and_paragraphs(
#     text,
#     remove_paragraph_markers=True,
#     to_lower=True,
#     remove_tags=True,
#     remove_numbers=True,
#     remove_punctuation=True,
#     remove_extra_whitespace=True,
#     strip_short_words=False,
#     remove_stopwords =False,
#     apply_stemming=False
# ):
#     if not isinstance(text, str):
#         return ""

#     if to_lower:
#         text = text.lower()

#     if remove_paragraph_markers:
#         # Remove paragraph indicators like "§ 1", "1.", "1.1" etc.
#         text = re.sub(r'^(§?\s*\d+[a-zA-Z]*[.)]?(\s*\(?\d+[.)]?)?)', '', text)
#         text = re.sub(r'\(?\b\d{1,2}(\.\d{1,2})?\)?', '', text)

#     if remove_tags:
#         text = strip_tags(text)

#     if remove_numbers:
#         text = strip_numeric(text)

#     if remove_punctuation:
#         text = strip_punctuation(text)

#     if remove_extra_whitespace:
#         text = strip_multiple_whitespaces(text)

#     if strip_short_words:
#         text = strip_short(text, minsize=3)

#     if remove_stopwords:
#         text = remove_stopwords(text, stopwords=STOPWORDS) # !!!!!! Englische Stopwörter  !!!!!!!

#     if apply_stemming:
#         text = stem_text(text)

#     return text.strip()


**appling the function with deaults**:

In [13]:
df_structured["clean_paragraph_content"] = df_structured["paragraph_content"].apply(clean_sections_and_paragraphs)
df_structured["clean_section_content"] = df_structured["section_content"].apply(clean_sections_and_paragraphs)


Example clean paragraph:

In [14]:
print(df_structured["clean_paragraph_content"][0][:150] + "...")

der anbieter stellt dem kunden die software name der software zur verfügung die über eine cloud infrastruktur zugänglich ist die nutzung umfasst die b...


Example clean section:

In [15]:

print(df_structured["clean_section_content"][0][:100] + "...")

der anbieter stellt dem kunden die software name der software zur verfügung die über eine cloud infr...


# 2.4 Visualizing Token Distributions with Word Clouds
To gain a better understanding of the most common words and tokens used across the contracts—at the session and paragraph level—we plan to generate several word clouds.

These visualizations will be based on different tokenization stages, including:

- Raw tokenization (including punctuation and original casing),
- Stemming, and
- Lemmatization

By comparing these different views, we aim to identify frequently used legal terms, recurring patterns, and potentially meaningful vocabulary that could inform our downstream tasks such as classification, clustering, or contract clause extraction.

In [16]:
%matplotlib inline
nlp = en_core_web_sm.load()
bert_uncased_tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
mlparaphrase_tokenizer = AutoTokenizer.from_pretrained("sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2")


### paragraphs
df_structured["paragraph_content_stemm"]=df_structured['clean_paragraph_content'].apply(
    lambda text: stem_text(text)
)
df_structured["paragraph_content_lemma"]=df_structured['clean_paragraph_content'].apply(
    lambda text: " ".join([token.lemma_ for token in nlp(text) if not token.is_space])
)
df_structured["paragraph_content_token_bert"]=df_structured['clean_paragraph_content'].apply(
    lambda text: bert_uncased_tokenizer.tokenize(text)
)

df_structured["paragraph_content_token_mlp"]=df_structured['clean_paragraph_content'].apply(
    lambda text: mlparaphrase_tokenizer.tokenize(text)
)

### sections

df_structured["section_content_stemm"]=df_structured['clean_section_content'].apply(
    lambda text: stem_text(text)
)
df_structured["paragraph_section_lemma"]=df_structured['clean_section_content'].apply(
    lambda text: " ".join([token.lemma_ for token in nlp(text) if not token.is_space])
)
df_structured["paragraph_section_token_bert"]=df_structured['clean_section_content'].apply(
    lambda text: bert_uncased_tokenizer.tokenize(text)
)

df_structured["paragraph_section_token_mlp"]=df_structured['clean_section_content'].apply(
    lambda text: mlparaphrase_tokenizer.tokenize(text)
)


df_clean = df_structured.copy()

columns_and_titles = [
    ("paragraph_content_stemm", "Paragraph – Stemmed"),
    ("paragraph_content_lemma", "Paragraph – Lemmatized"),
    ("paragraph_content_token_bert", "Paragraph – BERT Tokens"),
    ("paragraph_content_token_mlp", "Paragraph – Mulitlingual Paraphrase Tokens"),
    ("section_content_stemm", "Section – Stemmed"),
    ("paragraph_section_lemma", "Section – Lemmatized"),
    ("paragraph_section_token_bert", "Section – BERT Tokens"),
    ("paragraph_section_token_mlp", "Section – Mulitlingual Paraphrase Tokens"),
]


tab_contents = []

for col, title in columns_and_titles:
    output = widgets.Output()
    with output:
        # Join tokens for token-based columns, else join raw text
        if "token" in col:
            all_text = " ".join([" ".join(tokens) if isinstance(tokens, list) else str(tokens)
                                for tokens in df_structured[col].dropna()])
        else:
            all_text = " ".join(df_structured[col].dropna())

        # Generate word cloud
        wordcloud = WordCloud(width=800, height=400, background_color="white").generate(all_text)

        # Plot
        plt.figure(figsize=(12, 6))
        plt.imshow(wordcloud, interpolation='bilinear')
        plt.axis("off")
        plt.title(f"Wordcloud – {title}", fontsize=16)
        plt.show()
        plt.close() 

        tab_contents.append((title, output))

# Tabs erzeugen

tab_widget = widgets.Tab()
tab_widget.children = [out for _, out in tab_contents]
for idx, (name, _) in enumerate(tab_contents):
    tab_widget.set_title(idx, name)

display(tab_widget)





Token indices sequence length is longer than the specified maximum sequence length for this model (1091 > 512). Running this sequence through the model will result in indexing errors


Tab(children=(Output(), Output(), Output(), Output(), Output(), Output(), Output(), Output()), selected_index=…

===> Save cleaned Contract content

In [17]:
file_path = '../data/data_clean.pkl'  
df_clean.to_pickle(file_path)
df_clean.to_excel("../data/data_clean.xlsx", index=False)

# 3. Labeling
## 3.1 Catalog cleaning
In the next step, our goal is to assign labels to as many of our contract sections as possible, based on a predefined requirement catalog.
The requirement catalog consists of the following three components:

- Paragraph Topic: This indicates which paragraph or general area of the contract the requirement refers to.
- Section Topic: A short guiding question that describes the specific aspect or issue that should be addressed within that section of the contract.
- Example Sentence: A concrete example taken from an actual SaaS contract that illustrates how this requirement might typically be formulated in legal language.

This structured setup allows us to later match contract content to catalog entries based on thematic and semantic similarity.

However, before we can map the requirement catalog to our contract data, we first need to clean the example phrases (reference texts) contained within the catalog itself. These examples need to go through the same preprocessing pipeline, such as lowercasing, punctuation removal, and stopword filtering, to ensure that the label mapping is accurate and consistent with the processed contract content.

Only once both the contract data and the catalog examples are cleaned can we begin matching them effectively for label assignment.

**<=== load the raw catalogue**

In [18]:
catalogue_raw = pd.read_excel("../data/catalogue_raw.xlsx")
display(catalogue_raw)


Unnamed: 0,paragraph_topic,section_topic,example
0,Projektkosten & Zahlungsmodalitäten,Sind sämtliche Kostenarten und -bestandteile (...,„Im Festpreis von 200.000 € sind sämtliche Lei...
1,Projektkosten & Zahlungsmodalitäten,Ist das Vergütungsmodell eindeutig festgelegt ...,„Der Kunde zahlt eine monatliche Pauschale von...
2,Projektkosten & Zahlungsmodalitäten,Ist ein Zahlungsplan mit konkreten Fälligkeite...,„Die Vergütung ist in drei Raten zahlbar: 30% ...
3,Projektkosten & Zahlungsmodalitäten,"Sind Währung, Rechnungsstellung, Zahlungsfrist...",„Alle Preise verstehen sich in Euro zuzüglich ...
4,Projektkosten & Zahlungsmodalitäten,Regelt der Vertrag den Umgang mit Nebenkosten ...,„Reise- und Übernachtungskosten werden nur ers...
...,...,...,...
71,Sonstige wichtige Klauseln,Ist die anwendbare Rechtsordnung eindeutig ver...,„Dieser Vertrag unterliegt dem Recht der Bunde...
72,Sonstige wichtige Klauseln,Ist ein Gerichtsstand für Streitigkeiten festg...,„Gerichtsstand für alle Streitigkeiten aus ode...
73,Sonstige wichtige Klauseln,"Bei mehrsprachigen Verträgen: Ist festgelegt, ...",„Dieser Vertrag wird in deutscher und englisch...
74,Sonstige wichtige Klauseln,Enthält der Vertrag eine salvatorische Klausel...,„Sollte eine Bestimmung dieses Vertrages unwir...


In [19]:

catalogue_raw["example"] = catalogue_raw["example"].str.strip('„“"').apply(clean_sections_and_paragraphs)
catalogue_raw["paragraph_topic"] = catalogue_raw["paragraph_topic"].apply(
    lambda x: x.replace("&", 'und').strip().replace(" ","_")
)

catalogue_clean= catalogue_raw.copy()

display(catalogue_clean.head())

file_path = '../data/catalogue_clean.pkl'  
catalogue_clean.to_pickle(file_path)
catalogue_clean.to_excel("../data/catalogue_clean.xlsx", index=False)



Unnamed: 0,paragraph_topic,section_topic,example
0,Projektkosten_und_Zahlungsmodalitäten,Sind sämtliche Kostenarten und -bestandteile (...,im festpreis von € sind sämtliche leistungen e...
1,Projektkosten_und_Zahlungsmodalitäten,Ist das Vergütungsmodell eindeutig festgelegt ...,der kunde zahlt eine monatliche pauschale von ...
2,Projektkosten_und_Zahlungsmodalitäten,Ist ein Zahlungsplan mit konkreten Fälligkeite...,die vergütung ist in drei raten zahlbar bei pr...
3,Projektkosten_und_Zahlungsmodalitäten,"Sind Währung, Rechnungsstellung, Zahlungsfrist...",alle preise verstehen sich in euro zuzüglich g...
4,Projektkosten_und_Zahlungsmodalitäten,Regelt der Vertrag den Umgang mit Nebenkosten ...,reise und übernachtungskosten werden nur ersta...


## 3.2 Embedings 
### Reusable Embedding Function for Multiple Models
Since we plan to experiment with various models throughout the project, and each model generates different embeddings, we created a reusable function that adds a new column to a given dataset. This column contains the embeddings computed by the specified model, based on a target text column. This setup allows us to easily switch between models and store their outputs for further analysis or comparison.

In [20]:
# def add_embed_text_column(df, text_column, model, target_column, batch_size=16):
#     """
#     Computes SentenceTransformer embeddings column-wise in batches, optimized for CPU performance.
#     """
#     texts = df[text_column].fillna("").tolist()
#     all_embeddings = []

#     for i in tqdm(range(0, len(texts), batch_size), desc=f"Embedding {text_column}"):
#         batch = texts[i:i+batch_size]
#         with torch.no_grad():
#             emb = model.encode(batch, convert_to_tensor=True)
#         all_embeddings.extend(emb.cpu().numpy())

#     df[target_column] = all_embeddings
#     return df

## 3.3 Initial Labeling via Embedding Similarity (Deprecated Due to Data Leakage)
Our initial approach was to label the dataset using cosine similarity between contract embeddings and requirement catalog examples, based on a selected model and only doublecheked by a human. The idea was to use these labels later to evaluate and compare different classification models.

However, as the project progressed, we realized that this method introduced data leakage, since the same embeddings used for labeling were also used during model evaluation. To avoid biased results, we decided to abandon this approach in favor of a more robust labeling strategy.

Nevertheless, for the sake of completeness, the original method is documented below.

**<=== load data_clean**


In [21]:
path = '../data/data_clean.pkl'
train_data_unlabeled = pd.read_pickle(path)

# Set seed and sample data
random.seed(2211)
sample_indices = random.sample(range(len(train_data_unlabeled["clean_section_content"])), k=600)

# Choose the model to evaluate
model = "sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2"
model_name = "multilingual_paraphrase_sentence"

# Copy and sample data
train_data_labeled = train_data_unlabeled.iloc[sample_indices, :].copy()

model = models.Transformer(model, max_seq_length=512)

pooling_model = models.Pooling(
            model.get_word_embedding_dimension(),
            pooling_mode_cls_token=False,           
            pooling_mode_mean_tokens=True,         # ✅ Mean laut chat gpt besser für semantische übereinstimmung
            pooling_mode_max_tokens=False,
        )

model = SentenceTransformer(modules=[model, pooling_model])

# Embed contract sections
train_data_labeled = add_embed_text_column(
    train_data_labeled,
    text_column="clean_section_content",
    model= model,
    target_column=f"section_em_{model_name}"
)
print(catalogue_clean)

# Embed catalog examples
catalogue = add_embed_text_column(
    catalogue_clean,
    text_column="example",
    model=model,
    target_column="emb"
)

# Keep only relevant columns
train_data_labeled = train_data_labeled[["contract", "paragraph", "section", "clean_section_content", f"section_em_{model_name}"]]

# Compute cosine similarity
X = np.vstack(train_data_labeled[f"section_em_{model_name}"].values)
Y = np.vstack(catalogue["emb"].values)
similarity_matrix = cosine_similarity(X, Y)
similarity_percent = np.round(similarity_matrix * 100, 2)

# Best matches per section
best_match_idx = similarity_matrix.argmax(axis=1)
best_match_score = similarity_percent[np.arange(len(X)), best_match_idx]

# Store results
train_data_labeled[f"matched_example_index_{model_name}"] = best_match_idx
train_data_labeled[f"similarity_percent_{model_name}"] = best_match_score
train_data_labeled[f"matched_example_text_{model_name}"] = catalogue.loc[best_match_idx, "example"].values
train_data_labeled[f"matched_example_topic_{model_name}"] = catalogue.loc[best_match_idx, "section_topic"].values
train_data_labeled[f"matched_paragraph_{model_name}"] = catalogue.loc[best_match_idx, "paragraph_topic"].values

# Summary stats
print(f"# {model_name}")
print("    mean similarity:", train_data_labeled[f"similarity_percent_{model_name}"].mean())
print("    max similarity:", train_data_labeled[f"similarity_percent_{model_name}"].max())
print("    min similarity:", train_data_labeled[f"similarity_percent_{model_name}"].min())

# Example: best and worst match
train_data_labeled_sorted = train_data_labeled.sort_values(f"similarity_percent_{model_name}", ascending=False)
best_match = train_data_labeled_sorted.iloc[0]
print("\nBest match:")
print(f'{best_match["clean_section_content"]} \n### to ### \n{best_match[f"matched_example_text_{model_name}"]} \n### with score ### {best_match[f"similarity_percent_{model_name}"]}')

train_data_labeled_sorted = train_data_labeled.sort_values(f"similarity_percent_{model_name}", ascending=True)
worst_match = train_data_labeled_sorted.iloc[0]
print("\nWorst match:")
print(f'{worst_match["clean_section_content"]} \n### to ### \n{worst_match[f"matched_example_text_{model_name}"]} \n### with score ### {worst_match[f"similarity_percent_{model_name}"]}')


NameError: name 'tqdm' is not defined

This results in a dataset containing the best matches between contract sections and catalog examples for mulitlingual paraphrase models.

In [None]:
display(train_data_labeled_sorted.sort_values(f"similarity_percent_{model_name}", ascending=False).head(5))

subset = train_data_labeled_sorted[['clean_section_content',"similarity_percent_multilingual_paraphrase_sentence", 'matched_example_text_multilingual_paraphrase_sentence']].sort_values("similarity_percent_multilingual_paraphrase_sentence", ascending=False)

def wrap_text(df, columns):
    return df.style.set_properties(**{
        'white-space': 'pre-wrap',
        'word-wrap': 'break-word'
    }, subset=columns)

# Example: wrap long text in 2 columns
wrap_text(subset.head(5), ['clean_section_content', 'matched_example_text_multilingual_paraphrase_sentence'])

len(catalogue)


## 3.4 Manual mapping

However, as previously mentioned, the automatic labeling approach introduced data leakage. Therefore, we changed our strategy and switched to manual labeling of as many data points as possible.

Since this process is extremely time-consuming, and we are not legal experts — even in our native language, German — we were only able to manually label a total of 64 data samples.

Another challenge we encountered was that, although we had compiled a dataset of over 1,300 contract sections, we were unable to find suitable matches for all examples in the requirements catalog. As a result, 12 out of the 76 requirement items remain without any corresponding training data.

In essence, we manually reviewed all available contracts and initially searched for relevant keywords to identify potential matches. Once potential sections were found, we compared them directly to the example sections in the requirement catalog.

Only when we were 100% confident that a contract section semantically matched a catalog entry did we assign the corresponding catalog ID (i.e., the position of the example section within the catalog) to that contract section.

The following table shows a selection of sections from our dataset that have been mapped to their corresponding catalog_id (i.e., the index of the requirements catalog).

In [None]:
df_unlabeled = pd.read_pickle("../data/data_clean.pkl")
df_unlabeled = df_unlabeled[["contract","paragraph","section","section_content","clean_section_content"]]
display(df_unlabeled)


mapping = pd.read_excel("../data/mapping_human.xlsx")
print(mapping.head())
mapping = mapping[["contract","paragraph", "section","section_content","catalog_id"]]
display(mapping)

# Merge on both 'paragraph' and 'section' for more precise matching
df_labeled = mapping.merge(df_unlabeled, how="left", on=["contract","paragraph", "section"])
df_labeled= df_labeled[["contract","paragraph","section","clean_section_content","catalog_id"]]
display(df_labeled)

catalogue = pd.read_pickle("../data/catalogue_clean.pkl")
catalogue["catalog_id"] = range(1, len(catalogue) + 1)
display(catalogue)

# 4. Model Comparrison & Evaluation
We now move on to selecting the most suitable model. In total, we chose five different models from Hugging Face, all of which are Sentence Transformers, each with a corresponding tokenizer:

- "deepset/gbert-base"
- "sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2"
- "bert-base-uncased",
- "jinaai/jina-embeddings-v2-small-en",
- "jinaai/jina-embeddings-v2-base-de",


Additionally, we need to decide on a pooling strategy. This strategy determines how we aggregate token-level embeddings into a single section-level embedding, which is essential for comparing sections rather than individual words.
When generating embeddings for full sections using transformer-based models, we must reduce the token-level output to a single fixed-size vector. This process is called pooling. Two common strategies are:

**CLS Token Pooling**
Uses the embedding of the special [CLS] token, which is prepended to every input by models like BERT.
The [CLS] token is explicitly trained (during pretraining) to capture the meaning of the entire sentence for classification tasks.
Advantage: In models trained for classification (e.g. BERT), the CLS token often encodes global sentence-level semantics efficiently.

**Mean Pooling**
Averages the embeddings of all tokens (excluding padding).
Provides a more balanced representation of the entire sentence or section.
Advantage: Especially useful in models not fine-tuned with CLS-specific objectives (e.g. multilingual or paraphrase models), where the semantic information is more evenly distributed across all tokens.

**Max Pooling**
Takes the maximum value across all token embeddings for each embedding dimension (excluding padding).
This emphasizes the most salient features that appear anywhere in the input, regardless of position.

Advantage: Max pooling can highlight strong semantic signals (e.g. key terms) by preserving their peak activations. This may be beneficial in tasks where individual high-impact words (rather than holistic meaning) are important — especially in longer or noisy texts.



## 4.1 Tokenizers and Embeddings

In [None]:
import warnings
model_names = [
    "gbert-base",
    "paraphrase-multilingual-MiniLM-L12-v2",
    "bert-base-uncased",
    "jina-embeddings-v2-small-en",
    "jina-embeddings-v2-base-de"
]



model_urls = [
    "deepset/gbert-base",
    "sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2",
    "bert-base-uncased",
    "jinaai/jina-embeddings-v2-small-en",
    "jinaai/jina-embeddings-v2-base-de"
]


max_tokens = []

for model_name, model_url in zip(model_names, model_urls):
    with warnings.catch_warnings():
        warnings.simplefilter("ignore")
        model = AutoModel.from_pretrained(model_url)
    
    max_len = model.embeddings.position_embeddings.weight.shape[0]
    max_tokens.append((model_name, max_len))
print("##### ##### #### #### #### ")
print(" ")
for model_name, max_token in zip(model_names, max_tokens):
    print(f'{model_name} --> max tokens = {max_token}')

#### Hier noch beschreieben warum wir diese 5 getestet haben


The following diagram shows how many tokens the various language models generate for the texts in our dataset. It is clearly noticeable that English-based models tend to produce a higher token density, especially for longer texts. A likely reason for this is that German words are not well represented in the English vocabulary, which leads to more frequent splitting into subword tokens.

The two models bert-base-uncase and jina-embeddings-v2-small-en generate much more tokens per text segment compared to the German-language models such as gbert-base and jina-embeddings-v2-base-de.

Nevertheless, all models, including the English ones, achieve consistently solid performance. The fact that many texts exceed the maximum token input size of the respective models (e.g., 512 for BERT) does not pose a problem for our use case, as we handle this through appropriate truncation or segmentation strategies.

Overall, the results indicate that German-language models are better suited for processing German texts, as they require fewer splits and better capture the language’s structure. For our dataset of over 1,300 section texts, language-specific models such as gbert-base, jina-base-de or paraphrase-multilingual-MiniLM-L12-v2 (mulitlingual) prove to be particularly effective.

In [None]:

# Deine Spalte zum Analysieren
column = "clean_section_content"

# Modellnamen und URLs
model_names = [
    "gbert-base",
    "paraphrase-multilingual-MiniLM-L12-v2",
    "bert-base-uncased",
    "jina-embeddings-v2-small-en",
    "jina-embeddings-v2-base-de"
]

model_urls = [
    "deepset/gbert-base",
    "sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2",
    "bert-base-uncased",
    "jinaai/jina-embeddings-v2-small-en",
    "jinaai/jina-embeddings-v2-base-de"
]

# Tokenizer vorbereiten
tokenizers = {}
for name, url in zip(model_names, model_urls):
    tokenizers[name] = AutoTokenizer.from_pretrained(url)

# Tokenlängen erfassen
token_counts = {}
for name, tokenizer in tokenizers.items():
    token_counts[name] = df_structured[column].fillna("").apply(lambda x: len(tokenizer.tokenize(x)))
print(token_counts)
# Plot: Verteilung der Tokenanzahl (99%-Quantil)
plt.figure(figsize=(10, 6))
max_x = max(token_counts[name].quantile(0.99) for name in token_counts)
bins = np.linspace(0, max_x, 40)

styles = ['-', '--', '-.', ':', (0, (3, 1, 1, 1))]

for (name, data), style in zip(token_counts.items(), styles):
    hist, bin_edges = np.histogram(data, bins=bins, density=True)
    bin_centers = (bin_edges[:-1] + bin_edges[1:]) / 2
    plt.plot(bin_centers, hist, label=name, linestyle=style)

plt.title(f"Tokencount-distribution for: {column}")
plt.xlabel("token count")
plt.ylabel("density")
plt.legend()
plt.grid(True)
plt.tight_layout()
plt.show()


## 4.2 Model Assembly and Pooling start
In this section, we load a selection of transformer-based language models and prepare them for use with the Sentence Transformers framework. Each model is wrapped with a pooling strategy to generate fixed-size sentence embeddings.
Each model is assembled using the Sentence Transformers Transformer + Pooling modules. We store each configuration in a dictionary (models_dict) using a key that includes the model name and pooling strategy (e.g., base_gbert_sentence_cls).
If the RAM is not sufficient there is also the Option to safe all the build models for reuse without reloading from Hugging Face however when doing so they have to be excluded from the git repository since they are to big.
This setup provides a flexible and consistent framework to benchmark **15 different transformer** models and pooling strategies for downstream semantic matching tasks.


In [None]:
from sentence_transformers import SentenceTransformer, models
import os

models_in = [
    "deepset/gbert-base",
    "sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2",
    "bert-base-uncased",
    "jinaai/jina-embeddings-v2-small-en",
    "jinaai/jina-embeddings-v2-base-de",
]

models_out = [
    "base_gbert_sentence",
    "multilingual_paraphrase_sentence",
    "bert_base_uncased_sentence",
    "jina_small_en_sentence",
    "jina_base_de_sentence",
]

pool_strats = ["cls", "mean", "max"]

# Dictionary zur Sammlung der Modelle
models_dict = {}
models_urls = {}
models_strat = {}


for model_in, model_out in zip(models_in, models_out):
    word_embedding_model = models.Transformer(model_in, max_seq_length=512)

    for pool_strat in pool_strats:
        print(f"Lade Modell: {model_out}, Strategie: {pool_strat}")

        cls = pool_strat == "cls"
        mean = pool_strat == "mean"
        maxi = pool_strat == "max"

        pooling_model = models.Pooling(
            word_embedding_model.get_word_embedding_dimension(),
            pooling_mode_cls_token=cls,
            pooling_mode_mean_tokens=mean,
            pooling_mode_max_tokens=maxi,
        )

        model = SentenceTransformer(modules=[word_embedding_model, pooling_model])

        # Key z. B. "base_gbert_sentence_cls"
        dict_key = f"{model_out}_{pool_strat}"
        models_dict[dict_key] = model
        models_urls[dict_key] = model_in
        models_strat[dict_key] = pool_strat


        # Optional speichern:
        # model.save(f"../models/raw_STM/{dict_key}_emb")

# Ausgabe der geladenen Modelle
print("Geladene Modelle:", list(models_dict.keys()))


## 4.3 Matching Contract Sections to Catalog Entries Using Embeddings
In this section, we match contract sections to the most relevant catalog entries using vector-based semantic similarity. For each embedding model in our comparison, we perform the following steps:

Embedding Generation:
We encode the contract sections (clean_section_content) and the catalog examples (example) using the selected embedding model.

Similarity Computation:
We calculate the cosine similarity between each section embedding and all catalog entry embeddings. For each section, we select the catalog entry with the highest similarity score as the predicted match.

Ground Truth Comparison:
We compare the predicted catalog ID against the ground truth (true_catalog_id) to assess whether the top match is correct.

Evaluation with ROC Curve:
Using the cosine similarity scores as prediction confidence, we compute an ROC curve and the AUC (Area Under Curve) to measure how well the model distinguishes correct from incorrect matches.

Threshold Optimization:
We determine an optimal similarity threshold based on the ROC curve (max(TPR - 0.5 × FPR)), which we then use to classify matches as valid or invalid.

Postprocessing:
Matches below the threshold are marked as invalid by assigning a dummy catalog ID (-99), enabling further analysis and filtering.

This analysis is repeated for each model in our benchmark set. The results help us compare model performance and select the best embedding model for semantic contract section matching.

In [None]:
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np
df_to_match = df_labeled[["contract","paragraph","section","clean_section_content"]]
df_true_match = df_labeled[["contract","paragraph","section","clean_section_content","catalog_id"]]
df_true_match.rename(columns={"catalog_id": "true_catalog_id"},inplace= True)
display(df_true_match)
df_true_match["true_catalog_id"] = df_true_match["true_catalog_id"].astype(int)
cols = ["contract","paragraph","section","clean_section_content"]

models_thresholds = {}
tab_contents = []
print(catalogue)

for model_name in models_dict:
    model_selected = models_dict[model_name]
    df_to_match = add_embed_text_column(df_to_match, text_column="clean_section_content",model = model_selected, target_column=f"section_em_{model_name}" )
    cat = add_embed_text_column(catalogue, text_column = "example", model = model_selected, target_column = "emb")
    cols.append(f"section_em_{model_name}")
    df_to_match  = df_to_match[cols]
    X = np.vstack(df_to_match[f"section_em_{model_name}" ].values)  # Shape: [1400, 768]
    Y = np.vstack(cat["emb"].values)                  # Shape: [100, 768]

    # Cosine Similarity: alle Kombinationen
    similarity_matrix = cosine_similarity(X, Y)  # Shape: [1400, 100]
    similarity_percent = np.round(similarity_matrix * 100, 2)  # Skaliert zu 0–100 %

    best_match_idx = similarity_matrix.argmax(axis=1)
    best_match_score = similarity_percent[np.arange(len(X)), best_match_idx]

    # Ergebnisse anhängen
    df_to_match[f"matched_example_index_{model_name}"] = best_match_idx
    df_to_match[f"similarity_percent_{model_name}"] = best_match_score
    df_to_match[f"matched_example_text_{model_name}"] = cat.loc[best_match_idx, "example"].values
    df_to_match[f"matched_example_topic_{model_name}"] = cat.loc[best_match_idx, "section_topic"].values
    df_to_match[f"matched_paragraph_{model_name}"] = cat.loc[best_match_idx, "paragraph_topic"].values
    df_to_match[f"matched_catalog_id_{model_name}"] = cat.loc[best_match_idx, "catalog_id"].values

    # # print(df_true_match["true_catalog_id"].dtype)
    # print(df_to_match[f"matched_catalog_id_{model_name}"].dtype)    
    

    y_true = (df_true_match["true_catalog_id"].values == df_to_match[f"matched_catalog_id_{model_name}"].values).astype(int)
    y_scores = df_to_match[f"similarity_percent_{model_name}"].values / 100  # zurück zu 0–1
    # print("Verteilung der Klassen in y_true:")
    # print(np.unique(y_true, return_counts=True))


    # ROC-Kurve
    fpr, tpr, thresholds = roc_curve(y_true, y_scores)
    roc_auc = auc(fpr, tpr)
    custom_score = tpr - 0.5 * fpr
    optimal_idx = np.argmax(custom_score)
    optimal_threshold = thresholds[optimal_idx]
    
    output = widgets.Output()
    with output:
        plt.figure(figsize=(8, 6))
        plt.plot(fpr, tpr, label=f'ROC curve (AUC = {roc_auc:.2f})')
        plt.plot([0, 1], [0, 1], 'k--')  # Diagonale
        plt.xlabel('False Positive Rate')
        plt.ylabel('True Positive Rate')
        plt.title(f'ROC Curve - {model_name}')
        plt.plot(fpr[optimal_idx], tpr[optimal_idx], 'ro', label='Optimal Threshold')
        plt.legend(loc='lower right')
        plt.legend(loc='lower right')
        plt.grid(True)
        plt.show()

        tab_contents.append((title, output))
    # Optimaler Threshold = max(tpr - fpr)

    # print(f"Optimaler Threshold für Cosine Similarity (%): {optimal_threshold * 100:.2f}")
    # Neue Spalte: Match nur wenn Score >= Threshold
    df_to_match[f"match_valid_{model_name}"] = y_scores >= optimal_threshold
    df_to_match.loc[~df_to_match[f"match_valid_{model_name}"], f"matched_catalog_id_{model_name}"] = -99
    cols = list(df_to_match.columns)
    # display(df_to_match)
    models_thresholds[model_name] = optimal_threshold

tab_widget = widgets.Tab()
tab_widget.children = [out for _, out in tab_contents]
for idx, (name, _) in enumerate(tab_contents):
    tab_widget.set_title(idx, name)
    







display(tab_widget)


In [None]:
df_matched = df_to_match
from sklearn.metrics import accuracy_score, f1_score, classification_report

results = []

for model_name in models_dict:
    print(f"### {model_name} ###")
    df_matched_ids = df_matched[["contract","paragraph","section","clean_section_content",f"matched_catalog_id_{model_name}"]]
    print("Accuracy:", accuracy_score(df_true_match["true_catalog_id"], df_matched_ids[f"matched_catalog_id_{model_name}"]))
    print("F1 (macro):", f1_score(df_true_match["true_catalog_id"], df_matched_ids[f"matched_catalog_id_{model_name}"], average='macro'))
    print("F1 (weighted):", f1_score(df_true_match["true_catalog_id"], df_matched_ids[f"matched_catalog_id_{model_name}"], average='weighted'))
    print("\nReport:\n", classification_report(df_true_match["true_catalog_id"], df_matched_ids[f"matched_catalog_id_{model_name}"]))

    print(f"### {model_name} ###")

    results.append({
            "model": model_name,
            "recall (macro)": recall_score(df_true_match["true_catalog_id"], df_matched_ids[f"matched_catalog_id_{model_name}"], average='macro'),
            "recall (weighted)": recall_score(df_true_match["true_catalog_id"], df_matched_ids[f"matched_catalog_id_{model_name}"], average='weighted'),
            "Accuracy": accuracy_score(df_true_match["true_catalog_id"], df_matched_ids[f"matched_catalog_id_{model_name}"]),
            "F1 (macro)": f1_score(df_true_match["true_catalog_id"], df_matched_ids[f"matched_catalog_id_{model_name}"], average='macro'),
            "F1 (weighted)": f1_score(df_true_match["true_catalog_id"], df_matched_ids[f"matched_catalog_id_{model_name}"], average='weighted')
        })

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.metrics import confusion_matrix
import ipywidgets as widgets
from IPython.display import display

tab_contents = []

for model_name in models_dict:
    output = widgets.Output()
    with output:
        # Daten vorbereiten
        df_matched_ids = df_matched[["contract", "paragraph", "section", "clean_section_content", f"matched_catalog_id_{model_name}"]]
        y_true = df_true_match["true_catalog_id"].astype(str)
        y_pred = df_matched_ids[f"matched_catalog_id_{model_name}"].astype(str)

        # Gemeinsame, sortierte Label-Liste für Achsen
        # Gemeinsame Labels sammeln
        all_labels_set = set(y_true).union(set(y_pred))

        # Nach int sortieren, dann in String zurückwandeln
        all_labels = [str(x) for x in sorted(map(int, all_labels_set))]
        cm = confusion_matrix(y_true, y_pred, labels=all_labels)

        # Plot
        plt.figure(figsize=(8, 6))
        sns.heatmap(cm, annot=False, fmt="d", cmap="Blues",
                    xticklabels=all_labels, 
                    yticklabels=all_labels)
        plt.title(f"Confusion Matrix: {model_name}")
        plt.xlabel("Predicted Label")
        plt.ylabel("True Label")
        plt.tight_layout()
        plt.show()
    
    tab_contents.append((model_name, output))

# Tabs erzeugen
tab_widget = widgets.Tab()
tab_widget.children = [out for _, out in tab_contents]
for idx, (name, _) in enumerate(tab_contents):
    tab_widget.set_title(idx, name)

display(tab_widget)


In [None]:
results_df = pd.DataFrame(results).sort_values("recall (weighted)", ascending=False)
thresholds_df = pd.DataFrame(models_thresholds.items(), columns=["model", "optimal_threshold"])
models_urls_df = pd.DataFrame(models_urls.items(), columns=["model", "model_url"])
models_strat_df = pd.DataFrame(models_strat.items(), columns=["model", "pooling_strategy"])
results_df = results_df.merge(models_urls_df, on="model", how="left").merge(models_strat_df, on="model", how="left").merge(thresholds_df, on="model", how="left")
display(results_df)
results_df.to_csv("../data/best_models.csv, index=False)


# 5. Finetuning

In [None]:
results_df = pd.read_csv("../data/best_models.csv)
best_model = results_df[["model","model_url","pooling_strategy","optimal_threshold"]].iloc[0]
print(best_model)

df_unlabeled = pd.read_pickle("../data/data_clean.pkl")
df_unlabeled = df_unlabeled[["contract","paragraph","section","section_content","clean_section_content"]]
display(df_unlabeled)


mapping = pd.read_excel("../data/mapping_human.xlsx")
mapping = mapping[["contract","paragraph","section","section_content","catalog_id"]]
display(mapping)

df_labeled = mapping.merge(df_unlabeled, how = "left", on = ["contract","paragraph","section"])
df_labeled= df_labeled[["contract","paragraph","section","clean_section_content","catalog_id"]]
df_labeled = df_labeled[df_labeled["catalog_id"].notna()]


print(df_labeled.head)


catalogue = pd.read_pickle("../data/catalogue_clean.pkl")
catalogue = catalogue.rename(columns={'example': 'clean_section_content'})
catalogue = catalogue[['clean_section_content']]
catalogue["catalog_id"] = range(len(catalogue))


print(catalogue.head())



In [None]:
import inspect
from sentence_transformers.models import Pooling
print("Model Summary:")
print("#####################################################")
print("# # # # # # # # # # # # # # # # # # # # # # # # # # #")
print("#####################################################")
print(f'Model: {best_model["model"]}') 
print("------------------------------------------------------------")
print(f' embeddings from: {best_model["model_url"]}: ')
print(models_dict[best_model["model"]][0].auto_model)
print("#####################################################")
print(f' pooling strategy: {best_model["pooling_strategy"]}')
print("------------------------------------------------------------")
print(models_dict[best_model["model"]][1])  
print(inspect.getsource(Pooling.forward))
print("#####################################################")
print(f' classification threashold: {best_model["optimal_threshold"]}')
print("------------------------------------------------------------")


In [None]:
pool_strat = best_model["pooling_strategy"]
word_embedding_model = models_dict[best_model["model"]][0]

print(
    pool_strat
)

cls = pool_strat == "cls"
mean = pool_strat == "mean"
maxim = pool_strat == "max"

pooling_model = models.Pooling(
    word_embedding_model.get_word_embedding_dimension(),
    pooling_mode_cls_token=cls,
    pooling_mode_mean_tokens=mean,
    pooling_mode_max_tokens=maxim,
)

model = SentenceTransformer(modules=[word_embedding_model, pooling_model])


print(catalogue["clean_section_content"])
catalogue_emb = embed_text_column(catalogue, text_column="clean_section_content", model=model, target_column="emb")
print(catalogue["emb"])
label_embeddings = torch.tensor(np.stack(catalogue["emb"].to_list())).float()



In [None]:
# class CosineMapper(nn.Module):
#     def __init__(
#         self,
#         model_name= best_model["model_url"],
#         label_embeddings = label_embeddings,
#         pooling= best_model["pooling_strategy"],
#         threshold= best_model["optimal_threshold"]
#     ):
#         super().__init__()
#         self.tokenizer = AutoTokenizer.from_pretrained(model_name)
#         config = AutoConfig.from_pretrained(model_name)
#         self.bert = AutoModel.from_pretrained(model_name,config = config)  # Verwende 'gelu' als Aktivierungsfunktion
#         self.pooling = pooling
#         self.threshold = threshold

#         self.label_embeddings = nn.Parameter(label_embeddings, requires_grad=False)  # z. B. aus SentenceTransformer
#         self.activation = nn.GELU()  # Verwende 'gelu' als Aktivierungsfunktion

#         self.dropout = nn.Dropout(0.5)

#     def forward(self, texts,return_embedding = False  ):
#         if isinstance(texts, str):
#             texts = [texts]

#         inputs = self.tokenizer(texts, padding=True, truncation=True, return_tensors="pt")
#         outputs = self.bert(**inputs)
#         token_embeddings = outputs.last_hidden_state  # (B, T, H)

#         # Pooling
#         if self.pooling == "cls":
#             pooled = token_embeddings[:, 0]
#         elif self.pooling == "max":
#             mask = inputs["attention_mask"].unsqueeze(-1).expand(token_embeddings.shape).float()
#             token_embeddings[mask == 0] = -1e9
#             pooled = torch.max(token_embeddings, dim=1)[0]
#         elif self.pooling == "mean":
#             mask = inputs["attention_mask"].unsqueeze(-1).expand(token_embeddings.shape).float()
#             summed = torch.sum(token_embeddings * mask, dim=1)
#             counts = mask.sum(dim=1).clamp(min=1e-9)
#             pooled = summed / counts
#         else:
#             raise ValueError("Unknown pooling method")

#         pooled = self.dropout(pooled)
#         if return_embedding:
#             return pooled 

#         # Cosine similarity zu allen Labels
#         normalized_input = nn.functional.normalize(pooled, dim=1)
#         normalized_labels = nn.functional.normalize(self.label_embeddings, dim=1)

#         cosine_sim = torch.matmul(normalized_input, normalized_labels.T)  # (B, num_labels)
#         return cosine_sim

#     def predict(self, texts, top_k: int = 1, return_scores: bool = False):
#         with torch.no_grad():
#             scores = self.forward(texts)

#         if return_scores:
#             # Gib Top-k Indizes (+1) und Scores zurück
#             topk_scores, topk_indices = torch.topk(scores, k=top_k, dim=1)
#             results = []
#             for indices, values in zip(topk_indices, topk_scores):
#                 results.append([(i.item(), round(s.item(), 4)) for i, s in zip(indices, values)])
#             return results if len(results) > 1 else results[0]

#         else:
#             # Gib nur Index (+1) der besten Klasse zurück
#             preds = torch.argmax(scores, dim=1)
#             result = (preds+1 ).tolist()
#             return result[0] if len(result) == 1 else result


In [None]:
model = CosineMapper(
    model_name=best_model["model_url"],
    label_embeddings=label_embeddings,
    pooling=best_model["pooling_strategy"],
    threshold=best_model["optimal_threshold"],
)
model


In [None]:
from sklearn.metrics import classification_report, confusion_matrix
import pandas as pd

# 1. Alle Texte und true labels holen
texts = df_labeled["clean_section_content"].tolist()
texts = [str(t) if not isinstance(t, str) else t for t in texts]

true_labels = df_labeled["catalog_id"].tolist()  # oder wie deine Label-Spalte heißt
# 2. Vorhersagen holen
model.return_scores = False  # wichtig: keine Scores, sondern Predictions
pred_labels = model.predict(texts)  # gibt List[int]

# 3. Klassifikationsbericht
print(classification_report(true_labels, pred_labels, digits=3))



In [None]:
from sklearn.model_selection import train_test_split
from torch.utils.data import DataLoader
from torch.utils.data import Dataset


In [None]:

# Nur Zeilen behalten, bei denen "clean_section_content" nicht NaN ist
filtered_df = df_labeled.dropna(subset=["clean_section_content"])

# Labels (int) extrahieren
filtered_df["label"] = filtered_df["catalog_id"].astype(int)

# Listen erzeugen
texts = filtered_df["clean_section_content"].tolist()
labels = filtered_df["label"].tolist()


# Split in Train/Test
train_texts, val_texts, train_labels, val_labels = train_test_split(
    texts, labels, test_size=0.3,  random_state=42
)

test_texts, val_texts, test_labels, val_labels = train_test_split(
    val_texts,
    val_labels,
    test_size=0.5,
    random_state=42
)


# class TextLabelDataset(Dataset):
#     def __init__(self, texts, labels):
#         self.texts = texts
#         self.labels = labels

#     def __len__(self):
#         return len(self.texts)

#     def __getitem__(self, idx):
#         return {
#             "text": self.texts[idx],
#             "label": self.labels[idx]
#         }
train_dataset = TextLabelDataset(train_texts, train_labels)
val_dataset = TextLabelDataset(val_texts, val_labels)
test_dataset = TextLabelDataset(test_texts, test_labels)

train_loader = DataLoader(
    dataset=train_dataset,
    batch_size=32,
    shuffle=True
)

val_loader = DataLoader(
    dataset=val_dataset,
    batch_size=32,
    shuffle=False
)

test_loader = DataLoader(
    dataset=test_dataset,
    batch_size=32,
    shuffle=False
)

for batch in train_loader:
    print(batch["text"][0])
    print(batch["label"][0])
    break

In [None]:
epochs = 100  #number of epochs i.e. how many times is the whole dataset passed through the architecture
patience = 5  # Number of epochs to wait before stopping if no improvement
best_val_loss = float("inf")
patience_counter = 0
optim = AdamW(model.parameters(), lr=5e-5) 
model.train()  # Set model to training mode

In [None]:
train_losses = []
val_losses = []
criterion = nn.CrossEntropyLoss()
best_val_loss = float("inf")
patience = 10
patience_counter = 0
model.return_scores=True
print(model("test"))

for epoch in range(epochs):
    model.train()
    train_total_loss = 0

    for batch in train_loader:
        texts = batch["text"]
        labels = batch["label"] -1
        optim.zero_grad()

        logits = model(texts)  # cosine similarity (B, num_labels)
        loss = criterion(logits, labels)

        train_total_loss += loss.item()
        loss.backward()
        optim.step()

    avg_train_loss = train_total_loss / len(train_loader)
    train_losses.append(avg_train_loss)
    print(f"Epoch {epoch + 1}, Training Loss: {avg_train_loss:.4f}")

    # Validation
    model.eval()
    val_loss = 0
    all_val_preds = []
    all_val_labels = []

    with torch.no_grad():
        for batch in val_loader:
            texts = batch["text"]
            labels = batch["label"]

            logits = model(texts)
            loss = criterion(logits, labels)
            val_loss += loss.item()

            probs = torch.softmax(logits, dim=1)
            preds = torch.argmax(probs, dim=1)

            all_val_preds.extend(preds.cpu().tolist())
            all_val_labels.extend(labels.cpu().tolist())

    avg_val_loss = val_loss / len(val_loader)
    val_losses.append(avg_val_loss)
    print(f"Epoch {epoch + 1}, Validation Loss: {avg_val_loss:.4f}")

    if avg_val_loss < best_val_loss:
        best_val_loss = avg_val_loss
        patience_counter = 0
        torch.save(model.state_dict(), "best_model.pth")
        print("✅ Model saved!")
    else:
        patience_counter += 1
        if patience_counter >= patience:
            print("⛔ Early stopping triggered.")
            break

# Nach dem Training: bestes Modell laden
model.load_state_dict(torch.load("best_model.pth"))
model.eval() #so sieht der aktuelle output aus 

In [None]:
import matplotlib.pyplot as plt
plt.figure(figsize=(8, 6))
plt.plot(range(1, len(train_losses) + 1), train_losses, label="Training Loss", marker="o")
plt.plot(range(1, len(val_losses) + 1), val_losses, label="Validation Loss", marker="o")

plt.xlabel("Epochs")
plt.ylabel("Loss")
plt.title("Training and Validation Loss Curve")
plt.legend()
plt.grid()
plt.show()

# Ohne TEST 

In [None]:
from torch.utils.data import Dataset


# Nur Zeilen behalten, bei denen "clean_section_content" nicht NaN ist
filtered_df = df_labeled.dropna(subset=["clean_section_content"])

# Labels (int) extrahieren
filtered_df["label"] = filtered_df["catalog_id"].astype(int)

# Listen erzeugen
texts = filtered_df["clean_section_content"].tolist()
labels = filtered_df["label"].tolist()

train_texts, val_texts, train_labels, val_labels = train_test_split(
    texts, labels, test_size=0.2,  random_state=42
)


# class TextLabelDataset(Dataset):
#     def __init__(self, texts, labels):
#         self.texts = texts
#         self.labels = labels

#     def __len__(self):
#         return len(self.texts)

#     def __getitem__(self, idx):
#         return {
#             "text": self.texts[idx],
#             "label": self.labels[idx]
#         }
train_dataset = TextLabelDataset(train_texts, train_labels)
val_dataset = TextLabelDataset(val_texts, val_labels)

train_loader = DataLoader(
    dataset=train_dataset,
    batch_size=32,
    shuffle=True
)

val_loader = DataLoader(
    dataset=val_dataset,
    batch_size=32,
    shuffle=False
)



In [None]:
epochs = 1000  #number of epochs i.e. how many times is the whole dataset passed through the architecture
patience = 10  # Number of epochs to wait before stopping if no improvement
best_val_loss = float("inf")
patience_counter = 0
model = CosineMapper(
    model_name=best_model["model_url"],
    label_embeddings=label_embeddings,      
    pooling=best_model["pooling_strategy"],
    threshold=best_model["optimal_threshold"]
)
optim = AdamW(model.parameters(), lr=100e-6)    

model.train()  # Set model to training mode

In [None]:
print("Shape of label_embeddings:", label_embeddings.shape) 
train_losses = []
val_losses = []
criterion = nn.CrossEntropyLoss()
best_val_loss = float("inf")
patience = 10
patience_counter = 0
model.return_scores=True
print(model("test"))

for epoch in range(epochs):
    model.train()
    train_total_loss = 0

    for batch in train_loader:
        texts = batch["text"]
        labels = batch["label"]-1

        optim.zero_grad()

        logits = model(texts)  # cosine similarity (B, num_labels)
        loss = criterion(logits, labels)

        train_total_loss += loss.item()
        loss.backward()
        optim.step()

    avg_train_loss = train_total_loss / len(train_loader)
    train_losses.append(avg_train_loss)
    print(f"Epoch {epoch + 1}, Training Loss: {avg_train_loss:.4f}")

    # Validation
    model.eval()
    val_loss = 0
    all_val_preds = []
    all_val_labels = []

    with torch.no_grad():
        for batch in val_loader:
            texts = batch["text"]
            labels = batch["label"]

            logits = model(texts)
            loss = criterion(logits, labels)
            val_loss += loss.item()

            probs = torch.softmax(logits, dim=1)
            preds = torch.argmax(probs, dim=1)

            all_val_preds.extend(preds.cpu().tolist())
            all_val_labels.extend(labels.cpu().tolist())

    avg_val_loss = val_loss / len(val_loader)
    val_losses.append(avg_val_loss)
    print(f"Epoch {epoch + 1}, Validation Loss: {avg_val_loss:.4f}")

    if avg_val_loss < best_val_loss:
        best_val_loss = avg_val_loss
        patience_counter = 0
        torch.save(model.state_dict(), "best_model_no_test.pth")
        print("✅ Model saved!")
    else:
        patience_counter += 1
        if patience_counter >= patience:
            print("⛔ Early stopping triggered.")
            break
        


In [None]:

# Nach dem Training: bestes Modell laden
model.load_state_dict(torch.load("best_model.pth"))
model.eval() #so sieht der aktuelle output aus 

In [None]:
import matplotlib.pyplot as plt
plt.figure(figsize=(8, 6))
plt.plot(range(1, len(train_losses) + 1), train_losses, label="Training Loss", marker="o")
plt.plot(range(1, len(val_losses) + 1), val_losses, label="Validation Loss", marker="o")

plt.xlabel("Epochs")
plt.ylabel("Loss")
plt.title("Training and Validation Loss Curve")
plt.legend()
plt.grid()
plt.show()

# 6. Application + Textual checking of requirements 

In [None]:
model = CosineMapper(
    model_name=best_model["model_url"],
    label_embeddings=label_embeddings,      
    pooling=best_model["pooling_strategy"],
    threshold=best_model["optimal_threshold"]
)



df_labeled = df_labeled[pd.notna(df_labeled["contract"])]
contracts_with_labels = df_labeled["contract"].astype(int).unique()
np.append(contracts_with_labels, 1)

print(contracts_with_labels)


contracts= pd.read_pickle("../data/data_scraped_input_relevant.pkl")
contracts["contract"] = contracts["contract"].astype(int)
contracts_without_labels = contracts[~contracts["contract"].isin(contracts_with_labels)]["contract"].unique().tolist()
exp_contract = 26
exp = contracts[contracts["contract"] == exp_contract]  # Beispiel: zufällige 1 Verträge
print(exp)

In [None]:



# class SectionTopicPredictor:
#     def __init__(self, model, catalogue_clean):
#         """
#         Parameters:
#         - model: dein CosineMapper mit predict()-Methode
#         - catalogue_clean: DataFrame mit Spalte 'section_topic'
#         """
#         self.model = model
#         self.catalogue_clean = catalogue_clean

#     def _preprocess_contract(self, text):
#         """
#         Führt Paragraph- und Abschnittsextraktion + Cleaning durch.
#         """
#         fake_row = {"content": text, "contract": 1}
#         sections = extract_paragraphs_and_sections(fake_row)  # -> List[Dict]
#         for section in sections:
#             section["clean_section_content"] = clean_sections_and_paragraphs(section["section_content"])
#         return sections

#     def predict_contract(self, contract_text, return_topic_score=False):
#         """
#         contract_text: Volltext eines Vertrags (String)

#         Rückgabe: DataFrame mit Feldern aus den extrahierten Sections + 'predicted_topic'
#         """
#         cont_exp = self._preprocess_contract(contract_text)
#         return self.predict_sections(cont_exp, return_topic_score=return_topic_score)

#     def predict_sections(self, cont_exp, return_topic_score=False):
#         """
#         cont_exp: Liste von Dicts mit mindestens 'section', 'section_content', 'clean_section_content'

#         Rückgabe: DataFrame mit allen ursprünglichen Feldern + 'predicted_topic'
#         """
#         records = []

#         for section in cont_exp:
#             cleaned = section['clean_section_content']
# #
#             if return_topic_score:
#                 model_output = self.model.predict(cleaned, return_scores=True)

#                 top = max(model_output, key=lambda x: x[1])
#                 label = top[0]
#                 score = top[1]
#                 index = int(label) - 1 
#             else:
#                 label = self.model.predict(cleaned, return_scores=False)
#                 index = int(label) - 1
#                 score = None

#             topic = self.catalogue_clean["section_topic"].iloc[index]

#             record = {**section, "predicted_topic": topic}
#             if return_topic_score:
#                 record["score"] = score

#             records.append(record)

#         return pd.DataFrame(records)
    


In [None]:

top_scores = []
top_sections = []
top_indexes = []
print(catalogue_clean)
print(exp["content"])
predictor = SectionTopicPredictor(model, catalogue_clean)

for i in contracts_without_labels:

    exp = contracts[contracts["contract"] == i] 
    result_df = predictor.predict_contract(exp["content"].values[0], return_topic_score=True)

    # Ausgabe (optional)
    print(result_df.sort_values("score", ascending=False).head(1))
    sorted_df = result_df.sort_values("score", ascending=False)

    top_score = sorted_df.iloc[0]["score"]
    top_section = sorted_df.iloc[0]["section"]
    index_topscore = result_df[result_df["section"] == top_section].index[0]  # Index des Top-Scores
    top_scores.append(top_score)  
    top_sections.append(top_section)  
    top_indexes.append(index_topscore)  # Index des Top-Scores




In [None]:
print(contracts_without_labels)
print(top_scores)
print(top_sections)
print(top_indexes)

In [None]:
exp = contracts[contracts["contract"] == 16] 
for i in range(len(exp)):  
    text = exp["content"].values[0]
    predicted_df = predictor.predict_contract(text, return_topic_score=True)
    #print(predicted_df)

# Gute Kombi z.b contract 26 section 73
random_index = 35 #random.randint(0, len(predicted_df) - 1)

topic = predicted_df.loc[random_index, "predicted_topic"]
content = predicted_df.loc[random_index, "section_content"]
score = predicted_df.loc[random_index, "score"]

print("🔹 Predicted Topic:\n", topic)
print("\n🔸 Section Content:\n", content)
# Print Section Content in green color
print("\033[92m🔥 Score:\033[0m", score)



In [None]:
#Functioniert gut Vertrag 26 index 68
random_index

In [None]:
print(predicted_df)
results_df = predicted_df.sort_values("score")
catalogue = pd.read_excel("../data/catalogue_clean_mit_aspects.xlsx")
print(catalogue)

print(result_df[["predicted_topic","section_content","score"]].head(5))

In [None]:

client = OpenAI(api_key=OpenAiKey)

def check_core_aspects_with_llm(section_text, core_aspects, model="gpt-4o-mini", sleep_between_calls=1.5):
    aspects_list = "\n- " + "\n- ".join(core_aspects)
    prompt = f"""Du bist ein Vertragsexperte. Prüfe den folgenden Vertragstext auf die Einhaltung der folgenden Kernanforderungen (Core Aspects).

Gib als Ergebnis für jeden einzelnen Punkt einen Erfüllungsgrad von 0 bis 1 an (0 = nicht erfüllt, 1 = voll erfüllt, 0.5 = teilweise erfüllt). Gib zusätzlich eine durchschnittliche Erfüllungsquote in Prozent für alle Core Aspects an.

Vertragstext:
{section_text}

Core Aspects:{aspects_list}

Antwortformat (nur JSON):
{{
"core_aspect_scores": {{
    "Aspekt 1": 1,
    "Aspekt 2": 0.5
}},
"average_fulfillment_percent": 76.5
}}"""

    try:
        response = client.chat.completions.create(
            model=model,
            messages=[{"role": "user", "content": prompt}],
            temperature=0.2
        )
        return response.choices[0].message.content
    except Exception as e:
        print("API-Fehler:", e)
        return None
    finally:
        time.sleep(sleep_between_calls)


In [None]:

print(predicted_df.columns)
print(catalogue.columns)

df_eval = predicted_df.merge(
    catalogue[["section_topic", "core_aspects"]],
    left_on="predicted_topic",
    right_on="section_topic",
    how="left")

# df_eval = df_eval[["contract", "paragraph", "section", "clean_section_content", "predicted_topic", "core_aspects"]]
df_eval = df_eval[["contract", "paragraph", "section", "section_content", "predicted_topic","score", "core_aspects"]]



print(df_eval.head() )

In [None]:
import random
import json


# Bewertungsfunktion
def evaluate_llm(row):
    section_text = row["section_content"]  # section_text = row["clean_section_content"]
    aspects = [line.strip() for line in str(row["core_aspects"]).split("\n") if line.strip()]
    raw_response = check_core_aspects_with_llm(section_text, aspects)
    # Versuche, reines JSON aus der Antwort zu extrahieren
    try:
        # Sonderfall: Antwort enthält ```json ... ``` oder anderen Markdown-Block
        match = re.search(r"{.*}", raw_response, re.DOTALL)
        if match:
            cleaned_json = match.group(0)
            print("✅ LLM-Antwort:", cleaned_json)
            return json.loads(cleaned_json)
        else:
            raise ValueError("Kein JSON-Block gefunden.")
    except Exception:
        print("❌ Parsing-Fehler. Antwort war:", raw_response)
        return {"core_aspect_scores": {}, "average_fulfillment_percent": None}


# LLM-Auswertung durchführen
# Zufällige Auswahl von 10 Zeilen für LLM-Auswertung
random_indices = random.sample(range(len(df_eval)), 20)
df_eval_subset = df_eval#.iloc[random_indices].copy()

df_eval_subset["llm_eval_result"] = df_eval_subset.apply(evaluate_llm, axis=1)
df_eval.loc[df_eval_subset.index, "llm_eval_result"] = df_eval_subset["llm_eval_result"]

df_eval["core_aspect_scores"] = df_eval["llm_eval_result"].apply(
    lambda x: x.get("core_aspect_scores", {}) if isinstance(x, dict) else {}
)
df_eval["average_fulfillment_percent"] = df_eval["llm_eval_result"].apply(
    lambda x: x.get("average_fulfillment_percent") if isinstance(x, dict) else None
)


In [None]:

# Nur Zeilen anzeigen, bei denen "average_fulfillment_percent" nicht None ist
sorted = df_eval[df_eval["average_fulfillment_percent"].notnull()].sort_values("average_fulfillment_percent", ascending=False).head()
display(sorted)
if not sorted.empty:
	first_row = sorted.iloc[1]
	print(f'content: {first_row["section_content"]}')
	print(f'map to : {first_row["predicted_topic"]}')
	print(f'core aspects: {first_row["core_aspects"]}')
	print(f'average fulfillment percent: {first_row["average_fulfillment_percent"]}')   
	print(f'core aspect scores: {first_row["core_aspect_scores"]}')
# Optionally save the evaluated results
df_eval.to_pickle("../data/llm_eval_result.pkl")
df_eval.to_excel("../data/llm_eval_result.xlsx", index=False)
# preview first few results
#display(sorted[["section_content", "core_aspect_scores", "average_fulfillment_percent"]].head())