# DXF Parser Module – Structural Geometry Pipeline
**Author**: Thaddeus da Silva Correa

**Project**: Automated Extraction and Interpretation of Structural Geometry from CAD Drawings for BIM Integration  
**Module**: 1 of 4 – DXF Parser  
**Environment**: Google Colab  
**Last updated**: June 2025

---

This module reads DXF drawings and extracts geometric primitives into a structured intermediate format (`geometry.json`). It detects lines, arcs, polylines, and text, reconstructs closed profiles, and classifies them for further 3D interpretation.

**Inputs**: DXF files  
**Outputs**: geometry JSONs with raw edges, closed chains, and metadata  



## 1. Setup  
Import necessary libraries and define file system paths.



In [148]:
# Install required library for DXF parsing
!pip install ezdxf

# Core libraries
import ezdxf
import json
import os
import math
from pathlib import Path
from collections import defaultdict, Counter

# Geometry & math utilities
import numpy as np
from scipy.spatial import KDTree
import re
from shapely.geometry import Polygon, Point




## 2. Core Parsing Functions

In this section, we define the core functions used to parse and process the geometry from DXF files. These functions are organized into the following categories:

- ** A. Geometry Utilities**
- ** B. DXF Entity Parsers**
- ** C. Chain Builder**
- ** D. Cutout Detection**
- ** E. Profile Classification & Feature Grouping**
- ** F. DXF Entity Handler**
- ** G. Batch Processing & Layer Summary**
- ** H. Metadata Tagging**



### A. Geometry Utilities
Basic point math: distance, rounding, centroid, and polygon area.


In [149]:
TOLERANCE = 0.1  # Default spatial tolerance in mm


def round_point(pt, tol=TOLERANCE):
    """
    Round a 2D or 3D point to the nearest grid defined by the given tolerance.

    Args:
        pt (tuple or list): Input point (x, y) or (x, y, z)
        tol (float): Tolerance grid spacing

    Returns:
        tuple: Rounded point
    """
    return tuple([round(coord / tol) * tol for coord in pt])


def point_distance(p1, p2):
    """
    Calculate Euclidean distance between two 2D or 3D points.

    Args:
        p1, p2 (tuple or list): Points (x, y) or (x, y, z)

    Returns:
        float: Euclidean distance
    """
    return math.sqrt(sum((a - b) ** 2 for a, b in zip(p1, p2)))


def compute_centroid(points):
    """
    Compute the centroid (center of mass) of a set of 2D or 3D points.

    Args:
        points (list): List of points, each being (x, y) or (x, y, z)

    Returns:
        list: Centroid point as [x, y, z]
    """
    if not points:
        return [0, 0, 0]

    n = len(points)
    cx = sum(p[0] for p in points) / n
    cy = sum(p[1] for p in points) / n
    cz = sum((p[2] if len(p) > 2 else 0) for p in points) / n
    return [cx, cy, cz]


def polygon_area(points):
    """
    Compute the signed area of a polygon using the shoelace formula.

    Args:
        points (list): List of (x, y) points forming the polygon

    Returns:
        float: Absolute value of area
    """
    area = 0.0
    n = len(points)
    for i in range(n):
        x0, y0 = points[i][:2]
        x1, y1 = points[(i + 1) % n][:2]
        area += (x0 * y1) - (x1 * y0)
    return abs(area) / 2.0


### B. DXF Entity Parsers  
Functions to parse `LINE`, `ARC`, `LWPOLYLINE`, `POLYLINE`, and `SPLINE` DXF entities into segments or chains of lines.


In [150]:
def parse_line(entity):
    """
    Parse a DXF LINE entity into a 2-point edge.

    Args:
        entity (ezdxf.entities.Line): LINE entity from DXF.

    Returns:
        dict: Edge dictionary with start/end coordinates and layer.
    """
    return {
        "type": "line",
        "start": list(entity.dxf.start),
        "end": list(entity.dxf.end),
        "layer": entity.dxf.layer
    }


def parse_arc(entity, min_segments=8, max_segments=64, config=None):
    """
    Approximate a DXF ARC entity as a polyline (list of straight segments).

    Args:
        entity (ezdxf.entities.Arc): ARC entity.
        min_segments (int): Minimum number of segments for approximation.
        max_segments (int): Maximum number of segments for approximation.
        config (dict): Optional config dictionary for arc segmentation control.

    Returns:
        tuple:
            - list of line segments approximating the arc
            - arc metadata dictionary (center, radius, start/end angles)
    """
    center = entity.dxf.center
    radius = entity.dxf.radius
    start_angle_deg = entity.dxf.start_angle
    end_angle_deg = entity.dxf.end_angle

    start_angle = math.radians(start_angle_deg)
    end_angle = math.radians(end_angle_deg)

    sweep = (end_angle - start_angle) % (2 * math.pi)
    arc_length = radius * sweep

    seg_per_180 = config.get("arc_segments_per_180", 20) if config else 20
    segment_length = max(arc_length / seg_per_180, 0.5)
    num_segments = int(max(min_segments, min(max_segments, arc_length / segment_length)))

    points = []
    for i in range(num_segments + 1):
        angle = start_angle + sweep * i / num_segments
        pt = [
            center[0] + radius * math.cos(angle),
            center[1] + radius * math.sin(angle),
            center[2]
        ]
        points.append(pt)

    arc_meta = {
        "center": list(center),
        "radius": radius,
        "start_angle": start_angle_deg,
        "end_angle": end_angle_deg,
        "layer": entity.dxf.layer
    }

    segments = [
        {
            "type": "line",
            "start": points[i],
            "end": points[i + 1],
            "layer": entity.dxf.layer,
            "source": "arc",
            "arc_meta": arc_meta
        }
        for i in range(num_segments)
    ]

    return segments, arc_meta


def parse_lwpolyline(entity):
    """
    Parse a DXF LWPOLYLINE entity into a list of connected line segments.

    Args:
        entity (ezdxf.entities.LWPolyline): Lightweight polyline.

    Returns:
        list: List of edge dictionaries.
    """
    pts = [[p[0], p[1], 0] for p in entity.get_points()]
    edges = [{"type": "line", "start": pts[i], "end": pts[i + 1], "layer": entity.dxf.layer} for i in range(len(pts) - 1)]
    if entity.closed:
        edges.append({"type": "line", "start": pts[-1], "end": pts[0], "layer": entity.dxf.layer})
    return edges


def parse_polyline(entity, closure_tolerance=1e-2):
    """
    Parse a legacy POLYLINE entity into edge segments. Auto-closes if points are near each other.

    Args:
        entity (ezdxf.entities.Polyline): POLYLINE entity.
        closure_tolerance (float): Distance to consider implicit closure.

    Returns:
        list: List of edge dictionaries.
    """
    pts = [list(vertex.dxf.location) for vertex in entity.vertices]
    edges = [{"type": "line", "start": pts[i], "end": pts[i + 1], "layer": entity.dxf.layer} for i in range(len(pts) - 1)]

    first, last = pts[0], pts[-1]
    is_explicitly_closed = entity.is_closed
    is_implicitly_closed = point_distance(first, last) < closure_tolerance

    if is_explicitly_closed or is_implicitly_closed:
        edges.append({"type": "line", "start": last, "end": first, "layer": entity.dxf.layer})

    return edges


def parse_spline(entity, max_angle_step=5):
    """
    Approximate a SPLINE entity as straight-line segments using flattening.

    Args:
        entity (ezdxf.entities.Spline): SPLINE entity.
        max_angle_step (float): Maximum angle (degrees) between segments.

    Returns:
        list: List of segment dictionaries approximating the spline.
    """
    try:
        points = list(entity.flattening(max_angle=math.radians(max_angle_step)))
        segments = [
            {
                "type": "line",
                "start": list(points[i]),
                "end": list(points[i + 1]),
                "layer": entity.dxf.layer,
                "source": "spline"
            }
            for i in range(len(points) - 1)
        ]
        return segments
    except Exception as e:
        return []


def arc_to_edges(arc, min_segments=8, max_segments=64):
    """
    Convert arc dictionary (from metadata) to segmented line edges.

    Args:
        arc (dict): Arc metadata with center, radius, angles, and layer.
        min_segments (int): Minimum number of segments.
        max_segments (int): Maximum number of segments.

    Returns:
        list: List of line segments approximating the arc.
    """
    center = arc["center"]
    radius = arc["radius"]
    start_angle = math.radians(arc["start_angle"])
    end_angle = math.radians(arc["end_angle"])
    sweep = (end_angle - start_angle) % (2 * math.pi)
    arc_length = radius * sweep
    segment_length = max(arc_length / 20, 0.5)
    num_segments = int(max(min_segments, min(max_segments, arc_length / segment_length)))

    points = [
        [
            center[0] + radius * math.cos(start_angle + sweep * i / num_segments),
            center[1] + radius * math.sin(start_angle + sweep * i / num_segments),
            0
        ]
        for i in range(num_segments + 1)
    ]

    return [
        {
            "type": "line",
            "start": points[i],
            "end": points[i + 1],
            "layer": arc.get("layer", ""),
            "source": "hatch_arc"
        }
        for i in range(num_segments)
    ]


### C. Chain Builder  
Group edges into continuous chains and detect closed profiles such as extrusions or revolutions. Chains are created by connecting colinear or adjacent segments and are heuristically classified.


In [151]:
def build_chains(edges, tol=TOLERANCE):
    """
    Group edge segments into continuous chains. Detects closed loops (profiles) and classifies them as extrusion or revolution.

    Args:
        edges (list): List of raw or arc-derived edge dictionaries.
        tol (float): Tolerance for point snapping and matching.

    Returns:
        list: List of chain dictionaries with metadata (type, closed status, arc info).
    """
    snapped_edges = []
    point_to_edges = defaultdict(list)
    all_points = set()

    # Snap points and build lookup
    for e in edges:
        s = round_point(e["start"], tol)
        t = round_point(e["end"], tol)
        if s == t or math.dist(s, t) < 1e-6:
            continue

        edge_idx = len(snapped_edges)
        snapped_edges.append({"start": s, "end": t, "orig": e})
        point_to_edges[s].append(edge_idx)
        point_to_edges[t].append(edge_idx)
        all_points.add(s)
        all_points.add(t)

    point_list = list(all_points)
    point_idx_map = {pt: i for i, pt in enumerate(point_list)}
    kdtree = KDTree(point_list)

    visited = set()
    chains = []

    # Try to grow chains from unvisited segments
    for i, edge in enumerate(snapped_edges):
        if i in visited:
            continue

        chain = [edge["start"], edge["end"]]
        visited.add(i)

        def extend_chain(from_start):
            """Helper to extend chain forward or backward via nearest point."""
            extended = True
            while extended:
                extended = False
                ref_point = chain[0] if from_start else chain[-1]
                dist, idx = kdtree.query(ref_point, distance_upper_bound=tol + 1e-6)
                if dist >= tol or idx >= len(point_list):
                    return
                nearest_pt = point_list[idx]

                for j in point_to_edges.get(nearest_pt, []):
                    if j in visited:
                        continue
                    e = snapped_edges[j]
                    if nearest_pt == e["start"]:
                        next_pt = e["end"]
                    elif nearest_pt == e["end"]:
                        next_pt = e["start"]
                    else:
                        continue

                    if from_start:
                        chain.insert(0, next_pt)
                    else:
                        chain.append(next_pt)
                    visited.add(j)
                    extended = True
                    break

        extend_chain(from_start=False)
        extend_chain(from_start=True)

        # Compute metadata
        is_closed = point_distance(chain[0], chain[-1]) < tol
        length = sum(point_distance(chain[k], chain[k + 1]) for k in range(len(chain) - 1))

        # Gather arc metadata
        chain_segment_indices = [
            idx for idx, seg in enumerate(snapped_edges)
            if seg["start"] in chain and seg["end"] in chain
        ]
        source_types = set(
            snapped_edges[i]["orig"].get("source", "manual")
            for i in chain_segment_indices
            if "orig" in snapped_edges[i]
        )
        arc_metas = [
            snapped_edges[i]["orig"].get("arc_meta")
            for i in chain_segment_indices
            if "arc_meta" in snapped_edges[i]["orig"]
        ]
        arc_summary = None
        if arc_metas:
            arc_summary = {
                "count": len(arc_metas),
                "avg_radius": round(sum(a["radius"] for a in arc_metas) / len(arc_metas), 3),
                "layers": list(set(a["layer"] for a in arc_metas))
            }

        layer = snapped_edges[i]["orig"].get("layer", "")
        chain_type = "open"

        # Heuristic classification (extrusion vs revolution)
        if is_closed:
            arc_count = len(arc_metas)
            arc_ratio = arc_count / max(1, len(chain))
            layer_lower = layer.lower()
            is_probably_revolution = (
                arc_count >= 1 and (
                    arc_ratio > 0.25 or
                    len(chain) <= 4 or
                    "schraffur" in layer_lower or
                    (arc_summary and arc_summary["avg_radius"] < 15)
                )
            )
            if not is_probably_revolution and arc_count == 0:
                if 3 <= len(chain) <= 6 and length < 30:
                    xs = [p[0] for p in chain]
                    ys = [p[1] for p in chain]
                    bbox_width = max(xs) - min(xs)
                    bbox_height = max(ys) - min(ys)
                    aspect_ratio = bbox_width / max(bbox_height, 1e-3)
                    if 0.8 <= aspect_ratio <= 1.2:
                        is_probably_revolution = True

            chain_type = "revolution" if is_probably_revolution else "extrusion"

        chain_data = {
            "points": chain,
            "is_closed": is_closed,
            "length": round(length, 2),
            "is_cutout": False,
            "repaired": is_closed,
            "type": chain_type,
            "role": "profile",
            "layer": layer,
            "source_types": list(source_types),
            "contains_arc": "arc" in source_types
        }
        if len(source_types) == 1:
            chain_data["source"] = list(source_types)[0]
        if arc_summary:
            chain_data["arc_meta_summary"] = arc_summary

        chains.append(chain_data)

    # Attempt to merge adjacent open chains
    merged_chains = []
    for chain in chains:
        if not chain["is_closed"]:
            for other_chain in chains:
                if chain == other_chain or other_chain["is_closed"]:
                    continue
                if point_distance(chain["points"][-1], other_chain["points"][0]) < tol:
                    chain["points"].extend(other_chain["points"])
                    chain["length"] += other_chain["length"]
                    chain["is_closed"] = point_distance(chain["points"][0], chain["points"][-1]) < tol
                    chains.remove(other_chain)
                    merged_chains.append(chain)
                    break

    chains.extend(merged_chains)
    return chains


def is_valid_edge(e, tol=1e-3):
    """
    Filter out degenerate or zero-length edges.

    Args:
        e (dict): Edge dictionary with start/end points.
        tol (float): Minimum distance allowed.

    Returns:
        bool: True if edge is valid.
    """
    return point_distance(e["start"], e["end"]) >= tol


### D. Cutout Detection  
Detect cutouts (holes) in the geometry by analyzing circular features and nesting relationships. This includes round holes (circles or revolutions) as well as enclosed polygonal cutouts.


In [152]:
def detect_cutouts(chains, circles=None, min_radius=1.0, max_radius=50.0):
    """
    Identify cutouts (holes) in geometry based on circular approximation or spatial nesting.

    Args:
        chains (list): List of chain dictionaries (each with points and metadata).
        circles (list, optional): List of circle-like shapes with radius.
        min_radius (float): Minimum radius to qualify as a cutout.
        max_radius (float): Maximum radius to qualify as a cutout.

    Returns:
        dict: Summary of detected cutouts: {'chains': int, 'circles': int}
    """

    def estimate_radius(points, centroid):
        """Estimate radius of a loop based on average distance to centroid."""
        return sum(math.dist(p[:2], centroid[:2]) for p in points) / len(points)

    def get_bbox(points):
        """Compute 2D bounding box from list of points."""
        xs, ys = zip(*[p[:2] for p in points])
        return min(xs), min(ys), max(xs), max(ys)

    def bbox_inside(inner_bbox, outer_bbox):
        """Check if inner bounding box is fully inside the outer one."""
        return (inner_bbox[0] >= outer_bbox[0] and
                inner_bbox[1] >= outer_bbox[1] and
                inner_bbox[2] <= outer_bbox[2] and
                inner_bbox[3] <= outer_bbox[3])

    def point_in_polygon(point, polygon):
        """Ray casting algorithm to determine if point is inside polygon."""
        x, y = point[:2]
        inside = False
        n = len(polygon)
        px, py = zip(*[p[:2] for p in polygon])
        j = n - 1
        for i in range(n):
            if ((py[i] > y) != (py[j] > y)) and \
               (x < (px[j] - px[i]) * (y - py[i]) / ((py[j] - py[i]) + 1e-9) + px[i]):
                inside = not inside
            j = i
        return inside

    detected = {"chains": 0, "circles": 0}

    # Pass 1: Detect circular chains based on shape and radius uniformity
    for chain in chains:
        if not chain.get("is_closed") or len(chain.get("points", [])) < 3:
            continue

        points = chain["points"]
        centroid = compute_centroid(points)
        estimated_radius = estimate_radius(points, centroid)
        max_dev = max(abs(math.dist(p[:2], centroid[:2]) - estimated_radius) for p in points)

        if min_radius <= estimated_radius <= max_radius and max_dev < 1.0:
            chain["is_cutout"] = True
            chain["role"] = "cutout"
            detected["chains"] += 1

    # Pass 2: Detect polygonal cutouts nested inside larger shapes
    closed_chains = [c for c in chains if c.get("is_closed")]
    for i, inner in enumerate(closed_chains):
        if inner.get("is_cutout"):
            continue

        inner_bbox = get_bbox(inner["points"])
        for j, outer in enumerate(closed_chains):
            if i == j or outer.get("is_cutout"):
                continue

            outer_bbox = get_bbox(outer["points"])
            if bbox_inside(inner_bbox, outer_bbox):
                centroid = compute_centroid(inner["points"])
                if point_in_polygon(centroid, outer["points"]):
                    inner["is_cutout"] = True
                    inner["role"] = "cutout"
                    detected["chains"] += 1
                    break
                elif point_in_polygon(inner["points"][0], outer["points"]):
                    inner["is_cutout"] = True
                    inner["role"] = "cutout"
                    detected["chains"] += 1
                    break

    # Pass 3: Detect circle primitives directly
    if circles:
        for circle in circles:
            radius = circle.get("radius", 0)
            if min_radius <= radius <= max_radius:
                circle["is_cutout"] = True
                circle["role"] = "cutout"
                detected["circles"] += 1

    return detected


### E. Profile Classification & Feature Grouping  
Classify profile chains into extrusion or revolution types, and associate auxiliary geometry such as centerlines, tags, and dimensions for downstream processing.


In [153]:
def classify_profiles(chains, geometry=None):
    """
    Classify closed chains as 'extrusion' or 'revolution' based on shape.
    Also identifies centerlines and links semantic tags to geometry.

    Args:
        chains (list): List of chain dictionaries.
        geometry (dict, optional): DXF geometry with semantic_tags and circles for linking.
    """
    open_lines = [c for c in chains if not c["is_closed"] and len(c["points"]) == 2]

    for chain in chains:
        if not chain["is_closed"]:
            continue

        # Compute centroid of the profile
        pts = chain["points"]
        cx = sum(p[0] for p in pts) / len(pts)
        cy = sum(p[1] for p in pts) / len(pts)
        centroid = (cx, cy)
        chain["centroid"] = centroid

        # Determine shape type based on radial uniformity
        rad_tol = 0.3
        distances = [math.sqrt((p[0] - cx) ** 2 + (p[1] - cy) ** 2) for p in pts]
        if max(distances) - min(distances) < rad_tol:
            chain["type"] = "revolution"
        else:
            chain["type"] = "extrusion"

        # Tag likely centerlines by layer name or proximity to revolution center
        if any(word in chain["layer"].lower() for word in ["axis", "center", "achse", "achsen"]):
            chain["role"] = "centerline"
            continue

        if chain["type"] == "revolution":
            for line in open_lines:
                a, b = line["points"]
                x1, y1 = a[:2]
                x2, y2 = b[:2]
                x0, y0 = centroid
                num = abs((y2 - y1) * x0 - (x2 - x1) * y0 + x2 * y1 - y2 * x1)
                den = math.sqrt((y2 - y1)**2 + (x2 - x1)**2)
                dist_to_line = num / den if den > 1e-6 else float('inf')
                if dist_to_line < 0.3:
                    line["role"] = "centerline"
                    break

    # Optional: Link semantic tags to chains or circles by proximity
    if geometry and "semantic_tags" in geometry:
        chain_centroids = [
            (i, c["centroid"]) for i, c in enumerate(chains) if c.get("is_closed") and "centroid" in c
        ]
        circle_list = geometry.get("circles", [])

        for tag in geometry["semantic_tags"]:
            tx, ty = tag["position"][:2]
            best = None
            best_dist = float("inf")

            for idx, (cx, cy) in chain_centroids:
                dist = ((tx - cx)**2 + (ty - cy)**2) ** 0.5
                if dist < best_dist:
                    best = {"type": "chain", "index": idx, "distance": dist}
                    best_dist = dist

            for i, circ in enumerate(circle_list):
                cx, cy = circ["center"][:2]
                dist = ((tx - cx)**2 + (ty - cy)**2) ** 0.5
                if dist < best_dist:
                    best = {"type": "circle", "index": i, "distance": dist}
                    best_dist = dist

            if best and best["distance"] < 15:
                tag["linked_geometry"] = best

def group_features(geometry, proximity_thresh=20):
    """
    Group nearby features such as holes, semantic tags, and dimensions into part-level features.

    Args:
        geometry (dict): Parsed geometry containing holes, semantic_tags, and dimensions.
        proximity_thresh (float): Max distance to associate tags/dimensions with holes.

    Returns:
        list: Feature groups (hole with associated tags/dimensions).
    """
    features = []
    used_tags = set()
    used_dims = set()

    for i, hole in enumerate(geometry.get("holes", [])):
        hx, hy = hole["center"][:2]
        hole_feature = {
            "geometry_ref": {"type": hole["source"], "index": i},
            "linked_texts": [],
            "linked_dimensions": [],
            "role": hole["role"]
        }

        # Link nearby semantic tags
        for j, tag in enumerate(geometry.get("semantic_tags", [])):
            if j in used_tags:
                continue
            tx, ty = tag["position"][:2]
            dist = ((hx - tx) ** 2 + (hy - ty) ** 2) ** 0.5
            if dist < proximity_thresh:
                hole_feature["linked_texts"].append(tag)
                used_tags.add(j)

        # Link nearby dimensions
        for k, dim in enumerate(geometry.get("dimensions", [])):
            if k in used_dims:
                continue
            dx, dy = dim["insert"][:2]
            dist = ((hx - dx) ** 2 + (hy - dy) ** 2) ** 0.5
            if dist < proximity_thresh:
                hole_feature["linked_dimensions"].append(dim)
                used_dims.add(k)

        features.append(hole_feature)

    geometry["features"] = features
    return features

### F. DXF Entity Handler  
Read a DXF file, parse its contents by entity type, and structure the output geometry dictionary with categorized features, chains, holes, and semantic tags.


In [154]:
def parse_dxf_entities(filepath, tolerance=0.1, layer_config=None, config=None):
    """
    Parses a DXF file into a structured geometry dictionary.

    Args:
        filepath (str): Path to the DXF file to parse.
        tolerance (float): Tolerance for edge merging and chain building.
        layer_config (dict): Optional dictionary to filter layers.
        config (dict): Optional configuration dict for arc resolution, hole thresholds, etc.

    Returns:
        dict: Contains geometry metadata, parsed parts, and analysis results.
    """
    geometry = defaultdict(list)
    warnings = []
    all_edges = []
    bbox_min = [float('inf'), float('inf')]
    bbox_max = [float('-inf'), float('-inf')]

    # Load DXF
    try:
        doc = ezdxf.readfile(filepath)
    except Exception as e:
        return {
            "filename": filepath,
            "error": str(e),
            "geometry": {},
            "warnings": [str(e)]
        }

    msp = doc.modelspace()
    units_code = doc.header.get('$INSUNITS', 0)
    units_map = {0: "unitless", 1: "inches", 2: "feet", 4: "mm", 5: "cm", 6: "m"}
    units = units_map.get(units_code, "unknown")

    def update_bounds(pt):
        """Track drawing bounding box."""
        nonlocal bbox_min, bbox_max
        bbox_min[0] = min(bbox_min[0], pt[0])
        bbox_min[1] = min(bbox_min[1], pt[1])
        bbox_max[0] = max(bbox_max[0], pt[0])
        bbox_max[1] = max(bbox_max[1], pt[1])

    # Loop through all modelspace entities
    for entity in list(msp):
        try:
            # Optional layer filtering
            layer = entity.dxf.layer
            if layer_config:
                if "allowed_layers" in layer_config and layer not in layer_config["allowed_layers"]:
                    continue
                if "ignored_layers" in layer_config and layer in layer_config["ignored_layers"]:
                    continue

            # Parse entity by type
            etype = entity.dxftype()
            if etype == 'LINE':
                edge = parse_line(entity)
                if is_valid_edge(edge):
                  all_edges.append(edge)
                update_bounds(edge["start"])
                update_bounds(edge["end"])
            elif etype == 'ARC':
                arc_edges, arc_meta = parse_arc(entity, min_segments=config.get("arc_segments_per_180", 8), max_segments=64, config=config)
                all_edges.extend(e for e in arc_edges if is_valid_edge(e))
                for e in arc_edges:
                    update_bounds(e["start"])
                    update_bounds(e["end"])
                geometry["arcs"].append(arc_meta)
            elif etype == 'CIRCLE':
                center = list(entity.dxf.center)
                radius = entity.dxf.radius
                geometry["circles"].append({
                    "center": center,
                    "radius": radius,
                    "layer": layer
                })
                update_bounds([center[0] - radius, center[1] - radius])
                update_bounds([center[0] + radius, center[1] + radius])
            elif etype == 'LWPOLYLINE':
                segments = parse_lwpolyline(entity)
                all_edges.extend(e for e in segments if is_valid_edge(e))
                for e in segments:
                    update_bounds(e["start"])
                    update_bounds(e["end"])
            elif etype == 'POLYLINE':
                segments = parse_polyline(entity)
                all_edges.extend(e for e in segments if is_valid_edge(e))
                for e in segments:
                    update_bounds(e["start"])
                    update_bounds(e["end"])
            elif etype == 'SPLINE':
                segments = parse_spline(entity)
                all_edges.extend(e for e in segments if is_valid_edge(e))
                for e in segments:
                    update_bounds(e["start"])
                    update_bounds(e["end"])
            elif etype == 'TEXT':
                try:
                    txt = entity.dxf.text.strip()
                    pos = list(entity.dxf.insert)
                    role = match_text_to_role(txt)

                    geometry["texts"].append({
                        "text": txt,
                        "insert": pos,
                        "rotation": entity.dxf.rotation,
                        "height": entity.dxf.height,
                        "layer": layer
                    })
                    if role:
                        geometry.setdefault("semantic_tags", []).append({
                            "text": txt,
                            "position": pos,
                            "layer": layer,
                            "role": role
                        })
                except Exception as e:
                    warnings.append(f"TEXT parse error: {str(e)}")
            elif etype == 'DIMENSION':
                try:
                    txt = entity.dxf.text.strip()
                    pos = list(entity.dxf.defpoint)
                    role = match_text_to_role(txt)
                    geometry["dimensions"].append({
                        "text": txt,
                        "insert": pos,
                        "layer": layer,
                        "role": role
                    })
                    if role:
                        geometry.setdefault("semantic_tags", []).append({
                            "text": txt,
                            "position": pos,
                            "layer": layer,
                            "role": role
                        })
                except Exception as e:
                    warnings.append(f"DIMENSION parse error: {str(e)}")
            elif etype == 'HATCH':
                try:
                    for path in entity.paths:
                        edges = []
                        if path.has_edges:
                            for edge in path.edges:
                                if edge.TYPE == 'LineEdge':
                                    e = {
                                        "type": "line",
                                        "start": list(edge.start),
                                        "end": list(edge.end),
                                        "layer": layer,
                                        "source": "hatch"
                                    }
                                    edges.append(e)
                                    update_bounds(e["start"])
                                    update_bounds(e["end"])
                                elif edge.TYPE == 'ArcEdge':
                                    arc = {
                                        "center": list(edge.center),
                                        "radius": edge.radius,
                                        "start_angle": edge.start_angle,
                                        "end_angle": edge.end_angle,
                                        "layer": layer
                                    }
                                    arc_segs = arc_to_edges(arc)
                                    edges.extend(arc_segs)
                                    for e in arc_segs:
                                        update_bounds(e["start"])
                                        update_bounds(e["end"])
                                else:
                                    warnings.append(f"HATCH unsupported edge type: {edge.TYPE}")
                        elif path.path_type_flags & 2:
                            pts = [[v[0], v[1], 0] for v in path.vertices]
                            for i in range(len(pts) - 1):
                                e = {"type": "line", "start": pts[i], "end": pts[i + 1], "layer": layer}
                                edges.append(e)
                                update_bounds(e["start"])
                                update_bounds(e["end"])
                            if path.has_closed_path:
                                e = {"type": "line", "start": pts[-1], "end": pts[0], "layer": layer}
                                edges.append(e)
                                update_bounds(e["start"])
                                update_bounds(e["end"])
                        else:
                            warnings.append(f"HATCH path skipped: unsupported path structure")
                        all_edges.extend(e for e in edges if is_valid_edge(e))
                        geometry["hatch_boundaries"].append([e["start"] for e in edges])
                except Exception as e:
                    warnings.append(f"HATCH parse error: {str(e)}")
            elif etype == 'REGION':
                try:
                    exploded = list(entity.explode())
                    for sub_entity in exploded:
                        try:
                            sub_etype = sub_entity.dxftype()
                            if sub_etype == 'LINE':
                                e = parse_line(sub_entity)
                                if is_valid_edge(e):
                                    all_edges.append(e)
                                update_bounds(e["start"])
                                update_bounds(e["end"])
                            elif sub_etype == 'ARC':
                                arc_edges, arc_meta = parse_arc(sub_entity, min_segments=config.get("arc_segments_per_180", 8), max_segments=64, config=config)
                                all_edges.extend(e for e in arc_edges if is_valid_edge(e))
                                for e in arc_edges:
                                    update_bounds(e["start"])
                                    update_bounds(e["end"])
                                geometry["arcs"].append(arc_meta)
                            elif sub_etype == 'ELLIPSE':
                                ellipse_meta = {
                                    "center": list(sub_entity.dxf.center),
                                    "major_axis": list(sub_entity.dxf.major_axis),
                                    "ratio": sub_entity.dxf.ratio,
                                    "start_param": sub_entity.dxf.start_param,
                                    "end_param": sub_entity.dxf.end_param,
                                    "layer": sub_entity.dxf.layer
                                }
                                geometry["ellipses"].append(ellipse_meta)
                                approx_pts = list(sub_entity.flattening(distance=0.5))
                                for i in range(len(approx_pts) - 1):
                                    e = {
                                        "type": "line",
                                        "start": list(approx_pts[i]),
                                        "end": list(approx_pts[i + 1]),
                                        "layer": sub_entity.dxf.layer,
                                        "source": "region_ellipse"
                                    }
                                    if is_valid_edge(e):
                                        all_edges.append(e)
                                    update_bounds(e["start"])
                                    update_bounds(e["end"])
                            else:
                                warnings.append(f"REGION exploded entity unsupported: {sub_etype}")
                        except Exception as e:
                            warnings.append(f"REGION sub-entity error: {str(e)}")
                except Exception as e:
                    warnings.append(f"REGION explode failed: {str(e)}")
            elif etype == 'INSERT':
                try:
                    exploded = entity.explode()
                    for sub_entity in exploded:
                        try:
                            sub_etype = sub_entity.dxftype()
                            sub_layer = sub_entity.dxf.layer
                            if layer_config:
                                if "allowed_layers" in layer_config and sub_layer not in layer_config["allowed_layers"]:
                                    continue
                                if "ignored_layers" in layer_config and sub_layer in layer_config["ignored_layers"]:
                                    continue
                            if sub_etype == 'LINE':
                                e = parse_line(sub_entity)
                                if is_valid_edge(e):
                                    all_edges.append(e)
                                update_bounds(e["start"])
                                update_bounds(e["end"])
                            elif sub_etype == 'ARC':
                                arc_edges, arc_meta = parse_arc(sub_entity, min_segments=config.get("arc_segments_per_180", 8), max_segments=64, config=config)
                                all_edges.extend(e for e in arc_edges if is_valid_edge(e))
                                for e in arc_edges:
                                    update_bounds(e["start"])
                                    update_bounds(e["end"])
                                geometry["arcs"].append(arc_meta)
                            elif sub_etype == 'CIRCLE':
                                center = list(sub_entity.dxf.center)
                                radius = sub_entity.dxf.radius
                                geometry["circles"].append({
                                    "center": center,
                                    "radius": radius,
                                    "layer": sub_layer
                                })
                                update_bounds([center[0] - radius, center[1] - radius])
                                update_bounds([center[0] + radius, center[1] + radius])
                            elif sub_etype == 'LWPOLYLINE':
                                segments = parse_lwpolyline(sub_entity)
                                all_edges.extend(e for e in segments if is_valid_edge(e))
                                for e in segments:
                                    update_bounds(e["start"])
                                    update_bounds(e["end"])
                            elif sub_etype == 'POLYLINE':
                                segments = parse_polyline(sub_entity)
                                all_edges.extend(e for e in segments if is_valid_edge(e))
                                for e in segments:
                                    update_bounds(e["start"])
                                    update_bounds(e["end"])
                            elif sub_etype == 'SPLINE':
                                segments = parse_spline(sub_entity)
                                all_edges.extend(e for e in segments if is_valid_edge(e))
                                for e in segments:
                                    update_bounds(e["start"])
                                    update_bounds(e["end"])
                            elif sub_etype == 'ELLIPSE':
                                ellipse_meta = {
                                    "center": list(sub_entity.dxf.center),
                                    "major_axis": list(sub_entity.dxf.major_axis),
                                    "ratio": sub_entity.dxf.ratio,
                                    "start_param": sub_entity.dxf.start_param,
                                    "end_param": sub_entity.dxf.end_param,
                                    "layer": sub_layer
                                }
                                geometry.setdefault("ellipses", []).append(ellipse_meta)
                                approx_pts = list(sub_entity.flattening(distance=0.5))
                                for i in range(len(approx_pts) - 1):
                                    e = {
                                        "type": "line",
                                        "start": list(approx_pts[i]),
                                        "end": list(approx_pts[i + 1]),
                                        "layer": sub_layer,
                                        "source": "insert_ellipse"
                                    }
                                    if is_valid_edge(e):
                                        all_edges.append(e)
                                    update_bounds(e["start"])
                                    update_bounds(e["end"])
                            else:
                                warnings.append(f"INSERT exploded sub-entity unsupported: {sub_etype}")
                        except Exception as e:
                            warnings.append(f"INSERT sub-entity error: {str(e)}")
                except Exception as e:
                    warnings.append(f"INSERT explode failed: {str(e)}")
            else:
                warnings.append(f"Skipped unsupported entity: {etype}")
        except Exception as e:
            warnings.append(f"{entity.dxftype()} parse error: {str(e)}")
            # The function processes all DXF entity types: LINE, ARC, CIRCLE, SPLINE, etc.
            # And updates `all_edges`, `geometry`, and `warnings` accordingly.

    # Adaptive tolerance (optional)
    if config and config.get("adaptive_tolerance", False) and tolerance is None:
        bbox_width = bbox_max[0] - bbox_min[0]
        bbox_height = bbox_max[1] - bbox_min[1]
        bbox_diagonal = math.sqrt(bbox_width**2 + bbox_height**2)
        adaptive_tol = 0.001 * bbox_diagonal
        max_tolerance = config.get("max_tolerance", 1.0)
        tolerance = min(adaptive_tol, max_tolerance)
        min_tolerance = config.get("min_tolerance", 0.001)
        tolerance = max(tolerance, min_tolerance)
        warnings.append(f"Adaptive tolerance computed: {tolerance:.6f}")

    # Build chains and extract holes/profiles
    chains = build_chains(all_edges, tol=tolerance)
    cutout_counts = detect_cutouts(chains, geometry.get("circles", []))
    classify_profiles(chains, geometry=geometry)

    # Identify hole candidates (circular or revolution chains)
    geometry["hole_candidates"] = []
    for i, circ in enumerate(geometry.get("circles", [])):
        if circ["radius"] < 30:
            circ["hole"] = True
            geometry["hole_candidates"].append({"type": "circle", "index": i})
    for i, chain in enumerate(chains):
        if chain.get("is_closed") and chain.get("type") == "revolution":
            chain["hole"] = True
            geometry["hole_candidates"].append({"type": "chain", "index": i})

    # Classify actual holes
    geometry["holes"] = []
    hole_radius_thresh = config.get("hole_radius_threshold", 20.0) if config else 20.0
    bolt_keywords = {"anker", "bolt", "loch", "bohr", "m", "sw"}

    for circle in geometry.get("circles", []):
        center, radius, layer = circle["center"], circle["radius"], circle["layer"]
        if radius < hole_radius_thresh:
            role = None
            nearby_tags = [
                tag for tag in geometry.get("semantic_tags", [])
                if ((tag["position"][0] - center[0])**2 + (tag["position"][1] - center[1])**2)**0.5 < 15
            ]
            for tag in nearby_tags:
                txt = tag["text"].lower()
                if any(k in txt for k in bolt_keywords):
                    role = "bolt_hole"
                    break
            if not role:
                role = "generic_hole"
            geometry["holes"].append({
                "center": center,
                "radius": radius,
                "layer": layer,
                "role": role,
                "type": "circular",
                "source": "circle"
            })

    for i, chain in enumerate(chains):
        if chain.get("hole") and chain.get("centroid") and chain.get("points"):
            center = chain["centroid"]
            points = chain["points"]
            approx_radius = sum(
                ((p[0] - center[0])**2 + (p[1] - center[1])**2)**0.5 for p in points
            ) / len(points)
            geometry["holes"].append({
                "center": center,
                "radius": approx_radius,
                "role": "generic_hole",
                "type": "circular",
                "source": "chain",
                "chain_index": i
            })

    # Final aggregation
    geometry["edges"] = all_edges
    geometry["edge_chains"] = chains
    group_features(geometry)

    # Check planarity
    z_vals = [p[2] for e in all_edges for p in (e["start"], e["end"]) if len(p) == 3]
    if z_vals:
        min_z, max_z = min(z_vals), max(z_vals)
        if abs(max_z - min_z) > 1e-3:
            warnings.append(f"Non-planar geometry detected: Z range = {min_z:.3f} to {max_z:.3f}")

    return {
        "filename": os.path.basename(filepath),
        "drawing_units": units,
        "tolerance_used": tolerance,
        "geometry": dict(geometry),
        "stats": {
            "edge_count": len(all_edges),
            "degenerate_edges": 0,
            "chain_count": len(chains),
            "open_chains": sum(not c["is_closed"] for c in chains),
            "cutouts": sum(c["is_cutout"] for c in chains),
            "revolutions": sum(c["type"] == "revolution" for c in chains),
            "extrusions": sum(c["type"] == "extrusion" for c in chains),
            "unit": units
        },
        "warnings": warnings
    }


### G. Batch Processing & Layer Summary
Process folders of DXFs and summarize layers.


In [155]:
def parse_folder(folder_path, output_path, tolerance=0.1):
    """
    Parse all DXF files in a folder and export their geometry as JSON.

    Args:
        folder_path (str): Path to input folder containing .dxf files.
        output_path (str): Path to folder where parsed .json outputs will be saved.
        tolerance (float): Tolerance value for edge joining (if not using adaptive).

    Behavior:
        - Loads optional `config.json` from the folder for parsing settings.
        - Applies layer filtering and adaptive tolerance if specified.
        - Runs `parse_dxf_entities` for each file and writes results to output.
        - Prints a brief summary per file.
    """
    folder = Path(folder_path)
    output = Path(output_path)
    output.mkdir(parents=True, exist_ok=True)

    # Load optional configuration
    config_path = folder / "config.json"
    config = {}
    if config_path.exists():
        try:
            with open(config_path) as f:
                config = json.load(f)
        except Exception as e:
            print(f"Failed to load config.json: {e}")
            config = {}

    # Extract layer filtering config
    layer_config = {
        "allowed_layers": config.get("allowed_layers"),
        "ignored_layers": config.get("ignored_layers")
    }
    use_adaptive = config.get("adaptive_tolerance", False)

    # Process each DXF file in folder
    for file in folder.glob("*.dxf"):
        tol = None if use_adaptive else tolerance
        result = parse_dxf_entities(
            str(file),
            tolerance=tol,
            layer_config=layer_config,
            config=config
        )

        geometry = result["geometry"]
        chains = geometry.get("edge_chains", [])
        circles = geometry.get("circles", [])
        geometry["chains"] = chains  # Alias for convenience

        # Recompute cutout stats
        cutout_counts = detect_cutouts(chains, circles)
        result["stats"]["cutouts"] = (
            f"{cutout_counts['chains']} chains + {cutout_counts['circles']} circles = "
            f"{cutout_counts['chains'] + cutout_counts['circles']}"
        )

        # Save JSON output
        outname = output / f"{file.stem}_geometry.json"
        with open(outname, "w") as f:
            json.dump(result, f, indent=2)

        # Print summary
        stats = result["stats"]
        print(f" {file.name}")
        print(f"    Chains: {stats['chain_count']},  Repaired: {sum(c.get('repaired') for c in chains)}")
        print(f"    Cutouts: {stats['cutouts']}")
        print(f"    Edges: {stats['edge_count']}")

        arcs = geometry.get('arcs', [])
        if arcs:
            max_radius = max(a["radius"] for a in arcs)
            print(f"    Arcs: {len(arcs)} (Max Radius: {max_radius:.2f})")

        if result["warnings"]:
            print(f"    {len(result['warnings'])} warnings")
        print()


def collect_layer_summary(folder_path):
    """
    Analyze all DXF files in a folder and summarize layer usage.

    Args:
        folder_path (str): Path to folder containing .dxf files.

    Returns:
        tuple:
            - dict with layer name -> total count across files
            - dict with filename -> set of layers used in that file
    """
    folder = Path(folder_path)
    layer_counts = Counter()
    file_layers = defaultdict(set)

    for file in folder.glob("*.dxf"):
        try:
            doc = ezdxf.readfile(file)
            msp = doc.modelspace()

            for entity in msp:
                layer = entity.dxf.layer
                layer_counts[layer] += 1
                file_layers[file.name].add(layer)

        except Exception as e:
            print(f" Could not read {file.name}: {e}")

    return dict(layer_counts), dict(file_layers)


### H. Metadata Tagging
Link TEXT annotations in the drawing to roles (e.g. axis, centerline).


In [156]:
def match_text_to_role(text):
    """
    Classify a DXF TEXT annotation into a semantic role based on keywords or patterns.

    Args:
        text (str): The raw text string from the DXF entity.

    Returns:
        str or None: A role string (e.g., 'bolt_height', 'thread_size') or None if unmatched.
    """
    text_norm = text.strip().lower()

    # Direct keyword mapping
    keyword_map = {
        "ankerhöhe": "bolt_height",
        "rohlänge": "raw_length",
        "bolzenrohlänge": "raw_length",
        "bolzen-": "bolt_prefix",
        "gesamthöhe": "total_height",
        "sw": "wrench_size",
        "= prüfmaß": "inspection_flag",
        "= funktionsrelevantes merkmal": "functional_flag",
        "prüfung erfolgt mit prüflehre": "inspection_note",
        "prägestempel wechseln": "stamp_change_note",
        "achtung!": "warning_note",
    }

    for key, role in keyword_map.items():
        if text_norm.startswith(key) or text_norm == key:
            return role

    # Pattern-based matching
    regex_map = {
        r"^m\d{1,2}$": "thread_size",             # e.g. "M6", "M12"
        r"^%%c\d+$": "symbol_code",               # e.g. DXF special character codes
        r"^hta\s*-\s*ce.*": "part_id",            # e.g. "HTA - CE1234"
        r"^\d+$": "numeric_note",                 # Just a number
        r"^\*+$": "asterisk_marker",              # A string of one or more asterisks
    }

    for pattern, role in regex_map.items():
        if re.fullmatch(pattern, text_norm):
            return role

    return None


## 3. File Paths
Specify the source folder for DXF files and the destination for parsed geometry output.


In [157]:
# Define input and output paths
dxf_folder = Path()
output_folder = Path()

# Ensure output directory exists
output_folder.mkdir(parents=True, exist_ok=True)

## 4. Configuration Setup

Configure how the parser interprets geometry:
- Set which DXF layers to include or ignore.
- Enable adaptive tolerance to automatically adjust based on drawing size.
- Define arc segmentation granularity (segments per 180 degrees).


In [158]:
# Collect available layers in the folder
layer_counts, _ = collect_layer_summary(dxf_folder)

# Define configuration parameters
config = {
    "allowed_layers": sorted(layer_counts.keys()),  # include all layers by default
    "ignored_layers": [],                           # or selectively ignore some
    "adaptive_tolerance": True,                     # auto-adjust tolerance based on bounding box
    "max_tolerance": 1.0,                           # upper limit for adaptive tolerance
    "arc_segments_per_180": 20                      # resolution of arc approximation
}

# Save config to the same folder as the DXFs
config_path = dxf_folder / "config.json"
with open(config_path, "w") as f:
    json.dump(config, f, indent=2)


## 5. Execute Parser

Run the parser on all DXF files in the specified folder.
This will extract and process the geometry, then save structured JSON files to the output directory.


In [None]:
# Run batch parser on all DXFs in the input folder
parse_folder(dxf_folder, output_folder)

---

## Final Notes

This notebook is part of a **four-step modular pipeline** for extracting and validating BIM-ready geometry from structural engineering drawings.

### Output Location
- Parsed geometry is saved as `_geometry.json` files in the defined `output_folder`.

### How to Run
1. Set your `dxf_folder` and `output_folder` paths in **Section 3**.
2. Ensure all dependencies are installed (e.g., `ezdxf`, `matplotlib`, `numpy`).
3. Run all cells from top to bottom.

### Next Step
- Continue to the next notebook: `[Geometry Interpreter]`

### Documentation
For full setup instructions and pipeline details, see the [README.md](https://github.com/yourusername/thesis-geometry-pipeline) in the repository.

---
