# Floorplan Dimension Extractor

This notebook implements a Python pipeline to extract dimensions and appliance codes from a floorplan PDF.

**The process is as follows:**
1.  **Load PDF**: Read the specified page from the input PDF file.
2.  **Image Conversion**: Since the PDF is image-based, convert the page into an image format that can be processed by an OCR engine.
3.  **OCR Extraction**: Use Tesseract OCR to extract all text and their corresponding bounding boxes from the image.
4.  **Pattern Matching**: Use Regular Expressions (Regex) to find text that matches dimension formats (e.g., `12' 6"`, `34 1/2"`) and cabinet codes (e.g., `DB24`).
5.  **Data Conversion**: Convert the raw dimension strings into a standardized unit (float inches).
6.  **Output Generation**: Store the results in a structured JSON file and create a new visualized PDF with bounding boxes drawn on it.

In [75]:
import fitz  # PyMuPDF
import pytesseract
import cv2
import numpy as np
import re
import json
import os
from PIL import Image

# For Windows users, you might need to specify the path to the Tesseract executable
pytesseract.pytesseract.tesseract_cmd = r'C:\Program Files\Tesseract-OCR\tesseract.exe'

In [76]:
# --- Configuration ---
INPUT_PDF_PATH = os.path.join('data', 'floorplan.pdf')
OUTPUT_DIR = 'output'
OUTPUT_JSON_PATH = os.path.join(OUTPUT_DIR, 'extracted_data.json')
OUTPUT_VISUAL_PATH = os.path.join(OUTPUT_DIR, 'visualized_floorplan.pdf')

# Create output directory if it doesn't exist
os.makedirs(OUTPUT_DIR, exist_ok=True)

## Step 1 & 2: PDF to Image Conversion and OCR

First, we define a function to extract a page from the PDF and convert it into a high-resolution image (as a NumPy array) suitable for OCR. Then, we use `pytesseract` to perform OCR on this image. We use `image_to_data` to get detailed information, including the text, its coordinates (`left`, `top`, `width`, `height`), and the confidence score.

In [77]:
def pdf_page_to_image(pdf_path, page_number=0, dpi=300):
    """
    Converts a specific page of a PDF to a NumPy array image.
    """
    doc = fitz.open(pdf_path)
    page = doc.load_page(page_number)
    
    # Render page to a pixmap (image)
    pix = page.get_pixmap(dpi=dpi)
    
    # Convert pixmap to a NumPy array for OpenCV
    img = np.frombuffer(pix.samples, dtype=np.uint8).reshape(pix.height, pix.width, pix.n)
    
    # If the image is RGBA, convert to BGR for OpenCV
    if img.shape[2] == 4:
        img = cv2.cvtColor(img, cv2.COLOR_RGBA2BGR)
        
    doc.close()
    return img

def extract_text_with_bboxes(image):
    """
    Performs OCR on an image and returns text with bounding boxes.
    """
    # Use pytesseract to get structured data
    ocr_data = pytesseract.image_to_data(image, output_type=pytesseract.Output.DICT)
    return ocr_data

## Step 3, 4 & 5: Parsing and Converting Data

This is the core logic. We define functions to:
1.  **Parse Fractions**: Convert string fractions like "1/2" to floats.
2.  [cite_start]**Parse Dimensions**: A powerful function that uses regex to find and convert various dimension formats (`X' Y"`, `X' Y/Z"`, `X"`, etc.) into inches[cite: 5, 6].
3.  [cite_start]**Find Patterns**: A main processing function that iterates through the OCR data, groups words that are close together, and applies regex to identify and process dimensions and codes[cite: 7].

In [78]:
def parse_fraction(fraction_str):
    """Converts a fraction string like '1/2' to a float."""
    if '/' in fraction_str:
        num, den = fraction_str.split('/')
        return float(num) / float(den)
    return float(fraction_str)

def convert_to_inches(raw_string):
    """
    Converts a single recognized dimension string (e.g., "12' 6 1/2\"") to float inches.
    This version is more robust.
    """
    total_inches = 0
    
    # Comprehensive regex: captures feet, and then inches which can be a whole number, a fraction, or both.
    # Example matches: 5', 6", 1/2", 6 1/2", 5' 6", 5' 1/2", 5' 6 1/2"
    pattern = re.compile(r"(?:(\d+)')?\s*(?:(\d+))?\s*(?:(\d+/\d+))?\"?")
    match = pattern.search(raw_string)

    if not match:
        return 0.0

    feet = match.group(1)
    whole_inches = match.group(2)
    fraction = match.group(3)

    if feet:
        total_inches += float(feet) * 12
    if whole_inches:
        total_inches += float(whole_inches)
    if fraction:
        num, den = fraction.split('/')
        total_inches += float(num) / float(den)
        
    return total_inches

def find_and_process_patterns(ocr_data):
    """
    Finds and processes BOTH dimension strings AND codes from the same OCR data.
    """
    dimensions = []
    codes = []
    
    # Regex for codes
    code_pattern = re.compile(r'^\d{4}[A-Z]{2}$')
    
    n_boxes = len(ocr_data['text'])
    for i in range(n_boxes):
        # Only process words with a decent confidence score
        if int(ocr_data['conf'][i]) > 40:
            text = ocr_data['text'][i]
            
            # --- Check for Codes ---
            if code_pattern.match(text):
                codes.append(text)
                continue # Move to the next word once a code is found

            # --- Check for Dimensions ---
            # This logic is simplified to look for the foot marker, which is a strong indicator
            if "'" in text:
                raw_text = text
                # Simple logic to combine with the next word if it looks like inches
                if i + 1 < n_boxes and '"' in ocr_data['text'][i+1]:
                    raw_text += " " + ocr_data['text'][i+1]

                # Process single or combined dimensions (e.g., 12'0" X 11'6")
                for part in re.split(r'\s*[xX]\s*', raw_text):
                    try:
                        inch_val = convert_to_inches(part)
                        if inch_val > 0:
                            bbox = [
                                ocr_data['left'][i],
                                ocr_data['top'][i],
                                ocr_data['left'][i] + ocr_data['width'][i],
                                ocr_data['top'][i] + ocr_data['height'][i]
                            ]
                            dimensions.append({"raw": part, "inches": inch_val, "bbox": bbox})
                    except:
                        # Ignore parts that fail conversion
                        continue

    return dimensions, list(set(codes)) # Return unique codes

# --- REPLACEMENT FUNCTION for extract_native_codes in Cell 7 ---
def extract_native_codes(pdf_path, page_number=0):
    """
    Extracts native, selectable text from a PDF page, finds codes using regex,
    and returns both the codes and all the raw text found for debugging.
    """
    codes = []
    all_raw_text = []
    # Regex for codes like 2030SH or 3040SH (4 digits, 2 uppercase letters)
    code_pattern = re.compile(r'^\d{4}[A-Z]{2}$')
    
    doc = fitz.open(pdf_path)
    page = doc.load_page(page_number)
    
    # Extract text as a list of blocks
    text_blocks = page.get_text("blocks")
    
    for block in text_blocks:
        # The text content is the 5th item in the tuple
        text_content = block[4].strip()
        if text_content: # Only process non-empty blocks
            all_raw_text.append(text_content)
            # Check if any word in the block matches the code pattern
            for word in text_content.split():
                if code_pattern.match(word):
                    codes.append(word)
                
    doc.close()
    
    # Return unique codes and all the raw text
    return list(set(codes)), all_raw_text

## Step 6: Visualization (Bonus)

[cite_start]This bonus function takes the extracted dimensions and draws their bounding boxes and converted inch values directly onto the source image[cite: 16]. This is great for visually verifying the accuracy of the extraction. The final annotated image is then saved.

In [None]:
# --- REPLACEMENT FUNCTION for Cell 9 ---
def visualize_results(image, dimensions, output_path):
    """
    Draws bounding boxes and labels for dimensions on the image and saves it.
    Labels are placed BELOW the bounding box on a white background for clarity.
    """
    vis_image = image.copy()
    
    for dim in dimensions:
        bbox = dim['bbox']
        label = f"{dim['inches']}\""
        
        # Define text properties
        font = cv2.FONT_HERSHEY_SIMPLEX
        font_scale = 0.7
        font_thickness = 2
        text_color = (0, 0, 255) # Red for text
        box_color = (0, 0, 255)  # Red for bounding box
        bg_color = (255, 255, 255) # White for background

        # Get text size to determine background rectangle
        (text_width, text_height), baseline = cv2.getTextSize(label, font, font_scale, font_thickness)
        
        # --- COORDINATE CALCULATION (THE KEY CHANGE) ---
        # Position the label's top edge 5 pixels BELOW the bounding box's bottom edge
        text_x = bbox[0]
        text_y_baseline = bbox[3] + text_height + 5 # 5 is a small padding in pixels

        # Calculate coordinates for the background rectangle to sit behind the text
        # Top-left corner of the background
        bg_rect_start = (text_x, bbox[3] + 5) 
        # Bottom-right corner of the background
        bg_rect_end = (text_x + text_width, bbox[3] + 5 + text_height + baseline)
        
        # Draw the white background rectangle first
        cv2.rectangle(vis_image, bg_rect_start, bg_rect_end, bg_color, -1) # -1 fills the rectangle
        
        # Draw the red text on top of the white background
        cv2.putText(vis_image, label, (text_x, text_y_baseline), 
                    font, font_scale, text_color, font_thickness, cv2.LINE_AA)
        
        # Finally, draw the original bounding box rectangle
        cv2.rectangle(vis_image, (bbox[0], bbox[1]), (bbox[2], bbox[3]), box_color, 2)
    
    # Save the visualized image as a temporary file
    temp_image_path = os.path.join(OUTPUT_DIR, "temp_visual.png")
    cv2.imwrite(temp_image_path, vis_image)

    # Convert the saved image back to a PDF
    with Image.open(temp_image_path) as img:
        img.convert('RGB').save(output_path)
    
    os.remove(temp_image_path) # Clean up temp file
    print(f"Visualization saved to {output_path}")

## Main Pipeline Execution

This is the final step where we tie everything together. [cite_start]We'll run the full pipeline for each page in the PDF, collect the results, and then save them to the specified JSON and visualized PDF files[cite: 8, 21].

In [80]:
# --- REPLACEMENT FUNCTION for main in Cell 11 ---
def main():
    final_results = {"pages": []}
    page_num = 0

    print(f"Processing page {page_num + 1}...")
    
    # --- Single OCR Pass for ALL data ---
    page_image = pdf_page_to_image(INPUT_PDF_PATH, page_number=page_num)
    ocr_data = extract_text_with_bboxes(page_image)
    
    # Process the OCR data to get both dimensions and codes
    dimensions, codes = find_and_process_patterns(ocr_data)
    
    # --- Combine Results ---
    page_data = {
        "page": page_num + 1,
        "dimensions": dimensions,
        "codes": codes
    }
    final_results["pages"].append(page_data)

    # Bonus: Visualize the dimension results
    if dimensions:
         visualize_results(page_image, dimensions, OUTPUT_VISUAL_PATH)

    # Save the final JSON output
    with open(OUTPUT_JSON_PATH, 'w') as f:
        json.dump(final_results, f, indent=4)
        
    print(f"\nExtraction complete!")
    print(f"JSON output saved to {OUTPUT_JSON_PATH}")
    print(f"Found {len(dimensions)} dimensions and {len(codes)} codes on page {page_num + 1}.")

# Run the pipeline
if __name__ == '__main__':
    main()

Processing page 1...
Visualization saved to output\visualized_floorplan.pdf

Extraction complete!
JSON output saved to output\extracted_data.json
Found 30 dimensions and 1 codes on page 1.
