# **Introduction**

This notebook will serve as a documentation for the current status of the project. It will detail everything about:
- The Problem Statement
- Dataset
- Preprocessing
- Workflows
- Results
- Next Plan

# **The Problem Statement**

This section outlines the background, motivation, and formal definition of the problem addressed in this project.

## **Background and Motivation**

**SKETS Studio**, established in **2005**, is an architectural firm specializing in **BIM Consultancy**, **Design Documentation** for Interior Design, Architecture, and Engineering projects, and **high-end 3D Visualization** services.  
Over the past two decades, the firm has successfully delivered **2,500+ large-scale projects** and a total of **over 75,000 projects** globally.

Despite this extensive experience, SKETS currently lacks a structured system for efficiently referencing or reusing prior design assets.  
When a new project is initiated, teams often need to **recreate drawings and design components from scratch**, resulting in **redundant effort** and **longer project timelines**.

This inefficiency highlights the need for an intelligent retrieval system capable of **searching through the firm’s existing project database** and identifying **similar designs** based on either a **new sketch** or a **textual description** provided by the user.  
Such a system would enable **design reuse**, **reduce manual rework**, and **accelerate project delivery** while maintaining design consistency across projects.

## **Formal Problem Statement**

**Problem 1:**  
Develop a **structured and searchable library** of architectural sketches, where each sketch is represented as an **embedding** (feature vector) stored in a database.  
The database should be **efficiently indexed** for fast and accurate retrieval and should maintain **links between sketches and their respective CAD project files**.

**Problem 2:**  
Design a **retrieval system** that accepts either a **sketch image** or a **text description** as input and **searches through the library** to return **visually or contextually similar sketches** along with their **associated CAD files**, enabling **efficient reuse and reference** in future projects.


# **Proposed Solutions**

To address the problems outlined in the previous section, two conceptual solutions are proposed. Each focuses on a distinct approach to enable efficient retrieval of architectural sketches and their corresponding CAD representations from the project library.

---

## **Solution 1: Tag-Based Retrieval System**

**Concept:**  
Leverage **tag-based indexing** to retrieve sketches similar to a user’s query. The idea is to associate each sketch with a comprehensive set of descriptive tags. When a user provides a textual prompt, the system extracts semantic information (tags) from the prompt and retrieves sketches with overlapping or similar tags. The corresponding CAD files can then be mapped from these retrieved sketches.

**Underlying Principle / Working Mechanism:**  
- Each sketch in the library will be **indexed by tags**, where tags represent architectural elements, materials, design styles, or spatial configurations.  
- For a **user query**, the system will employ a **text embedding model** to extract tags or key features from the description.  
- The retrieved tags will be compared against the tag index to identify **sketches with similar semantic features**, which can then be linked to their **CAD project files**.

**Requirements:**  
- A **comprehensively tagged dataset** (each sketch linked to its CAD representation).  
- A **robust text embedding model** capable of extracting meaningful tags from vague or descriptive textual inputs.

**Challenges:**  
- **Scalability:** Tag management and retrieval complexity increase as the dataset grows.  
- **Subjectivity:** Ensuring consistency and objectivity in manual or automated tagging is challenging.  
- **Data Quality Dependence:** The system’s reliability depends heavily on the accuracy and completeness of the tagged dataset.

---

## **Solution 2: Image-Based Retrieval System (CBIR Approach)**

**Concept:**  
Transform the task into a **Content-Based Image Retrieval (CBIR)** problem by comparing visual similarities between sketches. The system accepts a sketch as input, extracts its **feature embeddings**, and searches for similar embeddings within a **vector database** of preprocessed sketches. The matched sketches and their corresponding CAD files are then retrieved.

**Underlying Principle / Working Mechanism:**  
- The library of sketches will be **converted into embeddings** using a pretrained **image embedding model** (e.g., CLIP, SketchFormer, or a sketch-specific CNN).  
- These embeddings will be **indexed in a vector database** (e.g., FAISS) optimized for similarity search.  
- Given a **query sketch**, the system will generate its embedding, perform **nearest-neighbor matching**, and return visually similar sketches along with their **linked CAD files**.

**Requirements:**  
- A **large and consistent dataset of sketches** (uniform resolution and preprocessing), each mapped to its CAD counterpart.  
- An **image embedding model** capable of capturing architectural sketch features effectively.

**Challenges:**  
- **Model Availability:** Few pretrained models exist for complex architectural sketches (most are trained on doodles or simple objects).  
- **Data Limitation:** Currently, only a small number of sketches (≈8) are available in a suitable format, which must be expanded significantly to improve retrieval accuracy.

---

> **Note:**  
> The ultimate goal is to **integrate both approaches** into a **hybrid retrieval system** capable of handling both **text-based** and **sketch-based** queries. This combined framework would enable multimodal search capabilities, thereby improving usability, scalability, and system accuracy.


# **Dataset**

The current dataset consists of **8 architectural sketches** provided by the **SKETS Studio** team.  
All sketches represent **wardrobe views** and are untagged at this stage.  
Each image has a resolution of **(5100 × 6600 × 3)** (height, width, channels).

Given the small dataset size, **data augmentation** and **preprocessing** were applied to improve model robustness and ensure consistent feature extraction before generating embeddings.

---

## **Data Augmentation**

**Data Augmentation** is a technique used to synthetically increase dataset size by creating modified versions of existing samples through transformations such as scaling, rotation, translation, or brightness adjustments.  
The goal is to make the model **robust to variations** that do not alter the semantic content of the image.

In this project, each original sketch was used to generate **10 augmented samples**, increasing the dataset size from **8 to 88 images** (8 originals + 80 augmentations).  

We used the [`albumentations`](https://albumentations.ai/) library for augmentation, which provides a fast and flexible API for image transformations.  
The following policy was applied:

```python
augment = A.Compose(
    [
        A.Affine(
            scale=(0.98, 1.02),
            translate_percent=(0.01, 0.02),
            rotate=(-7, 7),
            shear=(-2, 2),
            fit_output=False,
            border_mode=cv2.BORDER_REFLECT,
            p=0.9
        ),
        A.OneOf(
            [
                A.RandomBrightnessContrast(brightness_limit=0.1, contrast_limit=0.1, p=0.8),
                A.GaussianBlur(blur_limit=(3, 5), p=0.5),
            ],
            p=0.8
        ),
        A.HorizontalFlip(p=0.5),
        A.Perspective(scale=(0.02, 0.05), p=0.3),
        A.ImageCompression(quality_range=(90, 100), p=0.2)
    ]
)
```

**Explanation:**  
- **Affine Transformations:** Slight rotations, translations, and scaling to mimic natural sketch variations.  
- **Brightness/Contrast & Blur:** Introduce minor lighting and sharpness differences to improve generalization.  
- **Horizontal Flip:** Adds left–right mirrored variants to handle orientation bias.  
- **Perspective Transformation:** Simulates minor skewing as may occur during scanning.  
- **Image Compression:** Emulates compression artifacts from file storage or transfer.

These transformations preserve the overall structure while ensuring variability in line intensity, orientation, and noise—leading to more resilient feature extraction during embedding generation.

---

## **Preprocessing**

Scanned sketches often contain **noise**, **shadows**, or **uneven illumination** introduced during the scanning process.  
Architectural sketches also tend to be **sparse**, with meaningful content distributed across a large white background.  
To address these issues, a **preprocessing pipeline** was designed to enhance edges, suppress noise, and normalize brightness levels before embedding extraction.

The preprocessing function is as follows:

```python
def preprocess_sketch(img_path):
    import cv2, numpy as np

    img = cv2.imread(img_path)
    if img is None:
        raise ValueError(f"Could not read {img_path}")

    # Step 1: Add small reflective padding to preserve boundary pixels
    img = cv2.copyMakeBorder(img, 5, 5, 5, 5, cv2.BORDER_REFLECT)

    gray = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)

    # Step 2: Apply conditional smoothing to remove mild noise
    blur = cv2.bilateralFilter(gray, 9, 75, 75) if np.std(gray) > 20 else gray

    # Step 3: Normalize intensity and detect edges
    norm = cv2.normalize(blur, None, 0, 255, cv2.NORM_MINMAX)
    edges = cv2.Canny(norm, 30, 100)
    edges = cv2.dilate(edges, np.ones((2, 2), np.uint8), iterations=1)

    # Step 4: Adaptive thresholding and morphological refinement
    binary = cv2.adaptiveThreshold(
        edges, 255, cv2.ADAPTIVE_THRESH_GAUSSIAN_C, cv2.THRESH_BINARY, 11, 2
    )

    kernel = np.ones((3, 3), np.uint8)
    refined = cv2.morphologyEx(binary, cv2.MORPH_CLOSE, kernel)

    # Step 5: Convert to RGB for model compatibility
    processed_rgb = cv2.cvtColor(refined, cv2.COLOR_GRAY2RGB)
    return processed_rgb
```

**Explanation:**  
1. **Padding:** Adds a small reflective border to preserve sketch boundaries after transformations.  
2. **Grayscale Conversion:** Simplifies image processing while retaining edge information.  
3. **Bilateral Filtering:** Smooths regions while preserving edges, helping to reduce minor scan noise.  
4. **Normalization & Edge Detection:** Enhances contrast and extracts structural edges via the **Canny** operator.  
5. **Adaptive Thresholding:** Converts the image into a binary format emphasizing linework.  
6. **Morphological Closing:** Fills small gaps in lines, making contours continuous.  
7. **RGB Conversion:** Ensures compatibility with embedding extraction pipelines expecting 3-channel inputs.

The output of this pipeline is a **clean, high-contrast binary sketch** highlighting the structural components of the drawing.  
These preprocessed images form the input to the **embedding extraction** stage used later in the system workflow.


# **Image Embedding Extraction**

The next stage in the workflow involves **embedding extraction**, where we convert images into high-dimensional numerical representations that capture semantic and structural information.  
While several pretrained models exist for generic image embeddings, the challenge lies in identifying models that perform well on **sketch-based datasets**, which differ significantly from natural images in texture, color distribution, and edge density.

Datasets such as [Sketchy](https://github.com/CDOTAD/SketchyDatabase/blob/master/SketchyDataset_README.md), [QuickDraw](https://github.com/googlecreativelab/quickdraw-dataset), and [TU-Berlin](https://huggingface.co/datasets/sdiaeyu6n/tu-berlin) have been commonly used for sketch-based model training.  
However, these datasets primarily consist of **simplified, vectorized doodles**—collected from hand-drawn sketches on tablets—and do not represent **architectural sketches**, which are richer, denser, and more complex in structure.

To address this, we explored pretrained models that could still yield meaningful embeddings for architectural sketches. Our selection included models either:
1. **Trained on large-scale sketch datasets** (though not architectural), or  
2. **Trained on vast multimodal datasets**, enabling generalization to unseen domains like architectural line drawings.

Below we discuss the primary model used in our experimentation.

---

## **CLIP ViT-B/32 (Vision Transformer Base, Patch Size 32)**

### **Description**

The **CLIP (Contrastive Language–Image Pretraining)** model, developed by *OpenAI*, learns joint representations of text and images through a **contrastive learning objective**.  
It uses two encoders — a **Vision Transformer (ViT-B/32)** for images and a **Transformer-based text encoder** — trained to maximize the similarity between matching (image, text) pairs while minimizing it for non-matching pairs.

This approach allows CLIP to generalize to a wide range of visual tasks in a **zero-shot** manner without task-specific fine-tuning.  
We specifically used the `ViT-B/32` variant, which divides each image into 32×32 patches and processes them through a Transformer encoder to obtain the final embedding representation.

Further architectural details and pretrained weights can be found on [Hugging Face](https://huggingface.co/openai/clip-vit-base-patch32).

---

### **Challenges and Limitations**

Although CLIP was trained on a massive and diverse corpus of image–text pairs, its direct applicability to **architectural sketches** is limited for the following reasons:

- **Domain Mismatch:** CLIP’s training data primarily comprises natural and photographic images, not monochromatic architectural line drawings.  
- **Resolution Constraint:** The model expects inputs of size **(224 × 224 × 3)**, whereas our sketches have a much higher native resolution (**5100 × 6600 × 3**).  
  Resizing or cropping to meet the model’s requirements can result in **significant spatial information loss** (potentially up to 99.85%).  
- **Context Loss:** Architectural sketches rely on precise geometric and proportional cues that are often degraded during downsampling.

Consequently, while CLIP serves as a **baseline embedding extractor**, future improvements could involve training a **domain-specific embedding model** once a sufficiently large corpus of architectural sketches becomes available.

---

### **Experimental Evaluation**

We evaluated CLIP on our dataset (original and augmented) under two preprocessing conditions — **with** and **without** sketch preprocessing — to study its sensitivity to noise, contrast enhancement, and edge refinement.

The following tables summarize the **top similar sketches retrieved** for the query `pdf3_SIM` under different dataset configurations.

---

#### **Original Dataset (Without Preprocessing)**

| Rank | Sketch Name | Similarity Score |
|------|--------------|------------------|
| 1 | pdf7.png | 92.14% |
| 2 | pdf3.png | 91.63% |
| 3 | pdf5.png | 90.81% |
| 4 | pdf8.png | 90.35% |
| 5 | pdf2.png | 90.33% |

---

#### **Original Dataset (With Preprocessing)**

| Rank | Sketch Name | Similarity Score |
|------|--------------|------------------|
| 1 | pdf7.png | 92.92% |
| 2 | pdf4.png | 92.80% |
| 3 | pdf5.png | 92.23% |
| 4 | pdf3.png | 91.16% |
| 5 | pdf6.png | 89.68% |

---

#### **Augmented Dataset (Without Preprocessing)**

| Rank | Sketch Name | Similarity Score |
|------|--------------|------------------|
| 1 | pdf7_aug7.png | 93.41% |
| 2 | pdf5_aug6.png | 93.34% |
| 3 | pdf5_aug3.png | 92.84% |
| 4 | pdf8_aug8.png | 92.69% |
| 5 | pdf5_aug1.png | 92.60% |
| 6 | pdf5_aug8.png | 92.37% |
| 7 | pdf5_aug5.png | 92.22% |
| 8 | pdf7.png | 92.14% |
| 9 | pdf8_aug10.png | 91.98% |
| 10 | pdf7_aug3.png | 91.71% |

---

#### **Augmented Dataset (With Preprocessing)**

| Rank | Sketch Name | Similarity Score |
|------|--------------|------------------|
| 1 | pdf4.png | 94.31% |
| 2 | pdf4_aug3.png | 94.16% |
| 3 | pdf4_aug7.png | 93.06% |
| 4 | pdf4_aug10.png | 92.75% |
| 5 | pdf4_aug2.png | 92.40% |
| 6 | pdf4_aug9.png | 91.56% |
| 7 | pdf3_aug5.png | 91.43% |
| 8 | pdf5_aug10.png | 91.40% |
| 9 | pdf3.png | 91.36% |
| 10 | pdf3_aug4.png | 91.27% |

---

### **Observations**

- Preprocessing generally improved retrieval consistency, indicating that **noise reduction and edge enhancement** benefit embedding quality.  
- Augmentation expanded the diversity of embeddings, leading to slightly **higher similarity scores** and **denser clusters** in the embedding space.  
- Despite architectural differences between sketches, CLIP was able to identify broad visual similarities, validating its potential as a **baseline model** for sketch retrieval tasks.  
- However, to achieve production-level performance, a **fine-tuned or domain-specific model** trained on architectural sketches will likely be necessary.

---

## **SketchFormer**

### **Description**

[SketchFormer](https://github.com/leosampaio/sketchformer) is a transformer-based architecture designed to model free-hand sketches represented as **vector sequences** (strokes), rather than raster images. It captures both structural and temporal dependencies within sketches, enabling it to perform multiple downstream tasks, including:

- **Sketch Classification**
- **Sketch-based Image Retrieval (SBIR)**
- **Reconstruction and Interpolation of Sketches**

For this study, we utilized the **pretrained TensorFlow model** provided in the official [repository](https://github.com/leosampaio/sketchformer) to extract embeddings from our dataset of architectural sketches.

---

### **Challenges and Limitations**

While SketchFormer was explicitly trained on sketch data, its training domain was **vectorized sketches**—datasets such as Sketchy and QuickDraw—where each sketch is a sequence of strokes captured from drawing devices.  
In contrast, our dataset consists of **rasterized architectural sketches**, lacking inherent stroke information.

To adapt our sketches for SketchFormer, we implemented a **raster-to-stroke extraction** pipeline to convert the binary sketches into approximate stroke sequences. However, the model imposes a **maximum sequence length of 200**, while our extracted sequences range from **6,620 to 74,717 points**, resulting in **substantial truncation** and potential **information loss** between **96.97% and 99.73%**.

This mismatch between input expectations and available data significantly limits SketchFormer's representational capability in this experiment.

---

### **Experimental Evaluation**

We evaluated SketchFormer’s performance on both **original** and **augmented** versions of our dataset, under preprocessing conditions designed to enhance edge clarity and reduce noise.  
The results shown below illustrate the **top retrieved sketches** for the query image `pdf3_SIM` based on **cosine similarity** of embeddings.

#### **Original Dataset**

| Rank | Sketch Name | Similarity Score |
|------|--------------|------------------|
| 1 | pdf7.png | 6.26% |
| 2 | pdf2.png | 2.80% |
| 3 | pdf4.png | 1.36% |
| 4 | pdf8.png | 1.32% |
| 5 | pdf3.png | 0.59% |

> *Observation:* The low similarity scores across the original dataset indicate limited representational fidelity, likely due to information loss during raster-to-vector conversion.

---

#### **Augmented Dataset**

| Rank | Sketch Name     | Similarity Score |
|------|------------------|------------------|
| 1 | pdf8_aug10.png | 18.62% |
| 2 | pdf8_aug4.png  | 16.46% |
| 3 | pdf8_aug1.png  | 16.27% |
| 4 | pdf8.png       | 16.27% |
| 5 | pdf8_aug8.png  | 16.05% |
| 6 | pdf8_aug3.png  | 15.37% |
| 7 | pdf8_aug9.png  | 13.21% |
| 8 | pdf8_aug2.png  | 12.87% |
| 9 | pdf7.png       | 12.77% |
| 10 | pdf1_aug1.png | 12.74% |

> *Observation:* The augmented dataset shows improved clustering, with multiple variants of the same sketch (`pdf8`) appearing at the top ranks. This suggests that augmentation enhanced the model’s ability to capture intra-class consistency, even with significant structural information loss.

---

**Summary:**  
While SketchFormer demonstrates some clustering capability, its effectiveness is limited by the mismatch between **raster and vector representations**. A future direction involves developing a **vectorization model fine-tuned for architectural sketches**, or training SketchFormer on a **synthetic stroke-based architectural dataset** to bridge this representational gap.

---

## **CLIP ViT-G/14**

### **Description**

The **CLIP ViT-G/14** model represents one of the most powerful variants of the CLIP (Contrastive Language–Image Pretraining) architecture, employing a **Vision Transformer (ViT-G/14)** as its image encoder.  
It was trained on the **LAION-2B** dataset — one of the largest open-source collections of image–text pairs — enabling it to learn highly generalizable multimodal representations.

The model aligns visual and textual embeddings in a shared latent space through **contrastive learning**, where paired image–text examples are optimized to be close in embedding space.  
Pretrained weights and additional documentation are publicly available on [Hugging Face](https://huggingface.co/laion/CLIP-ViT-g-14-laion2B-s12B-b42K).

---

### **Challenges and Limitations**

Despite its large-scale pretraining and strong generalization capabilities, applying CLIP ViT-G/14 directly to **architectural sketches** presents notable limitations:

- **Domain Mismatch:**  
  The model was primarily trained on natural, colored, and textured images, whereas architectural sketches are **monochromatic, sparse, and geometry-driven**. This leads to a representational gap in learned features.

- **Resolution Constraint:**  
  The model expects input dimensions of **(1024 × 1024 × 3)**, while our sketches have an average resolution of **(5100 × 6600 × 3)**.  
  Downscaling to the required input size introduces **significant structural and proportional information loss**, estimated at up to **96.88%**.

- **Contextual Degradation:**  
  Architectural sketches rely on **fine edge continuity** and **precise spatial proportions**, which are easily distorted during resizing and normalization.  
  Consequently, the embeddings may fail to capture global layout patterns critical for design similarity.

Therefore, while CLIP ViT-G/14 provides a robust **baseline model** for embedding extraction, domain-specific fine-tuning or custom training on architectural sketch datasets will likely be necessary for optimal retrieval performance.

---

### **Experimental Evaluation**

We evaluated CLIP ViT-G/14 on both **original** and **augmented** sketch datasets, using **preprocessed inputs** to minimize background noise and enhance line visibility.  
The tables below present the **top-10 most similar sketches** retrieved for the query sketch `pdf3_SIM`, based on **cosine similarity** of embeddings.

#### **Original Dataset**

| Rank | Sketch Name | Similarity Score |
|------|--------------|------------------|
| 1 | pdf2.png | 86.01% |
| 2 | pdf8.png | 85.17% |
| 3 | pdf4.png | 84.83% |
| 4 | pdf7.png | 84.48% |
| 5 | pdf3.png | 83.77% |

> *Observation:* The retrieved sketches show strong clustering among geometrically similar images, demonstrating the model’s inherent ability to capture compositional similarities even without domain adaptation.

---

#### **Augmented Dataset**

| Rank | Sketch Name     | Similarity Score |
|------|------------------|------------------|
| 1 | pdf2_aug7.png | 87.06% |
| 2 | pdf2_aug9.png | 87.02% |
| 3 | pdf2_aug4.png | 86.85% |
| 4 | pdf2.png | 86.01% |
| 5 | pdf2_aug10.png | 85.71% |
| 6 | pdf8.png | 85.17% |
| 7 | pdf8_aug1.png | 85.08% |
| 8 | pdf2_aug1.png | 85.06% |
| 9 | pdf8_aug7.png | 85.05% |
| 10 | pdf8_aug10.png | 84.96% |

> *Observation:* The augmented dataset further improved intra-class consistency, as multiple augmented variants of `pdf2` and `pdf8` dominate the top ranks. This suggests that CLIP ViT-G/14 embeddings are robust to moderate geometric transformations introduced during augmentation.

---

**Summary:**  
CLIP ViT-G/14 demonstrates strong representational power even without fine-tuning, producing consistent similarity patterns across augmented samples. However, due to domain and resolution mismatches, a **custom fine-tuned variant trained on architectural sketches** is expected to yield substantially improved embedding fidelity and retrieval precision.
