# Interior Architecture Dataset Construction

## 1. Project Overview
This project aims to construct a high-quality, labeled image dataset for fine-grained **interior architectural element recognition**. The goal is to create a clean, labeled dataset for training computer vision models to recognize fine-grained architectural details

Existing datasets often label images with broad categories like "Living Room" or "Kitchen." However, for architectural analysis, we require a taxonomy that focuses on specific structural details, such as *Coffered Ceilings*, *Clerestory Windows*, or *Vestibules*.

To achieve this, I developed custom scrapers to collect, filter, and normalize data from public sources (**Unsplash** and **Wikimedia Commons**).

## 2. Taxonomy Decision

### Decision: Custom Taxonomy
This custom taxonomy is chosen from tag_taxonomy.json focusing on structural and architectural specificities.

### Selected Classes (22 Categories)
The dataset focuses on the following structural elements:

| Ceiling Elements | Windows & Light | Walls & Floors | Circulation | Transitional Spaces |
| :--- | :--- | :--- | :--- | :--- |
| Vaulted Ceiling | Clerestory Windows | Concrete Wall | Spiral Staircase | Foyer |
| Coffered Ceiling | Skylight | Brick Interior | Grand Staircase | Vestibule |
| Exposed Beams | Glass Curtain Wall | Marble Floor | Corridor | Atrium |
| Archway | Natural Light | | Hallway | Mezzanine |
| Colonnade | Recessed Lighting | | | |
| | Chandelier | | | |

## 3. Data Collection Methodology

We utilize two primary data sources to ensure a balance between aesthetic quality and diversity.

### Source A: Unsplash (High-Quality Photography)
* **Tool**: `unsplash_test.py`
* **Method**: Official Unsplash API.
* **Strategy**: 
    * Searches for {Keyword} + Interior.
    * Filtering: Removes images with "people", "fashion", "exterior" tags.
    * Output: High-resolution images with rich photographer metadata.

### Source B: Wikimedia Commons (Open Knowledge)
* **Tool**: `wikimedia_test.py`
* **Method**: MediaWiki API
* **Strategy**:
    * Filtering: Removes images with "drawings, plans, maps, and ruins".
    * De-duplication: Checks title similarity to avoid downloading burst shots (e.g., "DSC_01.jpg", "DSC_02.jpg").
        * It calculates the text similarity ratio between the incoming title and the list of already downloaded titles.
        * A strict SIMILARITY_THRESHOLD of 0.8 is applied. If the similarity score exceeds this limit, the image is automatically discarded to ensure dataset diversity.

## 4. Key Features & Quality Control

To ensure the dataset is clean and usable for Computer Vision tasks, strict filtering logic is implemented in the code.

### 1. Content Blacklisting
I filter out images if their metadata contains specific excluded terms. This removes irrelevant data types:
* **Non-Photographic**: `drawing`, `sketch`, `plan`, `map`, `diagram`.
* **Exterior/Nature**: `exterior`, `facade`, `garden`, `street`, `aerial`.
* **Privacy/Human**: `person`, `portrait`, `wedding`, `crowd`, `fashion`.

### 2. Deduplication
Both scrapers maintain a record of downloaded IDs or Titles to prevent duplicate entries across different run sessions.

### 3. Metadata Standardization
Regardless of the source, all data is saved into a CSV format for easy training integration:

```csv
filename, keyword, full_caption, image_url, source
```

## 5. Usage Instructions

### Running the Collection
1.  **Unsplash**:
    * Replace `UNSPLASH_ACCESS_KEY` with your API key.
    * Run all cells to start downloading images to `Dataset_Unsplash_Interior`.
2.  **Wikimedia**:
    * Run all cells to start downloading images to `Dataset_Wikimedia_Interior`.

### Output
The scripts will generate:
* Image folders containing the raw `.jpg` files.
* CSV files (`dataset_unsplash_captions.csv`, `dataset_wikimedia_captions.csv`) containing the labels and metadata.

## 6. Future Work
* **Houzz Integration**: Continue attempts to integrate Houzz.com as a data source, potentially using advanced headless browser automation (e.g., Playwright) to navigate dynamic JavaScript content and overcome anti-scraping measures.
* **ADE20K Dataset**: Investigate the [ADE20K Dataset](https://ade20k.csail.mit.edu/) as a potential source for semantic segmentation masks to augment the current classification dataset.