# Notebook: 01_generate_dataset.ipynb

This notebook implements a synthetic real estate dataset generation system designed for machine learning model training and validation. The module creates statistically realistic property data that follows Italian real estate market patterns and business logic constraints.

## **System Architecture Summary**
This implementation provides a robust, extensible synthetic data generation pipeline specifically designed for real estate machine learning applications. Key architectural features include:

**Data Quality Assurance:**
- Deterministic reproducibility through seed management
- Multi-layer validation ensuring logical consistency
- Business rule enforcement through parameterized constraints

**Scalability and Extensibility:**
- Factory pattern supporting multiple asset types
- Centralized configuration management
- Modular function design enabling easy enhancement

**Production Readiness:**
- Comprehensive error handling and validation
- Standard data formats and schema compliance
- Statistical profiling for dataset characterization

The generated dataset serves as foundation for downstream machine learning workflows including exploratory data analysis, feature engineering, model training, and performance validation.

## 01. Setup & Imports

### Technical Overview
Imports essential Python libraries for data manipulation and numerical operations.

### Implementation Details
- `numpy`: Numerical computing and array operations
- `pandas`: Data structure manipulation and analysis
- `random`: Pseudorandom number generation
- `datetime`: Temporal data handling and timestamp generation

### Purpose
Establishes the computational environment and dependencies required for dataset generation operations.

### Output
Runtime environment configured with necessary modules for data generation pipeline.

In [21]:
import numpy as np
import pandas as pd
import random
from datetime import datetime, timedelta

## 02. Central Configuration

### Technical Overview
Initializes reproducible random state and defines domain-specific business parameters for synthetic data generation.

### Implementation Details
- Sets deterministic seeds (`random.seed(42)`, `np.random.seed(42)`) for reproducible results across executions  
- Defines `CONFIG` dictionary containing market-driven parameters:
  - Price per square meter boundaries (€1,200–€3,500)  
  - Feature-based property value adjustments  
  - Architectural constraints (elevator requirements)

### Purpose
Encapsulates synthetic value generation logic and ensures reproducibility.  
The configuration can be extended to other asset types beyond real estate.

### Output
Deterministic random state and parameterized value generation rules.

In [22]:
random.seed(42)
np.random.seed(42)

CONFIG = {
    'PRICE_PER_SQM_MIN': 1200,
    'PRICE_PER_SQM_MAX': 3500,
    'GARDEN_BONUS': 10000,
    'BALCONY_BONUS': 5000,
    'GARAGE_BONUS': 7000,
    'ELEVATOR_MIN_FLOORS': 4
}

## 03. Base Data

### Technical Overview
Defines dataset dimensions and categorical feature vocabularies.

### Implementation Details
- `N_ROWS = 150`: Dataset size specification
- `LOCATIONS`: Italian metropolitan areas for geographic distribution
- `ENERGY_CLASSES`: EU energy efficiency classifications (A-G scale)
- Temporal reference point establishment

### Purpose
Establishes dataset scope and categorical feature spaces for property generation.

### Output
Static vocabularies and dimensional constraints for data generation process.

In [23]:
N_ROWS = 150
ASSET_TYPE = "property"
TODAY = datetime.utcnow()

# Static vocabularies
LOCATIONS = ['Milan', 'Rome', 'Naples', 'Florence', 'Turin', 'Bologna', 'Palermo', 'Genoa']
ENERGY_CLASSES = ['A', 'B', 'C', 'D', 'E', 'F', 'G']

## 04. Support Functions

### Technical Overview
Implements domain-specific scoring algorithms and temporal data generation utilities.

### Implementation Details
**`simulate_condition_score()`:**
- Implements multi-factor condition assessment algorithm
- Applies environmental penalties (humidity > 65%, temperature extremes)
- Incorporates energy efficiency adjustments
- Adds Gaussian noise for realistic variance

**`random_recent_timestamp()`:**
- Generates ISO 8601 compliant timestamps
- Implements uniform distribution across specified temporal window
- Supports configurable lookback period (default: 60 days)

### Purpose
Provides specialized functions for realistic property condition modeling and temporal data generation.

### Output
Utility functions returning condition scores (float, 0-1) and ISO timestamps (string).

In [24]:
def simulate_condition_score(humidity: float, temperature: float, energy_class: str) -> float:
    """
    Produce a synthetic condition score in [0,1] based on a few signals.
    """
    score = 0.85
    # Humidity penalty
    if humidity > 65:
        score -= 0.15
    elif humidity > 55:
        score -= 0.05
    # Temperature penalty (outside mild comfort band)
    if temperature < 14 or temperature > 24:
        score -= 0.07
    # Energy class adjustment
    class_adjust = {
        'A': +0.03, 'B': +0.02, 'C': 0.00,
        'D': -0.02, 'E': -0.04, 'F': -0.06, 'G': -0.10
    }
    score += class_adjust.get(energy_class, 0.0)
    # Add small noise
    score += np.random.normal(0, 0.02)
    return round(min(1.0, max(0.0, score)), 3)

def random_recent_timestamp(days_back: int = 60) -> str:
    """
    Generate an ISO timestamp (Z) within the last `days_back` days.
    """
    delta_days = random.randint(0, days_back)
    dt = TODAY - timedelta(days=delta_days,
                           hours=random.randint(0, 23),
                           minutes=random.randint(0, 59))
    return dt.isoformat(timespec='seconds') + "Z"

## 05. Generate Data Rows

### Technical Overview
Core property generation algorithm implementing comprehensive real estate feature modeling.

### Implementation Details
**Physical characteristics generation:**
- Size: Uniform distribution (40-199 m²)
- Rooms: Discrete uniform (2-6)
- Floors: Constrained generation ensuring floor ≤ building_floors
- Construction year: Historical range (1950-2022)

**Business logic implementation:**
- Elevator presence: Deterministic rule (building_floors ≥ 4)
- Amenity probabilities: Garden (30%), Balcony (60%), Garage (50%)

**Valuation algorithm:**
- Base price = size × random_uniform(€1,200-€3,500/m²)
- Energy efficiency multiplier: A/B classes (+5%)
- Fixed amenity premiums: Garden (+€10k), Balcony (+€5k), Garage (+€7k)

### Purpose
Implements comprehensive property generation with realistic interdependencies and market-based pricing.

### Output
Dictionary containing 23 property attributes with realistic value distributions and logical constraints.

In [25]:
def generate_property(index: int) -> dict:
    size_m2 = np.random.randint(40, 200)
    rooms = np.random.randint(2, 7)
    bathrooms = np.random.randint(1, 4)
    year_built = np.random.randint(1950, 2023)

    floor = np.random.randint(0, 5)
    building_floors = np.random.randint(floor + 1, 10)  # ensure floor < building_floors

    has_elevator = int(building_floors >= 4)
    has_garden = int(random.random() < 0.30)
    has_balcony = int(random.random() < 0.60)
    garage = int(random.random() < 0.50)

    energy_class = random.choice(ENERGY_CLASSES)
    humidity = round(np.random.uniform(30, 70), 1)
    temperature = round(np.random.uniform(12, 25), 1)
    noise_level = int(np.random.randint(20, 80))
    air_quality_index = int(np.random.randint(30, 150))
    location = random.choice(LOCATIONS)

    current_year = datetime.utcnow().year
    age_years = current_year - year_built

    # Synthetic valuation (in thousands)
    base_price_eur = size_m2 * np.random.uniform(1200, 3500)
    if energy_class in ['A', 'B']:
        base_price_eur *= 1.05
    if has_garden:
        base_price_eur += 10_000
    if has_balcony:
        base_price_eur += 5_000
    if garage:
        base_price_eur += 7_000

    valuation_k = round(base_price_eur / 1000, 2)

    condition_score = simulate_condition_score(humidity, temperature, energy_class)
    risk_score = round(min(1.0, max(0.0, (1 - condition_score) + np.random.normal(0, 0.02))), 3)

    return {
        "asset_id": f"asset_{index:04}",
        "asset_type": ASSET_TYPE,
        "location": location,
        "size_m2": size_m2,
        "rooms": rooms,
        "bathrooms": bathrooms,
        "year_built": year_built,
        "age_years": age_years,
        "floor": floor,
        "building_floors": building_floors,
        "has_elevator": has_elevator,
        "has_garden": has_garden,
        "has_balcony": has_balcony,
        "garage": garage,
        "energy_class": energy_class,
        "humidity_level": humidity,
        "temperature_avg": temperature,
        "noise_level": noise_level,
        "air_quality_index": air_quality_index,
        "valuation_k": valuation_k,
        "condition_score": condition_score,
        "risk_score": risk_score,
        "last_verified_ts": random_recent_timestamp()
    }

## 06. Factory multi-RWA

### Technical Overview
Implements comprehensive data integrity validation for generated properties.

### Implementation Details
**Validation rules:**
- Structural constraints: floor ≤ building_floors
- Score boundaries: condition_score, risk_score ∈ [0,1]
- Positive value constraints: valuation_k, size_m2 > 0

### Purpose
Ensures data quality and logical consistency through automated validation.

### Output
Validated property records with guaranteed constraint satisfaction.

### Technical Overview
Implements extensible factory pattern for multiple asset type support.

### Implementation Details
- Abstract asset generation interface
- Type-based delegation to specialized generators
- Extensibility framework for future asset classes (art, vehicles, etc.)

### Purpose
Provides scalable architecture supporting multiple Real World Asset (RWA) categories.

### Output
Factory function enabling type-safe asset generation with extensible design.

In [26]:
def generate_asset(asset_type, index):
    """Asset factory - ready for multi-RWA"""
    if asset_type == "property":
        return generate_property(index)
    # Future support:
    # elif asset_type == "art": return generate_art(index)
    else:
        raise ValueError(f"Unsupported asset_type: {asset_type}")

## 07. Data Validation

### Technical Overview
Implements comprehensive data integrity validation for generated properties.

### Implementation Details
**Validation rules:**
- Structural constraints: floor ≤ building_floors
- Score boundaries: condition_score, risk_score ∈ [0,1]
- Positive value constraints: valuation_k, size_m2 > 0

### Purpose
Ensures data quality and logical consistency through automated validation.

### Output
Validated property records with guaranteed constraint satisfaction.

In [27]:
# Validazione record singolo
def validate_property(prop_data):
    """Validate generated property data"""
    assert prop_data['floor'] <= prop_data['building_floors']
    assert 0 <= prop_data['condition_score'] <= 1
    assert 0 <= prop_data['risk_score'] <= 1
    assert prop_data['valuation_k'] > 0
    assert prop_data['size_m2'] > 0
    return prop_data

## 08. Generate DataFrame

### Technical Overview
Orchestrates full dataset generation with schema validation and data organization.

### Implementation Details
1. Batch generation: 150 property records with validation
2. Schema compliance verification: 23 required fields
3. Column ordering optimization
4. Data type consistency enforcement
5. Statistical overview generation

### Purpose
Produces production-ready dataset with guaranteed schema compliance and data quality.

### Output
Structured pandas DataFrame (150×23) with validated schema and optimized column ordering.

In [28]:
data = [validate_property(generate_asset("property", i)) for i in range(N_ROWS)]
df = pd.DataFrame(data)

REQUIRED_FIELDS = [
    'asset_id', 'asset_type', 'location', 'size_m2', 'rooms', 
    'bathrooms', 'year_built', 'age_years', 'floor', 'building_floors',
    'has_elevator', 'has_garden', 'has_balcony', 'garage', 'energy_class',
    'humidity_level', 'temperature_avg', 'noise_level', 'air_quality_index',
    'valuation_k', 'condition_score', 'risk_score', 'last_verified_ts'
]

def validate_schema(df):
    """Ensure all required fields are present"""
    missing = set(REQUIRED_FIELDS) - set(df.columns)
    assert not missing, f"Missing required fields: {missing}"
    print(f"✅ Schema validation passed - all {len(REQUIRED_FIELDS)} fields present")

validate_schema(df)

preferred_order = REQUIRED_FIELDS
df = df[preferred_order]

df.head()

✅ Schema validation passed - all 23 fields present


Unnamed: 0,asset_id,asset_type,location,size_m2,rooms,bathrooms,year_built,age_years,floor,building_floors,...,garage,energy_class,humidity_level,temperature_avg,noise_level,air_quality_index,valuation_k,condition_score,risk_score,last_verified_ts
0,asset_0000,property,Naples,142,5,1,1964,61,2,7,...,1,B,53.9,17.8,42,104,348.41,0.852,0.14,2025-06-05T01:11:41Z
1,asset_0001,property,Milan,170,6,2,1979,46,1,9,...,0,A,69.7,20.0,77,51,222.1,0.73,0.261,2025-07-16T22:40:41Z
2,asset_0002,property,Palermo,54,4,3,2013,12,0,3,...,1,F,64.4,20.8,28,68,78.45,0.742,0.271,2025-07-07T14:17:41Z
3,asset_0003,property,Palermo,48,3,1,1951,74,3,7,...,0,B,47.6,13.6,27,76,90.58,0.776,0.216,2025-06-30T20:45:41Z
4,asset_0004,property,Rome,171,3,2,1955,70,1,5,...,1,D,37.4,24.6,45,73,591.7,0.764,0.254,2025-06-29T17:16:41Z


## 09. Export CSV

### Technical Overview
Implements data persistence and statistical profiling for dataset validation.

### Implementation Details
- Serialization to CSV format for cross-platform compatibility
- Comprehensive descriptive statistics generation
- Data distribution analysis across all features
- Schema verification confirmation

### Purpose
Provides persistent data storage and statistical validation of generated dataset characteristics.

### Output
- Persistent CSV file: `property_dataset_v1.csv`
- Statistical summary confirming data quality and distribution properties

In [29]:
out_path = "../data/property_dataset_v1.csv"
df.to_csv(out_path, index=False)
print("Saved:", out_path, "rows:", len(df), "cols:", len(df.columns))

df.describe(include='all').T.head(5)

Saved: ../data/property_dataset_v1.csv rows: 150 cols: 23


Unnamed: 0,count,unique,top,freq,mean,std,min,25%,50%,75%,max
asset_id,150.0,150.0,asset_0000,1.0,,,,,,,
asset_type,150.0,1.0,property,150.0,,,,,,,
location,150.0,8.0,Florence,30.0,,,,,,,
size_m2,150.0,,,,121.746667,47.664964,40.0,78.25,122.0,165.25,199.0
rooms,150.0,,,,3.993333,1.440063,2.0,3.0,4.0,5.0,6.0
