# Multimodal Property Valuation: Tabular Data + Satellite Imagery

This notebook builds a multimodal regression pipeline that combines structured
housing attributes with satellite image embeddings. Visual features are extracted
using a pretrained CNN and fused with tabular data for final price prediction
using XGBoost.

## 1. Setup and Imports

We import libraries required for image processing, feature extraction using CNNs,
and regression modeling using XGBoost.

In [1]:
import os
import numpy as np
import pandas as pd
from tqdm import tqdm

import torch
import torch.nn as nn

from PIL import Image
from torchvision import models, transforms
from torchvision.models import ResNet18_Weights

from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, r2_score
from xgboost import XGBRegressor

## 2. Loading Preprocessed Training Data

We load the stratified training subset created during preprocessing.
Satellite images corresponding to these properties have already been downloaded
and stored locally.

In [2]:
train_df = pd.read_csv("../data/processed/train_sampled.csv")
IMG_DIR = "../data/images/train"

print(train_df.shape)
train_df.head()

(5000, 21)


Unnamed: 0,id,date,price,bedrooms,bathrooms,sqft_living,sqft_lot,floors,waterfront,view,...,grade,sqft_above,sqft_basement,yr_built,yr_renovated,zipcode,lat,long,sqft_living15,sqft_lot15
0,9543000205,20150413T000000,139950,0,0.0,844,4269,1.0,0,0,...,7,844,0,1913,0,98001,47.2781,-122.25,1380,9600
1,3353400120,20140701T000000,174000,2,1.0,900,13531,1.0,0,0,...,6,900,0,1979,0,98001,47.2616,-122.251,1767,8308
2,2976800749,20141031T000000,150000,4,2.0,1460,7254,1.0,0,0,...,6,1460,0,1959,0,98178,47.5056,-122.254,1460,7236
3,7335400020,20140626T000000,219500,3,1.0,1090,6710,1.5,0,0,...,5,1090,0,1912,0,98002,47.3066,-122.217,1170,6708
4,7883600700,20150122T000000,157500,2,1.0,670,4500,1.0,0,0,...,5,670,0,1905,0,98108,47.5271,-122.326,1210,4500


The dataset contains a balanced representation of properties across the price spectrum,
ensuring that both low and high priced properties are included in multimodal training.

## 3. CNN Feature Extractor

A pretrained ResNet-18 model is used as a fixed feature extractor.
The final classification layer is removed so that the network outputs
512-dimensional visual embeddings.

In [3]:
cnn = models.resnet18(weights=ResNet18_Weights.DEFAULT)
cnn.fc = nn.Identity()   # remove classifier head
cnn.eval()

ResNet(
  (conv1): Conv2d(3, 64, kernel_size=(7, 7), stride=(2, 2), padding=(3, 3), bias=False)
  (bn1): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
  (relu): ReLU(inplace=True)
  (maxpool): MaxPool2d(kernel_size=3, stride=2, padding=1, dilation=1, ceil_mode=False)
  (layer1): Sequential(
    (0): BasicBlock(
      (conv1): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
      (bn1): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      (relu): ReLU(inplace=True)
      (conv2): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
      (bn2): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    )
    (1): BasicBlock(
      (conv1): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
      (bn1): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      (relu): ReLU(inplace=True)
  

## 4. Image Preprocessing

Satellite images are resized and normalized using ImageNet statistics to match
the pretrained CNN’s expected input distribution.

In [4]:
transform = transforms.Compose([
    transforms.Resize((224, 224)),
    transforms.ToTensor(),
    transforms.Normalize(
        mean=[0.485, 0.456, 0.406],
        std=[0.229, 0.224, 0.225]
    )
])

## 5. Robust Image Embedding Extraction

A helper function is defined to safely extract image embeddings.
Properties with missing or corrupted images are skipped to ensure
data integrity in the multimodal dataset.

In [5]:
def extract_embedding_safe(img_path):
    try:
        img = Image.open(img_path).convert("RGB")
        x = transform(img).unsqueeze(0)

        with torch.no_grad():
            emb = cnn(x).squeeze().numpy()

        return emb
    except:
        return None

## 6. Extracting Image Embeddings

We iterate over all training samples, extract CNN embeddings for available images,
and keep track of valid property IDs to ensure proper alignment with tabular data.

In [6]:
embeddings = []
valid_ids = []

for _, row in tqdm(train_df.iterrows(), total=len(train_df)):
    img_path = f"{IMG_DIR}/{row['id']}.png"
    emb = extract_embedding_safe(img_path)

    if emb is not None:
        embeddings.append(emb)
        valid_ids.append(row["id"])

100%|███████████████████████████████████████| 5000/5000 [00:56<00:00, 88.07it/s]


In [7]:
emb_df = pd.DataFrame(
    embeddings,
    columns=[f"img_emb_{i}" for i in range(512)]
)

emb_df["id"] = valid_ids

print("Embeddings shape:", emb_df.shape)

Embeddings shape: (5000, 513)


The resulting embedding matrix contains one 512-dimensional visual representation
for each property with a valid satellite image.

## 7. Aligning Tabular and Image Data

To avoid misalignment, the tabular dataset is filtered to include only those
properties for which valid image embeddings were successfully extracted.

In [8]:
train_df_mm = train_df[train_df["id"].isin(valid_ids)].copy()

print("Filtered train_df:", train_df_mm.shape)

Filtered train_df: (5000, 21)


This step ensures that each row in the multimodal dataset corresponds to
exactly one property with both tabular and visual features.

## 8. Creating the Multimodal Dataset

Image embeddings are merged with tabular features using property IDs.
An inner join is used to guarantee consistency across modalities.

In [9]:
full_df = train_df_mm.merge(emb_df, on="id", how="inner")

print("Final multimodal dataset:", full_df.shape)

Final multimodal dataset: (5030, 533)


The final multimodal dataset combines structured housing attributes
with high dimensional visual context from satellite imagery.

## 9. Preparing Features and Target Variable

The target variable (`price`) and non predictive identifiers are removed
from the feature matrix. The remaining columns serve as inputs to the regressor.

In [10]:
X = full_df.drop(columns=["id", "price", "date"])
y = full_df["price"]

print(X.shape, y.shape)

(5030, 530) (5030,)


## 10. Train–Validation Split

The multimodal dataset is split into training and validation sets
to evaluate generalization performance.

In [11]:
X_train, X_val, y_train, y_val = train_test_split(
    X, y, test_size=0.2, random_state=42
)

## 11. Multimodal Regressor: XGBoost

XGBoost is used as the final regression model due to its strong performance
on tabular data and its ability to integrate high dimensional image embeddings.

In [12]:
xgb_mm = XGBRegressor(
    n_estimators=600,
    max_depth=6,
    learning_rate=0.05,
    subsample=0.8,
    colsample_bytree=0.8,
    random_state=42
)

xgb_mm.fit(X_train, y_train)

0,1,2
,objective,'reg:squarederror'
,base_score,
,booster,
,callbacks,
,colsample_bylevel,
,colsample_bynode,
,colsample_bytree,0.8
,device,
,early_stopping_rounds,
,enable_categorical,False


## 12. Model Evaluation

Model performance is evaluated on the validation set using RMSE and R² metrics.

In [13]:
import numpy as np
from sklearn.metrics import mean_squared_error, r2_score

preds = xgb_mm.predict(X_val)

rmse = np.sqrt(mean_squared_error(y_val, preds))
r2 = r2_score(y_val, preds)

print("Multimodal XGB RMSE:", rmse)
print("Multimodal XGB R²:", r2)

Multimodal XGB RMSE: 139433.42635107265
Multimodal XGB R²: 0.8592821359634399


The multimodal model achieves strong predictive performance, demonstrating that
satellite image embeddings provide complementary information beyond structured features.

This model is selected as the final model for test set inference.

## 13. Saving the Trained Multimodal Model

The trained XGBoost model is saved to disk for reuse during test-time inference.

In [14]:
import joblib
joblib.dump(xgb_mm, "../models/xgb_multimodal.pkl")

['../models/xgb_multimodal.pkl']

## Summary

This notebook demonstrates a complete multimodal learning pipeline for property valuation.
Satellite images are encoded using a pretrained CNN, fused with structured housing data,
and used to train an XGBoost regressor. The resulting model achieves good validation
performance and is selected for final test set predictions.