# Satellite-based Property Price Prediction

## Contents
1. Problem Statement
2. Dataset Description
3. Exploratory Data Analysis (EDA)
4. Feature Engineering & Preprocessing
5. Train–Test Split
6. Baseline Model (Tabular Only)
7. XGBoost Model (Tabular Only)
8. Multimodal Model (Tabular + Satellite Images)
9. Model Evaluation & Comparison
10. Final Predictions
11. Conclusion & Future Work

# SECTION 1 — Problem Statement

# Satellite-Based Property Price Prediction

## Problem Statement
Accurately estimating property prices is critical for real estate valuation,
investment analysis, and urban planning. Traditional models rely mainly on
tabular property attributes such as location, size, and amenities.

In this project, we aim to improve property price prediction by combining:
- **Tabular property data** (numerical and categorical features)
- **Satellite imagery** capturing neighborhood-level visual characteristics

The objective is to compare the performance of:
1. A tabular-data-only model
2. A multimodal model that integrates tabular data with satellite images


# SECTION 2 — Dataset Description

## Dataset Description

The dataset consists of two main components:

### 1. Tabular Property Data
The tabular dataset contains structured information about properties, such as:
- Property identifiers
- Location-related attributes
- Structural characteristics
- Other numerical and categorical features relevant to price prediction

The target variable is:
- **Property Price**, which represents the market value of each property.

### 2. Satellite Images
For each property, a satellite image is available that captures the surrounding
environment (e.g., greenery, road density, building patterns). These images are
used to extract visual features that may influence property prices.

In [None]:
## Dataset Description

In [None]:
# ========================
# Imports & Configuration
# ========================
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import os

RANDOM_STATE = 42

DATA_PATH = "./"          # current folder
IMAGE_PATH = "./images/"  # images folder

In [None]:
# Load the training data (Excel file)
df = pd.read_excel(DATA_PATH + "train(1).xlsx")

# Basic dataset overview
print("Dataset shape:", df.shape)
df.head()

In [None]:
# Dataset information
df.info()

In [None]:
print("Images found:", os.listdir(IMAGE_PATH)[:5])
print("Data shape:", df.shape)

### Satellite Image Acquisition (Programmatic)

Satellite images were programmatically acquired using the Mapbox Static Images API.
Each image is centered on the property location using latitude and longitude
coordinates and captures the surrounding environment, including buildings,
roads, greenery, and water bodies.

To ensure reproducibility and avoid redundant API calls, the download logic
checks for existing images and skips re-downloading when images are already present.

# we are using Mapbox Static API for satellite images

In [None]:
# For each house, use: Latitude and Longitude to download a satellite image showing: Buildings, Roads, Green areas, Water
# 1. Get a Mapbox Access Token
# Default Public Token = "YOUR_MAPBOX_TOKEN"

In [None]:
# 2. Create Folder to Save Images
import os

os.makedirs(IMAGE_PATH, exist_ok=True)

if len(os.listdir(IMAGE_PATH)) == 0:
    print("Downloading satellite images...")
    # ⬇️ MOVE ALL YOUR DOWNLOAD CODE INSIDE HERE ⬇️
else:
    print("Satellite images already exist. Skipping download.")

In [None]:
#3. Decide Image Settings (Keep Fixed)
# Zoom                     18                             Property + neighborhood
# Size                     224 cross 224                        CNN standard
# Style                    satellite-v9                   Satellite imagery

In [None]:
# 4. Download ONE Test Satellite Image
import requests

lat = df.loc[0, "lat"]
lon = df.loc[0, "long"]

MAPBOX_TOKEN = "YOUR_MAPBOX_TOKEN"

url = (
    f"https://api.mapbox.com/styles/v1/mapbox/satellite-v9/static/"
    f"{lon},{lat},18/512x512"
    f"?access_token={MAPBOX_TOKEN}"
)

response = requests.get(url)

print("Status:", response.status_code)
print("Type:", response.headers.get("Content-Type"))

if response.headers.get("Content-Type", "").startswith("image"):
    with open("images/test_image.png", "wb") as f:
        f.write(response.content)
    print("✅ Image saved successfully")
else:
    print("❌ Error from Mapbox:")
    print(response.text)


# --- IMAGE DOWNLOAD (already completed) ---
# (Skipped to avoid re-downloading images)

In [None]:
# view test image
from PIL import Image
import matplotlib.pyplot as plt

img = Image.open("images/test_image.png")
plt.imshow(img)
plt.axis("off")
plt.show()

In [None]:
# 5.Bulk Image Download Code for 500 images

# Create a random sample
df_sample = df.sample(n=500, random_state=42).reset_index(drop=True)

In [None]:
# --- IMAGE DOWNLOAD (already completed) ---
# (Skipped to avoid re-downloading images)

# Download satellite images for the random sample
import requests
from tqdm import tqdm

MAPBOX_TOKEN = "YOUR_MAPBOX_TOKEN"

for i, row in enumerate(tqdm(df_sample.iterrows(), total=500)):
    lat = row[1]["lat"]
    lon = row[1]["long"]

    url = (
        f"https://api.mapbox.com/styles/v1/mapbox/satellite-v9/static/"
        f"{lon},{lat},18/224x224"
        f"?access_token={MAPBOX_TOKEN}"
    )

    response = requests.get(url)

    if response.headers.get("Content-Type", "").startswith("image"):
        with open(f"images/{i}.png", "wb") as f:
            f.write(response.content)

print("✅ Random 500 satellite images downloaded successfully!")

# --- IMAGE DOWNLOAD (already completed) ---
# (Skipped to avoid re-downloading images)

In [None]:
# check if all images are created or not
import os

print(len(os.listdir("images")))
print(os.listdir("images")[:10])

In [None]:
from PIL import Image
import matplotlib.pyplot as plt

img = Image.open("images/1.png")
plt.imshow(img)
plt.axis("off")
plt.show()

**SECTION 3 — Exploratory Data Analysis (EDA)**

## Exploratory Data Analysis (EDA)

This section explores the distribution of the target variable,
identifies missing values, and visualizes sample satellite images
to gain intuition about the data.

In [None]:
# Preview column names
df.columns

In [None]:
# Missing values per column
missing_values = df.isnull().sum()
missing_values[missing_values > 0]

In [None]:
# Price Distribution
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns

In [None]:
plt.figure(figsize=(6, 4))
plt.hist(df["price"], bins=50)
plt.title("House Price Distribution")
plt.xlabel("Price")
plt.ylabel("Count")
plt.show()

# Observation:
House prices are right-skewed, with many lower-priced homes and a long tail of high-value properties.

In [None]:
#Log-Transformed Price Distribution
plt.figure(figsize=(6, 4))
plt.hist(np.log1p(df["price"]), bins=50)
plt.title("Log-Transformed House Price Distribution")
plt.xlabel("log(price)")
plt.ylabel("Count")
plt.show()

# Observation:
The log transformation produces a more symmetric distribution, which is beneficial for regression modeling.

In [None]:
# Geospatial Distribution of Prices
plt.figure(figsize=(7, 5))
plt.scatter(
    df["long"],
    df["lat"],
    c=df["price"],
    cmap="viridis",
    s=5
)
plt.colorbar(label="Price")
plt.xlabel("Longitude")
plt.ylabel("Latitude")
plt.title("House Prices by Location")
plt.show()

# Observation:
Distinct high-price clusters appear in waterfront and dense urban regions, indicating strong spatial effects.

In [None]:
# Waterfront Effect on Price
df.groupby("waterfront")["price"].mean()

# Observation:
Properties with waterfront access have significantly higher average prices, validating the importance of environmental context.

In [None]:
# View Quality vs Price
df.groupby("view")["price"].mean()

# Observation:
Better views are associated with higher property prices, supporting the use of satellite imagery.

In [None]:
# Living Area vs Price
plt.figure(figsize=(6, 4))
plt.scatter(df["sqft_living"], df["price"], alpha=0.3)
plt.xlabel("Sqft Living Area")
plt.ylabel("Price")
plt.title("Living Area vs Price")
plt.show()

# Observation:
Larger living areas generally correspond to higher prices, with some dispersion due to location and amenities.

In [None]:
# Neighborhood Size vs Price 
plt.figure(figsize=(6, 4))
plt.scatter(df["sqft_living15"], df["price"], alpha=0.3)
plt.xlabel("Neighborhood Living Area (sqft_living15)")
plt.ylabel("Price")
plt.title("Neighborhood Size vs Price")
plt.show()

# Observation:
Properties in neighborhoods with larger average living areas tend to have higher prices, indicating neighborhood-level effects.

In [None]:
# Correlation Analysis
plt.figure(figsize=(12, 8))
sns.heatmap(
    df[
        [
            "price",
            "sqft_living",
            "grade",
            "bathrooms",
            "sqft_living15",
            "view",
            "waterfront"
        ]
    ].corr(),
    annot=True,
    cmap="coolwarm"
)
plt.title("Feature Correlation with Price")
plt.show()

# Observation:
Living area, grade, bathrooms, and neighborhood features show strong positive correlation with price.

In [None]:
# Target Statistics
df["price"].describe()

In [None]:
#Correlation with Target (Numeric Features Only)
numeric_df = df.select_dtypes(include=[np.number])

plt.figure(figsize=(10, 8))
sns.heatmap(
    numeric_df.corr(),
    cmap="coolwarm",
    center=0
)
plt.title("Correlation Heatmap (Numeric Features)")
plt.show()

In [None]:
# Sample Satellite Images
from PIL import Image

sample_images = [f for f in os.listdir(IMAGE_PATH) if f.endswith(".png")][:4]

plt.figure(figsize=(10, 5))
for i, img_name in enumerate(sample_images):
    plt.subplot(1, 4, i + 1)
    img = Image.open(os.path.join(IMAGE_PATH, img_name))
    plt.imshow(img)
    plt.axis("off")
    plt.title(img_name)

plt.suptitle("Sample Satellite Images")
plt.show()

**SECTION 4 — Feature Engineering & Preprocessing**

In [None]:
# 1.Target Transformation

# Log-transform the target variable (justified by EDA)
df["log_price"] = np.log1p(df["price"])

In [None]:
# 2.Drop Unused Columns
# Drop unused or non-predictive columns
df = df.drop(columns=["id", "date"], errors="ignore")

In [None]:
# 3.Feature Grouping
# Numerical features
num_features = [
    "bedrooms", "bathrooms", "sqft_living", "sqft_lot",
    "sqft_above", "sqft_basement", "sqft_living15",
    "sqft_lot15", "condition", "grade", "view",
    "waterfront", "floors", "yr_built", "yr_renovated"
]

# Categorical feature
cat_features = ["zipcode"]

# Geographical features (used for satellite imagery alignment)
geo_features = ["lat", "long"]

## Explanation:

Numerical features capture property size, structure, and quality

Zipcode is treated as a categorical variable

Latitude and longitude are kept separate for satellite image association

In [None]:
# 4.Define Model Inputs and Target
# Features used for tabular modeling
X_tabular = df[num_features + cat_features]

# Target variable
y = df["log_price"]

In [None]:
# 5.Preprocessing Pipelines
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.impute import SimpleImputer

In [None]:
# Numerical Pipeline
numeric_transformer = Pipeline(steps=[
    ("imputer", SimpleImputer(strategy="median")),
    ("scaler", StandardScaler())
])

# Categorical Pipeline
categorical_transformer = Pipeline(steps=[
    ("imputer", SimpleImputer(strategy="most_frequent")),
    ("encoder", OneHotEncoder(handle_unknown="ignore"))
])

In [None]:
# 6.Combine Preprocessing Steps
preprocessor = ColumnTransformer(
    transformers=[
        ("num", numeric_transformer, num_features),
        ("cat", categorical_transformer, cat_features)
    ]
)

**SECTION 5 — Train–Test Split**

## Train–Test Split

The dataset is split into training and testing sets to evaluate model
performance on unseen data. The same split is used consistently across
all models to ensure a fair comparison.

In [None]:
# 1.Perform the Train–Test Split
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    X_tabular,
    y,
    test_size=0.2,
    random_state=RANDOM_STATE
)

print("Training samples:", X_train.shape[0])
print("Testing samples:", X_test.shape[0])

In [None]:
# 2.Preserve Indices for Image Alignment, This is critical for your multimodal model.
# Preserve indices for aligning satellite images later
train_indices = X_train.index
test_indices = X_test.index

SECTION 6 — Baseline Model (Tabular Only)

In [None]:
## Baseline Model (Tabular Only)

A baseline model is trained using only tabular features to establish a
reference level of performance. This allows us to quantify the improvement
gained by more advanced models such as XGBoost and the multimodal approach.

In [None]:
# 1.Build the Baseline Pipeline
# We’ll use a simple linear model (Ridge Regression) as the baseline.
from sklearn.linear_model import Ridge
from sklearn.pipeline import Pipeline
from sklearn.metrics import mean_squared_error, r2_score

In [None]:
baseline_model = Pipeline(steps=[
    ("preprocessor", preprocessor),
    ("model", Ridge(alpha=1.0))
])

In [None]:
# 2.Train the Baseline Model
baseline_model.fit(X_train, y_train)

In [None]:
# 3.Evaluate the Baseline Model
from sklearn.metrics import mean_squared_error, r2_score
import numpy as np

y_pred_baseline = baseline_model.predict(X_test)

rmse_baseline = np.sqrt(mean_squared_error(y_test, y_pred_baseline))
r2_baseline = r2_score(y_test, y_pred_baseline)

print("Baseline Model Performance:")
print("RMSE:", rmse_baseline)
print("R²:", r2_baseline)

#Interpretation
**Observation:**

The baseline model provides a simple benchmark using linear relationships
between features and price. While it captures general trends, it is limited
in modeling non-linear interactions and complex feature relationships.

**SECTION 7 — XGBoost Model (Tabular Only)**

## XGBoost Model (Tabular Only)

XGBoost is a powerful tree-based ensemble model that can capture non-linear
relationships and feature interactions. In this section, we train an XGBoost
model using only tabular features and compare its performance with the
baseline model.

In [None]:
# 1.Import XGBoost
from xgboost import XGBRegressor

In [None]:
# 2.Define the XGBoost Pipeline
# We reuse the same preprocessing to ensure a fair comparison.
xgb_model = XGBRegressor(
    n_estimators=300,
    max_depth=6,
    learning_rate=0.05,
    subsample=0.8,
    colsample_bytree=0.8,
    objective="reg:squarederror",
    random_state=RANDOM_STATE
)

xgb_pipeline = Pipeline(steps=[
    ("preprocessor", preprocessor),
    ("model", xgb_model)
])

In [None]:
# 3.Train the XGBoost Model
xgb_pipeline.fit(X_train, y_train)

In [None]:
# 4.Evaluate the XGBoost Model
# Predictions
y_pred_xgb = xgb_pipeline.predict(X_test)

# Metrics
rmse_xgb = np.sqrt(mean_squared_error(y_test, y_pred_xgb))
r2_xgb = r2_score(y_test, y_pred_xgb)

print("XGBoost Model Performance:")
print("RMSE:", rmse_xgb)
print("R²:", r2_xgb)

**Discussion:**

Interestingly, the baseline linear model slightly outperforms the XGBoost
model on this particular train–test split. This suggests that much of the
predictive signal in the tabular data can be captured through approximately
linear relationships, especially after log-transforming the target variable.

Given the strong performance of the baseline, the marginal difference between
models is small and may vary with different data splits or further
hyperparameter tuning.

#Hyperparameter Tuning (XGBoost)

### Hyperparameter Tuning (XGBoost)

To further evaluate the potential of XGBoost, we perform hyperparameter tuning
using cross-validation. This helps assess whether improved performance can be
achieved beyond the default configuration.

In [None]:
# Define Parameter Grid
from sklearn.model_selection import RandomizedSearchCV

In [None]:
param_grid = {
    "model__n_estimators": [200, 300, 500],
    "model__max_depth": [3, 5, 7],
    "model__learning_rate": [0.01, 0.05, 0.1],
    "model__subsample": [0.7, 0.8, 1.0],
    "model__colsample_bytree": [0.7, 0.8, 1.0]
}

In [None]:
# Run Randomized Search
xgb_search = RandomizedSearchCV(
    estimator=xgb_pipeline,
    param_distributions=param_grid,
    n_iter=20,
    scoring="r2",
    cv=5,
    random_state=RANDOM_STATE,
    n_jobs=-1,
    verbose=1
)

xgb_search.fit(X_train, y_train)

In [None]:
# Best Model Evaluation
best_xgb = xgb_search.best_estimator_

y_pred_xgb_tuned = best_xgb.predict(X_test)

rmse_xgb_tuned = np.sqrt(mean_squared_error(y_test, y_pred_xgb_tuned))
r2_xgb_tuned = r2_score(y_test, y_pred_xgb_tuned)

print("Tuned XGBoost Performance:")
print("RMSE:", rmse_xgb_tuned)
print("R²:", r2_xgb_tuned)

In [None]:
# Best Parameters
xgb_search.best_params_

Comparison Summary
Model	                                  RMSE	               R²
Baseline (Ridge)	                      0.1830	           0.8786
Default XGBoost                        	  0.1895	           0.8698
Tuned XGBoost	                          0.1783	           0.8848

**Final Tabular Model Selection:**

After hyperparameter tuning, XGBoost achieves the best performance among all
tabular models. The improvement over both the baseline and the default XGBoost
configuration highlights the importance of model optimization for capturing
complex feature interactions.

**SECTION 8 — Multimodal Model (Tabular + Satellite Images)**

## Multimodal Model (Tabular + Satellite Images)

In this section, we build a multimodal model that combines tabular property
features with visual features extracted from satellite images. A convolutional
neural network (CNN) is used to encode satellite images, and the resulting image
embeddings are fused with tabular features to predict property prices.

In [None]:
# Create a fixed random sample for multimodal learning
# df_sample = df.sample(n=500, random_state=RANDOM_STATE).reset_index(drop=True)

# print("Multimodal sample shape:", df_sample.shape)
print("Multimodal sample shape:", df_sample.shape)

In [None]:
# 1.Prepare Multimodal Dataset. We use the same 500-sample dataset you used for images.
# Use the sampled dataset used for satellite images
df_mm = df_sample.copy()

# Target (log-price)
y_mm = np.log1p(df_mm["price"])

# Tabular features
X_tabular_mm = df_mm[num_features + cat_features]

In [None]:
print(df_sample.index[:5])
print(os.listdir(IMAGE_PATH)[:5])

In [None]:
# 2.Train–Test Split (Multimodal) which is separate split
from sklearn.model_selection import train_test_split

X_tab_train, X_tab_test, y_mm_train, y_mm_test = train_test_split(
    X_tabular_mm,
    y_mm,
    test_size=0.2,
    random_state=RANDOM_STATE
)

In [None]:
# Preserve indices for image loading:
train_idx = X_tab_train.index
test_idx = X_tab_test.index

In [None]:
# 3.Preprocess Tabular Data (Reuse Pipeline)
# Fit preprocessing on multimodal training data
X_tab_train_proc = preprocessor.fit_transform(X_tab_train)
X_tab_test_proc = preprocessor.transform(X_tab_test)


In [None]:
# 4.Load Satellite Images
from tensorflow.keras.preprocessing.image import load_img, img_to_array

IMG_SIZE = (224, 224)

def load_images(indices):
    images = []
    for idx in indices:
        img_path = os.path.join(IMAGE_PATH, f"{idx}.png")
        if os.path.exists(img_path):  # safety check
            img = load_img(img_path, target_size=IMG_SIZE)
            img = img_to_array(img) / 255.0
            images.append(img)
    return np.array(images)
X_img_train = load_images(train_idx)
X_img_test = load_images(test_idx)

In [None]:
# 5.Build CNN for Image Feature Extraction
import tensorflow as tf
from tensorflow.keras import layers, models

In [None]:
def build_cnn():
    base_model = tf.keras.applications.ResNet50(
        weights="imagenet",
        include_top=False,
        input_shape=(224, 224, 3)
    )
    base_model.trainable = False

    model = models.Sequential([
        base_model,
        layers.GlobalAveragePooling2D(),
        layers.Dense(128, activation="relu")
    ])
    return model

cnn_model = build_cnn()

In [None]:
# 6.Build Multimodal Fusion Model
# Tabular branch
tabular_input = layers.Input(shape=(X_tab_train_proc.shape[1],))
tabular_dense = layers.Dense(64, activation="relu")(tabular_input)

# Image branch
image_input = layers.Input(shape=(224, 224, 3))
image_features = cnn_model(image_input)

# Fusion
combined = layers.concatenate([tabular_dense, image_features])
combined_dense = layers.Dense(64, activation="relu")(combined)
output = layers.Dense(1)(combined_dense)

multimodal_model = models.Model(
    inputs=[tabular_input, image_input],
    outputs=output
)

In [None]:
# 7.Compile & Train the Model
multimodal_model.compile(
    optimizer="adam",
    loss="mse"
)

In [None]:
history = multimodal_model.fit(
    [X_tab_train_proc, X_img_train],
    y_mm_train,
    validation_split=0.2,
    epochs=10,
    batch_size=16,
    verbose=1
)

In [None]:
# 8.Evaluate Multimodal Model
y_mm_pred = multimodal_model.predict([X_tab_test_proc, X_img_test]).flatten()

rmse_mm = np.sqrt(mean_squared_error(y_mm_test, y_mm_pred))
r2_mm = r2_score(y_mm_test, y_mm_pred)

print("Multimodal Model Performance:")
print("RMSE:", rmse_mm)
print("R²:", r2_mm)

**Multimodal Model Analysis:**

The multimodal model demonstrates lower predictive performance compared to the
tabular-only models. This outcome can be attributed to the limited size of the
image dataset and the already strong predictive power of tabular features such
as property size, quality, and amenities.

While satellite imagery provides valuable contextual information about the
surrounding environment, effectively leveraging visual features typically
requires a substantially larger image dataset or task-specific fine-tuning.
Despite the lower quantitative performance, the multimodal approach highlights
the potential of integrating visual context for real estate valuation.

### Visual Explainability using Grad-CAM

To interpret how satellite images influence the multimodal model’s predictions,
Grad-CAM visualizations were generated for selected test images. Grad-CAM
highlights the spatial regions of an image that contribute most strongly to the
model’s output, providing insight into which visual patterns the CNN focuses on
when estimating property prices.


In [None]:
# Imports and Setups
import tensorflow as tf
import numpy as np
import matplotlib.pyplot as plt
import os
from tensorflow.keras.models import Model
from tensorflow.keras.preprocessing.image import load_img, img_to_array
from PIL import Image

IMG_SIZE = (224, 224)

In [None]:
# Extract the ResNet50 backbone from the trained CNN
resnet_model = cnn_for_gradcam.get_layer("resnet50")

# Sanity check: should show ResNet50 layers
resnet_model.summary()

In [None]:
# 3.Grad-CAM Function
def make_gradcam_heatmap(img_array, model, last_conv_layer_name):
    grad_model = Model(
        inputs=model.input,
        outputs=[
            model.get_layer(last_conv_layer_name).output,
            model.output
        ]
    )

    with tf.GradientTape() as tape:
        conv_outputs, predictions = grad_model(img_array)
        loss = predictions[:, 0]

    grads = tape.gradient(loss, conv_outputs)
    pooled_grads = tf.reduce_mean(grads, axis=(0, 1, 2))

    conv_outputs = conv_outputs[0]
    heatmap = tf.reduce_sum(conv_outputs * pooled_grads, axis=-1)

    heatmap = tf.maximum(heatmap, 0)
    heatmap /= tf.reduce_max(heatmap)
    return heatmap.numpy()

In [None]:
# 4.Image Loader
def load_image(img_path):
    img = load_img(img_path, target_size=IMG_SIZE)
    img = img_to_array(img) / 255.0
    return np.expand_dims(img, axis=0)

In [None]:
def make_gradcam_heatmap(img_array, resnet_model, last_conv_layer_name):
    grad_model = Model(
        inputs=resnet_model.input,
        outputs=[
            resnet_model.get_layer(last_conv_layer_name).output,
            resnet_model.output
        ]
    )

    with tf.GradientTape() as tape:
        conv_outputs, predictions = grad_model(img_array)
        loss = tf.reduce_mean(predictions)

    grads = tape.gradient(loss, conv_outputs)
    pooled_grads = tf.reduce_mean(grads, axis=(0, 1, 2))

    conv_outputs = conv_outputs[0]
    heatmap = tf.reduce_sum(conv_outputs * pooled_grads, axis=-1)

    heatmap = tf.maximum(heatmap, 0)
    heatmap /= tf.reduce_max(heatmap)
    return heatmap.numpy()

In [None]:
resnet_model.summary()

In [None]:
GRADCAM_DIR = "gradcam_outputs"
os.makedirs(GRADCAM_DIR, exist_ok=True)

image_ids = [10, 50, 100]  # 2–3 images is enough

for i in image_ids:
    img_path = f"images/{i}.png"
    img_array = load_image(img_path)

    heatmap = make_gradcam_heatmap(
        img_array,
        resnet_model,
        last_conv_layer_name="conv5_block3_out"
    )

    # Load original image
    img = Image.open(img_path).resize(IMG_SIZE)
    img = np.array(img)

    # Resize heatmap
    heatmap_resized = Image.fromarray(np.uint8(255 * heatmap)).resize(IMG_SIZE)
    heatmap_resized = np.array(heatmap_resized)

    # Apply colormap
    heatmap_color = plt.cm.jet(heatmap_resized)[:, :, :3] * 255

    # Overlay heatmap
    overlay = (0.6 * img + 0.4 * heatmap_color).astype(np.uint8)

    # Show
    plt.figure(figsize=(4, 4))
    plt.imshow(overlay)
    plt.axis("off")
    plt.title(f"Grad-CAM (Image {i})")
    plt.show()

    # Save
    plt.imsave(f"{GRADCAM_DIR}/gradcam_{i}.png", overlay)

**Grad-CAM Interpretation:**

The Grad-CAM heatmaps indicate that the CNN focuses on regions containing dense
building structures, road networks, and surrounding green spaces. These visual
patterns capture neighborhood-level characteristics such as urban density and
environmental quality, which are known to influence property prices. This
provides interpretability for how satellite imagery contributes to the
multimodal model’s predictions.

** SECTION 9 — Model Evaluation & Comparison**

## Model Evaluation & Comparison

In this section, we compare the performance of all trained models using the same
evaluation metrics. This comparison highlights the strengths and limitations of
each modeling approach, including tabular-only and multimodal models.

In [None]:
import pandas as pd

results = pd.DataFrame({
    "Model": [
        "Baseline (Ridge)",
        "XGBoost (Default)",
        "XGBoost (Tuned)",
        "Multimodal (Tabular + Satellite Images)"
    ],
    "RMSE (log-price)": [
        0.183009208706201,
        0.18952830241383828,
        0.17832588139033603,
        0.36939097447658004
    ],
    "R²": [
        0.8786304675516019,
        0.8698296903829609,
        0.8847628388586452,
        0.47765999855354047
    ]
})

results

**Results Interpretation:**

The tuned XGBoost model achieves the best overall performance among all models,
demonstrating the importance of modeling non-linear relationships and applying
hyperparameter optimization for tabular data.

The baseline Ridge regression model performs competitively, indicating that a
large portion of the predictive signal can be captured through approximately
linear relationships after log-transforming the target variable.

The multimodal model, which integrates satellite imagery with tabular features,
exhibits lower predictive performance. This outcome is primarily due to the
limited size of the satellite image dataset and the strong predictive power of
the tabular features. Despite lower quantitative performance, the multimodal
approach provides valuable visual interpretability through Grad-CAM and
demonstrates the potential of incorporating spatial context into property price
prediction.

**SECTION 10 — Final Predictions (CSV)**

## Final Predictions

Using the best-performing model (Tuned XGBoost), final property price
predictions are generated and exported as a CSV file in the required format.

In [None]:
# Load test data
# Load test data (Excel)
test_df = pd.read_excel(DATA_PATH + "test2.xlsx")

print("Test data shape:", test_df.shape)
test_df.head()

In [None]:
# Store IDs for submission
test_ids = test_df["id"]

In [None]:
# Drop unused columns
test_df_model = test_df.drop(columns=["id", "date"], errors="ignore")

# Select tabular features
X_test_final = test_df_model[num_features + cat_features]

In [None]:
# Check duplicate IDs
dup_counts = test_df["id"].value_counts()
dup_ids = dup_counts[dup_counts > 1]

print("Number of duplicated IDs:", len(dup_ids))
dup_ids.head()

In [None]:
# Drop duplicate IDs, keep first occurrence
test_df_unique = test_df.drop_duplicates(subset="id", keep="first")

print("Original test size:", test_df.shape)
print("After removing duplicates:", test_df_unique.shape)

In [None]:
import os

OUTPUT_PATH = "outputs"
os.makedirs(OUTPUT_PATH, exist_ok=True)

In [None]:
# Preserve IDs
test_ids = test_df_unique["id"]

# Drop unused columns
test_df_model = test_df_unique.drop(columns=["id", "date"], errors="ignore")

# Select features
X_test_final = test_df_model[num_features + cat_features]

# Predict log-price
test_log_preds = best_xgb.predict(X_test_final)

# Convert to original scale
test_price_preds = np.expm1(test_log_preds)

# Create submission DataFrame
submission = pd.DataFrame({
    "id": test_ids,
    "predicted_price": test_price_preds
})

submission.head()

In [None]:
submission.to_csv(os.path.join(OUTPUT_PATH, "22323041_Final.csv"), index=False)

print("Prediction file saved successfully!")

## Duplicate property entries were identified in the test dataset. To ensure a
single prediction per property, duplicate IDs were removed by retaining the
first occurrence before generating final predictions.

In [None]:
os.listdir("outputs")

**SECTION 11 — Conclusion & Future Work**

## Conclusion & Future Work

### Conclusion

In this project, we developed and evaluated multiple models for satellite-based
property price prediction, ranging from simple linear baselines to advanced
tree-based and multimodal deep learning approaches.

The baseline Ridge regression model provided a strong reference point,
demonstrating that many relationships in the tabular data are approximately
linear after log-transforming the target variable. A tuned XGBoost model
achieved the best overall performance among tabular-only models, highlighting
the importance of non-linear modeling and hyperparameter optimization for
structured property data.

A multimodal model integrating satellite imagery with tabular features was also
explored. While the multimodal approach did not outperform tabular-only models
in terms of predictive accuracy, it provided valuable visual interpretability
through Grad-CAM and demonstrated the potential of incorporating spatial and
environmental context into real estate valuation.

---

### Future Work

Several directions could further improve the multimodal modeling approach:

- Expanding the satellite image dataset to include a larger and more diverse
  set of properties, which would allow the CNN to learn more robust visual
  representations.
- Fine-tuning the CNN backbone on real estate–specific imagery to better capture
  features relevant to property valuation.
- Incorporating additional spatial data sources, such as points of interest,
  zoning information, or street-level imagery.
- Applying more advanced multimodal fusion strategies and attention mechanisms
  to better balance the contributions of tabular and visual features.

Overall, this project demonstrates the feasibility of combining structured data
with satellite imagery for property price prediction while highlighting both
the strengths and current limitations of multimodal learning in real-world
settings.