# Predicting Real Estate Demand: A Two-Model Approach with Post-Processing

This document outlines a complete machine learning workflow for predicting the `new_house_transaction_amount` in a real estate demand prediction competition.

The core strategy is more advanced than simply predicting the final amount directly. Instead, it builds two separate, more stable models:
1.  A model to predict the **price per area**.
2.  A model to predict the **total area**.

The predictions from these two models are then combined and refined through a series of intelligent post-processing steps to improve the final score and handle real-world data complexities.


## 1. Setup and Data Loading

First, we import the necessary libraries and set up our environment. The code is designed to be flexible, using the high-performance **LightGBM** library if it's available, and falling back to **RandomForestRegressor** otherwise.


In [1]:
import os, sys, math, warnings
warnings.filterwarnings("ignore")
import numpy as np
import pandas as pd
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import KFold
from sklearn.metrics import mean_absolute_error

# Try to use LightGBM if present
LGB_AVAILABLE = False
try:
    import lightgbm as lgb
    LGB_AVAILABLE = True
except Exception:
    LGB_AVAILABLE = False

SEED = 42
np.random.seed(SEED)

# Define input/output directories
INPUT_DIR = "data"
OUT_DIR = "outputs"
print("INPUT_DIR:", INPUT_DIR, "OUT_DIR:", OUT_DIR, "LightGBM:", LGB_AVAILABLE)

# Load data
train = pd.read_csv(os.path.join(INPUT_DIR, "train/new_house_transactions.csv"))
test = pd.read_csv(os.path.join(INPUT_DIR, "test.csv"))

print("train shape:", train.shape, "test shape:", test.shape)

INPUT_DIR: data OUT_DIR: outputs LightGBM: True
train shape: (5433, 11) test shape: (1152, 2)


## 2. Data Preparation and Feature Engineering

Before modeling, we clean the data and create a consistent set of features for both the training and test sets.

* The target columns (`area`, `price`, `amount`) are identified. If the `amount` column is missing, it's calculated as `area * price`.
* Missing numerical values are filled with 0.
* Categorical features like `sector` and `month` are converted into numerical codes so the model can use them.


In [2]:
# Robustly find column names
def find_col(df, candidates):
    for c in candidates:
        if c in df.columns:
            return c
    return None

area_col_train = find_col(train, ["area_new_house_transactions"])
price_col_train = find_col(train, ["price_new_house_transactions"])

# --- Fix missing sector/month in test ---
if "sector" not in test.columns:
    # Extract from "id"
    test["sector"] = test["id"].str.extract(r"sector (\d+)").astype(float)
if "month" not in test.columns:
    test["month"] = test["id"].str.extract(r"(\d{4} \w+)")

# Basic cleaning and type conversion
train = train.copy()
test = test.copy()
train[area_col_train] = train[area_col_train].fillna(0).astype(float)
train[price_col_train] = train[price_col_train].fillna(0).astype(float)
train["amount_new_house_transactions"] = train[area_col_train] * train[price_col_train]

# Define the feature set, excluding identifiers and targets
exclude = set(["month", "sector", "id", area_col_train, price_col_train, "amount_new_house_transactions"])
features = [c for c in train.columns if c not in exclude and train[c].dtype in [np.int64, np.float64]]

# Add encoded categorical features
train["sector_code"] = pd.factorize(train["sector"].astype(str))[0]
test["sector_code"] = pd.factorize(test["sector"].astype(str))[0]

# Month encoding
months = sorted(train["month"].astype(str).unique().tolist())
mo2i = {m: i for i, m in enumerate(months)}
train["month_code"] = train["month"].astype(str).map(mo2i).fillna(-1).astype(int)
test["month_code"] = test["month"].astype(str).map(mo2i).fillna(-1).astype(int)

# Final feature list
features = ["month_code", "sector_code"] + features

print("✅ Using features:", features[:20], f"(total {len(features)})")


✅ Using features: ['month_code', 'sector_code', 'num_new_house_transactions', 'area_per_unit_new_house_transactions', 'total_price_per_unit_new_house_transactions', 'num_new_house_available_for_sale', 'area_new_house_available_for_sale', 'period_new_house_sell_through'] (total 8)


## 3. Modeling Strategy: Predicting Price and Area Separately

Instead of predicting the total `amount` directly, we build two separate models. This approach can be more robust because `price per area` and `area` might have different relationships with the input features.

**Log Transformation:** We apply a `log1p` transformation (`log(1+x)`) to our target variables. This is a common technique in regression that helps to normalize skewed data and can significantly improve model performance. The predictions are later converted back to their original scale using `expm1`.


### a) Training the Price Model

In [6]:
# --- Deduplicate columns (critical for LightGBM) ---
features = list(dict.fromkeys(features))  # removes duplicates
train = train.loc[:, ~train.columns.duplicated()]
test = test.loc[:, ~test.columns.duplicated()]
print(f"✅ Deduplicated features: {len(features)} remain.")

# --- Prepare aligned feature matrices ---
X_price = train[features].fillna(0)
y_price = np.log1p(train[price_col_train].clip(lower=0).astype(float).values)

# Align test columns with train features
X_test = test.copy()
for c in features:
    if c not in X_test.columns:
        X_test[c] = 0.0
X_test = X_test[features].fillna(0)

print(f"✅ Feature alignment done: {len(features)} features used.")

# --- LightGBM parameters ---
params = {
    "objective": "regression",
    "metric": "mae",
    "learning_rate": 0.03,
    "num_leaves": 64,
    "seed": 42,
    "verbosity": -1,
}

# --- Train model ---
print("🚀 Training LightGBM price model...")
dtrain = lgb.Dataset(X_price, label=y_price)
model_price = lgb.train(params, dtrain, num_boost_round=800)

# --- Predict ---
pred_price = np.expm1(model_price.predict(X_test))
pred_price = np.clip(pred_price, 0, None)
print(f"✅ Price predictions ready. Shape: {pred_price.shape}, mean={pred_price.mean():.2f}")

✅ Deduplicated features: 8 remain.
✅ Feature alignment done: 8 features used.
🚀 Training LightGBM price model...
✅ Price predictions ready. Shape: (1152,), mean=22286.24


### b) Training the Area Model

In [7]:
X_area = X_price  # Use the same features
y_area = np.log1p(train[area_col_train].clip(lower=0).astype(float).values)
dtrain2 = lgb.Dataset(X_area, label=y_area)
params2 = { **params, "seed": SEED + 1 } # Use a different seed for model diversity

print("Training LightGBM area model...")
model_area = lgb.train(params2, dtrain2, num_boost_round=800)
pred_area = np.expm1(model_area.predict(X_test))

Training LightGBM area model...


## 4. Post-Processing: Refining the Predictions

The raw model predictions are refined through several intelligent steps to handle potential issues and improve the final score.

1.  **Combine & Scale:** The final `amount` is calculated by multiplying the predictions (`price * area`). A check is performed to scale the result by 10,000 if the values are too large, ensuring they match the competition's expected units.
2.  **Sector Fallback:** To handle cases where the model predicts an unreasonably small value (e.g., near zero), these predictions are replaced with 80% of the median amount for that specific sector (calculated from the training data).
3.  **Smoothing:** A 3-month centered rolling mean is applied to the predictions within each sector. This smooths out sharp, unrealistic month-to-month spikes or dips.
4.  **Outlier Clipping & Flooring:** Finally, predictions are clipped at the top and bottom (based on the 1st and 99th percentiles) to control for extreme outliers, and any tiny values are floored to zero.


In [8]:
# --- Combine, Scale, and Post-Process ---
pred_amount = pred_price * pred_area
pred_amount = np.clip(pred_amount, 0, None)
mean_pred = np.nanmean(pred_amount)
if mean_pred > 1e5:
    print(f"Detected large-scale predictions (mean {mean_pred:.1f}), scaling down by 10,000.")
    pred_amount /= 10000.0

pred_df = pd.DataFrame({"id": test["id"], "sector": test["sector"], "pred_amount": pred_amount})

# Sector fallback for near-zero predictions
train_sector_median = train.groupby("sector")["amount_new_house_transactions"].median().to_dict()
mask_zero = pred_df["pred_amount"] < 1.0
pred_df["sector_median"] = pred_df["sector"].map(train_sector_median)
n_replaced = mask_zero & pred_df["sector_median"].notna()
pred_df.loc[n_replaced, "pred_amount"] = pred_df.loc[n_replaced, "sector_median"] * 0.8
print(f"Sector fallback replaced {int(n_replaced.sum())} near-zero predictions.")

# Smoothing with rolling mean
pred_df["month"] = pred_df["id"].str.extract(r"(\d{4} \w+)", expand=False)
pred_df["sector_int"] = pred_df["sector"].astype(int)
pred_df = pred_df.sort_values(["sector_int","month"]).reset_index(drop=True)
pred_df["smoothed"] = pred_df.groupby("sector_int")["pred_amount"].transform(lambda s: s.rolling(window=3, min_periods=1, center=True).mean())
pred_df["pred_amount"] = np.clip(pred_df["smoothed"], 0, None)

# Outlier clipping and flooring
q1, q99 = pred_df["pred_amount"].quantile(0.01), pred_df["pred_amount"].quantile(0.99)
pred_df["pred_amount"] = pred_df["pred_amount"].clip(lower=q1*0.5, upper=q99*1.5)
pred_df["pred_amount"] = pred_df["pred_amount"].where(pred_df["pred_amount"] >= 1.0, 0.0)

Detected large-scale predictions (mean 733262.7), scaling down by 10,000.
Sector fallback replaced 0 near-zero predictions.


## 5. Final Submission

The fully processed predictions are now saved to a `submission.csv` file in the format required by the competition.


In [9]:
submission = pred_df[["id","pred_amount"]].rename(columns={"pred_amount":"new_house_transaction_amount"})
out_path = os.path.join(OUT_DIR, "submission_v6_price_area.csv")
submission.to_csv(out_path, index=False)

print("Saved submission to:", out_path)
print("\nDone. Upload the generated CSV to Kaggle.")

Saved submission to: /kaggle/working/submission_v6_price_area.csv

Done. Upload the generated CSV to Kaggle.
