Handling uncertainty

In this notebook we test whether the missingness in numerical variables is meaningful. 

This will define the decision whether to keep Nan for missing values or impute -1 and a missing_flag column. 

If missingness is meaningful, we can either:

keep Nan (for tree models and pipelines that will handle Nan) 

OR

use -1 in missing cells and add a missing_flag column

(For numeric variables, using –1 as a placeholder is safe only when paired with a missing-value flag, because the model learns missingness from the flag rather than treating –1 as a real value. This approach preserves the predictive signal of missingness, while keeping the column numeric for any model type.)




Varuables to be tested:

surface_land_sqm, 
construction_year, 
primary_energy_consumption_sqm, 
garden_sqm, 
terrace_sqm, 
total_area_sqm, 
nbr_bedrooms


In [11]:
#Import the clean data file
import pandas as pd
import numpy as np
import seaborn as sns

file_path = "../data/processed/cleaned_properties.csv"
with open(file_path, 'r', encoding='utf-8') as f:
    first_line = f.readline()
    sep = ';' if ';' in first_line else ','
df = pd.read_csv(file_path, sep=sep, low_memory=False)

In [12]:
# TO BE ADDED TO THE CLEANING FILE: creating EPC categorization across regions, via creating groups "excellent", "good", "poor", "bad"

# Defining mapping rules for each region

epc_mapping = {
    "Flanders": {
        "A+": "excellent",
        "A": "excellent",
        "B": "good",
        "C": "poor",
        "D": "poor",
        "E": "bad",
        "F": "bad"
    },
    
    "Brussels-Capital": {
        "A": "excellent",
        "B": "good",
        "C": "good",
        "D": "poor",
        "E": "poor",
        "F": "bad",
        "G": "bad"
    },
    
    "Wallonia": {
        "A++": "excellent",
        "A+": "excellent",
        "A": "good",
        "B": "good",
        "C": "poor",
        "D": "poor",
        "E": "poor",
        "F": "bad",
        "G": "bad"
    }
}

# Function that uses the rule on region

def recode_epc(row):
    region = row['region']
    epc = row['epc']

    if pd.isna(region) or pd.isna(epc):
        return np.nan
    
    region_rules = epc_mapping.get(region)

    if region_rules is None:
        return np.nan
    
    return region_rules.get(epc, np.nan)

df['epc_group'] = df.apply(recode_epc, axis =1)
df = df.drop(columns=['epc'])
df = df.rename(columns={'epc_group': 'epc'})
df['epc'] = df['epc'].fillna("MISSING") # Replacing Nan with "MISSING"
df['epc'].value_counts(dropna=False)



epc
MISSING    75508
Name: count, dtype: int64

In [13]:
# Capping and log transformations

df_before = df.copy() #Keeping a copy of data before capping
cap_vars = ['price', 'surface_land_sqm', 'total_area_sqm','garden_sqm', 'terrace_sqm', 'nbr_bedrooms', 'nbr_frontages']
lower_cap = 0.01
upper_cap = 0.99
for var in cap_vars:
    lower = df[var].quantile(lower_cap)
    upper = df[var].quantile(upper_cap)
    df[var] = np.where(df[var] < lower, lower,
                       np.where(df[var] > upper, upper, df[var]))

# Initiating the log transfiormations (nbr_bedrooms and nbr_frontages were normalized only by the capping so no need for log-trandformation)
log_vars = ['price', 'surface_land_sqm', 'total_area_sqm','garden_sqm', 'terrace_sqm']
for var in log_vars:
    df[f'{var}_log'] = np.log1p(df[var])
df[[f'{v}_log' for v in log_vars]].skew()


price_log               0.633032
surface_land_sqm_log   -1.389089
total_area_sqm_log      0.359023
garden_sqm_log          1.949138
terrace_sqm_log         0.382170
dtype: float64

In [14]:
# Testing whether the price differs across properties with missign and non-missing observations in a given variable (Mann-Whitney)

# If missingness corresponds to higher/lower prices, it carries predictive information
# Big differences in mean/median -> meaningful
# Very small p-value (<0.05) -> distributions differ significantly
# Missing group is very cheap or very expensive -> missingness is highly informative

import pandas as pd
import numpy as np
from scipy.stats import mannwhitneyu

target = "price_log"
vars_to_test = ["surface_land_sqm_log","construction_year","primary_energy_consumption_sqm","garden_sqm_log","terrace_sqm_log","total_area_sqm_log","nbr_frontages"]

results = []

for var in vars_to_test:
    
    g_missing = df[df[var].isna()][target]
    g_present = df[df[var].notna()][target]
    
    # Summary stats
    mean_missing = g_missing.mean()
    mean_present = g_present.mean()
    median_missing = g_missing.median()
    median_present = g_present.median()
    
    # Statistical test (distribution-free)
    stat, p = mannwhitneyu(g_missing, g_present, alternative='two-sided')
    
    results.append({
        "variable": var,
        "pct_missing": df[var].isna().mean(),
        "mean_price_missing": mean_missing,
        "mean_price_present": mean_present,
        "median_price_missing": median_missing,
        "median_price_present": median_present,
        "p_value_difference": p
    })

pd.DataFrame(results)

# Interpretation of results and recommendations:

#!!! MAIN MESSAGE: Missingness is statistically meaningful (not random) and correlates with price for all 7 variables !!!
#Therefore, it should not be replaced with median as it carries predictive signal. 

#The decision what to do depends on what's coming next:
# - if using tree models (XGBoost, LightGBM, CatBoost), we can keep Nan as it is now;
#- if using general ML pipelines/neural networks/linear models, we need to impute -1 and add a binary "missing flag" column for each of these 6 vars; 

#"surface_land_sqm_log": missing values are not random; properties with missing value for this variable are much cheaper; 
#   it is likely to reflect that this information is missing for apartments; 

#"construction_year": the properties where construction year is missing, are significantly cheaper; 
#   it could be that older or lower-quality buildings often don't list their build year; 

#"primary_energy_consumption_sqm": missing values are again not random, but for properties that are just slightly more expensive;
#  could be the case that for more expensive properties buyers don't care much about this and the information is often omitted; 

#"garden_sqm_log": missing values are for properties that are more expensive (e.g. for apartments); the reason for that is that houses with gardens
# are likely to be often in suburban or rural areas, therefore, with lower price per sqm;  

#"terrace_sqm_log": missing values are for properties that are more expensive 

#"total_area_sqm_log": missing values are for much cheaper properties; 

# "nbr_frontages": missing values are for slightly cheaper properties; 



Unnamed: 0,variable,pct_missing,mean_price_missing,mean_price_present,median_price_missing,median_price_present,p_value_difference
0,surface_land_sqm_log,0.480121,12.653887,12.825092,12.601659,12.815841,0.0
1,construction_year,0.442178,12.655227,12.812384,12.644331,12.751303,7.342892e-255
2,primary_energy_consumption_sqm,0.35183,12.810901,12.705977,12.765691,12.644331,1.4807349999999998e-258
3,garden_sqm_log,0.038923,12.827265,12.739476,12.790336,12.694656,4.987969e-25
4,terrace_sqm_log,0.174021,12.804368,12.729941,12.751303,12.691584,1.1980970000000001e-23
5,total_area_sqm_log,0.10085,12.632412,12.755284,12.608202,12.706851,2.790767e-70
6,nbr_frontages,0.348917,12.723834,12.753106,12.676079,12.721889,3.846369e-09


In [15]:
# Testing whether there is correlation between missingness and a set of categorical variables

# Create missing_flag columns for each of the 7 variables
from scipy.stats import chi2_contingency

vars_to_test = ["surface_land_sqm_log","construction_year","primary_energy_consumption_sqm","garden_sqm_log","terrace_sqm","total_area_sqm_log","nbr_frontages"]

for var in vars_to_test:
    df[var + "_missing"] = df[var].isna().astype(int)

categoricals = ["property_type", "subproperty_type", "region", "epc"]

# Showing how missingness of the variables of interest is concetrated in other categorical variables
for var in vars_to_test:
    flag = var + "_missing"
    print(f"\n Fraction of missing {var} by category")
    for cat in categoricals:
        summary = df.groupby(cat)[flag].mean().sort_values(ascending=False)
        print(f"\n{cat}:\n{summary}\n")

# Cramer's V
def cramers_v(x, y):
    """Compute Cramer's V statistic for categorical-categorical association."""
    confusion_matrix = pd.crosstab(x, y)
    chi2 = chi2_contingency(confusion_matrix)[0]
    n = confusion_matrix.sum().sum()
    phi2 = chi2 / n
    r,k = confusion_matrix.shape
    return np.sqrt(phi2 / min(k-1, r-1))


print("Cramer's V: missingness vs categorical variables")
cramers_results = []

for var in vars_to_test:
    flag = var + "_missing"
    print(f"\n--- {flag} ---")
    for cat in categoricals:
        v = cramers_v(df[flag], df[cat])
        print(f"{cat}: Cramer's V = {v:.3f}")
        cramers_results.append({"variable": var, "categorical": cat, "cramers_v": v})


# Interpretation of results:

##!!! MAIN MESSAGE: Missingness is not random; it clusters by property type, subproperty type, region, and EPC, 
#and should be preserved explicitly in your preprocessing pipeline. !!!

#Very strong signal: surface_land_sqm_log (perfectly tied to property type)
#Moderate signal: primary_energy_consumption_sqm, nbr_frontages
#Weak signal: garden_sqm_log, total_area_sqm_log
#Some regional or epc effects for construction_year and frontages

#surface_land_sqm_log:

#Property type: 100% of surface_land_sqm_log is missing for APARTMENTS, 0% for HOUSES.
#Subproperty type: All apartment-like subtypes missing;
#Region: More missing in Brussels-Capital (75%), less in Wallonia (36%).
#epc: More missing in “good”/“excellent” properties
#Interpretation:
#Missingness is completely deterministic by property type — apartments don’t have land plots.
#This is also reflected in Cramer’s V = 1.0 for both property_type and subproperty_type.

#construction_year:

#Missing values slightly more common in houses (47%) than apartments (41%).
#Subproperty type shows moderate variation (highest missing in CASTLE, KOT, FARMHOUSE).
#Wallonia has more missing (61%) vs Brussels (38%).
#Interpretation:
#Missingness is moderate and partially structured, especially by region and subproperty type.
#Cramer’s V values (0.057–0.231) indicate weak-to-moderate associations — meaning some categories influence missingness, but not deterministically.

#primary_energy_consumption_sqm:

#More missing for apartments (44%) than houses (27%).
#Highest missing in KOT, GROUND_FLOOR, PENTHOUSE.
#Slight regional differences: Flanders 39%, Brussels 26%.
#epc: missing more in “excellent” and “good” properties.
#Interpretation:
#Missingness is meaningful and partially structured — probably related to property type and quality (epc).
#Cramer’s V: 0.103–0.258 → moderate association with epc, subproperty type, and property_type.

#garden_sqm_log:

#Low overall missingness (3–6%).
#Slightly higher for houses and manor/chalet/bungalow types.
#Brussels has almost no missing values (0.5%).
#Interpretation:
#Missingness is not strongly structured, Cramér’s V = 0.046–0.132 -> weak association.
#Probably negligible predictive signal.

#total_area_sqm_log:

#6–13% missing depending on property type; higher in other_property/apartment_block.
#Slight differences by region and epc.
#Cramer’s V = 0.05–0.138 -> weak association.
#Interpretation:
#Some structured missingness, but weaker than surface_land_sqm or primary_energy_consumption_sqm.

#nbr_frontages:

#50% missing in apartments, 20% in houses.
#Subproperty type: strong variation (50–30% missing).
#Region: higher in Flanders (42%), lower in Brussels (22%).
#EPC: missing more in “excellent” (46%) and “good” (35%).
#Cramer’s V = 0.171–0.328 → moderate association.
#Interpretation:
#Missingness is structured and meaningful, particularly by property/subproperty type and epc.



 Fraction of missing surface_land_sqm_log by category

property_type:
property_type
APARTMENT    1.0
HOUSE        0.0
Name: surface_land_sqm_log_missing, dtype: float64


subproperty_type:
subproperty_type
APARTMENT               1.0
DUPLEX                  1.0
SERVICE_FLAT            1.0
KOT                     1.0
LOFT                    1.0
FLAT_STUDIO             1.0
GROUND_FLOOR            1.0
PENTHOUSE               1.0
TRIPLEX                 1.0
COUNTRY_COTTAGE         0.0
BUNGALOW                0.0
APARTMENT_BLOCK         0.0
CHALET                  0.0
CASTLE                  0.0
EXCEPTIONAL_PROPERTY    0.0
MANOR_HOUSE             0.0
HOUSE                   0.0
FARMHOUSE               0.0
MANSION                 0.0
OTHER_PROPERTY          0.0
MIXED_USE_BUILDING      0.0
TOWN_HOUSE              0.0
VILLA                   0.0
Name: surface_land_sqm_log_missing, dtype: float64


region:
region
Brussels-Capital    0.753771
Flanders            0.496534
Wallonia            0.3

  return np.sqrt(phi2 / min(k-1, r-1))


epc: Cramer's V = nan

--- construction_year_missing ---
property_type: Cramer's V = 0.057
subproperty_type: Cramer's V = 0.130
region: Cramer's V = 0.231


  return np.sqrt(phi2 / min(k-1, r-1))


epc: Cramer's V = nan

--- primary_energy_consumption_sqm_missing ---
property_type: Cramer's V = 0.179
subproperty_type: Cramer's V = 0.204
region: Cramer's V = 0.103
epc: Cramer's V = nan

--- garden_sqm_log_missing ---
property_type: Cramer's V = 0.122


  return np.sqrt(phi2 / min(k-1, r-1))


subproperty_type: Cramer's V = 0.132
region: Cramer's V = 0.061


  return np.sqrt(phi2 / min(k-1, r-1))


epc: Cramer's V = nan

--- terrace_sqm_missing ---
property_type: Cramer's V = 0.164
subproperty_type: Cramer's V = 0.177
region: Cramer's V = 0.070


  return np.sqrt(phi2 / min(k-1, r-1))
  return np.sqrt(phi2 / min(k-1, r-1))


epc: Cramer's V = nan

--- total_area_sqm_log_missing ---
property_type: Cramer's V = 0.119
subproperty_type: Cramer's V = 0.138
region: Cramer's V = 0.050
epc: Cramer's V = nan

--- nbr_frontages_missing ---
property_type: Cramer's V = 0.317
subproperty_type: Cramer's V = 0.328
region: Cramer's V = 0.171
epc: Cramer's V = nan


  return np.sqrt(phi2 / min(k-1, r-1))


In [16]:
#  Imputation: in the 7 numerical variables with missing values, replace Nan with -1 and create missing_flag column

numeric_vars = ["surface_land_sqm_log","construction_year","primary_energy_consumption_sqm","garden_sqm_log","terrace_sqm_log","total_area_sqm_log","nbr_frontages"]

for var in numeric_vars:
    df[var] = pd.to_numeric(df[var], errors='coerce')
    
for var in numeric_vars:
    missing_flag = var + "_missing"
    df[missing_flag] = df[var].isna().astype(int)  # 1 if missing, 0 otherwise
    df[var] = df[var].fillna(-1)

# Save a new data file
output_file = "../data/processed/cleaned_properties_(-1).csv"
df.to_csv(output_file, index=False)
print(f"Processed data saved to {output_file}")

# Quick summary to verify
summary = pd.DataFrame({
    'min': df[numeric_vars].min(),
    'max': df[numeric_vars].max(),
    'missing_flag_sum': [df[var+'_missing'].sum() for var in numeric_vars]
})
print(summary)

# Checking column types
print(df.dtypes)

Processed data saved to ../data/processed/cleaned_properties_(-1).csv
                                  min           max  missing_flag_sum
surface_land_sqm_log             -1.0  9.473524e+00             36253
construction_year                -1.0  2.024000e+03             33388
primary_energy_consumption_sqm -140.0  2.023112e+07             26566
garden_sqm_log                   -1.0  7.496097e+00              2939
terrace_sqm_log                  -1.0  4.615121e+00             13140
total_area_sqm_log               -1.0  6.523562e+00              7615
nbr_frontages                    -1.0  4.000000e+00             26346
id                                          int64
price                                     float64
property_type                              object
subproperty_type                           object
region                                     object
province                                   object
locality                                   object
zip_code            

In [17]:
# Check for remaining NANs
import pandas as pd

file_path = "../data/processed/cleaned_properties_(-1).csv"

with open(file_path, 'r', encoding='utf-8') as f:
    first_line = f.readline()
    sep = ';' if ';' in first_line else ','
df = pd.read_csv(file_path, sep=sep, low_memory=False)

def check_remaining_nans(df, columns):
    results = {}
    for col in columns:
        if col in df.columns:
            nan_count = df[col].isna().sum()
            results[col] = {
                "remaining_nans": nan_count,
                "remaining_nans_%": round(100 * nan_count / len(df), 2)
            }
    return pd.DataFrame(results).T.sort_values("remaining_nans_%", ascending=False)

numeric_vars_cleaned = ["surface_land_sqm_log","construction_year","primary_energy_consumption_sqm","garden_sqm_log","terrace_sqm_log","total_area_sqm_log","nbr_frontages"]
check_remaining_nans(df, numeric_vars_cleaned)

Unnamed: 0,remaining_nans,remaining_nans_%
surface_land_sqm_log,0.0,0.0
construction_year,0.0,0.0
primary_energy_consumption_sqm,0.0,0.0
garden_sqm_log,0.0,0.0
terrace_sqm_log,0.0,0.0
total_area_sqm_log,0.0,0.0
nbr_frontages,0.0,0.0
