How many observations and features/columns do you have?
What is the proportion of missing values per column?
Which variables would you delete and why?
What variables are most subject to outliers?
How many qualitative and quantitative variables are there? What are appropiate visuals for quantitative vs qualitative data? What are appropiate measures for correlations when dealing with qualitative and quantitative variables?
What is the correlation between the variables and the price? Why do you think some variables are more correlated than others?
How are the variables themselves correlated to each other? Can you find groups of variables that are correlated together?
How are the number of properties distributed according to their surface?
Which five variables do you consider the most important and why?
What are the least/most expensive municipalities in Belgium/Wallonia/Flanders? (in terms of price per mÂ², average price, and median price)

In [11]:
import numpy as np
import pandas as pd

file_path = '../../data/processed/cleaned_properties.csv'
df = pd.read_csv(file_path)

numeric = df.select_dtypes(include="number")

flag_cols = [
    col for col in numeric.columns
    if set(numeric[col].dropna().unique()) <= {0, 1}
]
exclude = ["latitude", "longitude", "zip_code"] + flag_cols

numeric_clean = numeric.drop(columns=exclude, errors="ignore")
iqr_counts = {}

for col in numeric_clean.columns:
    col_data = numeric_clean[col].dropna()
    
    q1 = col_data.quantile(0.25)
    q3 = col_data.quantile(0.75)
    iqr = q3 - q1
    lower = q1 - 1.5 * iqr
    upper = q3 + 1.5 * iqr
    
    iqr_counts[col] = ((col_data < lower) | (col_data > upper)).sum()

outlier_table = (
    pd.DataFrame.from_dict(iqr_counts, orient="index", columns=["outlier_count"])
      .sort_values("outlier_count", ascending=False)
)
outlier_table

from scipy.stats import median_abs_deviation
import numpy as np
import pandas as pd

def mad_outliers(series, threshold=4.5):
    data = series.dropna()
    med = data.median()
    mad = median_abs_deviation(data, scale='normal')
    
    if mad == 0:
        return 0
    
    robust_z = np.abs((data - med) / mad)
    return (robust_z > threshold).sum()

outlier_counts = {
    col: mad_outliers(numeric_clean[col], threshold=4.5)
    for col in numeric_clean.columns
}

pd.DataFrame.from_dict(outlier_counts, orient="index", columns=["outlier_count"]) \
  .sort_values("outlier_count", ascending=False)


Unnamed: 0,outlier_count
terrace_sqm,25501
price,3219
surface_land_sqm,2798
total_area_sqm,2254
cadastral_income,1157
primary_energy_consumption_sqm,496
nbr_bedrooms,427
construction_year,107
nbr_frontages,8
id,0
