# 4. Correlation & Relationship Analysis

This notebook focuses on investigating the relationships between variables, especially with the property price. We will load the processed data and calculate correlations.

In [1]:
import pandas as pd
import os

# Define the path to the processed data file
processed_data_path = os.path.join('..', '..', 'data', 'processed', 'immoweb_processed_data.csv')

# Load the data into a pandas DataFrame
try:
    df = pd.read_csv(processed_data_path)
    print('Processed data loaded successfully for correlation analysis.')
    print('DataFrame Info:')
    df.info()
    print('First 5 rows:')
    display(df.head())
except FileNotFoundError:
    print(f'Error: The file was not found at {processed_data_path}')

Processed data loaded successfully for correlation analysis.
DataFrame Info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 75511 entries, 0 to 75510
Data columns (total 34 columns):
 #   Column                          Non-Null Count  Dtype  
---  ------                          --------------  -----  
 0   id                              75511 non-null  int64  
 1   price                           75511 non-null  float64
 2   property_type                   75511 non-null  object 
 3   subproperty_type                75511 non-null  object 
 4   region                          75511 non-null  object 
 5   province                        75511 non-null  object 
 6   locality                        75511 non-null  object 
 7   zip_code                        75511 non-null  int64  
 8   latitude                        75511 non-null  float64
 9   longitude                       75511 non-null  float64
 10  construction_year               75511 non-null  float64
 11  total_area_sqm  

Unnamed: 0,id,price,property_type,subproperty_type,region,province,locality,zip_code,latitude,longitude,...,state_building,primary_energy_consumption_sqm,epc,heating_type,fl_double_glazing,price_log,total_area_sqm_log,surface_land_sqm_log,terrace_sqm_log,garden_sqm_log
0,34221000,225000.0,APARTMENT,APARTMENT,Flanders,Antwerp,Antwerp,2050,51.217172,4.379982,...,MISSING,231.0,C,GAS,1,12.32386,4.615121,5.894403,1.791759,0.0
1,2104000,449000.0,HOUSE,HOUSE,Flanders,East Flanders,Gent,9185,51.174944,3.845248,...,MISSING,221.0,C,MISSING,1,13.01478,,6.523562,0.0,0.0
2,34036000,335000.0,APARTMENT,APARTMENT,Brussels-Capital,Brussels,Brussels,1070,50.842043,4.334543,...,AS_NEW,242.0,MISSING,GAS,0,12.721889,4.962845,5.894403,0.693147,0.0
3,58496000,501000.0,HOUSE,HOUSE,Flanders,Antwerp,Turnhout,2275,51.238312,4.817192,...,MISSING,99.0,A,MISSING,0,13.124363,5.236442,6.226537,0.0,0.0
4,48727000,982700.0,APARTMENT,DUPLEX,Wallonia,Walloon Brabant,Nivelles,1410,50.900919,4.376713,...,AS_NEW,19.0,A+,GAS,0,13.79806,5.135798,5.894403,3.044522,4.962845


## T019: Calculate Correlation Between Variables and Price

We will now calculate the Pearson correlation coefficient between all numerical variables and the `price` variable. This will help us understand which features have a linear relationship with the property price.

In [2]:
if 'df' in locals():
    # Select only numerical columns for correlation calculation
    numerical_df = df.select_dtypes(include=['number'])

    num_cols = ['price_log', 'total_area_sqm_log', 'surface_land_sqm_log', 'terrace_sqm_log','garden_sqm_log', 'nbr_bedrooms', 'construction_year', 'primary_energy_consumption_sqm']
    # Calculate correlation with 'price'
    # if 'price' in numerical_df.columns:
    if set(num_cols).issubset(numerical_df.columns):
        price_correlations = numerical_df[num_cols].corr()
        print("Correlation with 'price' variable:")
        display(price_correlations)
    else:
        print("'price' column not found in numerical data for correlation calculation.")

Correlation with 'price' variable:


Unnamed: 0,price_log,total_area_sqm_log,surface_land_sqm_log,terrace_sqm_log,garden_sqm_log,nbr_bedrooms,construction_year,primary_energy_consumption_sqm
price_log,1.0,0.58889,0.168516,0.119493,0.082893,0.365265,0.009931,-0.000552
total_area_sqm_log,0.58889,1.0,0.061346,-0.031025,0.239093,0.640563,-0.206407,0.005063
surface_land_sqm_log,0.168516,0.061346,1.0,0.069978,0.072317,0.026467,0.009504,0.001698
terrace_sqm_log,0.119493,-0.031025,0.069978,1.0,0.229368,-0.029636,0.125864,-0.005261
garden_sqm_log,0.082893,0.239093,0.072317,0.229368,1.0,0.158254,-0.122724,-0.001293
nbr_bedrooms,0.365265,0.640563,0.026467,-0.029636,0.158254,1.0,-0.147005,0.003463
construction_year,0.009931,-0.206407,0.009504,0.125864,-0.122724,-0.147005,1.0,-0.000744
primary_energy_consumption_sqm,-0.000552,0.005063,0.001698,-0.005261,-0.001293,0.003463,-0.000744,1.0


## T021: Determine Appropriate Measures for Correlations Between Qualitative and Quantitative Variables

When analyzing the relationship between qualitative (categorical) and quantitative (numerical) variables, traditional Pearson correlation is not suitable. Instead, we can use statistical tests like ANOVA (Analysis of Variance) to determine if there are significant differences in the means of the quantitative variable across different groups of the qualitative variable.

For this analysis, we will focus on key qualitative variables such as `property_type`, `region`, `province`, `locality`, `equipped_kitchen`, `state_building`, `epc`, and `heating_type` and their relationship with the `price` (quantitative) variable. We will use ANOVA to test for significant differences in `price` across categories of these qualitative variables.

In [3]:
from scipy import stats
import pandas as pd


def perform_anova_test(df: pd.DataFrame, group_col: str, target_col: str = 'price'):
    print(f"- **{group_col}** vs. **{target_col}**:")

    # 1. Initial Check and Data Preparation
    if target_col not in df.columns or group_col not in df.columns:
        print(f"  > Skipping: '{target_col}' or '{group_col}' not found.")
        return

    # Filter out NaNs for the required columns and select the working subset
    temp_df = df[[group_col, target_col]].dropna()

    # 2. Check for sufficient unique groups
    n_unique = temp_df[group_col].nunique()
    if n_unique <= 1:
        print(f"  > Only **{n_unique}** unique category found. ANOVA not applicable.")
        return

    # 3. Create groups for ANOVA
    # Collects the 'price' values for each category into a list of arrays
    try:
        groups = [
            temp_df[target_col].loc[temp_df[group_col] == category].values
            for category in temp_df[group_col].unique()
        ]
    except Exception as e:
        print(f"  > Error creating groups: {e}")
        return

    # 4. Check for empty groups (ANOVA requires at least one observation per group)
    if not all(len(group) > 0 for group in groups):
        print("  > Not enough non-NaN data in groups for ANOVA.")
        return

    # 5. Perform ANOVA
    # The *groups unpacks the list of arrays into separate arguments for f_oneway
    f_statistic, p_value = stats.f_oneway(*groups)

    # 6. Display Results
    print(f"  > F-statistic = **{f_statistic:.2f}**, p-value = **{p_value:.3f}**")

    alpha = 0.05
    if p_value < alpha:
        print(f"  ✅ **Result:** There is a statistically significant difference in {target_col} across the categories (p < {alpha}).")
    else:
        print(f"  ❌ **Result:** No statistically significant difference in {target_col} across the categories (p ≥ {alpha}).")


def run_qualitative_anova_analysis(df: pd.DataFrame, target_col: str = 'price'):
    # Check if the DataFrame exists in the current scope
    if df is None:
        print("Error: DataFrame 'df' is not defined.")
        return

    # 1. Get relevant columns
    qualitative_vars = df.select_dtypes(include=['object']).columns.tolist()

    if not qualitative_vars:
        print("No qualitative (object type) variables found for analysis.")
        return

    print(f"Qualitative Variables Found: {', '.join(qualitative_vars)}")

    # 2. Iterate and test
    for q_var in qualitative_vars:
        perform_anova_test(df, group_col=q_var, target_col=target_col)


if 'df' in locals():
    run_qualitative_anova_analysis(df=df)

Qualitative Variables Found: property_type, subproperty_type, region, province, locality, equipped_kitchen, state_building, epc, heating_type
- **property_type** vs. **price**:
  > F-statistic = **1395.50**, p-value = **0.000**
  ✅ **Result:** There is a statistically significant difference in price across the categories (p < 0.05).
- **subproperty_type** vs. **price**:
  > F-statistic = **619.21**, p-value = **0.000**
  ✅ **Result:** There is a statistically significant difference in price across the categories (p < 0.05).
- **region** vs. **price**:
  > F-statistic = **1147.38**, p-value = **0.000**
  ✅ **Result:** There is a statistically significant difference in price across the categories (p < 0.05).
- **province** vs. **price**:
  > F-statistic = **486.38**, p-value = **0.000**
  ✅ **Result:** There is a statistically significant difference in price across the categories (p < 0.05).
- **locality** vs. **price**:
  > F-statistic = **242.72**, p-value = **0.000**
  ✅ **Result:** T