# Bussines question:


RESEARCH OBJECTIVES: Real Estate Market Optimization using Machine Learning

<b>business question:<b/> How can housing attributes help predict and improve property pricing?

Which factors influence property pricing and how can machine learning models identify these factors and predict the house pricing?


🔘 1. How can machine learning models be utilized to analyze, predict, and explain factors affecting property valuation and demand?
Application of regression models (Linear Regression, Random Forest, Neural Networks) for property price prediction.
Identification of the most influential factors (e.g., location, size, energy efficiency, market trends).
🔘 2. Can classification models predict the likelihood of a property selling within a specific time frame?
Binary classification (sold fast vs. slow) using Logistic Regression, Decision Trees, and XGBoost.
Impact of property features, pricing, and marketing on sales speed.
🔘 3. How can clustering techniques segment the real estate market into meaningful groups for targeted investment and pricing strategies?
Unsupervised learning (K-Means, DBSCAN) to group properties based on price, location, and features.
Identifying high-value investment clusters vs. slow-selling regions.
🔵 Bonus:
Can natural language processing (NLP) extract insights from property descriptions to enhance price and sales predictions?

BERT embeddings & Topic Modeling (BERTopic) for analyzing property descriptions.
Identifying descriptive keywords that correlate with high sales and price appreciation.

### <span style="color:rgb(185, 241, 238); font-size: 48px; font-family: 'Times New Roman'; font-weight: bold; text-decoration: underline;">1. Importing Libraries</span>




In [16]:
# Importing basic Librarie

# ==============================
# Data Manipulation & Analysis
# ==============================
import numpy as np  # Numerical operations and array handling
import pandas as pd  # Data manipulation and analysis
import polars as pl  # Alternative high-performance dataframe library

# ==============================
# Statistical Analysis
# ==============================
import scipy.stats as stats  # Statistical functions and hypothesis testing
from scipy.stats import (
    f_oneway,  # ANOVA test
    kruskal,  # Kruskal-Wallis H-test
    levene,  # Levene test for equal variances
    shapiro,  # Shapiro-Wilk test for normality
    randint  # Random integers for distributions
)
from statsmodels.stats.multicomp import pairwise_tukeyhsd  # Post-hoc Tukey's test

# ==============================
# Data Visualization
# ==============================
import matplotlib.pyplot as plt  # Core plotting library
import seaborn as sns  # Statistical data visualization
import graphviz  # Visualizing decision trees
from wordcloud import WordCloud  # Generate word clouds

# Matplotlib specific imports
import matplotlib.colors as mcolors  # Color manipulation for plots
from matplotlib.colors import ListedColormap  # Custom colormaps
from mpl_toolkits.mplot3d import Axes3D  # 3D plotting

# Advanced visualization tools
from plotnine import *  # Grammar of graphics for Python (ggplot-like)





### <span style="color:rgb(185, 241, 238); font-size: 48px; font-family: 'Times New Roman'; font-weight: bold; text-decoration: underline;">2. Reading Files</span>


In [17]:
# Defining file paths
funda_housing_dataset_path = r"C:\Users\josel\OneDrive\Documents\block 3\Resit AI\housing_data.csv"

#read csv file
funda_housing_df = pd.read_csv(funda_housing_dataset_path)

### <span style="color:rgb(185, 241, 238); font-size: 48px; font-family: 'Times New Roman'; font-weight: bold; text-decoration: underline;">3. Exploratory Data Analysis </span>

## <span style="color: #8DB3B1; font-size: 36px; font-family: 'Times New Roman'; font-weight: bold;">3.1 Introduction</span>


In [18]:
# View the first 20 row of the funda housing data set
funda_housing_df.head(10)

Unnamed: 0,globalId,publicatieDatum,postcode,koopPrijs,volledigeOmschrijving,soortWoning,categorieObject,bouwjaar,indTuin,perceelOppervlakte,kantoor_naam_MD5hash,aantalKamers,aantalBadkamers,energielabelKlasse,globalId.1,oppervlakte,datum_ondertekening
0,4388064,2018-07-31,1774PG,139000.0,"Ruimte, vrijheid, en altijd het water en de we...",<{woonboot}> <{vrijstaande woning}>,<{Woonhuis}>,1971-1980,1,,09F114F5C5EC061F6230349892132149,3,,,4388064,62,2018-11-12
1,4388200,2018-09-24,7481LK,209000.0,Verrassend ruime tussenwoning nabij het centru...,<{eengezinswoning}> <{tussenwoning}>,<{Woonhuis}>,1980,1,148.0,6A91BF7DB06A8DF2C9A89064F28571E7,5,1.0,B,4388200,136,2018-08-30
2,4399344,2018-08-02,1068MS,267500.0,- ENGLISH TRANSLATION - \n\nOn the 21st of Sep...,<{tussenverdieping}> (<{appartement}>),<{Appartement}>,2001-2010,0,,E983FEDC63D87BF61AE952D181C8FD17,3,,,4399344,70,2018-11-23
3,4400638,2018-08-04,5628EN,349000.0,Wonen in een zeer royaal bemeten geschakelde 2...,<{eengezinswoning}> <{geschakelde 2-onder-1-ka...,<{Woonhuis}>,1973,1,244.0,02BC26608B8B1A0888D3612AC7A5DB5C,5,,,4400638,144,2018-12-14
4,4401765,2018-08-05,7731TV,495000.0,Landgoed Junne is een eeuwenoud landgoed en li...,<{woonboerderij}> <{vrijstaande woning}>,<{Woonhuis}>,1900,0,4500.0,F56B2705CE24B8D78A68481ED1B276CB,8,1.0,,4401765,323,2018-12-06
5,4401831,2018-08-06,5971CR,162500.0,"In een rustige wijk, op korte afstand van het ...",<{eengezinswoning}> <{hoekwoning}>,<{Woonhuis}>,1970,1,104.0,DA6EDCA2E6F7AADE8D9817099455ABC4,4,1.0,,4401831,68,2019-04-06
6,4402098,2018-08-06,9571BM,217500.0,In landelijke woonomgeving en aan de doorgaand...,<{eengezinswoning}> <{vrijstaande woning}>,<{Woonhuis}>,1987,1,1028.0,FB71E2057357FAC18F2CDB18C3F15FC2,5,,C,4402098,184,2019-03-15
7,4406997,2018-08-12,1031KA,655000.0,Dit betreft bouwnummer 24 van het project Aan ...,<{portiekflat}>,<{Appartement}>,2019,0,,F8471E80DFB18392B3D1AA2BFD2C1CE4,3,,,4406997,105,2019-02-13
8,4407331,2018-08-12,9076BK,180000.0,"Levensloop bestendige, jaren 30 woning, met sl...",<{eengezinswoning}> <{2-onder-1-kapwoning}>,<{Woonhuis}>,1933,0,371.0,B17343CFB6032D8E7AADDD7503A416ED,4,,,4407331,93,2019-02-28
9,4417043,2018-09-08,4465AL,495000.0,Robuust woonhuis in klassiek stijl met heerlij...,<{eengezinswoning}> <{vrijstaande woning}>,<{Woonhuis}>,1998,1,880.0,3BB61B3A4BCF2EED8CDEE3D53AF510EA,5,2.0,B,4417043,156,2018-11-02


### <span style="color: #7AD6C7; font-size: 36px; font-weight: bold;">Explanation of the Funda Housing Dataset</span>

The **Funda Housing Dataset** provides a detailed information about real estate listings in the Netherlands. 
Each row represents a property listed with attributes such as location, pricing, features, and energy efficiency.

➤ <span style='color: #7AD6C7;'>globalId:</span>  Unique identifier for the property listing  <br>  
➤ <span style='color: #7AD6C7;'>publicatieDatum:</span>  The date when the house or property was published  <br>  
➤ <span style='color: #7AD6C7;'>postcode:</span>  Postal code of the property’s location  <br>  
➤ <span style='color: #7AD6C7;'>koopPrijs:</span>  Purchase price of the property in euros  <br>  
➤ <span style='color: #7AD6C7;'>volledigeOmschrijving:</span>  Full description of the property  <br>  
➤ <span style='color: #7AD6C7;'>soortWoning:</span>  Type of house (e.g., boat house, apartment, detached)  <br>  
➤ <span style='color: #7AD6C7;'>categorieObject:</span>  Category of the object (e.g., residential)  <br>  
➤ <span style='color: #7AD6C7;'>bouwjaar:</span>  Year the property was built  <br>  
➤ <span style='color: #7AD6C7;'>indTuin:</span>  Indicates if the property has a garden  <br>  
➤ <span style='color: #7AD6C7;'>perceelOppervlakte:</span>  Total land area of the property (m²)  <br>  
➤ <span style='color: #7AD6C7;'>kantoor_naam_MD5hash:</span>  Hashed identifier of the real estate agency  <br>  
➤ <span style='color: #7AD6C7;'>aantalKamers:</span>  Number of rooms in the property  <br>  
➤ <span style='color: #7AD6C7;'>aantalBadkamers:</span>  Number of bathrooms in the property  <br>  
➤ <span style='color: #7AD6C7;'>energielabelKlasse:</span>  Energy efficiency rating (A, B, C, etc.)  <br>  
➤ <span style='color: #7AD6C7;'>globalId.1:</span>  Another unique identifier   <br>  
➤ <span style='color: #7AD6C7;'>oppervlakte:</span>  Total living area of the property (m²)  <br>  
➤ <span style='color: #7AD6C7;'>datum_ondertekening:</span>  Date of contract signing  <br>  










In [19]:
# sales speed
# omschrijving → vector embeddings 

## <span style="color: #8DB3B1; font-size: 36px; font-family: 'Times New Roman'; font-weight: bold;">3.2 Cleaning the Funda housing dataset
</span>

In chapter 3.2, duplicates will be removed and teh dataset will be analysed or checked wether it contains nan values.

### 3.2.1 Checking for duplicated rows 

In [20]:
# show amount of duplicated rows if applicable 
row_duplicates= funda_housing_df.duplicated().sum()
print('The total amount of duplicated rows in the funda housing dataset is: ' ,row_duplicates )

The total amount of duplicated rows in the funda housing dataset is:  0


### 3.2.2 Checking for nan values

In [21]:
# Show the number of rows and columns
print("Total rows and columns in funda_housing_df:", funda_housing_df.shape)

# Checking null values, data types, and NaN percentages for funda_housing_df
summary_funda_df = pd.DataFrame({
    "Column Name": funda_housing_df.columns,
    "Null Count": funda_housing_df.isnull().sum().values,
    "Data Type": funda_housing_df.dtypes.values,
    "NaN Percentage": (funda_housing_df.isnull().sum().values / len(funda_housing_df)) * 100
})

# Display the summary DataFrame
summary_funda_df

Total rows and columns in funda_housing_df: (211617, 17)


Unnamed: 0,Column Name,Null Count,Data Type,NaN Percentage
0,globalId,0,int64,0.0
1,publicatieDatum,0,object,0.0
2,postcode,0,object,0.0
3,koopPrijs,741,float64,0.350161
4,volledigeOmschrijving,0,object,0.0
5,soortWoning,0,object,0.0
6,categorieObject,0,object,0.0
7,bouwjaar,0,object,0.0
8,indTuin,0,int64,0.0
9,perceelOppervlakte,67241,float64,31.774857


### <span style="color: #7AD6C7; font-size: 32px; font-family: 'Times New Roman'; font-weight: bold;">Columns with NaN Values and analysis</span>

#### <span style="color: #7AD6C7; font-size: 30px; font-family: 'Times New Roman'; font-weight: bold;"> Columns with Missing Values:</span>

#### <span style="color:#7AD6C7; font-size: 28px; font-family: 'Times New Roman'; font-weight: bold;"> A) Columns with few or low percentage of missing values:</span>

➤ **<span style="color: #7AD6C7;">koopPrijs</span>** → **0.35% missing**  
   🔹 **Action:** **Impute using median**  

**<span style="color: #7AD6C7;font-size: 26px; font-family: 'Times New Roman'; font-weight: bold;">Reason for imputing these values</span>**

Also, the missing values percentage for the house price column is **very low (0.35%)**, so the misisng values will be **imputed using the median** in order to prevent skewness in the distribution. By imputing the misisng values it prevents unnecessary data loss and keeps the dataset complete. By using methods like the median (for numbers) or mode (for categories) it helps maintain data consistency without changing its overall pattern

#### <span style="color:#7AD6C7; font-size: 28px; font-family: 'Times New Roman'; font-weight: bold;"> B) Columns with many or high percentage of missing values:</span>

➤ **<span style="color: #7AD6C7;">aantalBadkamers</span>** → **28.90% missing**  
   🔹 **Action:** **Impute based on KNN imputation method**  

➤ **<span style="color: #7AD6C7;">perceelOppervlakte</span>** → **31.77% missing**  
   🔹 **Action:** **Impute based on KNN imputation method** 

➤ **<span style="color: #7AD6C7;">energielabelKlasse</span>** → **58.00% missing**  
   🔹 **Action:** **Impute based on KNN imputation method** 

**<span style="color: #7AD6C7;font-size: 26px; font-family: 'Times New Roman'; font-weight: bold;">Reason for using KNN imputation on misisng values</span>**

The missing values for the amount of bathrooms should be filled with the KNN imputation as it is a better choice for aantalBadkamers because it assigns values based on similar properties (considering factors like oppervlakte, koopPrijs, and soortWoning). There is a total of **31.77%** of missing values which is why dropping the column may **remove too much data**. Since this feature is important for property pricing, missing values should be imputed using KNN instead of simple methods like median imputation. KNN will estimate missing values based on similar properties, considering factors like property type, region, and bouwjaar to ensure more accurate imputation.

Similarly, energielabelKlasse has 58% missing values, making inference challenging with traditional methods. Instead of dropping the column, we will use KNN Imputation to predict missing values based on bouwjaar (construction year) and soortWoning (house type), as older homes tend to have lower energy ratings while newer ones have better efficiency.

---

### <span style="color: #7AD6C7; font-size: 30px; font-family: 'Times New Roman'; font-weight: bold;">Final Actions:</span>

| **Column** | **NaN Percentage** | **Description of how Nan values will be handled** |
|-----------|-----------------|----------------|
| **koopPrijs** | **0.35%** | **Impute (median)** |
| **perceelOppervlakte** | **31.77%** | **KNN imputation method** |
| **aantalBadkamers** | **28.90%** | **KNN imputation method** |
| **energielabelKlasse** | **58.00%** | **KNN imputation method** |



