# ImmoEliza Project - Part 3: Regression 

# Step 3: Data Formatting

In [1]:
# Import cleaned & encoded data

import pandas as pd

df_selected = pd.read_csv("./data/2_engineered_data.csv")

## Hybrid Approach of Normalization and Standardization

* Latitude and longitude stay within a logical and interpretable range.
* Price_per_sqm gets standardized to mitigate the impact of extreme values.

**Warning!**
* For KNN modules the target is not normalized/standardized

***Normalization (Min-Max-Scaling)***

Normalization rescales the values of a feature into a fixed range (usually [0, 1] or [-1, 1]). It works best for features that have no inherent scale but are important in terms of their relative positions. Use normalization for features that:

* Are continuous with a specific range.
* Are not heavily skewed (i.e., the data should be roughly evenly distributed).
* You want to make comparable in magnitude across different features.


***Standardization (Z-score Scaling)***
Standardization transforms the data to have a mean of 0 and a standard deviation of 1. It’s useful when:

* The feature follows a Gaussian distribution (normal distribution).
* The feature contains outliers (as it’s less sensitive to them compared to normalization).
* The feature has an arbitrary scale, and you want to focus on its relative variance.

**Final Decision:**

* Normalize:

  * Latitude
  * Longitude

* Standardize:
  * Living Area
  * Subtype of Property (after encoding)
  * Building Condition (after encoding)
  * Facade Number (if treated as numerical or ordinal)
  * Equipped kitchen
  * Garden
  * Swimming Pool (same)
  
* Keep as is:
  * Binary data 

In [2]:
df_selected.columns

Index(['price', 'commune_encoded', 'living_area', 'building_condition',
       'terrace', 'equipped_kitchen', 'subtype_of_property', 'garden'],
      dtype='object')

In [3]:
df_selected.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 24878 entries, 0 to 24877
Data columns (total 8 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   price                24878 non-null  float64
 1   commune_encoded      24878 non-null  int64  
 2   living_area          24878 non-null  int64  
 3   building_condition   24878 non-null  int64  
 4   terrace              24878 non-null  int64  
 5   equipped_kitchen     24878 non-null  int64  
 6   subtype_of_property  24878 non-null  int64  
 7   garden               24878 non-null  int64  
dtypes: float64(1), int64(7)
memory usage: 1.5 MB


In [4]:
# Convert categorical-like columns to numeric first (force to NaN if not convertible)
# Select all columns except 'price' and convert them to float
columns_to_convert = [col for col in df_selected.columns if col != 'price']
df_selected[columns_to_convert] = df_selected[columns_to_convert].astype(float)

In [5]:
df_selected.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 24878 entries, 0 to 24877
Data columns (total 8 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   price                24878 non-null  float64
 1   commune_encoded      24878 non-null  float64
 2   living_area          24878 non-null  float64
 3   building_condition   24878 non-null  float64
 4   terrace              24878 non-null  float64
 5   equipped_kitchen     24878 non-null  float64
 6   subtype_of_property  24878 non-null  float64
 7   garden               24878 non-null  float64
dtypes: float64(8)
memory usage: 1.5 MB


### Split training and test set before normalization to avoid leakage!

In [6]:
from sklearn.model_selection import train_test_split

# Define features and target
X = df_selected.drop('price', axis=1)  # All columns except 'price'
y = df_selected['price']  # Target column

# Split the data (80% train, 20% test)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [7]:
import numpy as np
from sklearn.preprocessing import MinMaxScaler, StandardScaler

# Initialize scalers
scaler_minmax = MinMaxScaler()
scaler_standard = StandardScaler()

# Normalize (latitude, longitude)
#df_selected.loc[:, ['latitude', 'longitude']] = scaler_minmax.fit_transform(df_selected.loc[:, ['latitude', 'longitude']])

# Standardize only the training set
X_train_scaled = X_train.copy()
X_train_scaled.loc[:, columns_to_convert] = scaler_standard.fit_transform(X_train[columns_to_convert])

# Apply the same scaler to the test set
X_test_scaled = X_test.copy()
X_test_scaled.loc[:, columns_to_convert] = scaler_standard.transform(X_test[columns_to_convert])

# Step 1: Apply log transformation to living area
#df_selected.loc[:, 'living_area'] = np.log1p(df_selected['living_area'])  # log(1 + x) to avoid log(0)

# Step 2: Apply standardization to living area
#scaler = StandardScaler()
#df_selected.loc[:, 'living_area'] = scaler.fit_transform(df_selected[['living_area']])

# Apply log transformation to 'price'
#df_selected.loc[:, 'price'] = np.log1p(df_selected['price'])

In [8]:
# Export final dataframe
X_train_scaled.to_csv('./data/3_Scaled_Features_Train.csv', index=False)
X_test_scaled.to_csv('./data/3_Scaled_Features_Test.csv', index=False)
y_train.to_csv('./data/3_Target_Train.csv', index=False)
y_test.to_csv('./data/3_Target_Test.csv', index=False)