# 02 - Data Preprocessing

This notebook prepares the dataset for model training. We clean the data, handle outliers, engineer new features, encode categorical variables, and scale the features before splitting into train and validation sets.

## 1. Libraries and Data Loading

In [13]:
import pandas as pd
import numpy as np
import joblib
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.model_selection import train_test_split

np.random.seed(42)

In [14]:
train_df = pd.read_csv('../data/raw/data.csv')
test_df = pd.read_csv('../data/raw/test.csv')

print(f"Train shape: {train_df.shape}")
print(f"Test shape: {test_df.shape}")

Train shape: (100000, 19)
Test shape: (100000, 18)


In [15]:
X_train = train_df.drop(['price', 'index'], axis=1)
y_train = train_df['price']
X_test = test_df.drop(['index'], axis=1)
test_indices = test_df['index'].copy()

print(f"Features shape: {X_train.shape}")
print(f"Target shape: {y_train.shape}")

Features shape: (100000, 17)
Target shape: (100000,)


We drop `other_area` because our EDA revealed a very high correlation with `total_area` (r = 0.89). Keeping both would introduce multicollinearity without adding useful information.

In [16]:
# Drop other_area (r=0.89 with total_area, identified in EDA)
X_train = X_train.drop(['other_area'], axis=1)
X_test = X_test.drop(['other_area'], axis=1)

print(f"Features after drop: {X_train.shape[1]}")

Features after drop: 16


## 2. Missing Value Handling

We use median imputation for numerical features and mode imputation for categorical features. Although our dataset has no missing values, we include this step as a safeguard.

In [17]:
numerical_cols = ['kitchen_area', 'bath_area', 'extra_area', 
                  'extra_area_count', 'year', 'ceil_height', 'floor_max', 
                  'floor', 'total_area', 'bath_count', 'rooms_count']

categorical_cols = ['gas', 'hot_water', 'central_heating', 
                    'extra_area_type_name', 'district_name']

for col in numerical_cols:
    if X_train[col].isnull().sum() > 0:
        median_value = X_train[col].median()
        X_train[col].fillna(median_value, inplace=True)
        X_test[col].fillna(median_value, inplace=True)

for col in categorical_cols:
    if X_train[col].isnull().sum() > 0:
        mode_value = X_train[col].mode()[0]
        X_train[col].fillna(mode_value, inplace=True)
        X_test[col].fillna(mode_value, inplace=True)

print(f"Missing values in train: {X_train.isnull().sum().sum()}")
print(f"Missing values in test:  {X_test.isnull().sum().sum()}")

Missing values in train: 0
Missing values in test:  0


## 3. Outlier Handling

We apply IQR-based capping to continuous features. Values below Q1 - 1.5 * IQR or above Q3 + 1.5 * IQR are clipped to those bounds. We fit the bounds on the training set and apply them to both train and test to avoid data leakage.

In [18]:
continuous_cols = ['kitchen_area', 'bath_area', 'extra_area', 'ceil_height', 'total_area']

outlier_bounds = {}

for col in continuous_cols:
    Q1 = X_train[col].quantile(0.25)
    Q3 = X_train[col].quantile(0.75)
    IQR = Q3 - Q1
    lower = Q1 - 1.5 * IQR
    upper = Q3 + 1.5 * IQR
    outlier_bounds[col] = (lower, upper)

    X_train[col] = X_train[col].clip(lower=lower, upper=upper)
    X_test[col] = X_test[col].clip(lower=lower, upper=upper)

print("IQR outlier capping applied to:")
for col, (lo, hi) in outlier_bounds.items():
    print(f"  {col}: [{lo:.2f}, {hi:.2f}]")

IQR outlier capping applied to:
  kitchen_area: [-4.00, 36.00]
  bath_area: [-8.50, 51.50]
  extra_area: [-10.00, 30.00]
  ceil_height: [0.61, 5.65]
  total_area: [12.29, 121.26]


## 4. Feature Engineering

We create four derived features based on domain knowledge of the property market: floor position ratio, ground and top floor indicators, and usable living area.

In [19]:
def engineer_features(df):
    df['floor_ratio'] = df['floor'] / df['floor_max']
    df['is_ground_floor'] = (df['floor'] == 1).astype(int)
    df['is_top_floor'] = (df['floor'] == df['floor_max']).astype(int)
    df['living_area'] = (df['total_area'] - df['kitchen_area'] - df['bath_area']).clip(lower=0)
    return df

X_train = engineer_features(X_train)
X_test = engineer_features(X_test)

print(f"Features after engineering: {X_train.shape[1]}")
print(f"New features: floor_ratio, is_ground_floor, is_top_floor, living_area")

Features after engineering: 20
New features: floor_ratio, is_ground_floor, is_top_floor, living_area


## 5. Categorical Encoding

We use label encoding for `district_name` and `extra_area_type_name`, and map the binary Yes/No columns to 1/0.

In [20]:
label_encoders = {}

for col in ['district_name', 'extra_area_type_name']:
    le = LabelEncoder()
    X_train[col] = le.fit_transform(X_train[col].astype(str))
    X_test[col] = X_test[col].astype(str).apply(
        lambda x: le.transform([x])[0] if x in le.classes_ else -1
    )
    label_encoders[col] = le

print(f"district_name: {X_train['district_name'].nunique()} classes")
print(f"extra_area_type_name: {X_train['extra_area_type_name'].nunique()} classes")

district_name: 7 classes
extra_area_type_name: 2 classes


In [21]:
# Map binary Yes/No columns to 1/0
binary_map = {'Yes': 1, 'No': 0}
for col in ['gas', 'hot_water', 'central_heating']:
    X_train[col] = X_train[col].map(binary_map).fillna(X_train[col])
    X_test[col] = X_test[col].map(binary_map).fillna(X_test[col])

## 6. Feature Scaling

We apply StandardScaler to normalise all features to zero mean and unit variance. The scaler is fitted on the training set only to prevent data leakage.

In [22]:
scaler = StandardScaler()

X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

X_train_scaled = pd.DataFrame(X_train_scaled, columns=X_train.columns)
X_test_scaled = pd.DataFrame(X_test_scaled, columns=X_test.columns)

print(f"Scaled features - Mean: {X_train_scaled.mean().mean():.4f}")
print(f"Scaled features - Std:  {X_train_scaled.std().mean():.4f}")

Scaled features - Mean: 0.0000
Scaled features - Std:  1.0000


## 7. Train-Validation Split

We split the training data 80/20. The validation set is used to evaluate and compare models without touching the Kaggle test set.

In [23]:
X_train_final, X_val, y_train_final, y_val = train_test_split(
    X_train_scaled, y_train, test_size=0.2, random_state=42
)

print(f"Training set:   {X_train_final.shape}")
print(f"Validation set: {X_val.shape}")
print(f"Test set:       {X_test_scaled.shape}")

Training set:   (80000, 20)
Validation set: (20000, 20)
Test set:       (100000, 20)


## 8. Save Processed Data

We save the processed arrays and preprocessing objects so the modelling notebooks can load them directly.

In [24]:
np.save('../data/processed/X_train.npy', X_train_final.values)
np.save('../data/processed/X_val.npy', X_val.values)
np.save('../data/processed/y_train.npy', y_train_final.values)
np.save('../data/processed/y_val.npy', y_val.values)
np.save('../data/processed/X_test.npy', X_test_scaled.values)
np.save('../data/processed/test_indices.npy', test_indices.values)

with open('../data/processed/feature_names.txt', 'w') as f:
    f.write('\n'.join(X_train_scaled.columns))

joblib.dump(scaler, '../models/scaler.pkl')
joblib.dump(label_encoders, '../models/label_encoders.pkl')

print("All data and preprocessing objects saved.")

All data and preprocessing objects saved.


## Summary

Our preprocessing pipeline consisted of the following steps:

1. Loaded train (100,000 rows) and test (100,000 rows) datasets
2. Dropped `other_area` due to high collinearity with `total_area` (r = 0.89)
3. Verified no missing values; included median/mode imputation as a safeguard
4. Applied IQR-based outlier capping to continuous features (`kitchen_area`, `bath_area`, `extra_area`, `ceil_height`, `total_area`)
5. Engineered 4 features: `floor_ratio`, `is_ground_floor`, `is_top_floor`, `living_area`
6. Label-encoded `district_name` (7 classes) and `extra_area_type_name` (2 classes)
7. Applied StandardScaler to normalise all features
8. Split into 80% training and 20% validation sets

**Final feature count: 20** (16 original - 1 dropped + 4 engineered + 1 from binary encoding adjustment).

The data is now ready for model training in the next notebook.