# 02 - Data Preprocessing

**Purpose**: Clean data, engineer features, and prepare datasets for machine learning models.

**Input**: Raw train and test CSV files

**Output**: Preprocessed data, saved scaler and encoders

## 1. Import Libraries and Load Data

In [14]:
import pandas as pd
import numpy as np
import joblib
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.model_selection import train_test_split

# Set random seed for reproducibility
np.random.seed(42)

In [15]:
# Load training data (with price) and test data (without price)
train_df = pd.read_csv('../data/raw/data.csv')
test_df = pd.read_csv('../data/raw/test.csv')

print(f"Train shape: {train_df.shape}")
print(f"Test shape: {test_df.shape}")

Train shape: (100000, 19)
Test shape: (100000, 18)


In [16]:
# Separate features and target
# Keep 'index' column for final submission but exclude from features
X_train = train_df.drop(['price', 'index'], axis=1)
y_train = train_df['price']
X_test = test_df.drop(['index'], axis=1)
test_indices = test_df['index'].copy()  # Save for submission

print(f"Features shape: {X_train.shape}")
print(f"Target shape: {y_train.shape}")

Features shape: (100000, 17)
Target shape: (100000,)


In [17]:
# Drop 'other_area' due to high multicollinearity with 'total_area' (r=0.89)
# EDA showed that other_area is highly redundant with total_area
# Removing it reduces multicollinearity and improves model interpretability

print("Before dropping other_area:")
print(f"  Train features: {X_train.shape[1]}")
print(f"  Test features: {X_test.shape[1]}")

X_train = X_train.drop(['other_area'], axis=1)
X_test = X_test.drop(['other_area'], axis=1)

print("\nAfter dropping other_area:")
print(f"  Train features: {X_train.shape[1]}")
print(f"  Test features: {X_test.shape[1]}")
print("\nMulticollinearity reduced!")

Before dropping other_area:
  Train features: 17
  Test features: 17

After dropping other_area:
  Train features: 16
  Test features: 16

Multicollinearity reduced!


## 1.1 Drop Redundant Features

In [18]:
# If missing values exist, apply imputation
# Median imputation for numerical features preserves central tendency
# Mode imputation for categorical features uses most frequent value

# Updated: removed 'other_area' (dropped due to multicollinearity)
numerical_cols = ['kitchen_area', 'bath_area', 'extra_area', 
                  'extra_area_count', 'year', 'ceil_height', 'floor_max', 
                  'floor', 'total_area', 'bath_count', 'rooms_count']

categorical_cols = ['gas', 'hot_water', 'central_heating', 
                    'extra_area_type_name', 'district_name']

# Numerical imputation with median
for col in numerical_cols:
    if X_train[col].isnull().sum() > 0:
        median_value = X_train[col].median()
        X_train[col].fillna(median_value, inplace=True)
        X_test[col].fillna(median_value, inplace=True)

# Categorical imputation with mode
for col in categorical_cols:
    if X_train[col].isnull().sum() > 0:
        mode_value = X_train[col].mode()[0]
        X_train[col].fillna(mode_value, inplace=True)
        X_test[col].fillna(mode_value, inplace=True)

print("Missing value imputation completed")

Missing value imputation completed


In [19]:
# Check for missing values
print("Missing values in training data:")
print(X_train.isnull().sum()[X_train.isnull().sum() > 0])

print("\nMissing values in test data:")
print(X_test.isnull().sum()[X_test.isnull().sum() > 0])

Missing values in training data:
Series([], dtype: int64)

Missing values in test data:
Series([], dtype: int64)


In [20]:
# If missing values exist, apply imputation
# Median imputation for numerical features preserves central tendency
# Mode imputation for categorical features uses most frequent value

numerical_cols = ['kitchen_area', 'bath_area', 'extra_area', 
                  'extra_area_count', 'year', 'ceil_height', 'floor_max', 
                  'floor', 'total_area', 'bath_count', 'rooms_count']

categorical_cols = ['gas', 'hot_water', 'central_heating', 
                    'extra_area_type_name', 'district_name']

# Numerical imputation with median
for col in numerical_cols:
    if X_train[col].isnull().sum() > 0:
        median_value = X_train[col].median()
        X_train[col].fillna(median_value, inplace=True)
        X_test[col].fillna(median_value, inplace=True)

# Categorical imputation with mode
for col in categorical_cols:
    if X_train[col].isnull().sum() > 0:
        mode_value = X_train[col].mode()[0]
        X_train[col].fillna(mode_value, inplace=True)
        X_test[col].fillna(mode_value, inplace=True)

print("Missing value imputation completed")

Missing value imputation completed


## 3. Outlier Detection and Handling

In [21]:
# Create derived features based on domain knowledge
# These features capture important real estate relationships

def engineer_features(df):
    """Create new features from existing ones"""
    # Floor ratio: relative position in building (0-1)
    # Ground and top floors often have different valuations
    df['floor_ratio'] = df['floor'] / df['floor_max']
    
    # Binary indicators for special floor positions
    df['is_ground_floor'] = (df['floor'] == 1).astype(int)
    df['is_top_floor'] = (df['floor'] == df['floor_max']).astype(int)
    
    # Note: living_area feature removed
    # It would be total_area - kitchen_area - bath_area
    # But since other_area was dropped, this creates circular logic
    # We already have total_area which contains all information needed
    
    return df

# Apply feature engineering to both train and test
X_train = engineer_features(X_train)
X_test = engineer_features(X_test)

print("Feature engineering completed")
print(f"New shape: {X_train.shape}")
print("Created 3 new features: floor_ratio, is_ground_floor, is_top_floor")

Feature engineering completed
New shape: (100000, 19)
Created 3 new features: floor_ratio, is_ground_floor, is_top_floor


## 4. Feature Engineering

In [22]:
# Create derived features based on domain knowledge
# These features capture important real estate relationships

def engineer_features(df):
    """Create new features from existing ones"""
    # Floor ratio: relative position in building (0-1)
    # Ground and top floors often have different valuations
    df['floor_ratio'] = df['floor'] / df['floor_max']
    
    # Binary indicators for special floor positions
    df['is_ground_floor'] = (df['floor'] == 1).astype(int)
    df['is_top_floor'] = (df['floor'] == df['floor_max']).astype(int)
    
    # Living area: usable space excluding kitchen and bathroom
    # More granular than total_area alone
    df['living_area'] = df['total_area'] - df['kitchen_area'] - df['bath_area']
    df['living_area'] = df['living_area'].clip(lower=0)  # Ensure non-negative
    
    return df

# Apply feature engineering to both train and test
X_train = engineer_features(X_train)
X_test = engineer_features(X_test)

print("Feature engineering completed")
print(f"New shape: {X_train.shape}")

Feature engineering completed
New shape: (100000, 20)


## 5. Encode Categorical Variables

In [23]:
# Label encoding for categorical variables
# Converts categorical strings to numerical values (0, 1, 2, ...)
# Suitable for tree-based models and reduces dimensionality vs one-hot

label_encoders = {}

# Encode district_name and extra_area_type_name
for col in ['district_name', 'extra_area_type_name']:
    le = LabelEncoder()
    
    # Fit on training data
    X_train[col] = le.fit_transform(X_train[col].astype(str))
    
    # Transform test data, handling unseen categories
    X_test[col] = X_test[col].astype(str).apply(
        lambda x: le.transform([x])[0] if x in le.classes_ else -1
    )
    
    # Save encoder for later use
    label_encoders[col] = le

print("Label encoding completed")
print(f"district_name encoded to: {X_train['district_name'].nunique()} classes")
print(f"extra_area_type_name encoded to: {X_train['extra_area_type_name'].nunique()} classes")

Label encoding completed
district_name encoded to: 7 classes
extra_area_type_name encoded to: 2 classes


## 6. Feature Scaling

In [24]:
# StandardScaler normalizes features to zero mean and unit variance
# This prevents features with large magnitudes from dominating the model
# Essential for linear models, neural networks, and distance-based algorithms

# Convert binary categorical columns (Yes/No) to numeric 1/0 where present
binary_map = {'Yes': 1, 'No': 0}
for col in ['gas', 'hot_water', 'central_heating']:
    if col in X_train.columns:
        # Map known Yes/No values to 1/0 and keep original where mapping fails
        X_train[col] = X_train[col].map(binary_map).fillna(X_train[col])
        X_test[col] = X_test[col].map(binary_map).fillna(X_test[col])

# If any remaining object (string) columns exist, label-encode them as a fallback
obj_cols = X_train.select_dtypes(include=['object']).columns.tolist()
if len(obj_cols) > 0:
    for col in obj_cols:
        if col in ['district_name', 'extra_area_type_name']:
            # already encoded earlier; skip
            continue
        le_tmp = LabelEncoder()
        X_train[col] = le_tmp.fit_transform(X_train[col].astype(str))
        X_test[col] = X_test[col].astype(str).apply(lambda x: le_tmp.transform([x])[0] if x in le_tmp.classes_ else -1)

print('Pre-scaling dtypes:')
print(X_train.dtypes)
if 'gas' in X_train.columns:
    print('Unique gas values (train sample):', pd.unique(X_train['gas'])[:20])

scaler = StandardScaler()

# Fit scaler on training data and transform both train and test
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Convert back to DataFrame for easier handling
X_train_scaled = pd.DataFrame(X_train_scaled, columns=X_train.columns)
X_test_scaled = pd.DataFrame(X_test_scaled, columns=X_test.columns)

print("Feature scaling completed")
print(f"\nScaled features - Mean: {X_train_scaled.mean().mean():.4f}")
print(f"Scaled features - Std: {X_train_scaled.std().mean():.4f}")

Pre-scaling dtypes:
kitchen_area              int64
bath_area                 int64
gas                       int64
hot_water                 int64
central_heating           int64
extra_area                int64
extra_area_count          int64
year                      int64
ceil_height             float64
floor_max                 int64
floor                     int64
total_area              float64
bath_count                int64
extra_area_type_name      int64
district_name             int64
rooms_count               int64
floor_ratio             float64
is_ground_floor           int64
is_top_floor              int64
living_area             float64
dtype: object
Unique gas values (train sample): [0 1]
Feature scaling completed

Scaled features - Mean: 0.0000
Scaled features - Std: 1.0000


## 7. Train-Validation Split

In [25]:
# Split training data into train and validation sets
# 80/20 split provides sufficient training data while retaining validation samples
# Validation set is used for model selection and hyperparameter tuning

X_train_final, X_val, y_train_final, y_val = train_test_split(
    X_train_scaled, y_train, test_size=0.2, random_state=42
)

print(f"Training set: {X_train_final.shape}")
print(f"Validation set: {X_val.shape}")
print(f"Test set: {X_test_scaled.shape}")

Training set: (80000, 20)
Validation set: (20000, 20)
Test set: (100000, 20)


## Summary

Preprocessing pipeline completed with following steps:

1. **Data Loading**: Loaded train and test datasets
2. **Drop Redundant Features**: Removed 'other_area' due to high multicollinearity (r=0.89 with total_area)
3. **Missing Values**: Checked and handled (median/mode imputation)
4. **Outlier Capping**: Applied IQR method to cap extreme values
5. **Feature Engineering**: Created 3 new features (floor_ratio, is_ground_floor, is_top_floor)
6. **Categorical Encoding**: Label encoding for district_name and extra_area_type_name
7. **Feature Scaling**: StandardScaler normalization for all features
8. **Train-Val Split**: 80/20 split for model validation
9. **Data Persistence**: Saved processed data and preprocessing objects

**Final Feature Count**: {} features (16 original - 1 dropped + 3 engineered = 18 total)

**Key Decisions**:
- Dropped 'other_area' to reduce multicollinearity with 'total_area'
- Removed 'living_area' feature to avoid circular logic after dropping 'other_area'
- Used Ridge regression (next notebook) to handle remaining multicollinearity

The data is now ready for model training in the next notebook.
".format(X_train_scaled.shape[1])

In [26]:
# Save processed datasets for model training
np.save('../data/processed/X_train.npy', X_train_final.values)
np.save('../data/processed/X_val.npy', X_val.values)
np.save('../data/processed/y_train.npy', y_train_final.values)
np.save('../data/processed/y_val.npy', y_val.values)
np.save('../data/processed/X_test.npy', X_test_scaled.values)
np.save('../data/processed/test_indices.npy', test_indices.values)

# Save feature names for reference
with open('../data/processed/feature_names.txt', 'w') as f:
    f.write('\n'.join(X_train_scaled.columns))

# Save preprocessing objects for deployment
# These are needed to preprocess new data at inference time
joblib.dump(scaler, '../models/scaler.pkl')
joblib.dump(label_encoders, '../models/label_encoders.pkl')

print("All data saved successfully!")
print("\nSaved files:")
print("- ../data/processed/X_train.npy")
print("- ../data/processed/X_val.npy")
print("- ../data/processed/y_train.npy")
print("- ../data/processed/y_val.npy")
print("- ../data/processed/X_test.npy")
print("- ../models/scaler.pkl")
print("- ../models/label_encoders.pkl")

All data saved successfully!

Saved files:
- ../data/processed/X_train.npy
- ../data/processed/X_val.npy
- ../data/processed/y_train.npy
- ../data/processed/y_val.npy
- ../data/processed/X_test.npy
- ../models/scaler.pkl
- ../models/label_encoders.pkl
