# Predicting House Prices: Pushing the Limit of Linear Regression
In this project, we aim to predict house prices using a linear regression model. Through careful feature selection, feature engineering, and regularization, we attempt to maximize the performance of the linear model, approaching the accuracy of non-linear models such as XGBoost.


In [2]:
import pandas as pd
import numpy as np

# Load dataset
data = pd.read_csv('house_dataset_processed.csv')  


In [None]:
y = data['price']
X = data.drop(columns=['price'])

# Log-transform the target
y_log = np.log(y)

# drop NaNs
data_cleaned = pd.concat([X, y_log], axis=1).dropna()

X = data_cleaned[X.columns]
y_log = data_cleaned[y_log.name] 

In [22]:
# Basic Linear Regression (Baseline)

from sklearn.linear_model import LinearRegression
from sklearn.model_selection import KFold
from sklearn.metrics import mean_absolute_percentage_error

kf = KFold(n_splits=5, shuffle=True, random_state=42)

model = LinearRegression()
mape_list = []

for train_idx, test_idx in kf.split(X):
    X_train, X_test = X.iloc[train_idx], X.iloc[test_idx]
    y_train, y_test = y_log.iloc[train_idx], y_log.iloc[test_idx]
    
    model.fit(X_train, y_train)
    y_pred_log = model.predict(X_test)
    y_pred = np.exp(y_pred_log)
    
    mape_list.append(mean_absolute_percentage_error(np.exp(y_test), y_pred))

print(f"Baseline Linear Regression MAPE: {np.mean(mape_list):.4f}")

Baseline Linear Regression MAPE: 0.2665


In [23]:
# 4. Direct Ridge and Lasso on Selected Features

from sklearn.linear_model import Ridge, Lasso

ridge_model = Ridge(alpha=1.0)
lasso_model = Lasso(alpha=0.01)

models = {'Ridge Regression': ridge_model, 'Lasso Regression': lasso_model}

for model_name, model in models.items():
    mape_list = []
    for train_idx, test_idx in kf.split(X):
        X_train, X_test = X.iloc[train_idx], X.iloc[test_idx]
        y_train, y_test = y_log.iloc[train_idx], y_log.iloc[test_idx]

        model.fit(X_train, y_train)
        y_pred_log = model.predict(X_test)
        y_pred = np.exp(y_pred_log)

        mape_list.append(mean_absolute_percentage_error(np.exp(y_test), y_pred))

    print(f"{model_name} MAPE: {np.mean(mape_list):.4f}")

Ridge Regression MAPE: 0.2666


  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(


Lasso Regression MAPE: 0.2749


  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(


# Ridge(alpha=1.0) applies L2 regularization with moderate strength,
# shrinking the coefficients towards zero but not setting them exactly to zero.
# This helps in preventing overfitting while retaining all features.

# Lasso(alpha=0.01) applies L1 regularization with a small penalty,
# encouraging sparsity by setting some coefficients exactly to zero,
# effectively performing feature selection along with regression.


As we can see above, the MAPE of the original linear regression, lasso regression, and ridge regression is approximately around 0.27.
There is no significant difference between their predictive performance based on 5-fold cross-validation.
Therefore, we proceed to apply Random Forest to perform feature selection based on feature importances.

In [12]:
# feature selection by random forest
from sklearn.ensemble import RandomForestRegressor

# Train random forest
rf = RandomForestRegressor(n_estimators=100, random_state=42)
rf.fit(X, y_log)

# Get feature importances
feature_importance_df = pd.DataFrame({
    'Feature': X.columns,
    'Importance': rf.feature_importances_
}).sort_values(by='Importance', ascending=False)

# Select important features
important_features = feature_importance_df.query('Importance > 0.02')['Feature'].tolist()
X_selected = X[important_features]


In [14]:
X_selected.columns

Index(['total_space', 'sqft_living', 'sqft_city', 'lot_zip', 'sqft_zip',
       'lot_city', 'sqft_lot', 'sqft_above', 'yr_built'],
      dtype='object')

Based on the selected features, let us build a multiple linear regression model and have a look at the performance.

In [24]:
model = LinearRegression()
mape_list = []

for train_idx, test_idx in kf.split(X_selected):
    X_train, X_test = X_selected.iloc[train_idx], X_selected.iloc[test_idx]
    y_train, y_test = y_log.iloc[train_idx], y_log.iloc[test_idx]

    model.fit(X_train, y_train)
    y_pred_log = model.predict(X_test)
    y_pred = np.exp(y_pred_log)

    mape_list.append(mean_absolute_percentage_error(np.exp(y_test), y_pred))

print(f"Linear Regression with Selected Features MAPE: {np.mean(mape_list):.4f}")

Linear Regression with Selected Features MAPE: 0.2897


The MAPE is even larger so we'll try apply feature engineering via cross and polinomial features to capture 
the potential non-linear relations in our regression model.

In [25]:
# feature engineering 
# Cross Features (Interaction Terms)
X_selected['sqft_living_total_space'] = X_selected['sqft_living'] * X_selected['total_space']
X_selected['sqft_city_lot_zip'] = X_selected['sqft_city'] * X_selected['lot_zip']
X_selected['sqft_zip_lot_city'] = X_selected['sqft_zip'] * X_selected['lot_city']



A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  X_selected['sqft_living_total_space'] = X_selected['sqft_living'] * X_selected['total_space']
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  X_selected['sqft_city_lot_zip'] = X_selected['sqft_city'] * X_selected['lot_zip']
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  X_selected['sqft_zip_lot_city

In [26]:
# Standardization
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
X_scaled = scaler.fit_transform(X_selected)


In [27]:
# Polynomial Features
from sklearn.preprocessing import PolynomialFeatures

poly = PolynomialFeatures(degree=2, include_bias=False)
X_poly = poly.fit_transform(X_scaled)


In [29]:
# model building for ridge regression
from sklearn.linear_model import Ridge
from sklearn.model_selection import KFold
from sklearn.metrics import mean_absolute_percentage_error

kf = KFold(n_splits=5, shuffle=True, random_state=42)

mape_list = []

for train_idx, test_idx in kf.split(X_poly):
    X_train, X_test = X_poly[train_idx], X_poly[test_idx]
    y_train, y_test = y_log.iloc[train_idx], y_log.iloc[test_idx]
    
    model = Ridge(alpha=1.0)
    model.fit(X_train, y_train)
    
    y_pred_log = model.predict(X_test)
    y_pred = np.exp(y_pred_log)  # convert back to price scale
    
    mape_list.append(mean_absolute_percentage_error(np.exp(y_test), y_pred))

print(f"Ridge Regression after Feature Engineering: MAPE = {np.mean(mape_list):.4f}")


Ridge Regression after Feature Engineering: MAPE = 0.2289


In [None]:
# Benchmark model (XGBoost using original features)
import xgboost as xgb

model = xgb.XGBRegressor(n_estimators=100, learning_rate=0.1, max_depth=3, random_state=42)
mape_list = []

for train_idx, test_idx in kf.split(X):
    X_train, X_test = X.iloc[train_idx], X.iloc[test_idx]
    y_train, y_test = y_log.iloc[train_idx], y_log.iloc[test_idx]
    
    model.fit(X_train, y_train)
    y_pred_log = model.predict(X_test)
    y_pred = np.exp(y_pred_log)
    
    mape_list.append(mean_absolute_percentage_error(np.exp(y_test), y_pred))

print(f"XGBoost Regression on Original Features MAPE: {np.mean(mape_list):.4f}")




XGBoost Regression on Original Features MAPE: 0.1808


The XGBoost benchmark model was trained using all available features in the dataset, except for the target variable (Price). No feature selection, standardization, or polynomial expansion was applied. This ensures that the model leverages the full information content of the original feature space.

# Final Model Comparison

The table below summarizes the performance of different models evaluated through 5-fold cross-validation:

| Model | Feature Set | MAPE (Actual Price) |
|:---|:---|:---|
| Baseline Linear Regression | All original features | ~0.2665 |
| Ridge Regression (Original Features) | All original features | ~0.2666 |
| Lasso Regression (Original Features) | All original features | ~0.2749 |
| Linear Regression (Selected Features) | Random Forest Selected Features | ~0.2897 |
| Advanced Ridge Regression (After Feature Engineering) | Selected Features + Cross Terms + Polynomial Features | ~0.2289 |
| XGBoost Regression (Original Features) | All original features | ~0.1808 |

Overall, through feature selection and advanced feature engineering (cross terms and polynomial features), the Ridge Regression model's performance was significantly improved compared to the baseline linear models. Although XGBoost still achieved the best performance, the gap was substantially reduced, demonstrating the effectiveness of pushing the limits of linear regression through careful preprocessing and feature engineering.
