
# House Price Prediction: Feature Engineering, Selection & Modeling


---

## Overview
This notebook covers feature engineering, encoding, skewness correction, feature selection, model training, cross-validation, evaluation, and model selection for the House Price Prediction project.

---

## Steps:
1. Load Cleaned Data
2. Feature Engineering
3. Encoding Categorical Variables
4. Skewness Correction
5. Feature Selection
6. Model Training & Evaluation
7. Save Processed Data & Model
8. Summary & Next Steps

---



## 1. Load Cleaned Data 
Let's load the cleaned training data from the previous notebook.

In [50]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import warnings

df = pd.read_csv('../data/cleaned_train.csv')
df.head()

Unnamed: 0,Id,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,Utilities,...,PoolArea,PoolQC,Fence,MiscFeature,MiscVal,MoSold,YrSold,SaleType,SaleCondition,SalePrice
0,1,60,RL,65.0,8450,Pave,Grvl,Reg,Lvl,AllPub,...,0,Gd,MnPrv,Shed,0,2,2008,WD,Normal,208500
1,2,20,RL,80.0,9600,Pave,Grvl,Reg,Lvl,AllPub,...,0,Gd,MnPrv,Shed,0,5,2007,WD,Normal,181500
2,3,60,RL,68.0,11250,Pave,Grvl,IR1,Lvl,AllPub,...,0,Gd,MnPrv,Shed,0,9,2008,WD,Normal,223500
3,4,70,RL,60.0,9550,Pave,Grvl,IR1,Lvl,AllPub,...,0,Gd,MnPrv,Shed,0,2,2006,WD,Abnorml,140000
4,5,60,RL,84.0,14260,Pave,Grvl,IR1,Lvl,AllPub,...,0,Gd,MnPrv,Shed,0,12,2008,WD,Normal,250000


## 2. Feature Engineering 
Let's create new features such as total square footage and house age.

In [51]:
# Total square footage
df['TotalSF'] = df['TotalBsmtSF'] + df['1stFlrSF'] + df['2ndFlrSF']

# House age at time of sale
df['HouseAge'] = df['YrSold'] - df['YearBuilt']

# Years since remodel
df['RemodAge'] = df['YrSold'] - df['YearRemodAdd']

df[['TotalSF', 'HouseAge', 'RemodAge']].head()

Unnamed: 0,TotalSF,HouseAge,RemodAge
0,2566,5,5
1,2524,31,31
2,2706,7,6
3,2473,91,36
4,3343,8,8


## 3. Encoding Categorical Variables 
Encode categorical variables using one-hot encoding.

In [52]:
# Identify categorical columns
cat_cols = df.select_dtypes(include='object').columns

# One-hot encode categorical variables
df_encoded = pd.get_dummies(df, columns=cat_cols, drop_first=True)
df_encoded.head()

Unnamed: 0,Id,MSSubClass,LotFrontage,LotArea,OverallQual,OverallCond,YearBuilt,YearRemodAdd,MasVnrArea,BsmtFinSF1,...,SaleType_ConLI,SaleType_ConLw,SaleType_New,SaleType_Oth,SaleType_WD,SaleCondition_AdjLand,SaleCondition_Alloca,SaleCondition_Family,SaleCondition_Normal,SaleCondition_Partial
0,1,60,65.0,8450,7,5,2003,2003,196.0,706,...,False,False,False,False,True,False,False,False,True,False
1,2,20,80.0,9600,6,8,1976,1976,0.0,978,...,False,False,False,False,True,False,False,False,True,False
2,3,60,68.0,11250,7,5,2001,2002,162.0,486,...,False,False,False,False,True,False,False,False,True,False
3,4,70,60.0,9550,7,5,1915,1970,0.0,216,...,False,False,False,False,True,False,False,False,False,False
4,5,60,84.0,14260,8,5,2000,2000,350.0,655,...,False,False,False,False,True,False,False,False,True,False


## 4. Skewness Correction 
Identify and correct skewness in numerical features.

In [53]:
# Format float display
pd.options.display.float_format = '{:.2f}'.format

# Find skewed numeric features
skewed_feats = df_encoded.select_dtypes(include=[np.number]).apply(lambda x: x.skew()).sort_values(ascending=False)
skewness = skewed_feats[abs(skewed_feats) > 0.75]
skewed_cols = skewness.index.tolist()

# identify and optionally clip large values
for col in skewed_cols:
    if col != 'SalePrice' and df_encoded[col].nunique() > 2 and (df_encoded[col] >= 0).all():
        if df_encoded[col].max() > 1e6:
            df_encoded[col] = df_encoded[col].clip(upper=1e6)

df_encoded.head()


Unnamed: 0,Id,MSSubClass,LotFrontage,LotArea,OverallQual,OverallCond,YearBuilt,YearRemodAdd,MasVnrArea,BsmtFinSF1,...,SaleType_ConLI,SaleType_ConLw,SaleType_New,SaleType_Oth,SaleType_WD,SaleCondition_AdjLand,SaleCondition_Alloca,SaleCondition_Family,SaleCondition_Normal,SaleCondition_Partial
0,1,60,65.0,8450,7,5,2003,2003,196.0,706,...,False,False,False,False,True,False,False,False,True,False
1,2,20,80.0,9600,6,8,1976,1976,0.0,978,...,False,False,False,False,True,False,False,False,True,False
2,3,60,68.0,11250,7,5,2001,2002,162.0,486,...,False,False,False,False,True,False,False,False,True,False
3,4,70,60.0,9550,7,5,1915,1970,0.0,216,...,False,False,False,False,True,False,False,False,False,False
4,5,60,84.0,14260,8,5,2000,2000,350.0,655,...,False,False,False,False,True,False,False,False,True,False


## 5. Feature Selection
Select the most important features based on correlation with the target.

In [54]:
# Correlation with target
corr = df_encoded.corr()
top_features = corr['SalePrice'].abs().sort_values(ascending=False).head(20)
print(top_features)

SalePrice          1.00
OverallQual        0.80
TotalSF            0.75
GrLivArea          0.67
GarageCars         0.65
GarageArea         0.63
ExterQual_TA       0.61
TotalBsmtSF        0.58
1stFlrSF           0.57
YearBuilt          0.56
HouseAge           0.56
FullBath           0.55
GarageFinish_Unf   0.55
KitchenQual_TA     0.54
RemodAge           0.54
YearRemodAdd       0.54
ExterQual_Gd       0.54
BsmtQual_TA        0.53
Foundation_PConc   0.52
TotRmsAbvGrd       0.49
Name: SalePrice, dtype: float64


## 6. Model Training & Evaluation 
Train and evaluate several regression models.

In [55]:
from sklearn.model_selection import cross_val_score, train_test_split
from sklearn.linear_model import LinearRegression, Ridge, Lasso
from sklearn.svm import SVR
from sklearn.ensemble import RandomForestRegressor

# Split into X and y
X = df_encoded.drop('SalePrice', axis=1)
y = df_encoded['SalePrice']

models = {
    'LinearRegression': LinearRegression(),
    'Ridge': Ridge(),
    'Lasso': Lasso(),
    'SVR': SVR(),
    'RandomForest': RandomForestRegressor(random_state=42)
}

results = {}
for name, model in models.items():
    with warnings.catch_warnings():
        warnings.simplefilter("ignore")
        score = cross_val_score(model, X, y, cv=5, scoring='neg_root_mean_squared_error').mean()
    results[name] = -score
    print(f'{name}: RMSE = {-score:.4f}')

LinearRegression: RMSE = 35646.4331
Ridge: RMSE = 27322.5789
Lasso: RMSE = 34761.9456
SVR: RMSE = 68840.4517
RandomForest: RMSE = 23967.0460


In [56]:
# Select best model
best_model_name = min(results, key=results.get)
print(f'Best Model: {best_model_name}')
best_model = models[best_model_name]
best_model.fit(X, y)

Best Model: RandomForest


In [None]:
# After fitting the best model
best_model.fit(X, y)

# Evaluate on training data
from sklearn.metrics import mean_squared_error, r2_score
train_preds = best_model.predict(X)
train_rmse = np.sqrt(mean_squared_error(y, train_preds))
train_r2 = r2_score(y, train_preds)
print(f"Train RMSE: {train_rmse:.4f}")
print(f"Train R²: {train_r2:.4f}")

# train/validation split for validation accuracy
from sklearn.model_selection import train_test_split
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=42)
best_model.fit(X_train, y_train)
val_preds = best_model.predict(X_val)
val_rmse = np.sqrt(mean_squared_error(y_val, val_preds))
val_r2 = r2_score(y_val, val_preds)
print(f"Validation RMSE: {val_rmse:.4f}")
print(f"Validation R²: {val_r2:.4f}")

Train RMSE: 8964.0310
Train R²: 0.9823
Validation RMSE: 27774.4197
Validation R²: 0.8219


## 7. Save Processed Data & Model
Save the processed data and the best model for use in the next notebook.

In [58]:
import joblib
import os

# Save processed data columns for test set alignment
os.makedirs('../models', exist_ok=True)
joblib.dump(list(X.columns), '../models/model_columns.pkl')

# Save the best model
joblib.dump(best_model, '../models/model_house_price_prediction.pkl')
print('Model and columns saved!')

# Save processed training data for reference
df_encoded.to_csv('../data/processed_train.csv', index=False)

Model and columns saved!


## 8. Summary & Next Steps 
- Created new features and encoded categorical variables  
- Corrected skewness in numerical features  
- Identified top features for modeling  
- Trained and evaluated multiple regression models  
- Selected and saved the best model and processed data  

**Next:** Final pipeline and export in the next notebook.