# House Prices Prediction

## Week 7 — Data Understanding & Feature Engineering
### Part 1: Exploratory Data Analysis & Baseline
### Part 2: Feature Engineering & Updated Baseline


In [19]:
import os
import pandas as pd

DATA_DIR = "/kaggle/input/house-prices-advanced-regression-techniques" 

train = pd.read_csv(os.path.join(DATA_DIR, "train.csv"))
test  = pd.read_csv(os.path.join(DATA_DIR, "test.csv"))

print("train:", train.shape, "test:", test.shape)
print("target exists:", "SalePrice" in train.columns, "SalePrice in test:", "SalePrice" in test.columns)
print("head cols:", train.columns[:10].tolist())
train.head()


train: (1460, 81) test: (1459, 80)
target exists: True SalePrice in test: False
head cols: ['Id', 'MSSubClass', 'MSZoning', 'LotFrontage', 'LotArea', 'Street', 'Alley', 'LotShape', 'LandContour', 'Utilities']


Unnamed: 0,Id,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,Utilities,...,PoolArea,PoolQC,Fence,MiscFeature,MiscVal,MoSold,YrSold,SaleType,SaleCondition,SalePrice
0,1,60,RL,65.0,8450,Pave,,Reg,Lvl,AllPub,...,0,,,,0,2,2008,WD,Normal,208500
1,2,20,RL,80.0,9600,Pave,,Reg,Lvl,AllPub,...,0,,,,0,5,2007,WD,Normal,181500
2,3,60,RL,68.0,11250,Pave,,IR1,Lvl,AllPub,...,0,,,,0,9,2008,WD,Normal,223500
3,4,70,RL,60.0,9550,Pave,,IR1,Lvl,AllPub,...,0,,,,0,2,2006,WD,Abnorml,140000
4,5,60,RL,84.0,14260,Pave,,IR1,Lvl,AllPub,...,0,,,,0,12,2008,WD,Normal,250000


In [20]:
import numpy as np
import os

OUT_EDA = "outputs/eda"
os.makedirs(OUT_EDA, exist_ok=True)

target = "SalePrice"
y = train[target].copy()
X = train.drop(columns=[target])

# 1) missing rate
missing = (train.isna().mean()
           .sort_values(ascending=False)
           .reset_index())
missing.columns = ["feature", "missing_rate"]
missing.to_csv(os.path.join(OUT_EDA, "missing_rate.csv"), index=False)

# 2) feature type summary
num_cols = X.select_dtypes(include=[np.number]).columns.tolist()
cat_cols = [c for c in X.columns if c not in num_cols]
pd.DataFrame({
    "numeric_features": [len(num_cols)],
    "categorical_features": [len(cat_cols)],
    "total_features": [X.shape[1]]
}).to_csv(os.path.join(OUT_EDA, "feature_type_summary.csv"), index=False)

# 3) target summary
pd.DataFrame({
    "SalePrice_mean":[y.mean()],
    "SalePrice_std":[y.std()],
    "SalePrice_skew":[y.skew()],
    "SalePrice_kurt":[y.kurt()]
}).to_csv(os.path.join(OUT_EDA, "target_summary.csv"), index=False)

# 4) top numeric correlations
corr = train[num_cols + [target]].corr()[target].sort_values(ascending=False)
corr.reset_index().rename(columns={"index":"feature", target:"corr_with_target"}) \
    .to_csv(os.path.join(OUT_EDA, "top_numeric_correlations.csv"), index=False)

print("Saved EDA files to:", OUT_EDA)
print(os.listdir(OUT_EDA)[:20])


Saved EDA files to: outputs/eda
['feature_type_summary.csv', 'missing_rate.csv', 'top_numeric_correlations.csv', 'target_summary.csv']


In [21]:
import os
import json
import numpy as np
import pandas as pd

from sklearn.model_selection import KFold, cross_val_score
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder
from sklearn.linear_model import Ridge
from sklearn.metrics import mean_squared_error, make_scorer

# outputs direction
OUT_MODELS = "outputs/models"
os.makedirs(OUT_MODELS, exist_ok=True)

# target / features
target = "SalePrice"
y = np.log1p(train[target])        # metric 
X = train.drop(columns=[target])

# feature split
num_cols = X.select_dtypes(include=[np.number]).columns.tolist()
cat_cols = [c for c in X.columns if c not in num_cols]

# preprocess
numeric_transformer = Pipeline(steps=[
    ("imputer", SimpleImputer(strategy="median")),
])

categorical_transformer = Pipeline(steps=[
    ("imputer", SimpleImputer(strategy="most_frequent")),
    ("onehot", OneHotEncoder(handle_unknown="ignore")),
])

preprocess = ColumnTransformer(
    transformers=[
        ("num", numeric_transformer, num_cols),
        ("cat", categorical_transformer, cat_cols),
    ],
    remainder="drop"
)

# model
model = Ridge(alpha=10.0, random_state=42)

pipe = Pipeline(steps=[
    ("preprocess", preprocess),
    ("model", model),
])

# RMSE on log-space
def rmse(y_true, y_pred):
    return np.sqrt(mean_squared_error(y_true, y_pred))

rmse_scorer = make_scorer(rmse, greater_is_better=False)

# CV
cv = KFold(n_splits=5, shuffle=True, random_state=42)
scores = cross_val_score(pipe, X, y, cv=cv, scoring=rmse_scorer)

result = {
    "model": "Ridge(alpha=10.0)",
    "target_transform": "log1p(SalePrice)",
    "cv": "KFold(n_splits=5, shuffle=True, random_state=42)",
    "metric": "RMSE on log-space (negative scorer)",
    "rmse_mean": float((-scores).mean()),
    "rmse_std": float((-scores).std()),
    "rmse_each_fold": [float(s) for s in (-scores)]
}

with open(os.path.join(OUT_MODELS, "baseline_ridge_cv.json"), "w") as f:
    json.dump(result, f, indent=2)

print("Baseline CV RMSE (log-space):", result["rmse_mean"], "+/-", result["rmse_std"])
print("Saved:", os.path.join(OUT_MODELS, "baseline_ridge_cv.json"))


Baseline CV RMSE (log-space): 0.19657044672518134 +/- 0.045085585064443365
Saved: outputs/models/baseline_ridge_cv.json


### Interpretation: Data Characteristics and Baseline Design


Exploratory analysis showed that the target variable (SalePrice) is right-skewed,which motivated a log-scale transformation to align with the evaluation metric andreduce heteroscedasticity. The dataset contains a large number of categorical featuresand non-random missing values, indicating that careful preprocessing and regularizationwould be necessary for building stable linear models.Based on these characteristics, a regularized linear regression model was selectedas a baseline to establish a reproducible performance reference before introducingadditional feature transformations.



In [22]:
# Feature Engineering (Week A - Part 2)

X_fe = train.drop(columns=["SalePrice"]).copy()

# 1) Total square footage
X_fe["TotalSF"] = (
    X_fe["TotalBsmtSF"].fillna(0)
    + X_fe["1stFlrSF"]
    + X_fe["2ndFlrSF"]
)

# 2) House age
X_fe["HouseAge"] = X_fe["YrSold"] - X_fe["YearBuilt"]

# 3) Is remodeled
X_fe["IsRemodeled"] = (X_fe["YearRemodAdd"] != X_fe["YearBuilt"]).astype(int)

# 4) Total bathrooms
X_fe["TotalBath"] = (
    X_fe["FullBath"]
    + 0.5 * X_fe["HalfBath"]
    + X_fe["BsmtFullBath"]
    + 0.5 * X_fe["BsmtHalfBath"]
)

print(X_fe[["TotalSF", "HouseAge", "IsRemodeled", "TotalBath"]].head())


   TotalSF  HouseAge  IsRemodeled  TotalBath
0     2566         5            0        3.5
1     2524        31            0        2.5
2     2706         7            1        3.5
3     2473        91            1        2.0
4     3343         8            0        3.5


In [23]:
from sklearn.model_selection import KFold, cross_val_score

y = np.log1p(train["SalePrice"])

num_cols = X_fe.select_dtypes(include=[np.number]).columns.tolist()
cat_cols = [c for c in X_fe.columns if c not in num_cols]

preprocess_fe = ColumnTransformer(
    transformers=[
        ("num", Pipeline([
            ("imputer", SimpleImputer(strategy="median")),
        ]), num_cols),
        ("cat", Pipeline([
            ("imputer", SimpleImputer(strategy="most_frequent")),
            ("onehot", OneHotEncoder(handle_unknown="ignore")),
        ]), cat_cols),
    ]
)

pipe_fe = Pipeline(steps=[
    ("preprocess", preprocess_fe),
    ("model", Ridge(alpha=10.0, random_state=42)),
])

cv = KFold(n_splits=5, shuffle=True, random_state=42)
scores_fe = cross_val_score(pipe_fe, X_fe, y, cv=cv, scoring=rmse_scorer)

print("FE Ridge CV RMSE (log):",
      (-scores_fe).mean(), "+/-", (-scores_fe).std())


FE Ridge CV RMSE (log): 0.19477724320511808 +/- 0.04547290816459814


### Interpretation: Feature Engineering Results

The engineered features resulted in a small but consistent improvement in
cross-validated performance, while the variance across folds remained similar.
This suggests that the added features introduced useful signal without increasing
model complexity or overfitting.

The magnitude of improvement is modest, which is expected for linear models applied
to high-dimensional tabular data. More complex interactions and nonlinear effects
are intentionally deferred to later modeling stages.


In [24]:
import os, json

OUT_MODELS = "outputs/models"
os.makedirs(OUT_MODELS, exist_ok=True)

fe_result = {
    "model": "Ridge(alpha=10.0)",
    "features": "baseline + engineered features (TotalSF, HouseAge, IsRemodeled, TotalBath)",
    "target_transform": "log1p(SalePrice)",
    "cv": "KFold(n_splits=5, shuffle=True, random_state=42)",
    "metric": "RMSE on log-space",
    "rmse_mean": float((-scores_fe).mean()),
    "rmse_std": float((-scores_fe).std()),
    "rmse_each_fold": [float(s) for s in (-scores_fe)]
}

with open(os.path.join(OUT_MODELS, "fe_ridge_cv.json"), "w") as f:
    json.dump(fe_result, f, indent=2)

print("Saved:", os.path.join(OUT_MODELS, "fe_ridge_cv.json"))


Saved: outputs/models/fe_ridge_cv.json


### Design Considerations

Alternative feature strategies such as neighborhood-level aggregations,
interaction terms, and ordinal encodings were considered but intentionally
deferred to isolate the effect of simple, interpretable feature transformations
before introducing additional complexity.


## Week 8 — Regularization & Model Selection

In [25]:
import numpy as np
from sklearn.model_selection import KFold, cross_val_score
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder
from sklearn.linear_model import Ridge, Lasso, ElasticNet

# target
y = np.log1p(train["SalePrice"])

# feature split (FE fixed)
num_cols = X_fe.select_dtypes(include=[np.number]).columns.tolist()
cat_cols = [c for c in X_fe.columns if c not in num_cols]

preprocess = ColumnTransformer(
    transformers=[
        ("num", Pipeline([("imputer", SimpleImputer(strategy="median"))]), num_cols),
        ("cat", Pipeline([
            ("imputer", SimpleImputer(strategy="most_frequent")),
            ("onehot", OneHotEncoder(handle_unknown="ignore")),
        ]), cat_cols),
    ]
)

cv = KFold(n_splits=5, shuffle=True, random_state=42)


In [26]:
import pandas as pd

alphas = [0.1, 1.0, 10.0, 50.0, 100.0]

results = []

for model_name, model_cls, kwargs in [
    ("Ridge", Ridge, {}),
    ("Lasso", Lasso, {"max_iter": 5000}),
    ("ElasticNet", ElasticNet, {"l1_ratio": 0.5, "max_iter": 5000}),
]:
    for a in alphas:
        pipe = Pipeline(steps=[
            ("preprocess", preprocess),
            ("model", model_cls(alpha=a, random_state=42, **kwargs)),
        ])
        scores = cross_val_score(pipe, X_fe, y, cv=cv, scoring=rmse_scorer)
        results.append({
            "model": model_name,
            "alpha": a,
            "rmse_mean": float((-scores).mean()),
            "rmse_std": float((-scores).std()),
        })

df_results = pd.DataFrame(results).sort_values(["rmse_mean", "rmse_std"])
df_results


Unnamed: 0,model,alpha,rmse_mean,rmse_std
10,ElasticNet,0.1,0.179957,0.049057
5,Lasso,0.1,0.189868,0.047422
11,ElasticNet,1.0,0.191479,0.050963
3,Ridge,50.0,0.194772,0.045467
4,Ridge,100.0,0.194774,0.045474
0,Ridge,0.1,0.194775,0.045473
2,Ridge,10.0,0.194777,0.045473
1,Ridge,1.0,0.194778,0.045473
6,Lasso,1.0,0.198569,0.052737
12,ElasticNet,10.0,0.23365,0.050588


In [27]:
import os, json

OUT = "outputs/models"
os.makedirs(OUT, exist_ok=True)

# 1) comparison table
df_results.to_csv(os.path.join(OUT, "week_8_regularization_cv_table.csv"), index=False)

# 2) best extract (RMSE min. standard)
best_overall = df_results.iloc[0].to_dict()
best_ridge = df_results[df_results["model"]=="Ridge"].iloc[0].to_dict()
best_lasso = df_results[df_results["model"]=="Lasso"].iloc[0].to_dict()
best_enet  = df_results[df_results["model"]=="ElasticNet"].iloc[0].to_dict()

week8 = {
    "week": "Week 8",
    "task": "Regularization & model selection",
    "metric": "RMSE on log(SalePrice)",
    "cv": "5-fold CV, shuffle=True, random_state=42",
    "alphas_tested": sorted([float(a) for a in df_results["alpha"].unique().tolist()]),
    "best_overall": best_overall,
    "best_by_model": {
        "Ridge": best_ridge,
        "Lasso": best_lasso,
        "ElasticNet": best_enet
    },
    
    "selected_model": "Ridge",
    "selection_reason": "Selected for stability/robustness across alpha values in high-dimensional one-hot encoded feature space."
}

with open(os.path.join(OUT, "week_8_regularization_selection.json"), "w") as f:
    json.dump(week8, f, indent=2)

print("Saved:",
      "week_8_regularization_cv_table.csv",
      "week_8_regularization_selection.json")
print("Models dir:", sorted(os.listdir(OUT)))


Saved: week_8_regularization_cv_table.csv week_8_regularization_selection.json
Models dir: ['baseline_ridge_cv.json', 'fe_ridge_cv.json', 'week_8_regularization_cv_table.csv', 'week_8_regularization_selection.json', 'week_9_gbdt_feature_importance_top20.csv', 'week_9_model_family_comparison.csv', 'week_c_gbdt_feature_importance_top20.csv', 'week_c_model_family_comparison.csv']


### Interpretation: Regularization and Model Selection

Regularized linear models were compared under cross-validation to evaluate both
predictive performance and stability. While Lasso and ElasticNet achieved slightly
lower average error at specific regularization strengths, their performance was
more sensitive to hyperparameter choice.

Ridge regression demonstrated consistent performance across a wide range of alpha
values, indicating greater robustness for this high-dimensional, one-hot encoded
feature space. Based on this tradeoff, Ridge was selected as the final linear model.


## Week 9 — Tree-based Models & Model Evaluation

In [28]:
import numpy as np
import pandas as pd
from sklearn.model_selection import KFold, cross_val_score
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor
from sklearn.metrics import mean_squared_error, make_scorer

# target
y = np.log1p(train["SalePrice"])

def rmse(y_true, y_pred):
    return np.sqrt(mean_squared_error(y_true, y_pred))

rmse_scorer = make_scorer(rmse, greater_is_better=False)

# feature split (Week A FE fixed)
num_cols = X_fe.select_dtypes(include=[np.number]).columns.tolist()
cat_cols = [c for c in X_fe.columns if c not in num_cols]

preprocess = ColumnTransformer(
    transformers=[
        ("num", Pipeline([("imputer", SimpleImputer(strategy="median"))]), num_cols),
        ("cat", Pipeline([
            ("imputer", SimpleImputer(strategy="most_frequent")),
            ("onehot", OneHotEncoder(handle_unknown="ignore")),
        ]), cat_cols),
    ],
    remainder="drop"
)

cv = KFold(n_splits=5, shuffle=True, random_state=42)


In [29]:
models = [
    ("RandomForest", RandomForestRegressor(
        n_estimators=300,
        max_depth=None,
        random_state=42,
        n_jobs=-1
    )),
    ("GradientBoosting", GradientBoostingRegressor(
        n_estimators=300,
        learning_rate=0.05,
        max_depth=3,
        random_state=42
    ))
]

rows = []

for name, model in models:
    pipe = Pipeline([
        ("preprocess", preprocess),
        ("model", model),
    ])
    scores = cross_val_score(pipe, X_fe, y, cv=cv, scoring=rmse_scorer)
    rows.append({
        "model": name,
        "rmse_mean": float((-scores).mean()),
        "rmse_std": float((-scores).std()),
    })

df_tree = pd.DataFrame(rows).sort_values("rmse_mean")
df_tree


Unnamed: 0,model,rmse_mean,rmse_std
1,GradientBoosting,0.131918,0.018892
0,RandomForest,0.142959,0.018119


In [30]:
import os, json
import pandas as pd

OUT = "outputs/models"
os.makedirs(OUT, exist_ok=True)

# Read Week 7 baseline from the file
with open(os.path.join(OUT, "baseline_ridge_cv.json"), "r") as f:
    base = json.load(f)

compare = pd.DataFrame([
    {"model_family":"Linear", "model":"Ridge (baseline)", "rmse_mean":base["rmse_mean"], "rmse_std":base["rmse_std"]},
])

tree_block = df_tree.copy()
tree_block["model_family"] = "Tree"

compare = pd.concat(
    [compare, tree_block[["model_family","model","rmse_mean","rmse_std"]]],
    ignore_index=True
)

compare.to_csv(os.path.join(OUT, "week_9_model_family_comparison.csv"), index=False)

print("Saved: week_9_model_family_comparison.csv")
compare


Saved: week_9_model_family_comparison.csv


Unnamed: 0,model_family,model,rmse_mean,rmse_std
0,Linear,Ridge (baseline),0.19657,0.045086
1,Tree,GradientBoosting,0.131918,0.018892
2,Tree,RandomForest,0.142959,0.018119


### Interpretation: Linear vs Tree-based Models

Tree-based models significantly outperformed the selected linear baseline,
demonstrating their ability to capture nonlinear relationships and feature interactions
present in the housing data. In particular, Gradient Boosting achieved the lowest
cross-validated error, indicating strong bias reduction compared to linear assumptions.

This comparison highlights the tradeoff between interpretability and predictive power,
and motivates the use of tree-based models for final external evaluation.


In [31]:
import numpy as np
import pandas as pd
import os

from sklearn.pipeline import Pipeline
from sklearn.ensemble import GradientBoostingRegressor

OUT = "outputs/models"
os.makedirs(OUT, exist_ok=True)

# 1) GBDT pipeline
gbdt = GradientBoostingRegressor(
    n_estimators=300,
    learning_rate=0.05,
    max_depth=3,
    random_state=42
)

gbdt_pipe = Pipeline([
    ("preprocess", preprocess),
    ("model", gbdt),
])

# 2) Fit to full train
gbdt_pipe.fit(X_fe, y)

# 3) one-hot included feature names
pre = gbdt_pipe.named_steps["preprocess"]

num_features = num_cols
cat_ohe = pre.named_transformers_["cat"].named_steps["onehot"]
cat_features = cat_ohe.get_feature_names_out(cat_cols).tolist()

feature_names = list(num_features) + list(cat_features)

# 4) importance extract + organization
importances = gbdt_pipe.named_steps["model"].feature_importances_
fi = pd.DataFrame({
    "feature": feature_names,
    "importance": importances
}).sort_values("importance", ascending=False)

# 5) top 20 save
top20 = fi.head(20).reset_index(drop=True)
top20.to_csv(os.path.join(OUT, "week_9_gbdt_feature_importance_top20.csv"), index=False)

top20


Unnamed: 0,feature,importance
0,TotalSF,0.357268
1,OverallQual,0.354751
2,TotalBath,0.05427
3,YearRemodAdd,0.023059
4,GarageCars,0.017405
5,OverallCond,0.016085
6,LotArea,0.01387
7,GarageFinish_Unf,0.013323
8,GarageArea,0.01165
9,Fireplaces,0.011506


### Interpretation: Feature Importance and Model Behavior

Feature importance analysis from the Gradient Boosting model highlights that overall
house size and quality are the dominant drivers of price prediction. In particular,
engineered features such as TotalSF and TotalBath rank among the most influential,
confirming that aggregating raw area and bathroom counts provides meaningful signal
beyond individual floor-level measurements.

Quality-related attributes, including OverallQual and indicators of remodeling,
also contribute substantially, reflecting non-linear effects that are not easily
captured by linear models. The presence of both raw features (e.g., GrLivArea,
LotArea) and engineered features among the top-ranked variables suggests that the
tree-based model effectively combines original and derived representations.

Overall, this analysis supports the earlier performance comparison, demonstrating
that tree-based methods better capture complex interactions and nonlinearities in
housing data. These findings justify the use of Gradient Boosting for final external
evaluation, while the linear baseline remains valuable for interpretability and
model validation.
