## Project Summary
**Notes:**
- ANALYSIS_NOTES.md to provide deeper economist edge intuition and interpretation.

**Motivation:**
- Using accessible raw dataset to practice applied prediction with proper CV and data cleaning in Python.

**Goal:**
- Predict house prices using structured housing data and compare linear, regularized, and ML tree-based models.

**Methodology:**
- Log-transformed target: Log(SalePrice)
- Data cleaning and encoding
- Evaluates models with 10-fold cross-validation RMSE
- Models: OLS, Ridge, Lasso, Elastic Net, Random Forest, Gradient Boosting, XGBoost

**Key Findings:**
- XGBoost achieves the lowest and most stable CV RMSE (~0.118)

In [1]:
# To install packages
## %pip install numpy pandas matplotlib seaborn scikit-learn xgboost

In [2]:
import numpy as np
import pandas as pd

train_file_path = "../input/house-prices-advanced-regression-techniques/train.csv"
train = pd.read_csv(train_file_path)
test_file_path = "../input/house-prices-advanced-regression-techniques/test.csv"
test = pd.read_csv(test_file_path)


## Data Load and Inspect

In [3]:
series = train.isna().sum()
missing_count = series.to_frame(name="n_missing")

# percentage missing
n_rows = train.shape[0]

missing_count["pct_missing"] = missing_count["n_missing"]/ n_rows * 100

missing_count = missing_count[missing_count["n_missing"] > 0]

# sort descending
missing_count = missing_count.sort_values(
    by="pct_missing",
    ascending=False
)

missing_count = missing_count.reset_index().rename(
    columns={"index": "variable"}
)

**Handle NA problem: None vs 0 vs Real missing**

In [4]:
none_cols = [
    "PoolQC", "MiscFeature", "Alley", "Fence", "FireplaceQu",
    "GarageType", "GarageFinish", "GarageQual", "GarageCond",
    "BsmtQual", "BsmtCond", "BsmtExposure", "BsmtFinType1", "BsmtFinType2", "MasVnrType"
]

train[none_cols] = train[none_cols].fillna("None")
test[none_cols] = test[none_cols].fillna("None")

zero_cols = ["MasVnrArea", "GarageYrBlt"]

train[zero_cols] = train[zero_cols].fillna(0)
test[zero_cols] = test[zero_cols].fillna(0)

train["LotFrontage"] = train.groupby("Neighborhood")["LotFrontage"]\
                            .transform(lambda x: x.fillna(x.median()))

test["LotFrontage"] = test.groupby("Neighborhood")["LotFrontage"]\
                          .transform(lambda x: x.fillna(x.median()))

train["Electrical"] = train["Electrical"].fillna(
    train["Electrical"].mode()[0]
)
test["Electrical"] = test["Electrical"].fillna(
    train["Electrical"].mode()[0]
)

In [5]:
test_missing = test.isna().sum()
test_missing = test_missing[test_missing > 0].sort_values(ascending=False)

cat_cols = [
    "MSZoning", "Utilities", "Functional",
    "Exterior1st", "Exterior2nd",
    "KitchenQual", "SaleType"
]

for col in cat_cols:
    test[col] = test[col].fillna(train[col].mode()[0])

num_zero_cols = [
    "BsmtFullBath", "BsmtHalfBath",
    "BsmtFinSF1", "BsmtFinSF2", "BsmtUnfSF",
    "TotalBsmtSF",
    "GarageCars", "GarageArea"
]

for col in num_zero_cols:
    test[col] = test[col].fillna(0)


In [6]:
y = train["SalePrice"].copy()

#Train/test feature split
X_train = train.drop("SalePrice", axis=1)
X_test  = test.copy()
# Concentrate before dummy encoding
X_all = pd.concat([X_train, X_test], axis=0, ignore_index=True)
# dummy encoding
X_all_encoded = pd.get_dummies(X_all)

# Split back into train/test encoded
X_train_enc = X_all_encoded.iloc[:len(X_train), :]
X_test_enc  = X_all_encoded.iloc[len(X_train):, :]

X_train_enc = X_train_enc.fillna(0)
X_test_enc  = X_test_enc.fillna(0)
y_log = np.log(y)
X = X_train_enc.copy()
print(X_train_enc.shape)
X_test_enc.shape

(1460, 303)


(1459, 303)

## Phase 1: Linear models

## Simple Linear Regression with K-fold CV

In [7]:
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import KFold, cross_val_score
from sklearn.model_selection import train_test_split


y_log = np.log(y)
tX = X_train_enc.copy()

# KFold setup !!!!
kf = KFold(n_splits=10, shuffle=True, random_state=42)

model = LinearRegression()

scores = cross_val_score(
    model,
    X,
    y_log,
    cv=kf,
    scoring="neg_root_mean_squared_error"
)

rmse_scores = -scores
print("OLS 10-Fold RMSE (log scale):")
print("Fold RMSE:", rmse_scores)
print("Mean RMSE:", rmse_scores.mean())
print("Std  RMSE:", rmse_scores.std())

OLS 10-Fold RMSE (log scale):
Fold RMSE: [0.11818827 0.15100814 0.13021239 0.12735436 0.19174033 0.25304027
 0.18153506 0.11448518 0.12137219 0.09236675]
Mean RMSE: 0.1481302916018943
Std  RMSE: 0.045430030119244444


**Interpretation: OLS**
- OLS is unstable with ~300 one-hot features → high variance across folds.
- Regularization reduces variance and stabilizes CV.

### Next: Ridge regression
Ridge + GridSearchCV ---> Full K-fold CV + Alpha tuning

In [8]:
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import Ridge
from sklearn.model_selection import GridSearchCV

ridge_pipe = Pipeline([
    ("scaler", StandardScaler(with_mean=False)),  
    ("ridge", Ridge())
])


param_grid = {
    "ridge__alpha": np.logspace(-3, 3, 50)  
}

grid = GridSearchCV(
    ridge_pipe,
    param_grid=param_grid,
    scoring="neg_root_mean_squared_error",
    cv=kf,
    n_jobs=-1
)

grid.fit(X, y_log)

best_alpha = grid.best_params_["ridge__alpha"]
best_rmse  = -grid.best_score_

print("Best alpha:", best_alpha)
print("Best 10-fold RMSE (log scale):", best_rmse)

Best alpha: 429.1934260128778
Best 10-fold RMSE (log scale): 0.13888309472698798


## OLS vs Ridge Coefficient shrinkage inspection
- OLS = unbiased but unstable under multicollinearity
- Ridge = biased but lower variance
- Bias–variance trade-off 

In [9]:
X = X_train_enc.copy()
y_log = np.log(train["SalePrice"])


ols_pipe = Pipeline([
    ("scaler", StandardScaler(with_mean=False)),
    ("ols", LinearRegression())
])

ols_pipe.fit(X, y_log)


beta_ols = ols_pipe.named_steps["ols"].coef_
features = X.columns


best_model = grid.best_estimator_

beta_ridge = best_model.named_steps["ridge"].coef_
best_alpha = best_model.named_steps["ridge"].alpha

print("Best alpha used:", best_alpha)

coef_df = pd.DataFrame({
    "feature": features,
    "beta_ols": beta_ols,
    "beta_ridge": beta_ridge
})

coef_df["abs_ols"] = coef_df["beta_ols"].abs()
coef_df["abs_ridge"] = coef_df["beta_ridge"].abs()

# shrink ratio: how much magnitude remains after ridge
# add tiny number to avoid divide-by-zero explosions
eps = 1e-12
coef_df["shrink_ratio"] = coef_df["abs_ridge"] / (coef_df["abs_ols"] + eps)


big_in_ols = coef_df[coef_df["abs_ols"] > 0.05].copy()


big_in_ols.sort_values("shrink_ratio", ascending=True).head(25)

Best alpha used: 429.1934260128778


Unnamed: 0,feature,beta_ols,beta_ridge,abs_ols,abs_ridge,shrink_ratio
25,GarageYrBlt,-0.132255,0.002351,0.132255,0.002351,0.017776
33,PoolArea,0.067427,0.004827,0.067427,0.004827,0.071592
6,YearBuilt,0.052409,0.012404,0.052409,0.012404,0.236678
16,GrLivArea,0.066072,0.037909,0.066072,0.037909,0.573754
126,RoofMatl_ClyTile,-0.065999,-0.038929,0.065999,0.038929,0.589841
4,OverallQual,0.056505,0.040005,0.056505,0.040005,0.707995


## LASSO: Variable selection

In [10]:
from sklearn.linear_model import Lasso

lasso_pipe = Pipeline([
    ("scaler", StandardScaler(with_mean=False)),
    ("lasso", Lasso(max_iter=10000))
])


## α tuning with K-fold CV
param_grid = {
    "lasso__alpha": np.logspace(-4, 1, 50)  
}

lasso_grid = GridSearchCV(
    lasso_pipe,
    param_grid=param_grid,
    scoring="neg_root_mean_squared_error",
    cv=kf,
    n_jobs=-1
)

lasso_grid.fit(X, y_log)

best_alpha_lasso = lasso_grid.best_params_["lasso__alpha"]
best_rmse_lasso  = -lasso_grid.best_score_

print("Best Lasso alpha:", best_alpha_lasso)
print("Best Lasso 10-fold RMSE (log):", best_rmse_lasso)

Best Lasso alpha: 0.005428675439323859
Best Lasso 10-fold RMSE (log): 0.1373247185196936


**Lasso Inspection**

In [11]:
best_lasso = lasso_grid.best_estimator_
beta_lasso = best_lasso.named_steps["lasso"].coef_

n_total = len(beta_lasso)
n_zero  = np.sum(beta_lasso == 0)
n_nonzero = n_total - n_zero


In [12]:
coef_df["beta_lasso"] = beta_lasso
coef_df["abs_lasso"]  = coef_df["beta_lasso"].abs()
ridge_kept_lasso_dropped = coef_df[
    (coef_df["abs_ridge"] > 1e-4) &
    (coef_df["abs_lasso"] == 0)
].sort_values("abs_ridge", ascending=False)

## Table comparing Lasso and Ridge coefficients: what they shrank, what they kept/ killed
ridge_kept_lasso_dropped.head(10)

Unnamed: 0,feature,beta_ols,beta_ridge,abs_ols,abs_ridge,shrink_ratio,beta_lasso,abs_lasso
13,1stFlrSF,0.04024,0.030395,0.04024,0.030395,0.755339,0.0,0.0
23,TotRmsAbvGrd,0.007472,0.021813,0.007472,0.021813,2.919254,0.0,0.0
14,2ndFlrSF,0.043516,0.018512,0.043516,0.018512,0.425397,0.0,0.0
21,BedroomAbvGr,0.005266,0.009688,0.005266,0.009688,1.839802,0.0,0.0
133,RoofMatl_WdShngl,0.005693,0.008593,0.005693,0.008593,1.509528,0.0,0.0
40,MSZoning_RL,0.00779,0.008534,0.00779,0.008534,1.095474,0.0,0.0
188,BsmtQual_TA,-0.003545,-0.008485,0.003545,0.008485,2.393382,-0.0,0.0
8,MasVnrArea,0.001484,0.006596,0.001484,0.006596,4.444737,0.0,0.0
114,HouseStyle_1Story,-0.004772,-0.006547,0.004772,0.006547,1.372025,-0.0,0.0
178,Foundation_BrkTil,-0.007089,-0.006451,0.007089,0.006451,0.909897,-0.0,0.0


### Elastic Net = Ridge + Lasso

In [13]:
from sklearn.linear_model import ElasticNet
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import GridSearchCV

enet_pipe = Pipeline([
    ("scaler", StandardScaler(with_mean=False)),
    ("enet", ElasticNet(max_iter=100000, tol=1e-3, warm_start=True))
])

param_grid_enet = {
    "enet__alpha": np.logspace(-3, 2, 30),      # 0.001 ... 100  (more stable than 1e-4..10)
    "enet__l1_ratio": [0.1, 0.3, 0.5, 0.7, 0.9]
}

enet_grid = GridSearchCV(
    enet_pipe,
    param_grid=param_grid_enet,
    scoring="neg_root_mean_squared_error",
    cv=kf,
    n_jobs=-1
)

enet_grid.fit(X, y_log)

best_alpha_enet = enet_grid.best_params_["enet__alpha"]
best_l1_ratio   = enet_grid.best_params_["enet__l1_ratio"]
best_rmse_enet  = -enet_grid.best_score_

print("Best ElasticNet alpha:", best_alpha_enet)
print("Best ElasticNet l1_ratio:", best_l1_ratio)
print("Best ElasticNet 10-fold RMSE (log):", best_rmse_enet)

Best ElasticNet alpha: 0.0529831690628371
Best ElasticNet l1_ratio: 0.1
Best ElasticNet 10-fold RMSE (log): 0.13614026744090554


In [14]:
ols_pipe = Pipeline([
    ("scaler", StandardScaler(with_mean=False)),
    ("ols", LinearRegression())
])

lin_models = [
    ("OLS", ols_pipe, "no regularization"),
    ("Ridge", grid.best_estimator_, f"alpha={grid.best_params_['ridge__alpha']:.3g}"),
    ("Lasso", lasso_grid.best_estimator_, f"alpha={lasso_grid.best_params_['lasso__alpha']:.3g}"),
    ("Elastic Net", enet_grid.best_estimator_,
     f"alpha={enet_grid.best_params_['enet__alpha']:.3g}, l1={enet_grid.best_params_['enet__l1_ratio']}")
]


## Phase 2: Nonlinear Model

In [15]:
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error


rf = RandomForestRegressor(random_state=42, n_jobs=-1)

param_grid = {
    "n_estimators": [300, 350],
    "max_features": ["sqrt", 0.3],
    "min_samples_leaf": [1, 2]
}

rf_grid = GridSearchCV(
    rf,
    param_grid=param_grid,
    scoring="neg_root_mean_squared_error",
    cv=kf,
    n_jobs=-1
)

rf_grid.fit(X, y_log)

print("Best params:", rf_grid.best_params_)
print("Best CV mean RMSE:", -rf_grid.best_score_)
best_rf = rf_grid.best_estimator_
rf.fold = -cross_val_score(best_rf, X, y_log, cv=kf,
                        scoring="neg_root_mean_squared_error", n_jobs=-1)


Best params: {'max_features': 0.3, 'min_samples_leaf': 1, 'n_estimators': 350}
Best CV mean RMSE: 0.13535157016539978


Best: n_estimators=350
      max_features=0.3
      min_samples_leaf=1
CV RMSE ≈ 0.1354
std ≈ 0.021


In [16]:
rows = []
for name, mdl, note in lin_models:
    scores = cross_val_score(
        mdl, X, y_log,
        cv=kf,
        scoring="neg_root_mean_squared_error",
        n_jobs=-1
    )
    rmse = -scores
    rows.append({
        "Model": name,
        "CV_RMSE_mean": rmse.mean(),
        "CV_RMSE_std": rmse.std(),
        "Notes": note
    })


rows.append({
    "Model": "Random Forest",
    "CV_RMSE_mean": rf.fold.mean(),
    "CV_RMSE_std": rf.fold.std(),
    "Notes": "n_estimators=300, max_features=0.3, min_samples_leaf=1"
})

results = pd.DataFrame(rows).sort_values("CV_RMSE_mean")
results



Unnamed: 0,Model,CV_RMSE_mean,CV_RMSE_std,Notes
4,Random Forest,0.135352,0.020926,"n_estimators=300, max_features=0.3, min_sample..."
3,Elastic Net,0.13614,0.045422,"alpha=0.053, l1=0.1"
2,Lasso,0.137325,0.04892,alpha=0.00543
1,Ridge,0.138883,0.036051,alpha=429
0,OLS,0.146358,0.043875,no regularization


### Boosting
- build many small trees sequentially, each new tree tries to fix the mistakes (residuals) of the current model.

In [17]:
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.model_selection import GridSearchCV

gbr = GradientBoostingRegressor(random_state=42)

param_grid = {
    "learning_rate": [0.1],
    "n_estimators": [1200],
    "max_depth": [2]
}

gbr_grid = GridSearchCV(
    gbr,
    param_grid=param_grid,
    scoring="neg_root_mean_squared_error",
    cv=kf,
    n_jobs=-1
)

gbr_grid.fit(X, y_log)

print("Best params:", gbr_grid.best_params_)
print("Best CV RMSE:", -gbr_grid.best_score_)

Best params: {'learning_rate': 0.1, 'max_depth': 2, 'n_estimators': 1200}
Best CV RMSE: 0.12196243627146534


In [18]:
best_gbr = gbr_grid.best_estimator_
gbr_scores = cross_val_score(
    best_gbr, X, y_log,
    cv=kf,
    scoring="neg_root_mean_squared_error",
    n_jobs=-1
)
gbr_rmse = -gbr_scores  # convert to positive RMSE



In [19]:
rows.append({
    "Model": "Gradient Boosting",
    "CV_RMSE_mean": gbr_rmse.mean(),
    "CV_RMSE_std": gbr_rmse.std(),
    "Notes": (
        f"n_estimators={gbr_grid.best_params_['n_estimators']}, "
        f"lr={gbr_grid.best_params_['learning_rate']}, "
        f"max_depth={gbr_grid.best_params_['max_depth']}"
    )
})


results = pd.DataFrame(rows).sort_values("CV_RMSE_mean")
results


Unnamed: 0,Model,CV_RMSE_mean,CV_RMSE_std,Notes
5,Gradient Boosting,0.121962,0.01989,"n_estimators=1200, lr=0.1, max_depth=2"
4,Random Forest,0.135352,0.020926,"n_estimators=300, max_features=0.3, min_sample..."
3,Elastic Net,0.13614,0.045422,"alpha=0.053, l1=0.1"
2,Lasso,0.137325,0.04892,alpha=0.00543
1,Ridge,0.138883,0.036051,alpha=429
0,OLS,0.146358,0.043875,no regularization


### XGboost

#### Hyperparameter tuning:
- I used a stepwise coarse-to-fine search to tune XGBoost efficiently.
- I first searched the learning dynamics (learning_rate, n_estimators), then tuned sampling for generalization (subsample, colsample_bytree), and finally adjusted regularization / split constraints (reg_lambda, min_child_weight, plus optional gamma/reg_alpha).
- After selecting the best combination, I refit the final model and report the final 10-fold CV RMSE .
- Full stepwise grid-search logs are kept in the analysis notes for reproducibility.


In [20]:
from xgboost import XGBRegressor

final_params = dict(
    objective="reg:squarederror",
    random_state=42,
    n_jobs=-1,
    tree_method="hist",

    max_depth=3,
    learning_rate=0.05,
    n_estimators=1000,

    subsample=0.85,
    colsample_bytree=0.70,

    min_child_weight=1,
    reg_lambda=1,
    gamma=0,
    reg_alpha=0,
)

final_model = XGBRegressor(**final_params)
final_model.fit(X, y_log)

scores = cross_val_score(final_model, X, y_log, cv=kf,
                         scoring="neg_root_mean_squared_error", n_jobs=-1)
xgb_rmse = -scores
print("Final 10-fold mean:", xgb_rmse.mean(), "std:", xgb_rmse.std())


Final 10-fold mean: 0.11754785535786746 std: 0.0192754743575291


In [21]:
rows.append({
    "Model": "XGBoost",
    "CV_RMSE_mean": xgb_rmse.mean(),
    "CV_RMSE_std": xgb_rmse.std(),
    "Notes": "n_estimators=1000, max_depth=3,learning_rate=0.05"
})

results = pd.DataFrame(rows).sort_values("CV_RMSE_mean")
results


Unnamed: 0,Model,CV_RMSE_mean,CV_RMSE_std,Notes
6,XGBoost,0.117548,0.019275,"n_estimators=1000, max_depth=3,learning_rate=0.05"
5,Gradient Boosting,0.121962,0.01989,"n_estimators=1200, lr=0.1, max_depth=2"
4,Random Forest,0.135352,0.020926,"n_estimators=300, max_features=0.3, min_sample..."
3,Elastic Net,0.13614,0.045422,"alpha=0.053, l1=0.1"
2,Lasso,0.137325,0.04892,alpha=0.00543
1,Ridge,0.138883,0.036051,alpha=429
0,OLS,0.146358,0.043875,no regularization


In [24]:
xgb_rmse_mean = xgb_rmse.mean()
xgb_rmse_sd = xgb_rmse.std()
xgb_summary = pd.DataFrame({
    "Metric": [
        "Model",
        "CV RMSE (mean)",
        "CV RMSE (std)",
        "max_depth",
        "learning_rate",
        "n_estimators",
        "subsample",
        "colsample_bytree",
        "min_child_weight",
        "reg_lambda",
        "gamma",
        "reg_alpha",
        "tree_method",
    ],
    "Value": [
        "XGBoost",
        f"{xgb_rmse_mean:.6f}",
        f"{xgb_rmse_sd:.6f}",
        final_params["max_depth"],
        final_params["learning_rate"],
        final_params["n_estimators"],
        final_params["subsample"],
        final_params["colsample_bytree"],
        final_params["min_child_weight"],
        final_params["reg_lambda"],
        final_params["gamma"],
        final_params["reg_alpha"],
        final_params["tree_method"],
    ]
})

xgb_summary

Unnamed: 0,Metric,Value
0,Model,XGBoost
1,CV RMSE (mean),0.117548
2,CV RMSE (std),0.019275
3,max_depth,3
4,learning_rate,0.05
5,n_estimators,1000
6,subsample,0.85
7,colsample_bytree,0.7
8,min_child_weight,1
9,reg_lambda,1


In [None]:
params = final_params.copy() 


xgb_final = XGBRegressor(**params)
xgb_final.fit(X, y_log)

y_test_log = xgb_final.predict(X_test_enc)
y_test = np.exp(y_test_log)

submission = pd.DataFrame({
    "Id": test["Id"],
    "SalePrice": y_test
})
submission.to_csv("submission_xgb.csv", index=False)
