# House Prices - Baseline

This time, we will only do basic stuff. We won't do anything fancy here
This will serve as a baseline for us to compare with more advanced feature engineering techniques later.
Also, we want to try training multiple algorithms to see which models fit better with this house prices dataset.

## Preparation

In [1]:
from os.path import join
import pandas as pd
import warnings
warnings.filterwarnings("ignore")

# Load dataset
path_dir = join("..", "..", "data")
df = pd.read_csv(join(path_dir, "preprocessed", "preprocessed_train.csv"))
df_test = pd.read_csv(join(path_dir, "preprocessed", "preprocessed_test.csv"))

df.head()

Unnamed: 0,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,Utilities,LotConfig,...,PoolArea,PoolQC,Fence,MiscFeature,MiscVal,MoSold,YrSold,SaleType,SaleCondition,SalePrice
0,60,3,4.189655,9.04204,1,1,3,3,0,4,...,0.0,3,4,1,0.0,2,2008,8,4,208500
1,20,3,4.394449,9.169623,1,1,3,3,0,2,...,0.0,3,4,1,0.0,5,2007,8,4,181500
2,60,3,4.234107,9.328212,1,1,0,3,0,4,...,0.0,3,4,1,0.0,9,2008,8,4,223500
3,70,3,4.110874,9.164401,1,1,0,3,0,0,...,0.0,3,4,1,0.0,2,2006,8,0,140000
4,60,3,4.442651,9.565284,1,1,0,3,0,2,...,0.0,3,4,1,0.0,12,2008,8,4,250000


## Modelling

We tried several algorithms:
- Linear Regression
- Ridge Regression
- Lasso
- ElasticNet
- LassoLars
- BayesianRidge
- TweedieRegressor
- SGDRegressor
- Decision Tree
- Random Forest
- SVM
- KNN
- XGBoost
- LightGBM
- CatBoost

after that, we will choose 2 models that fits the dataset most to continue improve its accuracy.

In [2]:
from sklearn.linear_model import LinearRegression, Ridge, Lasso, ElasticNet, LassoLars, BayesianRidge, TweedieRegressor, SGDRegressor
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.svm import SVR
from sklearn.neighbors import KNeighborsRegressor
from xgboost import XGBRegressor
from lightgbm import LGBMRegressor
from catboost import CatBoostRegressor
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
import numpy as np

# Prepare data
X = df.drop('SalePrice', axis=1)
y = df['SalePrice']
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=42)

# List of models
models = {
    'Linear Regression': LinearRegression(),
    'Ridge Regression': Ridge(),
    'Lasso': Lasso(),
    'ElasticNet': ElasticNet(),
    'LassoLars': LassoLars(),
    'BayesianRidge': BayesianRidge(),
    'TweedieRegressor': TweedieRegressor(),
    # 'SGDRegressor': SGDRegressor(),
    'Decision Tree': DecisionTreeRegressor(),
    'Random Forest': RandomForestRegressor(),
    'SVM': SVR(),
    'KNN': KNeighborsRegressor(),
    'XGBoost': XGBRegressor(),
    'LightGBM': LGBMRegressor(),
    'CatBoost': CatBoostRegressor(verbose=0)
}

# Train and evaluate models
for name, model in models.items():
    model.fit(X_train, y_train)
    print(f'R^2 score for {name}: {model.score(X_val, y_val):.4f}')
    y_pred = model.predict(X_val)
    rmse = np.sqrt(mean_squared_error(y_val, y_pred))
    print(f'{name}: RMSE = {rmse:.2f}\n')

R^2 score for Linear Regression: 0.8292
Linear Regression: RMSE = 36193.17

R^2 score for Ridge Regression: 0.8358
Ridge Regression: RMSE = 35492.29

R^2 score for Lasso: 0.8295
Lasso: RMSE = 36168.53

R^2 score for ElasticNet: 0.8455
ElasticNet: RMSE = 34424.68

R^2 score for LassoLars: 0.8293
LassoLars: RMSE = 36182.82

R^2 score for BayesianRidge: 0.8495
BayesianRidge: RMSE = 33973.40

R^2 score for TweedieRegressor: 0.7782
TweedieRegressor: RMSE = 41243.52

R^2 score for Decision Tree: 0.7479
Decision Tree: RMSE = 43974.81

R^2 score for Random Forest: 0.8958
Random Forest: RMSE = 28273.27

R^2 score for SVM: -0.0244
SVM: RMSE = 88642.97

R^2 score for KNN: 0.8029
KNN: RMSE = 38886.62

R^2 score for XGBoost: 0.9144
XGBoost: RMSE = 25629.45

[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.001076 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] T

Base on the outputs, we conclude:
- XGBoost have the lowest RSME figure and the highest R2 score, which makes it the most fitted models compared to others
- CatBoost is the second most effective models.

However, just based on local score is not enough, we need to compare Kaggle score to ensure our belief. We will take 4 best models (XGBoost, CatBoost, Random Forest, BayesianRidge), let them predict house prices, and submit files on Kaggle to see it's actual accuracy. 

In [None]:
import os
import sys
import joblib
from sklearn.model_selection import KFold

# === Th√™m ƒë∆∞·ªùng d·∫´n ƒë·ªÉ import log_experiment ===
sys.path.append(os.path.abspath(os.path.join(os.getcwd(), "..", "..")))
from log.experiment_logger import log_experiment

log_path = join(path_dir, '..', "log", 'baseLine', "experiment_log.csv")
author = "Thien"
models = {
    'XGBoost': XGBRegressor(), 
    'CatBoost': CatBoostRegressor(verbose=0), 
    'Random Forest': RandomForestRegressor(),
    'BayesianRidge': BayesianRidge()
}

for name, model in models.items():
    kf = KFold(n_splits=5, shuffle=True, random_state=42)
    rmses = []
    r2s = []
    
    fold_index = 1
    for train_index, val_index in kf.split(X):
        X_train, X_val = X.iloc[train_index], X.iloc[val_index]
        y_train, y_val = y.iloc[train_index], y.iloc[val_index]
        model.fit(X_train, y_train)
        y_pred = model.predict(X_val)
        
        rmse = np.sqrt(mean_squared_error(y_val, y_pred))
        r2 = model.score(X_val, y_val)
        
        rmses.append(rmse)
        r2s.append(r2)
        
        print(f"\n==== Fold {fold_index} results for {name} ====")
        print(f"Fold {fold_index} - R2: {r2:.4f} | RMSE: {rmse:.4f}")
        fold_index += 1

    mean_r2 = np.mean(r2s)
    mean_rmse = np.mean(rmses)
    print("\n==== Mean metrics ====")
    print(f"R2 Score: {mean_r2:.4f}")
    print(f"RMSE: {mean_rmse:.4f}")

    # === Ghi log k·∫øt qu·∫£ v√†o CSV ===
    log_experiment(
        output_path=log_path,
        model_name=name,
        feature_name="preprocessed",
        params= model.get_params(),
        kfold=5,
        rmse=mean_rmse,
        r2=mean_r2,
        author=author
    )

    # === Hu·∫•n luy·ªán l·∫°i tr√™n to√†n b·ªô d·ªØ li·ªáu train ===
    final_model = model
    final_model.fit(X, y)

    # === Dump model ra .pkl ===
    model_dir = join(path_dir, '..', "log", "baseLine", "Model Pickles", name)
    os.makedirs(model_dir, exist_ok=True)
    model_path = join(model_dir, name + "_baseLine.pkl")
    joblib.dump(final_model, model_path)
    print(f"‚úÖ Model saved to {model_path}")
    df_original = pd.read_csv(join(path_dir, "raw", "test.csv"))
    ids = df_original["Id"]
    # === T·∫°o file submission ===
    X_test = df_test.copy()
    if 'SalePrice' in X_test.columns:
        X_test = X_test.drop(columns=['SalePrice'])

    y_test_pred = final_model.predict(X_test)

    submission = pd.DataFrame({
        'Id': ids,  # ƒë·∫£m b·∫£o test c√≥ c·ªôt n√†y
        'SalePrice': y_test_pred
    })

    sub_dir = join(path_dir, "submissions", "baseLine", name)
    os.makedirs(sub_dir, exist_ok=True)
    submission_path = join(sub_dir, f"submission_{name}_baseLine.csv")
    submission.to_csv(submission_path, index=False)
    print(f"üì§ Submission file saved to {submission_path}")


==== Fold 1 results for XGBoost ====
Fold 1 - R2: 0.9144 | RMSE: 25629.4506

==== Fold 2 results for XGBoost ====
Fold 2 - R2: 0.8512 | RMSE: 31812.1758

==== Fold 3 results for XGBoost ====
Fold 3 - R2: 0.6481 | RMSE: 44089.4277

==== Fold 4 results for XGBoost ====
Fold 4 - R2: 0.8636 | RMSE: 29264.7982

==== Fold 5 results for XGBoost ====
Fold 5 - R2: 0.8910 | RMSE: 23865.2993

==== Mean metrics ====
R2 Score: 0.8337
RMSE: 30932.2303
Logged experiment to ..\..\data\log\experiment_log.csv
‚úÖ Model saved to ..\..\data\..\log\baseLine\Model Pickles\XGBoost\XGBoost_baseLine.pkl
üì§ Submission file saved to ..\..\data\submissions\baseLine\XGBoost\submission_XGBoost_baseLine.csv

==== Fold 1 results for CatBoost ====
Fold 1 - R2: 0.9065 | RMSE: 26780.4832

==== Fold 2 results for CatBoost ====
Fold 2 - R2: 0.9131 | RMSE: 24309.2844

==== Fold 3 results for CatBoost ====
Fold 3 - R2: 0.7186 | RMSE: 39426.5207

==== Fold 4 results for CatBoost ====
Fold 4 - R2: 0.8929 | RMSE: 25927.4707

## Result

<img src="images/output_kaggle_baseline.png" width="800">

This is what we received after submitting files on Kaggle platform. CatBoost is the best model we got, not XGBoost.
Further experiments is needed to improve model's accuracy.