# Model Training – Baseline Model

## Objective
The objective of this notebook is to train a baseline regression model to predict house prices and establish a reference performance for comparison with advanced models.


In [28]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score


In [29]:
file_path = '/content/drive/MyDrive/House Price cleaned.csv'
data = pd.read_csv(file_path)


In [79]:
data.columns

Index(['No of bedrooms', 'No of bathrooms', 'living area', 'No of floors',
       'waterfront present', 'No of views', 'house condition', 'house grade',
       'Built Year', 'Renovation Year', 'Lattitude', 'Longitude',
       'living_area_renov', 'Price', 'Total floor area', 'Total lot area'],
      dtype='object')

In [30]:
X = data.drop("Price", axis=1)
y = data["Price"]


In [31]:
X_train , X_test , y_train , y_test = train_test_split(X,y,test_size=0.2,random_state=42)


In [46]:
lr_pipeline = Pipeline(steps=[
    ("scaler", StandardScaler()),
    ("model", LinearRegression())
])

In [47]:
lr_pipeline.fit(X_train, y_train)

In [48]:
y_pred = lr_pipeline.predict(X_test)

In [75]:
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score

lr_mae = mean_absolute_error(y_test, y_pred)
lr_rmse = np.sqrt(mean_squared_error(y_test, y_pred))
lr_r2 = r2_score(y_test, y_pred)

print("Linear Regression Performance:")
print("MAE:",lr_mae)
print("RMSE",lr_rmse)
print("R2 Score:",lr_r2)


Linear Regression Performance:
MAE: 125700.54809257189
RMSE 210215.7806284751
R2 Score: 0.6969813652852082


### Error Analysis

- The RMSE of the Linear Regression model is approximately 210,216.
- This means the model’s predictions deviate from actual house prices by around this amount on average.
- The error level is acceptable for a baseline model and will be improved using advanced models.


## Advanced Model – Random Forest Regressor
Goal: beat the Linear Regression baseline

## Why Random Forest?

Handles non-linearity

Robust to outliers

Very strong for tabular data

In [50]:
from sklearn.ensemble import RandomForestRegressor

In [51]:
rf_pipeline = Pipeline(steps=[
    ("scaler", StandardScaler()),
    ("model", RandomForestRegressor(
        n_estimators=200,
        random_state=42,
        n_jobs=-1
    ))
])

rf_pipeline.fit(X_train, y_train)

### Pipeline Usage

- A machine learning pipeline was used to combine preprocessing and model training.
- This approach ensures reproducibility and prevents data leakage.
- The same pipeline structure was used across models for consistency.

In [52]:
rf_pred = rf_pipeline.predict(X_test)

In [73]:
rf_mae = mean_absolute_error(y_test, rf_pred)
rf_rmse = np.sqrt(mean_squared_error(y_test, rf_pred))
rf_r2 = r2_score(y_test, rf_pred)

print("Random Forest Regressor Performance:")
print("MAE:", rf_mae)
print("RMSE:", rf_rmse)
print("R2 Score:", rf_r2)

Random Forest Regressor Performance:
MAE: 68472.08876743779
RMSE: 127033.31805129831
R2 Score: 0.8893442487389069


### Model Comparison Summary

- Linear Regression achieved an R² score of approximately 0.69 and served as a baseline model.
- Random Forest Regressor significantly improved performance with an R² score of approximately 0.88.
- The improvement indicates the presence of non-linear relationships in the data, which Random Forest captures effectively.


In [54]:
from sklearn.model_selection import cross_val_score

cv_scores = cross_val_score(
    rf_pipeline,
    X,
    y,
    cv=5,
    scoring="r2",
    n_jobs=-1
)

cv_scores, cv_scores.mean()


(array([0.8607842 , 0.8855576 , 0.89639552, 0.87226544, 0.87193505]),
 np.float64(0.8773875626613318))

- Cross-validation confirmed that the model performance is stable across folds.


In [55]:
from sklearn.model_selection import GridSearchCV

param_grid = {
    "model__n_estimators": [200, 400],
    "model__max_depth": [None, 10, 20],
    "model__min_samples_split": [2, 5],
    "model__min_samples_leaf": [1, 2],
    "model__max_features": ["sqrt", "log2"]
}

grid = GridSearchCV(
    rf_pipeline,
    param_grid,
    cv=5,
    scoring="r2",
    n_jobs=-1
)

grid.fit(X_train, y_train)

print("Best Hyperparameters:", grid.best_params_)
print("Best Cross-Validation Score:", grid.best_score_)




Best Hyperparameters: {'model__max_depth': 20, 'model__max_features': 'sqrt', 'model__min_samples_leaf': 1, 'model__min_samples_split': 2, 'model__n_estimators': 400}
Best Cross-Validation Score: 0.850530612088274


In [56]:
from sklearn.ensemble import GradientBoostingRegressor


In [57]:
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler

gbr_pipeline = Pipeline(steps=[
    ("scaler", StandardScaler()),  # optional, but fine
    ("model", GradientBoostingRegressor(
        n_estimators=200,
        learning_rate=0.05,
        max_depth=3,
        random_state=42
    ))
])


In [58]:
gbr_pipeline.fit(X_train, y_train)


In [59]:
y_pred_gbr = gbr_pipeline.predict(X_test)


In [61]:
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
import numpy as np

mae_gbr = mean_absolute_error(y_test, y_pred_gbr)
rmse_gbr = np.sqrt(mean_squared_error(y_test, y_pred_gbr))
r2_gbr = r2_score(y_test, y_pred_gbr)

print("Gradient Boosting Regressor Performance:")
print("MAE:", mae_gbr)
print("RMSE:", rmse_gbr)
print("R2 Score:", r2_gbr)

Gradient Boosting Regressor Performance:
MAE: 77989.72167950949
RMSE: 133705.15192122493
R2 Score: 0.877415660599259


In [62]:
from xgboost import XGBRegressor


In [63]:
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler

xgb_pipeline = Pipeline(steps=[
    ("scaler", StandardScaler()),  # optional but okay
    ("model", XGBRegressor(
        n_estimators=300,
        learning_rate=0.05,
        max_depth=6,
        subsample=0.8,
        colsample_bytree=0.8,
        objective="reg:squarederror",
        random_state=42,
        n_jobs=-1
    ))
])


In [64]:
xgb_pipeline.fit(X_train, y_train)


In [65]:
y_pred_xgb = xgb_pipeline.predict(X_test)


In [67]:
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
import numpy as np

mae_xgb = mean_absolute_error(y_test, y_pred_xgb)
rmse_xgb = np.sqrt(mean_squared_error(y_test, y_pred_xgb))
r2_xgb = r2_score(y_test, y_pred_xgb)

print("XGBoost Regressor Performance:")
print("MAE:", mae_xgb)
print("RMSE:", rmse_xgb)
print("R2 Score:", r2_xgb)

XGBoost Regressor Performance:
MAE: 66519.8671875
RMSE: 119197.04479558207
R2 Score: 0.9025751948356628


In [76]:
comparison = pd.DataFrame({
    "Model": [
        "Linear Regression",
        "Random Forest",
        "Gradient Boosting",
        "XGBoost"
    ],
    "RMSE": [
        lr_rmse,
        rf_rmse,
        rmse_gbr,
        rmse_xgb
    ],
    "R2 Score": [
        lr_r2,
        rf_r2,
        r2_gbr,
        r2_xgb
    ]
})

comparison


Unnamed: 0,Model,RMSE,R2 Score
0,Linear Regression,210215.780628,0.696981
1,Random Forest,127033.318051,0.889344
2,Gradient Boosting,133705.151921,0.877416
3,XGBoost,119197.044796,0.902575


### XGBoost Model Evaluation

- XGBoost Regressor was evaluated as the most advanced model.
- The model combines gradient boosting with regularization to reduce overfitting.
- Performance was compared with Linear Regression, Random Forest, and Gradient Boosting.
- The final model was selected based on R² score and RMSE.


## Final Model Selection

- Multiple models were evaluated, including Linear Regression, Random Forest, Gradient Boosting, and XGBoost.
- XGBoost Regressor achieved the best performance in terms of R² score and RMSE.
- Due to its ability to capture complex non-linear relationships and regularization, XGBoost was selected as the final model.
- The complete preprocessing and modeling workflow was saved using a pipeline for reproducibility.


In [80]:
import pandas as pd

new_data = pd.DataFrame({
    "No of bedrooms": [3, 4, 2],
    "No of bathrooms": [2.0, 3.5, 1.5],
    "living area": [1800, 2800, 1200],
    "No of floors": [1, 2, 1],
    "waterfront present": [0, 1, 0],
    "No of views": [0, 2, 0],
    "house condition": [3, 4, 3],
    "house grade": [7, 9, 6],
    "Built Year": [2005, 2015, 1998],
    "Renovation Year": [0, 2018, 0],
    "Lattitude": [47.5112, 47.6205, 47.4021],
    "Longitude": [-122.257, -122.349, -122.189],
    "living_area_renov": [1800, 3000, 1200],
    "Total floor area": [2000, 3500, 1400],
    "Total lot area": [4000, 6000, 3000]
})

In [82]:
predicted_prices = xgb_pipeline.predict(new_data)
predicted_prices


array([267002.3 , 666800.5 , 254184.64], dtype=float32)

In [84]:
result = new_data.copy()
result["Predicted Price"] = predicted_prices
result


Unnamed: 0,No of bedrooms,No of bathrooms,living area,No of floors,waterfront present,No of views,house condition,house grade,Built Year,Renovation Year,Lattitude,Longitude,living_area_renov,Total floor area,Total lot area,Predicted Price
0,3,2.0,1800,1,0,0,3,7,2005,0,47.5112,-122.257,1800,2000,4000,267002.3125
1,4,3.5,2800,2,1,2,4,9,2015,2018,47.6205,-122.349,3000,3500,6000,666800.5
2,2,1.5,1200,1,0,0,3,6,1998,0,47.4021,-122.189,1200,1400,3000,254184.640625


In [85]:
sample = X_test.iloc[[0]]
actual_price = y_test.iloc[0]

In [86]:
sample = X_test.iloc[[0]]
actual_price = y_test.iloc[0]

In [88]:
predicted_price = xgb_pipeline.predict(sample)[0]


In [89]:
actual_price, predicted_price


(np.int64(545000), np.float32(617645.44))

In [90]:
for i in range(5):
    actual = y_test.iloc[i]
    pred = xgb_pipeline.predict(X_test.iloc[[i]])[0]
    print(f"Actual: {actual}, Predicted: {int(pred)}, Error: {int(pred-actual)}")


Actual: 545000, Predicted: 617645, Error: 72645
Actual: 550000, Predicted: 424939, Error: -125060
Actual: 399950, Predicted: 379375, Error: -20574
Actual: 940000, Predicted: 744222, Error: -195777
Actual: 359782, Predicted: 522297, Error: 162515


In [93]:
import pickle

with open("./models/final_xgboost_model.pkl", "wb") as f:
    pickle.dump(xgb_pipeline, f)


FileNotFoundError: [Errno 2] No such file or directory: './models/final_xgboost_model.pkl'

In [92]:
import os
print(os.getcwd())


/content


In [98]:
import pickle

with open("/content/drive/MyDrive/final_xgboost_model.pkl", "wb") as file:
    pickle.dump(xgb_pipeline, file)


In [100]:
import pickle

with open("/content/drive/MyDrive/final_xgboost_model.pkl", "rb") as f:
    loaded_model = pickle.load(f)

loaded_model.predict(X_test.iloc[:3])

array([617645.44, 424939.22, 379375.75], dtype=float32)

In [102]:
sample = X_test.iloc[[2]]
actual_price = y_test.iloc[2]
predicted_price = loaded_model.predict(sample)[0]

actual_price, predicted_price

(np.int64(399950), np.float32(379375.75))

## Final Conclusion

- Multiple regression models were trained and evaluated.
- XGBoost Regressor achieved the best performance with an R² score of ~0.90.
- The final pipeline was saved and tested successfully.
- The model generalizes well to unseen data.
