# MODELING AND EVALUATION
## Introduction
This notebook focuses on training, evaluating, and selecting the best predictive model for house price estimation. The final feature-engineered dataset is used to train and test the models. Multiple models are evaluated to assess bias–variance behavior and generalization performance.

### 1. Import Library and Load Dataset

In [None]:
import pandas as pd 
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import Ridge
from sklearn.tree import DecisionTreeRegressor
from sklearn.svm import SVR
from sklearn.metrics import mean_squared_error,r2_score
from sklearn.model_selection import GridSearchCV
import joblib

In [None]:
df = pd.read_csv("final_chennai_dataset.csv")

df.head()

Unnamed: 0,loc_adyar,loc_alwarpet,loc_ambattur,loc_anna nagar,loc_avadi,loc_ayanambakkam,loc_chromepet,loc_egmore,loc_guduvancheri,loc_iyyappanthangal,...,loc_ullagaram,loc_vadapalani,loc_vandalur,loc_velachery,loc_velappanchavadi,area_sqft,resale,no_of_bedrooms,amenity_score,log_price
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,1310,0,3,0,15.520259
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,1126,0,2,6,15.492607
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,1307,0,3,7,15.920254
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,3600,0,3,4,16.968247
4,0,0,0,0,0,0,0,0,0,1,...,0,0,0,0,0,1700,0,3,4,16.128046


### 2. Define Features and Target Variable

In [None]:
X = df.drop(columns="log_price")
y = df["log_price"]

print("X shape :",X.shape)
print("y shape :",y.shape)

X shape : (4303, 65)
y shape : (4303,)


### 3. Train, Test Split and Scaling 

In [None]:
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.3,random_state=0)

scalar = StandardScaler()
X_train_scaled = scalar.fit_transform(X_train)
X_test_scaled = scalar.transform(X_test)

### 4. Model Training and Evaluation

##### a) Baesline Model 

In [None]:
lr = LinearRegression()
lr.fit(X_train_scaled,y_train)

y_train_pred = lr.predict(X_train_scaled)
train_mse = mean_squared_error(y_train, y_train_pred)
train_r2 = r2_score(y_train, y_train_pred)

y_test_pred = lr.predict(X_test_scaled)
test_mse = mean_squared_error(y_test,y_test_pred)
test_r2 = r2_score(y_test,y_test_pred)

print("Linear Regression")
print("Train MSE:", train_mse,"|", "Train R2:", train_r2)
print("Test  MSE:", test_mse,  "Test  R2:", test_r2)

Linear Regression
Train MSE: 0.3242025245436797 Train R2: 0.3151236082104091
Test  MSE: 0.30711122833313975 Test  R2: 0.2772744038957088


- Train and test R² scores are low and close to each other
- The model underfits, indicating that linear relationships alone are insufficient
- This establishes a simple baseline for comparison

##### b) Advanced Models

In [None]:
rr = Ridge()
rr.fit(X_train_scaled,y_train)

y_train_pred = rr.predict(X_train_scaled)
train_mse = mean_squared_error(y_train, y_train_pred)
train_r2 = r2_score(y_train, y_train_pred)

y_test_pred = rr.predict(X_test_scaled)
test_mse = mean_squared_error(y_test,y_test_pred)
test_r2 = r2_score(y_test,y_test_pred)

print("Ridge Regression")
print("Train MSE:", train_mse, "Train R2:", train_r2)
print("Test  MSE:", test_mse,  "Test  R2:", test_r2)

Ridge Regression
Train MSE: 0.3242025332899194 Train R2: 0.3151235897340168
Test  MSE: 0.30710462032568603 Test  R2: 0.2772899545356249


- Ridge Regression shows almost identical performance to Linear Regression
- Regularization improves stability but does not increase predictive capacity
- Results confirm that the limitation is model simplicity, not overfitting

In [None]:
tree = DecisionTreeRegressor()
tree.fit(X_train,y_train)

y_train_pred = tree.predict(X_train)
train_mse = mean_squared_error(y_train,y_train_pred)
train_r2 = r2_score(y_train,y_train_pred)

y_test_pred = tree.predict(X_test)
test_mse = mean_squared_error(y_test,y_test_pred)
test_r2 = r2_score(y_test,y_test_pred)

print("Decision Tree")
print("Train MSE:",train_mse, "Train R2:",train_r2)
print("Test MSE:", test_mse, "Test R2:",test_r2)

Decision Tree
Train MSE: 0.06806275417553902 Train R2: 0.8562177343911315
Test MSE: 0.4782107176488603 Test R2: -0.12537443795876624


- Very high training R² indicates strong memorization of training data
- Poor test R² shows severe overfitting
- The model does not generalize well to unseen data

In [None]:
svr = SVR()
svr.fit(X_train_scaled,y_train)

y_train_pred = svr.predict(X_train_scaled)
train_mse = mean_squared_error(y_train,y_train_pred)
train_r2 = r2_score(y_train,y_train_pred)

y_test_pred = svr.predict(X_test_scaled)
test_mse = mean_squared_error(y_test,y_test_pred)
test_r2 = r2_score(y_test,y_test_pred)

print("Support Vector Machine")
print("Train MSE:",train_mse, "Train R2:",train_r2)
print("Test MSE:", test_mse, "Test R2:",test_r2)

Support Vector Machine
Train MSE: 0.3018346730259798 Train R2: 0.3623755951006705
Test MSE: 0.2992489292043142 Test R2: 0.295776771443361


- Both train and test R² scores are poor
- The model underfits due to default hyperparameters
- This suggests misconfiguration rather than lack of model capability

### 5. Hyperparameter Tuning

In [None]:
pipeline = Pipeline([("scaler", StandardScaler()),("svr", SVR())])

param_grid = {
    "svr__kernel": ["rbf"],
    "svr__C": [0.1, 1, 10, 100],
    "svr__gamma": ["scale", 0.01, 0.1, 1],
    "svr__epsilon": [0.01, 0.1, 0.2]
    }


grid = GridSearchCV(
    estimator=pipeline,
    param_grid=param_grid,
    scoring="neg_mean_squared_error",
    cv=5,
    n_jobs=-1
)

grid.fit(X_train, y_train)

print("Best Parameters:", grid.best_params_)
print("Best CV MSE:", -grid.best_score_)

Best Parameters: {'svr__C': 1, 'svr__epsilon': 0.2, 'svr__gamma': 'scale', 'svr__kernel': 'rbf'}
Best CV MSE: 0.3327106099494575


In [None]:
best_svr = grid.best_estimator_

y_train_pred = best_svr.predict(X_train)
train_mse = mean_squared_error(y_train, y_train_pred)
train_r2 = r2_score(y_train, y_train_pred)

y_test_pred = best_svr.predict(X_test)
test_mse = mean_squared_error(y_test, y_test_pred)
test_r2 = r2_score(y_test, y_test_pred)

print("Tuned SVR Performance")
print("Train MSE:", train_mse, "Train R2:", train_r2)
print("Test  MSE:", test_mse,  "Test  R2:", test_r2)

Tuned SVR Performance
Train MSE: 0.29898532654704024 Train R2: 0.3683948268700752
Test  MSE: 0.29598346536406944 Test  R2: 0.30346139539314976


- Tuned SVR achieves the highest test R² among all models
- Train and test scores are balanced, indicating good generalization
- The model captures non-linear patterns better than linear baselines

### 6. Model Comparision Summary
- Linear and Ridge Regression underfit the data, showing limited ability to capture price patterns
- Decision Tree exhibits strong overfitting with poor test performance
- Support Vector Regression achieves the best balance between bias and variance and performs best on unseen data

### 7. Final Model Selection
The tuned Support Vector Regression model is selected as the final model due to its superior generalization performance

### 8. Save the Model

In [None]:
joblib.dump(best_svr, "reis_svr_model.pkl")

['reis_svr_model.pkl']

## Summary
- The final feature-engineered dataset was used to train and evaluate multiple regression models
- Linear and Ridge Regression were established as baseline models and showed underfitting behavior
- A Decision Tree Regressor was evaluated and found to overfit with poor generalization
- Support Vector Regression demonstrated better generalization compared to other models
- Hyperparameter tuning was selectively applied to SVR to improve performance
- The tuned SVR achieved the best balance between bias and variance and was selected as the final model
- The selected model is saved and will be used as a core component of the REIS system for price estimation