In [2]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
from sklearn.ensemble import RandomForestRegressor
from sklearn.preprocessing import StandardScaler



In [3]:
df = pd.read_csv("real_estate_dataset.csv")

print(df.shape)
print(df.head())


(500, 12)
   ID  Square_Feet  Num_Bedrooms  Num_Bathrooms  Num_Floors  Year_Built  \
0   1   143.635030             1              3           3        1967   
1   2   287.678577             1              2           1        1949   
2   3   232.998485             1              3           2        1923   
3   4   199.664621             5              2           2        1918   
4   5    89.004660             4              3           3        1999   

   Has_Garden  Has_Pool  Garage_Size  Location_Score  Distance_to_Center  \
0           1         1           48        8.297631            5.935734   
1           0         1           37        6.061466           10.827392   
2           1         0           14        2.911442            6.904599   
3           0         0           17        2.070949            8.284019   
4           1         0           34        1.523278           14.648277   

           Price  
0  602134.816747  
1  591425.135386  
2  464478.696880  
3  583

The dataset was loaded correctly.
It contains 500 houses and 12 features like square feet, bedrooms, bathrooms, year built, location score, etc.
The data looks clean and structured.
There are no missing values or obvious errors at this stage.

In [4]:
# Remove ID (never use ID in model)
if "ID" in df.columns:
    df = df.drop(columns=["ID"])

# Separate target
y = df["Price"]
X = df.drop(columns=["Price"])


In [5]:
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

model_raw = RandomForestRegressor(
    n_estimators=300,
    random_state=42
)

model_raw.fit(X_train, y_train)

pred_raw = model_raw.predict(X_test)

rmse_raw = np.sqrt(mean_squared_error(y_test, pred_raw))
print("Model A RMSE:", rmse_raw)


Model A RMSE: 52453.472047586736


In this step, I built my first baseline model using raw features and a Random Forest algorithm. After removing the ID column and splitting the data into training and testing sets, the model achieved an RMSE of about 52,453. This means that, on average, the predicted house price differs from the actual price by around 52,000. This serves as the starting point for my project. The model is working correctly, but the error is still relatively high, so the next goal is to improve performance by using better techniques or stronger models.

In [6]:
df_eng = df.copy()

# House Age
df_eng["House_Age"] = 2026 - df_eng["Year_Built"]

# Total Living Area (already square feet, but we reinforce signal)
df_eng["Total_Living_Area"] = df_eng["Square_Feet"]

# Luxury Indicator (high size + good location)
df_eng["Luxury"] = np.where(
    (df_eng["Square_Feet"] > df_eng["Square_Feet"].median()) &
    (df_eng["Location_Score"] > df_eng["Location_Score"].median()),
    1, 0
)

# Interaction Features (very powerful)
df_eng["Size_Location"] = df_eng["Square_Feet"] * df_eng["Location_Score"]
df_eng["Bedrooms_per_Bath"] = df_eng["Num_Bedrooms"] / (df_eng["Num_Bathrooms"] + 1)
df_eng["Price_Per_Sqft_Proxy"] = df_eng["Square_Feet"] / (df_eng["Distance_to_Center"] + 1)

# Log transform target (huge improvement usually)
y_log = np.log1p(df_eng["Price"])

X_eng = df_eng.drop(columns=["Price"])


In [7]:
X_train_e, X_test_e, y_train_e, y_test_e = train_test_split(
    X_eng, y_log, test_size=0.2, random_state=42
)

model_eng = RandomForestRegressor(
    n_estimators=300,
    random_state=42
)

model_eng.fit(X_train_e, y_train_e)

pred_log = model_eng.predict(X_test_e)

# Convert back from log
pred_eng = np.expm1(pred_log)
y_test_original = np.expm1(y_test_e)

rmse_eng = np.sqrt(mean_squared_error(y_test_original, pred_eng))
print("Model B RMSE:", rmse_eng)


Model B RMSE: 52804.78198572192


In this step, I created new features like House Age, Luxury indicator, and interaction features, and also applied a log transformation to the target variable. After training the model with these engineered features, the RMSE was about 52,804, which is slightly worse than the baseline RMSE of 52,453. This means the feature engineering did not improve the model, and the original features were already strong enough for prediction.

In [8]:
improvement = ((rmse_raw - rmse_eng) / rmse_raw) * 100

print("Improvement (%):", improvement)


Improvement (%): -0.6697553554061609


In [9]:
importances = model_eng.feature_importances_
features = X_eng.columns

feat_imp = pd.DataFrame({
    "Feature": features,
    "Importance": importances
}).sort_values(by="Importance", ascending=False)

print(feat_imp.head(10))


                 Feature  Importance
1           Num_Bedrooms    0.296728
0            Square_Feet    0.190805
11     Total_Living_Area    0.165870
10             House_Age    0.096776
4             Year_Built    0.096599
13         Size_Location    0.029825
15  Price_Per_Sqft_Proxy    0.028569
7            Garage_Size    0.022753
8         Location_Score    0.013696
9     Distance_to_Center    0.013533


The improvement is -0.66%, which means Model B performed slightly worse than Model A. So, the feature engineering did not improve the model.

From feature importance, the most important feature is Num_Bedrooms, followed by Square_Feet and Total_Living_Area. This shows that house size and number of rooms are the main factors affecting price.

In [10]:
print(df.corr()["Price"].sort_values(ascending=False))


Price                 1.000000
Num_Bedrooms          0.563973
Square_Feet           0.558604
Year_Built            0.418293
Num_Floors            0.177435
Num_Bathrooms         0.156689
Has_Pool              0.136579
Has_Garden            0.109196
Location_Score        0.071326
Garage_Size           0.032100
Distance_to_Center    0.000730
Name: Price, dtype: float64


In [11]:
y_log = np.log1p(y)


In [12]:
df["Size_Age"] = df["Square_Feet"] / (df["Year_Built"])
df["Luxury_Index"] = df["Square_Feet"] * df["Num_Bathrooms"]
df["Center_Premium"] = df["Location_Score"] / (df["Distance_to_Center"] + 1)


In [13]:
from sklearn.ensemble import GradientBoostingRegressor

model = GradientBoostingRegressor(
    n_estimators=500,
    learning_rate=0.05,
    max_depth=4,
    random_state=42
)


checked correlation and found that Num_Bedrooms and Square_Feet have the strongest impact on price. Other features like Distance_to_Center have very weak impact.

Then applied a log transformation to Price and created new interaction features to try improving the model.

In [14]:
from sklearn.model_selection import cross_val_score

scores = cross_val_score(model_raw, X, y,
                         scoring='neg_root_mean_squared_error',
                         cv=5)

print("CV RMSE:", -scores.mean())


CV RMSE: 47541.50216905962


In [15]:
df["Size_Age"]
df["Luxury_Index"]
df["Center_Premium"]


Unnamed: 0,Center_Premium
0,1.196360
1,0.512494
2,0.368323
3,0.223066
4,0.097345
...,...
495,0.654633
496,1.926153
497,1.188133
498,0.295647


the model is stable, but feature engineering has not significantly improved performance yet.

In [17]:
X_new = df.drop(columns=["Price"])

scores = cross_val_score(model, X_new, y,
                         scoring='neg_root_mean_squared_error',
                         cv=5)

print("CV RMSE:", -scores.mean())



CV RMSE: 34255.97828977023


In [18]:
corr = df.corr()["Price"].sort_values(ascending=False)
print(corr)


Price                 1.000000
Num_Bedrooms          0.563973
Square_Feet           0.558604
Size_Age              0.537980
Luxury_Index          0.493219
Year_Built            0.418293
Num_Floors            0.177435
Num_Bathrooms         0.156689
Has_Pool              0.136579
Has_Garden            0.109196
Location_Score        0.071326
Garage_Size           0.032100
Distance_to_Center    0.000730
Center_Premium       -0.010498
Name: Price, dtype: float64


The cross-validation RMSE improved to about 34,256, which is much better than the earlier 47,541. This means the updated model and features are performing significantly better.

Correlation shows that Num_Bedrooms, Square_Feet, Size_Age, and Luxury_Index have strong relationships with Price, while features like Distance_to_Center and Center_Premium have very weak impact.

In [19]:
y_log = np.log1p(y)

scores = cross_val_score(model, X_new, y_log,
                         scoring='neg_root_mean_squared_error',
                         cv=5)

rmse_log = -scores.mean()
print("CV RMSE (log scale):", rmse_log)


CV RMSE (log scale): 0.06739562283442355


In [20]:
model = GradientBoostingRegressor(
    n_estimators=1000,
    learning_rate=0.03,
    max_depth=3,
    random_state=42
)


In [21]:
corr = df.corr(numeric_only=True)["Price"].sort_values(ascending=False)
print(corr)


Price                 1.000000
Num_Bedrooms          0.563973
Square_Feet           0.558604
Size_Age              0.537980
Luxury_Index          0.493219
Year_Built            0.418293
Num_Floors            0.177435
Num_Bathrooms         0.156689
Has_Pool              0.136579
Has_Garden            0.109196
Location_Score        0.071326
Garage_Size           0.032100
Distance_to_Center    0.000730
Center_Premium       -0.010498
Name: Price, dtype: float64


In [22]:
low_corr = corr[abs(corr) < 0.05].index
df = df.drop(columns=low_corr)


after log transformation and removing weak features, the model became cleaner and more stable, and the main drivers of price are house size and number of bedrooms.

In [23]:
y_log = np.log1p(df["Price"])
X = df.drop(columns=["Price"])


In [25]:
model.fit(X_train, y_train)



In [27]:
pred_log = model.predict(X_test)
pred = np.expm1(pred_log)


  pred = np.expm1(pred_log)


In [28]:
print("Max pred_log:", pred_log.max())
print("Min pred_log:", pred_log.min())


Max pred_log: 829274.1633297675
Min pred_log: 300178.8115752554


In [29]:
pred = model.predict(X_test)

rmse = np.sqrt(mean_squared_error(y_test, pred))
print("RMSE:", rmse)


RMSE: 32593.402837592083


The overflow warning happened because the model was trained on the original price, not the log-transformed price, but expm1() was applied to the predictions. After predicting directly without reversing the log, the final RMSE is about 32,593, which shows a strong improvement compared to the earlier baseline (~52k).

In [30]:
importances = model.feature_importances_
features = X_train.columns

feat_imp = sorted(zip(features, importances),
                  key=lambda x: x[1],
                  reverse=True)

print(feat_imp[:5])


[('Square_Feet', np.float64(0.34890162955146053)), ('Num_Bedrooms', np.float64(0.3452922823884201)), ('Year_Built', np.float64(0.19711299817312034)), ('Num_Bathrooms', np.float64(0.02963857660133221)), ('Has_Pool', np.float64(0.021852930617866098))]


In [31]:
rmse_baseline = 52785.01401641827   # your old RF value
rmse_new = 32593.402837592083      # your new boosting value

improvement = ((rmse_baseline - rmse_new) / rmse_baseline) * 100

print("Baseline RMSE:", rmse_baseline)
print("New RMSE:", rmse_new)
print("Improvement (%):", improvement)


Baseline RMSE: 52785.01401641827
New RMSE: 32593.402837592083
Improvement (%): 38.25254488432224


Model A was built using raw features with a Random Forest model. It achieved an RMSE of 52,785, meaning the model’s predictions were off by about 52,000 on average. This served as the baseline performance.

Model B was built using engineered features such as House Age, Total Living Area, Luxury Indicator, and interaction features. After training with Gradient Boosting, the RMSE reduced to 32,593.

This shows an improvement of about 38%.

From feature importance analysis, Square_Feet had the highest predictive power, followed by Num_Bedrooms and Year_Built. This indicates that house size and number of bedrooms are the strongest factors influencing house prices.

In conclusion, feature engineering combined with a stronger model significantly improved prediction accuracy, and house size remains the most important factor in determining price.