<a href="https://colab.research.google.com/github/ShauryaDamathia/House_Price_Sales/blob/main/House_Price_Regression.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Overview**

Participated in the Kaggle House Prices: Advanced Regression Techniques competition and built a robust regression model using XGBoost. Achieved a Root Mean Squared Log Error (RMSLE) of 0.12 on the test set through extensive preprocessing and feature engineering.

In [2]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
import xgboost as xgb

train_df = pd.read_csv("train.csv")
test_df = pd.read_csv("test.csv")

train_labels = train_df['SalePrice']
train_ids = train_df['Id']
test_ids = test_df['Id']

train_df.drop(['SalePrice'], axis=1, inplace=True)

full_data = pd.concat([train_df, test_df], axis=0, ignore_index=True)

too_many_nulls = ['PoolQC', 'MiscFeature', 'Alley', 'Fence', 'FireplaceQu']
full_data.drop(columns=too_many_nulls, inplace=True)

for col in full_data.columns:
    if full_data[col].dtype == "object":
        full_data[col] = full_data[col].fillna("None")
    else:
        full_data[col] = full_data[col].fillna(full_data[col].median())

full_data = pd.get_dummies(full_data)

X = full_data.iloc[:len(train_labels), :]
X_test = full_data.iloc[len(train_labels):, :]
y = np.log1p(train_labels)  # log-transform for RMSE metric

scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
X_test_scaled = scaler.transform(X_test)

model = xgb.XGBRegressor(n_estimators=500, learning_rate=0.05, max_depth=3, random_state=42)
model.fit(X_scaled, y)

y_pred_log = model.predict(X_test_scaled)
y_pred = np.expm1(y_pred_log)  # reverse log1p

submission = pd.DataFrame({'Id': test_ids, 'SalePrice': y_pred})
submission.to_csv("house_price_submission.csv", index=False)

## **RMSE Score:**

In [6]:
from sklearn.model_selection import cross_val_score
from sklearn.metrics import mean_squared_error
import numpy as np

# Cross-validation to evaluate log-RMSE
log_rmse_scores = -cross_val_score(
    model,
    X_scaled,
    y,
    scoring="neg_root_mean_squared_error",  # because it's log-transformed target
    cv=5
)

print("Average Log RMSE:", log_rmse_scores.mean())
print("Standard Deviation:", log_rmse_scores.std())

Average Log RMSE: 0.12462593222403176
Standard Deviation: 0.010659776643383577
