<H1>Housing Price Predictor</H1>

<H2>1) Problem Statement</H2>

The goal of this project is to build a machine learning model that can predict house prices based on various features such as LotArea, number of bedrooms, Neighborhood, year built, HouseStyle, OverallCond and more. 



<H2>2) Data Collection</H2>

Dataset Source - https://www.kaggle.com/competitions/house-prices-advanced-regression-techniques/overview

The data consists of 81 features and 1460 records.

<H3>Importing Packages</H3>

In [None]:
import pandas as pd            # For data manipulation
import numpy as np  
import matplotlib.pyplot as plt
from feature_engine.encoding import RareLabelEncoder
from feature_engine.encoding import OrdinalEncoder
from scipy.stats import zscore
from scipy.stats import skew
import seaborn as sns          # For statistical visualizations
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import accuracy_score, mean_squared_error
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import RandomizedSearchCV
from sklearn.model_selection import GridSearchCV
from scipy.stats import randint
from xgboost import XGBRegressor
import optuna
from optuna.integration import OptunaSearchCV
import warnings
warnings.filterwarnings("ignore")


# pd.set_option('display.max_columns', None)
# pd.set_option('display.max_rows',None)


<H3>Import the CSV Data as Pandas DataFrame</H3>

In [None]:
train = pd.read_csv('./data/train.csv')
test = pd.read_csv('./data/test.csv')
sample = pd.read_csv('./data/sample_submission.csv')

In [None]:
print(train.shape)
train.head()

In [None]:
train.info()

<H3>3.1 Check Missing values</H3>

In [None]:
train.isna().sum()[train.isna().sum() > 0].sort_values()

<H3>Check Duplicates</H3>

In [None]:
train.duplicated().sum()

<H3>3.2 Handling Missing Values</H3>

Dropping high-NaN columns: Columns with a high percentage of missing values (e.g., more than 50%) are dropped from the dataset, as they provide little to no useful information.

Imputation: For columns with fewer missing values, we use mean imputation for numerical features and mode imputation for categorical features.

In [None]:
def drop_high_nan_columns(df, threshold=0.5):
    """Drops columns with more than threshold% missing values."""
    missing_percent = df.isnull().mean()
    cols_to_drop = missing_percent[missing_percent > threshold].index

    print("🔍 Dropping columns with > {:.0%} missing values:\n".format(threshold))
    for col in cols_to_drop:
        print(f"❌ Dropped '{col}' ({missing_percent[col]*100:.2f}% missing)")

    df = df.drop(columns=cols_to_drop)
    return df

def fill_missing_values(df):
    """
    Fills missing values for training data and stores the fill values
    so they can be reused for test data.
    Calculates mean/mode for all columns (even those without NaNs).
    
    Returns:
        df_filled: DataFrame with missing values filled
        fill_values: dict with {column_name: fill_value}
    """
    categorical_cols = [col for col in df.columns if df[col].dtype == 'O']
    numerical_cols = [col for col in df.columns if df[col].dtype != 'O' and col != 'SalePrice']

    fill_values = {}
    df_filled = df.copy()

    print("\n🔧 Filling Missing Values (Training Data):")

    for col in numerical_cols:
        mean_val = df[col].mean()
        df_filled[col].fillna(mean_val, inplace=True)
        fill_values[col] = mean_val
        print(f"🔢 Filled NaNs in numerical '{col}' with mean: {mean_val:.2f}")

    for col in categorical_cols:
        mode_val = df[col].mode()[0] if not df[col].mode().empty else None
        df_filled[col].fillna(mode_val, inplace=True)
        fill_values[col] = mode_val
        print(f"🔤 Filled NaNs in categorical '{col}' with mode: '{mode_val}'")

    return df_filled, fill_values


def apply_fill_values(df, fill_values):
    """
    Applies stored fill values to another dataset (e.g., test set).
    """
    df_filled = df.copy()
    print("\n🔧 Applying Stored Fill Values:")
    for col, val in fill_values.items():
        df_filled[col].fillna(val, inplace=True)
        print(f"✅ Filled NaNs in '{col}' with stored value: {val}")
    return df_filled


In [None]:
train = drop_high_nan_columns(train)

train ,fill_values = fill_missing_values(train)


In [None]:
test = test.drop(columns=['Alley','MasVnrType','PoolQC','Fence','MiscFeature'])
test = apply_fill_values(test, fill_values)

<H3>3.4 Feature Engineering</H3>

Age Features: We calculate the age of the house and its remodeling by subtracting the year built/remodel from the year sold, providing more meaningful temporal information.

Total Bathrooms & Square Footage: We combine various bathroom and square footage features into single, more representative features like TotalBaths and TotalSF, reducing dimensionality and providing a clearer picture of the house's size.

Log Transformation: Key numerical features like LotArea and SalePrice are log-transformed to reduce skewness and stabilize variance, which often improves the performance of linear models.

In [None]:
def preprocess(df):

    df['YearRemodAdd'] = [ys if yb > ys else yb for yb, ys in zip(df['YearRemodAdd'], df['YrSold'])]
    df['YearBuilt'] = [ys if yb > ys else yb for yb, ys in zip(df['YearBuilt'], df['YrSold'])]
    # Age features
    for feature in ['YearBuilt', 'YearRemodAdd', 'GarageYrBlt']:
        df[feature] = df['YrSold'] - df[feature]

        # Total number of bathrooms
    df["TotalBaths"] = (df["FullBath"] + 0.5 * df["HalfBath"] +
                        df["BsmtFullBath"] + 0.5 * df["BsmtHalfBath"])
    df.drop(columns=["FullBath", "HalfBath", "BsmtFullBath", "BsmtHalfBath"], inplace=True, errors='ignore')

    # Total square footage
    df["TotalSF"] = df["TotalBsmtSF"] + df["GrLivArea"]
    df.drop(columns=["TotalBsmtSF", "GrLivArea"], inplace=True, errors='ignore')
    
    # Log-transform selected numeric features
    num_features = ['LotFrontage', 'LotArea', '1stFlrSF', 'TotalSF']
    for feature in num_features:
        df[feature] = np.log1p(df[feature])

    return df


<H3>3.5 Outlier Detection and Removal</H3>


We identify and remove them from the dataset by using the Interquartile Range (IQR) method. This process is particularly focused on the features that have the highest correlation with the target variable, SalePrice, to ensure the most influential outliers are addressed. 

In [None]:
# Step 1: Detect outliers function
def detect_outliers(df, numerical_cols, threshold=1.5):
    outlier_dict = {}
    for col in numerical_cols:
        Q1 = df[col].quantile(0.25)
        Q3 = df[col].quantile(0.75)
        IQR = Q3 - Q1
        lower_bound = Q1 - threshold * IQR
        upper_bound = Q3 + threshold * IQR
        outliers = df[(df[col] < lower_bound) | (df[col] > upper_bound)].index
        outlier_dict[col] = list(outliers)
    return dict(sorted(outlier_dict.items(), key=lambda item: len(item[1]), reverse=True))

# Step 2: Use the 10 most correlated features with SalePrice
correlations = train.corr(numeric_only=True)['SalePrice'].abs()
top10_corr_features = correlations.drop('SalePrice').sort_values(ascending=False).head(10).index.tolist()

# Step 3: Detect outliers
outliers = detect_outliers(train, top10_corr_features,threshold=2)

# Step 4: Plot histograms in 1x5 layout
fig, axes = plt.subplots(2, 5, figsize=(24, 5))
axes = axes.flatten()

for i, feature in enumerate(top10_corr_features):
    sns.histplot(train[feature], kde=True, ax=axes[i], bins=30)
    axes[i].set_title(f'{feature}')
    axes[i].set_xlabel('')
    axes[i].set_ylabel('')

plt.tight_layout()
plt.suptitle('Top 10 Correlated Features with SalePrice (Histogram)', fontsize=16, y=1.05)
plt.show()
print(outliers)

In [None]:
# print(train[train.GrLivArea > 4000]['GrLivArea'].sort_values())
# print(train[train.TotRmsAbvGrd > 14]['TotRmsAbvGrd'].sort_values())
# print(train[train.TotalBsmtSF>4000]['TotalBsmtSF'].sort_values())
# print(train[train['1stFlrSF']>3000]['1stFlrSF'].sort_values())
# print(train[train['GarageArea']>1300]['GarageArea'].sort_values())
# print(train[train['SalePrice']>500000]['SalePrice'].sort_values())

In [None]:
rows_indexes = [1182, 1298, 1169, 224]
train = train.drop(index=rows_indexes).reset_index(drop=True)

In [None]:
train = preprocess(train)
test = preprocess(test)

train['SalePrice'] = np.log1p(train['SalePrice'])

In [None]:
final_df = pd.concat([train,test], axis = 0)

In [None]:
final_df['YrSold'] = final_df['YrSold'].astype('category')

<H3>3.6 Categorical Feature Encoding</H3>


Rare Label Encoding: This technique groups infrequent categories into a single "Rare" category, which helps to prevent overfitting to categories with very few instances.

Ordinal Encoding: This method assigns a numerical value to each category based on its relationship with the target variable, SalePrice. This ordered mapping helps the model to capture the inherent hierarchy or ranking within the categorical data.

In [None]:
# Rare label encoding
categorical_cols = [col for col in final_df.columns if final_df[col].dtype == 'O']

rare_encoder = RareLabelEncoder(tol=0.01, n_categories=1, replace_with='Rare', variables=categorical_cols)
final_df = rare_encoder.fit_transform(final_df)

train = final_df.iloc[:1456,:]
test = final_df.iloc[1456:,:]

# Ordinal encoding 
ordinal_encoder = OrdinalEncoder(encoding_method='ordered', variables=categorical_cols)
train = ordinal_encoder.fit_transform(train, train['SalePrice'])
test = ordinal_encoder.transform(test)
final_df.dtypes

In [None]:
# Step 1: Compute full correlation matrix
corr_matrix = train.corr(numeric_only=True)

# Step 2: Create boolean mask for values > 0.7 (excluding diagonal)
mask = (abs(corr_matrix) > 0.8) & (corr_matrix != 1.0)

# Step 3: Get column names where any correlation is > 0.7
high_corr_features = mask.any(axis=0)
selected_features = corr_matrix.columns[high_corr_features]


# Step 5: Plot heatmap
plt.figure(figsize=(18, 14))
sns.heatmap(train[selected_features].corr(), annot=True, cmap='coolwarm',cbar=False)
plt.title("Correlation > 0.7 Between Any Feature Pairs")
plt.show()

<H3>3.8 Feature Selection and Multicollinearity</H3>

Multicollinearity occurs when features are highly correlated with each other, which can lead to unstable model predictions. We calculate the correlation matrix and visualize it using a heatmap to identify and remove features with a high correlation coefficient (e.g., greater than 0.8). This step ensures that our model uses a diverse set of independent variables, improving its interpretability and stability.



In [None]:
# TotRmsAbvGrd , Exterior2nd
train = train.drop(columns=['GarageCars','TotRmsAbvGrd','Exterior2nd'], errors='ignore')
test = test.drop(columns=['GarageCars','TotRmsAbvGrd','Exterior2nd'], errors='ignore')

In [None]:
test.drop(columns=['SalePrice'],inplace=True)
test.shape

In [None]:
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import MinMaxScaler
from sklearn.metrics import r2_score
from sklearn.metrics import mean_squared_error

<H2>4 Model Training and Evaluation</H2>

<H3>4.1 Hyperparameter Tuning with Optuna</H3>

To achieve the best possible performance, we use Optuna, an automatic hyperparameter optimization framework. Optuna efficiently searches for the optimal combination of hyperparameters by intelligently exploring the parameter space. We define an objective function that calculates the Root Mean Squared Error (RMSE) for each trial, and Optuna minimizes this value to find the best settings for each model. This automated process saves significant time and effort compared to manual tuning.

In [None]:
# Define objective function for Optuna for catboost
def objective_cat(trial):
    # Define hyperparameters to optimize
    catboost_params = {
        'iterations': trial.suggest_int('iterations', 1000, 8000),
        'learning_rate': trial.suggest_loguniform('learning_rate', 0.005, 0.08),
        'depth': trial.suggest_int('depth', 3, 7),
        'eval_metric': 'RMSE',
    }

    # Initialize models with suggested parameters
    catboost_model = CatBoostRegressor(**catboost_params, verbose=0)
    
    # Train models
    catboost_model.fit(train, train['SalePrice'])

    # Calculate RMSE
    kf = KFold(n_splits=10)
    catboost_rmse = np.exp(np.sqrt(-cross_val_score(catboost_model, train, train['SalePrice'], scoring='neg_mean_squared_error', cv=kf)))

    # Return average RMSE
    return np.mean(catboost_rmse)

In [None]:
# Define objective function for Optuna for xgboost
def objective_xgb(trial):
    # Define hyperparameters to optimize
    xgboost_params = {
        'n_estimators': trial.suggest_int('n_estimators', 1000, 8000),
        'learning_rate': trial.suggest_loguniform('learning_rate', 0.005, 0.1),
        'colsample_bytree': trial.suggest_uniform('colsample_bytree', 0.2, 0.6),
        'subsample': trial.suggest_uniform('subsample', 0.4, 0.8),
        'min_child_weight': trial.suggest_int('min_child_weight', 2, 5),
    }

    # Initialize models with suggested parameters
    xgb_model = XGBRegressor(**xgboost_params, verbosity=0)

    # Train models
    xgb_model.fit(train, train['SalePrice'])

    # Calculate RMSE
    kf = KFold(n_splits=10)
    xgb_rmse = np.exp(np.sqrt(-cross_val_score(xgb_model, train, train['SalePrice'], scoring='neg_mean_squared_error', cv=kf)))

    # Return average RMSE
    return np.mean(xgb_rmse)

In [None]:
# Optimize hyperparameters catboost
study_cat = optuna.create_study(direction='minimize')
study_cat.optimize(objective_cat, n_trials=50)

In [None]:
# Optimize hyperparameters xgboost
study_xgb = optuna.create_study(direction='minimize')
study_xgb.optimize(objective_xgb, n_trials=50)

<H3>4.2 Model Training and Prediction</H3>

Once the optimal hyperparameters are found, we train the XGBoost and CatBoost models using the entire training dataset. Each model is encapsulated within a Pipeline to ensure that data scaling (using MinMaxScaler) is applied consistently before training and prediction. After training, the models are used to make predictions on the validation set to evaluate their performance.



In [None]:
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import MinMaxScaler
from xgboost import XGBRegressor
from catboost import CatBoostRegressor
from sklearn.metrics import mean_squared_error, r2_score
import numpy as np

# ========================
# Step 1: Prepare Data
# ========================
X = train.drop(columns=['Id', 'SalePrice'])
y = train['SalePrice']
X_test_final = test.drop(columns=['Id'])

X_train, X_val, y_train, y_val = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# ========================
# Step 2: Use Best Parameters from Optuna
# ========================
best_params_xgb = study_xgb.best_params
best_params_cat = study_cat.best_params

# Ensure random_state is fixed for reproducibility
best_params_xgb['random_state'] = 42
best_params_cat['random_state'] = 42

# ========================
# Step 3: XGBoost Pipeline
# ========================
xgb_pipeline = Pipeline([
    ('scaler', MinMaxScaler()),
    ('model', XGBRegressor(**best_params_xgb))
])

xgb_pipeline.fit(X_train, y_train)

# ========================
# Step 4: CatBoost Pipeline
# ========================
cat_pipeline = Pipeline([
    ('scaler', MinMaxScaler()),  # optional for CatBoost, but keeps consistency
    ('model', CatBoostRegressor(**best_params_cat, verbose=0))
])

cat_pipeline.fit(X_train, y_train)

# ========================
# Step 5: Predict & Evaluate (Example for XGB)
# ========================
y_val_pred_xgb = xgb_pipeline.predict(X_val)
rmse_xgb = np.sqrt(mean_squared_error(y_val, y_val_pred_xgb))
r2_xgb = r2_score(y_val, y_val_pred_xgb)

print("XGBoost RMSE:", rmse_xgb)
print("XGBoost R²:", r2_xgb)


y_val_pred_cat = cat_pipeline.predict(X_val)
rmse_cat = np.sqrt(mean_squared_error(y_val, y_val_pred_cat))
r2_cat = r2_score(y_val, y_val_pred_cat)

print("catBoost RMSE:", rmse_cat)
print("catBoost R²:", r2_cat)

# ========================
# Step 6: Predict on Test
# ========================
y_test_pred_xgb = xgb_pipeline.predict(X_test_final)
y_test_pred_cat = cat_pipeline.predict(X_test_final)


<H3>4.3 Performance Metrics</H3>

The models' performance is measured using the Root Mean Squared Error (RMSE) and R-squared (R 
2
 ) score.

RMSE gives a measure of the average magnitude of the errors in the predictions. A lower RMSE indicates a more accurate model.

R 
2
  represents the proportion of the variance in the dependent variable that is predictable from the independent variables. An R 
2
  of 1 indicates a perfect fit, while a value closer to 0 suggests the model does not explain the variability well.



In [None]:
y_test_pred_xgb = np.exp(y_test_pred_xgb)
y_test_pred_cat = np.exp(y_test_pred_cat)

ypred_xgb = pd.DataFrame(y_test_pred_xgb)
ypred_cat = pd.DataFrame(y_test_pred_cat)
result_xgb = pd.concat([sample['Id'],ypred_xgb],axis=1)
result_cat = pd.concat([sample['Id'],ypred_cat],axis=1)
result_xgb.columns = ['Id', 'SalePrice']
result_cat.columns = ['Id', 'SalePrice']
result_xgb.to_csv('Downloads/results_xgb.csv',index=False)
result_cat.to_csv('Downloads/results_cat.csv',index=False)