Getting Started

	1.	Loaded the Data
	•	pd.read_csv was used to load the training and test datasets.
	•	Always keep a backup of the original data (main_df) in case things go wrong.
	2.	Handling Missing Values
	•	Why? Many columns had missing values (NaN), which can break the model. We handled them:
	•	Numerical Columns: Replaced missing values with the mean, as this preserves numerical stability.
	•	Categorical Columns: Replaced missing values with the mode (most frequent value), as this is the safest assumption for categories.
	•	Dropped columns with too many missing values (like Alley and PoolQC) and those not helpful for predictions (like Id).
	3.	Encoding Categorical Variables
	•	Why? Machine learning models, including XGBoost, work with numbers, not text. So, categorical columns were converted to numerical ones using one-hot encoding via pd.get_dummies.
	4.	Combining Train and Test Data
	•	Why? One-hot encoding should treat train and test data the same. Concatenating both ensures consistent handling of categories.

Building the Model

	1.	Train-Test Split
	•	The training data was split into X_train (features) and y_train (target variable, SalePrice).
	•	For validation, a portion of the training data (X_val and y_val) was set aside to check how the model performs on unseen data.
	2.	Using XGBoost
	•	Why XGBoost?
	•	It’s fast and works well with tabular data.
	•	It handles missing values automatically (a big plus here).
	•	It’s ensemble-based, meaning it combines predictions from many small decision trees for better accuracy.
	•	Built-in features like regularization help avoid overfitting.
	•	The XGBRegressor was used because we’re solving a regression problem (predicting house prices).
	3.	Random State = 42
	•	Why? Ensures the train-test split and model randomness are consistent across runs.
	•	42 is a commonly used seed value (popularized as a reference to “The Hitchhiker’s Guide to the Galaxy”) but you could use any number.
	4.	Model Training
	•	The classifier.fit() method trained the model on X_train and y_train.
	5.	Validation
	•	Predicted house prices (y_pred) for the validation set.
	•	Evaluated the performance using RMSE (Root Mean Squared Error), a common metric for regression problems.
Final Steps

	1.	Predictions
	•	Used the trained model to predict prices for the test dataset (df_test).
	2.	Submission File
	•	Created a .csv file for submission with Id and SalePrice, following Kaggle’s format.

In [1]:
# Importing necessary libraries
import pandas as pd
import numpy as np
import xgboost
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error

In [2]:
# Load datasets
df = pd.read_csv('/kaggle/input/house-prices-advanced-regression-techniques/train.csv')
test_df = pd.read_csv('/kaggle/input/house-prices-advanced-regression-techniques/test.csv')

# Backup the original data
main_df = df.copy()

In [3]:
# Handling Missing Values
def fill_missing(df, col, method="mean"):
    if method == "mean":
        df[col] = df[col].fillna(df[col].mean())
    elif method == "mode":
        df[col] = df[col].fillna(df[col].mode()[0])
    return df

In [4]:
# Numerical columns
num_cols = ['LotFrontage', 'MasVnrArea']
for col in num_cols:
    df = fill_missing(df, col, method="mean")

# Categorical columns
cat_cols = ['BsmtCond', 'BsmtQual', 'FireplaceQu', 'GarageType', 'GarageFinish',
            'GarageQual', 'GarageCond', 'BsmtExposure', 'BsmtFinType2', 'MasVnrType']
for col in cat_cols:
    df = fill_missing(df, col, method="mode")

In [5]:
# Dropping unnecessary columns
columns_to_drop = ['Alley', 'GarageYrBlt', 'PoolQC', 'Fence', 'MiscFeature', 'Id']
df.drop(columns=columns_to_drop, axis=1, inplace=True)

# Drop rows with any remaining missing values
df.dropna(inplace=True)

In [6]:
# Handling Test Data
for column in test_df.columns:
    if test_df[column].isnull().sum() > 0:  # Check if the column has missing values
        if test_df[column].dtype in ['int64', 'float64']:  # Numerical columns
            test_df[column] = test_df[column].fillna(test_df[column].mean())
        else:  # Categorical columns
            test_df[column] = test_df[column].fillna(test_df[column].mode()[0])

In [7]:
# Combining train and test data
final_df = pd.concat([df, test_df], axis=0).reset_index(drop=True)

In [8]:
# Categorical Encoding
def category_onehot_multicols(multicolumns):
    df_final = final_df
    i = 0
    for fields in multicolumns:
        df1 = pd.get_dummies(final_df[fields], drop_first=True)
        final_df.drop([fields], axis=1, inplace=True)
        if i == 0:
            df_final = df1.copy()
        else:
            df_final = pd.concat([df_final, df1], axis=1)
        i += 1
    df_final = pd.concat([final_df, df_final], axis=1)
    return df_final

In [9]:
# Identifying categorical columns for encoding
columns = [col for col in final_df.columns if final_df[col].dtype == 'object']
final_df = category_onehot_multicols(columns)

In [10]:
# Remove duplicate columns
final_df = final_df.loc[:, ~final_df.columns.duplicated()]

In [11]:
# Splitting back into train and test
df_train = final_df.iloc[:len(df), :]
df_test = final_df.iloc[len(df):, :]

In [12]:
# Drop SalePrice from test data
if 'SalePrice' in df_test.columns:
    df_test.drop(['SalePrice'], axis=1, inplace=True)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_test.drop(['SalePrice'], axis=1, inplace=True)


In [13]:
# Splitting features and target
X_train = df_train.drop(['SalePrice'], axis=1)
y_train = df_train['SalePrice']

# Train-Test Split for Validation
X_train, X_val, y_train, y_val = train_test_split(X_train, y_train, test_size=0.2, random_state=42)

In [14]:
# XGBoost Model
classifier = xgboost.XGBRegressor(random_state=42)
classifier.fit(X_train, y_train)

# Validate the model
y_pred = classifier.predict(X_val)
rmse = mean_squared_error(y_val, y_pred, squared=False)
print(f"Validation RMSE: {rmse}")

Validation RMSE: 25544.550188057874


In [15]:
# Final Predictions
final_predictions = classifier.predict(df_test)

In [16]:
# Creating submission file
submission = pd.DataFrame({'Id': test_df['Id'], 'SalePrice': final_predictions})
submission.to_csv('submission.csv', index=False)