# **Predicting House Prices**

# Introduction

This is a Kaggle Machine Learning competition designed for Kaggle Learn users. The goal is to predict house prices based on a variety of features provided in the training and test datasets. We'll use multiple regression models to make predictions and evaluate which model performs best.

Since the test dataset does not include the `SalePrice` column (as expected in most real-world scenarios), we will simulate this setup by splitting the original training data into **training and validation subsets**. This allows us to train models on one portion and evaluate their performance on unseen validation data before making final predictions on the actual test set.

The dataset includes nearly 80 columns, so identifying the most relevant features that impact the sale price is essential. We'll start by exploring the data, understanding the meaning of each column, and handling missing values. Any columns with excessive missing data will be dropped to maintain data quality.

Feature engineering will be a key part of this project. It helps us avoid issues like overfitting or underfitting. Once we build and evaluate our models using RMSE (Root Mean Squared Error), we’ll refine our features and finalize the best model for prediction.

Let’s get started by diving into the data!

# Importing Libraries and Dataset
In this project, we’ll be using multiple regression models such as Lasso, Ridge, Decision Tree, Random Forest, and XGBoost to predict house prices. To support data preprocessing, model training, and evaluation, we’ll primarily rely on libraries from Scikit-learn `sklearn`, along with a few other essential Python libraries.

In [None]:
import pandas as pd #data manipulation
import numpy as np #numerical analysis
import matplotlib.pyplot as plt #basic data visualization
import seaborn as sns #advanced statistical visualization
%matplotlib inline

from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import LabelEncoder
from sklearn.metrics import mean_squared_error
from sklearn.impute import SimpleImputer
from sklearn.compose import ColumnTransformer

from sklearn.linear_model import Lasso, Ridge
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
from xgboost import XGBRegressor

Great! Now as we got our required libraries it's time to load data sets. As the data is available in the kaggle repository itself we have just import from there itself.

In [None]:
#import the data set and read them into pandas data frame
file_path_train = '/kaggle/input/home-data-for-ml-course/train.csv'
file_path_test = '/kaggle/input/home-data-for-ml-course/test.csv'

train_df = pd.read_csv(file_path_train)
test_df = pd.read_csv(file_path_test)

train_df.head()

Great! we got our train and test data sets loaded, it's time to analyze the data, let's have a look at each column and undersatand what it actually ment, this help us to think logically as well when performing feature engineering. 

Note - We do have data description file in text format, make sure to read it thouroughly to understand the data, it is very important, because if we don't know what we're doing the there's no point of doing it! lol!!!

# Data Analysis

## House Prices Dataset - Feature Description

| S.No | Column         | Description                                                  |
|------|----------------|--------------------------------------------------------------|
| 1    | SalePrice      | The property's sale price in dollars. Target variable.       |
| 2    | MSSubClass     | The building class                                           |
| 3    | MSZoning       | The general zoning classification                            |
| 4    | LotFrontage    | Linear feet of street connected to property                  |
| 5    | LotArea        | Lot size in square feet                                      |
| 6    | Street         | Type of road access                                          |
| 7    | Alley          | Type of alley access                                         |
| 8    | LotShape       | General shape of property                                    |
| 9    | LandContour    | Flatness of the property                                     |
| 10   | Utilities      | Type of utilities available                                  |
| 11   | LotConfig      | Lot configuration                                            |
| 12   | LandSlope      | Slope of property                                            |
| 13   | Neighborhood   | Physical locations within Ames city limits                   |
| 14   | Condition1     | Proximity to main road or railroad                           |
| 15   | Condition2     | Proximity to main road or railroad (if a second is present)  |
| 16   | BldgType       | Type of dwelling                                             |
| 17   | HouseStyle     | Style of dwelling                                            |
| 18   | OverallQual    | Overall material and finish quality                          |
| 19   | OverallCond    | Overall condition rating                                     |
| 20   | YearBuilt      | Original construction date                                   |
| 21   | YearRemodAdd   | Remodel date                                                 |
| 22   | RoofStyle      | Type of roof                                                 |
| 23   | RoofMatl       | Roof material                                                |
| 24   | Exterior1st    | Exterior covering on house                                   |
| 25   | Exterior2nd    | Exterior covering on house (if more than one material)       |
| 26   | MasVnrType     | Masonry veneer type                                          |
| 27   | MasVnrArea     | Masonry veneer area in square feet                           |
| 28   | ExterQual      | Exterior material quality                                    |
| 29   | ExterCond      | Present condition of the material on the exterior            |
| 30   | Foundation     | Type of foundation                                           |
| 31   | BsmtQual       | Height of the basement                                       |
| 32   | BsmtCond       | General condition of the basement                            |
| 33   | BsmtExposure   | Walkout or garden level basement walls                       |
| 34   | BsmtFinType1   | Quality of basement finished area                            |
| 35   | BsmtFinSF1     | Type 1 finished square feet                                  |
| 36   | BsmtFinType2   | Quality of second finished area (if present)                 |
| 37   | BsmtFinSF2     | Type 2 finished square feet                                  |
| 38   | BsmtUnfSF      | Unfinished square feet of basement area                      |
| 39   | TotalBsmtSF    | Total square feet of basement area                           |
| 40   | Heating        | Type of heating                                              |
| 41   | HeatingQC      | Heating quality and condition                                |
| 42   | CentralAir     | Central air conditioning                                     |
| 43   | Electrical     | Electrical system                                            |
| 44   | 1stFlrSF       | First Floor square feet                                      |
| 45   | 2ndFlrSF       | Second floor square feet                                     |
| 46   | LowQualFinSF   | Low quality finished square feet (all floors)                |
| 47   | GrLivArea      | Above grade (ground) living area square feet                 |
| 48   | BsmtFullBath   | Basement full bathrooms                                      |
| 49   | BsmtHalfBath   | Basement half bathrooms                                      |
| 50   | FullBath       | Full bathrooms above grade                                   |
| 51   | HalfBath       | Half baths above grade                                       |
| 52   | Bedroom        | Number of bedrooms above basement level                      |
| 53   | Kitchen        | Number of kitchens                                           |
| 54   | KitchenQual    | Kitchen quality                                              |
| 55   | TotRmsAbvGrd   | Total rooms above grade (excluding bathrooms)                |
| 56   | Functional     | Home functionality rating                                    |
| 57   | Fireplaces     | Number of fireplaces                                         |
| 58   | FireplaceQu    | Fireplace quality                                            |
| 59   | GarageType     | Garage location                                              |
| 60   | GarageYrBlt    | Year garage was built                                        |
| 61   | GarageFinish   | Interior finish of the garage                                |
| 62   | GarageCars     | Size of garage in car capacity                               |
| 63   | GarageArea     | Size of garage in square feet                                |
| 64   | GarageQual     | Garage quality                                               |
| 65   | GarageCond     | Garage condition                                             |
| 66   | PavedDrive     | Paved driveway                                               |
| 67   | WoodDeckSF     | Wood deck area in square feet                                |
| 68   | OpenPorchSF    | Open porch area in square feet                               |
| 69   | EnclosedPorch  | Enclosed porch area in square feet                           |
| 70   | 3SsnPorch      | Three season porch area in square feet                       |
| 71   | ScreenPorch    | Screen porch area in square feet                             |
| 72   | PoolArea       | Pool area in square feet                                     |
| 73   | PoolQC         | Pool quality                                                 |
| 74   | Fence          | Fence quality                                                |
| 75   | MiscFeature    | Miscellaneous feature not covered in other categories        |
| 76   | MiscVal        | $Value of miscellaneous feature                              |
| 77   | MoSold         | Month Sold                                                   |
| 78   | YrSold         | Year Sold                                                    |
| 79   | SaleType       | Type of sale                                                 |
| 80   | SaleCondition  | Condition of sale                                            |
| 81   | ID             | Index                                                        

Note - Read 'data_description.txt' file for full understanding of the features

Wow, it’s amazing how much there is to learn! Until now, I had no idea that so many features are considered when determining the value of a house. This makes the project even more exciting as we dig deeper into the data.
I’m especially interested in analyzing how different features impact the sale price—understanding this could be incredibly valuable when I’m working with a real estate agent in the future. With this knowledge, I’ll be in a much better position to evaluate homes and negotiate a fair price for my dream house.

As we can see our tarin data frame, there are some missing values with NaN, so first let's have a look at each column that have missing values and analyze what it is, before that let's have a quick, that how many data points does our data set have.

In [None]:
train_df.shape

We have 1460 rows and 81 columns

In [None]:
missing_values = train_df.isnull().sum()
missing_values = missing_values[missing_values > 0]
missing_values = missing_values.sort_values(ascending = False)
print(missing_values)

Ohh! okay, we have in total 19 columns with missing values among them columns `PoolQC`, `MiscFeature`, `Alley`, `Fence`, `ManVnrType`, `FirePlaceQC`, `LotFrontage` have many missing values. Now let's analyze one by one

1. `PoolQC` - This column represents the quality of pool, bu the results show among 1460 houses 1453 house doen't have pool. So now if we replace these missing values with most frequent value of rest 7 houses will impact the sale price very badly and predict wrong results. So let's drop out the 'poolqc' column.
2. `MiscFeature` - This column represents any other features home has, like elevator and other features. Results show we have 1406 missing values among 1460 houses. So let's drop the this column as well.
3. `Alley` - this columns represents the passage to backside of the home, result shows 1369/1460 houses doesn't have alley. So let's drop this column as well
4. `Fence` - 1179/1460 missing values
5. `MasVnrType`  - 872/1460 missing values
6. `FireplaceQu` - 690/1460 missing values

Now that we've identified the columns to drop, it's time to split the dataset into training and validation subsets for model evaluation.  

But before we do that, let's take a closer look at the distribution of the **SalePrice** column — our target variable. In housing data, the sale price is often **right-skewed**, meaning there are a few extremely high-priced properties that can distort the overall distribution.

To ensure better model performance, especially for linear models, it's important to **normalize the target variable**. We’ll examine the distribution and apply a log transformation if needed to reduce skewness and bring the data closer to a normal distribution.

In [None]:
from scipy.stats import skew

plt.figure(figsize = (10,4))
sns.histplot(train_df['SalePrice'], kde = True, bins = 30)
plt.title('Distribution of Sale Price')
plt.xlabel('Sale Price')
plt.ylabel('Frequency')
plt.show()

#calculating skewness
salesprice_skew = skew(train_df['SalePrice'])
print(f"Sale Price Skewness: {salesprice_skew: .2f}")

Yoo! So our sales price data is right skewed with most of the houses range between 100000 to 200000 dollars. 

As the house prices are not normally distributed — right-skewed, meaning:
1. Most homes are moderately priced
2. A few are extremely expensive (outliers)
3. The distribution is not symmetric

Machine learning models (especially linear ones like Ridge or Lasso) work better when the target variable is closer to a normal distribution. So, let perform log transformation `np.log1p(SalePrice)` to reduces the effect of outliers, helps models learn better relationships, and improves RMSE and overall model performance


Interpretation:
Skewness = 0 → Perfectly normal distribution
Skewness > 0 → Right-skewed (long tail on the right)
Skewness < 0 → Left-skewed

In [None]:
#define X and y values to spilt
X = train_df.drop(['PoolQC', 'MiscFeature', 'Alley', 'Fence', 'MasVnrType', 'FireplaceQu', 'LotFrontage', 'SalePrice'], axis = 1)
y = np.log1p(train_df['SalePrice'])

Now, let's split the train data into train and validation subsets with the validation split of 30% and random state of 42

In [None]:
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size = 0.3, random_state = 42)

Wonderful! We got our train and validation data sets. Before proceeding let's have a look at the categorical columns, these are the tricky one when performing machine learning models, Unlike our numeric variable these are not the easy ones. Machin leaning models require values in numbers. So let's encode them, mainly there are 3 types of encoding style 'OneHotEncoding', 'Label Encoding', and 'Ordinal Encoding'. Let's understand one by one

1. `OneHotEncoding`: This splits the data from the column and add up new columns and assign the binary values of 0 and 1. This is best for linear models like (Ridg and lasso regression models)
2. `Label Encoding`: This assigns the value to the unique data points in the column like 0, 1, 2, ...etc,. This is good for tree type regression models (Decision Tree, Random Forest, XGB)
3. `Ordinal Encoding`: This assigns the value based on the ranking like Execllent > Good > Poor > worst. This is good for the ordinal category (real ranking)

So, for this project we will be focusing on One Hot Encoding for Ridge and Lasso regression models and label encoding for tree models.

Now what about the numeric variable, numeric variables need 

# Model Development

In [None]:
#slipt the numeric and categorical variables 
numeric_features = X.select_dtypes(include = ['number']).columns
categorical_features = X.select_dtypes(include = ['object', 'category']).columns

## Data preprocessing pipeline for linear models (Ridge and Lasso)

In [None]:
from sklearn.preprocessing import StandardScaler, OneHotEncoder

numeric_features_pipeline = Pipeline(steps = [
    ('Imputer', SimpleImputer(strategy = 'median')),
    ('Scaler', StandardScaler())
])

categorical_features_pipeline = Pipeline(steps = [
    ('Imputer', SimpleImputer(strategy = 'most_frequent')),
    ('OneHot', OneHotEncoder(handle_unknown = 'ignore'))
])

In [None]:
#let's combine both the transformed feature pipelines into one using column transformer
preprocessor = ColumnTransformer(transformers = [
    ('numeric', numeric_features_pipeline, numeric_features),
    ('categorical', categorical_features_pipeline, categorical_features)
])

## Create a full modeling pipeline

Now we’ll combine the preprocessor (for data cleaning and transformation) a regression model (Ridge). This way, everything from preprocessing to prediction happens in one go!

## Ridge Regression

In [None]:
model_pipeline = Pipeline(steps = [
    ('preprocessor', preprocessor),
    ('model', Ridge())
])

### Model tuning and validation
For getting better performance and avoiding overfitting.

Now instead of just .fit(), let's use:
- Step 1: cross_val_score() – Quick performance estimate
- Step 2: GridSearchCV – For best model & hyperparameter tuning
Let’s go with Step 1 first for quick validation, and we’ll do Step 2 right after.


In [None]:
from sklearn.model_selection import cross_val_score, KFold

#use negative RMSE because sklearn minimizes the losses
scores = cross_val_score(model_pipeline, X_train, y_train, cv = 5, 
                         scoring = 'neg_root_mean_squared_error')

#print avg RMSE across folds
print('Avg RMSE:', -np.mean(scores))

### Hyperparameter tuning with GridSearchCV

In [None]:
from sklearn.model_selection import GridSearchCV

param_grid = {
    'model__alpha': [0.01, 0.1, 1, 10, 50, 100]
}

grid_search = GridSearchCV(model_pipeline, param_grid,
                          cv = 5,
                          scoring = 'neg_root_mean_squared_error',
                          n_jobs = -1)
grid_search.fit(X_train, y_train)

#print best param and score
print('Best Paramater:', grid_search.best_params_)
print('Bst CV RMSE:', -grid_search.best_score_)

Yoo! we dit it, but is tis the best model? well we can answer that once we check the others. As we gone through step by step for the above problem let's quickly have a look at the next linear model that is lasso.

Note we have to start from defining model pipeline.

## Lasso Regression

In [None]:
import warnings
warnings.filterwarnings('ignore')

model_pipeline_lasso = Pipeline(steps = [
    ('preprocessor', preprocessor),
    ('model', Lasso(max_iter = 1000))
])

#model tunning and validation
scores = cross_val_score(model_pipeline_lasso, X_train, y_train, cv = 6,
                        scoring = 'neg_root_mean_squared_error')

print('Avg RMSE lasso: ', -np.mean(scores))

In [None]:
#now let's add hyperparameters and perform grid search
param_grid = {
    'model__alpha': [0.01, 0.1, 1, 10, 50, 100]
}

grid_search_lasso = GridSearchCV(model_pipeline_lasso, param_grid,
                                cv = 5,
                                scoring = 'neg_root_mean_squared_error',
                                n_jobs = -1)

grid_search_lasso.fit(X_train, y_train)

print('Best Parameter: ', grid_search_lasso.best_params_)
print('Best CV RMSE: ', -grid_search_lasso.best_score_)

Hurray!!! We got the best value for the lasso for the parameter value 100

It's time to dive deep into over models, lets try the tree models and check how over model is reacting to them as well.

## Decision Tree Regressor
Remember we have to create new pipeline for the data processing as planned let's encode the categorical variables with the label encoding and we don't need to standardize the data for numeric values.

In [None]:
from sklearn.preprocessing import OrdinalEncoder
from sklearn.tree import DecisionTreeRegressor

#pipeline for tree models for data processing
numeric_features_pipeline_tree = Pipeline(steps = [
    ('Imputer', SimpleImputer(strategy = 'median')),
])

categorical_features_pipeline_tree = Pipeline(steps=[
    ('Imputer', SimpleImputer(strategy = 'most_frequent')),
    ('Ordinal', OrdinalEncoder(handle_unknown='use_encoded_value', unknown_value=-1))
])

#combine both 
preprocessor_tree = ColumnTransformer(transformers = [
    ('numeric', numeric_features_pipeline_tree, numeric_features),
    ('categorical', categorical_features_pipeline_tree, categorical_features)
])

model_pipeline_DT = Pipeline(steps = [
    ('preprocessor', preprocessor_tree),
    ('model', DecisionTreeRegressor())
])

In [None]:
#Evaluate basline decision tree with cross validation score
scores_DT = cross_val_score(model_pipeline_DT, X_train, y_train,
                           cv = 5,
                           scoring = 'neg_root_mean_squared_error')

rmse_scores_DT = -scores_DT
print('RMSE score: ', rmse_scores_DT)
print('Avg RMSE: ', rmse_scores_DT.mean())

Let's tune a few key parameters:
- max_depth: Maximum depth of the tree
- min_samples_split: Minimum samples required to split an internal node
- min_samples_leaf: Minimum samples required to be at a leaf node

In [None]:
param_grid_DT = {
    'model__max_depth': [5, 10, 20, None],
    'model__min_samples_split': [3, 7, 15],
    'model__min_samples_leaf': [2, 4, 6]
}

grid_search_DT = GridSearchCV(model_pipeline_DT, param_grid_DT,
                             cv = 5,
                             scoring = 'neg_root_mean_squared_error', n_jobs = -1)

grid_search_DT.fit(X_train, y_train)

print('Best Prameters', grid_search_DT.best_params_)
print('Best CV RMSE', grid_search_DT.best_score_)

## Random Forest Regressor

In [None]:
model_pipeline_RF = Pipeline(steps = [
    ('preprocessor', preprocessor_tree),
    ('model', RandomForestRegressor())
])

param_grid_RF = {
    'model__n_estimators': [100, 200],  # Number of trees in the forest
    'model__max_depth': [None, 10, 20],  # Max depth of each tree
    'model__min_samples_split': [2, 5],  # Min samples to split an internal node
    'model__min_samples_leaf': [1, 2],   # Min samples at a leaf node
    'model__max_features': ['auto', 'sqrt']  # Number of features to consider at every split
}

grid_search_RF = GridSearchCV(model_pipeline_RF, param_grid_RF,
                             cv = 5,
                             scoring = 'neg_root_mean_squared_error', n_jobs = -1)

grid_search_RF.fit(X_train, y_train)

print('Best Prameters', grid_search_RF.best_params_)
print('Best CV RMSE', grid_search_RF.best_score_)

PERFECT WE GOT GOOD SCORE!! Now finally let's perform for the XGB

## XGBoost Regressor

In [None]:
model_pipeline_XGB = Pipeline(steps = [
    ('preprocessor', preprocessor_tree),
    ('model', XGBRegressor(objective='reg:squarederror', random_state=42))
])

param_grid_XGB = {
    'model__n_estimators': [100, 200],
    'model__max_depth': [3, 5, 10],
    'model__learning_rate': [0.01, 0.1, 0.3],
    'model__subsample': [0.8, 1.0],
    'model__colsample_bytree': [0.8, 1.0]
}

grid_search_XGB = GridSearchCV(model_pipeline_XGB, param_grid_XGB,
                             cv = 5,
                             scoring = 'neg_root_mean_squared_error', n_jobs = -1)

grid_search_XGB.fit(X_train, y_train)

print('Best Prameters', grid_search_XGB.best_params_)
print('Best CV RMSE', grid_search_XGB.best_score_)

# Model Evaluation and Refinement

In [None]:
#Use best parameters from GridSearch
best_rf_model = RandomForestRegressor(
    n_estimators=100,
    max_depth=20,
    max_features='sqrt',
    min_samples_split=2,
    min_samples_leaf=1,
    random_state=42
)

# Create pipeline again with best RF model
model_pipeline_RF_final = Pipeline(steps=[
    ('preprocessor', preprocessor_tree),
    ('model', best_rf_model)
])

# Fit to train subset
model_pipeline_RF_final.fit(X_train, y_train)

# Predict on validation subset
val_preds = model_pipeline_RF_final.predict(X_val)

# Evaluate with RMSE
from sklearn.metrics import mean_squared_error

rmse_val = mean_squared_error(y_val, val_preds, squared=False)
print("Validation RMSE: ", rmse_val)

In [None]:
#fit the model on original train set
model_pipeline_RF_final.fit(X,y)

In [None]:
#now it's time to bring our test set
test_df.head()

In [None]:
#good!! now let's predict the model on this test set
test_df_pred_log = model_pipeline_RF_final.predict(test_df)

In [None]:
print(np.isnan(test_df_pred_log).sum(), "NaNs in predictions")
print(np.isinf(test_df_pred_log).sum(), "Infs in predictions")
print("Max value:", np.max(test_df_pred_log))
print("Min value:", np.min(test_df_pred_log))


In [None]:
test_df_preds = np.expm1(test_df_pred_log)

In [None]:
# Create submission DataFrame
submission = pd.DataFrame({
    'Id': test_df['Id'], 
    'SalePrice': test_df_preds
})

# Save to CSV (no index)
submission.to_csv('submission.csv', index=False)

# Conclusion

The model performed well in the initial prediction. For improved results, we can focus on advanced feature engineering or fine-tuning the hyperparameters. However, it's important to ensure that the model does not overfit — training the model repeatedly without proper validation may cause it to memorize patterns instead of generalizing well.

Thanks for reading this notebook!  
If you need any help, feel free to contact me at: **dabidhussain2502@gmail.com**
