# üè† House Price Prediction System using Machine Learning

## Project Overview
This project predicts house prices using the Kaggle dataset **train.csv**.

- Dataset contains 81 features
- Target column is **SalePrice**
- We apply data preprocessing, feature encoding, model training, and evaluation.



## Step 1: Import Required Libraries

We import libraries for:

- Data handling (Pandas, NumPy)
- Visualization (Matplotlib, Seaborn)
- Machine Learning model training (Scikit-learn)



In [8]:
import pandas as pd
import numpy as np

import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, r2_score

from sklearn.ensemble import RandomForestRegressor


## Step 2: Load Dataset

We load the Kaggle dataset `train.csv`.
It contains house features and the target price column `SalePrice`.


In [10]:
df = pd.read_csv("train.csv")

print("Dataset Loaded Successfully!")
df.head()


Dataset Loaded Successfully!


Unnamed: 0,Id,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,Utilities,...,PoolArea,PoolQC,Fence,MiscFeature,MiscVal,MoSold,YrSold,SaleType,SaleCondition,SalePrice
0,1,60,RL,65.0,8450,Pave,,Reg,Lvl,AllPub,...,0,,,,0,2,2008,WD,Normal,208500
1,2,20,RL,80.0,9600,Pave,,Reg,Lvl,AllPub,...,0,,,,0,5,2007,WD,Normal,181500
2,3,60,RL,68.0,11250,Pave,,IR1,Lvl,AllPub,...,0,,,,0,9,2008,WD,Normal,223500
3,4,70,RL,60.0,9550,Pave,,IR1,Lvl,AllPub,...,0,,,,0,2,2006,WD,Abnorml,140000
4,5,60,RL,84.0,14260,Pave,,IR1,Lvl,AllPub,...,0,,,,0,12,2008,WD,Normal,250000


## Step 3: Dataset Exploration

We check:

- Total rows and columns
- Feature datatypes
- Missing values in dataset


In [11]:
print("Dataset Shape:", df.shape)
df.info()



Dataset Shape: (1460, 81)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1460 entries, 0 to 1459
Data columns (total 81 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   Id             1460 non-null   int64  
 1   MSSubClass     1460 non-null   int64  
 2   MSZoning       1460 non-null   object 
 3   LotFrontage    1201 non-null   float64
 4   LotArea        1460 non-null   int64  
 5   Street         1460 non-null   object 
 6   Alley          91 non-null     object 
 7   LotShape       1460 non-null   object 
 8   LandContour    1460 non-null   object 
 9   Utilities      1460 non-null   object 
 10  LotConfig      1460 non-null   object 
 11  LandSlope      1460 non-null   object 
 12  Neighborhood   1460 non-null   object 
 13  Condition1     1460 non-null   object 
 14  Condition2     1460 non-null   object 
 15  BldgType       1460 non-null   object 
 16  HouseStyle     1460 non-null   object 
 17  OverallQual    1460 non-nu

## Step 4: Missing Value Detection

The dataset contains missing values in several columns.
We identify the top missing-value columns before cleaning.



In [12]:
missing = df.isnull().sum().sort_values(ascending=False)

print("Top Missing Values Columns:")
print(missing.head(10))



Top Missing Values Columns:
PoolQC         1453
MiscFeature    1406
Alley          1369
Fence          1179
MasVnrType      872
FireplaceQu     690
LotFrontage     259
GarageYrBlt      81
GarageCond       81
GarageType       81
dtype: int64


## Step 5: Data Cleaning

- Numerical columns missing values are filled using median
- Categorical columns missing values are filled using mode

This makes dataset ready for training.



In [13]:
# Fill numerical missing values with median
num_cols = df.select_dtypes(include=["int64", "float64"]).columns
df[num_cols] = df[num_cols].fillna(df[num_cols].median())

# Fill categorical missing values with most frequent value (mode)
cat_cols = df.select_dtypes(include=["object"]).columns
df[cat_cols] = df[cat_cols].fillna(df[cat_cols].mode().iloc[0])

print("Missing values handled successfully!")



Missing values handled successfully!


## Step 6: Feature Selection

- `X` contains all house features
- `y` contains the target column `SalePrice`



In [14]:
X = df.drop(["SalePrice"], axis=1)
y = df["SalePrice"]

print("Input Features Shape:", X.shape)
print("Target Shape:", y.shape)



Input Features Shape: (1460, 80)
Target Shape: (1460,)


## Step 7: Categorical Encoding

The dataset has many categorical columns like:

- MSZoning
- Street
- Neighborhood

We convert them into numeric form using One-Hot Encoding.


In [15]:
X = pd.get_dummies(X, drop_first=True)

print("Categorical Encoding Completed!")
print("New Shape:", X.shape)


Categorical Encoding Completed!
New Shape: (1460, 245)


## Step 8: Train-Test Split

We divide the dataset:

- 80% Training Data
- 20% Testing Data

This helps evaluate model performance on unseen data.


In [16]:
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

print("Training Set:", X_train.shape)
print("Testing Set:", X_test.shape)


Training Set: (1168, 245)
Testing Set: (292, 245)


## Step 9: Model Training

We use **Random Forest Regressor**, which performs better than Linear Regression
for this Kaggle dataset.

It builds multiple decision trees and provides better accuracy.


In [17]:
model = RandomForestRegressor(
    n_estimators=200,
    random_state=42
)

model.fit(X_train, y_train)

print("Random Forest Model Trained Successfully!")


Random Forest Model Trained Successfully!


## Step 10: Prediction

The trained model predicts house prices on the test dataset.


In [18]:
y_pred = model.predict(X_test)

print("Prediction Completed!")


Prediction Completed!


## Step 11: Model Evaluation

We evaluate the model using:

- **R¬≤ Score** ‚Üí Accuracy measure
- **RMSE** ‚Üí Error in prediction


In [19]:
r2 = r2_score(y_test, y_pred)
rmse = np.sqrt(mean_squared_error(y_test, y_pred))

print("Model Performance:")
print("R2 Score:", r2)
print("RMSE:", rmse)


Model Performance:
R2 Score: 0.8934549038866828
RMSE: 28587.33361062636


## Step 12: Feature Importance

Random Forest provides feature importance scores.
This helps identify which house features most affect price.


In [20]:
importances = pd.Series(model.feature_importances_, index=X.columns)
top10 = importances.sort_values(ascending=False).head(10)

print("Top 10 Important Features:")
print(top10)


Top 10 Important Features:
OverallQual    0.556789
GrLivArea      0.122382
TotalBsmtSF    0.033554
2ndFlrSF       0.031090
1stFlrSF       0.028028
BsmtFinSF1     0.027985
LotArea        0.017620
GarageArea     0.015539
GarageCars     0.014821
YearBuilt      0.012276
dtype: float64


## Step 13: Save Trained Model

The trained regression model is saved as a Pickle file.
It can be reused later for deployment.


In [22]:
import pickle

pickle.dump(model, open("house_price_model.pkl", "wb"))

print("Model Saved Successfully as house_price_model.pkl")


Model Saved Successfully as house_price_model.pkl


## Conclusion

In this project, we successfully developed a **House Price Prediction System** using Machine Learning techniques. The Kaggle housing dataset (`train.csv`) was used, which contains multiple numerical and categorical features influencing house prices.

The dataset was preprocessed by handling missing values, encoding categorical variables using One-Hot Encoding, and splitting the data into training and testing sets. A **Random Forest Regressor** model was trained to predict the target variable `SalePrice`.

The model achieved good performance based on evaluation metrics such as **R¬≤ Score** and **RMSE**, demonstrating its effectiveness in estimating house prices accurately. Additionally, feature importance analysis helped identify the most impactful factors affecting house pricing.

This project shows how machine learning can be applied in the real estate domain to assist buyers, sellers, and investors in making better pricing decisions. In the future, the system can be further improved by using advanced algorithms like XGBoost and deploying it as a complete web application for real-time predictions.
