## Problem Statement

   **The goal of this project is to analyze a house price dataset and build a machine learning model to predict house prices based on various    features such as area, quality, location, and other characteristics. This project includes data cleaning, exploratory data analysis, feature   engineering, model building, and evaluation. Based on the analysis, customer recommendations will also be provided.**



## Import Libraries

In [None]:
import numpy as np 
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import warnings 
warnings.filterwarnings("ignore")
%matplotlib

## Load dataset

In [None]:
data = pd.read_csv("House Price.csv")

In [None]:
## Display max columns
pd.set_option("Display.max_column",None)
data.head()

## Understand the Data (Basic Checks)

In [None]:
data.shape

In [None]:
# Insights
  # There are 1460 rows and 81 columns

In [None]:
data.head()

In [None]:
data.tail()

In [None]:
# Insights
 # Target column : SalePrice.
 # Rest of the columns are input.

In [None]:
data.info()

In [None]:
# Insights
 # There are totally 1460 entires.
 # Some of the column have Missing values.
 # These data contain different dataTypes such as Float64, int64, object.

In [None]:
data.describe().T

In [None]:
# Insights
  # From the difference between the 75% and Max that shows some of the columns clearly have a outliers
  # Some of the columns are right skewed and some of the columns are left skewed

In [None]:
data.describe(include= "object").T

In [None]:
pd.set_option('display.max_rows', None)
pd.set_option('display.max_columns', None)
data.isnull().sum()[data.isnull().sum()>0]

In [None]:
data.duplicated().sum()

In [None]:
missing_percent = data.isnull().mean() * 100
missing_percent.sort_values(ascending=False)

In [None]:
# Insights
 # The above columns have a mising values 
 # Colulmn with more than 50% of missing values will be dropped
 # No duplicated Values are present

## Handle the Missing values

In [None]:
col_to_drop = missing_percent[missing_percent>50].index
data.drop(columns=col_to_drop,inplace = True)
data.head()

In [None]:
# Insights
 # Drop the columns which have more than 50% of missing values because it contain insufficient Informations and affect model performane

In [None]:
# Fill the Null values with the mean,median,mode
data.isnull().sum()[data.isnull().sum()>0]

In [None]:
# FireplaceQu
data["FireplaceQu"].fillna("No Fireplace",inplace = True)
# Insights
  # We fill the null values with "No Fireplace" instead of mean,median.
  # Because the null values carry some information about the house does'nt have a Fireplace.
  # So We fill the null values with "No Fireplace"

In [None]:
## Fill the Null values column with median
med_cols=["LotFrontage","MasVnrArea","GarageYrBlt"]

for col in med_cols:
    data[col].fillna(data[col].median(),inplace=True)

In [None]:
## Fill the null values column with mode
mode_cols=["BsmtQual","BsmtCond","BsmtExposure","BsmtFinType1","BsmtFinType2","Electrical","GarageType","GarageFinish",
     "GarageQual","GarageCond"]

for col in mode_cols:
    data[col].fillna(data[col].mode()[0],inplace=True)

In [None]:
# Final Checks
data.isnull().sum()[data.isnull().sum()>0]

In [None]:
# Insights 
 # for the numerical columns we fill the Null Values with median because it contain Outliers 
 # for the categorical columns we fill the null values with most repeated values

## Explorative Data Analysis

In [None]:
from ydata_profiling import ProfileReport 
profile = ProfileReport(data,title="EDA",explorative=False)
profile

In [None]:
data.head(2)

In [None]:
data.MSSubClass.unique()

In [None]:
categorical = ["MSZoning","Street","LotShape","LandContour","Utilities","LotConfig","LandSlope","Neighborhood","Condition1",
              "Condition2","BldgType","HouseStyle","RoofStyle","RoofMatl","Exterior1st","Exterior2nd","ExterQual","ExterCond",
              "Foundation","BsmtQual","BsmtCond","BsmtExposure","BsmtFinType1","BsmtFinType2","Heating",
              "HeatingQC","CentralAir","Electrical","KitchenQual","Functional","FireplaceQu","GarageType","GarageFinish","GarageQual",
              "GarageCond","PavedDrive","SaleType","SaleCondition"]

In [None]:
discrete=["MSSubClass","OverallQual","OverallCond","BsmtFullBath","BsmtHalfBath","FullBath","HalfBath","BedroomAbvGr",
          "KitchenAbvGr","TotRmsAbvGrd","Fireplaces","GarageCars","MoSold","YrSold"]

In [None]:
continuous=["LotFrontage","LotArea","MasVnrArea","BsmtFinSF1","BsmtFinSF2","BsmtUnfSF","TotalBsmtSF","1stFlrSF",
           "2ndFlrSF","LowQualFinSF","GrLivArea","GarageYrBlt","GarageArea","WoodDeckSF","OpenPorchSF","EnclosedPorch","3SsnPorch",
            "ScreenPorch","PoolArea","MiscVal","YearBuilt","YearRemodAdd",]

## Univariant Analysis

In [None]:
#Select the important columns for EDA
cat_important={}
for col in categorical:
    variation = data.groupby(col)["SalePrice"].mean().std()
    cat_important[col]=variation
cat_important=pd.Series(cat_important).sort_values(ascending=False)
top_cat_col=cat_important.head(10)
top_cat_col

In [None]:
## Categorical
plt.figure(figsize=(30,30),facecolor="white")
plotnumber=1
for col in top_cat_col.index:
    if plotnumber <= 11:
        ax=plt.subplot(5,2,plotnumber)
        sns.countplot(x=data[col])
        plt.title(col, fontsize=20)
        plt.xlabel(col, fontsize=20)
        plt.ylabel("Count", fontsize=20)
        plt.xticks(rotation=90,fontsize=20)
        plt.yticks(fontsize=15)
    plotnumber+=1
plt.tight_layout()
plt.show()

In [None]:
# Insights
  # The Exterior Quality for the majority houses are Average/Typical.
  # The Kitchen Quality of the houses are Average to Good range.
  # The Basement Quality of the house are near to Average and Good.
  # The second nearby environmental factor is normal.
  # The majority of the house roofs are made with Composite shingles.
  # Most of the house have no fireplaces and some of house have Average and excellent fireplaces.
  # Neighbourhood places a imporant role in Saleprice.
  # Most of the house basement condition are typical, Poor and Good conditions are rare.
  # Garage Quality around Average conditions.
  # exterior materials are distributed imbalanced.

In [None]:
#Select the important columns for EDA
disc_important={}
for col in discrete:
    variation = data.groupby(col)["SalePrice"].mean().std()
    disc_important[col]=variation
disc_important=pd.Series(disc_important).sort_values(ascending=False)
top_disc_col=disc_important.head(10)
top_disc_col

In [None]:
## Discrete
plt.figure(figsize=(30,30),facecolor="white")
plotnumber=1
for col in top_disc_col.index:
    if plotnumber <= 15:
        ax=plt.subplot(5,2,plotnumber)
        sns.countplot(x=data[col],)
        plt.title(col, fontsize=20)
        plt.xlabel(col, fontsize=20)
        plt.ylabel("Count", fontsize=20)
        plt.xticks(fontsize=20)
        plt.yticks(fontsize=20)
    plotnumber+=1
plt.tight_layout()
plt.show()

In [None]:
# Insights
  # The overall quality of the houses are moderate.
  # There is a Two Full bathroom in the majority Houses and some house have one.
  # There is average of 5 to 8 room in the ground level.
  # The Majority house can capacity to park the 2 cars.
  # Most of the house contain no fireplaces and some of the house contain one fireplace.
  # Overall condition of the houses are Average condition.
  # Most houses are have one story and two story Classes.
  # Most of the houses have a single kitchen.
  # Almost of the bathroom are fullbath and som. 
  # Average of 3 bedroom are in a single house in ground level.

In [None]:
#Select the important columns for EDA
con_important = data[continuous].corrwith(data["SalePrice"])
con_important=con_important.abs().sort_values(ascending=False)
top_con = con_important.head(10)
top_con

In [None]:
## Continuous
plt.figure(figsize=(30,30),facecolor="white")
plotnumber=1
for col in top_con.index:
    if plotnumber <= 11:
        ax=plt.subplot(5,2,plotnumber)
        sns.distplot(x=data[col],)
        plt.title(col, fontsize=20)
        plt.xlabel(col, fontsize=20)
        plt.ylabel("Count", fontsize=20)
        plt.xticks(fontsize=20)
        plt.yticks(fontsize=20)
    plotnumber+=1
plt.tight_layout()
plt.show()

In [None]:
# Insights
  # The Total ground living area are around 1000 sq.ft to 2000 sq.ft.
  # The Garage area are contain average of 500 sq.ft.
  # The Commom sq.ft of the basement is almost around 1000 sq.ft and someof have no basement.
  # The First floor Sq.ft are moderated.
  # The houses are built from 1850 and gradually increase from year to year and a High growth during a year 2000.
  # The houses are mostly Renovated  after the year of 1990 - 2000.
  # Most of the houses are decorated from the outside
  # The garage are build from 1880 and still it incease gradually
  # Some house have no basement and some house have unfinished basment
  # Most of the House have 60 feet lotfrontage and some of have higher frontage.

In [None]:
## Outlier Detection
for col in continuous:
    plt.figure(figsize=(5,3))
    sns.boxplot(x=data[col])
    plt.title(col)
    plt.show()

In [None]:
# Insight 
  # All the columns have Outliers but the outliers are valid point so we can't blindly remove or replace it.
  # We just transform the Outliers (Log transform or power transform)

## Bivariant Analysis

In [None]:
# Categorical
plt.figure(figsize=(50,50),facecolor="white")
plotnumber=1
for col in top_cat_col.index:
    if plotnumber <= 11:
        ax=plt.subplot(5,2,plotnumber)
        sns.boxplot(x=data[col],y=data["SalePrice"])
        plt.title(col, fontsize=20)
        plt.xlabel(col, fontsize=20)
        plt.ylabel("SalePrice", fontsize=20)
        plt.xticks(rotation=90,fontsize=20)
        plt.yticks(fontsize=15)
    plotnumber+=1
plt.tight_layout()
plt.show()

In [None]:
# Insights
  # The columns like ExterQual, KitchenQual, BsmtQual,FireplaceQu, BsmtCond, and GarageQual show a consistent value,
  # relationship with SalePrice.
  # The SalePrice increase from Fair to Excellent significantly.
  # The Condition2 column shows a strong relationship with salesprice and the house near to the,
  # posN(positive Feature) increase the Saleprice.
  # The house which use a Roof MAterial as WDshngl influence the SalePrice ,
  # The house which use a Compshngl influence a modrated SalePrice.
  # The Neighborhood have a strong influence in SalePrice,Price Increases according to the different location

In [None]:
# Discrete
plt.figure(figsize=(30,30),facecolor="white")
plotnumber=1
for col in top_disc_col.index:
    if plotnumber <= 15:
        ax=plt.subplot(5,2,plotnumber)
        sns.boxplot(x=data[col],y=data["SalePrice"])
        plt.title(col, fontsize=20)
        plt.xlabel(col, fontsize=20)
        plt.ylabel("SalePrice", fontsize=20)
        plt.xticks(fontsize=20)
        plt.yticks(fontsize=20)
    plotnumber+=1
plt.tight_layout()
plt.show()

In [None]:
# Insights
  # The overall Quality, FullBath, Fireplace, OverallCond influence a Saleprice increase according to the higher Rating.
  # The Totalroomabvgrnd increase the SalePrice according to the room, and the house have 10 room in groundfloor have a higher Saleprice.
  # The house which have a capacity to park a three cars influence the SalePrice.
  # The house which have a 1 story and 2 story increase the House Price.
  # The house which have a 1 and 2 kitchen in ground floor improve the SalePrice.
  # The bedroom plays a important role in Saleprice.

In [None]:
## Continuous
plt.figure(figsize=(30,30),facecolor="white")
plotnumber=1
for col in top_con.index:
    if plotnumber <= 11:
        ax=plt.subplot(5,2,plotnumber)
        sns.scatterplot(x=data[col],y=data["SalePrice"])
        plt.title(col, fontsize=20)
        plt.xlabel(col, fontsize=20)
        plt.ylabel("Saleprice", fontsize=20)
        plt.xticks(fontsize=20)
        plt.yticks(fontsize=20)
    plotnumber+=1
plt.tight_layout()
plt.show()

In [None]:
## Insights
  # The houses with Grlivarea, Garagearea, Totalbsmtsqft, first floor sqft, basementfin sf1, lotfrontage,
  # Shows a strong relationship with salePrices and influence SalePrice
  # The other column have a weak and moderate relationship the SalePrice

## MultiVariant Analysis

In [None]:
Continuous_hm=["LotFrontage","LotArea","MasVnrArea","BsmtFinSF1","BsmtFinSF2","BsmtUnfSF","TotalBsmtSF","1stFlrSF",
           "2ndFlrSF","LowQualFinSF","GrLivArea","GarageYrBlt","GarageArea","WoodDeckSF","OpenPorchSF","EnclosedPorch","3SsnPorch",
            "ScreenPorch","PoolArea","MiscVal","YearBuilt","YearRemodAdd","SalePrice"]

In [None]:
plt.figure(figsize=(30,30))
sns.heatmap(data[Continuous_hm].corr(),annot=True,annot_kws={"size":14})

In [None]:
discrete_hm=["MSSubClass","OverallQual","OverallCond","BsmtFullBath","BsmtHalfBath","FullBath","HalfBath","BedroomAbvGr",
          "KitchenAbvGr","TotRmsAbvGrd","Fireplaces","GarageCars","MoSold","YrSold"]

In [None]:
plt.figure(figsize=(30,30))
sns.heatmap(data[discrete_hm].corr(),annot=True,annot_kws={"size":14})

In [None]:
# Insights
  #There is no multicollinearity between the input columns so we can't need to drop any of these column

## Data Preprocessing

In [None]:
data.drop(columns="Id",inplace=True)

In [None]:
# Create New feature
data["House_age"] = 2026 - data["YearBuilt"]
data.drop(columns="YearBuilt",inplace=True)

## Handle Outliers

In [None]:
continuous_pre=["LotFrontage","LotArea","MasVnrArea","BsmtFinSF1","BsmtFinSF2","BsmtUnfSF","TotalBsmtSF","1stFlrSF",
           "2ndFlrSF","LowQualFinSF","GrLivArea","GarageYrBlt","GarageArea","WoodDeckSF","OpenPorchSF","EnclosedPorch","3SsnPorch",
            "ScreenPorch","PoolArea","MiscVal","YearRemodAdd","House_age"]

In [None]:
for col in continuous_pre:
    print(col,data[col].skew())

In [None]:
for col in continuous_pre:
    if data[col].skew()>1:
        data[col] = np.log1p(data[col])

In [None]:
# Insights
  # Most of the column have a skewness more than 1
  # The column which have a skewness more than 1 and -1 we do a log transform and powertTransform
  # The column which have less skewness keep as it is 

## Conversion of categorical variable into numerical

## OneHotEncoding

In [None]:
one_hot=["MSZoning","Street","LandContour","LotConfig","LandSlope","Neighborhood","Condition1","Condition2","BldgType","HouseStyle",
        "RoofStyle","RoofMatl","Exterior1st","Exterior2nd","Foundation","BsmtFinType1","BsmtFinType2","Heating","Electrical",
        "GarageType","SaleType","SaleCondition"]

In [None]:
data=pd.get_dummies(data,columns=one_hot,dtype=int,drop_first=True)

In [None]:
data.head()

In [None]:
# Insights
  # The column which are nomial data i use ONEHOTENCODING to convert the categorical(Nominal) data into numerical data

## Manual Mapping

In [None]:
ordinal_data={
    "LotShape":{"IR3":0,"IR2":1,"IR1":2,"Reg":3},
    "ExterQual":{"Fa":1,"TA":2,"Gd":3,"Ex":4},
    "ExterCond":{"Po":1,"Fa":2,"TA":3,"Gd":4,"Ex":5},
    "BsmtQual":{"Fa":1,"TA":2,"Gd":3,"Ex":4},
    "BsmtCond":{"Po":1,"Fa":2,"TA":3,"Gd":4},
    "BsmtExposure":{"No":1,"Mn":2,"Av":3,"Gd":4},
    "HeatingQC":{"Po":1,"Fa":2,"TA":3,"Gd":4,"Ex":5},
    "CentralAir":{"N":0,"Y":1},
    "KitchenQual":{"Fa":1,"TA":2,"Gd":3,"Ex":4},
    "Functional":{"Sev":1, "Maj2":2, "Maj1":3,"Mod":4, "Min2":5, "Min1":6, "Typ":7},
    "FireplaceQu":{"No Fireplace":1,"Po":2,"Fa":3,"TA":4,"Gd":5,"Ex":6},
    "GarageFinish":{"Unf":1,"RFn":2,"Fin":3},
    "GarageQual":{"Po":1,"Fa":2,"TA":3,"Gd":4,"Ex":5},
    "GarageCond":{"Po":1,"Fa":2,"TA":3,"Gd":4,"Ex":5},
    "PavedDrive":{"N":0,"P":1,"Y":2}
}

In [None]:
for col,mapping in ordinal_data.items():
    data[col]=data[col].map(mapping)

In [None]:
data.Utilities.value_counts()
data.drop(columns="Utilities",inplace=True)

In [None]:
# Insights
  # Use a manual mapping for the Ordinal data
  # drop the Utilities column because majority of the data in AllPub and only one data in NoSeWa

In [None]:
# Transform the target Variable
data.SalePrice.skew()

In [None]:
data["SalePrice"]=np.log1p(data["SalePrice"])

In [None]:
# Insights
  # Transform the target variable because it contain the skewness
  # Transform the target variable make the model easy to understand

## Split the data

In [None]:
x=data.drop(columns="SalePrice")
y=data.SalePrice

In [None]:
## Train_test_split
from sklearn.model_selection import train_test_split
x_train,x_test,y_train,y_test = train_test_split(x,y,random_state=30)

## Scale (Scale the train data only)

In [None]:
from sklearn.preprocessing import StandardScaler
scale=StandardScaler()
x_train_scaled = x_train.copy()
x_test_scaled = x_test.copy()
x_train_scaled[continuous_pre] = scale.fit_transform(x_train[continuous_pre])
x_test_scaled[continuous_pre] = scale.transform(x_test[continuous_pre])

## Model Building

## Linear Regression 

In [None]:
from sklearn.linear_model import LinearRegression
model=LinearRegression()
model.fit(x_train_scaled,y_train)
y_pred=model.predict(x_test_scaled)

In [None]:
y_test_original = np.expm1(y_test)
y_pred_original = np.expm1(y_pred)

In [None]:
from sklearn.metrics import r2_score,mean_squared_error,mean_absolute_error
print("R2_score",r2_score(y_test_original, y_pred_original)*100)
print("mean_squared_error",mean_squared_error(y_test_original, y_pred_original))
print("mean_absolute_error",mean_absolute_error(y_test_original, y_pred_original))

In [None]:
n = x_test.shape[0] 
p = x_test.shape[1]  

r2 = r2_score(y_test_original, y_pred_original)
adj_r2 = 1 - (1 - r2) * (n - 1) / (n - p - 1)

print("Adjusted R2:", adj_r2)

## Cross Validation for Linear Regression

In [None]:
from sklearn.model_selection import cross_val_score

cv_scores = cross_val_score(model, x_train_scaled, y_train, cv=5, scoring='r2')
print("Cross Validation R2 Mean:", cv_scores.mean())

In [None]:
# Insights
 # Linear Regression achieved strong predictive performance with 91% test R² and 84% cross-validation R².
 # The results indicate strong linear relationships between features and house prices.
 #The model demonstrates good generalization ability and stable performance, making it a reliable choice for price prediction.

## Random Forest Regressor

In [None]:
from sklearn.ensemble import RandomForestRegressor
rf_model = RandomForestRegressor()
rf_model.fit(x_train,y_train)
y_predict = rf_model.predict(x_test)

In [None]:
y_test_original1 = np.expm1(y_test)
y_pred_original1 = np.expm1(y_predict)

In [None]:
from sklearn.metrics import r2_score,mean_squared_error,mean_absolute_error
print("R2_score",r2_score(y_test_original1, y_pred_original1)*100)
print("mean_squared_error",mean_squared_error(y_test_original1, y_pred_original1))
print("mean_absolute_error",mean_absolute_error(y_test_original1, y_pred_original1))

In [None]:
r2 = r2_score(y_test_original1, y_pred_original1)
adj_r2 = 1 - (1 - r2) * (n - 1) / (n - p - 1)

print("Adjusted R2:", adj_r2)

## Hyperparameter Tuning

In [None]:
from sklearn.model_selection import RandomizedSearchCV

params={
    "n_estimators":[100,200,300,400],
    "max_depth":[None,10,20,30,40],
    "min_samples_leaf":[1,2,4],
    "min_samples_split":[2,5,10],
    "max_features":["sqrt","log2"]
}

rf = RandomForestRegressor(random_state=30)
random_search = RandomizedSearchCV(
    estimator=rf,
    param_distributions=params,
    n_iter=20,    
    cv=5,
    scoring="r2",
    verbose=1,
    n_jobs=-1,
    random_state=42
)

random_search.fit(x_train, y_train)

print("Best Parameters:", random_search.best_params_)
print("Best R2:", random_search.best_score_)

In [None]:
rf1_model = RandomForestRegressor(n_estimators= 400, min_samples_split= 2, min_samples_leaf= 2, max_features='sqrt', max_depth=None)
rf1_model.fit(x_train,y_train)
y_predict = rf1_model.predict(x_test)

In [None]:
y_test_original2 = np.expm1(y_test)
y_pred_original2 = np.expm1(y_predict)

In [None]:
from sklearn.metrics import r2_score,mean_squared_error,mean_absolute_error
print("R2_score",r2_score(y_test_original2, y_pred_original2)*100)
print("mean_squared_error",mean_squared_error(y_test_original2, y_pred_original2))
print("mean_absolute_error",mean_absolute_error(y_test_original2, y_pred_original2))

In [None]:
# Insights
 # Random Forest achieved an R² of approximately 83%, which was lower than Linear Regression. 
 # Even after hyperparameter tuning, the model did not outperform the linear model. 
 # This suggests that the dataset primarily contains linear relationships and does not require complex nonlinear modeling techniques.

## Model Selection
  1️ Higher Predictive Accuracy
Linear Regression achieved the highest Test R² (91%), outperforming Random Forest, indicating stronger predictive capability.

  2️ Better Generalization
The 5-fold cross-validation score of 84% confirms that Linear Regression performs consistently across different data splits.

 3️ Limited Nonlinear Complexity
Random Forest did not improve performance even after tuning, suggesting that the dataset primarily contains linear relationships.

 4️ Interpretability and Simplicity
Linear Regression provides clear coefficient interpretation and achieves high accuracy without unnecessary model complexity.

## Business Impact
The developed model can help real estate agencies, property sellers, and buyers estimate house prices based on property characteristics.
Accurate price prediction supports:

Better investment decisions

Fair property valuation

Reduced pricing errors

Faster sales cycle

By explaining 91% of the price variance, the model provides a reliable decision-support tool for housing market analysis.

## Conclusion
This project aimed to predict house prices using various property-related features. After performing data cleaning, exploratory data analysis, feature engineering, and log transformation of the target variable, multiple machine learning models were evaluated.

Among the models tested, Linear Regression achieved the best performance with a Test R² of 91% and a Cross-Validation R² of 84%, indicating strong predictive accuracy and stable generalization. Although Random Forest was implemented and tuned, it did not outperform the linear model, suggesting that the dataset primarily contains strong linear relationships.

The results demonstrate that a well-preprocessed dataset combined with a simple and interpretable model can achieve high predictive performance. Therefore, Linear Regression was selected as the final model for house price prediction.

Overall, the project highlights the importance of proper data preprocessing, model comparison, and validation techniques in building reliable machine learning solutions.