## Problem Statement
**Develop a data science model to predict house prices based on features such as location, size, number of rooms, and amenities.
The project aims to analyze key factors influencing property values and build an accurate predictive model.
The outcome will support data-driven decision-making in the real estate market.**

## Import Libraries

In [None]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import warnings
warnings.filterwarnings('ignore')
%matplotlib

## Load Data

In [None]:
data=pd.read_csv('house price.csv')

In [None]:
data

In [None]:
## display max columns
pd.set_option('Display.max_column',None)
data.head()

In [None]:
data.tail()

## Understsand the Data(Basic checks)

In [None]:
data.shape

In [None]:
data.head()

In [None]:
# insights
# Target column:SalePrice
# all other columns are input

In [None]:
data.info()

In [None]:
# insights
# there are totaly 1460 entries
# some of the columns have a missing values
# these data contains diffrent data types like float,int64,object

In [None]:
data.describe().T

In [None]:
# insights
# from the diffrence between the 75% and max that shows some of the columns clearly have a outliers
# some of the columns are right skewed and some of the columns are left skewed

In [None]:
data.describe(include='object').T

In [None]:
pd.set_option('Display.max_rows',None)
pd.set_option('Display.max_column',None)
data.isnull().sum()[data.isnull().sum()>0]

In [None]:
data.duplicated().sum()

In [None]:
missing_percent=data.isnull().mean() *100
missing_percent.sort_values(ascending=False)

In [None]:
# insights
# the above columns have missing values
# no duplicate values are present

## handle the missing values

In [None]:
col_of_drop=missing_percent[missing_percent>50].index
data.drop(columns=col_of_drop,inplace=True)
data.head()

In [None]:
# insights
# columns with more than 50% of missing values will be droped
# because it makes insufficiant vales and it will affect out model performance

In [None]:
# fill the missing values with using mean,median or mode
data.isnull().sum()[data.isnull().sum()>0]

In [None]:
# FireplaceQu
data['FireplaceQu'].fillna('No Fireplace',inplace=True)
# insights
# we fill the null values with 'No Fireplace'instead of mean,median
# because the null values carry the some information about the house doesn't have a Fireplace
# so we fill thhe null values with 'No Fireplace'

In [None]:
#fill the null values columns with median


In [None]:
median_cols=['LotFrontage','MasVnrArea','GarageYrBlt']
for col in median_cols:
    data[col].fillna(data[col].median(),inplace=True)

In [None]:
# we fill the null values with mode

In [None]:
mode_cols=['BsmtQual','BsmtCond','BsmtExposure','BsmtFinType1','BsmtFinType2','Electrical','GarageType','GarageFinish','GarageQual',
     'GarageCond']
for col in mode_cols:
    data[col].fillna(data[col].mode()[0],inplace=True)

In [None]:
# final checks
data.isnull().sum()[data.isnull().sum()>0]

In [None]:
# insights
# for the numerical columns we fill the null values with median because it contains outliers
# for the categorical columns we will the null values with most repeated values

## Explorative Data Analysis

In [None]:
from ydata_profiling import ProfileReport
profile=ProfileReport(data,title='EDA',explorative=False)
profile

In [None]:
data.head(2)

In [None]:
categorical=['MSZoning','Street','LotShape','LandContour','Utilities','LotConfig','LandSlope','Neighborhood',
            'Condition1','Condition2','BldgType','HouseStyle','RoofStyle','RoofMatl','Exterior1st','Exterior2nd',
            'ExterQual','ExterCond','Foundation','BsmtQual','BsmtCond','BsmtExposure','BsmtFinType1','BsmtFinType2','Heating',
            'HeatingQC','CentralAir','Electrical','KitchenQual','Functional','FireplaceQu','GarageType','GarageFinish',
            'GarageQual','GarageCond','PavedDrive','SaleType','SaleCondition']

In [None]:
discrete=['MSSubClass','OverallQual','OverallCond','BsmtFullBath','BsmtHalfBath','FullBath','HalfBath',
         'BedroomAbvGr','KitchenAbvGr','TotRmsAbvGrd','Fireplaces','GarageCars','MoSold','YrSold']

In [None]:
continuous=['LotFrontage','LotArea','MasVnrArea','BsmtFinSF1','BsmtFinSF2','BsmtUnfSF','TotalBsmtSF',
           '1stFlrSF','2ndFlrSF','LowQualFinSF','GrLivArea','GarageYrBlt','GarageArea','WoodDeckSF','OpenPorchSF',
           'EnclosedPorch','3SsnPorch','ScreenPorch','PoolArea','MiscVal','YearBuilt','YearRemodAdd']

## univarient analysis

In [None]:
# important feature selection for categorical
cat_important={}
for col in categorical:
    variation=data.groupby(col)['SalePrice'].mean().std()
    cat_important[col]=variation
cat_important=pd.Series(cat_important).sort_values(ascending=False)
top_cat_col=cat_important.head(10)
top_cat_col

In [None]:
# categorical 
plt.figure(figsize=(50,50),facecolor='white')
plotnumber=1
for column in top_cat_col.index:
    if plotnumber<=10:
        ax=plt.subplot(5,2,plotnumber)
        sns.countplot(x=data[column])
        plt.xticks(rotation=90)
        plt.xlabel(column,fontsize=20)
        plt.title(column,fontsize=20)
        plt.ylabel('count',fontsize=20)
    plotnumber+=1
plt.tight_layout()
plt.show

In [None]:
# insughts
# The Exterior Quality for the majarity houses are Average/Typical
# The Kitchen Quality of the houses are average to Good Range
# The Basement Quality of the houses are near to Average and Good
# The Second nearby environmental factor is normal
# The majority of the house roofs are made with composite shingals
# Most of the houses have a No Fireplaces and some of the houses have a Average and Excellent Fire Places
# Nighbourhood places a important role in saleprice
# Most of the house basement condition are typical,poor and good conditions are rare
# Garage Quality around Average conditions
# Exterior materials are distributed imbalanced

In [None]:
# important feature selection for discrete 
dis_important={}
for col in discrete:
    variation=data.groupby(col)['SalePrice'].mean().std()
    dis_important[col]=variation
dis_important=pd.Series(dis_important).sort_values(ascending=False)
top_dis_col=dis_important.head(10)
top_dis_col

In [None]:
# discrete
plt.figure(figsize=(50,50),facecolor='white')
plotnumber=1
for column in top_dis_col.index:
    if plotnumber<=10:
        ax=plt.subplot(5,2,plotnumber)
        sns.countplot(x=data[column])
        plt.xticks(rotation=90)
        plt.xlabel(column,fontsize=20)
        plt.title(column,fontsize=20)
        plt.ylabel('count',fontsize=20)
    plotnumber+=1
plt.tight_layout()
plt.show()

In [None]:
# insights
# The overall Quality of the house are moderate.
# There is Two bathrooms in majority in the houses and some houses have a one bathroom.
# There is average 5 to 8 room in ground level.
# The majority houses have capacity to park 2 cars.
# Majority of houses contain no fireplaces and some houses have one.
# Overall condition of the houses are average condition.
# Most houses are have one story and two story.
# Most of the houses have a single kitchen.
# Most of the bathrooms are full bath.
# Average of 3 bedrooms are in a single house in ground level.

In [None]:
# import feature selection for continuous
con_important=data[continuous].corrwith(data['SalePrice'])
con_important=con_important.abs().sort_values(ascending=False)
top_con=con_important.head(10)
top_con

In [None]:
# continuous
plt.figure(figsize=(50,50),facecolor='white')
plotnumber=1
for column in top_con.index:
    if plotnumber<=10:
        ax=plt.subplot(5,2,plotnumber)
        sns.distplot(x=data[column])
        plt.xticks(rotation=90)
        plt.xlabel(column,fontsize=20)
        plt.ylabel('count',fontsize=20)
    plotnumber+=1
plt.tight_layout()
plt.show()

In [None]:
# insights
# The total ground living area are around 1000sq.ft to 2000 sq.ft.
# The garage area are contain avrage 500sq.ft.
# The common sq.ft of the basement is almost around 1000sq.ft.
# The fisrt floor sq.ft are moderate.
# The houses are built from 1850 and gradualy increase from year to year and a high growth during year 2000.
# The houses are mostly renovated in year of 1990 - 2000.
# Most of the houses are decorated from from the outside.
# The garage are built from 1880 and stil it increase gradualy.
# Some hosue have no basement and some houses have unfinished basement.
# Most of the houses have 60 feet Lotfrontage and some of have higher frontage

In [None]:
for col in continuous:
    plt.figure(figsize=(5,3))
    sns.boxplot(x=data[col])
    plt.title(col)
    
    plt.show()

In [None]:
# insights
# All the columns have outliers but the outliers carry the valid information so we can't blindly remove or replae it.
# we just transform the outliers.

## Bivarient analysis

In [None]:
# categorical 
plt.figure(figsize=(50,50),facecolor='white')
plotnumber=1
for column in top_cat_col.index:
    if plotnumber<=10:
        ax=plt.subplot(5,2,plotnumber)
        sns.boxplot(x=data[column],y=data['SalePrice'])
        plt.xticks(rotation=90)
        plt.xlabel(column,fontsize=20)
        plt.title(column,fontsize=20)
        plt.ylabel('SalePrice',fontsize=20)
    plotnumber+=1
plt.tight_layout()
plt.show()

In [None]:
# insights
# The columns like ExterQual,KitchenQual,BsmtQual,FireplaceQu,BsmtCond and GarageQual show a consistent value,
# Relationship with Saleprice.
# The SalePrice increase from fair to excellent sighnificantly.
# The Condition2 column shows a strong relationship with saleprice and hose near to the,
# posN(positive feature) increase the sales price.
# The house which use a Roof Material as WDshngl influence the saleprice,
# The house which use a compshngl influence a modrated SalePrice.
# The Neighbourhood have a strong influenece in SalePrice,Price Increases according to the diffrent location.

In [None]:
# discrete
plt.figure(figsize=(50,50),facecolor='white')
plotnumber=1
for column in top_dis_col.index:
    if plotnumber<=10:
        ax=plt.subplot(5,2,plotnumber)
        sns.boxplot(x=data[column],y=data['SalePrice'])
        plt.xticks(rotation=90)
        plt.xlabel(column,fontsize=20)
        plt.title(column,fontsize=20)
        plt.ylabel('SalePrice',fontsize=20)
    plotnumber+=1
plt.tight_layout()
plt.show()

In [None]:
# insights
# The overallqual,FullBath,Fireplcae,Overallcond influence a saleprice increase accodring to the higher Rating.
# The Totalroombvgrnd increase saleprice according to the room, and the house have 10 rooms in groundfloor have a high saleprice.
# The house which have a capacity to park a three cars influnece the saleprice.
# The house which have a 1 story and 2 story increase the house price.
# The house which have a 1 kitchen and 2 kitchen in ground floor improve the saleprice.
# The bedrooms plays a important role in sale price.

In [None]:
# continuous
plt.figure(figsize=(50,50),facecolor='white')
plotnumber=1
for column in top_con.index:
    if plotnumber<=10:
        ax=plt.subplot(5,2,plotnumber)
        sns.scatterplot(x=data[column],y=data['SalePrice'])
        plt.xticks(rotation=90)
        plt.xlabel(column,fontsize=20)
        plt.ylabel('count',fontsize=20)
    plotnumber+=1
plt.tight_layout()
plt.show()

In [None]:
# insights
# The houses with Grlivarea,garagearea,Totalbsmtsqft,frst floor sqrt,basementfin sf1,lotfrontage,
# Shows a strong realtionships with salesprice and influence saleprice.
# The other column have a weak and moderate relationship the saleprice.

## Multivarient Analysis

In [None]:
heatmap=['LotFrontage','LotArea','MasVnrArea','BsmtFinSF1','BsmtFinSF2','BsmtUnfSF','TotalBsmtSF',
           '1stFlrSF','2ndFlrSF','LowQualFinSF','GrLivArea','GarageYrBlt','GarageArea','WoodDeckSF','OpenPorchSF',
           'EnclosedPorch','3SsnPorch','ScreenPorch','PoolArea','MiscVal','YearBuilt','YearRemodAdd']

In [None]:
plt.figure(figsize=(20,20))
sns.heatmap(data[heatmap].corr(),annot=True,annot_kws={'size':7})
plt.show()

In [None]:
heatmaps1=['MSSubClass','OverallQual','OverallCond','BsmtFullBath','BsmtHalfBath','FullBath','HalfBath',
         'BedroomAbvGr','KitchenAbvGr','TotRmsAbvGrd','Fireplaces','GarageCars','MoSold','YrSold']

In [None]:
plt.figure(figsize=(20,20))
sns.heatmap(data[heatmaps1].corr(),annot=True,annot_kws={'size':7})
plt.show()

In [None]:
# insights
# three is no mulicolinearity between the input column so we cant drop any of these column

## Data Preprocessing

In [None]:
data.head()

In [None]:
data.drop(columns=['Id'],inplace=True)

In [None]:
# Create new feature
data['House_age']=2026-data['YearBuilt']
data.drop(columns='YearBuilt',inplace=True)

## handle outliers

In [None]:
continuous_pre=['LotFrontage','LotArea','MasVnrArea','BsmtFinSF1','BsmtFinSF2','BsmtUnfSF','TotalBsmtSF',
           '1stFlrSF','2ndFlrSF','LowQualFinSF','GrLivArea','GarageYrBlt','GarageArea','WoodDeckSF','OpenPorchSF',
           'EnclosedPorch','3SsnPorch','ScreenPorch','PoolArea','MiscVal','House_age','YearRemodAdd']

In [None]:
for col in continuous_pre:
    print(col,data[col].skew())

In [None]:
for col in continuous_pre:
    if data[col].skew()>1:
        data[col]=np.log1p(data[col])

In [None]:
# Insights
# Most  of the columns have a skewness more than 1
# The column which have a skewness more than 1 and -1 we do log transform and poower transform
# The column which have a less skewness keep as it is

## Convertion of categorical into numerical

## one hot encoding

In [None]:
one_hot=['MSZoning','Street','LandContour','LotConfig','LandSlope','Neighborhood','Condition1','Condition2','BldgType','HouseStyle',
         'RoofStyle','RoofMatl','Exterior1st','Exterior2nd','Foundation','BsmtFinType1','BsmtFinType2','Electrical','Heating',
        'GarageType','SaleType','SaleCondition']

In [None]:
data=pd.get_dummies(data,columns=one_hot,dtype=int,drop_first=True)

In [None]:
data.head()

In [None]:
# insights
# The column which are nominal data i use ONEHOTENCODING to convert the categorical data into a numerical data

## Manual Mapping

In [None]:
data.BsmtCond.unique()

In [None]:
ordinal={
    'LotShape':{'IR3':0,'IR2':1,'IR1':2,'Reg':3},
    'ExterQual':{'Fa':0,'TA':1,'Gd':2,'Ex':3},
    'ExterCond':{'Po':0,'Fa':1,'TA':2,'Gd':3,'Ex':4},
    'BsmtQual':{'Fa':0,'TA':1,'Gd':2,'Ex':3},
    'BsmtCond':{'Po':0,'Fa':1,'TA':2,'Gd':3},
    'BsmtExposure':{'No':0,'Mn':1,'Av':2,'Gd':3},
    'HeatingQC':{'Po':0,'Fa':1,'TA':2,'Gd':3,'Ex':4},
    'CentralAir':{'Y':1, 'N':0},
    'KitchenQual':{'Fa':0,'TA':1,'Gd':2,'Ex':3},
    'Functional':{'Sev':0,'Maj2':1,'Maj1':2,'Mod':3,'Min2':4,'Min1':5,'Typ':6},
    'FireplaceQu':{'No Fireplace':0,'Po':1,'Fa':2,'TA':3,'Gd':4,'Ex':5},
    'GarageFinish':{'Unf':0,'RFn':1,'Fin':2},
    'GarageQual':{'Po':0,'Fa':1,'TA':2,'Gd':3,'Ex':4},
    'GarageCond':{'Po':0,'Fa':1,'TA':2,'Gd':3,'Ex':4},
    'PavedDrive':{'N':0,'P':1,'Y':2}
}

In [None]:
for col,mapping in ordinal.items():
    data[col]=data[col].map(mapping)

In [None]:
data.Utilities.value_counts()

In [None]:
data.Utilities.value_counts()
data.drop(columns='Utilities',inplace=True)

In [None]:
# insights
# Use a manual mapping for the ordinal data.
# Drop the utlities column because all majority data in allpub and only one data in nosewa.

In [None]:
data.head()

In [None]:
# Transform the target variable
data.SalePrice.skew()

In [None]:
data['SalePrice']=np.log1p(data['SalePrice'])

In [None]:
# insights
# Transform the target variable becasue it contains skewness.
# Transform the target variable make the model easy to understand.

## Split the Data

In [None]:
x=data.drop(columns='SalePrice')
y=data.SalePrice

In [None]:
from sklearn.model_selection import train_test_split
x_train,x_test,y_train,y_test=train_test_split(x,y,random_state=42)

## Scale

In [None]:
from sklearn.preprocessing import StandardScaler
scale=StandardScaler()
x_train_scaled=x_train.copy()
x_test_scaled=x_test.copy()
x_train_scaled[continuous_pre]=scale.fit_transform(x_train[continuous_pre])
x_test_scaled[continuous_pre]=scale.transform(x_test[continuous_pre])

## Model Building

In [None]:
from sklearn.linear_model import LinearRegression
lr=LinearRegression()
lr.fit(x_train_scaled,y_train)
y_pred=lr.predict(x_test_scaled)

In [None]:
y_test_original=np.expm1(y_test)
y_pred_original=np.expm1(y_pred)

In [None]:
from sklearn.metrics import r2_score,mean_squared_error,mean_absolute_error
print('r2_score :',r2_score(y_test_original,y_pred_original))
print('mean_squared_error :',mean_squared_error(y_test_original,y_pred_original))
print('mean_absolute_error :',mean_absolute_error(y_test_original,y_pred_original))

In [None]:
n=x_test.shape[0]
p=x_test.shape[1]
r2=r2_score(y_test_original,y_pred_original)
adj_r2=1-(1-r2)*(n-1)/(n-p-1)
print('Adjusted_R2_Score :',adj_r2)

In [None]:
from sklearn.model_selection import cross_val_score

cv_scores = cross_val_score(lr, x_train_scaled, y_train, cv=5, scoring='r2')
print("Cross Validation R2 Mean:", cv_scores.mean())

In [None]:
# Insights
# Linear Regression achieved strong predictive performance with 91% test R² and 83% cross-validation R².
# The results indicate strong linear relationships between features and house prices.
#The model demonstrates good generalization ability and stable performance, making it a reliable choice for price prediction.

## Random Forest Regressor

In [None]:
y_test_original1=np.expm1(y_test)
y_pred_original1=np.expm1(y_pred)

In [None]:
from sklearn.ensemble import RandomForestRegressor
rf_model=RandomForestRegressor()
rf_model.fit(x_train,y_train)
y_pred=rf_model.predict(x_test)

In [None]:
from sklearn.metrics import r2_score,mean_squared_error,mean_absolute_error
print('r2_score :',r2_score(y_test_original1,y_pred_original1))
print('mean_squared_error :',mean_squared_error(y_test_original1,y_pred_original1))
print('mean_absolute_error :',mean_absolute_error(y_test_original1,y_pred_original1))

In [None]:
n=x_test.shape[0]
p=x_test.shape[1]
r2=r2_score(y_test,y_pred)
adj_r2=1-(1-r2)*(n-1)/(n-p-1)
print('Adjusted_R2_Score :',adj_r2)

## Hyperparameter Tunning

In [None]:
from sklearn.model_selection import RandomizedSearchCV
params={
    'n_estimators':[100,200,300,400],
    'max_depth':[None,10,20,30,40],
    'min_samples_leaf':[1,2,4],
    'min_samples_split':[2,5,10],
    'max_features':['sqrt','log2']
}
rf = RandomForestRegressor(random_state=30)
random_search = RandomizedSearchCV(
    estimator=rf,
    param_distributions=params,
    n_iter=20,    
    cv=5,
    scoring="r2",
    verbose=1,
    n_jobs=-1,
    random_state=42
)

random_search.fit(x_train, y_train)

print("Best Parameters:", random_search.best_params_)
print("Best R2:", random_search.best_score_)

In [None]:
from sklearn.ensemble import RandomForestRegressor
rf1_model=RandomForestRegressor(n_estimators= 400, min_samples_split= 2, min_samples_leaf= 2, max_features= 'sqrt', max_depth= None)
rf1_model.fit(x_train,y_train)
y_pred1=rf1_model.predict(x_test)

In [None]:
y_test_original2=np.expm1(y_test)
y_pred_original2=np.expm1(y_pred1)

In [None]:
from sklearn.metrics import r2_score,mean_squared_error,mean_absolute_error
print('r2_score :',r2_score(y_test_original2,y_pred_original2))
print('mean_squared_error :',mean_squared_error(y_test_original2,y_pred_original2))
print('mean_absolute_error :',mean_absolute_error(y_test_original2,y_pred_original2))

In [None]:
# Insights
 # Random Forest achieved an R² of approximately 84%, which was lower than Linear Regression. 
 # Even after hyperparameter tuning, the model did not outperform the linear model. 
 # This suggests that the dataset primarily contains linear relationships and does not require complex nonlinear modeling techniques.

## Model Selection
  1️ Higher Predictive Accuracy
Linear Regression achieved the highest Test R² (91%), outperforming Random Forest, indicating stronger predictive capability.

  2️ Better Generalization
The 5-fold cross-validation score of 84% confirms that Linear Regression performs consistently across different data splits.

 3️ Limited Nonlinear Complexity
Random Forest did not improve performance even after tuning, suggesting that the dataset primarily contains linear relationships.

 4️ Interpretability and Simplicity
Linear Regression provides clear coefficient interpretation and achieves high accuracy without unnecessary model complexity.

## Business Impact
The developed model can help real estate agencies, property sellers, and buyers estimate house prices based on property characteristics.
Accurate price prediction supports:

Better investment decisions

Fair property valuation

Reduced pricing errors

Faster sales cycle

By explaining 91% of the price variance, the model provides a reliable decision-support tool for housing market analysis.

## Conclusion
This project aimed to predict house prices using various property-related features. After performing data cleaning, exploratory data analysis, feature engineering, and log transformation of the target variable, multiple machine learning models were evaluated.

Among the models tested, Linear Regression achieved the best performance with a Test R² of 91% and a Cross-Validation R² of 83%, indicating strong predictive accuracy and stable generalization. Although Random Forest was implemented and tuned, it did not outperform the linear model, suggesting that the dataset primarily contains strong linear relationships.

The results demonstrate that a well-preprocessed dataset combined with a simple and interpretable model can achieve high predictive performance. Therefore, Linear Regression was selected as the final model for house price prediction.

Overall, the project highlights the importance of proper data preprocessing, model comparison, and validation techniques in building reliable machine learning solutions.