# Real Estate Price Prediction Using Machine Learning

## Executive Summary
Accurately estimating house prices is a criticalaspect of the real estate industry. Understanding what drives property calue helps inform pricing startegies, investment decisions and negotiations. This project leverages machine learning techniques to predict residential house prices. The goal is to build a predictive model that captures market patterns and outputs realistic, data driven price estimates.

### Problem Statement

Accurately estimating the price of residential properties is a key challenge in the real estate market. Buyers want fair deals, sellers want competitive pricing, and investors deek undervalued opportunities. This project aims to build a machine learning model that predicts house prices based on key property features such as size, location, condition and amenities.

Using the "House Price Prediction" dataset from Kaggle, the goal is to develop and evaluate regression models that can learn from historical housing data and provide reliable price estimates for new, unseen properties.

### Objectives
1. To support real estate agents and property developers in pricing homes more accurately, reducing the risk of overpricing or undervaluing properties.

2. To help home buyers and sellers make informed decisions by providing data-driven estimates of property values based on key housing features and market factors.

3. To uncover and analyze the key drivers of property value such as location, square footage, number of bedrooms/bathrooms, and neighborhood conditions.

4. To reduce manual valuation time and subjectivity by offering an automated prediction system that complements or enhances traditional property appraisal methods.

5. To identify pricing trends and anomalies within a local housing market, assisting stakeholders in spotting investment opportunities or areas of concern.

6. To simulate the impact of property improvements (e.g., renovations or additional rooms) on house value, guiding property owners on which upgrades yield the highest return.

7. To build a predictive tool that can be used by real estate platforms to enhance customer experience by offering instant price estimates on property listings.



### This project will involve:
    A[Data Loading] --> B[Data Cleaning and Preprocessing]
    B --> C[Exploratory Data Analysis(EDA)]]
    C --> D[Feature engineering]
    D --> E[Model Training and Evaluation]
    E --> F[Conclusions & Recommendations]

In [1]:
#importing the necessary libraries
import pandas as pd
import numpy as np
import seaborn as sns  
import matplotlib.pyplot as plt  
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split, RandomizedSearchCV, cross_validate,cross_val_score
from sklearn.metrics import accuracy_score,precision_score, recall_score, f1_score, precision_recall_curve, precision_score, recall_score, roc_auc_score, confusion_matrix, ConfusionMatrixDisplay,make_scorer,roc_curve, auc
from sklearn.ensemble import RandomForestClassifier
#from imblearn.over_sampling import SMOTE
import joblib


### 1. Loading and inspecting the data

In [2]:
#Load the dataset in Python using pandas and inspect the first few rows

df=pd.read_csv("Data/HousingData.csv")
df.head(10)

Unnamed: 0,date,price,bedrooms,bathrooms,sqft_living,sqft_lot,floors,waterfront,view,condition,sqft_above,sqft_basement,yr_built,yr_renovated,street,city,statezip,country
0,2014-05-02 00:00:00,313000.0,3.0,1.5,1340,7912,1.5,0,0,3,1340,0,1955,2005,18810 Densmore Ave N,Shoreline,WA 98133,USA
1,2014-05-02 00:00:00,2384000.0,5.0,2.5,3650,9050,2.0,0,4,5,3370,280,1921,0,709 W Blaine St,Seattle,WA 98119,USA
2,2014-05-02 00:00:00,342000.0,3.0,2.0,1930,11947,1.0,0,0,4,1930,0,1966,0,26206-26214 143rd Ave SE,Kent,WA 98042,USA
3,2014-05-02 00:00:00,420000.0,3.0,2.25,2000,8030,1.0,0,0,4,1000,1000,1963,0,857 170th Pl NE,Bellevue,WA 98008,USA
4,2014-05-02 00:00:00,550000.0,4.0,2.5,1940,10500,1.0,0,0,4,1140,800,1976,1992,9105 170th Ave NE,Redmond,WA 98052,USA
5,2014-05-02 00:00:00,490000.0,2.0,1.0,880,6380,1.0,0,0,3,880,0,1938,1994,522 NE 88th St,Seattle,WA 98115,USA
6,2014-05-02 00:00:00,335000.0,2.0,2.0,1350,2560,1.0,0,0,3,1350,0,1976,0,2616 174th Ave NE,Redmond,WA 98052,USA
7,2014-05-02 00:00:00,482000.0,4.0,2.5,2710,35868,2.0,0,0,3,2710,0,1989,0,23762 SE 253rd Pl,Maple Valley,WA 98038,USA
8,2014-05-02 00:00:00,452500.0,3.0,2.5,2430,88426,1.0,0,0,4,1570,860,1985,0,46611-46625 SE 129th St,North Bend,WA 98045,USA
9,2014-05-02 00:00:00,640000.0,4.0,2.0,1520,6200,1.5,0,0,3,1520,0,1945,2010,6811 55th Ave NE,Seattle,WA 98115,USA


This dataset contains detailed information on residential properties, including their physical attributes, location and sale prices. Key attributes in the dataset include:
- `Price`: The target variable representing the sale price of the house.

- `Bedrooms` & `Bathrooms`: Number of bedrooms and bathrooms in the property.

- `Living Area` & `Lot Size`: Square footage of the interior living space and the overall lot.

- `Floors`: Number of floors in the house.

- `Waterfront`, `View`, and `Condition`: Qualitative indicators of whether the house has a waterfront view, general view quality, and condition rating.

- `Year Built` & `Year Renovated`: Construction year and the year of last major renovation (if any).

- `Location Information`: Street address, city, state, and ZIP code.

In [3]:
# Checking the shape of the dataset (rows, columns)

df.shape

(4600, 18)

In [4]:
# Checking dataset structure and column details

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4600 entries, 0 to 4599
Data columns (total 18 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   date           4600 non-null   object 
 1   price          4600 non-null   float64
 2   bedrooms       4600 non-null   float64
 3   bathrooms      4600 non-null   float64
 4   sqft_living    4600 non-null   int64  
 5   sqft_lot       4600 non-null   int64  
 6   floors         4600 non-null   float64
 7   waterfront     4600 non-null   int64  
 8   view           4600 non-null   int64  
 9   condition      4600 non-null   int64  
 10  sqft_above     4600 non-null   int64  
 11  sqft_basement  4600 non-null   int64  
 12  yr_built       4600 non-null   int64  
 13  yr_renovated   4600 non-null   int64  
 14  street         4600 non-null   object 
 15  city           4600 non-null   object 
 16  statezip       4600 non-null   object 
 17  country        4600 non-null   object 
dtypes: float

The output shows the count of non-null entries, data types, and memory usage, which helps in spotting missing values and identifying opportunities for type conversion.

Based on the above, it will be necessary to change the data type of date from `object` type and convert it to `datetime` using `pd.to_datetime()`

In [7]:
df['date'] = pd.to_datetime(df['date'])
df['year'] = df['date'].dt.year
df['month'] = df['date'].dt.month

print(df[['date', 'year', 'month']].dtypes)


date     datetime64[ns]
year              int32
month             int32
dtype: object


Converting selected columns to the `category` data type to optimize memory usage and prepare for encoding.

In [8]:
categorical_cols = ['city', 'statezip', 'country']
for col in categorical_cols:
    df[col] = df[col].astype('category')

categorical_cols


['city', 'statezip', 'country']

In [10]:
print(df.dtypes)

date             datetime64[ns]
price                   float64
bedrooms                float64
bathrooms               float64
sqft_living               int64
sqft_lot                  int64
floors                  float64
waterfront                int64
view                      int64
condition                 int64
sqft_above                int64
sqft_basement             int64
yr_built                  int64
yr_renovated              int64
street                   object
city                   category
statezip               category
country                category
year                      int32
month                     int32
dtype: object


In [11]:
# Checking for missing values
print(df.isnull().sum())


date             0
price            0
bedrooms         0
bathrooms        0
sqft_living      0
sqft_lot         0
floors           0
waterfront       0
view             0
condition        0
sqft_above       0
sqft_basement    0
yr_built         0
yr_renovated     0
street           0
city             0
statezip         0
country          0
year             0
month            0
dtype: int64


In [20]:
#Converting bedrooms and price to integers

cols_to_int = ['bedrooms', 'price']

for col in cols_to_int:
    df[col] = df[col].astype(int)


In [25]:
# Handle outliers using IQR
numeric_cols = ['price', 'sqft_living', 'sqft_lot', 'bedrooms', 'bathrooms']

for col in numeric_cols:
    Q1 = df[col].quantile(0.25)            # 25th percentile
    Q3 = df[col].quantile(0.75)            # 75th percentile
    IQR = Q3 - Q1                          # Interquartile Range
    lower_bound = Q1 - 1.5 * IQR           # Lower outlier threshold
    upper_bound = Q3 + 1.5 * IQR           # Upper outlier threshold
    
    # This replaces outliers with the nearest acceptable value
    df[col] = df[col].clip(lower=lower_bound, upper=upper_bound)


In [26]:
numeric_cols

['price', 'sqft_living', 'sqft_lot', 'bedrooms', 'bathrooms']

### Dropping Irrelevant Columns

The `date` column has already been used to extract useful features such as `year` and `month`, making it redundant in its original form. The `street` column is high-cardinality and unlikely to provide predictive value to the model. Therefore, we drop both columns to clean the dataset and reduce noise.


In [31]:
cols_to_drop = ['date', 'street']
df.drop(columns=[col for col in cols_to_drop if col in df.columns], inplace=True)


In [33]:
#Confirming date and street were dropped
print(df.columns)

Index(['price', 'bedrooms', 'bathrooms', 'sqft_living', 'sqft_lot', 'floors',
       'waterfront', 'view', 'condition', 'sqft_above', 'sqft_basement',
       'yr_built', 'yr_renovated', 'city', 'statezip', 'country', 'year',
       'month'],
      dtype='object')


### Encoding Categorical Variables

Machine learning models require numerical input, so categorical variables must be converted into a numerical format. In this case, we use **one-hot encoding** to transform columns like `city`, `statezip`, and `country` into binary indicator variables. We use `drop_first=True` to avoid the dummy variable trap, which helps prevent multicollinearity in models like linear or logistic regression.


In [35]:
categorical_cols = ['city', 'statezip', 'country']

# Create a new DataFrame with one-hot encoded categorical variables
df_encoded = pd.get_dummies(df, columns=categorical_cols, drop_first=True)

# Convert boolean columns to integer (0/1)
df_encoded = df_encoded.astype({col: 'int' for col in df_encoded.columns if df_encoded[col].dtype == 'bool'})

# Preview the encoded DataFrame
print(df_encoded.head())


        price  bedrooms  bathrooms  sqft_living  sqft_lot  floors  waterfront  \
0   313000.00       3.0       1.50         1340      7912     1.5           0   
1  1153093.75       5.0       2.50         3650      9050     2.0           0   
2   342000.00       3.0       2.00         1930     11947     1.0           0   
3   420000.00       3.0       2.25         2000      8030     1.0           0   
4   550000.00       4.0       2.50         1940     10500     1.0           0   

   view  condition  sqft_above  ...  statezip_WA 98155  statezip_WA 98166  \
0     0          3        1340  ...                  0                  0   
1     4          5        3370  ...                  0                  0   
2     0          4        1930  ...                  0                  0   
3     0          4        1000  ...                  0                  0   
4     0          4        1140  ...                  0                  0   

   statezip_WA 98168  statezip_WA 98177  statezip_

In [36]:
df_encoded.describe().T  # Transposed for easier viewing


Unnamed: 0,count,mean,std,min,25%,50%,75%,max
price,4600.0,516993.292391,261471.572219,0.000,322875.00,460943.00,654962.50,1153093.750
bedrooms,4600.0,3.392609,0.856964,1.500,3.00,3.00,4.00,5.500
bathrooms,4600.0,2.139158,0.720548,0.625,1.75,2.25,2.50,3.625
sqft_living,4600.0,2114.626739,867.106902,370.000,1460.00,1980.00,2620.00,4360.000
sqft_lot,4600.0,8934.793261,5388.102126,638.000,5000.75,7683.00,11001.25,20002.000
...,...,...,...,...,...,...,...,...
statezip_WA 98188,4600.0,0.005000,0.070541,0.000,0.00,0.00,0.00,1.000
statezip_WA 98198,4600.0,0.012174,0.109674,0.000,0.00,0.00,0.00,1.000
statezip_WA 98199,4600.0,0.014783,0.120695,0.000,0.00,0.00,0.00,1.000
statezip_WA 98288,4600.0,0.000652,0.025532,0.000,0.00,0.00,0.00,1.000


In [37]:
df_encoded[df_encoded['price'] == 0]


Unnamed: 0,price,bedrooms,bathrooms,sqft_living,sqft_lot,floors,waterfront,view,condition,sqft_above,...,statezip_WA 98155,statezip_WA 98166,statezip_WA 98168,statezip_WA 98177,statezip_WA 98178,statezip_WA 98188,statezip_WA 98198,statezip_WA 98199,statezip_WA 98288,statezip_WA 98354
4354,0.0,3.0,1.75,1490,10125,1.0,0,0,4,1490,...,0,0,0,0,0,0,0,0,0,0
4356,0.0,4.0,2.75,2600,5390,1.0,0,0,4,1300,...,0,0,0,0,0,0,0,1,0,0
4357,0.0,5.5,2.75,3200,9200,1.0,0,2,4,1600,...,0,0,0,0,0,0,0,0,0,0
4358,0.0,5.0,3.5,3480,20002,2.0,0,0,4,2490,...,0,0,0,0,0,0,0,0,0,0
4361,0.0,5.0,1.5,1500,7112,1.0,0,0,5,760,...,0,1,0,0,0,0,0,0,0,0
4362,0.0,4.0,3.625,3680,18804,2.0,0,0,3,3680,...,0,0,0,0,0,0,0,0,0,0
4374,0.0,2.0,2.5,2200,20002,1.0,0,3,3,2200,...,0,0,0,0,0,0,0,0,0,0
4376,0.0,4.0,2.25,2170,10500,1.0,0,2,4,1270,...,0,1,0,0,0,0,0,0,0,0
4382,0.0,5.0,3.625,4360,6324,2.0,0,0,3,3210,...,0,0,0,0,0,0,0,0,0,0
4383,0.0,5.0,3.625,4360,9000,2.0,0,0,3,4430,...,0,0,0,0,0,0,0,0,0,0


In [38]:
(df_encoded['price'] == 0).sum()


49

#### Dropping Invalid Price Entries

Some rows in the dataset had a house price of 0, which is not realistic for a housing market. These entries likely represent missing or incorrect data. Since they make up just over 1% of the dataset, they are safely removed to avoid negatively impacting model training.


In [39]:
#Removing columns with a house price of 0
df_encoded = df_encoded[df_encoded['price'] != 0]

print(df_encoded.shape)


(4551, 134)


In [40]:
df_encoded.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
price,4551.0,522559.689079,257282.400079,7800.000,326264.00,465000.00,657500.0,1153093.750
bedrooms,4551.0,3.386399,0.852697,1.500,3.00,3.00,4.0,5.500
bathrooms,4551.0,2.134970,0.717214,0.625,1.75,2.25,2.5,3.625
sqft_living,4551.0,2108.644913,862.904002,370.000,1460.00,1970.00,2610.0,4360.000
sqft_lot,4551.0,8913.638761,5385.045286,638.000,5000.00,7680.00,10978.0,20002.000
...,...,...,...,...,...,...,...,...
statezip_WA 98188,4551.0,0.004834,0.069367,0.000,0.00,0.00,0.0,1.000
statezip_WA 98198,4551.0,0.012305,0.110255,0.000,0.00,0.00,0.0,1.000
statezip_WA 98199,4551.0,0.014722,0.120451,0.000,0.00,0.00,0.0,1.000
statezip_WA 98288,4551.0,0.000659,0.025669,0.000,0.00,0.00,0.0,1.000


### Frequency Summary of One-Hot Encoded Binary Columns

To better understand the distribution of categories after one-hot encoding, we calculate the proportion of 1s (i.e., presence of each category) for all binary columns. This helps identify rare or dominant categories that may impact model performance.


In [42]:
binary_cols = [col for col in df_encoded.columns 
               if df_encoded[col].nunique() == 2 and 
               set(df_encoded[col].unique()).issubset({0, 1})]


In [43]:
print("\nSummary Statistics for Binary Columns (Proportion of 1s):") 
binary_summary = df_encoded[binary_cols].mean().sort_values(ascending=False)
print(binary_summary)



Summary Statistics for Binary Columns (Proportion of 1s):
city_Seattle                0.343002
city_Renton                 0.063942
city_Bellevue               0.061745
city_Redmond                0.051637
city_Kirkland               0.041090
                              ...   
statezip_WA 98050           0.000439
city_Inglewood-Finn Hill    0.000220
city_Snoqualmie Pass        0.000220
city_Beaux Arts Village     0.000220
statezip_WA 98068           0.000220
Length: 120, dtype: float64


### Binary Feature Distribution

After one-hot encoding categorical variables, we calculated the proportion of 1s for each binary column. This reveals how frequently each category appears in the dataset.

- The most common city is Seattle, representing over 34% of listings.
- Several cities and zip codes occur in less than 0.1% of the data, which may be considered rare and could potentially be dropped or grouped in future steps.


In [47]:
# Save the cleaned and encoded data
df_encoded.to_csv('Data/cleaned_housing_data.csv', index=False)

# Confirmation prints
print("Encoded and cleaned data saved to 'cleaned_housing_data.csv'")
print(f"Data shape: {df_encoded.shape}")
print(df_encoded.head())



Encoded and cleaned data saved to 'cleaned_housing_data.csv'
Data shape: (4551, 134)
        price  bedrooms  bathrooms  sqft_living  sqft_lot  floors  waterfront  \
0   313000.00       3.0       1.50         1340      7912     1.5           0   
1  1153093.75       5.0       2.50         3650      9050     2.0           0   
2   342000.00       3.0       2.00         1930     11947     1.0           0   
3   420000.00       3.0       2.25         2000      8030     1.0           0   
4   550000.00       4.0       2.50         1940     10500     1.0           0   

   view  condition  sqft_above  ...  statezip_WA 98155  statezip_WA 98166  \
0     0          3        1340  ...                  0                  0   
1     4          5        3370  ...                  0                  0   
2     0          4        1930  ...                  0                  0   
3     0          4        1000  ...                  0                  0   
4     0          4        1140  ...        