<a href="https://colab.research.google.com/github/Parthshh19/Real-Estate-Analytics---Predicting-House-Prices/blob/main/Real_Estate_Analytics_Predicting_House_Prices.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Real estate analytics for predicting house prices

## Table of Content

1. [Executive Summary](#cell_Summary)

2. [Data Preprocessing](#cell_Preprocessing)

3. [Predictive Modeling](#cell_model)

4. [Experiments Report](#cell_report)



<a id = "cell_Summary"></a>
## 1. Executive Summary

**Business Problem**

The real estate market is highly competitive, and accurate house price prediction is crucial for both sellers and buyers. The ability to predict house prices based on features like location, size, and other property characteristics can provide valuable insights for real estate agencies, investors, and individuals looking to make informed decisions. A robust predictive model would allow stakeholders to assess property values more accurately, minimize risk, and capitalize on opportunities in the housing market.

In this project, we aim to build AI models that can predict house prices based on historical sales data and house features. By developing and comparing various machine learning models, including Linear Regression and Multi-Layer Perceptrons (MLPs), we will identify the best model to deliver accurate predictions suitable for real-world applications.

**Dataset**

The dataset contains 20,000 records of house sales with 20 features, including:
- **Target Variable**: `price`
- **Input Features**: `bedrooms`, `bathrooms`, `sqft_living`, `sqft_lot`, `floors`, `waterfront`, `view`, `condition`, `grade`, `sqft_above`, `sqft_basement`, `yr_built`, `yr_renovated`, `zipcode`, `lat`, `long`, `sqft_living15`, `sqft_lot15`, and `date`.

Each record describes house characteristics that may impact the house price, making it an ideal dataset for regression analysis.

**Methods**

Several AI models were developed and compared to predict house prices. The models include:

Linear Regression: A simple, interpretable model to serve as a baseline.

Multi-Layer Perceptron (MLP) Models: Various MLP architectures with different optimizers, activation functions, dropout layers, and regularization techniques.

Experimentation: Various MLP architectures, such as shallow and deep networks, different optimizers (Adam, RMSProp, SGD), and activation functions (ReLU, Swish, LeakyReLU), were compared.

**Experimentation**

A total of 12 deep learning models and a linear regression model were trained. A 70/30 train-validation split was used for all experiments.

Performance Metrics:

- **Mean Absolute Error (MAE)**: A measure of how close predictions are to the actual prices.
- **Mean Squared Error (MSE) and Root Mean Squared Error (RMSE)**: Measures of the average squared difference between predicted and actual values, with RMSE providing a more interpretable error in the same unit as the target variable.
- **Correlation Coefficient** - A correlation coefficient close to 1.0 indicates a near-perfect positive relationship, implying that the model is highly effective at predicting house prices.
- **Training Loss** - It shows how well the model is learning from the data it has been trained on.
- **Validation Loss** - It is used to evaluate how well the model generalizes to unseen data.

**Analysis**

Best Performing Model:

Model 11 (RMSProp with momentum) achieved the lowest Validation MAE and the lowest training error among all models. It has a very high correlation coefficient. It utilized a moderate architecture with RMSprop optimizer, showing that smooth convergence due to the momentum helped achieve better predictions.

**Suitability for Real-World Deployment**

The chosen model can be deployed in real-world applications due to its relatively low error and ability to generalize well. However, additional tuning, testing, and model monitoring would be required to ensure performance stability in production. This model could be integrated into real estate platforms to provide instant house price predictions, helping buyers, sellers, and real estate agents make data-driven decisions.

<a id = "cell_Preprocessing"></a>
## 2. Data Preprocessing

Load some Python libraries.

In [None]:
from __future__ import print_function
import os
import math
import datetime
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

In [None]:
from sklearn.model_selection import cross_val_score, KFold, train_test_split
from sklearn.preprocessing import StandardScaler, MinMaxScaler, OneHotEncoder
# SimpleImputer was moved to sklearn.impute in version 0.20
from sklearn.impute import SimpleImputer
from sklearn.pipeline import Pipeline
from sklearn.metrics import mean_absolute_error, mean_squared_error
import tensorflow as tf
from sklearn.linear_model import LinearRegression
from keras.optimizers import SGD, Nadam, RMSprop, Adam
from tensorflow.keras.layers import Dense, Dropout, BatchNormalization, LeakyReLU
from tensorflow.keras.regularizers import l2

Some options to control Pandas display

In [None]:
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

Upload the provided data set `house_price.csv` to Google Colab and run the below code.

In [None]:
house_price_org = pd.read_csv("house_price.csv")
house_price_org.set_index('id', inplace=True)
house_price_org.head(10)
print('Number of records read: ', house_price_org.size)

Number of records read:  400000


In [None]:
house_price_org.shape

(20000, 20)

Find the column types and the number of missing values in each column

In [None]:
# Finding column types
house_price_org.dtypes

Unnamed: 0,0
date,object
price,float64
bedrooms,int64
bathrooms,float64
sqft_living,int64
sqft_lot,int64
floors,float64
waterfront,int64
view,int64
condition,int64


In [None]:
house_price_org.info()

<class 'pandas.core.frame.DataFrame'>
Index: 20000 entries, 7129300520 to 3566800485
Data columns (total 20 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   date           20000 non-null  object 
 1   price          20000 non-null  float64
 2   bedrooms       20000 non-null  int64  
 3   bathrooms      20000 non-null  float64
 4   sqft_living    20000 non-null  int64  
 5   sqft_lot       20000 non-null  int64  
 6   floors         20000 non-null  float64
 7   waterfront     20000 non-null  int64  
 8   view           20000 non-null  int64  
 9   condition      20000 non-null  int64  
 10  grade          20000 non-null  int64  
 11  sqft_above     20000 non-null  int64  
 12  sqft_basement  20000 non-null  int64  
 13  yr_built       20000 non-null  int64  
 14  yr_renovated   20000 non-null  int64  
 15  zipcode        20000 non-null  int64  
 16  lat            20000 non-null  float64
 17  long           20000 non-null  float64
 1

In [None]:
# Identification of missing values
missing = house_price_org.isnull().sum()
missing = missing[missing > 0]
missing.sort_values(ascending=False)

Unnamed: 0,0


In [None]:
house_price_org.drop(['date'], axis=1, inplace=True)
house_price_org.describe(include='all')

Unnamed: 0,price,bedrooms,bathrooms,sqft_living,sqft_lot,floors,waterfront,view,condition,grade,sqft_above,sqft_basement,yr_built,yr_renovated,zipcode,lat,long,sqft_living15,sqft_lot15
count,20000.0,20000.0,20000.0,20000.0,20000.0,20000.0,20000.0,20000.0,20000.0,20000.0,20000.0,20000.0,20000.0,20000.0,20000.0,20000.0,20000.0,20000.0,20000.0
mean,535567.9,3.36445,2.072013,2057.907,15606.37,1.44495,0.00795,0.2418,3.44175,7.60575,1757.4727,300.4343,1967.9565,90.8075,98078.16405,47.56039,-122.21516,1974.28685,13115.9366
std,366184.5,0.93374,0.762412,905.62543,41770.24,0.516776,0.08881,0.777922,0.665454,1.172598,811.60698,447.61877,28.317996,415.937997,54.045673,0.13932,0.139578,675.242028,26942.695517
min,75000.0,0.0,0.0,290.0,520.0,1.0,0.0,0.0,1.0,1.0,290.0,0.0,1900.0,0.0,98001.0,47.1559,-122.519,399.0,651.0
25%,317000.0,3.0,1.5,1420.0,5350.0,1.0,0.0,0.0,3.0,7.0,1180.0,0.0,1950.0,0.0,98033.0,47.46755,-122.327,1490.0,5347.75
50%,449950.0,3.0,2.0,1900.0,7819.0,1.0,0.0,0.0,3.0,7.0,1540.0,0.0,1969.0,0.0,98065.0,47.57295,-122.232,1830.0,7778.5
75%,640000.0,4.0,2.5,2510.0,11000.0,2.0,0.0,0.0,4.0,8.0,2150.0,590.0,1991.0,0.0,98118.0,47.679,-122.127,2337.0,10240.0
max,7700000.0,33.0,8.0,13540.0,1651359.0,3.5,1.0,4.0,5.0,13.0,9410.0,4820.0,2015.0,2015.0,98199.0,47.7776,-121.315,6210.0,871200.0


In [None]:
label_col = 'price'
house_price_org.head(10)

Unnamed: 0_level_0,price,bedrooms,bathrooms,sqft_living,sqft_lot,floors,waterfront,view,condition,grade,sqft_above,sqft_basement,yr_built,yr_renovated,zipcode,lat,long,sqft_living15,sqft_lot15
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1
7129300520,221900.0,3,1.0,1180,5650,1.0,0,0,3,7,1180,0,1955,0,98178,47.5112,-122.257,1340,5650
6414100192,538000.0,3,2.25,2570,7242,2.0,0,0,3,7,2170,400,1951,1991,98125,47.721,-122.319,1690,7639
5631500400,180000.0,2,1.0,770,10000,1.0,0,0,3,6,770,0,1933,0,98028,47.7379,-122.233,2720,8062
2487200875,604000.0,4,3.0,1960,5000,1.0,0,0,5,7,1050,910,1965,0,98136,47.5208,-122.393,1360,5000
1954400510,510000.0,3,2.0,1680,8080,1.0,0,0,3,8,1680,0,1987,0,98074,47.6168,-122.045,1800,7503
7237550310,1230000.0,4,4.5,5420,101930,1.0,0,0,3,11,3890,1530,2001,0,98053,47.6561,-122.005,4760,101930
1321400060,257500.0,3,2.25,1715,6819,2.0,0,0,3,7,1715,0,1995,0,98003,47.3097,-122.327,2238,6819
2008000270,291850.0,3,1.5,1060,9711,1.0,0,0,3,7,1060,0,1963,0,98198,47.4095,-122.315,1650,9711
2414600126,229500.0,3,1.0,1780,7470,1.0,0,0,3,7,1050,730,1960,0,98146,47.5123,-122.337,1780,8113
3793500160,323000.0,3,2.5,1890,6560,2.0,0,0,3,7,1890,0,2003,0,98038,47.3684,-122.031,2390,7570


<a id = "cell_model"></a>
## 3. Predictive Modeling

Splitting the training and test data

In [None]:
train_size, valid_size, test_size = (0.7, 0.3, 0.0)
house_train, house_valid = train_test_split(house_price_org,
                                      test_size=valid_size,
                                      random_state=2020)

Extract data for training and validation into x and y vectors.

In [None]:
house_y_train = house_train[[label_col]]
house_x_train = house_train.drop(label_col, axis=1)
house_y_valid = house_valid[[label_col]]
house_x_valid = house_valid.drop(label_col, axis=1)

print('Size of training set: ', len(house_x_train))
print('Size of validation set: ', len(house_y_valid))

Size of training set:  14000
Size of validation set:  6000


Before the data can be applied to a deep learning model, the data needs to be scaled to `[-1,1]` range.

Create a scaling model using training set and use it to scale both training and validation data.

In [None]:
scaler = MinMaxScaler(feature_range=(0, 1), copy=True).fit(house_x_train)
house_x_train = pd.DataFrame(scaler.transform(house_x_train),
                            columns = house_x_train.columns, index = house_x_train.index)
house_x_valid = pd.DataFrame(scaler.transform(house_x_valid),
                            columns = house_x_valid.columns, index = house_x_valid.index)

print('X train min =', round(house_x_train.min().min(),4), '; max =', round(house_x_train.max().max(), 4))
print('X valid min =', round(house_x_valid.min().min(),4), '; max =', round(house_x_valid.max().max(), 4))

X train min = 0.0 ; max = 1.0
X valid min = 0.0 ; max = 2.0487


In [None]:
house_x_valid.head(10)

Unnamed: 0_level_0,bedrooms,bathrooms,sqft_living,sqft_lot,floors,waterfront,view,condition,grade,sqft_above,sqft_basement,yr_built,yr_renovated,zipcode,lat,long,sqft_living15,sqft_lot15
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1
9310300175,0.090909,0.125,0.090566,0.009286,0.0,0.0,0.0,0.5,0.5,0.131579,0.0,0.4,0.0,0.666667,0.94129,0.142857,0.266908,0.029598
1823069213,0.090909,0.25,0.095094,0.008796,0.0,0.0,0.0,0.75,0.416667,0.138158,0.0,0.504348,0.0,0.292929,0.533055,0.348837,0.191189,0.095933
1900600040,0.151515,0.1875,0.091321,0.003993,0.0,0.0,0.0,1.0,0.416667,0.051535,0.179177,0.173913,0.0,0.833333,0.503941,0.140365,0.137842,0.015205
3275310220,0.090909,0.25,0.080755,0.005554,0.0,0.0,0.0,0.75,0.5,0.117325,0.0,0.721739,0.0,0.010101,0.163262,0.173588,0.170539,0.02126
1220069035,0.121212,0.3125,0.164528,0.233206,0.4,0.0,0.75,0.5,0.5,0.239035,0.0,0.791304,0.0,0.106061,0.134631,0.436877,0.220444,0.372633
7300400060,0.121212,0.3125,0.182642,0.003247,0.4,0.0,0.0,0.5,0.666667,0.265351,0.0,0.852174,0.0,0.459596,0.28229,0.288206,0.364997,0.012588
9169100175,0.121212,0.25,0.166038,0.002532,0.0,0.0,0.0,0.75,0.5,0.153509,0.193705,0.452174,0.0,0.681818,0.594338,0.105482,0.316813,0.010235
925059288,0.090909,0.3125,0.159245,0.004377,0.4,0.0,0.0,0.5,0.666667,0.23136,0.0,0.878261,0.0,0.161616,0.832395,0.287375,0.289279,0.018742
6430500086,0.090909,0.125,0.049057,0.002229,0.0,0.0,0.0,0.5,0.5,0.071272,0.0,0.478261,0.0,0.515152,0.855557,0.140365,0.168818,0.00807
7369600080,0.121212,0.28125,0.166038,0.003909,0.0,0.0,0.0,0.75,0.583333,0.131579,0.242131,0.46087,0.0,1.0,0.79733,0.091362,0.237653,0.011675


Convert pandas data frames to `np` arrays.

In [None]:
house_x_train = np.array(house_x_train)
house_y_train = np.array(house_y_train)
house_x_valid = np.array(house_x_valid)
house_y_valid = np.array(house_y_valid)

print('Training shape:', house_x_train.shape)
print('Training samples: ', house_x_train.shape[0])
print('Validation samples: ', house_x_valid.shape[0])

Training shape: (14000, 18)
Training samples:  14000
Validation samples:  6000


Linear Regression Model

In [None]:
model = LinearRegression()
model.fit(house_x_train, house_y_train)
house_y_pred = model.predict(house_x_valid)

# Calculate Mean Squared Error (MSE)
mse = mean_squared_error(house_y_valid, house_y_pred)

# Calculate Root Mean Squared Error (RMSE)
rmse = np.sqrt(mse)

# Calculate Mean Absolute Error (MAE)
mae = mean_absolute_error(house_y_valid, house_y_pred)

print("MSE:", mse)
print("RMSE:", rmse)
print("MAE:", mae)

MSE: 38388412821.56182
RMSE: 195929.61190581127
MAE: 126139.91029325705


Keras models for experiment purpose.

Model 1 - Two Layer Adam Optimisation

The first is very simple, consisting of two layers and `Adam` optimizer.

In [None]:
def model_1(x_size, y_size):
    t_model = Sequential()
    t_model.add(Dense(100, activation="relu", input_shape=(x_size,)))
    t_model.add(Dense(y_size))
    t_model.compile(
        loss='mean_squared_error',
        optimizer=Adam(learning_rate=0.001),
        metrics=[metrics.mae]
    )
    return t_model