# House Price Prediction

**Problem**: We want to build a regression model able to predict a "fair" offer price for real estate properties in Italy.

We structured the project as a Data Science showcase, in which we explain the various steps we took in a series of Jupyter notebooks, containing both the code and the thought process.

In this first notebook we will retrieve data, perform data clustering to feature engineering the most important feature in Real Estate (Location, Location, Location!) and then perform a basic "manual" computation to get a baseline benchmark to evaluate our models against.

Let's get started.

<img src="../images/re1.jpg" alt="Real Estate Image" width="300px" height="300px"/>

## Data Collection

We have it simple here because we already have real estate ads data in Italy, because we have developed a software called EasyMap (https://www.easymap-software.com/) for Real Estate investors.

We want to use AWS Sagemaker for this project, so to use the Data Wrangler we need to export our data to Amazon S3.

https://aws.amazon.com/sagemaker/

https://aws.amazon.com/s3/

We have this data in PostgreSQL, and to export to s3 we can use the extension aws_s3.

https://docs.aws.amazon.com/AmazonRDS/latest/AuroraUserGuide/postgresql-s3-export.html#aws_s3.export_query_to_s3

In order to do that, we need an IAM role that has permission to access both the s3 bucket and the RDS instance.

Then we install the extension in PostgreSQL:

In [2]:
# CREATE EXTENSION aws_s3 CASCADE;

Then in the psql console:

In [3]:
# Create uri to access bucket
# SELECT aws_commons.create_s3_uri('ds-houseprices', 'export.csv','eu-south-1') AS s3_uri_1 \gset

# Set encoding to utf-8
# SET client_encoding TO 'UTF8';

# Data Export
# SELECT * FROM aws_s3.query_export_to_s3('SELECT * FROM portals_ad', :'s3_uri_1', options :='format csv, header True');

## Clustering

To build a baseline benchmark we want to use a simple calculation based on the average price per square meter of the properties in a specific location.

The problem is that we don't have a "location" feature, so we need to build it ourselves.

This will probably come in handy also for our model as location will probably be our most important feature (or one of the most important).

We have latitude and longitude coordinates for each real estate ad and we want to cluster them geographically to be able to use that information as a feature.

To do this we start with a simple Dataframe containing only latitude and longitude.

After we calculate a cluster label we will join this new feature with the other features (we will see this in the next notebook)

In [4]:
import pandas as pd

uri = "s3://ds-houseprices/lat_long_prices.csv"
df = pd.read_csv(uri, float_precision='round_trip')[["latitude","longitude","data_price","data_size"]]
df = df.drop_duplicates()
df['latitude'] = pd.to_numeric(df['latitude'], errors="coerce")
df['longitude'] = pd.to_numeric(df['longitude'], errors="coerce")
df['data_price'] = pd.to_numeric(df['data_price'], errors="coerce")
df['data_size'] = pd.to_numeric(df['data_size'], errors="coerce")
df = df.dropna()
df

Unnamed: 0,latitude,longitude,data_price,data_size
0,43.959400,10.16770,1200000.0,95.0
1,46.461200,12.41970,150000.0,120.0
2,44.494700,11.33750,870000.0,170.0
3,45.909300,9.18060,20000.0,59.0
4,42.690700,13.91350,120000.0,250.0
...,...,...,...,...
997400,42.136800,12.80750,25000.0,141.0
997401,43.555500,10.33610,147000.0,90.0
997402,43.769700,11.25600,249000.0,40.0
997403,45.071362,7.63812,95000.0,44.0


To create our clusters we use KMeans algorithm

https://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html

Instead of doing a single execution with a lot of clusters, we choose to do it iteratively.

Meaning that we first run the algorithm on the full data, and this gives us big clusters. We then split each of these clusters in smaller ones and so on.

We do this because running the algorithm a single time with a lot of clusters gave us "bad" results with overlapping clusters.

Also we want small clusters because real estate prices vary a lot also in different neighborhoods in the same city.

In [5]:
from sklearn.cluster import KMeans

def cluster_with_kmeans(df, n_cluster=10, level=0):
    X = df.copy()

    cluster_label = f"cluster_{level}"
    kmeans = KMeans(n_clusters=n_cluster, init='k-means++')
    
    # kmeans fit_predict returns a cluster index for each row
    X[cluster_label] = kmeans.fit_predict(X[["latitude", "longitude"]])

    # we build a cluster sequentially by drilling down into clusters created in the previous step
    # we identify a single cluster by appending the various index cluster labels for each level
    if level == 0:
        X['cluster'] = X[cluster_label] .astype('str')
    else:
        X['cluster'] = X['cluster'].astype('str') + '_' + X[cluster_label] .astype('str')

    # for each cluster we keep the centroid latitude and longitude to be able to plot them
    X['centroid_latitude'] = X[cluster_label].apply(lambda x: list(kmeans.cluster_centers_[x])[0])
    X['centroid_longitude'] = X[cluster_label].apply(lambda x: list(kmeans.cluster_centers_[x])[1])
    X = X.drop(cluster_label, axis=1)
    return X


def iterative_clustering(df, split_by, max_size):
    
    # First step creates the first (very big) clusters
    df = cluster_with_kmeans(df, n_cluster=split_by)
    
    # limit the number of drill down steps to 10
    for level in range(1, 10):
        
        count_df = df.groupby(by='cluster').count()
        
        # we split each cluster that has more ads than max_size
        cluster_ids = list(count_df[count_df['latitude'] > max_size].index)
        
        # if there are no clusters to drill down, we can return
        if len(cluster_ids) == 0:
            return df
        
        for i, cluster_id in enumerate(sorted(cluster_ids)):
            
            X = df[df[f'cluster'] == cluster_id]
            X = cluster_with_kmeans(X, n_cluster=split_by, level=level)

            merge_columns = [f'cluster', 'centroid_latitude', 'centroid_longitude']
            df = pd.merge(df, X[merge_columns], how='left', left_index=True, right_index=True)

            for c in merge_columns:
                df[f'{c}_y'] = df[f'{c}_y'].fillna(df[f'{c}_x'])
                df[f'{c}'] = df[f'{c}_y']
                df = df.drop([f'{c}_y', f'{c}_x'], axis=1)
                    
    return df

Let's run an example with just the first 10k data points so we can visualize the results

In [6]:
sample_df = df[:10000]

clusterized_df = iterative_clustering(sample_df, 20, 100)
clusterized_df

Unnamed: 0,latitude,longitude,data_price,data_size,cluster,centroid_latitude,centroid_longitude
0,43.9594,10.1677,1200000.0,95.0,15_14,43.963066,10.204137
1,46.4612,12.4197,150000.0,120.0,19_12,46.526557,12.487086
2,44.4947,11.3375,870000.0,170.0,3_0,44.470231,11.378519
3,45.9093,9.1806,20000.0,59.0,18_7,45.943864,9.144963
4,42.6907,13.9135,120000.0,250.0,1_12,42.708937,13.945174
...,...,...,...,...,...,...,...
9998,43.8334,10.6510,145000.0,106.0,12_17,43.871733,10.731006
9999,45.6463,13.7809,138000.0,125.0,19_4,45.647079,13.768636
10000,43.9697,10.8285,40000.0,55.0,12_4,43.915715,10.949690
10001,43.4734,11.1467,180000.0,200.0,12_15,43.531263,11.071050


In [7]:
!pip install folium
import folium
import numpy as np

def gen_color():
    
    # generate random color
    color = np.random.randint(16, 256, size=3)
    color = [str(hex(i))[2:] for i in color]
    color = '#' + ''.join(color).upper()
    return color

def plot_map(X, cluster_label='cluster'):
    cols = {}

    # generate a random color for each cluster
    for lab in X[cluster_label].unique():
        cols[str(lab)] = gen_color()

    # we use folium library to visualize the map
    m = folium.Map(location=[X.latitude.mean(), X.longitude.mean()], zoom_start=7)

    # for each data point
    for _, row in X.iterrows():
        v = str(row[cluster_label])
        cluster_colour = cols[v]

        # we create a circle on the map
        folium.Circle(
            location=[row.latitude, row.longitude],
            radius=10,
            opacity=0.8,
            fill_opacity=0.8,
            color=cluster_colour,
            fill=True,
            fill_color=cluster_colour
        ).add_to(m)
    
    return m

[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.0.1[0m[39;49m -> [0m[32;49m23.1.2[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m


In [8]:
plot_map(clusterized_df)

## Baseline Benchmark

The most basic "manual" prediction we can do is just calculate the average price/mq for each cluster and predict by multiplying this for each ad size.

In [9]:
naive_benchmark_df = clusterized_df.copy()[["data_price","data_size","cluster"]]
naive_benchmark_df["price_mq"] = naive_benchmark_df["data_price"] / naive_benchmark_df["data_size"]
naive_benchmark_df

Unnamed: 0,data_price,data_size,cluster,price_mq
0,1200000.0,95.0,15_14,12631.578947
1,150000.0,120.0,19_12,1250.000000
2,870000.0,170.0,3_0,5117.647059
3,20000.0,59.0,18_7,338.983051
4,120000.0,250.0,1_12,480.000000
...,...,...,...,...
9998,145000.0,106.0,12_17,1367.924528
9999,138000.0,125.0,19_4,1104.000000
10000,40000.0,55.0,12_4,727.272727
10001,180000.0,200.0,12_15,900.000000


In [10]:
cluster_mean = naive_benchmark_df.groupby("cluster")["price_mq"].mean()
cluster_mean.name = "cluster_mean"
naive_benchmark_df = pd.merge(naive_benchmark_df, cluster_mean, how="inner", left_on="cluster", right_index=True)
naive_benchmark_df

Unnamed: 0,data_price,data_size,cluster,price_mq,cluster_mean
0,1200000.0,95.0,15_14,12631.578947,4927.004644
830,380000.0,132.0,15_14,2878.787879,4927.004644
2420,1250000.0,200.0,15_14,6250.000000,4927.004644
2470,740000.0,70.0,15_14,10571.428571,4927.004644
2578,870000.0,220.0,15_14,3954.545455,4927.004644
...,...,...,...,...,...
8916,180000.0,215.0,0_13_17,837.209302,837.209302
9022,420000.0,400.0,14_8_0,1050.000000,1050.000000
9074,150000.0,120.0,16_16,1250.000000,2562.500000
9144,310000.0,80.0,16_16,3875.000000,2562.500000


In [11]:
naive_benchmark_df["prediction"] = naive_benchmark_df["cluster_mean"] * naive_benchmark_df["data_size"]
naive_benchmark_df

Unnamed: 0,data_price,data_size,cluster,price_mq,cluster_mean,prediction
0,1200000.0,95.0,15_14,12631.578947,4927.004644,4.680654e+05
830,380000.0,132.0,15_14,2878.787879,4927.004644,6.503646e+05
2420,1250000.0,200.0,15_14,6250.000000,4927.004644,9.854009e+05
2470,740000.0,70.0,15_14,10571.428571,4927.004644,3.448903e+05
2578,870000.0,220.0,15_14,3954.545455,4927.004644,1.083941e+06
...,...,...,...,...,...,...
8916,180000.0,215.0,0_13_17,837.209302,837.209302,1.800000e+05
9022,420000.0,400.0,14_8_0,1050.000000,1050.000000,4.200000e+05
9074,150000.0,120.0,16_16,1250.000000,2562.500000,3.075000e+05
9144,310000.0,80.0,16_16,3875.000000,2562.500000,2.050000e+05


In [12]:
from sklearn.metrics import mean_squared_error
mean_squared_error(naive_benchmark_df["prediction"], naive_benchmark_df["data_price"], squared=False)

167499.97788347947

Let's try run this on more data

In [13]:
data_df = pd.read_csv("s3://ds-houseprices/ETL/ETL_Numeric_Categorical/Train/ETL-Categorical-2023-04-24T19-35-29/part-00000-7b62d3d8-da9b-429c-87f4-9970ecce75cb-c000.csv")
#df = data_df[["data_price","data_elevator","data_size","data_status","floor_num", "cluster_mean"]]

df = data_df.copy()[["data_price","data_size","cluster_mean"]]
df["prediction"] = df["cluster_mean"] * df["data_size"]
df

Unnamed: 0,data_price,data_size,cluster_mean,prediction
0,150000.0,70.0,2033.090131,142316.309193
1,140000.0,120.0,2033.090131,243970.815760
2,150000.0,70.0,2033.090131,142316.309193
3,150000.0,70.0,2033.090131,142316.309193
4,180000.0,110.0,1784.238502,196266.235236
...,...,...,...,...
947285,210000.0,56.0,7865.866457,440488.521608
947286,435000.0,102.0,4500.859593,459087.678500
947287,860420.0,76.0,10380.049162,788883.736294
947288,851185.0,76.0,10380.049162,788883.736294


In [14]:
mean_squared_error(df["prediction"], df["data_price"], squared=False)

183111.87612370093

Let's see if we can do better by just adding some more power into our manual prediction.

The most obvious way is to incorporate information about:
- house status (refurbished, to be renovated, new, ...)
- floor and elevator

In [15]:
df = data_df.copy()[["data_price","data_elevator","data_size","data_status","floor_num", "cluster_mean"]]
df

Unnamed: 0,data_price,data_elevator,data_size,data_status,floor_num,cluster_mean
0,150000.0,unknown,70.0,,1.0,2033.090131
1,140000.0,unknown,120.0,Buono Abitabile,,2033.090131
2,150000.0,unknown,70.0,,1.0,2033.090131
3,150000.0,unknown,70.0,,1.0,2033.090131
4,180000.0,unknown,110.0,,0.0,1784.238502
...,...,...,...,...,...,...
947285,210000.0,si,56.0,Ottimo Ristrutturato,0.5,7865.866457
947286,435000.0,unknown,102.0,Buono Abitabile,1.0,4500.859593
947287,860420.0,unknown,76.0,,,10380.049162
947288,851185.0,unknown,76.0,,,10380.049162


In [16]:
df["data_elevator"] = df["data_elevator"] == "si"
df["floor_num"] = df["floor_num"].fillna(0)

def mod_floor(row):
    floor = row["floor_num"]
    elevator = row["data_elevator"]
    
    # small penalize for ground floor
    if floor >= 0 and floor < 1:
        return 0.9
    
    # big penalize for basement floor
    elif floor < 0:
        return 0.5
    
    # big penalize for high floor without elevator
    elif not elevator and floor >= 3:
        return max(1 - floor / 10, 0.5)
    
    else:
        return 1

def mod_status(row):
    status = row["data_status"]
    
    # good status
    if status in [ 'OttimoRistrutturato', 'NuovoIncostruzione']:
        return 1.2
    
    # bad status
    elif status == "Daristrutturare":
        return 0.8
    
    else:
        return 1

# calculate modifiers
df["floor_mod"] = df.apply(lambda x: mod_floor(x), axis=1)
df = df.drop(["data_elevator","floor_num"], axis=1)
df["data_status"] = df["data_status"].fillna("Buono Abitabile")
df["data_status"] = df["data_status"].str.replace(" ","")
df["status_mod"] = df.apply(lambda x: mod_status(x), axis=1)

df["price_mq"] = round(df["data_price"] / df["data_size"],2)


# df = df.replace([np.inf, -np.inf], np.nan)
# df = df.dropna()

df["prediction"] = df["data_size"] * df["floor_mod"] * df["status_mod"] * df["cluster_mean"]

df

Unnamed: 0,data_price,data_size,data_status,cluster_mean,floor_mod,status_mod,price_mq,prediction
0,150000.0,70.0,BuonoAbitabile,2033.090131,1.0,1.0,2142.86,142316.309193
1,140000.0,120.0,BuonoAbitabile,2033.090131,0.9,1.0,1166.67,219573.734184
2,150000.0,70.0,BuonoAbitabile,2033.090131,1.0,1.0,2142.86,142316.309193
3,150000.0,70.0,BuonoAbitabile,2033.090131,1.0,1.0,2142.86,142316.309193
4,180000.0,110.0,BuonoAbitabile,1784.238502,0.9,1.0,1636.36,176639.611712
...,...,...,...,...,...,...,...,...
947285,210000.0,56.0,OttimoRistrutturato,7865.866457,0.9,1.2,3750.00,475727.603337
947286,435000.0,102.0,BuonoAbitabile,4500.859593,1.0,1.0,4264.71,459087.678500
947287,860420.0,76.0,BuonoAbitabile,10380.049162,0.9,1.0,11321.32,709995.362665
947288,851185.0,76.0,BuonoAbitabile,10380.049162,0.9,1.0,11199.80,709995.362665


In [17]:
mean_squared_error(df["prediction"], df["data_price"], squared=False)

169391.48230788572

The performance is a little better and will at least provide a benchmark from which to compare the ML models we'll build down the road.

Now let's continue on with the next step, where we are going to explore and clean the data.

[Go to Data Cleaning](data_cleaning.ipynb)