# Sydney Real Estate

The SurfIntoYourHome LLC from Sydney wants to offer a new service to their clients. Similar to the competition from overseas, the company wants to offer a price estimate of a building to their customers. This is supposed to help the customer evaluate wether the pricing of a building is fair or not. In order for the prediction to work the company has provided a dataset under the CC0: Public Domain License from https://www.kaggle.com/datasets/mihirhalai/sydney-house-prices?select=SydneyHousePrices.csv which has been scraped from the web.


In [20]:
import pandas as pd
import numpy as np
import seaborn as sns

In [3]:
data = pd.read_csv("/home/jovyan/DeepLearningExperiment/data/SydneyHousePrices.csv")

In [12]:
data.shape

(199504, 9)

In [6]:
data.head()

Unnamed: 0,Date,Id,suburb,postalCode,sellPrice,bed,bath,car,propType
0,2019-06-19,1,Avalon Beach,2107,1210000,4.0,2,2.0,house
1,2019-06-13,2,Avalon Beach,2107,2250000,4.0,3,4.0,house
2,2019-06-07,3,Whale Beach,2107,2920000,3.0,3,2.0,house
3,2019-05-28,4,Avalon Beach,2107,1530000,3.0,1,2.0,house
4,2019-05-22,5,Whale Beach,2107,8000000,5.0,4,4.0,house


The dataset contains 199504 rows and has 9 columns.

In [15]:
data["sellPrice"].max()

2147483647

In [13]:
data.describe()

Unnamed: 0,Id,postalCode,sellPrice,bed,bath,car
count,199504.0,199504.0,199504.0,199350.0,199504.0,181353.0
mean,99752.5,2196.379155,1269776.0,3.516479,1.890669,1.936224
std,57591.98839,193.053467,6948239.0,1.066555,0.926001,1.060237
min,1.0,2000.0,1.0,1.0,1.0,1.0
25%,49876.75,2082.0,720000.0,3.0,1.0,1.0
50%,99752.5,2144.0,985000.0,3.0,2.0,2.0
75%,149628.25,2211.0,1475000.0,4.0,2.0,2.0
max,199504.0,4878.0,2147484000.0,99.0,99.0,41.0


The dataset contains property that has been sold for 1 AUD up to 2 billion dollar. This is clearly a mistake and during the data cleaning those extreme values will be removed. the lower values for bedrooms, bathrooms and car parking spots seem plausible as they are all 1. However, the max values for bed and bathrooms of 99 seems implausible. The same applies for the car parking with 41. Even if those values are true the company is focused on singles and families and not real estate developers, therefor those values can be excluded from the prediction.

For this purpose the dataset will be filtered to only keep listings with a price lower than or equal to 10 Mio AUD. For the bedrooms, bathrooms and car parking lots a filter of 10 is applied. 

In [16]:
data_filtered = data[data["sellPrice"]<= 10000000]
data_filtered = data_filtered[data_filtered["bed"]<= 10]
data_filtered = data_filtered[data_filtered["bath"]<= 10]
data_filtered = data_filtered[data_filtered["car"]<= 10]

In [17]:
data_filtered.describe()

Unnamed: 0,Id,postalCode,sellPrice,bed,bath,car
count,180856.0,180856.0,180856.0,180856.0,180856.0,180856.0
mean,99088.106886,2204.848382,1238545.0,3.576182,1.933649,1.923475
std,57869.443012,194.464316,854833.1,0.96302,0.864519,0.9878
min,1.0,2000.0,1.0,1.0,1.0,1.0
25%,48444.75,2092.0,720000.0,3.0,1.0,1.0
50%,99005.5,2148.0,981000.0,3.0,2.0,2.0
75%,149570.25,2213.0,1480000.0,4.0,2.0,2.0
max,199504.0,4878.0,10000000.0,10.0,10.0,10.0


The resulting dataset appears to be more plausible now. The average real estate object has now a price of roughly 1.2 million AUD, which contains 3.5 bedrooms, almost 2 bathrooms and 2 car parking plots. 

In [21]:
np.unique(data_filtered["suburb"]).size

671

There are 671 individual suburbs in the dataset.

In [34]:
df1 = data_filtered[data_filtered['suburb'].map(data_filtered['suburb'].value_counts()) > 200]

In [35]:
np.unique(df1["suburb"]).size

303

In [11]:
np.unique(data["postalCode"]).size

235

In [5]:
data.describe()

Unnamed: 0,Id,postalCode,sellPrice,bed,bath,car
count,199504.0,199504.0,199504.0,199350.0,199504.0,181353.0
mean,99752.5,2196.379155,1269776.0,3.516479,1.890669,1.936224
std,57591.98839,193.053467,6948239.0,1.066555,0.926001,1.060237
min,1.0,2000.0,1.0,1.0,1.0,1.0
25%,49876.75,2082.0,720000.0,3.0,1.0,1.0
50%,99752.5,2144.0,985000.0,3.0,2.0,2.0
75%,149628.25,2211.0,1475000.0,4.0,2.0,2.0
max,199504.0,4878.0,2147484000.0,99.0,99.0,41.0
