> # Data preprocessing

In this section we will preprocess the landing data scraped from domain.com.au website by handling missing values as well as outliers. The preprocessed data is saved under ..data/raw/preprocessed properties.csv

> ### Import libraries and functions

In [23]:
import pandas as pd

> ### Handling missing values

In [24]:
# Get the csv file from scraping domain.com.au
property_df = pd.read_csv('../data/landing/properties.csv')

In [20]:
# Show the number of missing values in each column
property_df.isnull().sum()

price (AUD per week)    0
bedrooms                9
bathrooms               0
parkings                0
property type           0
address                 0
suburb                  0
postcode                0
additional features     0
property url            0
dtype: int64

We can see that number of bedrooms is the only column that contain missing values. We will inspect these rows.

In [21]:
# Inspect rows with any NaN values
property_df[property_df.isna().any(axis=1)]

Unnamed: 0,price (AUD per week),bedrooms,bathrooms,parkings,property type,address,suburb,postcode,additional features,property url
142,450.0,,1,0,Studio,1403/325 Collins Street,MELBOURNE,3000,['Furnished'],https://www.domain.com.au/1403-325-collins-str...
226,230.0,,1,0,Studio,109/32 St Edmonds Road,PRAHRAN,3181,[],https://www.domain.com.au/109-32-st-edmonds-ro...
462,615.0,,1,0,Apartment / Unit / Flat,L204/8 Caulfield Boulevard,CAULFIELD NORTH,3161,"['Intercom', 'In ground pool', 'Balcony', 'Out...",https://www.domain.com.au/l204-8-caulfield-bou...
506,75.0,,1,1,Car space,Car Park/228 La Trobe St,MELBOURNE,3000,[],https://www.domain.com.au/car-park-228-la-trob...
625,435.0,,1,0,Studio,7/340 Beaconsfield Parade,ST KILDA WEST,3182,"['Split Cooling', 'Split Heating', 'Kitchen', ...",https://www.domain.com.au/7-340-beaconsfield-p...
701,535.0,,1,0,Apartment / Unit / Flat,202/12 Caulfield Blvd,CAULFIELD NORTH,3161,"['In ground pool', 'In ground spa', 'Split sys...",https://www.domain.com.au/202-12-caulfield-blv...
757,250.0,,1,0,Studio,24/677 Park Street,BRUNSWICK,3056,[],https://www.domain.com.au/24-677-park-street-b...
802,350.0,,1,1,Studio,2/631 Punt Road,SOUTH YARRA,3141,[],https://www.domain.com.au/2-631-punt-road-sout...
905,380.0,,1,0,Studio,10/1 Lawson Grove,SOUTH YARRA,3141,[],https://www.domain.com.au/10-1-lawson-grove-so...


We can see that most missing values occur in the number of bedrooms for studio room or invalid property type such as car space.

We will look at what property type is included in our dataset.

In [17]:
print("Unique property types:", property_df['property type'].unique())

Unique property types: ['House' 'Apartment / Unit / Flat' 'Townhouse' 'Studio' 'Villa'
 'Car space' 'Terrace']


We will discard rows with type 'Car space' because this type is invalid in the scope of this project.

In [25]:
# Discard rows with type 'Car space'
property_df = property_df[property_df['property type'] != 'Car space']

Now we will fill in the missing values for bedrooms with the assumption that number of bedrooms for studio room is 1, otherwise we assume number of bedrooms equal number of bathrooms.

In [26]:
def fill_bedrooms(row):
    if pd.isnull(row['bedrooms']):
        if row['property type'] == 'Studio':    # assume number of bedrooms for studio is 1
            return 1
        else:
            return row['bathrooms']     # for other properties assume bedrooms = bathrooms
    return row['bedrooms']

property_df['bedrooms'] = property_df.apply(fill_bedrooms, axis=1)

In [24]:
# Check the number of missing values after filling in
property_df.isnull().sum()

price (AUD per week)    0
bedrooms                0
bathrooms               0
parkings                0
property type           0
address                 0
suburb                  0
postcode                0
additional features     0
property url            0
dtype: int64

We confirm that there are no missing entries left.

> ### Descriptive statistics

We will look at the descriptive statistics of number of bedrooms, bathrooms, parkings and rental price per bedroom to check if they are in reasonable ranges. Price per bedroom is chosen because it allows better interpretation.

In [27]:
# Compute price per bedroom
property_df['price per bedroom'] = property_df['price (AUD per week)'] / property_df['bedrooms']

In [17]:
property_df[['price per bedroom', 'bedrooms', 'bathrooms', 'parkings']].describe()

Unnamed: 0,price per bedroom,bedrooms,bathrooms,parkings
count,988.0,988.0,988.0,988.0
mean,361.057018,2.152834,1.461538,1.069838
std,125.107332,0.954869,0.639417,0.75605
min,112.5,1.0,1.0,0.0
25%,275.0,1.0,1.0,1.0
50%,337.5,2.0,1.0,1.0
75%,430.0,3.0,2.0,1.0
max,1100.0,6.0,4.0,6.0


Overall we can see that the range of number of bedrooms, bathrooms and parkings is reasonable. Some properties have very high price per bedroom however these could still be possible in more expensive suburbs. Therefore we will still keep these properties and later on classify them as 'Very High' for the classification task.

In [28]:
# Save the final df
property_df.to_csv('../data/raw/preprocessed properties.csv', index=False)