**PREPROCESSING**
--------------

This notebook will clean and feature engineer the scraped rental data

In [153]:
import pandas as pd
import sys
sys.path.append('..')
from scripts import preprocess
import importlib
importlib.reload(preprocess)
from scripts.preprocess import extract_suburb_postcode, categorise_property, calculate_annual_increase, extract_number, extract_first_number

Starting off with fixing up the historial rental dataset from old listings

In [154]:
df1 = pd.read_csv('../data/landing/rental_history_scrape1.csv', encoding='ISO-8859-1')
df2 = pd.read_csv('../data/landing/rental_history_scrape2.csv', encoding='ISO-8859-1')
combined_rental_history_df = pd.concat([df1, df2], ignore_index=True)

In [129]:
combined_rental_history_df['Property Type'].unique()

array(['Category : House', nan, 'Category : Townhouse',
       'Category : Unit/apmt', 'Category : Commercial',
       'Category : Rental_residential', 'Category : Available Now',
       'Category : Villa', 'Category : Medical/consulting',
       'Category : Apartment', 'Category : Duplexsemi-detached',
       'Category : Duplex, Semi', 'Category : Semi',
       'Category : Home Unit', 'Category : Rural', 'Category : Unit',
       'Category : Residential Lease', 'Category : Duplex',
       'Category : Town Home', 'Category : Flat',
       'Category : Semi-detatched', 'Category : Other',
       'Category : Penthouse', 'Category : Villa, Unit, Land',
       'Category : Home', 'Category : Studio', 'Category : Available',
       'Category : Available Date', 'Category : Offices',
       'Category : Semi-detached', 'Category : Semi/duplex',
       'Category : Duplexsemidetached', 'Category : Semi Detached',
       'Category : Duplex/semi Detach', 'Category : Villa For',
       'Category : In

There are many property types which are not residential. We will have to remove those

In [130]:


# Define non-residential keywords
non_residential_keywords = ['Commercial', 'Medical/consulting', 'Offices', 
                            'Industrial', 'Tourism', 'Retail', 'Healthcare',
                            'Car Space', 'Land', 'Acreage/semi-rural', 'Vacantland']

# Get the initial number of rows
initial_row_count = len(combined_rental_history_df)

# Filter out non-residential entries
combined_rental_history_df = combined_rental_history_df[~combined_rental_history_df['Property Type'].str.contains('|'.join(non_residential_keywords), na=False)]
# Get the final number of rows
final_row_count = len(combined_rental_history_df)

# Calculate the number of rows dropped
rows_dropped = initial_row_count - final_row_count

# Display the number of rows dropped
print(f'Number of rows dropped: {rows_dropped}')


Number of rows dropped: 161


In [131]:
combined_rental_history_df['Property Type'].unique()

array(['Category : House', nan, 'Category : Townhouse',
       'Category : Unit/apmt', 'Category : Rental_residential',
       'Category : Available Now', 'Category : Villa',
       'Category : Apartment', 'Category : Duplexsemi-detached',
       'Category : Duplex, Semi', 'Category : Semi',
       'Category : Home Unit', 'Category : Rural', 'Category : Unit',
       'Category : Residential Lease', 'Category : Duplex',
       'Category : Town Home', 'Category : Flat',
       'Category : Semi-detatched', 'Category : Other',
       'Category : Penthouse', 'Category : Home', 'Category : Studio',
       'Category : Available', 'Category : Available Date',
       'Category : Semi-detached', 'Category : Semi/duplex',
       'Category : Duplexsemidetached', 'Category : Semi Detached',
       'Category : Duplex/semi Detach', 'Category : Villa For',
       'Category : Semi-detached/duplex', 'Category : Uni',
       'Category : Terrace', 'Category : Serviced Apartment',
       'Category : Reside

There is still too many categories for model training. Updating to only 3 categories based on address structure and imputing NaN property types based on the address.

In [132]:
combined_rental_history_df['Property Type'] = combined_rental_history_df.apply(categorise_property, axis=1)

Adding a Suburb and Postcode feature based on the address

In [133]:
combined_rental_history_df[['Suburb', 'Postcode']] = combined_rental_history_df['Address'].apply(lambda x: pd.Series(extract_suburb_postcode(x)))

**Cleaning up NaN values**

If Cars is NaN, replace with 0, assuming that there are no carparks.
If Bed or Bath is NaN, we will remove the entry as we are assuming that a residential property cannot exist without those features.

In [134]:
initial_row_count = len(combined_rental_history_df)

combined_rental_history_df['Cars'] = combined_rental_history_df['Cars'].fillna('Car : 0')

combined_rental_history_df = combined_rental_history_df.dropna(subset=['Beds', 'Baths'])

final_row_count = len(combined_rental_history_df)

rows_removed = initial_row_count - final_row_count

print(f'Number of rows removed: {rows_removed}')


Number of rows removed: 574


Extracting just the number from the string of `Beds` `Baths` and `Cars` feature

In [135]:
combined_rental_history_df['Beds'] = combined_rental_history_df['Beds'].apply(extract_number)
combined_rental_history_df['Baths'] = combined_rental_history_df['Baths'].apply(extract_number)
combined_rental_history_df['Cars'] = combined_rental_history_df['Cars'].apply(extract_number)

Working out the annual % increase in rent for each property

In [136]:
combined_rental_history_df[['Oldest Price', 'Oldest Date', 'Newest Price', 'Newest Date', 'Months Difference', 'Annual Increase in Rent %']] = combined_rental_history_df.apply(calculate_annual_increase, axis=1)

For all properties that only have 1 recorded price and therefore cannot work out the annual increase, we will impute the median annual increase for the suburb and property type

In [137]:
median_increase_per_group = combined_rental_history_df.groupby(['Suburb', 'Property Type'])['Annual Increase in Rent %'].transform('median')
combined_rental_history_df['Annual Increase in Rent %'].fillna(median_increase_per_group, inplace=True)

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  combined_rental_history_df['Annual Increase in Rent %'].fillna(median_increase_per_group, inplace=True)


Removing `Last Advertised Price` as a feature and `Historical Prices` as we will only need the annual increase

In [138]:
combined_rental_history_df.drop(columns=['Last Advertised Price', 'Historical Prices'], inplace=True)

In [139]:
combined_rental_history_df

Unnamed: 0,Address,Beds,Baths,Cars,Property Type,Suburb,Postcode,Oldest Price,Oldest Date,Newest Price,Newest Date,Months Difference,Annual Increase in Rent %
0,"19 POULTER STREET, ASHBURTON",3,1,2,House,Ashburton,3147,550.0,2009-04-10,590.0,2024-09-10,185.0,0.471744
1,"50 KARNAK ROAD, ASHBURTON",4,3,3,House,Ashburton,3147,1190.0,2017-10-10,1350.0,2024-08-10,82.0,1.967616
2,"50 KARNAK RD, ASHBURTON",4,3,0,House,Ashburton,3147,1290.0,2021-01-10,1350.0,2024-08-10,43.0,1.297999
3,"55 VICTORY BOULEVARD, ASHBURTON",2,2,2,House,Ashburton,3147,495.0,2008-01-10,1050.0,2024-08-10,199.0,6.761078
4,"11A SOLWAY STREET, ASHBURTON",4,3,4,Apartment / Unit / Flat,Ashburton,3147,1100.0,2020-10-10,1395.0,2024-08-10,46.0,6.996047
...,...,...,...,...,...,...,...,...,...,...,...,...,...
50993,"9/92 VICTORIA CR, MONT ALBERT",2,1,1,Apartment / Unit / Flat,Mont Albert,3127,250.0,2006-12-10,250.0,2006-12-10,0.0,1.300823
50994,"1/7 GILBERT STREET, MONT ALBERT",2,1,2,Apartment / Unit / Flat,Mont Albert,3127,250.0,2006-12-10,250.0,2006-12-10,0.0,1.300823
50995,"3/764 WHITEHORSE ROAD, MONT ALBERT",2,1,1,Apartment / Unit / Flat,Mont Albert,3127,220.0,2006-11-10,240.0,2006-12-10,1.0,109.090909
50996,"3/458 BELMORE ROAD, MONT ALBERT",3,2,2,Apartment / Unit / Flat,Mont Albert,3127,330.0,2006-11-10,330.0,2006-11-10,0.0,1.300823


In [140]:
combined_rental_history_df.to_csv('../data/raw/rental_history_data.csv')

**Now, cleaning up scrape for current listings from domain**

In [141]:
current_rental_df = pd.read_csv('../data/landing/rental_scrape.csv')

In [142]:
current_rental_df[['Suburb', 'Postcode']] = current_rental_df['Address'].apply(lambda x: pd.Series(extract_suburb_postcode(x)))

Checking for suburbs not in the list of inner east suburbs we wish to predict and removing those instances

In [143]:
import pandas as pd

unknown_suburbs_count = current_rental_df[current_rental_df['Suburb'] == 'Unknown'].shape[0]

print(f"Number of entries labeled as 'Unknown' in the 'Suburb' column: {unknown_suburbs_count}")


Number of entries labeled as 'Unknown' in the 'Suburb' column: 695


In [144]:
current_rental_df = current_rental_df[current_rental_df['Suburb'] != 'Unknown']

Renaming columns to match historical dataset and extracting just the number for `Beds` `Baths` and `Cars`. Filling NaN `Cars` values with 0

In [145]:
current_rental_df.rename(columns={'Bedrooms':'Beds', 'Bathrooms':'Baths', 'Parking':'Cars', 'PropertyType':'Property Type'}, inplace=True)

current_rental_df['Cars'] = current_rental_df['Cars'].fillna('Car : 0')

current_rental_df['Beds'] = current_rental_df['Beds'].apply(extract_number)
current_rental_df['Baths'] = current_rental_df['Baths'].apply(extract_number)
current_rental_df['Cars'] = current_rental_df['Cars'].apply(extract_number)

No NaN values for `Beds` and `Baths`

In [146]:
initial_row_count = len(current_rental_df)

current_rental_df = current_rental_df.dropna(subset=['Beds', 'Baths'])

final_row_count = len(current_rental_df)

rows_removed = initial_row_count - final_row_count

print(f'Number of rows removed: {rows_removed}')


Number of rows removed: 0


Extracting rental price from string

In [147]:
current_rental_df['Cost'] = current_rental_df['Cost'].apply(extract_first_number)

Dropping unused columns

In [150]:
current_rental_df.drop(columns=['URL', 'Name', 'Description'], inplace=True)

KeyError: "['URL', 'Name', 'Description'] not found in axis"

In [151]:
current_rental_df

Unnamed: 0,Cost,Beds,Baths,Cars,Address,Property Type,Suburb,Postcode
1,650.0,3,1,1.0,"43 Highview Drive, Doncaster VIC 3108",House,Doncaster,3108
3,770.0,3,2,2.0,"2 Tidcombe Cres, Doncaster East VIC 3109",House,Doncaster,3108
4,470.0,1,1,1.0,"706/632-640 Doncaster Road, Doncaster VIC 3108",Apartment / Unit / Flat,Doncaster,3108
5,995.0,3,2,2.0,"101/1571 Malvern Road, Glen Iris VIC 3146",Apartment / Unit / Flat,Glen Iris,3146
9,775.0,3,2,3.0,"165 Wiltshire Drive, Kew VIC 3101",Townhouse,Kew,3101
...,...,...,...,...,...,...,...,...
1653,590.0,2,1,1.0,"309D/15 Foundation Boulevard, Burwood East VIC...",Apartment / Unit / Flat,Burwood,3125
1656,680.0,3,2,1.0,"2/111 Balwyn Road, Balwyn VIC 3103",Apartment / Unit / Flat,Balwyn,3103
1657,850.0,3,2,2.0,"3/32 Yerrin Street, Balwyn VIC 3103",Townhouse,Balwyn,3103
1658,550.0,3,1,1.0,"289 Blackburn Road, Mount Waverley VIC 3149",House,Blackburn,3130


In [152]:
current_rental_df.to_csv('../data/raw/current_rental_data.csv')