**PREPROCESSING**
--------------

This notebook will clean and feature engineer the scraped rental data

In [194]:
import pandas as pd
import sys
sys.path.append('..')
from scripts import preprocess
import importlib
importlib.reload(preprocess)
from scripts.preprocess import extract_suburb_postcode, categorise_property, calculate_annual_increase, extract_number, extract_first_number

**Now, cleaning up scrape for current listings from domain**

In [195]:
current_rental_df = pd.read_csv('../data/landing/rental_scrape.csv')

In [196]:
current_rental_df[['Suburb', 'Postcode']] = current_rental_df['Address'].apply(lambda x: pd.Series(extract_suburb_postcode(x)))

Checking for suburbs not in the list of inner east suburbs we wish to predict and removing those instances

In [197]:
import pandas as pd

unknown_suburbs_count = current_rental_df[current_rental_df['Suburb'] == 'Unknown'].shape[0]

print(f"Number of entries labeled as 'Unknown' in the 'Suburb' column: {unknown_suburbs_count}")


Number of entries labeled as 'Unknown' in the 'Suburb' column: 967


In [198]:
current_rental_df = current_rental_df[current_rental_df['Suburb'] != 'Unknown']

Renaming columns to match historical dataset and extracting just the number for `Beds` `Baths` and `Cars`. Filling NaN `Cars` values with 0

In [206]:
current_rental_df.rename(columns={'Bedrooms':'Beds', 'Bathrooms':'Baths', 'Parking':'Cars', 'PropertyType':'Property Type'}, inplace=True)

current_rental_df['Cars'] = current_rental_df['Cars'].fillna('0')

current_rental_df['Beds'] = current_rental_df['Beds'].apply(extract_number)
current_rental_df['Baths'] = current_rental_df['Baths'].apply(extract_number)
current_rental_df['Cars'] = current_rental_df['Cars'].apply(extract_number)

No NaN values for `Beds` and `Baths`

In [207]:
initial_row_count = len(current_rental_df)

current_rental_df = current_rental_df.dropna(subset=['Beds', 'Baths'])

final_row_count = len(current_rental_df)

rows_removed = initial_row_count - final_row_count

print(f'Number of rows removed: {rows_removed}')


Number of rows removed: 0


Extracting rental price from string

In [208]:
current_rental_df['Cost'] = current_rental_df['Cost'].apply(extract_first_number)

Dropping unused columns

In [202]:
current_rental_df.drop(columns=['URL', 'Name', 'Description'], inplace=True)

In [209]:
current_rental_df

Unnamed: 0,Cost,Beds,Baths,Cars,Address,Property Type,Suburb,Postcode
0,550.0,2,2,1,"2/21-23 Westgate Street, Pascoe Vale South VIC...",Apartment / Unit / Flat,Pascoe Vale,3044
1,560.0,2,1,1,"6/121 Mcdonald Street, Mordialloc VIC 3195",Apartment / Unit / Flat,Mordialloc,3195
2,550.0,2,1,1,"5/3 Carnarvon Street, Doncaster VIC 3108",Apartment / Unit / Flat,Doncaster,3108
3,340.0,1,1,1,"4/10 Cole Street, Noble Park VIC 3174",Apartment / Unit / Flat,Noble Park,3174
5,460.0,3,1,0,"8 Perth Avenue, Albion VIC 3020",House,Albion,3020
...,...,...,...,...,...,...,...,...
12093,950.0,4,3,2,"2/1106 Burke Road, Balwyn North VIC 3104",Townhouse,Balwyn,3103
12094,75.0,0,1,1,"Car Park/228 La Trobe St, Melbourne VIC 3000",Carspace,Melbourne,3004
12095,690.0,3,2,2,"4/420 Middleborough Road, Blackburn VIC 3130",Townhouse,Blackburn,3130
12097,700.0,3,1,1,Balwyn VIC 3103,House,Balwyn,3103


saving updated data 

In [210]:
current_rental_df.to_csv('../data/curated/current_rental_data.csv')