# About Dataset

## Updates
- **06/08/2018**: Melbourne housing has cooled off. Challenges:
  1. When did it exactly happen?
  2. Could you see it slowing down? Consider variables such as overall price, amount sold vs. unsold, changes in rentals vs. housing, changes in CouncilArea or Region, and distance from Melbourne CBD.
  3. Could you have predicted it?
  4. Should I hold off longer in buying a two-bedroom apartment in Northcote?

- **22/05/2018**: Continued with a smaller subset of the data (fewer columns) due to time-consuming web scraping and potential issues. Data will continue to be posted.

- **28/11/2017**: Clearance levels starting to decrease. Can you find a pattern or make sense of it?

## Content & Acknowledgements
This data was scraped from publicly available results posted weekly on Domain.com.au. Cleaned as best as possible for data analysis. The dataset includes:
- **Address**
- **Type** of Real Estate
- **Suburb**
- **Method** of Selling
- **Rooms**
- **Price** (in AUD)
- **Real Estate Agent**
- **Date** of Sale
- **Distance** from CBD

### Additional Data
- Property size
- Land size
- Council area

## Key Details
- **Suburb**: Suburb
- **Address**: Address
- **Rooms**: Number of rooms
- **Price**: Price in AUD
- **Method**:
  - S: property sold
  - SP: property sold prior
  - PI: property passed in
  - PN: sold prior not disclosed
  - SN: sold not disclosed
  - NB: no bid
  - VB: vendor bid
  - W: withdrawn prior to auction
  - SA: sold after auction
  - SS: sold after auction price not disclosed
  - N/A: price or highest bid not available

- **Type**:
  - br: bedroom(s)
  - h: house, cottage, villa, semi, terrace
  - u: unit, duplex
  - t: townhouse
  - dev site: development site
  - o res: other residential

- **SellerG**: Real Estate Agent
- **Date**: Date sold
- **Distance**: Distance from CBD in kilometers
- **Regionname**: General Region (West, North West, North, North East, etc.)
- **Propertycount**: Number of properties in the suburb
- **Bedroom2**: Scraped number of bedrooms (from a different source)
- **Bathroom**: Number of bathrooms
- **Car**: Number of car spots
- **Landsize**: Land size in square meters
- **BuildingArea**: Building size in square meters
- **YearBuilt**: Year the house was built
- **CouncilArea**: Governing council for the area
- **Latitude**: Latitude
- **Longitude**: Longitude

For more details, visit the [Melbourne Housing Market dataset on Kaggle](https://www.kaggle.com/datasets/anthonypino/melbourne-housing-market/data).


In [15]:
import pandas as pd
from sklearn.model_selection import train_test_split

In [19]:
df = pd.read_csv('DataSets/Melbourne_housing_FULL.csv')
df.sample(5)

Unnamed: 0,Suburb,Address,Rooms,Type,Price,Method,SellerG,Date,Distance,Postcode,...,Bathroom,Car,Landsize,BuildingArea,YearBuilt,CouncilArea,Lattitude,Longtitude,Regionname,Propertycount
7661,Pascoe Vale,2 Hazel Gr,2,h,915000.0,S,New,10/12/2016,9.9,3044.0,...,1.0,3.0,629.0,,1975.0,Moreland City Council,-37.7312,144.9399,Northern Metropolitan,7485.0
1763,Brighton,3/18 Waterloo St,3,t,,SP,Buxton,4/06/2016,11.2,3186.0,...,,,,,,Bayside City Council,,,Southern Metropolitan,10579.0
22543,Kew East,25 Minogue St,6,h,1875000.0,PI,Marshall,23/09/2017,7.3,3102.0,...,2.0,2.0,697.0,170.0,1962.0,Boroondara City Council,-37.79032,145.05408,Southern Metropolitan,2671.0
1084,Balwyn North,53 Trentwood Av,5,h,2190000.0,S,Marshall,12/11/2016,9.2,3104.0,...,,,,,,Boroondara City Council,,,Southern Metropolitan,7809.0
7992,Prahran,22/55 Union St,2,u,523000.0,S,Biggin,10/12/2016,4.5,3181.0,...,,,,,,Stonnington City Council,,,Southern Metropolitan,7717.0


In [20]:
# assign price to variable y and drop price from the dataframe
y = df.Price
melb_predictors = df.drop(['Price'], axis=1)
# in this step we exclude the object columns from the predictors
X = melb_predictors.select_dtypes(exclude=['object'])
X_train, X_valid, y_train, y_valid = train_test_split(X, y, train_size=0.8, test_size=0.2, random_state=0)
X_train.head()

Unnamed: 0,Rooms,Distance,Postcode,Bedroom2,Bathroom,Car,Landsize,BuildingArea,YearBuilt,Lattitude,Longtitude,Propertycount
3372,3,7.8,3058.0,4.0,1.0,1.0,417.0,135.0,1920.0,-37.751,144.9764,11204.0
27417,3,6.4,3012.0,2.0,1.0,1.0,,,,-37.7982,144.8745,5058.0
21317,3,25.2,3173.0,,,,,,,,,8459.0
5194,1,4.6,3122.0,1.0,1.0,1.0,0.0,63.0,1995.0,-37.8299,145.0422,11308.0
16910,3,7.5,3040.0,3.0,3.0,1.0,846.0,187.0,1940.0,-37.75228,144.88429,588.0


In [21]:
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_absolute_error

In [24]:
def score_dataset(X_train, X_valid, y_train, y_valid):
    model = RandomForestRegressor(n_estimators=10, random_state=0)
    model.fit(X_train, y_train)
    preds = model.predict(X_valid)
    return mean_absolute_error(y_valid, preds)

In [35]:
# Get names of columns with missing values
cols_with_missing = [col for col in X_train.columns if X_train[col].isnull().any()]

reduced_X_train = X_train.drop(cols_with_missing, axis=1)
reduced_X_valid = X_valid.drop(cols_with_missing, axis=1)
print("MAE from Approach 1 (Drop columns with missing values):")
print(score_dataset(reduced_X_train, reduced_X_valid, y_train, y_valid))

MAE from Approach 1 (Drop columns with missing values):


ValueError: Input y contains NaN.

In [34]:
from sklearn.impute import SimpleImputer

my_imputer = SimpleImputer()
imputed_X_train = pd.DataFrame(my_imputer.fit_transform(X_train))
imputed_X_valid = pd.DataFrame(my_imputer.transform(X_valid))

# Imputation removed column names; put them back
imputed_X_train.columns = X_train.columns
imputed_X_valid.columns = X_valid.columns

print("MAE from Approach 2 (Imputation):")
print(score_dataset(imputed_X_train, imputed_X_valid, y_train, y_valid))

MAE from Approach 2 (Imputation):


ValueError: Input y contains NaN.

In [37]:
X_train_plus = X_train.copy()
X_valid_plus = X_valid.copy()

for col in cols_with_missing:
    X_train_plus[col + '_was_missing'] = X_train_plus[col].isnull()
    X_valid_plus[col + '_was_missing'] = X_valid_plus[col].isnull()

my_imputer = SimpleImputer()
imputed_X_train_plus = pd.DataFrame(my_imputer.fit_transform(X_train_plus))
imputed_X_valid_plus = pd.DataFrame(my_imputer.transform(X_valid_plus))

imputed_X_train_plus.columns = X_train_plus.columns
imputed_X_valid_plus.columns = X_valid_plus.columns

print("MAE from Approach 3 (An Extension to Imputation):")
print(score_dataset(imputed_X_train_plus, imputed_X_valid_plus, y_train, y_valid))

# The best approach is to use the third approach, which is to add a binary column that indicates whether the value was missing or not.

MAE from Approach 3 (An Extension to Imputation):


ValueError: Input y contains NaN.

In [39]:
print(X_train.shape)

# missing values in the columns
missing_val_count_by_column = (X_train.isnull().sum())
print(missing_val_count_by_column[missing_val_count_by_column > 0])

(27885, 12)
Distance             1
Postcode             1
Bedroom2          6586
Bathroom          6593
Car               6997
Landsize          9459
BuildingArea     16909
YearBuilt        15444
Lattitude         6399
Longtitude        6399
Propertycount        2
dtype: int64
