$\newcommand{\xv}{\mathbf{x}}
 \newcommand{\wv}{\mathbf{w}}
 \newcommand{\yv}{\mathbf{y}}
 \newcommand{\zv}{\mathbf{z}}
 \newcommand{\uv}{\mathbf{u}}
 \newcommand{\vv}{\mathbf{v}}
 \newcommand{\Chi}{\mathcal{X}}
 \newcommand{\R}{\rm I\!R}
 \newcommand{\sign}{\text{sign}}
 \newcommand{\Tm}{\mathbf{T}}
 \newcommand{\Xm}{\mathbf{X}}
 \newcommand{\Zm}{\mathbf{Z}}
 \newcommand{\I}{\mathbf{I}}
 \newcommand{\Um}{\mathbf{U}}
 \newcommand{\Vm}{\mathbf{V}} 
 \newcommand{\muv}{\boldsymbol\mu}
 \newcommand{\Sigmav}{\boldsymbol\Sigma}
 \newcommand{\Lambdav}{\boldsymbol\Lambda}
$

# House Price Predictor


### ITCS 5156 Project

<br/>

NAME: Jose Salas-Ayala

</b> </font>

Step 1: Data Cleaning
The data that will be used is the same as the following reference research paper: T. D. Phan, "Housing Price Prediction Using Machine Learning Algorithms: The Case of Melbourne City, Australia," 2018 International Conference on Machine Learning and Data Engineering (iCMLDE), Sydney, NSW, Australia, 2018, pp. 35-42, doi: 10.1109/iCMLDE.2018.00017.


In [34]:
#import the neccassry libraries 
import numpy as np
import seaborn as sns
import pandas as pd
import matplotlib 
import matplotlib.pyplot as plt
import sklearn



The model selected will be a simple linear regression. This is because this model tends to work best as a basis to compare the accuracy of other models. It may not be able to capture the true curve for the best predictions but it will consistently give good predictions.

In [35]:
# Import the data

df = pd.read_csv('Melbourne_housing_FULL.csv')

#check the first 5 entries
df.head(5)
#df.shape

Unnamed: 0,Suburb,Address,Rooms,Type,Price,Method,SellerG,Date,Distance,Postcode,...,Bathroom,Car,Landsize,BuildingArea,YearBuilt,CouncilArea,Lattitude,Longtitude,Regionname,Propertycount
0,Abbotsford,68 Studley St,2,h,,SS,Jellis,3/09/2016,2.5,3067.0,...,1.0,1.0,126.0,,,Yarra City Council,-37.8014,144.9958,Northern Metropolitan,4019.0
1,Abbotsford,85 Turner St,2,h,1480000.0,S,Biggin,3/12/2016,2.5,3067.0,...,1.0,1.0,202.0,,,Yarra City Council,-37.7996,144.9984,Northern Metropolitan,4019.0
2,Abbotsford,25 Bloomburg St,2,h,1035000.0,S,Biggin,4/02/2016,2.5,3067.0,...,1.0,0.0,156.0,79.0,1900.0,Yarra City Council,-37.8079,144.9934,Northern Metropolitan,4019.0
3,Abbotsford,18/659 Victoria St,3,u,,VB,Rounds,4/02/2016,2.5,3067.0,...,2.0,1.0,0.0,,,Yarra City Council,-37.8114,145.0116,Northern Metropolitan,4019.0
4,Abbotsford,5 Charles St,3,h,1465000.0,SP,Biggin,4/03/2017,2.5,3067.0,...,2.0,0.0,134.0,150.0,1900.0,Yarra City Council,-37.8093,144.9944,Northern Metropolitan,4019.0


Data Cleaning: Recreating from the paper, where columns with %55 missing data are removed and rows with missing "Price" values are removed

In [53]:
missing_percentage = df.isna().mean()
#columns_to_drop = missing_percentage[missing_percentage > 0.55].index
df_housing = df.drop(columns=['Suburb','Address','Method','SellerG','Date','Postcode','Bedroom2','CouncilArea','Regionname','BuildingArea','YearBuilt'])
df_housing= df_housing.dropna(subset=['Price'])
df_housing= df_housing.dropna(subset=['Lattitude']) #in the paper they used googlemaps API to fill in the missing values but I didn't want to buy a key so I just dropped it instead.
df_housing = df_housing.dropna(thresh=6)
#matching the data in the reference paper leaves us with 11 variables

#imputations

grouped_df = df.groupby(['Type'])['Landsize'].median().reset_index()

# Function to impute missing values based on median values group by house types and suburbs
def impute_landsize(row):
    if pd.isnull(row['Landsize']):
        median_val = grouped_df.loc[(grouped_df['Type']==row['Type']) , 'Landsize'].values
        if len(median_val) > 0:
            return median_val[0]
        else:
            return row['Landsize']
    else:
        return row['Landsize']

df_housing['Landsize'] = df.apply(impute_landsize,axis=1)

#change NAN to 0 for bathrooms and carports
df_housing['Bathroom'] = df_housing['Bathroom'].fillna(0)
df_housing['Car'] = df_housing['Car'].fillna(0)

df_housing.head(5)
df_housing.shape
df_housing.describe()

Unnamed: 0,Rooms,Price,Distance,Bathroom,Car,Landsize,Lattitude,Longtitude,Propertycount
count,20993.0,20993.0,20993.0,20993.0,20993.0,20993.0,20993.0,20993.0,20993.0
mean,3.059163,1089746.0,11.35902,1.575001,1.666889,577.115896,-37.806963,144.996711,7516.751489
std,0.949881,653028.3,6.891418,0.715417,1.020688,3505.89579,0.091619,0.12068,4411.397778
min,1.0,85000.0,0.0,0.0,0.0,0.0,-38.19043,144.42379,83.0
25%,2.0,657000.0,6.4,1.0,1.0,218.0,-37.8609,144.9253,4380.0
50%,3.0,910000.0,10.4,1.0,2.0,520.0,-37.80046,145.0032,6567.0
75%,4.0,1335000.0,14.2,2.0,2.0,657.0,-37.74897,145.06877,10331.0
max,16.0,11200000.0,48.1,9.0,18.0,433014.0,-37.3978,145.52635,21650.0


After Datacleaning its time to use feature selection