# House price prediction

The dataset for this project consists of property data from Melbourne.  
The features are a mix of continuous and categorical variables.  
The task is to predict which price class a property in the city belongs to.

### Imports

In [2]:
import pandas as pd
import numpy as np
from sklearn.preprocessing import OneHotEncoder

### Reading the data

In [3]:
raw_df = pd.read_csv('data/train.csv', index_col=0)

### Data exploration and visualisation

Let's look at the shape of the data.

In [4]:
print(f'Shape of the dataset: {raw_df.shape}.')

Shape of the dataset: (11543, 15).


Let's have a look at the dataset.

In [5]:
raw_df.head()

Unnamed: 0,Rooms,Type,Method,Distance,Postcode,Bedrooms,Bathroom,Car,Landsize,YearBuilt,Lattitude,Longtitude,Regionname,Propertycount,Price class
0,4,h,S,6.4,3011.0,3.0,1.0,2.0,411.0,,-37.7969,144.9049,Western Metropolitan,7570.0,1
1,4,h,S,14.6,3189.0,4.0,1.0,2.0,638.0,1972.0,-37.9378,145.057,Southern Metropolitan,2555.0,1
2,5,h,PI,12.4,3107.0,5.0,4.0,2.0,968.0,1970.0,-37.77083,145.11516,Eastern Metropolitan,5420.0,1
3,3,h,SP,5.2,3056.0,3.0,1.0,2.0,264.0,,-37.7611,144.9644,Northern Metropolitan,11918.0,0
4,3,h,S,8.8,3072.0,3.0,1.0,2.0,610.0,,-37.751,145.0197,Northern Metropolitan,14577.0,0


We see that the following features are strings: Type, Method and Regionname.  
We need to encode these features.

Let's check what types of data we have.

In [6]:
raw_df.dtypes

Rooms              int64
Type              object
Method            object
Distance         float64
Postcode         float64
Bedrooms         float64
Bathroom         float64
Car              float64
Landsize         float64
YearBuilt        float64
Lattitude        float64
Longtitude       float64
Regionname        object
Propertycount    float64
Price class        int64
dtype: object

Most of the features are floats.  
Number of rooms in a house is an integer.  
And the target feature is an integer.

Let's check for missing data.

In [7]:
raw_df.isna().sum()

Rooms               0
Type                0
Method              0
Distance            0
Postcode            0
Bedrooms            0
Bathroom           46
Car                53
Landsize           33
YearBuilt        4572
Lattitude           0
Longtitude          0
Regionname          0
Propertycount      40
Price class         0
dtype: int64

We see that we have missing data in the following features: Bathroom, Car, Landside, YearBuilt and Propertycount.  
This must be dealt with.

Let's see how many different categories we have in the target feature.

In [8]:
raw_df['Price class'].nunique()

3

The target feature has 3 different categories.  
Let's look at what those 3 labels are.

In [9]:
raw_df['Price class'].unique()

array([1, 0, 2], dtype=int64)

The three labels in the target are the following integers: 0, 1, and 2.

### Data cleaning

First we will encode the features that are strings.

In [14]:
# We will use onehot encoder from scikitlearn to encode the string features.
def encoding(x):
    ohe = OneHotEncoder()
    ohe_results = ohe.fit_transform(raw_df[[x]])
    return pd.DataFrame(ohe_results.toarray(), columns=ohe.categories_)

# Type, Method and Regionname features need to be encoded
typ = encoding('Type')
method = encoding('Method')
region = encoding('Regionname')

# Add the encoded features
encoded_df = raw_df.join(pd.DataFrame(typ))
encoded_df = encoded_df.join(pd.DataFrame(method))
encoded_df = encoded_df.join(pd.DataFrame(region))

# Remove the original features containing strings
del encoded_df['Type']
del encoded_df['Method']
del encoded_df['Regionname']

Let's move price class, the target feature, to the end of the dataframe for convenience

In [16]:
price = encoded_df.pop('Price class')
encoded_df['Price class'] = price
encoded_df.head()

Unnamed: 0,Rooms,Distance,Postcode,Bedrooms,Bathroom,Car,Landsize,YearBuilt,Lattitude,Longtitude,...,"(VB,)","(Eastern Metropolitan,)","(Eastern Victoria,)","(Northern Metropolitan,)","(Northern Victoria,)","(South-Eastern Metropolitan,)","(Southern Metropolitan,)","(Western Metropolitan,)","(Western Victoria,)",Price class
0,4,6.4,3011.0,3.0,1.0,2.0,411.0,,-37.7969,144.9049,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1
1,4,14.6,3189.0,4.0,1.0,2.0,638.0,1972.0,-37.9378,145.057,...,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1
2,5,12.4,3107.0,5.0,4.0,2.0,968.0,1970.0,-37.77083,145.11516,...,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1
3,3,5.2,3056.0,3.0,1.0,2.0,264.0,,-37.7611,144.9644,...,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0
4,3,8.8,3072.0,3.0,1.0,2.0,610.0,,-37.751,145.0197,...,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0
