In [1]:
# Libraries to help with reading and manipulating data
import numpy as np
import pandas as pd

# Libraries to help with data visualization
import matplotlib.pyplot as plt
import seaborn as sns

# to restrict the float value to 3 decimal places
pd.set_option('display.float_format', lambda x: '%.3f' % x)

In [2]:
df = pd.read_csv("Melbourne_Housing.csv")

**View the first and last 5 rows of the dataset**

In [3]:
df.head()

Unnamed: 0,Suburb,Rooms,Type,SellerG,Date,Distance,Postcode,Bedroom,Bathroom,Car,Landsize,BuildingArea,YearBuilt,Regionname,Propertycount,Price
0,Airport West,3,t,Nelson,03-09-2016,13.5,3042.0,3.0,2.0,1.0,303.0,225.0,2016.0,Western Metropolitan,3464,840000
1,Albert Park,2,h,hockingstuart,03-09-2016,3.3,3206.0,2.0,1.0,0.0,120.0,82.0,1900.0,Southern Metropolitan,3280,1275000
2,Albert Park,2,h,Thomson,03-09-2016,3.3,3206.0,2.0,1.0,0.0,159.0,inf,,Southern Metropolitan,3280,1455000
3,Alphington,4,h,Brace,03-09-2016,6.4,3078.0,3.0,2.0,4.0,853.0,263.0,1930.0,Northern Metropolitan,2211,2000000
4,Alphington,3,h,Jellis,03-09-2016,6.4,3078.0,3.0,2.0,2.0,208.0,inf,2013.0,Northern Metropolitan,2211,1110000


In [4]:
df.tail()

Unnamed: 0,Suburb,Rooms,Type,SellerG,Date,Distance,Postcode,Bedroom,Bathroom,Car,Landsize,BuildingArea,YearBuilt,Regionname,Propertycount,Price
27109,Noble Park,3,h,C21,30-09-2017,22.7,3174.0,3.0,1.0,6.0,569.0,130.0,1959.0,South-Eastern Metropolitan,11806,627500
27110,Reservoir,3,u,RW,30-09-2017,12.0,3073.0,3.0,1.0,1.0,,105.0,1990.0,Northern Metropolitan,21650,475000
27111,Roxburgh Park,4,h,Raine,30-09-2017,20.6,3064.0,4.0,2.0,2.0,,225.0,1995.0,Northern Metropolitan,5833,591000
27112,Springvale South,3,h,Harcourts,30-09-2017,22.2,3172.0,3.0,2.0,1.0,544.0,,,South-Eastern Metropolitan,4054,780500
27113,Westmeadows,4,h,Barry,30-09-2017,16.5,3049.0,4.0,2.0,6.0,813.0,140.0,1960.0,Northern Metropolitan,2474,791000


**Understand the shape of the dataset**

In [5]:
# checking shape of the data
print("There are", df.shape[0], "rows and", df.shape[1], "columns")

There are 27114 rows and 16 columns


**Check the data types of the columns for the dataset**

In [6]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 27114 entries, 0 to 27113
Data columns (total 16 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   Suburb         27114 non-null  object 
 1   Rooms          27114 non-null  int64  
 2   Type           27114 non-null  object 
 3   SellerG        27114 non-null  object 
 4   Date           27114 non-null  object 
 5   Distance       27113 non-null  float64
 6   Postcode       27113 non-null  float64
 7   Bedroom        20678 non-null  float64
 8   Bathroom       20672 non-null  float64
 9   Car            20297 non-null  float64
 10  Landsize       17873 non-null  float64
 11  BuildingArea   10543 non-null  object 
 12  YearBuilt      11985 non-null  float64
 13  Regionname     27114 non-null  object 
 14  Propertycount  27114 non-null  int64  
 15  Price          27114 non-null  int64  
dtypes: float64(7), int64(3), object(6)
memory usage: 3.3+ MB


* The dataset has 10 numeric columns and 6 columns stored as objects.

* The date column is currently stored as an object, but it really should be converted into a datetime type.

* The BuildingArea column is also stored as an object, but it should actually be a numeric column.

* Several columns have missing values, which may need attention before analysis:

        Distance → 1 missing value
        Postcode → 1 missing value
        Bedroom → 6,436 missing values
        Bathroom → 6,442 missing values
        Car → 6,817 missing values
        Landsize → 9,241 missing values
        BuildingArea → 16,571 missing values
        YearBuilt → 15,129 missing values

The other columns (Suburb, Rooms, Type, SellerG, Date, Regionname, Propertycount, Price) have no missing values.

In [7]:
# Changing the date column to a datetime datatype.
df['Date'] = pd.to_datetime(df['Date'], format="%d-%m-%Y")

In [15]:
# let's see why BuildingArea column has object data type
df['BuildingArea'].unique()

array(['225', '82', 'inf', '263', '242', '251', '117', 'missing', '76',
       '399', '118', '103', '180', '123', '218', '129', '167', '154',
       '275', '121', nan, '125', '255', '75', '156', '240', '268', '108',
       '69', '140', '214', '253', '189', '215', '96', '104', '100', '313',
       '144', '93', '110', '70', '122', '51', '147', '113', '83', '56',
       '137', '85', '64', '175', '3558', '170', '265', '353', '138', '19',
       '116', '87', '74', '320', '300', '210', '120', '86', '97', '200',
       '106', '14', '161', '128', '185', '146', '133', '115', '143',
       '150', '195', '236', '276', '188', '179', '249', '141', '34', '73',
       '107', '84', '81', '207', '50', '264', '312', '235', '221', '183',
       '132', '160', '186', '78', '105', '145', '62', '220', '315', '181',
       '61', '112', '420', '226', '266', '410', '449', '356', '477',
       '250', '95', '190', '284', '247', '213', '209', '119', '111',
       '130', '348', '166', '44', '176', '98', '159', '79'

In [16]:
df['BuildingArea'].nunique()

655

In [17]:
df['BuildingArea'].info()

<class 'pandas.core.series.Series'>
RangeIndex: 27114 entries, 0 to 27113
Series name: BuildingArea
Non-Null Count  Dtype 
--------------  ----- 
10543 non-null  object
dtypes: object(1)
memory usage: 212.0+ KB


* It will be difficult to analyze each data point to find the categorical values in this column. Instead, we’ll check how many non-numeric entries exist in this column and review what they are.

In [9]:
df['BuildingArea'].apply(type).value_counts()

BuildingArea
<class 'float'>    16571
<class 'str'>      10543
Name: count, dtype: int64

Here’s what the above means, step by step:

    <class 'float'> 16571
        There are 16,571 entries in the BuildingArea column that are of float type.
        These are NaN values in this column, and are special floats representing missing data.

    <class 'str'> 10543
        There are 10,543 entries that are strings.
        These entries may include numbers stored as text (e.g., "105") or invalid entries such as "missing" or "inf".
        Numeric operations cannot be performed on them until they are converted to proper numeric types.

In [None]:
# Convert to numeric, flagging errors
numeric_converted = pd.to_numeric(df['BuildingArea'], downcast='integer', errors="coerce")

# Count numeric vs non-numeric
num_count = numeric_converted.notna().sum()
non_num_count = numeric_converted.isna().sum()

print("Non-Numeric values:", non_num_count)
print("Numeric values:", num_count)

Non-Numeric values: 16580
Numeric values: 10534


We can make the following deductions By comparing the results of the two operations above:

* The column contains 10,543 string (<class 'str'>) entries, which is more than the 10,534 numeric values. Some of these strings represent non-numeric data wrapped in quotes (e.g., "missing"), so when using pd.to_numeric, pandas is unable to convert them and assigns NaN instead.
* As a result, the expected float count (<class 'float'>: 16,571) from .apply(type) does not match the non-numeric count from pd.to_numeric (16,580).
* This suggests that roughly 9 elements cannot be converted to numeric values, whereas the rest are strings representing numbers (e.g., wrapped in quotes) that pd.to_numeric can successfully convert.

Hence:
* Using pd.to_numeric(errors='coerce') gives a more accurate picture of which entries can be treated as truly numeric.

In [23]:
# replacing values with nan
df['BuildingArea'] = df['BuildingArea'].replace(['missing', 'inf'], np.nan)

# Change the dtype to float
df['BuildingArea'] = df['BuildingArea'].astype(float)

In [None]:
# Check again the number of numeric elements
numeric_converted = pd.to_numeric(df['BuildingArea'], downcast='integer', errors="coerce")

num_count = numeric_converted.notna().sum()
non_num_count = numeric_converted.isna().sum()

print("Non-Numeric values:", non_num_count)
print("Numeric values:", num_count)

Non-Numeric values: 16585
Numeric values: 10529


In [24]:
# let's check the data type of columns again
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 27114 entries, 0 to 27113
Data columns (total 16 columns):
 #   Column         Non-Null Count  Dtype         
---  ------         --------------  -----         
 0   Suburb         27114 non-null  object        
 1   Rooms          27114 non-null  int64         
 2   Type           27114 non-null  object        
 3   SellerG        27114 non-null  object        
 4   Date           27114 non-null  datetime64[ns]
 5   Distance       27113 non-null  float64       
 6   Postcode       27113 non-null  float64       
 7   Bedroom        20678 non-null  float64       
 8   Bathroom       20672 non-null  float64       
 9   Car            20297 non-null  float64       
 10  Landsize       17873 non-null  float64       
 11  BuildingArea   10529 non-null  float64       
 12  YearBuilt      11985 non-null  float64       
 13  Regionname     27114 non-null  object        
 14  Propertycount  27114 non-null  int64         
 15  Price          2711

* The number of non-null float elements in BuildingArea now matches the 10,529 numeric values from our pd.to_numeric conversion, so the data is consistent.

* We see that the data types of Date and BuildingArea columns have been fixed.
* There are 11 numerical columns, 4 object type columns, and 1 date time column in the data.
* We observe that some columns have less entries that other columns (less than 27114 rows) which indicates the presence of missing values in the data.

**Checking for missing values in the data**

In [26]:
df.isnull().sum()

Suburb               0
Rooms                0
Type                 0
SellerG              0
Date                 0
Distance             1
Postcode             1
Bedroom           6436
Bathroom          6442
Car               6817
Landsize          9241
BuildingArea     16585
YearBuilt        15129
Regionname           0
Propertycount        0
Price                0
dtype: int64