## 1. Loading Modules

The first step we will is to load the modules that will help in the data anaylsis.

In [1]:
# importing libraries
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
from datetime import datetime # datetime processing

# setting path of the dataset
import os
for dirname, _, filenames in os.walk('../input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))


../input/.DS_Store
../input/melb_data.csv


## 2. Loading the Dataset

In [2]:
melb_house = pd.read_csv("../input/melb_data.csv")

# checking the columns
melb_house.head()

Unnamed: 0,Suburb,Address,Rooms,Type,Price,Method,SellerG,Date,Distance,Postcode,...,Bathroom,Car,Landsize,BuildingArea,YearBuilt,CouncilArea,Lattitude,Longtitude,Regionname,Propertycount
0,Abbotsford,85 Turner St,2,h,1480000.0,S,Biggin,3/12/2016,2.5,3067.0,...,1.0,1.0,202.0,,,Yarra,-37.7996,144.9984,Northern Metropolitan,4019.0
1,Abbotsford,25 Bloomburg St,2,h,1035000.0,S,Biggin,4/02/2016,2.5,3067.0,...,1.0,0.0,156.0,79.0,1900.0,Yarra,-37.8079,144.9934,Northern Metropolitan,4019.0
2,Abbotsford,5 Charles St,3,h,1465000.0,SP,Biggin,4/03/2017,2.5,3067.0,...,2.0,0.0,134.0,150.0,1900.0,Yarra,-37.8093,144.9944,Northern Metropolitan,4019.0
3,Abbotsford,40 Federation La,3,h,850000.0,PI,Biggin,4/03/2017,2.5,3067.0,...,2.0,1.0,94.0,,,Yarra,-37.7969,144.9969,Northern Metropolitan,4019.0
4,Abbotsford,55a Park St,4,h,1600000.0,VB,Nelson,4/06/2016,2.5,3067.0,...,1.0,2.0,120.0,142.0,2014.0,Yarra,-37.8072,144.9941,Northern Metropolitan,4019.0


## 3. Sneak peak of Dataset

To get better understanding of the data, we will do some priliminary analysis.

Checking for shape and info of dataset are basic analysis that will make the picture more clear. The shape will tell about the dimesion of dataset and from info we can get the type of data. We will also see the unique values of different attributes which will help is further anaylsis.

In [3]:
melb_house.shape

(13580, 21)

In [4]:
melb_house.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 13580 entries, 0 to 13579
Data columns (total 21 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   Suburb         13580 non-null  object 
 1   Address        13580 non-null  object 
 2   Rooms          13580 non-null  int64  
 3   Type           13580 non-null  object 
 4   Price          13580 non-null  float64
 5   Method         13580 non-null  object 
 6   SellerG        13580 non-null  object 
 7   Date           13580 non-null  object 
 8   Distance       13580 non-null  float64
 9   Postcode       13580 non-null  float64
 10  Bedroom2       13580 non-null  float64
 11  Bathroom       13580 non-null  float64
 12  Car            13518 non-null  float64
 13  Landsize       13580 non-null  float64
 14  BuildingArea   7130 non-null   float64
 15  YearBuilt      8205 non-null   float64
 16  CouncilArea    12211 non-null  object 
 17  Lattitude      13580 non-null  float64
 18  Longti

In [5]:
# checking for unique entries

unique_val = []
for i in melb_house.columns:
    u = melb_house[i].nunique()
    unique_val.append(u)
    
pd.DataFrame({"No. of unique values": unique_val}, index=melb_house.columns)

Unnamed: 0,No. of unique values
Suburb,314
Address,13378
Rooms,9
Type,3
Price,2204
Method,5
SellerG,268
Date,58
Distance,202
Postcode,198


* The shape of the dataframe shows that there are 13580 observation and 21 features
* The dataset have different type for data - object, int and float
* The dataset require coversion of features like Date and YearBuilt. These attribute should be in datetime format instead these are given object and float format respectively.
* There are some missing values in the dataset, which should be replaced before futher analysis.

## 4. Data Cleaning

Before further analysis we will make copy of our data to avoid any changes in the original data during the cleaning process.

In [6]:
# Working dataset
dataset = melb_house.copy()

### 4.1 Checking missing values

In [7]:
# Features with missing values
miss = dataset.isnull().sum().sort_values(ascending = False).head(5)
miss_per = (miss/len(dataset))*100

# Percentage of missing values
pd.DataFrame({'No. missing values': miss, '% of missind data': miss_per.values})

Unnamed: 0,No. missing values,% of missind data
BuildingArea,6450,47.496318
YearBuilt,5375,39.580265
CouncilArea,1369,10.081001
Car,62,0.456554
Suburb,0,0.0


 ### 4.2 Handling missing values 

##### 4.2.1 Car

In [8]:
dataset['Car'].value_counts()

2.0     5591
1.0     5509
0.0     1026
3.0      748
4.0      506
5.0       63
6.0       54
8.0        9
7.0        8
10.0       3
9.0        1
Name: Car, dtype: int64

In [9]:
# Filling null value
dataset['Car'].fillna(0, inplace = True)

# confimation after filling the null values
print("Null values before replacement :", melb_house['Car'].isnull().sum())
print("Null values after replacement :", dataset['Car'].isnull().sum())

Null values before replacement : 62
Null values after replacement : 0


There are some house with zero car, so we can replce the null value rows with 0. 

##### 4.2.2 Council Area

In [10]:
dataset['CouncilArea'].value_counts()

Moreland             1163
Boroondara           1160
Moonee Valley         997
Darebin               934
Glen Eira             848
Stonnington           719
Maribyrnong           692
Yarra                 647
Port Phillip          628
Banyule               594
Bayside               489
Melbourne             470
Hobsons Bay           434
Brimbank              424
Monash                333
Manningham            311
Whitehorse            304
Kingston              207
Whittlesea            167
Hume                  164
Wyndham                86
Knox                   80
Maroondah              80
Melton                 66
Frankston              53
Greater Dandenong      52
Casey                  38
Nillumbik              36
Yarra Ranges           18
Cardinia                8
Macedon Ranges          7
Unavailable             1
Moorabool               1
Name: CouncilArea, dtype: int64

In [11]:
# Filling the null value 
dataset['CouncilArea'].fillna('Unavailable', inplace = True)


# confimation after filling the null values
print("Null values before replacement :", melb_house['CouncilArea'].isnull().sum())
print("Null values after replacement :", dataset['CouncilArea'].isnull().sum())

Null values before replacement : 1369
Null values after replacement : 0


We can see that in the column Council Area there is a catergory "Unavailable", so I filled the null values with the "Unavailable"

##### 4.2.3 Year Built

In [12]:
# Filling the null value 
dataset['YearBuilt'].fillna("Unknown", inplace=True)

# confimation after filling the null values
print("Null values before replacement :", melb_house['YearBuilt'].isnull().sum())
print("Null values after replacement :", dataset['YearBuilt'].isnull().sum())

Null values before replacement : 5375
Null values after replacement : 0


The null values in 'YearBuilt' means we dont have any inforamtion about the year, so we cannot fill missing value with any numerical value, therefore I am substituting it with 'Unknown'.

##### 4.2.3 Building Area

In [13]:
# Filling the null value 
dataset['BuildingArea'].fillna(0, inplace = True)

# confimation after filling the null values
print("Null values before replacement :", melb_house['BuildingArea'].isnull().sum())
print("Null values after replacement :", dataset['BuildingArea'].isnull().sum())

Null values before replacement : 6450
Null values after replacement : 0


There are some property with no building, so we replace the missing value with 0

## 5. Exploratory Data Analysis

### 5.1 Univaritant Analysis

### Target Variable - Price

In [14]:
# log transformation of price
dataset['Price_trans'] = np.log(dataset['Price'])

### Numerical Data 

In [15]:
# Grouping the numerical data
num =  dataset.select_dtypes(exclude="object")
num = num.drop(['Price'], axis=1)

### Categorical Data

##### Sale Type and Method 

In [17]:
# coverting date into datetime format
dataset['Date'] = pd.to_datetime(dataset['Date'])
year = dataset['Date'].map(lambda x: datetime.strftime(x, '%Y'))
dataset['year'] = year
month = dataset['Date'].map(lambda x: datetime.strftime(x, '%b'))
dataset['month'] = month

###### We will check our interpertation.

In [18]:
# Dataset of Metropolitan area
rm = dataset[dataset['Regionname'].map(lambda x: 'Metropolitan' in x)]

# Dataset of Victoria area
rv = dataset[dataset['Regionname'].map(lambda x: 'Victoria' in x)]