In [None]:
import os

def scale_input_data(scale_factor):
  file_bases = ['./input/melb_data']
  for file_base in file_bases:
    import pandas as pd
    import shutil
    if scale_factor == 1.0:
      shutil.copyfile(file_base + '.csv', file_base + '.scaled.csv')
      continue
    df_to_scale = pd.read_csv(file_base + '.csv')
    new_num_rows = int(scale_factor * len(df_to_scale))
    if scale_factor <= 1.0:
      df_to_scale = df_to_scale.iloc[:new_num_rows]
    else:
      while len(df_to_scale) < new_num_rows:
        df_to_scale = pd.concat([df_to_scale, df_to_scale[:min(new_num_rows - len(df_to_scale), len(df_to_scale))]])
    df_to_scale.to_csv(file_base + '.scaled.csv', index=False)

if 'INPUT_SCALE_FACTOR' in os.environ:
  scale_input_data(float(os.environ['INPUT_SCALE_FACTOR']))

###### Hello, welcome to my notebook

###### In this notebook, I will do the exploratory analysis of Melbourne Housing data. The main aim is to see how different features are related to the price of the house.

###### Your feedback is appreciated. It will help me to improve my notebook.

###### Lets begin!

## 1. Loading Modules

The first step we will is to load the modules that will help in the data anaylsis.

In [1]:
# importing libraries
import numpy as np # linear algebra
# import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
exec(os.environ['IREWR_IMPORTS'])
from datetime import datetime # datetime processing

# Data visualisation
# ALEX: remove plotting
# import matplotlib.pyplot as plt
# import seaborn as sns

# setting path of the dataset
# ALEX: remove path printing
# import os
# for dirname, _, filenames in os.walk('/kaggle/input'):
#     for filename in filenames:
#         print(os.path.join(dirname, filename))


## 2. Loading the Dataset

In [2]:
melb_house = pd.read_csv("./input/melb_data.scaled.csv")

# checking the columns
melb_house.head()

Unnamed: 0,Suburb,Address,Rooms,Type,Price,Method,SellerG,Date,Distance,Postcode,...,Bathroom,Car,Landsize,BuildingArea,YearBuilt,CouncilArea,Lattitude,Longtitude,Regionname,Propertycount
0,Abbotsford,85 Turner St,2,h,1480000.0,S,Biggin,3/12/2016,2.5,3067.0,...,1.0,1.0,202.0,,,Yarra,-37.7996,144.9984,Northern Metropolitan,4019.0
1,Abbotsford,25 Bloomburg St,2,h,1035000.0,S,Biggin,4/02/2016,2.5,3067.0,...,1.0,0.0,156.0,79.0,1900.0,Yarra,-37.8079,144.9934,Northern Metropolitan,4019.0
2,Abbotsford,5 Charles St,3,h,1465000.0,SP,Biggin,4/03/2017,2.5,3067.0,...,2.0,0.0,134.0,150.0,1900.0,Yarra,-37.8093,144.9944,Northern Metropolitan,4019.0
3,Abbotsford,40 Federation La,3,h,850000.0,PI,Biggin,4/03/2017,2.5,3067.0,...,2.0,1.0,94.0,,,Yarra,-37.7969,144.9969,Northern Metropolitan,4019.0
4,Abbotsford,55a Park St,4,h,1600000.0,VB,Nelson,4/06/2016,2.5,3067.0,...,1.0,2.0,120.0,142.0,2014.0,Yarra,-37.8072,144.9941,Northern Metropolitan,4019.0


## 3. Sneak peak of Dataset

To get better understanding of the data, we will do some priliminary analysis.

Checking for shape and info of dataset are basic analysis that will make the picture more clear. The shape will tell about the dimesion of dataset and from info we can get the type of data. We will also see the unique values of different attributes which will help is further anaylsis.

In [3]:
melb_house.shape

(13580, 21)

In [4]:
melb_house.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 13580 entries, 0 to 13579
Data columns (total 21 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   Suburb         13580 non-null  object 
 1   Address        13580 non-null  object 
 2   Rooms          13580 non-null  int64  
 3   Type           13580 non-null  object 
 4   Price          13580 non-null  float64
 5   Method         13580 non-null  object 
 6   SellerG        13580 non-null  object 
 7   Date           13580 non-null  object 
 8   Distance       13580 non-null  float64
 9   Postcode       13580 non-null  float64
 10  Bedroom2       13580 non-null  float64
 11  Bathroom       13580 non-null  float64
 12  Car            13518 non-null  float64
 13  Landsize       13580 non-null  float64
 14  BuildingArea   7130 non-null   float64
 15  YearBuilt      8205 non-null   float64
 16  CouncilArea    12211 non-null  object 
 17  Lattitude      13580 non-null  float64
 18  Longti

In [5]:
# checking for unique entries

unique_val = []
for i in melb_house.columns:
    u = melb_house[i].nunique()
    unique_val.append(u)
    
pd.DataFrame({"No. of unique values": unique_val}, index=melb_house.columns)

Unnamed: 0,No. of unique values
Suburb,314
Address,13378
Rooms,9
Type,3
Price,2204
Method,5
SellerG,268
Date,58
Distance,202
Postcode,198


* The shape of the dataframe shows that there are 13580 observation and 21 features
* The dataset have different type for data - object, int and float
* The dataset require coversion of features like Date and YearBuilt. These attribute should be in datetime format instead these are given object and float format respectively.
* There are some missing values in the dataset, which should be replaced before futher analysis.

## 4. Data Cleaning

Before further analysis we will make copy of our data to avoid any changes in the original data during the cleaning process.

In [6]:
# Working dataset
dataset = melb_house.copy()

### 4.1 Checking missing values

In [7]:
# plot of missing value
# ALEX: remove plotting
# plt.figure(figsize=(9,5))
# sns.heatmap(dataset.isnull(),yticklabels=False, cbar=False, cmap="Paired");
# plt.title("Heatmap of Missing Values");
_ = dataset.isnull()

We can see there are some columns with missing values.

* Car columns have only few missing values
* The features 'BuildingArea' and 'YearBuilt' most number of missing values.
* In column 'CouncilArea' the missing values is aggregated in the last few observations.


In [8]:
# Features with missing values
miss = dataset.isnull().sum().sort_values(ascending = False).head(5)
miss_per = (miss/len(dataset))*100

# Percentage of missing values
pd.DataFrame({'No. missing values': miss, '% of missind data': miss_per.values})

Unnamed: 0,No. missing values,% of missind data
BuildingArea,6450,47.496318
YearBuilt,5375,39.580265
CouncilArea,1369,10.081001
Car,62,0.456554
Suburb,0,0.0


 ### 4.2 Handling missing values 

##### 4.2.1 Car

In [9]:
dataset['Car'].value_counts()

2.0     5591
1.0     5509
0.0     1026
3.0      748
4.0      506
5.0       63
6.0       54
8.0        9
7.0        8
10.0       3
9.0        1
Name: Car, dtype: int64

In [10]:
# Filling null value
dataset['Car'].fillna(0, inplace = True)

# confimation after filling the null values
print("Null values before replacement :", melb_house['Car'].isnull().sum())
print("Null values after replacement :", dataset['Car'].isnull().sum())

Null values before replacement : 62
Null values after replacement : 0


There are some house with zero car, so we can replce the null value rows with 0. 

##### 4.2.2 Council Area

In [11]:
dataset['CouncilArea'].value_counts()

Moreland             1163
Boroondara           1160
Moonee Valley         997
Darebin               934
Glen Eira             848
Stonnington           719
Maribyrnong           692
Yarra                 647
Port Phillip          628
Banyule               594
Bayside               489
Melbourne             470
Hobsons Bay           434
Brimbank              424
Monash                333
Manningham            311
Whitehorse            304
Kingston              207
Whittlesea            167
Hume                  164
Wyndham                86
Knox                   80
Maroondah              80
Melton                 66
Frankston              53
Greater Dandenong      52
Casey                  38
Nillumbik              36
Yarra Ranges           18
Cardinia                8
Macedon Ranges          7
Unavailable             1
Moorabool               1
Name: CouncilArea, dtype: int64

In [12]:
# Filling the null value 
dataset['CouncilArea'].fillna('Unavailable', inplace = True)


# confimation after filling the null values
print("Null values before replacement :", melb_house['CouncilArea'].isnull().sum())
print("Null values after replacement :", dataset['CouncilArea'].isnull().sum())

Null values before replacement : 1369
Null values after replacement : 0


We can see that in the column Council Area there is a catergory "Unavailable", so I filled the null values with the "Unavailable"

##### 4.2.3 Year Built

In [13]:
# Filling the null value 
dataset['YearBuilt'].fillna("Unknown", inplace=True)

# confimation after filling the null values
print("Null values before replacement :", melb_house['YearBuilt'].isnull().sum())
print("Null values after replacement :", dataset['YearBuilt'].isnull().sum())

Null values before replacement : 5375
Null values after replacement : 0


The null values in 'YearBuilt' means we dont have any inforamtion about the year, so we cannot fill missing value with any numerical value, therefore I am substituting it with 'Unknown'.

##### 4.2.3 Building Area

In [14]:
# ALEX: remove plotting
# plt.figure(figsize = (5, 5))
# sns.distplot(dataset['BuildingArea']);
_ = dataset['BuildingArea']

In [15]:
# Filling the null value 
dataset['BuildingArea'].fillna(0, inplace = True)

# confimation after filling the null values
print("Null values before replacement :", melb_house['BuildingArea'].isnull().sum())
print("Null values after replacement :", dataset['BuildingArea'].isnull().sum())

Null values before replacement : 6450
Null values after replacement : 0


There are some property with no building, so we replace the missing value with 0

## 5. Exploratory Data Analysis

### 5.1 Univaritant Analysis

### Target Variable - Price

In [16]:
# log transformation of price
dataset['Price_trans'] = np.log(dataset['Price'])

# plot of price
# ALEX: remove plotting
# plt.figure(figsize=(10, 5))
# plt.subplots_adjust(wspace=0.5)
# plt.suptitle("Distribution of Price", fontsize=14)

# plt.subplot(1,2,1)
# p1 = sns.kdeplot(dataset['Price']);
# p1.title.set_text("Before Transfromation")

# plt.subplot(1,2,2)
# p2 = sns.kdeplot(dataset['Price_trans']);
# p2.title.set_text("After log Transformation")
_ = dataset['Price']
_ = dataset['Price_trans']

We can see that distribution of price is positively skewed, so by performing log transformation of the target variable we can make it closed to normal distribution. Above two graphs represent the same. We will use the transformed values for further analysis.

### Numerical Data 

In [17]:
# Grouping the numerical data
num =  dataset.select_dtypes(exclude="object")
num = num.drop(['Price'], axis=1)

# Distributions of numrical data
# ALEX: remove plotting
# plt.figure(figsize=(15, 13))
# plt.subplots_adjust(hspace=0.4, wspace=0.3)

# j=1

for i in list(num.columns):
# ALEX: remove plotting
#     plt.subplot(4,3,j)
#     sns.distplot(dataset[i])
#     j+=1
    _ = dataset[i]
    
# ALEX: remove plotting
# plt.suptitle("Distribution of Numerical Data", fontsize=15);


Observation from the distributions:
* 'Rooms', 'Bedrooms2', "Bathroom" and 'Car' have dicrete distribution.
* 'Distance', 'Postcode', 'Landsize', 'BuilingArea', 'Lattitude', 'Longitude' and 'Property count' have countinous distribution.
* In features with continuous data some have distibution close to normal ('Lattitude', 'Longitude' and 'Property count') and features have skewed distrbtuion ('Distance', 'Postcode', 'Landsize', 'BuilingArea').

### Categorical Data

##### Sale Type and Method 

In [18]:

# ALEX: remove plotting
# plt.figure(figsize=(15, 5))
# plt.subplots_adjust(wspace=0.5)

# plt.subplot(1,2,1)
# ax1 = sns.countplot(x=dataset['Type'], palette='Accent');
# ax1.title.set_text("Plot of House Type")

# plt.subplot(1,2,2)
# ax2= sns.countplot(x=dataset['Method'], palette='Accent');
# ax2.title.set_text("Plot of Method of Selling")
_ = dataset['Type']
_ = dataset['Method']
    

Observations from above plot:
* Plot for house type shows that most houses are type h  i.e., house,cottage,villa, semi,terrace and very few type t - townhouse.
* Most of the properties are sold without any aution.

##### Regions of Melbourne

In [19]:
# ALEX: remove plotting
# plt.figure(figsize=(10, 5))
# sns.countplot(y = dataset['Regionname']);
# plt.ylabel("Region Name", fontsize=12);
# plt.xlabel("Count", fontsize=12);
_ = dataset['Regionname']

* From the above plot observe that, Melbourne can be braodly divide into two areas :  **Victoria** and **Metropolitian**.
* Most of the houses are in metropoltian area.

##### Seller

In [20]:
# checking for top 10 seller
# ALEX: remove plotting
# dataset['SellerG'].value_counts().head(10).plot(kind='bar', color='brown');
_ = dataset['SellerG'].value_counts().head(10)

# plot for top seller
# ALEX: remove plotting
# plt.title("Top 10 Estate Agents", fontsize=14);

The above plot show the top ten sellers in Melbourne. Nelson sold the highest number of property.

##### Date

In [21]:
# coverting date into datetime format
dataset['Date'] = pd.to_datetime(dataset['Date'])
year = dataset['Date'].map(lambda x: datetime.strftime(x, '%Y'))
dataset['year'] = year
month = dataset['Date'].map(lambda x: datetime.strftime(x, '%b'))
dataset['month'] = month

# plot of each month
# ALEX: remove plotting
# plt.figure(figsize = (12, 4))
# sns.countplot(x=dataset['month'], hue=dataset['year'], palette='Set1');
_ = dataset['month']
_ = dataset['year']

  dataset['Date'] = pd.to_datetime(dataset['Date'])


The plot show the number of sale for each month per year. We have data for two year 2016 and 2017.
* There is increase in number of sale in 2017 in every month except Oct and Nov.
* There is more sale from month March to August.

### 5.2 Bivariant Analysis

###### Now, we will see how each attribute is affecting our tagret feature 

##### Rooms vs Price 

In [22]:
# ALEX: remove plotting
# plt.figure(figsize=(10, 5))
# sns.boxplot(x = 'Rooms', y = "Price_trans", data=dataset);
# plt.ylabel("Price");

###### The house with more number of room should have more price, same is observed in the above plot. We can see the as the number of rooms are increasing the price of the house is also increasing.

##### Sale Type and Method vs Price

In [23]:
# ALEX: remove plotting
# plt.figure(figsize=(15, 5))
# plt.subplots_adjust(wspace=0.3)

# plt.subplot(1,2,1)
# sns.boxplot(x = 'Type', y = "Price_trans", data=dataset, palette='Set3');
# plt.ylabel("Price");

# plt.subplot(1,2,2)
# sns.boxplot(x = 'Method', y = "Price_trans", data=dataset, palette='Set3');
# plt.ylabel("Price");

###### * Cottage, villas, townhouse have more price than unit and duplex house.
###### * The method of sale does not have any significant affect on sale price as we can see from the boxplot. 

##### Car Vs Price

In [24]:
# ALEX: remove plotting
# plt.figure(figsize=(15, 5))
# sns.boxenplot(x = 'Car', y = "Price_trans", data=dataset);
# plt.ylabel("Price");

###### * The house with two or more cars have slightly higher price.
###### * Houses with one car have less price than house with zero cars, this is unsual.
###### * Overall the number of cars in the house does not have much affect on sale price as the variation in boxplot is very less.


##### Landsize Vs Price

In [25]:
# ALEX: remove plotting
# sns.lmplot(x="Landsize", y='Price_trans', data=dataset);
# plt.ylabel("Price");
# plt.xticks(rotation=15);

Price has linear relation with landsize. Larger the land area higher is the price.

##### Distance Vs Price

In [26]:
# ALEX: remove plotting
# sns.scatterplot(x='Distance', y='Price_trans', data=dataset);
# plt.ylabel("Price");

The porperty close to commercial and business area should have higher price. From the above graph we can observe the same, there is negative correlation between the price and the distance of property from the commerical ares. Most of the house are within 20 unit distance from the commercial center.

We had seen that Melborune is broadly divided into two area Metropolitan and Victoria. So the properties that are in Metropolitan area should be close to business center and thus and higher price than properties in Victoria area.

###### We will check our interpertation.

In [27]:
# Dataset of Metropolitan area
rm = dataset[dataset['Regionname'].map(lambda x: 'Metropolitan' in x)]

# Dataset of Victoria area
rv = dataset[dataset['Regionname'].map(lambda x: 'Victoria' in x)]

# plots of both region
# ALEX: remove plotting
# plt.figure(figsize=(15, 5))

# plt.subplot(1,2,1)
# ar1 = sns.scatterplot(x='Distance', y='Price_trans', data=rm, color='orange');
# plt.ylabel("Price");
# ar1.title.set_text("Metropolitan");

# plt.subplot(1,2,2)
# ar2 = sns.scatterplot(x='Distance', y='Price_trans', data=rv, color="green");
# plt.ylabel("Price");
# ar2.title.set_text('Victoria');

The above plot above plot confirm our interpretation. Most of the houses are in Metropolitan area and within 35 unit distance from commercial area have high price whereas protery at Victoria area are have less price and more distance.

## 6. Conclusion

We observe that: 
###### * The target feature (price) is positively related to rooms in house and have negative relation with the distance from commercial and business centers.

###### * Melbourne is divide into Metropolitan and Victotia area. The price of Metropolitan area is higher. 

###### * The method by which property is sold does not have much affect on the price of the house.

###### * Type of house has relation with the price. It is known that single and duplex house are cheaper than villas or townhouse.

More analysis need to be done for further confimation.

### **** Thank you for reading my notebook. If you like it, please upvote :)****