Description

Analysis of ‘Housing Dataset’ provided by Analyst-2 (analyst-2.ai), based on source dataset retrieved from https://www.kaggle.com/peterkmutua/housing-dataset on 21 November 2021.

--- Dataset description provided by original source is as follows ---

Context
There's a story behind every dataset and here's your opportunity to share yours.

ACTIVITIES

Follow the process below to develop a model that can be used by real estate companies and real estate agents to predict the price of a house.

Business Understanding -Conduct a literature review to understand the factors that determine the price of houses globally and locally. -Based on the dataset provided, formulate a business question to be answered through the analysis.

Data Understanding -The data in the dataset provided was collected through webs scrapping. Conduct further reading to understand the process of web scrapping, how it is conducted (methods and tools) and any ethical challenges related to it.

Data Preparation -Conduct a detailed exploratory analysis on the dataset. -Prepare the dataset for modeling -Identify the technique relevant for answering the business question stated above. -Ensure that the dataset meets all the assumptions of the technique identified. -Conduct preliminary feature selection by identifying the set of features that are likely to provide a model with good performance.

Modeling -Split the dataset into two; training set and validation set. With justifications, decide on the ratio of the training set to the validation set. -Generate the required model

Evaluation -Interpret the model in terms of its goodness of fit in predicting the price of houses. -Assume that the model is not good enough and then conduct further feature engineering or use any other model tuning strategies at your disposal to generate additional two instances of the model. -Settle on the best model instance and then re-interpret.

Implementation -Think of how the model can be implemented and used by real estate firms and agents. -Identify possible challenges of applying the model. -Recommendations on how the model can be improved in future

--- Original source retains full ownership of the source dataset ---

In [78]:
import pandas as pd
housing = pd.read_csv('melbourne_housing.csv')
housing

Unnamed: 0,Suburb,Address,Rooms,Type,Price,Method,SellerG,Date,Distance,Postcode,...,Bathroom,Car,Landsize,BuildingArea,YearBuilt,CouncilArea,Lattitude,Longtitude,Regionname,Propertycount
0,Abbotsford,85 Turner St,2,h,1480000,S,Biggin,3/12/2016,2.5,3067,...,1,1.0,202,,,Yarra,-37.79960,144.99840,Northern Metropolitan,4019
1,Abbotsford,25 Bloomburg St,2,h,1035000,S,Biggin,4/2/2016,2.5,3067,...,1,0.0,156,79.0,1900.0,Yarra,-37.80790,144.99340,Northern Metropolitan,4019
2,Abbotsford,5 Charles St,3,h,1465000,SP,Biggin,4/3/2017,2.5,3067,...,2,0.0,134,150.0,1900.0,Yarra,-37.80930,144.99440,Northern Metropolitan,4019
3,Abbotsford,40 Federation La,3,h,850000,PI,Biggin,4/3/2017,2.5,3067,...,2,1.0,94,,,Yarra,-37.79690,144.99690,Northern Metropolitan,4019
4,Abbotsford,55a Park St,4,h,1600000,VB,Nelson,4/6/2016,2.5,3067,...,1,2.0,120,142.0,2014.0,Yarra,-37.80720,144.99410,Northern Metropolitan,4019
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
13575,Wheelers Hill,12 Strada Cr,4,h,1245000,S,Barry,26/08/2017,16.7,3150,...,2,2.0,652,,1981.0,,-37.90562,145.16761,South-Eastern Metropolitan,7392
13576,Williamstown,77 Merrett Dr,3,h,1031000,SP,Williams,26/08/2017,6.8,3016,...,2,2.0,333,133.0,1995.0,,-37.85927,144.87904,Western Metropolitan,6380
13577,Williamstown,83 Power St,3,h,1170000,S,Raine,26/08/2017,6.8,3016,...,2,4.0,436,,1997.0,,-37.85274,144.88738,Western Metropolitan,6380
13578,Williamstown,96 Verdon St,4,h,2500000,PI,Sweeney,26/08/2017,6.8,3016,...,1,5.0,866,157.0,1920.0,,-37.85908,144.89299,Western Metropolitan,6380


In [79]:
housing.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 13580 entries, 0 to 13579
Data columns (total 21 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   Suburb         13580 non-null  object 
 1   Address        13580 non-null  object 
 2   Rooms          13580 non-null  int64  
 3   Type           13580 non-null  object 
 4   Price          13580 non-null  int64  
 5   Method         13580 non-null  object 
 6   SellerG        13580 non-null  object 
 7   Date           13580 non-null  object 
 8   Distance       13580 non-null  float64
 9   Postcode       13580 non-null  int64  
 10  Bedroom2       13580 non-null  int64  
 11  Bathroom       13580 non-null  int64  
 12  Car            13518 non-null  float64
 13  Landsize       13580 non-null  int64  
 14  BuildingArea   7130 non-null   float64
 15  YearBuilt      8205 non-null   float64
 16  CouncilArea    12211 non-null  object 
 17  Lattitude      13580 non-null  float64
 18  Longti

In [80]:
housing.columns

Index(['Suburb', 'Address', 'Rooms', 'Type', 'Price', 'Method', 'SellerG',
       'Date', 'Distance', 'Postcode', 'Bedroom2', 'Bathroom', 'Car',
       'Landsize', 'BuildingArea', 'YearBuilt', 'CouncilArea', 'Lattitude',
       'Longtitude', 'Regionname', 'Propertycount'],
      dtype='object')

In [81]:
#printing categorical columns

def print_categorical_columns(dataframe):
    """
    This function checks for categorical columns in a Pandas DataFrame and prints them.

    e"""
    # Identify categorical columns
    categorical_columns = dataframe.select_dtypes(include=['category', 'object']).columns.tolist()

    # Print categorical columns
    if len(categorical_columns) > 0:
        print('Categorical columns:')
        for column in categorical_columns:
            print(column)
    else:
        print('No categorical columns found.')

print_categorical_columns(housing)


Categorical columns:
Suburb
Address
Type
Method
SellerG
Date
CouncilArea
Regionname


In [82]:
housing 

Unnamed: 0,Suburb,Address,Rooms,Type,Price,Method,SellerG,Date,Distance,Postcode,...,Bathroom,Car,Landsize,BuildingArea,YearBuilt,CouncilArea,Lattitude,Longtitude,Regionname,Propertycount
0,Abbotsford,85 Turner St,2,h,1480000,S,Biggin,3/12/2016,2.5,3067,...,1,1.0,202,,,Yarra,-37.79960,144.99840,Northern Metropolitan,4019
1,Abbotsford,25 Bloomburg St,2,h,1035000,S,Biggin,4/2/2016,2.5,3067,...,1,0.0,156,79.0,1900.0,Yarra,-37.80790,144.99340,Northern Metropolitan,4019
2,Abbotsford,5 Charles St,3,h,1465000,SP,Biggin,4/3/2017,2.5,3067,...,2,0.0,134,150.0,1900.0,Yarra,-37.80930,144.99440,Northern Metropolitan,4019
3,Abbotsford,40 Federation La,3,h,850000,PI,Biggin,4/3/2017,2.5,3067,...,2,1.0,94,,,Yarra,-37.79690,144.99690,Northern Metropolitan,4019
4,Abbotsford,55a Park St,4,h,1600000,VB,Nelson,4/6/2016,2.5,3067,...,1,2.0,120,142.0,2014.0,Yarra,-37.80720,144.99410,Northern Metropolitan,4019
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
13575,Wheelers Hill,12 Strada Cr,4,h,1245000,S,Barry,26/08/2017,16.7,3150,...,2,2.0,652,,1981.0,,-37.90562,145.16761,South-Eastern Metropolitan,7392
13576,Williamstown,77 Merrett Dr,3,h,1031000,SP,Williams,26/08/2017,6.8,3016,...,2,2.0,333,133.0,1995.0,,-37.85927,144.87904,Western Metropolitan,6380
13577,Williamstown,83 Power St,3,h,1170000,S,Raine,26/08/2017,6.8,3016,...,2,4.0,436,,1997.0,,-37.85274,144.88738,Western Metropolitan,6380
13578,Williamstown,96 Verdon St,4,h,2500000,PI,Sweeney,26/08/2017,6.8,3016,...,1,5.0,866,157.0,1920.0,,-37.85908,144.89299,Western Metropolitan,6380


In [83]:
#housing['Suburb'] = housing['Suburb'].astype('category')
#housing.insert(1, 'Suburb_code', housing['Suburb'].cat.codes)
#housing.insert(1, 'Suburb_code', pd.Categorical(housing['Suburb']).codes)

#housing

In [84]:
columns = ['Suburb' , 'Address' , 'Type' , 'Method' , 'SellerG' , 'Date' , 'CouncilArea' , 'Regionname']

def convert_to_codes(df, columns):
    for col in columns:
        # Convert the column to categorical data type
        df[col] = df[col].astype('category')
        
        # Create a new column with categorical codes and insert it next to the original column
        df.insert(df.columns.get_loc(col)+1, col+'_code', df[col].cat.codes)

convert_to_codes(housing, columns)

In [86]:
housing

Unnamed: 0,Suburb,Suburb_code,Address,Address_code,Rooms,Type,Type_code,Price,Method,Method_code,...,Landsize,BuildingArea,YearBuilt,CouncilArea,CouncilArea_code,Lattitude,Longtitude,Regionname,Regionname_code,Propertycount
0,Abbotsford,0,85 Turner St,12794,2,h,0,1480000,S,1,...,202,,,Yarra,31,-37.79960,144.99840,Northern Metropolitan,2,4019
1,Abbotsford,0,25 Bloomburg St,5943,2,h,0,1035000,S,1,...,156,79.0,1900.0,Yarra,31,-37.80790,144.99340,Northern Metropolitan,2,4019
2,Abbotsford,0,5 Charles St,9814,3,h,0,1465000,SP,3,...,134,150.0,1900.0,Yarra,31,-37.80930,144.99440,Northern Metropolitan,2,4019
3,Abbotsford,0,40 Federation La,9004,3,h,0,850000,PI,0,...,94,,,Yarra,31,-37.79690,144.99690,Northern Metropolitan,2,4019
4,Abbotsford,0,55a Park St,10589,4,h,0,1600000,VB,4,...,120,142.0,2014.0,Yarra,31,-37.80720,144.99410,Northern Metropolitan,2,4019
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
13575,Wheelers Hill,302,12 Strada Cr,1991,4,h,0,1245000,S,1,...,652,,1981.0,,-1,-37.90562,145.16761,South-Eastern Metropolitan,4,7392
13576,Williamstown,305,77 Merrett Dr,12234,3,h,0,1031000,SP,3,...,333,133.0,1995.0,,-1,-37.85927,144.87904,Western Metropolitan,6,6380
13577,Williamstown,305,83 Power St,12745,3,h,0,1170000,S,1,...,436,,1997.0,,-1,-37.85274,144.88738,Western Metropolitan,6,6380
13578,Williamstown,305,96 Verdon St,13311,4,h,0,2500000,PI,0,...,866,157.0,1920.0,,-1,-37.85908,144.89299,Western Metropolitan,6,6380


In [87]:
housing['Type'].unique()

['h', 'u', 't']
Categories (3, object): ['h', 'u', 't']