# PREDICTING REAL ESTATE OF RESIDENTIAL PROPERTIES 

Predicting the selling price of a residential property depends on a number of factors, including the property age, availability of local amenities, and location.

This dataset consists of real estate sales transactions to predict the price-per-unit of a property based on its features. The price-per-unit in this data is based on a unit measurement of 3.3 square meters.

> **Citation**: Data Source
>
> *Yeh, I. C., & Hsu, T. K. (2018). Building real estate valuation models with comparative approach through case-based reasoning. Applied Soft Computing, 65, 260-271.*
>
> It was obtained from the UCI dataset repository (Dua, D. and Graff, C. (2019). [UCI Machine Learning Repository]([http://archive.ics.uci.edu/ml). Irvine, CA: University of California, School of Information and Computer Science).

#### Some EDA

## 1) DATASET DESCRIPTION

The data consists of the following variables:

* **transaction_date** - the transaction date (for example, 2013.250=2013 March, 2013.500=2013 June, etc.)
* **house_age** - the house age (in years)
* **transit_distance** - the distance to the nearest light rail station (in meters)
* **local_convenience_stores** - the number of convenience stores within walking distance
* **latitude** - the geographic coordinate, latitude
* **longitude** - the geographic coordinate, longitude
* **price_per_unit** house price of unit area (3.3 square meters)


## 2) LOAD THE DATASET

In [2]:
import pandas as pd
import numpy as np
import altair as alt

In [25]:

# load the training dataset
data = pd.read_csv('real_estate.csv', parse_dates = ['transaction_date'])
data.head()

Unnamed: 0,transaction_date,house_age,transit_distance,local_convenience_stores,latitude,longitude,price_per_unit
0,2012.917,32.0,84.87882,10,24.98298,121.54024,37.9
1,2012.917,19.5,306.5947,9,24.98034,121.53951,42.2
2,2013.583,13.3,561.9845,5,24.98746,121.54391,47.3
3,2013.5,13.3,561.9845,5,24.98746,121.54391,54.8
4,2012.833,5.0,390.5684,5,24.97937,121.54245,43.1


## 3) EXPLORE THE DATASET

In [26]:
data.isnull().sum()

transaction_date            0
house_age                   0
transit_distance            0
local_convenience_stores    0
latitude                    0
longitude                   0
price_per_unit              0
dtype: int64

In [27]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 414 entries, 0 to 413
Data columns (total 7 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   transaction_date          414 non-null    object 
 1   house_age                 414 non-null    float64
 2   transit_distance          414 non-null    float64
 3   local_convenience_stores  414 non-null    int64  
 4   latitude                  414 non-null    float64
 5   longitude                 414 non-null    float64
 6   price_per_unit            414 non-null    float64
dtypes: float64(5), int64(1), object(1)
memory usage: 22.8+ KB


In [70]:
#Analysing the distribution of numeric features and Response column
numeric_features = ['house_age', 'transit_distance']
data[numeric_features + ['price_per_unit']].describe()

Unnamed: 0,house_age,transit_distance,price_per_unit
count,414.0,414.0,414.0
mean,17.71256,1083.885689,37.980193
std,11.392485,1262.109595,13.606488
min,0.0,23.38284,7.6
25%,9.025,289.3248,27.7
50%,16.1,492.2313,38.45
75%,28.15,1454.279,46.6
max,43.8,6488.021,117.5


## SELF NOTE: WORK ON CONVERTING LAT AND LONG TO GEOHASH BINS

In [71]:


# Set a bigger default font size for plots
def bigger_font():
    return {
        'config': {
            'view': {'continuousWidth': 400, 'continuousHeight': 300},
            'legend': {'symbolSize': 30, 'titleFontSize': 14, 'labelFontSize': 14}, 
            'axis': {'titleFontSize': 15, 'labelFontSize': 12},
            'encoding': {'x': {'scale': {'zero': False}}}}}
alt.themes.register('bigger_font', bigger_font)
alt.themes.enable('bigger_font')



ThemeRegistry.enable('bigger_font')

### 4) INITIAL THOUGHTS 

* transaction_date : The date format appears to be Year followed by month codes. This cannot be treated as numeric column. It makes sense to have it as categorical.
* house_age - Mean age of the house is 17.7 years.
* transit_distance : The transit distance seems to have high variance.



In [72]:
response = data['price_per_unit']
alt.Chart(data).transform_density(
    'price_per_unit',
    as_=['price_per_unit','density'],
).mark_area(opacity = 0.6, color = 'red').encode(x='price_per_unit:Q', y = 'density:Q')

In [51]:
bar = alt.Chart(data).mark_bar().encode(
    alt.X("price_per_unit:Q", bin=alt.Bin(step = 2)),
    y='count()',
)

rule = alt.Chart(data).mark_rule(color='red').encode(
    x = 'mean(price_per_unit)',
    size = alt.value(4))

bar + rule

The response variable has a roughly random distribution with a mean and median distributed about the same! The box plot shows some outliers located at 80 and then ~120. We can guess that majority of the house prices per unit for every unit area (3.3 sq meters) is in the range $10 to $65 with a few outliers. The outliers are probably part of real estate in million dollar neighborhoods.

## Lets analyze some distribution of the numeric features

In [None]:
# Plot a histogram for each numeric feature
for col in numeric_features:
    fig = plt.figure(figsize=(9, 6))
    ax = fig.gca()
    feature = bike_data[col]
    feature.hist(bins=100, ax = ax)
    ax.axvline(feature.mean(), color='magenta', linestyle='dashed', linewidth=2)
    ax.axvline(feature.median(), color='cyan', linestyle='dashed', linewidth=2)
    ax.set_title(col)
plt.show()

In [55]:
numeric_features

['house_age', 'transit_distance', 'local_convenience_stores']

In [63]:
bar = alt.Chart(data).mark_bar().encode(
    alt.X("house_age:Q", bin=alt.Bin(step = 0.8)),
    y='count()',
)

rule = alt.Chart(data).mark_rule(color='red').encode(
    x = 'mean(house_age)',
    size = alt.value(4))

bar + rule
    

In [68]:

bar = alt.Chart(data).mark_bar().encode(
    alt.X("transit_distance:Q"),
    y='count()',
)

rule = alt.Chart(data).mark_rule(color='red').encode(
    x = 'mean(transit_distance)',
    size = alt.value(4))

bar + rule

In [75]:
## Correlation of features
alt.Chart(data).mark_rect().encode(
    alt.X(''))

In [None]:
Traning the 