### Data Set Information:

The market historical data set of real estate valuation are collected from Sindian Dist., New Taipei City, Taiwan. The â€œreal estate valuationâ€ is a regression problem. The data set was randomly split into the training data set (2/3 samples) and the testing data set (1/3 samples).


#### Attribute Information:

The inputs are as follows 
* X1=the transaction date (for example, 2013.250=2013 March, 2013.500=2013 June, etc.) 
* X2=the house age (unit: year) 
* X3=the distance to the nearest MRT station (unit: meter) 
* X4=the number of convenience stores in the living circle on foot (integer) 
* X5=the geographic coordinate, latitude. (unit: degree) 
* X6=the geographic coordinate, longitude. (unit: degree) 

#### The output is as follow 
* Y= house price of unit area (10000 New Taiwan Dollar/Ping, where Ping is a local unit, 1 Ping = 3.3 meter squared) 


#### Relevant Papers:

* Yeh, I. C., & Hsu, T. K. (2018). Building real estate valuation models with comparative approach through case-* * based reasoning. Applied Soft Computing, 65, 260-271.





### Loading modules.

In [3]:
import numpy as np               # python-scientifi computing library.
import pandas as pd              # python data structure for rectangular data.
import matplotlib.pyplot as plt  # python-visualization library.
%matplotlib inline

#### let's start through reading our data for exploration.

In [4]:
real_estate_df = pd.read_excel("Real estate valuation data set.xlsx",sheet_name="sh1")
real_estate_df.head()

Unnamed: 0,No,X1 transaction date,X2 house age,X3 distance to the nearest MRT station,X4 number of convenience stores,X5 latitude,X6 longitude,Y house price of unit area
0,1,2012.916667,32.0,84.87882,10,24.98298,121.54024,37.9
1,2,2012.916667,19.5,306.5947,9,24.98034,121.53951,42.2
2,3,2013.583333,13.3,561.9845,5,24.98746,121.54391,47.3
3,4,2013.5,13.3,561.9845,5,24.98746,121.54391,54.8
4,5,2012.833333,5.0,390.5684,5,24.97937,121.54245,43.1


### Changing the columns names in the dataframe.

In [8]:
# columns names
new_names = ['no','date','h_age','nearst_MRT',
             'n_conv_stores','latitude','longtiude','price']
# assign new columns names
real_estate_df.columns = new_names

In [9]:
# check new columns names
real_estate_df.head()

Unnamed: 0,no,date,h_age,nearst_MRT,n_conv_stores,latitude,longtiude,price
0,1,2012.916667,32.0,84.87882,10,24.98298,121.54024,37.9
1,2,2012.916667,19.5,306.5947,9,24.98034,121.53951,42.2
2,3,2013.583333,13.3,561.9845,5,24.98746,121.54391,47.3
3,4,2013.5,13.3,561.9845,5,24.98746,121.54391,54.8
4,5,2012.833333,5.0,390.5684,5,24.97937,121.54245,43.1


In [10]:
# display the number of records or observations.
print(len(real_estate_df))

414


In [11]:
# display the last 10 rows in the dataset.
real_estate_df.tail(10)

Unnamed: 0,no,date,h_age,nearst_MRT,n_conv_stores,latitude,longtiude,price
404,405,2013.333333,16.4,289.3248,5,24.98203,121.54348,41.2
405,406,2012.666667,23.0,130.9945,6,24.95663,121.53765,37.2
406,407,2013.166667,1.9,372.1386,7,24.97293,121.54026,40.5
407,408,2013.0,5.2,2408.993,0,24.95505,121.55964,22.3
408,409,2013.416667,18.5,2175.744,3,24.9633,121.51243,28.1
409,410,2013.0,13.7,4082.015,0,24.94155,121.50381,15.4
410,411,2012.666667,5.6,90.45606,9,24.97433,121.5431,50.0
411,412,2013.25,18.8,390.9696,7,24.97923,121.53986,40.6
412,413,2013.0,8.1,104.8101,5,24.96674,121.54067,52.5
413,414,2013.5,6.5,90.45606,9,24.97433,121.5431,63.9


Unnamed: 0,no,date,h_age,nearst_MRT,n_conv_stores,latitude,longtiude,price
0,1,1970-01-01 00:00:00.000002012,32.0,84.87882,10,24.98298,121.54024,37.9
1,2,1970-01-01 00:00:00.000002012,19.5,306.5947,9,24.98034,121.53951,42.2
2,3,1970-01-01 00:00:00.000002013,13.3,561.9845,5,24.98746,121.54391,47.3
3,4,1970-01-01 00:00:00.000002013,13.3,561.9845,5,24.98746,121.54391,54.8
4,5,1970-01-01 00:00:00.000002012,5.0,390.5684,5,24.97937,121.54245,43.1


while all features in our dataset is numerical.let's 

Index(['No', 'X1 transaction date', 'X2 house age',
       'X3 distance to the nearest MRT station',
       'X4 number of convenience stores', 'X5 latitude', 'X6 longitude',
       'Y house price of unit area'],
      dtype='object')
