# Real Estate Price Prediction

User want to know the price of the house of unit area depends on 6 parameter as mentioned below.
Transaction date, house age, distance to the nearest MRT station, number of convenience stores in the living circle, geographic coordinate latitude, geographic coordinate longitude.

- Attribute Information:

- The inputs are as follows
- X1=the transaction date (for example, 2013.250=2013 March, 2013.500=2013 June, etc.)
- X2=the house age (unit: year)
- X3=the distance to the nearest MRT station (unit: meter)
- X4=the number of convenience stores in the living circle on foot (integer)
- X5=the geographic coordinate, latitude. (unit: degree)
- X6=the geographic coordinate, longitude. (unit: degree)

The output is as follow
Y= house price of unit area (10000 New Taiwan Dollar/Ping, where Ping is a local unit, 1 Ping = 3.3 meter squared)

# Importing required libraries for the project

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

- Loading real estate dataset from git hub account

In [2]:
Realdf = pd.read_csv('https://raw.githubusercontent.com/Manju410/MLPractice/main/Real_Estate_Price_Prediction/RealEstatCleanUp.csv')

In [3]:
Realdf.head()

Unnamed: 0,X1 transaction date,X2 house age,X3 distance to the nearest MRT station,X4 number of convenience stores,X5 latitude,X6 longitude,Y house price of unit area
0,734808,32.0,84.87882,10,24.98298,121.54024,37.9
1,734808,19.5,306.5947,9,24.98034,121.53951,42.2
2,735050,13.3,561.9845,5,24.98746,121.54391,47.3
3,735020,13.3,561.9845,5,24.98746,121.54391,54.8
4,734777,5.0,390.5684,5,24.97937,121.54245,43.1


- Number of rows and columns in the dataset

In [4]:
Realdf.shape

(414, 7)

- There are 414 rows and 8 columns in the above dataset

- Information about dataset like datatype,count etc

In [5]:
Realdf.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 414 entries, 0 to 413
Data columns (total 7 columns):
 #   Column                                  Non-Null Count  Dtype  
---  ------                                  --------------  -----  
 0   X1 transaction date                     414 non-null    int64  
 1   X2 house age                            414 non-null    float64
 2   X3 distance to the nearest MRT station  414 non-null    float64
 3   X4 number of convenience stores         414 non-null    int64  
 4   X5 latitude                             414 non-null    float64
 5   X6 longitude                            414 non-null    float64
 6   Y house price of unit area              414 non-null    float64
dtypes: float64(5), int64(2)
memory usage: 22.8 KB


# Summary of above output
- Above dataset contains 8 columns.
- Two columns are integer datatype and Six columns are float datatype.
- Above dataset doesnot have any null values or empty values.
- Above dataset have 414 etries total

- Extract Train Test Dataset

In [6]:
X = Realdf.iloc[:, [1,2,4,5]]
y = Realdf.iloc[:, -1:]

In [7]:
X.head()

Unnamed: 0,X2 house age,X3 distance to the nearest MRT station,X5 latitude,X6 longitude
0,32.0,84.87882,24.98298,121.54024
1,19.5,306.5947,24.98034,121.53951
2,13.3,561.9845,24.98746,121.54391
3,13.3,561.9845,24.98746,121.54391
4,5.0,390.5684,24.97937,121.54245


In [8]:
y.head()

Unnamed: 0,Y house price of unit area
0,37.9
1,42.2
2,47.3
3,54.8
4,43.1


In [9]:
X.shape

(414, 4)

In [10]:
X_train = X.iloc[:350, :]
y_train = y.iloc[:350, :]

X_test = X.iloc[350:, :]
y_test = y.iloc[350:, :]

In [11]:
from sklearn.preprocessing import MinMaxScaler

In [12]:
X_train.head()

Unnamed: 0,X2 house age,X3 distance to the nearest MRT station,X5 latitude,X6 longitude
0,32.0,84.87882,24.98298,121.54024
1,19.5,306.5947,24.98034,121.53951
2,13.3,561.9845,24.98746,121.54391
3,13.3,561.9845,24.98746,121.54391
4,5.0,390.5684,24.97937,121.54245


In [13]:
X_test.head()

Unnamed: 0,X2 house age,X3 distance to the nearest MRT station,X5 latitude,X6 longitude
350,13.2,492.2313,24.96515,121.53737
351,4.0,2180.245,24.96324,121.51241
352,18.4,2674.961,24.96143,121.50827
353,4.1,2147.376,24.96299,121.51284
354,12.2,1360.139,24.95204,121.54842


In [14]:
Continous_col = X_train.columns
Continous_col

Index(['X2 house age', 'X3 distance to the nearest MRT station', 'X5 latitude',
       'X6 longitude'],
      dtype='object')

In [15]:
mmscaler = MinMaxScaler()

In [16]:
scaler = mmscaler.fit(X_train[Continous_col])

In [17]:
scaler.data_min_

array([  0.     ,  23.38284,  24.93293, 121.47353])

In [18]:
scaler.data_max_

array([  43.8    , 6488.021  ,   25.01459,  121.56627])

In [19]:
X_train.describe()

Unnamed: 0,X2 house age,X3 distance to the nearest MRT station,X5 latitude,X6 longitude
count,350.0,350.0,350.0,350.0
mean,17.955429,1091.021845,24.969063,121.533313
std,11.355236,1289.770616,0.012498,0.0155
min,0.0,23.38284,24.93293,121.47353
25%,9.9,289.3248,24.96305,121.528085
50%,16.2,492.2313,24.9711,121.538535
75%,29.125,1414.837,24.97744,121.543438
max,43.8,6488.021,25.01459,121.56627


In [20]:
scaler.data_range_

array([4.38000000e+01, 6.46463816e+03, 8.16600000e-02, 9.27400000e-02])

In [21]:
scaled_vals = scaler.transform(X_train[Continous_col])
scaled_vals[:5]

array([[0.73059361, 0.00951267, 0.61290718, 0.71932284],
       [0.44520548, 0.04380939, 0.58057801, 0.71145137],
       [0.30365297, 0.08331505, 0.6677688 , 0.75889584],
       [0.30365297, 0.08331505, 0.6677688 , 0.75889584],
       [0.11415525, 0.05679909, 0.56869949, 0.7431529 ]])

In [22]:
scaled_vals1 = scaler.transform(X_test[Continous_col])
scaled_vals1[:5]

array([[0.30136986, 0.07252509, 0.39456282, 0.68837611],
       [0.0913242 , 0.33364004, 0.37117316, 0.41923658],
       [0.42009132, 0.41016652, 0.34900808, 0.37459564],
       [0.09360731, 0.32855561, 0.36811168, 0.42387319],
       [0.27853881, 0.20677973, 0.2340191 , 0.80752642]])

In [23]:
X_test_scaled = pd.DataFrame(scaled_vals1,columns=Continous_col)
X_test_scaled.head()

Unnamed: 0,X2 house age,X3 distance to the nearest MRT station,X5 latitude,X6 longitude
0,0.30137,0.072525,0.394563,0.688376
1,0.091324,0.33364,0.371173,0.419237
2,0.420091,0.410167,0.349008,0.374596
3,0.093607,0.328556,0.368112,0.423873
4,0.278539,0.20678,0.234019,0.807526


In [24]:
X_test_scaled.head()

Unnamed: 0,X2 house age,X3 distance to the nearest MRT station,X5 latitude,X6 longitude
0,0.30137,0.072525,0.394563,0.688376
1,0.091324,0.33364,0.371173,0.419237
2,0.420091,0.410167,0.349008,0.374596
3,0.093607,0.328556,0.368112,0.423873
4,0.278539,0.20678,0.234019,0.807526


In [25]:
X_train_scaled = pd.DataFrame(scaled_vals,columns=Continous_col)
X_train_scaled.head()

Unnamed: 0,X2 house age,X3 distance to the nearest MRT station,X5 latitude,X6 longitude
0,0.730594,0.009513,0.612907,0.719323
1,0.445205,0.043809,0.580578,0.711451
2,0.303653,0.083315,0.667769,0.758896
3,0.303653,0.083315,0.667769,0.758896
4,0.114155,0.056799,0.568699,0.743153


In [26]:
X_train_scaled.head()

Unnamed: 0,X2 house age,X3 distance to the nearest MRT station,X5 latitude,X6 longitude
0,0.730594,0.009513,0.612907,0.719323
1,0.445205,0.043809,0.580578,0.711451
2,0.303653,0.083315,0.667769,0.758896
3,0.303653,0.083315,0.667769,0.758896
4,0.114155,0.056799,0.568699,0.743153


In [27]:
from sklearn.linear_model import LinearRegression

In [28]:
mdl = LinearRegression()
mdl.fit(X_train_scaled, y_train)

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)

In [29]:
c = mdl.intercept_
c

array([39.26323678])

In [30]:
m = mdl.coef_
m

array([[-11.89571708, -35.7918289 ,  24.61185586,  -2.33405276]])

In [31]:
coef_df = pd.DataFrame({'col': X_train_scaled.columns,'coeff':m.flatten()})
coef_df.sort_values('coeff',key=lambda x:abs(x),ascending=False)


Unnamed: 0,col,coeff
1,X3 distance to the nearest MRT station,-35.791829
2,X5 latitude,24.611856
0,X2 house age,-11.895717
3,X6 longitude,-2.334053


## Interpreting the Model

From the above table we notice that.
- if X3 distance increases than price of the house will reduce.
- if X2 age increases than price of the house will reduce.
- X6 longitude and X5 latitude are depends on city so we should not consider this columns

so as per above coefficients X3 distance nearest MRT stations has high impact on our model. here we are considering only continoues variables to build model

In [32]:
y_learnt = mdl.predict(X_train_scaled)
y_learnt[:5]

array([[43.63767163],
       [45.02771731],
       [47.33279539],
       [47.33279539],
       [48.13452662]])

In [33]:
learndf = pd.DataFrame({'Actual':y_train.iloc[:,0],'learn':y_learnt.flatten()})
learndf.head()

Unnamed: 0,Actual,learn
0,37.9,43.637672
1,42.2,45.027717
2,47.3,47.332795
3,54.8,47.332795
4,43.1,48.134527


In [34]:
y_pred = mdl.predict(X_test_scaled)
y_pred[:5]

array([[41.18663771],
       [34.39202251],
       [27.30074983],
       [34.46067371],
       [32.42362819]])

In [35]:
preddf = pd.DataFrame({'Actual':y_test.iloc[:,0],'pred':y_pred.flatten()})
preddf.head()

Unnamed: 0,Actual,pred
350,42.3,41.186638
351,28.6,34.392023
352,25.7,27.30075
353,31.3,34.460674
354,30.1,32.423628


In [36]:
from sklearn.metrics import mean_squared_error as mse

In [37]:
test_mse = mse(y_test,y_pred)
test_mse

74.58839023316486

In [38]:
rmse = np.sqrt(test_mse)
rmse

8.636457041702046

In [39]:
from sklearn.metrics import r2_score

In [40]:
r2score = r2_score(y_test,y_pred)
r2score

0.53262644960403

In [41]:
n= X.shape[0]
k= X.shape[1]
n,k

(414, 4)

In [42]:
Adj_R2 = 1-(((1-r2score)*(n-1))/(n-k-1))
Adj_R2

0.5280555591356098

In [57]:
new_modle = LinearRegression()

In [58]:
from sklearn.model_selection import KFold

In [59]:
X_new = pd.concat([X_train_scaled,X_test_scaled])
X_new.head()

Unnamed: 0,X2 house age,X3 distance to the nearest MRT station,X5 latitude,X6 longitude
0,0.730594,0.009513,0.612907,0.719323
1,0.445205,0.043809,0.580578,0.711451
2,0.303653,0.083315,0.667769,0.758896
3,0.303653,0.083315,0.667769,0.758896
4,0.114155,0.056799,0.568699,0.743153


In [60]:
y.head()

Unnamed: 0,Y house price of unit area
0,37.9
1,42.2
2,47.3
3,54.8
4,43.1


In [61]:
from sklearn.model_selection import cross_val_score

In [68]:
scores = cross_val_score(new_modle,X_new,y,
                         cv=KFold(n_splits=4,shuffle=True,random_state=1234),
                         scoring='r2')

In [69]:
scores

array([0.42200232, 0.53264259, 0.63085013, 0.52358709])

In [70]:
scores.mean()

0.5272705325457174