## 0. Setting Up The Data

In [96]:
pip install ucimlrepo

Note: you may need to restart the kernel to use updated packages.


In [97]:
from ucimlrepo import fetch_ucirepo 
  
# fetch dataset 
real_estate_valuation = fetch_ucirepo(id=477) 
  
# data (as pandas dataframes) 
X = real_estate_valuation.data.features 
y = real_estate_valuation.data.targets 
  
# metadata 
print(real_estate_valuation.metadata) 
  
# variable information 
print(real_estate_valuation.variables) 


{'uci_id': 477, 'name': 'Real Estate Valuation', 'repository_url': 'https://archive.ics.uci.edu/dataset/477/real+estate+valuation+data+set', 'data_url': 'https://archive.ics.uci.edu/static/public/477/data.csv', 'abstract': 'The real estate valuation is a regression problem. The market historical data set of real estate valuation are collected from Sindian Dist., New Taipei City, Taiwan. ', 'area': 'Business', 'tasks': ['Regression'], 'characteristics': ['Multivariate'], 'num_instances': 414, 'num_features': 6, 'feature_types': ['Integer', 'Real'], 'demographics': [], 'target_col': ['Y house price of unit area'], 'index_col': ['No'], 'has_missing_values': 'no', 'missing_values_symbol': None, 'year_of_dataset_creation': 2018, 'last_updated': 'Mon Feb 26 2024', 'dataset_doi': '10.24432/C5J30W', 'creators': ['I-Cheng Yeh'], 'intro_paper': {'ID': 373, 'type': 'NATIVE', 'title': 'Building real estate valuation models with comparative approach through case-based reasoning', 'authors': 'I. Yeh

## 1. Business Understanding

This model seeks to predict housing prices utilising linear regression 

## 2. Data Understanding

In [98]:
X.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 414 entries, 0 to 413
Data columns (total 6 columns):
 #   Column                                  Non-Null Count  Dtype  
---  ------                                  --------------  -----  
 0   X1 transaction date                     414 non-null    float64
 1   X2 house age                            414 non-null    float64
 2   X3 distance to the nearest MRT station  414 non-null    float64
 3   X4 number of convenience stores         414 non-null    int64  
 4   X5 latitude                             414 non-null    float64
 5   X6 longitude                            414 non-null    float64
dtypes: float64(5), int64(1)
memory usage: 19.5 KB


In [99]:
X.head(5)

Unnamed: 0,X1 transaction date,X2 house age,X3 distance to the nearest MRT station,X4 number of convenience stores,X5 latitude,X6 longitude
0,2012.917,32.0,84.87882,10,24.98298,121.54024
1,2012.917,19.5,306.5947,9,24.98034,121.53951
2,2013.583,13.3,561.9845,5,24.98746,121.54391
3,2013.5,13.3,561.9845,5,24.98746,121.54391
4,2012.833,5.0,390.5684,5,24.97937,121.54245


In [100]:
X.describe()

Unnamed: 0,X1 transaction date,X2 house age,X3 distance to the nearest MRT station,X4 number of convenience stores,X5 latitude,X6 longitude
count,414.0,414.0,414.0,414.0,414.0,414.0
mean,2013.148971,17.71256,1083.885689,4.094203,24.96903,121.533361
std,0.281967,11.392485,1262.109595,2.945562,0.01241,0.015347
min,2012.667,0.0,23.38284,0.0,24.93207,121.47353
25%,2012.917,9.025,289.3248,1.0,24.963,121.528085
50%,2013.167,16.1,492.2313,4.0,24.9711,121.53863
75%,2013.417,28.15,1454.279,6.0,24.977455,121.543305
max,2013.583,43.8,6488.021,10.0,25.01459,121.56627


## 3. Data Presentation

Geographical coordinates bear no linear relationship with the rest of the dataset, or to the final price.  
It is true however that in real life the physical location of the apartment can have an effect to the price, but in the way that linear regression makes its prediction it is more likely to confuse the model, than provide meaningful input.  
In addition if the data were the standardised any relation between the X and Y coordinates would be lost, as well as their placement geographically. 

In [101]:
X = X.drop(columns=['X5 latitude', 'X6 longitude'])
X.head(5)

Unnamed: 0,X1 transaction date,X2 house age,X3 distance to the nearest MRT station,X4 number of convenience stores
0,2012.917,32.0,84.87882,10
1,2012.917,19.5,306.5947,9
2,2013.583,13.3,561.9845,5
3,2013.5,13.3,561.9845,5
4,2012.833,5.0,390.5684,5


## 4. Modeling

### Linear Regression

First model constructed will be done without standardised dataset

In [102]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    X, 
    y, 
    test_size=0.3, 
    random_state=42
)

Split the data reserving 70% for training and 30% for testing

In [103]:
from sklearn.linear_model import LinearRegression

model = LinearRegression()
model.fit(X_train, y_train)

In [104]:
b0 = model.intercept_ 
b1 = model.coef_ 
print(b0)
print(b1)

[-13036.16651723]
[[ 6.49702854e+00 -2.28729288e-01 -5.75023239e-03  1.23176920e+00]]


Intercept seems to be at an unreasonable value, which could be explained by the varying scales of value in the dataset.  
In addition the negative weighs of the coefficients are considerably low when compared to the positive ones.  
As for the coefficients themselves:  
Transaction date bears the highest value in pricing and distance to the nearest station affects the price most in negative manner.  
Distance to the station makes sense as transportation would be one of the most important aspects of city life and if we interpret as longer distance to the nearest station loweing the price of the apartment, it would be in-line with this logic.  
However the transaction date bearing the highest weight on the apartment price seems suspcious.  

In [105]:
import pandas as pd
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
X = pd.DataFrame(scaler.fit_transform(X), columns=X.columns)
X.head(5)

Unnamed: 0,X1 transaction date,X2 house age,X3 distance to the nearest MRT station,X4 number of convenience stores
0,-0.823683,1.255628,-0.792495,2.007407
1,-0.823683,0.157086,-0.616612,1.667503
2,1.541151,-0.387791,-0.414015,0.307885
3,1.246435,-0.387791,-0.414015,0.307885
4,-1.121951,-1.117223,-0.549997,0.307885


Data is split again with 70-30 split using the standardised dataset

In [106]:
X_train_st, X_test_st, y_train_st, y_test_st = train_test_split(
    X, 
    y, 
    test_size=0.3, 
    random_state=42
)

In [107]:
model_st = LinearRegression()
model_st.fit(X_train_st, y_train_st)

In [108]:
b0 = model_st.intercept_ 
b1 = model_st.coef_ 
print(b0)
print(b1)

[38.07894921]
[[ 1.82973537 -2.60264588 -7.24865317  3.62386771]]


These results would appear to be a lot more reasonable.  
Intercept is at a reasonable value and the coefficients are all within the same value scale.  
Distance to the nearest station remains one of the strongest, becoming the strongest, weighted values.  
Transaction date drops in favour of number of convenience stores as well, which seems a lot more reasonable.  

### Logistic Regression

Converting the targets to binary

In [109]:
#mean house price
mean_price = y_train_st.mean()

# create binary target variable
y_train_bin = (y_train_st >= mean_price).astype(int)
y_test_bin = (y_test_st >= mean_price).astype(int)

Training the model

In [None]:
from sklearn.linear_model import LogisticRegression

# create a logistic regression model
log_reg = LogisticRegression()

# fit on the training data
log_reg.fit(X_train_st, y_train_bin)

# make predictions on the test set
y_pred_log = log_reg.predict(X_test_st)

from sklearn.metrics import accuracy_score
accuracy = accuracy_score(y_test_bin, y_pred_log)
print(f'Accuracy: {accuracy:.2f}')

[1 1 1 1 0 1 1 1 0 1 0 0 0 0 0 0 1 1 0 1 0 0 1 1 0 1 0 1 1 0 0 0 1 0 1 0 1
 0 1 1 0 1 1 0 1 1 1 0 1 1 1 1 1 1 0 0 1 0 0 1 0 0 0 0 0 0 0 1 0 0 1 0 1 1
 0 1 0 1 1 0 1 1 0 1 0 0 1 1 0 1 0 1 0 1 1 1 0 1 0 1 0 0 1 0 1 1 0 0 1 0 0
 1 1 1 0 1 1 1 0 0 1 0 1 1 0]
[[7.53054888e-02 9.24694511e-01]
 [2.59498200e-01 7.40501800e-01]
 [2.18754845e-01 7.81245155e-01]
 [3.13911217e-01 6.86088783e-01]
 [9.61039713e-01 3.89602866e-02]
 [2.09505407e-01 7.90494593e-01]
 [2.20461617e-01 7.79538383e-01]
 [1.86787490e-01 8.13212510e-01]
 [9.00709937e-01 9.92900632e-02]
 [3.30215237e-02 9.66978476e-01]
 [8.47916599e-01 1.52083401e-01]
 [8.62382910e-01 1.37617090e-01]
 [8.80947262e-01 1.19052738e-01]
 [9.07177657e-01 9.28223426e-02]
 [7.06906019e-01 2.93093981e-01]
 [8.53388667e-01 1.46611333e-01]
 [2.19210244e-01 7.80789756e-01]
 [6.53387767e-02 9.34661223e-01]
 [9.44764545e-01 5.52354548e-02]
 [1.60107070e-01 8.39892930e-01]
 [9.99994259e-01 5.74139459e-06]
 [8.60720049e-01 1.39279951e-01]
 [1.45596052e-01 8.

  y = column_or_1d(y, warn=True)


## 5. Evaluation

Evaluating non-standardised model

In [114]:
from sklearn.metrics import mean_absolute_error

preds = model.predict(X_test)

print("Mean absolute error: %.2f" % mean_absolute_error(y_test, preds))

Mean absolute error: 6.36


Evaluating standardised model

In [115]:
preds = model_st.predict(X_test_st)

print("Mean absolute error: %.2f" % mean_absolute_error(y_test_st, preds))

Mean absolute error: 6.36


## 6. Deployment