## 0. Setting Up The Data

In [1]:
pip install ucimlrepo

Defaulting to user installation because normal site-packages is not writeable
Note: you may need to restart the kernel to use updated packages.



[notice] A new release of pip is available: 24.3.1 -> 26.0.1
[notice] To update, run: python.exe -m pip install --upgrade pip


In [2]:
from ucimlrepo import fetch_ucirepo 
  
# fetch dataset 
real_estate_valuation = fetch_ucirepo(id=477) 
  
# data (as pandas dataframes) 
X = real_estate_valuation.data.features 
y = real_estate_valuation.data.targets 
  
# metadata 
print(real_estate_valuation.metadata) 
  
# variable information 
print(real_estate_valuation.variables) 


{'uci_id': 477, 'name': 'Real Estate Valuation', 'repository_url': 'https://archive.ics.uci.edu/dataset/477/real+estate+valuation+data+set', 'data_url': 'https://archive.ics.uci.edu/static/public/477/data.csv', 'abstract': 'The real estate valuation is a regression problem. The market historical data set of real estate valuation are collected from Sindian Dist., New Taipei City, Taiwan. ', 'area': 'Business', 'tasks': ['Regression'], 'characteristics': ['Multivariate'], 'num_instances': 414, 'num_features': 6, 'feature_types': ['Integer', 'Real'], 'demographics': [], 'target_col': ['Y house price of unit area'], 'index_col': ['No'], 'has_missing_values': 'no', 'missing_values_symbol': None, 'year_of_dataset_creation': 2018, 'last_updated': 'Mon Feb 26 2024', 'dataset_doi': '10.24432/C5J30W', 'creators': ['I-Cheng Yeh'], 'intro_paper': {'ID': 373, 'type': 'NATIVE', 'title': 'Building real estate valuation models with comparative approach through case-based reasoning', 'authors': 'I. Yeh

## 1. Business Understanding

This model seeks to predict housing prices utilising linear regression 

## 2. Data Understanding

In [3]:
X.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 414 entries, 0 to 413
Data columns (total 6 columns):
 #   Column                                  Non-Null Count  Dtype  
---  ------                                  --------------  -----  
 0   X1 transaction date                     414 non-null    float64
 1   X2 house age                            414 non-null    float64
 2   X3 distance to the nearest MRT station  414 non-null    float64
 3   X4 number of convenience stores         414 non-null    int64  
 4   X5 latitude                             414 non-null    float64
 5   X6 longitude                            414 non-null    float64
dtypes: float64(5), int64(1)
memory usage: 19.5 KB


In [4]:
X.head(5)

Unnamed: 0,X1 transaction date,X2 house age,X3 distance to the nearest MRT station,X4 number of convenience stores,X5 latitude,X6 longitude
0,2012.917,32.0,84.87882,10,24.98298,121.54024
1,2012.917,19.5,306.5947,9,24.98034,121.53951
2,2013.583,13.3,561.9845,5,24.98746,121.54391
3,2013.5,13.3,561.9845,5,24.98746,121.54391
4,2012.833,5.0,390.5684,5,24.97937,121.54245


In [5]:
X.describe()

Unnamed: 0,X1 transaction date,X2 house age,X3 distance to the nearest MRT station,X4 number of convenience stores,X5 latitude,X6 longitude
count,414.0,414.0,414.0,414.0,414.0,414.0
mean,2013.148971,17.71256,1083.885689,4.094203,24.96903,121.533361
std,0.281967,11.392485,1262.109595,2.945562,0.01241,0.015347
min,2012.667,0.0,23.38284,0.0,24.93207,121.47353
25%,2012.917,9.025,289.3248,1.0,24.963,121.528085
50%,2013.167,16.1,492.2313,4.0,24.9711,121.53863
75%,2013.417,28.15,1454.279,6.0,24.977455,121.543305
max,2013.583,43.8,6488.021,10.0,25.01459,121.56627


## 3. Data Presentation

Geographical coordinates bear no linear relationship with the rest of the dataset, or to the final price.  
It is true however that in real life the physical location of the apartment can have an effect to the price, but in the way that linear regression makes its prediction it is more likely to confuse the model, than provide meaningful input.  
In addition if the data were the standardised any relation between the X and Y coordinates would be lost, as well as their placement geographically. 

In [6]:
X = X.drop(columns=['X5 latitude', 'X6 longitude'])
X.head(5)

Unnamed: 0,X1 transaction date,X2 house age,X3 distance to the nearest MRT station,X4 number of convenience stores
0,2012.917,32.0,84.87882,10
1,2012.917,19.5,306.5947,9
2,2013.583,13.3,561.9845,5
3,2013.5,13.3,561.9845,5
4,2012.833,5.0,390.5684,5


## 4. Modeling

### Linear Regression

#### First model constructed will be done without standardized dataset

In [7]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    X, 
    y, 
    test_size=0.3, 
    random_state=42
)

Split the data reserving 70% for training and 30% for testing

In [8]:
from sklearn.linear_model import LinearRegression

model = LinearRegression()
model.fit(X_train, y_train)

0,1,2
,"fit_intercept  fit_intercept: bool, default=True Whether to calculate the intercept for this model. If set to False, no intercept will be used in calculations (i.e. data is expected to be centered).",True
,"copy_X  copy_X: bool, default=True If True, X will be copied; else, it may be overwritten.",True
,"tol  tol: float, default=1e-6 The precision of the solution (`coef_`) is determined by `tol` which specifies a different convergence criterion for the `lsqr` solver. `tol` is set as `atol` and `btol` of :func:`scipy.sparse.linalg.lsqr` when fitting on sparse training data. This parameter has no effect when fitting on dense data. .. versionadded:: 1.7",1e-06
,"n_jobs  n_jobs: int, default=None The number of jobs to use for the computation. This will only provide speedup in case of sufficiently large problems, that is if firstly `n_targets > 1` and secondly `X` is sparse or if `positive` is set to `True`. ``None`` means 1 unless in a :obj:`joblib.parallel_backend` context. ``-1`` means using all processors. See :term:`Glossary ` for more details.",
,"positive  positive: bool, default=False When set to ``True``, forces the coefficients to be positive. This option is only supported for dense arrays. For a comparison between a linear regression model with positive constraints on the regression coefficients and a linear regression without such constraints, see :ref:`sphx_glr_auto_examples_linear_model_plot_nnls.py`. .. versionadded:: 0.24",False


In [9]:
b0 = model.intercept_ 
b1 = model.coef_ 
print(b0)
print(b1)

[-13036.16651723]
[[ 6.49702854e+00 -2.28729288e-01 -5.75023239e-03  1.23176920e+00]]


Intercept seems to be at an unreasonable value, which could be explained by the varying scales of value in the dataset.  
In addition the negative weighs of the coefficients are considerably low when compared to the positive ones.  
As for the coefficients themselves:  
Transaction date bears the highest value in pricing and distance to the nearest station affects the price most in negative manner.  
Distance to the station makes sense as transportation would be one of the most important aspects of city life and if we interpret as longer distance to the nearest station loweing the price of the apartment, it would be in-line with this logic.  
However the transaction date bearing the highest weight on the apartment price seems suspcious.

#### Second model will be constructed using standardized sets
Using the same split 70% for training and 30% for testing as before.


In [10]:
import pandas as pd
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()

# standardize the training set and fit the scaler
X_train_st = pd.DataFrame(
    scaler.fit_transform(X_train),
    columns=X_train.columns,
    index=X_train.index
)

# standardize the test set with same scaler
X_test_st = pd.DataFrame(
    scaler.transform(X_test),
    columns=X_test.columns,
    index=X_test.index
)

**Standardization** is done to both the training and test sets using the same scaler fitted on the training set. This ensures that the features are on the same scale, which allows for a more meaningful comparison of coefficient magnitudes and can lead to better model performance.

In [11]:
model_st = LinearRegression()
model_st.fit(X_train_st, y_train)

0,1,2
,"fit_intercept  fit_intercept: bool, default=True Whether to calculate the intercept for this model. If set to False, no intercept will be used in calculations (i.e. data is expected to be centered).",True
,"copy_X  copy_X: bool, default=True If True, X will be copied; else, it may be overwritten.",True
,"tol  tol: float, default=1e-6 The precision of the solution (`coef_`) is determined by `tol` which specifies a different convergence criterion for the `lsqr` solver. `tol` is set as `atol` and `btol` of :func:`scipy.sparse.linalg.lsqr` when fitting on sparse training data. This parameter has no effect when fitting on dense data. .. versionadded:: 1.7",1e-06
,"n_jobs  n_jobs: int, default=None The number of jobs to use for the computation. This will only provide speedup in case of sufficiently large problems, that is if firstly `n_targets > 1` and secondly `X` is sparse or if `positive` is set to `True`. ``None`` means 1 unless in a :obj:`joblib.parallel_backend` context. ``-1`` means using all processors. See :term:`Glossary ` for more details.",
,"positive  positive: bool, default=False When set to ``True``, forces the coefficients to be positive. This option is only supported for dense arrays. For a comparison between a linear regression model with positive constraints on the regression coefficients and a linear regression without such constraints, see :ref:`sphx_glr_auto_examples_linear_model_plot_nnls.py`. .. versionadded:: 0.24",False


In [12]:
b0 = model_st.intercept_ 
b1 = model_st.coef_ 
print(b0)
print(b1)

[38.44186851]
[[ 1.83454082 -2.6036594  -7.16655007  3.67464933]]


These results would appear to be a lot more reasonable.  
Intercept is at a reasonable value and the coefficients are all within the same value scale.  
Distance to the nearest station remains one of the strongest, becoming the strongest, weighted values.  
Transaction date drops in favour of number of convenience stores as well, which seems a lot more reasonable.  

### Logistic Regression

Converting the targets to binary

In [13]:
# reshape target variable to 1D array
y_train = y_train.values.ravel()
y_test = y_test.values.ravel()

#mean house price
mean_price = y_train.mean()

# create binary target variable
y_train_bin = (y_train >= mean_price).astype(int)
y_test_bin = (y_test >= mean_price).astype(int)

Training the model

In [14]:
from sklearn.linear_model import LogisticRegression

# create a logistic regression model
log_reg = LogisticRegression()

# fit on the training data
log_reg.fit(X_train_st, y_train_bin)

# make predictions on the test set
y_pred_log = log_reg.predict(X_test_st)

## 5. Evaluation

Evaluating non-standardised model

In [15]:
from sklearn.metrics import mean_absolute_error

preds = model.predict(X_test)

print("Mean absolute error: %.2f" % mean_absolute_error(y_test, preds))

Mean absolute error: 6.36


Evaluating standardised model

In [16]:
preds = model_st.predict(X_test_st)

print("Mean absolute error: %.2f" % mean_absolute_error(y_test, preds))

Mean absolute error: 6.36


Evaluating Logistic Regression model

In [17]:
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

print("Accuracy:", accuracy_score(y_test_bin, y_pred_log))
print(confusion_matrix(y_test_bin, y_pred_log))
print(classification_report(y_test_bin, y_pred_log))


Accuracy: 0.8
[[53 19]
 [ 6 47]]
              precision    recall  f1-score   support

           0       0.90      0.74      0.81        72
           1       0.71      0.89      0.79        53

    accuracy                           0.80       125
   macro avg       0.81      0.81      0.80       125
weighted avg       0.82      0.80      0.80       125



## 6. Deployment

### How these models could be used in practice
The trained model could be used to estimate house prices in New Tapei City, Taiwan when a new listing is available.
The same set of features would be required to make a prediction (transaction date, house age, distance to MRT station, number of nearby convenience stores) and the model would output:
* **For linear regression**: the estimated price of the apartment
* **For logistic regression**: a class prediction (above or below average price) and potentially a probability of the apartment being above average price.

A realistic use case could be:
* A real estate website showing estimated prices for new listings based on the model's predictions.
* A simple tool for real estate agents to quickly assess the value of a property based on its features.


### What we learned from the models
We examined the coefficients of the standardized linear regression model. Because the input variables were standardized, the magnitude of the coefficients can be directly compared. A higher absolute value of a coefficient has a stronger impact on the predicted price.

Features in order of their impact on the price prediction:
* **Distance to the nearest MRT station** - This feature has the strongest impact on the price prediction, with a negative coefficient indicating that as the distance to the nearest MRT station increases, the predicted price of the apartment decreases.
* **Number of convenience stores** - This feature has a strong positive coefficient, meaning that apartments located in areas with more nearby convenience stores tend to have higher prices.
* **House age** - This feature has a moderate negative coefficient, suggesting that older apartments tend to have lower prices compared to newer ones.
* **Transaction date** - This feature has a moderate positive coefficient, indicating that more recent transactions are associated with higher prices, which could reflect market trends over time.

The following tables show the features ranked by the absolute value of their coefficients for both standardized and non-standardized linear regression models, as well as the logistic regression model.

In [18]:
# List of models with their feature names and model names
models = [
    (model_st, X_train_st.columns, "Linear Regression (Standardized)"),
    (model, X_train.columns, "Linear Regression (Non-Standardized)"),
    (log_reg, X_train_st.columns, "Logistic Regression")
]

# Display the coefficients for each model in a table sorted by absolute coefficient value
for m, cols, name in models:
    coef = pd.Series(m.coef_.ravel(), index=cols)
    df = pd.DataFrame({
        "coefficient": coef,
        "abs_coefficient": coef.abs()
    }).sort_values("abs_coefficient", ascending=False) # sort by absolute coefficient value to show the most influential features regardless of direction

    print(f"\n{name}")
    display(df)



Linear Regression (Standardized)


Unnamed: 0,coefficient,abs_coefficient
X3 distance to the nearest MRT station,-7.16655,7.16655
X4 number of convenience stores,3.674649,3.674649
X2 house age,-2.603659,2.603659
X1 transaction date,1.834541,1.834541



Linear Regression (Non-Standardized)


Unnamed: 0,coefficient,abs_coefficient
X1 transaction date,6.497029,6.497029
X4 number of convenience stores,1.231769,1.231769
X2 house age,-0.228729,0.228729
X3 distance to the nearest MRT station,-0.00575,0.00575



Logistic Regression


Unnamed: 0,coefficient,abs_coefficient
X3 distance to the nearest MRT station,-2.425425,2.425425
X4 number of convenience stores,0.637857,0.637857
X2 house age,-0.521377,0.521377
X1 transaction date,0.188293,0.188293


Even though the models do not require standardization for prediction accuracy, when it comes to interpreting coefficient magnitudes to estimate feature importance, standardization becomes necessary.
Without standardization, the coefficient sizes are influenced by the measurement units of the variables and cannot be directly compared.

For example without standardization, transaction date is measured in a different scale (e.g. 2012.917) compared to number of convenience stores (e.g. 5). This leads to the transaction date having a much larger coefficient than the number of convenience stores, even though the latter may have a stronger relationship with the target variable, which can lead to misleading interpretations of feature importance based on coefficient size alone.