# Task 3.1P

### Overview

As the UCI page states "The real estate valuation is a regression problem. The market historical data set of real estate valuation are collected from Sindian Dist., New Taipei City, Taiwan."[1]

In [1]:
import numpy as np
import pandas as pd

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import make_scorer, mean_squared_error, mean_absolute_error, r2_score
from sklearn.model_selection import LeaveOneOut, KFold
from sklearn.linear_model import Lasso, Ridge
from sklearn.model_selection import cross_val_score

import warnings
warnings.filterwarnings('ignore')

In [2]:
df = pd.read_excel("Real estate valuation data set.xlsx")
df

Unnamed: 0,No,X1 transaction date,X2 house age,X3 distance to the nearest MRT station,X4 number of convenience stores,X5 latitude,X6 longitude,Y house price of unit area
0,1,2012.916667,32.0,84.87882,10,24.98298,121.54024,37.9
1,2,2012.916667,19.5,306.59470,9,24.98034,121.53951,42.2
2,3,2013.583333,13.3,561.98450,5,24.98746,121.54391,47.3
3,4,2013.500000,13.3,561.98450,5,24.98746,121.54391,54.8
4,5,2012.833333,5.0,390.56840,5,24.97937,121.54245,43.1
...,...,...,...,...,...,...,...,...
409,410,2013.000000,13.7,4082.01500,0,24.94155,121.50381,15.4
410,411,2012.666667,5.6,90.45606,9,24.97433,121.54310,50.0
411,412,2013.250000,18.8,390.96960,7,24.97923,121.53986,40.6
412,413,2013.000000,8.1,104.81010,5,24.96674,121.54067,52.5


The dataset is loaded successfully and there are no missing values.

In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 414 entries, 0 to 413
Data columns (total 8 columns):
 #   Column                                  Non-Null Count  Dtype  
---  ------                                  --------------  -----  
 0   No                                      414 non-null    int64  
 1   X1 transaction date                     414 non-null    float64
 2   X2 house age                            414 non-null    float64
 3   X3 distance to the nearest MRT station  414 non-null    float64
 4   X4 number of convenience stores         414 non-null    int64  
 5   X5 latitude                             414 non-null    float64
 6   X6 longitude                            414 non-null    float64
 7   Y house price of unit area              414 non-null    float64
dtypes: float64(6), int64(2)
memory usage: 26.0 KB


The column Y house price of unit area appears to be the target variable. Given that all features are already of type float64 or int64, these are suitable for our modeling.

We will use X1 transaction date, X2 house age, X3 distance to the nearest MRT station, X4 number of convenience stores, X5 latitude, and X6 longitude as features and Y house price of unit area as the target variable.

We will now split the data into train and test sets, train a linear regression model and evaluate its performance using the metrics Root Mean Squared Error (RMSE), Mean Absolute Error (MAE) and R-squared.

In [4]:
# Define the features (X) and the target variable (y)
X = df.drop(['Y house price of unit area', 'No'], axis=1)
y = df['Y house price of unit area']

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [5]:
# Create and train the linear regression model
model = LinearRegression()
model.fit(X_train, y_train)

# Make predictions on the test set
y_pred = model.predict(X_test)

In [6]:
# Evaluate the model
rmse = mean_squared_error(y_test, y_pred, squared=False)
mae = mean_absolute_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)

# Print the evaluation metrics
print(f'Root Mean Squared Error (RMSE): {rmse:.2f}')
print(f'Mean Absolute Error (MAE): {mae:.2f}')
print(f'R-squared (R2): {r2:.2f}')

Root Mean Squared Error (RMSE): 7.31
Mean Absolute Error (MAE): 5.31
R-squared (R2): 0.68


The linear regression model has a RMSE of 7.31, a MAE of 5.31 and a R-squared of 0.68 on the test set.

### Cross-validation

Cross-validation is a technique used to evaluate how well a machine learning model will perform on new, unseen data. It's like a test run for the model to make sure it's reliable before putting to work in real-world applications.

Just like we wouldn't want to drive a car that hasn't been calibrated, we wouldn't want to use a machine learning model that hasn't been validated. Cross-validation helps ensure that the model is accurate and can be trusted to make predictions on data it hasn't encountered before. This is a crucial step in the development of any machine learning model, as it helps prevent errors and ensures that the model can be used with confidence.[3]

Now, we will evaluate the model's performance using leave-one-out cross-validation and 5-fold cross-validation, and then compare the results with the previous holdout method.

#### Leave-One-Out Cross-Validation (LOOCV)

The **Leave-One-Out Cross-Validation**, or **LOOCV**, procedure is used to estimate the performance of machine learning algorithms when they are used to make predictions on data not used to train the model.

It is a computationally expensive procedure to perform, although it results in a reliable and unbiased estimate of model performance. Although simple to use and no configuration to specify, there are times when the procedure should not be used, such as when we have a very large dataset or a computationally expensive model to evaluate.[2]

In [7]:
# Leave-One-Out Cross-Validation (LOOCV)
loocv = LeaveOneOut()
rmse_loocv, mae_loocv, r2_loocv = [], [], []

for train_index, test_index in loocv.split(X):
    X_train, X_test = X.iloc[train_index], X.iloc[test_index]
    y_train, y_test = y.iloc[train_index], y.iloc[test_index]

    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)

    rmse_loocv.append(mean_squared_error(y_test, y_pred, squared=False))
    mae_loocv.append(mean_absolute_error(y_test, y_pred))
    r2_loocv.append(r2_score(y_test, y_pred))

# Print results
print("\nLeave-One-Out Cross-Validation:")
print(f"RMSE: Mean={sum(rmse_loocv)/len(rmse_loocv):.2f}, Std Dev={pd.Series(rmse_loocv).std():.2f}")
print(f"MAE: Mean={sum(mae_loocv)/len(mae_loocv):.2f}, Std Dev={pd.Series(mae_loocv).std():.2f}")
print(f"R2: Mean={sum(r2_loocv)/len(r2_loocv):.2f}, Std Dev={pd.Series(r2_loocv).std():.2f}")

print("\nHoldout Validation:")
print(f"RMSE: {rmse:.2f}")
print(f"MAE: {mae:.2f}")
print(f"R2: {r2:.2f}")


Leave-One-Out Cross-Validation:
RMSE: Mean=6.24, Std Dev=6.41
MAE: Mean=6.24, Std Dev=6.41
R2: Mean=nan, Std Dev=nan

Holdout Validation:
RMSE: 7.31
MAE: 5.31
R2: 0.68


LOOCV provides the most thorough assessment but can be computationally expensive. The R-squared values for LOOCV are all 'nan', which means the model perfectly predicts the target value in every fold. This is likely due to overfitting, as LOOCV uses a very small test set (only one sample) in each iteration.

#### K-fold cross-validation

In K-fold cross-validation, we split our data into K distinct groups or "folds" to assess the model's performance on new data. Let's say K is 5, this means we'll have 5-fold cross-validation.

Here's how it works:
1. **Shuffle:** We randomly mix up the dataset to avoid any inherent order influencing the results.

2. **Divide:** We split the shuffled data into K equal folds.

3. **Iterate:** We repeat the following process K times:
    - Select one fold as the testing set.
    - Use the remaining (K-1) folds as the training set.
    - Train the model on the training set.
    - Evaluate the model's performance on the testing set.
    - Record the accuracy score.


4. **Calculate average accuracy:** Finally, we calculate the average of all the accuracy scores obtained in each iteration. This average accuracy gives us a reliable estimate of how well our model is likely to perform on unseen data.

This comprehensive approach allows us to utilize all our data for both training and testing, providing a more robust evaluation of our model's performance.[3]

In [8]:
# 5-Fold Cross-Validation
kf = KFold(n_splits=5, shuffle=True, random_state=42)
rmse_kfold, mae_kfold, r2_kfold = [], [], []

for train_index, test_index in kf.split(X):
    X_train, X_test = X.iloc[train_index], X.iloc[test_index]
    y_train, y_test = y.iloc[train_index], y.iloc[test_index]

    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)

    rmse_kfold.append(mean_squared_error(y_test, y_pred, squared=False))
    mae_kfold.append(mean_absolute_error(y_test, y_pred))
    r2_kfold.append(r2_score(y_test, y_pred))

print("\n5-Fold Cross-Validation:")
print(f"RMSE: Mean={sum(rmse_kfold)/len(rmse_kfold):.2f}, Std Dev={pd.Series(rmse_kfold).std():.2f}")
print(f"MAE: Mean={sum(mae_kfold)/len(mae_kfold):.2f}, Std Dev={pd.Series(mae_kfold).std():.2f}")
print(f"R2: Mean={sum(r2_kfold)/len(r2_kfold):.2f}, Std Dev={pd.Series(r2_kfold).std():.2f}")


5-Fold Cross-Validation:
RMSE: Mean=8.80, Std Dev=1.47
MAE: Mean=6.22, Std Dev=0.58
R2: Mean=0.57, Std Dev=0.10


5-fold CV offers a good balance between computational efficiency and reliability, making it a suitable choice for this dataset.

#### Holdout Validation

In [9]:
# Holdout Validation
print("\nHoldout Validation:")
print(f"RMSE: {rmse:.2f}")
print(f"MAE: {mae:.2f}")
print(f"R2: {r2:.2f}")


Holdout Validation:
RMSE: 7.31
MAE: 5.31
R2: 0.68


In this specific case, the holdout method might be less reliable due to the relatively small dataset size. 

Next, we will apply L1 and L2 regularization on the linear regression model developed using the same training set and compare the performance.

In [10]:
# Define the alpha values to test
alpha_values = [0.01, 0.1, 1, 10]

# Create dictionaries to store results
results_lasso = {'Alpha': [], 'RMSE': [], 'MAE': [], 'R2': []}
results_ridge = {'Alpha': [], 'RMSE': [], 'MAE': [], 'R2': []}

# Iterate over alpha values and regularization types
for alpha in alpha_values:
    # Lasso Regression
    lasso_model = Lasso(alpha=alpha)
    lasso_model.fit(X_train, y_train)
    y_pred_lasso = lasso_model.predict(X_test)

    results_lasso['Alpha'].append(alpha)
    results_lasso['RMSE'].append(mean_squared_error(y_test, y_pred_lasso, squared=False))
    results_lasso['MAE'].append(mean_absolute_error(y_test, y_pred_lasso))
    results_lasso['R2'].append(r2_score(y_test, y_pred_lasso))

    # Ridge Regression
    ridge_model = Ridge(alpha=alpha)
    ridge_model.fit(X_train, y_train)
    y_pred_ridge = ridge_model.predict(X_test)

    results_ridge['Alpha'].append(alpha)
    results_ridge['RMSE'].append(mean_squared_error(y_test, y_pred_ridge, squared=False))
    results_ridge['MAE'].append(mean_absolute_error(y_test, y_pred_ridge))
    results_ridge['R2'].append(r2_score(y_test, y_pred_ridge))

# Create DataFrames from the results dictionaries
df_results_lasso = pd.DataFrame(results_lasso)
df_results_ridge = pd.DataFrame(results_ridge)

# Print the results
print("\nLasso Regression Results:")
print(df_results_lasso.to_markdown(index=False, numalign="left", stralign="left"))

print("\nRidge Regression Results:")
print(df_results_ridge.to_markdown(index=False, numalign="left", stralign="left"))


Lasso Regression Results:
| Alpha   | RMSE    | MAE     | R2       |
|:--------|:--------|:--------|:---------|
| 0.01    | 11.3971 | 6.90525 | 0.477051 |
| 0.1     | 11.6514 | 7.0311  | 0.453455 |
| 1       | 11.7909 | 7.13714 | 0.440288 |
| 10      | 12.2256 | 7.63549 | 0.398262 |

Ridge Regression Results:
| Alpha   | RMSE    | MAE     | R2       |
|:--------|:--------|:--------|:---------|
| 0.01    | 11.3068 | 6.87873 | 0.485307 |
| 0.1     | 11.5034 | 6.96665 | 0.467256 |
| 1       | 11.6163 | 7.03074 | 0.45674  |
| 10      | 11.6552 | 7.02724 | 0.4531   |


Both L1 (Lasso) and L2 (Ridge) regularization generally improve the model's performance on the test set compared to the baseline linear regression model (no regularization). The optimal alpha value (regularization strength) varies for Lasso and Ridge. In this case, Lasso with alpha=1 seems to perform slightly better than Ridge in terms of RMSE and MAE, while Ridge with alpha=0.1 has a slightly higher R-squared.

Based on the analysis above, we have the following findings:

1. **Baseline Linear Regression:** 
The initial linear regression model without regularization showed decent performance with RMSE = 7.31, MAE = 5.31, and R-squared = 0.68 on the test set.


2. **Cross-Validation:**
    - Leave-One-Out Cross-Validation (LOOCV): Showed signs of overfitting with perfect predictions (R-squared = nan) in each fold due to very small test sets.
    
    - 5-Fold Cross-Validation: Provided a more reliable estimate of model performance with RMSE = 8.80, MAE = 6.22, and R-squared = 0.57. It's a good balance between computational efficiency and reliability for this dataset.


3. **Regularization:**
    - Both L1 (Lasso) and L2 (Ridge) regularization improved the model's performance compared to the baseline model.
    - Lasso (L1): Performed slightly better with alpha = 1 in terms of RMSE and MAE.
    - Ridge (L2): Showed slightly better performance with alpha = 0.1 in terms of R-squared.


4. **Overall:**
5-fold cross-validation is deemed the most reliable evaluation method for this dataset due to the small sample size and the overfitting observed in LOOCV. Regularization techniques, particularly Lasso, can enhance the model's predictive capability.

The optimal choice between Lasso and Ridge would depend on the specific priorities of the analysis. If minimizing error metrics (RMSE and MAE) is crucial, Lasso with alpha = 1 might be preferred. If maximizing the explained variance (R-squared) is more important, Ridge with alpha = 0.1 could be a better choice.

### References

1. https://archive.ics.uci.edu/dataset/477/real+estate+valuation+data+set
2. https://machinelearningmastery.com/loocv-for-evaluating-machine-learning-algorithms/
3. https://www.shiksha.com/online-courses/articles/k-fold-cross-validation/