This notebook is freely available for redistribution under the [GPL-3.0 license](https://choosealicense.com/licenses/gpl-3.0/).

Author: 蘇嘉冠

# [Exercise] Regression II

**小提示：由於開始變得複雜，可以先了解練習中各個 function 用來做什麼、如何使用的就好，有空有興趣再了解實做細節**

## 展示題（一）：房價預測（non-linearity 版本）

我們蒐集到了波士頓郊區的房價資料集（[來源](https://archive.ics.uci.edu/ml/machine-learning-databases/housing/?C=N;O=D)），想要從某城鎮的一個 feature，加上 non-linearity，來預測該城鎮自用住宅的房價中位數（`MEDV`）。

我們將這個資料集的 csv 檔讀入至一個 pandas 的 DataFrame：`df`。資料的各個 column 的意義如下：
- `CRIM`：某城鎮的人均犯罪率
- `ZN`：「超過 25,000 平方呎的住宅用地區塊」所佔的比例
- `INDUS`：某城鎮「非零售的商業用地」比例（英畝）
- `NOX`：一氧化氮濃度（以 10 ppm 為單位）
- `RM`：平均每戶有幾個房間
- `AGE`： 1940 年之前所建的房屋，屋主自用的比例
- `DIS`：到波士頓五個就業服務中心的（加權）距離
- `RAD`：使用高速公路的方便性 / 可達性指數
- `TAX`：「總價 / 房屋稅」的比例（單位：10,000 美金）
- `PTRATIO`：某城鎮的「生 / 師」比
- `LSTAT`：低所得的人口比例
- `MEDV`：自用住宅的房價中位數（單位：1,000 美金）




In [None]:
!pip install numpy pandas matplotlib scikit-learn

### 讀取資料

In [None]:
import pandas as pd

df = pd.read_csv(
    "https://raw.githubusercontent.com/AINTUT/code_2022/main/datasets/"
    "house_pricing.csv",
)

print(df)

### Data Preprocessing

我們定義了 3 個 function 來做資料前處理：
- `create_scaler()`：用 standardization 來對某批資料做 feature scaling，並回傳一個型別為 `StandardScaler` 的物件，這個物件儲存著某批資料的的平均值、標準差
- `apply_nonlinear()`：將某資料做 non-linearity，程度根據 `degree` 決定，例如 `degree` 為 3 的時候，代表會產生包含 1 次方、2 次方、3 次方的資料
- `preprocess_data()`：先對資料做 feature scaling，再做 non-linearity

In [3]:
import numpy as np
from sklearn.preprocessing import StandardScaler

def create_scaler(data):
    scaler = StandardScaler()
    scaler.fit_transform(data)

    return scaler

def apply_nonlinear(data, degree):
    data_stacks = []
    for data_index in range(data.shape[1]):
        for degree_index in range(1, degree + 1):
            degree_data = data[:, data_index] ** degree_index
            data_stacks.append(degree_data)
    applied_data = np.stack(data_stacks, axis=1)

    return applied_data

def preprocess_data(data, scaler, degree):
    applied_data = scaler.transform(data)
    applied_data = apply_nonlinear(applied_data, degree)

    return applied_data

### Training 與 Evaluation

首先我們定義了 3 個關於 training 的 function：`predict()`、`calculate_loss` 以及 `fit()`，基本上內容都跟上次的練習（Regression I）一樣，主要差別在：
1. 我們現在資料的 input feature 超過 1 個
2. `fit()` 加上了 `regular_lambda` 這個參數，代表 loss function 的 regularization 的 λ 值（如果 `regular_lambda` = 0 則代表不加 regularization）

In [4]:
def predict(x, weights):
    return np.dot(x, weights[1:]) + weights[0]

def calculate_loss(y_gt, y_pred, weights, regular_lambda):
    loss = \
        ((y_gt - y_pred) ** 2).sum() / 2.0 \
        + regular_lambda * (weights ** 2).sum()

    return loss

def fit(x_train, y_train, epoches, learning_rate, regular_lambda):
    weights = np.zeros(x_train.shape[1] + 1)
    losses = []

    for _ in range(epoches):
        y_pred = predict(x_train, weights)

        diff = y_train - y_pred
        weights[0] = weights[0] - learning_rate * -diff.sum()
        weights[1:] = \
            weights[1:] \
            - learning_rate \
            * (-x_train.T.dot(diff) + 2 * regular_lambda * weights[1:])

        losses.append(calculate_loss(y_train, y_pred, weights, regular_lambda))

    return weights, losses

再來我們定義 2 個關於 evaluation 的 function，其中 `reg_plot()` 同上次練習（Regression I），而 `predict_raw()` 類似 `predict()`。差別在於 `predict()` 是對已經預處理過得資料做預測，`predict_raw()` 則是對沒有預處理的資料做預測。

In [5]:
import matplotlib.pyplot as plt

def predict_raw(x, scaler, degree, weights):
    preprocessed_x = preprocess_data(x, scaler, degree)
    y_pred = predict(preprocessed_x, weights)

    return y_pred

def reg_plot(x_gt, y_gt, x_fit, y_fit):
    plt.scatter(x_gt[:, 0], y_gt, c="steelblue", edgecolor="white")
    plt.plot(x_fit[:, 0], y_fit, c="black")

    plt.xlabel("RM")
    plt.ylabel("MEDV")

    plt.show()

接下來我們用 1 個 function 來跑 training 與 evaluation 的流程：
1. 呼叫 `preprocess_data()` 來對資料做預處理
2. 呼叫 `fit()` 來做訓練，得到訓練結果
3. 將每個 epoch 的 loss 畫成圖
4. 將 training data 的結果畫成圖
5. 計算 training data 的 MSE
6. 將 testing data 的結果畫成圖
7. 計算 testing data 的 MSE

In [17]:
from sklearn.metrics import mean_squared_error

def run(
    x_train,
    x_test,
    y_train,
    y_test,
    degree,
    epoches,
    learning_rate,
    regular_lambda=0,
    show_fit=True,
):
    scaler = create_scaler(x_train)

    x_train_std = preprocess_data(x_train, scaler, degree)
    x_test_std = preprocess_data(x_test, scaler, degree)

    weights, losses = fit(
        x_train_std,
        y_train,
        epoches,
        learning_rate,
        regular_lambda,
    )

    plt.plot(range(1, epoches + 1), losses)
    plt.ylabel("SSE")
    plt.xlabel("Epoch")
    plt.show()

    print("Losses (degree = {}, regular_lambda = {})".format(
        degree,
        regular_lambda,
    ))

    if show_fit and (x_train.shape[1] == 1):
        x_fit = np.arange(x_train.min(), x_train.max(), 0.1)[:, np.newaxis]
        y_fit = predict_raw(x_fit, scaler, degree, weights)
        reg_plot(x_train, y_train, x_fit, y_fit)

    y_pred = predict(x_train_std, weights)
    mse_train = mean_squared_error(y_train, y_pred)
    print("MSE of Training Data (degree = {}, regular_lambda = {}): {}".format(
        degree,
        regular_lambda,
        mse_train,
    ))

    if show_fit and (x_train.shape[1] == 1):
        x_fit = np.arange(x_test.min(), x_test.max(), 0.1)[:, np.newaxis]
        y_fit = predict_raw(x_fit, scaler, degree, weights)
        reg_plot(x_test, y_test, x_fit, y_fit)

    y_pred = predict(x_test_std, weights)
    mse_test = mean_squared_error(y_test, y_pred)
    print("MSE of Testing Data (degree = {}, regular_lambda = {}): {}".format(
        degree,
        regular_lambda,
        mse_test,
    ))

    return mse_train, mse_test

接下來我們實際跑 training 與 evaluation 的流程，重複好幾次，並且將每次得到的 MSE 紀錄下來。

每一次最主要的差別在於 non-linearity 的程度，程度從 1 到 5（1 代表沒有 non-linearity，5 代表會對 input feature 做 1 次方到 5 次方的轉換）。

In [None]:
from sklearn.model_selection import train_test_split

hyper_parameters = [
    (1, 20, 0.001),
    (2, 60, 0.001),
    (3, 180, 0.0001),
    (4, 1800, 0.00001),
    (5, 15000, 0.000001),
]

x_data = df[["RM"]].to_numpy()
y_data = df["MEDV"].to_numpy()

x_train, x_test, y_train, y_test = train_test_split(
    x_data,
    y_data,
    test_size=0.2,
    random_state=0,
)

mse_train_list = []
mse_test_list = []

for hyper_parameter in hyper_parameters:
    degree, epoches, learning_rate = hyper_parameter
    mse_train, mse_test = \
        run(x_train, x_test, y_train, y_test, degree, epoches, learning_rate)

    mse_train_list.append(mse_train)
    mse_test_list.append(mse_test)

最後我們將不同 non-linearity 程度下 MSE 的圖畫出來。

In [None]:
def mse_degrees_plot(degrees, mse_train_list, mse_test_list):
    plt.plot(degrees, mse_train_list, c="green", label="Training Data")
    plt.plot(degrees, mse_test_list, c="red", label="Testing Data")

    plt.xticks(degrees)

    plt.xlabel("Degree")
    plt.ylabel("MSE")

    plt.legend(loc="upper right")

    plt.plot()

degrees = [p[0] for p in hyper_parameters]
mse_degrees_plot(degrees, mse_train_list, mse_test_list)

print("Degree\tTraining MSE\t\tTesting MSE")
for degree, mse_train, mse_test in zip(degrees, mse_train_list, mse_test_list):
    print("{}\t{}\t{}".format(degree, mse_train, mse_test))

## 展示題（二）：房價預測（多變數版本）

我們現在考慮不只一個 input feature 的情況，並且去試著找到最好的 input feature 有哪些。

首先，我們先選出 5 個與預測的目標：房價（`MEDV`）最有關係的 feature，並且分別畫出這 5 個 feature 與房價之間的散佈圖。

In [None]:
features = ["RM", "LSTAT", "PTRATIO", "INDUS", "TAX"]

for feature in features:
    df.plot.scatter(x=feature, y="MEDV")
    plt.show()

接下來我們實際跑 training 與 evaluation 的流程，重複好幾次，並且將每次得到的 MSE 紀錄下來。每一次最主要的差別在於 input feature 的數量，從 1 個到 5 個。

附註：由於我們在前面的展示題（一）得知程度為 3 的 non-linearity 是最好的，因此沿用這個設定。

In [None]:
degree = 3
epoches = 180
learning_rate = 0.0001

mse_train_list = []
mse_test_list = []

for num_features in range(1, len(features) + 1):
    selected_features = features[0:num_features]
    x_data = df[selected_features].to_numpy()
    y_data = df["MEDV"].to_numpy()

    x_train, x_test, y_train, y_test = train_test_split(
        x_data,
        y_data,
        test_size=0.2,
        random_state=0,
    )

    print("=== Selected features: {} ===".format(selected_features))
    mse_train, mse_test = run(
        x_train,
        x_test,
        y_train,
        y_test,
        degree,
        epoches,
        learning_rate,
        show_fit=False,
    )
    print("")

    mse_train_list.append(mse_train)
    mse_test_list.append(mse_test)

最後我們將不同 input feature 數量下 MSE 的結果秀出來。

In [None]:
print("# of Features\tTraining MSE\t\tTesting MSE")
for index in range(len(features)):
    num_features = index + 1
    selected_features = features[0:num_features]
    mse_train = mse_train_list[index]
    mse_test = mse_test_list[index]

    print("{}\t\t{}\t{}".format(num_features, mse_train, mse_test))

## 展示題（三）：房價預測（regularization）

在做展示題（一）跟（二）的時候都沒有加上 regularization（λ 值為 0）。現在我們來看看加入 regularization 對結果的影響

我們實際跑 training 與 evaluation 的流程，重複好幾次，並且將每次得到的 MSE 紀錄下來。每一次最主要的差別在於 λ 值的不同，從 0 個到 128。

附註：由於我們在前面的展示題（一）跟（二）得知，程度為 3 的 non-linearity 加上 input feature 為 `RM` + `LSTAT` 是最好的，因此沿用這個設定。

In [None]:
regular_lambdas = [0, 1, 2, 4, 8, 16, 32, 64, 128]

degree = 3
epoches = 180
learning_rate = 0.0001

features = ["RM", "LSTAT"]

mse_train_list = []
mse_test_list = []

for regular_lambda in regular_lambdas:
    x_data = df[features].to_numpy()
    y_data = df["MEDV"].to_numpy()

    x_train, x_test, y_train, y_test = train_test_split(
        x_data,
        y_data,
        test_size=0.2,
        random_state=0,
    )

    mse_train, mse_test = run(
        x_train,
        x_test,
        y_train,
        y_test,
        degree,
        epoches,
        learning_rate,
        regular_lambda=regular_lambda,
        show_fit=False,
    )

    mse_train_list.append(mse_train)
    mse_test_list.append(mse_test)

最後我們將不同 λ 值下 MSE 的結果秀出來。

In [None]:
def mse_regular_plot(regular_lambdas, mse_train_list, mse_test_list):
    plt.plot(regular_lambdas, mse_train_list, c="green", label="Training Data")
    plt.plot(regular_lambdas, mse_test_list, c="red", label="Testing Data")

    plt.xticks(regular_lambdas)
    plt.xscale("log", basex=2)

    plt.xlabel("Lambda for Regularization")
    plt.ylabel("MSE")

    plt.legend(loc="upper right")

    plt.plot()

mse_regular_plot(regular_lambdas, mse_train_list, mse_test_list)

print("Regular Lambda\tTraining MSE\t\tTesting MSE")
for regular_lambda, mse_train, mse_test in \
    zip(regular_lambdas, mse_train_list, mse_test_list):
    print("{}\t\t{}\t{}".format(regular_lambda, mse_train, mse_test))

In [None]:
from sklearn.linear_model import Ridge

features = ["RM", "LSTAT"]
degree = 3

x_data = df[features].to_numpy()
y_data = df["MEDV"].to_numpy()

x_train, x_test, y_train, y_test = train_test_split(
    x_data,
    y_data,
    test_size=0.2,
    random_state=0,
)

x_train = apply_nonlinear(x_train, degree)
x_test = apply_nonlinear(x_test, degree)

lr = Ridge()
lr.fit(x_train, y_train)

y_pred = lr.predict(x_train)
mse = mean_squared_error(y_train, y_pred)

print("MSE of Training Data: {}".format(mse))

y_pred = lr.predict(x_test)
mse = mean_squared_error(y_test, y_pred)

print("MSE of Testing Data: {}".format(mse))

## 練習題

試著用 `scikit-learn` 的[Ridge](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.Ridge.html)，input features 為 `RM` 加上 `LSTAT`，degree 為 3，來做一次訓練
- Ridge 的用法類似於 LinearRegression，請參考[上次的練習（Regression I）](https://colab.research.google.com/drive/15fTirCx1ejkr_K-cghE5rRBkRNqSGa-C?usp=sharing)，`改用 Scikit-Learn 做 Training` 這部份

附上參考答案，可以先自己試試後再看答案

In [None]:
from sklearn.linear_model import Ridge

features = ["RM", "LSTAT"]
degree = 3

x_data = df[features].to_numpy()
y_data = df["MEDV"].to_numpy()

x_train, x_test, y_train, y_test = train_test_split(
    x_data,
    y_data,
    test_size=0.2,
    random_state=0,
)

x_train = apply_nonlinear(x_train, degree)
x_test = apply_nonlinear(x_test, degree)

lr = Ridge()
lr.fit(x_train, y_train)

y_pred = lr.predict(x_train)
mse = mean_squared_error(y_train, y_pred)

print("MSE of Training Data: {}".format(mse))

y_pred = lr.predict(x_test)
mse = mean_squared_error(y_test, y_pred)

print("MSE of Testing Data: {}".format(mse))