# 多元線性迴歸MLR
### 中電會三月主題課程(2024/3/23)

<table class="tfo-notebook-buttons" align="left">
  <td>
  <td>
    <a target="_blank" href="https://colab.research.google.com/github/ChiuDeYuan/SCAICT_lecture/blob/main/0323/housing_price_MLR.ipynb"><img src="https://www.tensorflow.org/images/colab_logo_32px.png" />Run in Google Colab</a>
  </td>
  <td>
    <a target="_blank" href="https://github.com/ChiuDeYuan/SCAICT_lecture/blob/main/0323/housing_price_MLR.ipynb"><img src="https://www.tensorflow.org/images/GitHub-Mark-32px.png" />View on GitHub</a>
  </td>
</table>

## Useful links

* Housing Prices Dataset : https://www.kaggle.com/datasets/yasserh/housing-prices-dataset
* Linear Models (Scikit-learn) : https://scikit-learn.org/stable/modules/linear_model.html

## Imports
使用sklearn和statsmodels建構模型

In [None]:
# sklean
from sklearn import linear_model
from sklearn.metrics import r2_score
from sklearn.utils import shuffle
from sklearn.feature_selection import RFE
from sklearn.preprocessing import StandardScaler

In [None]:
# statsmodels
import statsmodels.api as sm

In [None]:
# 畫圖用
import matplotlib.pyplot as plt

# 跟數學運算相關
import numpy as np

# 讀取資料相關
import pandas as pd

# 畫圖用
import seaborn as sns

## Loads data

In [None]:
dataset_path = 'https://raw.githubusercontent.com/ChiuDeYuan/SCAICT_lecture/main/datasets/Housing.csv'
dataset = pd.read_csv(dataset_path)

In [None]:
# 顯示前幾筆資料
dataset.head()

In [None]:
# 查看資料集形狀
dataset.shape

## 將yes/no轉0/1
因為數字才能進行數學運算

In [None]:
# 選擇要轉換的特徵
mapped_var = ['mainroad', 'guestroom', 'basement', 'hotwaterheating', 'airconditioning', 'prefarea']

# 定義轉換方式
def map_func(x):
    return x.map({'yes':1 , 'no':0})

In [None]:
dataset[mapped_var] = dataset[mapped_var].apply(map_func)

## 刪掉特徵furnishstatus
因為這還要涉及到one-hot encoding有點麻煩

In [None]:
dataset = dataset.drop('furnishingstatus', axis=1)

In [None]:
dataset.head()

## 縮放數據
使用sklearn提供的standard scaler<br>
standard scaler會將數據減去平均值再除以標準差<br><br>
順帶一提<br>
其實訓練集和測試集要分開做縮放數據<br>
因為這兩個資料集應該要是獨立的<br>
但是我懶得再分所以就混在一起吧╮(╯▽╰)╭

In [None]:
# 定義scaler
scaler = StandardScaler()

# 選擇要縮放的數據
vars = ['area', 'bedrooms', 'bathrooms', 'stories', 'parking', 'price']

In [None]:
dataset[vars] = scaler.fit_transform(dataset[vars])

In [None]:
dataset.head()

In [None]:
# 一些統計結果
dataset.describe()

## 篩選特徵
因為特徵很多<br>
所以要刪去比較不重要的特徵<br>
避免模型受到干擾

### heatmap
畫出每種特徵間的相關係數

In [None]:
plt.figure(figsize = (8, 5))
sns.heatmap(dataset.corr(), annot = True, cmap="PuBuGn")
plt.show()

### RFE篩選特徵
RFE是先訓練一個線性模型<br>
接著把迴歸係數最小的特徵刪去後重新訓練<br>
最後可以得到n個最重要的特徵

In [None]:
# 把labels和features分離
dataset_y = dataset.pop('price')
dataset_x = dataset

In [None]:
# 定義線性模型
reg = linear_model.LinearRegression(fit_intercept = True)

In [None]:
# 使用RFE
# 設定篩出前5個最重要的特徵
rfe = RFE(reg, n_features_to_select=5)
rfe = rfe.fit(dataset_x, dataset_y)

In [None]:
# 特徵名稱/是否有被選中/重要性排名
list(zip(dataset_x.columns, rfe.support_, rfe.ranking_))

In [None]:
col = dataset_x.columns[rfe.support_]
col

In [None]:
# 只留下有被選中的特徵
dataset_x = dataset_x[col]

In [None]:
dataset_x.head()

## 加入常數項
線性模型是因為$x_0=1$才有$\theta_0$作為常數項<br>
所以每筆資料都要加入一個1作為$x_0$

In [None]:
dataset_x = sm.add_constant(dataset_x)

## 分割資料集
把資料集分成訓練集和測試集

In [None]:
# 先打亂
dataset_x, dataset_y = shuffle(dataset_x, dataset_y, random_state=0)

In [None]:
# 將後30筆資料做為測試集
#
dataset_x_train = dataset_x[:-30]
dataset_x_test = dataset_x[-30:]

dataset_y_train = dataset_y[:-30]
dataset_y_test = dataset_y[-30:]

In [None]:
print(f"{dataset_x_train.shape}\n{dataset_x_test.shape}")

## 訓練模型

In [None]:
reg = sm.OLS(dataset_y_train,dataset_x_train).fit()

## 預測&評估

In [None]:
print(reg.summary())

In [None]:
prediction = reg.predict(dataset_x_test)

r2 score越高代表模型越符合資料

In [None]:
r2_score(dataset_y_test, prediction)