# Linear Regression

## 本筆記用來學習線性迴歸，包含：Ordinary Linear Regression, Ridge Regression, Lasso Regression, ElasticNet Regression.

* 機器學習(Machine Learning)比較注重模型的預測能力: 在做機器學習的時候，我們比較注重於調參&嘗試各種複雜的模型(複雜的模型預測能力好，可是較不易解釋模型為何是建成那個樣)。
* 統計學習(Statistical Learning)則較注重於模型是否有足夠的解釋能力: 做統計學習時，我們比較偏向去做各種檢定，以確認模型是否有好的解釋能力(模型預測能力就不一定是優先考量了)。
* 我們這裡談的是如何以機器學習的方式做迴歸。若你想用統計學習的方式來做，請使用```statsmodels```套件: https://www.statsmodels.org/stable/index.html

---

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

Reference:
* https://github.com/rasbt/python-machine-learning-book-2nd-edition/blob/master/code/ch10/ch10.ipynb

# 載入房價資料，並略做EDA (Exploratory Data Analysis)

房價資料來源： https://archive.ics.uci.edu/ml/machine-learning-databases/housing/

現在，我們來將資料載入成Pandas DataFrame:

In [None]:
# 載入數據至Pandas資料表

data_url = "https://github.com/rasbt/python-machine-learning-book-2nd-edition/raw/master/code/ch10/housing.data.txt"
df = pd.read_csv(data_url,delim_whitespace=True,header=None)
df.columns = ['CRIM', 'ZN', 'INDUS', 'CHAS','NOX', 'RM', 'AGE', 'DIS', 'RAD','TAX', 'PTRATIO', 'B', 'LSTAT', 'MEDV']

df.head(5) # 看資料表前五行

畫出各欄位之間的correlation:

In [None]:
plt.figure(figsize=(12,12),dpi=200)
sns.heatmap( df.corr(),cmap="Blues",
             vmin=-1,
             vmax=1,
             square=True,
             annot=True)
plt.show()

In [None]:
plt.figure(figsize=(20,12),dpi=200)

sns.pairplot(data=df,vars=["MEDV","LSTAT","RM"])
plt.show()

# 資料切成訓練和測試兩份，並且做標準化

In [None]:
from sklearn.model_selection import train_test_split

x = df.iloc[:, :-1].values
y = df['MEDV'].values

train_x, test_x, train_y, test_y = train_test_split(x, y, test_size=0.3)

In [None]:
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import StandardScaler

In [None]:
scaler = StandardScaler()
scaler = scaler.fit(train_x)

train_x = scaler.transform(train_x)
test_x = scaler.transform(test_x)

# 線性迴歸

## 1. 普通線性迴歸

首先，試試LSTAT，看其是否能用來解釋y(和y有多少程度存在線性關係)。

In [None]:
regressor = LinearRegression()
model = regressor.fit(train_x[:,[-1,]],train_y)

### 以$R^2$評估迴歸結果:

In [None]:
from sklearn.metrics import r2_score

In [None]:
pred_y = model.predict(test_x[:,[-1,]])
r2_score(test_y,pred_y)

$R^2$一般來說介於0和1之間。

$R^2$可解釋為: y有$R^2\times 100\%$的程度，是可以用x來解釋的。

In [None]:
# 練習: 一般來說，離群值會讓模型學得比較不好。試著去掉離群值，然後再訓練一次模型，看$R^2$有沒有變好。
#
#
#

In [None]:
# 練習：選所有feature來做迴歸，並計算出R^2
#
# regressor = LinearRegression()
# model = regressor.fit(...)

In [None]:
# 練習：選所有feature(除去可能有共線性的feature)來做迴歸，並計算出R^2
#
# regressor = LinearRegression()
# model = regressor.fit(...)

In [None]:
# 練習：選RM來做迴歸，並計算出R^2
#
# regressor = LinearRegression()
# model = regressor.fit(...)

In [None]:
# 練習：選B來做迴歸，並計算出R^2
#
# regressor = LinearRegression()
# model = regressor.fit(...)

## 2. 多項式迴歸

$y \sim \alpha~x_{LSTAT} + \beta~x_{LSTAT}^2+\gamma$

In [None]:
train_lstat_and_square = np.hstack([train_x[:,[-1,]],train_x[:,[-1,]]**2])
test_lstat_and_square = np.hstack([test_x[:,[-1,]],test_x[:,[-1,]]**2])

In [None]:
regressor = LinearRegression()
model = regressor.fit(train_lstat_and_square,train_y)

In [None]:
pred_y = model.predict(test_lstat_and_square)
r2_score(test_y,pred_y)

In [None]:
plt.scatter(test_x[:,-1],test_y,label="test_y")
plt.scatter(test_x[:,-1],pred_y,label="pred_y")
plt.legend()

plt.xlabel("$x_{LSTAT}$")
plt.ylabel("$y$")
plt.show()

In [None]:
tmp = pd.DataFrame(np.vstack([train_x[:,-1]**1,train_x[:,-1]**2]).T)

In [None]:
tmp.corr()

## 3. Lasso Regression

先來做很多項的多項式回歸:

In [None]:
train_lstat_and_square = np.hstack([train_x[:,[-1,]]**j for j in range(1,10)]   # 共九項
                                  )
test_lstat_and_square = np.hstack([test_x[:,[-1,]]**j for j in range(1,10)]
                                  )

In [None]:
regresser = LinearRegression()
model = regressor.fit(train_lstat_and_square,train_y)

In [None]:
pred_y = model.predict(test_lstat_and_square)
r2_score(test_y,pred_y)

看weights和biases:

In [None]:
model.coef_

In [None]:
model.intercept_

In [None]:
from sklearn.linear_model import Lasso

In [None]:
regressor = Lasso(max_iter=100000)
regressor

In [None]:
model = regressor.fit(train_lstat_and_square,train_y)
pred_y = model.predict(test_lstat_and_square)
r2_score(test_y,pred_y)

看weights和biases:

In [None]:
[*regressor.coef_]

In [None]:
model.intercept_

## 4. Ridge Regression

In [None]:
from sklearn.linear_model import Ridge
# 練習：Ridge Regression
# ...
# ...

## 5. Elastic Net Regression

In [None]:
from sklearn.linear_model import ElasticNet

In [None]:
# 練習: 查一下scikit-learn API: http://scikit-learn.org/stable/index.html,
#      告訴我什麼是Elastic Net？

In [None]:
# 練習: Elastic Net Regression
# ...
# ...