# F0753 範例程式 - 第 11 章 多元線性迴歸分析：scikit-learn

## 11-0 使用 scikit-learn 並匯入測試資料集

### *取出自變數與應變數資料*

In [None]:
from sklearn import datasets

data = datasets.load_boston().data
target = datasets.load_boston().target

print(data)
print(target)

注意：Scikit-learn 自 1.2 版起將不再內建波士頓房價資料集，若你安裝了新版 Anaconda，執行以上程式可能會看到警告或產生錯誤。若遇到此種情況，請改執行下面的替代程式碼：

In [None]:
import pandas as pd

datasets = pd.read_csv('https://github.com/selva86/datasets/raw/master/BostonHousing.csv')
data = datasets.drop(['medv'], axis=1).to_numpy()
target = datasets['medv'].to_numpy()

print(data)
print(target)

### *資料分割：訓練資料集與測試資料集*

In [None]:
# 沿用上一小節的模組及 data/target

from sklearn.model_selection import train_test_split

data_train, data_test, target_train, target_test = train_test_split(data, target, test_size=0.2)

In [None]:
print(data_train.shape)
print(data_test.shape)
print(target_train.shape)
print(target_test.shape)

## 11-1 訓練並評估多元線性迴歸模型

### *使用訓練集產生模型*

In [None]:
# 沿用上一小節的模組及 data_train, data_test, target_train

from sklearn.linear_model import LinearRegression

regr_model = LinearRegression()
regr_model.fit(data_train, target_train)

In [None]:
predictions = regr_model.predict(data_test)

print(predictions.round(1))
print(target_test)

## 11-2 評估模型的表現 (performance)

### *評估模型表現 1：決定係數*

In [None]:
# 沿用上一小節的模組及 regr_model, data_train, data_test, target_train, target_test

print(regr_model.score(data_train, target_train).round(3))
print(regr_model.score(data_test, target_test).round(3))

### *評估模型表現 2：殘差圖*

In [None]:
# 沿用上一節的 predictions/target_test

import numpy as np
import matplotlib.pyplot as plt

x = np.arange(predictions.size)
y = x * 0

plt.scatter(x, predictions - target_test)
plt.plot(x, y, color='orange')
plt.show()

### *評估模型表現 3：平均絕對誤差*

In [None]:
# 沿用上一節的 predictions/target_test

from sklearn.metrics import mean_absolute_error

print(mean_absolute_error(target_test, predictions).round(3))

### *取得模型的係數*

In [None]:
# 沿用上一節的 regr_model

print(regr_model.coef_.round(2))
print(regr_model.intercept_.round(2))

## 11-3 用真實世界的資料做迴歸分析：共享單車與天氣

CSV 下載位址：https://archive.ics.uci.edu/ml/machine-learning-databases/00560/SeoulBikeData.csv

### *匯入資料集到 pandas 的 DataFrame*

In [None]:
import pandas as pd

df = pd.read_csv(r'C:\Users\使用者名稱\Downloads\SeoulBikeData.csv', encoding='gbk', index_col=['Date'])
df

你也可以直接如下載入資料集：

In [None]:
import pandas as pd

df = pd.read_csv('https://archive.ics.uci.edu/ml/machine-learning-databases/00560/SeoulBikeData.csv', encoding='gbk', index_col=['Date'])
df

### *資料清理 (Data cleaning)*

In [None]:
# 沿用上一小節的模組及 df

data = df.copy()
data = data[data['Functioning Day'] == 'Yes']
data.pop('Functioning Day')

In [None]:
data = data.rename(columns={'Temperature(癈)': 'Temperature(*C)', 'Dew point temperature(癈)': 'Dew point(*C)'})

### *將文字的資料『編碼』為數字*

In [None]:
# 沿用上一小節的模組及 data

from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()

data['Seasons'] = le.fit_transform(data['Seasons'])
data['Holiday'] = le.fit_transform(data['Holiday'])

data[['Seasons', 'Holiday']]

### *抽出目標值*

In [None]:
# 沿用上一小節的模組及 data

target = data.pop('Rented Bike Count')

### *開始訓練迴歸模型*

In [None]:
# 沿用上一小節的模組及 data

data_train, data_test, target_train, target_test = train_test_split(data.values, target.values, test_size=0.2)

regr = LinearRegression()
regr.fit(data_train, target_train)
predictions = regr.predict(data_test)

### *評估預測成果*

In [None]:
# 沿用上一小節的模組及 regr, data_train, data_test, target_train, target_test

print(regr.score(data_train, target_train).round(3))
print(regr.score(data_test, target_test).round(3))