# F0753 範例程式 - 第 12 章 運用機器學習做分類 (classication) 預測及資料簡化

## 12-0 資料分類 (classification)

## 12-1 KNN (K 近鄰) 預測模型

### *匯入並分割資料集*

In [1]:
from sklearn import datasets
from sklearn.model_selection import train_test_split

data, target = datasets.load_wine(return_X_y=True)

data_train, data_test, target_train, target_test = train_test_split(
    data, target, test_size=0.2, random_state=0)

### *建立 KNN 模型來預測分類*

In [2]:
# 沿用上一小節的套件及 data_train, data_test, target_train, target_test

from sklearn.neighbors import KNeighborsClassifier

knn = KNeighborsClassifier(n_neighbors=5)
knn.fit(data_train, target_train)
predictions = knn.predict(data_test)

print(predictions)
print(target_test)

[0 1 1 0 1 1 0 2 1 1 0 1 0 2 0 1 0 0 1 0 1 0 2 1 1 1 1 1 2 2 0 0 1 0 0 0]
[0 2 1 0 1 1 0 2 1 1 2 2 0 1 2 1 0 0 1 0 1 0 0 1 1 1 1 1 1 2 0 0 1 0 0 0]


### *評估預測結果*

In [3]:
# 沿用上一小節的套件及 knn, data_train, data_test, target_train, target_test

print(knn.score(data_train, target_train).round(3))
print(knn.score(data_test, target_test).round(3))

0.789
0.806


In [4]:
# 沿用上一小節的 predictions

from sklearn.metrics import classification_report

print(classification_report(target_test, predictions))

              precision    recall  f1-score   support

           0       0.87      0.93      0.90        14
           1       0.88      0.88      0.88        16
           2       0.40      0.33      0.36         6

    accuracy                           0.81        36
   macro avg       0.71      0.71      0.71        36
weighted avg       0.79      0.81      0.80        36



### *(bonus) 如何找出最適合的 K 值？*

雖然書上說你得自行試驗並尋找最合適的 K 值, 但 scikit-learn 其實提供了個半自動解決辦法。下面的程式會使用 GridSearchCV 來搜尋預測效果最好的 n_neighbors 參數。(當然, 預測效果仍會因資料集的分割方式而有所變動。)

In [5]:
import numpy as np
from sklearn import datasets
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier

from sklearn.model_selection import GridSearchCV

data, target = datasets.load_wine(return_X_y=True)
data_train, data_test, target_train, target_test = train_test_split(
    data, target, test_size=0.2, random_state=0)

knn = KNeighborsClassifier()

parameters = {'n_neighbors': np.arange(10) + 1}  # 指定 n_neighbors 測試範圍 (1~10)
clf = GridSearchCV(knn, parameters)
clf.fit(data_train, target_train)

print(clf.best_params_)  # 印出測試範圍中最佳的 n_neighbors 參數值

{'n_neighbors': 1}


## 邏輯斯 (Logistic) 迴歸模型

### *訓練邏輯斯迴歸模型並預測資料*

In [6]:
from sklearn import datasets
from sklearn.model_selection import train_test_split

data, target = datasets.load_wine(return_X_y=True)
data_train, data_test, target_train, target_test = train_test_split(
    data, target, test_size=0.2, random_state=0)

In [7]:
from sklearn.linear_model import LogisticRegression

log_model = LogisticRegression()
log_model.fit(data_train, target_train)
predictions = log_model.predict(data_test)

print(predictions)
print(target_test)

[0 2 1 0 1 1 0 2 1 1 2 2 0 1 2 1 0 0 2 0 0 0 1 1 1 1 1 1 1 2 0 0 1 0 0 0]
[0 2 1 0 1 1 0 2 1 1 2 2 0 1 2 1 0 0 1 0 1 0 0 1 1 1 1 1 1 2 0 0 1 0 0 0]


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


### *評估預測成效*

In [8]:
# 沿用上一小節的 log_model, ata_train, data_test, target_train, target_test, predictions

print(log_model.score(data_train, target_train).round(3))
print(log_model.score(data_test, target_test).round(3))

from sklearn.metrics import classification_report

print(classification_report(target_test, predictions))

0.986
0.917
              precision    recall  f1-score   support

           0       0.93      0.93      0.93        14
           1       0.93      0.88      0.90        16
           2       0.86      1.00      0.92         6

    accuracy                           0.92        36
   macro avg       0.91      0.93      0.92        36
weighted avg       0.92      0.92      0.92        36



## 12-3 改善邏輯斯迴歸模型

### *增加迭代次數、資料標準化 (standardization)*

In [9]:
from sklearn import datasets
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

data, target = datasets.load_wine(return_X_y=True)

sc = StandardScaler()
data_std = sc.fit_transform(data)

data_train, data_test, target_train, target_test = train_test_split(
    data_std, target, test_size=0.2, random_state=0)

from sklearn.linear_model import LogisticRegression

log_model = LogisticRegression(max_iter=10000, verbose=True)
log_model.fit(data_train, target_train)
predictions = log_model.predict(data_test)

print(log_model.score(data_train, target_train).round(3))
print(log_model.score(data_test, target_test).round(3))

1.0
1.0


[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.0s finished


## 12-4 主成分分析 (PCA)：減少需分析的變數

### *用線性支援向量 (SVM) 機建立預測模型*

In [10]:
from sklearn import datasets
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

data, target = datasets.load_digits(return_X_y=True)

sc = StandardScaler()
data_std = sc.fit_transform(data)

data_train, data_test, target_train, target_test = train_test_split(
    data_std, target, test_size=0.2, random_state=42)

from sklearn.svm import LinearSVC

svc = LinearSVC(max_iter=10000)
svc.fit(data_train, target_train)
predictions = svc.predict(data_test)

print(svc.score(data_train, target_train).round(3))
print(svc.score(data_test, target_test).round(3))

0.995
0.956


### *檢視特徵值的變異程度*

In [11]:
# 沿用上一小節的 data

from sklearn.decomposition import PCA

pca = PCA()
pca.fit(data)
print(pca.explained_variance_ratio_.round(3))

[0.149 0.136 0.118 0.084 0.058 0.049 0.043 0.037 0.034 0.031 0.024 0.023
 0.018 0.018 0.015 0.014 0.013 0.012 0.01  0.009 0.009 0.008 0.008 0.007
 0.007 0.006 0.006 0.005 0.005 0.004 0.004 0.004 0.003 0.003 0.003 0.003
 0.003 0.002 0.002 0.002 0.002 0.002 0.002 0.001 0.001 0.001 0.001 0.001
 0.001 0.    0.    0.    0.    0.    0.    0.    0.    0.    0.    0.
 0.    0.    0.    0.   ]


### *用 PCA 篩選特徵值*

In [12]:
# 沿用上一小節的套件及 data

pca = PCA(n_components=0.85)
pca.fit(data)
print(pca.explained_variance_ratio_.round(3))

[0.149 0.136 0.118 0.084 0.058 0.049 0.043 0.037 0.034 0.031 0.024 0.023
 0.018 0.018 0.015 0.014 0.013]


In [13]:
pca = PCA(n_components=10)
pca.fit(data)
print(pca.explained_variance_ratio_.round(3))

[0.149 0.136 0.118 0.084 0.058 0.049 0.043 0.037 0.034 0.031]


### *拿簡化過的資料來訓練模型*

In [14]:
from sklearn import datasets
from sklearn.model_selection import train_test_split
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
from sklearn.svm import LinearSVC

data, target = datasets.load_digits(return_X_y=True)

pca = PCA(n_components=0.85)
data_pca = pca.fit_transform(data)

sc = StandardScaler()
data_std = sc.fit_transform(data_pca)

data_train, data_test, target_train, target_test = train_test_split(
    data_std, target, test_size=0.2, random_state=42)

svc = LinearSVC(max_iter=10000)
svc.fit(data_train, target_train)
predictions = svc.predict(data_test)

print(svc.score(data_train, target_train).round(3))
print(svc.score(data_test, target_test).round(3))

0.963
0.956


## (bonus) 保存和重複使用預測模型

在訓練好模型後, 你可用以下程式將模型儲存在磁碟中, 以利日後重複使用, 不必每次執行程式都得重新訓練模型。

In [16]:
import pickle

filename = r'data\svc_model'  # 檔案路徑和名稱

pickle.dump(svc, open(filename, 'wb'))  # 將 svc 模型物件寫入到二進位檔

In [17]:
import pickle

filename = r'data\svc_model'  # 檔案路徑和名稱

saved_model = pickle.load(open(filename, 'rb'))  # 讀取二進位檔, 還原成模型物件 saved_model

print(saved_model)  # 檢視模型
print(saved_model.predict(data_test[0:5]))  # 預測資料 (使用 data_test 的前 5 筆資料)

LinearSVC(max_iter=10000)
[6 9 3 7 2]
