## [作業重點]
使用 Sklearn 中的線性迴歸模型，來訓練各種資料集，務必了解送進去模型訓練的**資料型態**為何，也請了解模型中各項參數的意義

## 作業
試著使用 sklearn datasets 的其他資料集 (wine, boston, ...)，來訓練自己的線性迴歸模型。

### HINT: 注意 label 的型態，確定資料集的目標是分類還是回歸，在使用正確的模型訓練！

In [4]:
import numpy as np
import matplotlib.pyplot as plt
from sklearn import datasets, linear_model
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, r2_score, accuracy_score
import warnings
warnings.filterwarnings('ignore')

In [5]:
# get dataset from sklearn
wine = datasets.load_wine()
boston = datasets.load_boston()
breast_cancer = datasets.load_breast_cancer()

In [10]:
# quick look data
from IPython.display import display
display(wine['target'][0:20])
display(boston['target'][0:20])
display(breast_cancer['target'][0:20])

array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0])

array([24. , 21.6, 34.7, 33.4, 36.2, 28.7, 22.9, 27.1, 16.5, 18.9, 15. ,
       18.9, 21.7, 20.4, 18.2, 19.9, 23.1, 17.5, 20.2, 18.2])

array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1])

## 使用模型
基於各資料的target類型不同:
* boston使用Reg
* wine、breast_cancer使用Logistic Reg

### Boston data

In [28]:
test_size = 0.3
rnd_seed = 123

In [29]:
import pandas as pd
boston_data = pd.DataFrame(boston['data'], columns = boston['feature_names'])
boston_data.head()

Unnamed: 0,CRIM,ZN,INDUS,CHAS,NOX,RM,AGE,DIS,RAD,TAX,PTRATIO,B,LSTAT
0,0.00632,18.0,2.31,0.0,0.538,6.575,65.2,4.09,1.0,296.0,15.3,396.9,4.98
1,0.02731,0.0,7.07,0.0,0.469,6.421,78.9,4.9671,2.0,242.0,17.8,396.9,9.14
2,0.02729,0.0,7.07,0.0,0.469,7.185,61.1,4.9671,2.0,242.0,17.8,392.83,4.03
3,0.03237,0.0,2.18,0.0,0.458,6.998,45.8,6.0622,3.0,222.0,18.7,394.63,2.94
4,0.06905,0.0,2.18,0.0,0.458,7.147,54.2,6.0622,3.0,222.0,18.7,396.9,5.33


In [30]:
train_X, test_X, train_y, test_y = train_test_split(boston['data'], boston['target'], test_size = test_size, random_state = rnd_seed)

model_reg = linear_model.LinearRegression()

model_reg.fit(train_X, train_y)

pred_y = model_reg.predict(test_X)

print(f'R_square: {r2_score(pred_y, test_y)}')
print(f'MSE: {mean_squared_error(pred_y, test_y)}')

R_square: 0.5103418739211152
MSE: 28.40585481050826


### Wine Data

In [80]:
solver = 'liblinear'

In [81]:
train_X, test_X, train_y, test_y = train_test_split(wine['data'], wine['target'], test_size = test_size, random_state = rnd_seed)

model_logistic = linear_model.LogisticRegression()

model_logistic.fit(train_X, train_y)

pred_y = model_logistic.predict(test_X)

print(f'AUC: {accuracy_score(pred_y, test_y)}')

AUC: 0.9444444444444444


In [82]:
train_X, test_X, train_y, test_y = train_test_split(wine['data'], wine['target'], test_size = test_size, random_state = rnd_seed)

from sklearn.preprocessing import MinMaxScaler
MMScaler = MinMaxScaler()
train_X = MMScaler.fit_transform(train_X)
test_X = MMScaler.fit_transform(test_X)

model_logistic = linear_model.LogisticRegression(solver = solver)

model_logistic.fit(train_X, train_y)

pred_y = model_logistic.predict(test_X)

print(f'AUC: {accuracy_score(pred_y, test_y)}')

AUC: 0.9074074074074074


### Breast Cancer Data

In [83]:
train_X, test_X, train_y, test_y = train_test_split(breast_cancer['data'], breast_cancer['target'], test_size = test_size, random_state = rnd_seed)

model_logistic = linear_model.LogisticRegression(solver = solver)

model_logistic.fit(train_X, train_y)

pred_y = model_logistic.predict(test_X)

print(f'AUC: {accuracy_score(pred_y, test_y)}')

AUC: 0.9824561403508771


In [84]:
train_X, test_X, train_y, test_y = train_test_split(breast_cancer['data'], breast_cancer['target'], test_size = test_size, random_state = rnd_seed)

from sklearn.preprocessing import MinMaxScaler
MMScaler = MinMaxScaler()
train_X = MMScaler.fit_transform(train_X)
test_X = MMScaler.fit_transform(test_X)

model_logistic = linear_model.LogisticRegression(solver = solver)

model_logistic.fit(train_X, train_y)

pred_y = model_logistic.predict(test_X)

print(f'AUC: {accuracy_score(pred_y, test_y)}')

AUC: 0.9707602339181286
