# [範例重點]
了解機器學習建模的步驟、資料型態以及評估結果等流程

In [1]:
from sklearn import metrics, datasets

from sklearn.tree import DecisionTreeClassifier, DecisionTreeRegressor
from sklearn.model_selection import train_test_split

# 建立模型四步驟
在 Scikit-learn 中，建立一個機器學習的模型其實非常簡單，流程大略是以下四個步驟<br />
<br />
1.讀進資料，並檢查資料的 shape (有多少 samples (rows), 多少 features (columns)，label 的型態是什麼？)<br />
讀取資料的方法：<br />
使用 pandas 讀取 .csv 檔：pd.read_csv<br />
使用 numpy 讀取 .txt 檔：np.loadtxt<br />
使用 Scikit-learn 內建的資料集：sklearn.datasets.load_xxx<br />
檢查資料數量：data.shape (data should be np.array or dataframe)<br />
<br />
2.將資料切為訓練 (train) / 測試 (test)<br />
train_test_split(data)<br />
<br />
3.建立模型，將資料 fit 進模型開始訓練<br />
clf = DecisionTreeClassifier()<br />
clf.fit(x_train, y_train)<br />
<br />
4.將測試資料 (features) 放進訓練好的模型中，得到 prediction，與測試資料的 label (y_test) 做評估<br />
clf.predict(x_test)<br />
accuracy_score(y_test, y_pred)<br />
f1_score(y_test, y_pred)<br />

In [2]:
iris = datasets.load_iris()

#split train and test data for model validation
x_train, x_test, y_train, y_test = train_test_split(iris.data,iris.target, test_size = 0.25 , random_state = 2019)

#build the classifier model
clf = DecisionTreeClassifier()

#train the classifier
clf.fit(x_train,y_train)

#predict the result
pred = clf.predict(x_test)

#measure the score
acc = metrics.accuracy_score(pred,y_test)
print(f' Accuracy : {acc} ')

 Accuracy : 0.9736842105263158 


In [3]:
#Check out the features in dataset
print(iris.feature_names)

['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)']


In [4]:
#Check out the importance of features
print(f' Feature imortance : {clf.feature_importances_} ')

 Feature imortance : [0.01347125 0.         0.05241164 0.93411711] 


# [作業重點]
目前你應該已經要很清楚資料集中，資料的型態是什麼樣子囉！包含特徵 (features) 與標籤 (labels)。因此要記得未來不管什麼專案，必須要把資料清理成相同的格式，才能送進模型訓練。 今天的作業開始踏入決策樹這個非常重要的模型，請務必確保你理解模型中每個超參數的意思，並試著調整看看，對最終預測結果的影響為何 <br />
<br /> 
# 作業
1.試著調整 DecisionTreeClassifier(...) 中的參數，並觀察是否會改變結果？<br />
2.改用其他資料集 (boston, wine)，並與回歸模型的結果進行比較<br />

In [15]:
clf.set_params(criterion = "entropy", max_depth = 2, min_samples_split = 2, min_samples_leaf = 1)
clf.fit(x_train,y_train)
pred = clf.predict(x_test)
#measure the score after adjust the parameters
acc = metrics.accuracy_score(pred,y_test)
print(f' Accuracy : {acc} ')

 Accuracy : 0.9736842105263158 


**The iris dataset is quite simple, even I modify the max_depth to 2, the score still great.**

In [8]:
boston = datasets.load_boston()

#check the data label type
print(boston.target[:10])

#check the shape of data
print(boston.data.shape)

[24.  21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9]
(506, 13)


In [27]:
train_x, test_x, train_y, test_y = train_test_split(boston.data, boston.target, test_size = 0.2, random_state = 2019)

regr = DecisionTreeRegressor(max_depth = 5, min_samples_split = 2 , min_samples_leaf = 1)
regr.fit(train_x,train_y)
pred = regr.predict(test_x)

print(f' MSE : {metrics.mean_squared_error(pred,test_y)} ')

 MSE : 19.899946301091774 


In [20]:
from sklearn.linear_model import LinearRegression

LR = LinearRegression()
LR.fit(train_x,train_y)
pred = LR.predict(test_x)

print(f' MSE: {metrics.mean_squared_error(pred,test_y)} ')

 MSE: 26.202748180423757 
