### 儲存訓練好的模型
我們訓練好了一個模型以後需要保存並讓下一次直接預測

常見的兩種保存Model的模塊有 pickle 與 joblib，範例使用pickle

由於pickle儲存模型後容量可能會有好幾百MB

因此建議可以透過gzip來壓縮模型並儲存

![image](./img/gzip.jpg)

gzip的檔名為pgz， w 為寫入(write)，f 為 file 

### 載入儲存的模型

![image](./img/readModel.jpg)

r 為讀取(read)

### 以XGBoost分類器為例

In [66]:
import pandas as pd             #進行資料處理函式庫
import numpy as np              #高階大量的維度陣列與矩陣計算
import matplotlib.pyplot as plt #繪圖
import seaborn as sns           #繪圖
import io                       #負責處理資料 input/output
import requests                 #HTTP Request下載訓練資料用

In [43]:
url = "https://gist.githubusercontent.com/netj/8836201/raw/6f9306ad21398ea43cba4f7d537619d0e07d5ae3/iris.csv"
s = requests.get(url).content
df_train = pd.read_csv(io.StringIO(s.decode("utf-8")))
#df_train = df_train.drop(labels=["sepal.length"],axis=1)    #移除sepal.length  axis= 1為行 0為列
df_train

Unnamed: 0,sepal.length,sepal.width,petal.length,petal.width,variety
0,5.1,3.5,1.4,0.2,Setosa
1,4.9,3.0,1.4,0.2,Setosa
2,4.7,3.2,1.3,0.2,Setosa
3,4.6,3.1,1.5,0.2,Setosa
4,5.0,3.6,1.4,0.2,Setosa
...,...,...,...,...,...
145,6.7,3.0,5.2,2.3,Virginica
146,6.3,2.5,5.0,1.9,Virginica
147,6.5,3.0,5.2,2.0,Virginica
148,6.2,3.4,5.4,2.3,Virginica


In [44]:
lable_map = {"Setosa":0,"Versicolor":1,"Virginica":2}
#將編碼後的lable map存至df_train["variety"]中。
df_train["Class"] = df_train["variety"].map(lable_map)

lable_map
df_train

Unnamed: 0,sepal.length,sepal.width,petal.length,petal.width,variety,Class
0,5.1,3.5,1.4,0.2,Setosa,0
1,4.9,3.0,1.4,0.2,Setosa,0
2,4.7,3.2,1.3,0.2,Setosa,0
3,4.6,3.1,1.5,0.2,Setosa,0
4,5.0,3.6,1.4,0.2,Setosa,0
...,...,...,...,...,...,...
145,6.7,3.0,5.2,2.3,Virginica,2
146,6.3,2.5,5.0,1.9,Virginica,2
147,6.5,3.0,5.2,2.0,Virginica,2
148,6.2,3.4,5.4,2.3,Virginica,2


In [45]:
X = df_train.drop(labels=["variety","Class"],axis=1).values #移除 class,variety(因為字母不參與訓練)
#checked missing data
print("checked missing data(NAN mount):",len(np.where(np.isnan(X))[0]))

checked missing data(NAN mount): 0


In [46]:
from sklearn.model_selection import train_test_split
X = df_train.drop(labels=["Class","variety"],axis=1)
y = df_train["Class"]
X_train , X_test ,y_train , y_test = train_test_split(X,y , test_size = .3 , random_state=42)

In [47]:
print("Training data shape : ",X_train.shape)
print("Testing data shape : ", X_test.shape)

Training data shape :  (105, 4)
Testing data shape :  (45, 4)


In [48]:
import xgboost as xgb

#建立XGBClassifier 模型
xgboostModel = xgb.XGBRFClassifier(learning_rate=0.3)

#使用訓練資料訓練模型
xgboostModel.fit(X_train,y_train)

#使用訓練資料預測分類
predicted = xgboostModel.predict(X_train)

In [49]:
#預測成功的比例
print("訓練集: ",xgboostModel.score(X_train,y_train))

print("測試集: ",xgboostModel.score(X_test,y_test))

訓練集:  0.9619047619047619
測試集:  1.0


### 儲存XGBoost(classfication)模型
大家可以觀察 .pickle 與 .gzip 兩種不同副檔名儲存結果檔案大小有何差別?
1. 使用pickle儲存模型

In [68]:
import pickle 
with open("./xgboost-iris.pickle","wb") as f:   #wb (write binary)
    pickle.dump(xgboostModel,f)                 #寫入二進制

#xgboostModel.save_model("model.json")


2. 使用pickle儲存模型並利用gzip壓縮

In [69]:
import pickle 
import gzip
with gzip.GzipFile('./xgboost-iris.pgz','w') as f:
    pickle.dump(xgboostModel,f)

### 載入 XGboost (classfication) 模型
試著載入兩種不同格式的模型，並預測一筆資料。注意模型預測輸入必須為numpy型態，且須為二維陣列格式。

1. 載入 pickle格式模型

In [65]:
data = np.array([
    [5.5, 2.4, 3.7, 1. ]
])

data.shape

(1, 4)

In [70]:
#讀取Model
with open('./xgboost-iris.pickle','rb') as f:
    xgboostModel = pickle.load(f)
    pred = xgboostModel.predict(np.array([[5.5, 2.4, 3.7, 1. ]]))
    print(pred)

# model_xgb_2 = xgb.Booster()
# model_xgb_2.load_model("./model.json")
# pred = model_xgb_2.predict(xgb.DMatrix(data))

[1]


2. 載入 gzip 格式模型

In [71]:
import pickle
import gzip

#讀取Model
with gzip.open('./xgboost-iris.pgz',"r") as f:
    xgboostModel = pickle.load(f)
    pred = xgboostModel.predict(np.array([[5.5,2.4,3.7,1.],[1.2,5,3,1]]))
    print(pred[:])

[1 1]
