# 本筆記目標是學習使用強大的XGBoost演算法

---

# 參考連結

[XGBOOST參數設定](https://github.com/dmlc/xgboost/blob/master/doc/parameter.md)

[XGBOOST簡介](http://xgboost.readthedocs.io/en/latest/python/python_intro.html)

[XGBOOST支持GPU的說明](https://xgboost.readthedocs.io/en/latest/gpu/index.html)

# 索引

[1. 整理資料](#1.-整理資料)

[2. 訓練資料](#2.-訓練資料)

[3. 檢視訓練情形](#3.-檢視訓練情形)

[4. 檢視各欄位重要性](#4.-檢視各欄位重要性)

[5. 以$R^2$評估回歸結果](#5.-以$R^2$評估回歸結果)

---

In [None]:
import pandas as pd
import numpy as np

import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
sns.set()

from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

In [None]:
import xgboost as xgb
xgb.__version__

### 1. 整理資料

In [None]:
df=pd.read_csv('../datasets/blFriday/train.csv') # 載入資料

In [None]:
df.head(3)

In [None]:
print('補空值前:')
print(df.isnull().sum())  # 查看各欄位空值狀態

df=df.fillna(0)           # 補空值

print('\n補空值後:')
print(df.isnull().sum())  # 查看各欄位空值狀態

In [None]:
# 除了最後一欄位是目標，其餘欄位皆為用來預測目標的特徵
x=df.iloc[:,0:11]
target=df['Purchase']

In [None]:
# 以pd.factorize()方法，將類別資料編碼為dummy code
dataEncoded=pd.DataFrame()
encInfo={}
for col in x.columns:
    facorized=pd.factorize(x[col])

    dataEncoded[col]=facorized[0]
    encInfo[col]=facorized[1]

In [None]:
# 以scikit-learn內建的train, test split, 將資料分成70%訓練，30%測試
trainX,testX,trainY,testY=train_test_split(dataEncoded,target,
                                           test_size=0.3)

In [None]:
# 確認一下資料形狀
print('size of the train data (x):\t',trainX.shape)
print('size of the train data (x):\t',trainY.shape)
print('size of the test data (y):\t',testX.shape)
print('size of the test data (y):\t',testY.shape)

In [None]:
# 將資料存成xgboost要求的型態
data_train = xgb.DMatrix( trainX, label=trainY)
data_test  = xgb.DMatrix( testX, label=testY)

[回索引](#索引)

### 2. 訓練資料

In [None]:
%%time

#給予模型參數，告知演算法該如何訓練模型
param = {}
param['objective'] = 'reg:linear' # 做線性回歸
param['tree_method'] = 'hist'
param['silent']=1
param['max_depth']=10
eval_list  = [(data_train,'train'),(data_test,'test')]
num_round = 50
eval_history={}

# 訓練模型
model = xgb.train( param, data_train, num_round,eval_list,
                  evals_result=eval_history,verbose_eval=False)

In [None]:
# 若有GPU, 可則執行以下程式碼來加速訓練。

# %%time

# #給予模型參數，告知演算法該如何訓練模型
# param = {}
# param['objective'] = 'reg:linear'
# param['n_gpus']=1
# param['gpu_id']=0
# param['tree_method'] = 'gpu_hist'
# param['silent']=1
# param['max_depth']=6
# eval_list  = [(data_train,'train'),(data_test,'test')]
# num_round = 50
# eval_history={}

# # 訓練模型
# model = xgb.train( param, data_train, num_round,eval_list,
#                   evals_result=eval_history,verbose_eval=False)

In [None]:
rmse_train=eval_history['train']['rmse']
rmse_test=eval_history['test']['rmse']

[回索引](#索引)

### 3. 檢視訓練情形

In [None]:
plt.plot(rmse_train,ms=10,marker='.',label='train_eval')
plt.plot(rmse_test,ms=10,marker='v',label='test_eval')
plt.legend()
plt.show()

In [None]:
# 檢視最後rms error
model.eval(data_test)

[回索引](#索引)

### 4. 檢視各欄位重要性

In [None]:
from xgboost import plot_importance
plot_importance(model)
plt.show()

[回索引](#索引)

### 5. 以$R^2$評估回歸結果

In [None]:
from sklearn.metrics import r2_score
testY_pred=model.predict(data_test)
r2_score(testY, testY_pred)

In [None]:
trainY_pred = model.predict(data_train)
r2_score(trainY, trainY_pred)

在訓練資料的表現是$R^2 =0.78$, 在測試資料的表現是$R^2 = 0.73$

[回索引](#索引)

---

#### 練習1：增加樹的深度，看模型準確率有什麼變化

In [None]:
# 練習於此
# ..

#### 練習2：控制L1/L2規範項的強度，看模型準確率有無變化

In [None]:
# 練習於此
# ..

#### 練習3：去掉Purchase欄位中的離群值再來建立模型

In [None]:
# 練習於此
# ..