## Sample Code & 作業內容
請完成 scikit-learn-practice 比賽(點擊連結可至競賽頁面)，讓大家熟悉 Scikit-learn 的比賽
- 總共有一千筆訓練資料、40個 features，二元分類問題，練習 features scaling、建模、調參數等步驟
- 每天最多上傳 10 次結果
- 請在 private / public leaderboard 上取得 0.7 以上的準確率
- 可多參考別人的 Kernel，學習別人的寫法與思路，完成自己的 Kaggle 競賽

作業提交請截圖kaggle競賽頁面提交畫面上傳至github，並回到官網提交github連結。(以下為Kaggle競賽頁面截圖範例)

In [1]:
import pandas as pd
from sklearn import metrics
from sklearn.model_selection import train_test_split, KFold, GridSearchCV
from sklearn.ensemble import GradientBoostingClassifier

In [2]:
x_data = pd.read_csv('data/data-science-london-scikit-learn/train.csv', header=None)
y_data = pd.read_csv('data/data-science-london-scikit-learn/trainLabels.csv', header=None)
x_valid = pd.read_csv('data/data-science-london-scikit-learn/test.csv', header=None)

In [3]:
x_data.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,30,31,32,33,34,35,36,37,38,39
0,0.299403,-1.226624,1.498425,-1.17615,5.289853,0.208297,2.404498,1.594506,-0.051608,0.663234,...,-0.850465,-0.62299,-1.833057,0.293024,3.552681,0.717611,3.305972,-2.715559,-2.682409,0.10105
1,-1.174176,0.332157,0.949919,-1.285328,2.199061,-0.151268,-0.427039,2.619246,-0.765884,-0.09378,...,-0.81975,0.012037,2.038836,0.468579,-0.517657,0.422326,0.803699,1.213219,1.382932,-1.817761
2,1.192222,-0.414371,0.067054,-2.233568,3.658881,0.089007,0.203439,-4.219054,-1.184919,-1.24031,...,-0.604501,0.750054,-3.360521,0.856988,-2.751451,-1.582735,1.672246,0.656438,-0.932473,2.987436
3,1.57327,-0.580318,-0.866332,-0.603812,3.125716,0.870321,-0.161992,4.499666,1.038741,-1.092716,...,1.022959,1.275598,-3.48011,-1.065252,2.153133,1.563539,2.767117,0.215748,0.619645,1.883397
4,-0.613071,-0.644204,1.112558,-0.032397,3.490142,-0.011935,1.443521,-4.290282,-1.761308,0.807652,...,0.513906,-1.803473,0.518579,-0.205029,-4.744566,-1.520015,1.830651,0.870772,-1.894609,0.408332


In [4]:
x_train, x_test, y_train, y_test = train_test_split(x_data, y_data, test_size=0.2)

### basic parameter

In [5]:
gbc = GradientBoostingClassifier()

gbc.fit(x_train, y_train)

y_pred = gbc.predict(x_test)

  y = column_or_1d(y, warn=True)


In [6]:
acc = metrics.accuracy_score(y_test, y_pred)
print("Acuuracy: ", acc)

Acuuracy:  0.86


### hyper-parameter search

In [7]:
# 設定要訓練的超參數組合
param_grid = {
    'n_estimators': [50, 75, 100, 125, 150, 175, 200],
    'learning_rate': [0.015, 0.03, 0.045, 0.06, 0.075, 0.09, 0.105]
}

## 建立搜尋物件，放入模型及參數組合字典 (n_jobs=-1 會使用全部 cpu 平行運算)
grid_search = GridSearchCV(gbc, param_grid, scoring="accuracy", n_jobs=-1, verbose=1)

# 開始搜尋最佳參數
grid_result = grid_search.fit(x_train, y_train[0])
### Error: Too many indices in the array
### 參考：https://stackoverflow.com/questions/42928855/gridsearchcv-error-too-many-indices-in-the-array

Fitting 3 folds for each of 49 candidates, totalling 147 fits


[Parallel(n_jobs=-1)]: Done  34 tasks      | elapsed:    2.3s
[Parallel(n_jobs=-1)]: Done 147 out of 147 | elapsed:    9.6s finished


In [8]:
# 印出最佳結果與最佳參數
print("Best Accuracy: %f using %s" % (grid_result.best_score_, grid_result.best_params_))

Best Accuracy: 0.871250 using {'learning_rate': 0.105, 'n_estimators': 125}


In [9]:
gbc_hyper = GradientBoostingClassifier(
    n_estimators=grid_result.best_params_['n_estimators'],
    learning_rate=grid_result.best_params_['learning_rate']
)

gbc_hyper.fit(x_train, y_train)

y_pred = gbc_hyper.predict(x_test)

  y = column_or_1d(y, warn=True)


In [10]:
acc = metrics.accuracy_score(y_test, y_pred)
print("Acuuracy: ", acc)

Acuuracy:  0.86


### 預測
#### basic

In [11]:
y_pred = pd.DataFrame(gbc.predict(x_valid))

y_pred.index += 1
y_pred.reset_index(inplace=True)
y_pred.columns = ['Id', 'Solution']
y_pred.to_csv('submission.csv', index=False)

#### hyper-parameter search

In [12]:
# hyper-parameter search
y_pred_hyper = pd.DataFrame(gbc_hyper.predict(x_valid))

y_pred_hyper.index += 1
y_pred_hyper.reset_index(inplace=True)
y_pred_hyper.columns = ['Id', 'Solution']
y_pred_hyper.to_csv('submission_hyper.csv', index=False)