# Q: What is the best model (and parameters) for activity recognition
* Conduct parameter k-fold validated grid search parameter tuning for SVM, RF, KNN, NB, and Neural Net classifiers.
* Validation should check for each model's accuracy, precision, recall with regard to each class prediction (confusion matrix).
* Are the mediocre performing models getting the same things correct as the best model or are these different things? Will an ensemble approach work well?
* How well are model confidences working? Are highly confident examples typically correct?

# Optimize Random Forest Classifier

* Consider using precision, F1 score, and balanaced accuracy measure for optimization

### [Advice from a Kaggle forum](https://www.kaggle.com/general/4092)
1. tune the number of candidate feature for splitting (starting with min_samples_leaf=1)
2. tune the depth of the trees (min_samples_leaf)
3. tune the number of trees

### sklearn's [tips for parameter tuning](http://scikit-learn.org/stable/modules/grid_search.html#tips-for-parameter-search)

### Analytics Vidhya [tuning random forest](https://www.analyticsvidhya.com/blog/2015/06/tuning-random-forest-model/)

In [1]:
import pandas as pd
import numpy as np
import time
import importlib.machinery
es = importlib.machinery.SourceFileLoader('extrasense','/home/sac086/extrasensory/extrasense/extrasense.py').load_module()
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, make_scorer
from sklearn.preprocessing import StandardScaler, MinMaxScaler
from sklearn.model_selection import GroupShuffleSplit, GroupKFold, cross_val_score, GridSearchCV
from sklearn.pipeline import Pipeline

### Load Data

In [2]:
features_df = es.get_impersonal_data(leave_users_out=[], data_type="activity", labeled_only=False)

# remove nan rows
no_label_indeces = features_df.label.isnull()
features_df = features_df[~no_label_indeces]

timestamps = features_df.pop('timestamp')
label_source = features_df.pop("label_source")
labels = features_df.pop("label")
user_ids = features_df.pop("user_id")

In [3]:
resulting_scores = {}

## Score to beat
* In previous trials, the best parameters were n_estimators=100, max_features="log2", min_samples_leaf=1

In [9]:
# set up the accuracy scorer for cross validation
scorer = make_scorer(accuracy_score)

### ZeroR

In [None]:
steps = []
steps.append(('standardize', StandardScaler()))
steps.append(('ZeroR', es.ZeroR()))
clf = Pipeline(steps)

In [12]:
score = cross_val_score(clf, features_df, labels, groups=user_ids, cv=GroupKFold(n_splits=3), n_jobs=20, scoring=scorer)
resulting_scores["ZeroR"]=score

### Random Forest with simple parameters

In [30]:
RF = RandomForestClassifier(n_estimators=500, max_features="log2", min_samples_leaf=1)
steps = [('standardize', StandardScaler()),
         ('RF_basic', RF)]
clf = Pipeline(steps)
score = cross_val_score(clf, features_df, labels, groups=user_ids, cv=GroupKFold(n_splits=3), n_jobs=20, scoring=scorer)
resulting_scores["RandomForest_basic"] = score
print(score)

In [31]:
print(score)

[ 0.60857622  0.62225529  0.54801124]


### Tuning Random Forest

#### Tuning the max features

In [14]:
param_grid = {"clf__max_features":["auto", "sqrt", "log2", 0.2, 0.4, 0.6, 0.8]}
RF = RandomForestClassifier(n_estimators=500, min_samples_leaf=1) # default settings

In [None]:
steps = [('standardize', StandardScaler()),
         ('clf', RF)]
clf = Pipeline(steps)
cv = list(GroupKFold(n_splits=3).split(features_df, labels, user_ids))
gsv = GridSearchCV(clf, param_grid=param_grid, cv=cv, n_jobs=12, verbose=2)

gsv_results = gsv.fit(features_df, labels)
print("best score : %s" % gsv_results.best_score_)
print("best params : %s" % gsv_results.best_params_)

In [15]:
cv = list(GroupKFold(n_splits=3).split(features_df, labels, user_ids))
gsv = GridSearchCV(RandomForestClassifier(n_estimators=500, min_samples_leaf=1), param_grid=param_grid, cv=cv, n_jobs=20)
results = gsv.fit(features_df, labels)

In [23]:
results.cv_results_['mean_test_score']

array([ 0.58944549,  0.58944549,  0.59473121,  0.58944549,  0.58180984,
        0.5772955 ,  0.57567463])

In [17]:
results.best_params_

{'max_features': 'log2'}

In [18]:
resulting_scores

{'RandomForest_basic': array([ 0.60941991,  0.60791178,  0.55080214]),
 'ZeroR': array([ 0.45628011,  0.44938444,  0.42823686])}

In [24]:
resulting_scores['RandomForest_max_features_log2'] = results.best_score_

#### Tuning min_sample_leaf

In [37]:
param_grid = {"clf__min_samples_leaf":[10,20,30,40,50,60,70,80,90,100]}
RF = RandomForestClassifier(n_estimators=500, max_features="log2") # default settings

In [39]:
steps = [('standardize', StandardScaler()),
         ('clf', RF)]
clf = Pipeline(steps)
cv = list(GroupKFold(n_splits=3).split(features_df, labels, user_ids))
gsv = GridSearchCV(clf, param_grid=param_grid, cv=cv, n_jobs=12, verbose=2)

gsv_results = gsv.fit(features_df, labels)
print("best score : %s" % gsv_results.best_score_)
print("best params : %s" % gsv_results.best_params_)

Fitting 3 folds for each of 10 candidates, totalling 30 fits
[CV] clf__min_samples_leaf=10 ........................................
[CV] clf__min_samples_leaf=10 ........................................
[CV] clf__min_samples_leaf=10 ........................................
[CV] clf__min_samples_leaf=20 ........................................
[CV] clf__min_samples_leaf=20 ........................................
[CV] clf__min_samples_leaf=20 ........................................
[CV] clf__min_samples_leaf=30 ........................................
[CV] clf__min_samples_leaf=30 ........................................
[CV] clf__min_samples_leaf=30 ........................................
[CV] clf__min_samples_leaf=40 ........................................
[CV] clf__min_samples_leaf=40 ........................................
[CV] clf__min_samples_leaf=40 ........................................
[CV] ......................... clf__min_samples_leaf=40, total= 8.3min
[CV] clf__min_sa

[Parallel(n_jobs=12)]: Done  23 out of  30 | elapsed: 18.0min remaining:  5.5min


[CV] ......................... clf__min_samples_leaf=80, total= 7.8min
[CV] ......................... clf__min_samples_leaf=90, total= 6.8min
[CV] ......................... clf__min_samples_leaf=90, total= 6.7min
[CV] ........................ clf__min_samples_leaf=100, total= 6.7min
[CV] ......................... clf__min_samples_leaf=90, total= 6.7min
[CV] ........................ clf__min_samples_leaf=100, total= 6.7min
[CV] ........................ clf__min_samples_leaf=100, total= 6.8min


[Parallel(n_jobs=12)]: Done  30 out of  30 | elapsed: 24.3min finished


best score : 0.614865106772
best params : {'clf__min_samples_leaf': 80}


In [40]:
gsv_results.cv_results_

{'mean_fit_time': array([ 567.8111376 ,  531.58625722,  509.29925283,  492.87803324,
         484.49457073,  471.49230687,  462.63780745,  456.62535318,
         397.51769185,  394.60614038]),
 'mean_score_time': array([ 10.41880687,   9.54804317,   9.11394922,   8.84795165,
          8.60592993,   8.41252112,   8.40913916,   7.74767915,
          7.3434844 ,   7.28181521]),
 'mean_test_score': array([ 0.60277045,  0.60750286,  0.61036378,  0.61069902,  0.61233616,
         0.6128374 ,  0.61203998,  0.61486511,  0.61435086,  0.61401887]),
 'mean_train_score': array([ 0.88126387,  0.8404917 ,  0.8209558 ,  0.80793895,  0.79850808,
         0.79047762,  0.78448423,  0.77959766,  0.7753806 ,  0.77170752]),
 'param_clf__min_samples_leaf': masked_array(data = [10 20 30 40 50 60 70 80 90 100],
              mask = [False False False False False False False False False False],
        fill_value = ?),
 'params': [{'clf__min_samples_leaf': 10},
  {'clf__min_samples_leaf': 20},
  {'clf__min_sam

In [41]:
RF = RandomForestClassifier(n_estimators=500, max_features="log2", min_samples_leaf=80)
steps = [('standardize', StandardScaler()),
         ('RF_basic', RF)]
clf = Pipeline(steps)
score = cross_val_score(clf, features_df, labels, groups=user_ids, cv=GroupKFold(n_splits=3), n_jobs=15, scoring=scorer, verbose=2)
resulting_scores["RandomForest_optimal"] = score

In [44]:
np.mean(score)

0.61487113249084513

In [42]:
from sklearn.metrics import classification_report

# Use Basic Neural Network to compare

# SKLearn MLP Classifier

In [51]:
from sklearn.neural_network import MLPClassifier

In [61]:
mlp = MLPClassifier(solver='adam', alpha=1e-5, hidden_layer_sizes=(20,10,5,2,5,10,20))

In [62]:
steps = [('standardize', StandardScaler()),
         ('clf', mlp)]
clf = Pipeline(steps)
score = cross_val_score(clf, features_df, labels, groups=user_ids, cv=GroupKFold(n_splits=3), n_jobs=15, scoring=scorer, verbose=2)
resulting_scores["RandomForest_optimal"] = score

[CV]  ................................................................
[CV]  ................................................................
[CV]  ................................................................
[CV] ................................................. , total= 1.1min
[CV] ................................................. , total= 1.2min
[CV] ................................................. , total= 1.3min


[Parallel(n_jobs=15)]: Done   3 out of   3 | elapsed:  1.4min finished


In [63]:
score

array([ 0.63620221,  0.54164965,  0.5636832 ])

In [64]:
np.mean(score)

0.58051168802577691

In [72]:
param_grid = {'clf__solver': ['sgd', 'lbfgs', 'adam'],
              'clf__learning_rate':['constant', 'invscaling'],
              'clf__nesterovs_momentum':[True,False],
              'clf__momentum':[0,0.9]}

In [73]:
mlp = MLPClassifier(learning_rate_init=0.2, alpha=1e-5, hidden_layer_sizes=(20,10,5,2,5,10,20))

In [74]:
steps = [('standardize', StandardScaler()),
         ('clf', mlp)]
clf = Pipeline(steps)

In [75]:
clf = Pipeline(steps)
cv = list(GroupKFold(n_splits=3).split(features_df, labels, user_ids))
gsv = GridSearchCV(clf, param_grid=param_grid, cv=cv, n_jobs=12, verbose=2)

gsv_results = gsv.fit(features_df, labels)
print("best score : %s" % gsv_results.best_score_)
print("best params : %s" % gsv_results.best_params_)

Fitting 3 folds for each of 24 candidates, totalling 72 fits
[CV] clf__nesterovs_momentum=True, clf__solver=sgd, clf__momentum=0, clf__learning_rate=constant 
[CV] clf__nesterovs_momentum=True, clf__solver=sgd, clf__momentum=0, clf__learning_rate=constant 
[CV] clf__nesterovs_momentum=True, clf__solver=sgd, clf__momentum=0, clf__learning_rate=constant 
[CV] clf__nesterovs_momentum=True, clf__solver=lbfgs, clf__momentum=0, clf__learning_rate=constant 
[CV] clf__nesterovs_momentum=True, clf__solver=lbfgs, clf__momentum=0, clf__learning_rate=constant 
[CV] clf__nesterovs_momentum=True, clf__solver=lbfgs, clf__momentum=0, clf__learning_rate=constant 
[CV] clf__nesterovs_momentum=True, clf__solver=adam, clf__momentum=0, clf__learning_rate=constant 
[CV] clf__nesterovs_momentum=True, clf__solver=adam, clf__momentum=0, clf__learning_rate=constant 
[CV] clf__nesterovs_momentum=True, clf__solver=adam, clf__momentum=0, clf__learning_rate=constant 
[CV] clf__nesterovs_momentum=False, clf__solver=

[Parallel(n_jobs=12)]: Done  17 tasks      | elapsed:  1.7min


[CV]  clf__nesterovs_momentum=True, clf__solver=lbfgs, clf__momentum=0, clf__learning_rate=constant, total= 1.8min
[CV] clf__nesterovs_momentum=False, clf__solver=sgd, clf__momentum=0.9, clf__learning_rate=constant 
[CV]  clf__nesterovs_momentum=True, clf__solver=lbfgs, clf__momentum=0, clf__learning_rate=constant, total= 1.8min
[CV]  clf__nesterovs_momentum=True, clf__solver=lbfgs, clf__momentum=0, clf__learning_rate=constant, total= 1.8min
[CV] clf__nesterovs_momentum=False, clf__solver=lbfgs, clf__momentum=0.9, clf__learning_rate=constant 
[CV] clf__nesterovs_momentum=False, clf__solver=lbfgs, clf__momentum=0.9, clf__learning_rate=constant 
[CV]  clf__nesterovs_momentum=True, clf__solver=adam, clf__momentum=0.9, clf__learning_rate=constant, total=  14.3s
[CV] clf__nesterovs_momentum=False, clf__solver=lbfgs, clf__momentum=0.9, clf__learning_rate=constant 
[CV]  clf__nesterovs_momentum=False, clf__solver=sgd, clf__momentum=0.9, clf__learning_rate=constant, total=  14.3s
[CV] clf__nes

[CV] clf__nesterovs_momentum=False, clf__solver=lbfgs, clf__momentum=0.9, clf__learning_rate=invscaling 
[CV]  clf__nesterovs_momentum=True, clf__solver=adam, clf__momentum=0.9, clf__learning_rate=invscaling, total=  14.6s
[CV] clf__nesterovs_momentum=False, clf__solver=lbfgs, clf__momentum=0.9, clf__learning_rate=invscaling 
[CV]  clf__nesterovs_momentum=True, clf__solver=adam, clf__momentum=0.9, clf__learning_rate=invscaling, total=  10.2s
[CV] clf__nesterovs_momentum=False, clf__solver=lbfgs, clf__momentum=0.9, clf__learning_rate=invscaling 
[CV]  clf__nesterovs_momentum=False, clf__solver=sgd, clf__momentum=0.9, clf__learning_rate=invscaling, total=   9.2s
[CV] clf__nesterovs_momentum=False, clf__solver=adam, clf__momentum=0.9, clf__learning_rate=invscaling 
[CV]  clf__nesterovs_momentum=True, clf__solver=sgd, clf__momentum=0.9, clf__learning_rate=invscaling, total=  40.9s
[CV] clf__nesterovs_momentum=False, clf__solver=adam, clf__momentum=0.9, clf__learning_rate=invscaling 
[CV]  

[Parallel(n_jobs=12)]: Done  72 out of  72 | elapsed:  5.6min finished


best score : 0.597071373473
best params : {'clf__nesterovs_momentum': False, 'clf__solver': 'sgd', 'clf__momentum': 0, 'clf__learning_rate': 'invscaling'}


# Tuning an SVM classifier

In [4]:
from sklearn.svm import SVC

In [5]:
param_grid = {'clf__C':[0.001, 0.01, 0.1, 1, 10], 
              'clf__gamma': [0.001, 0.01, 0.1, 1]}

In [6]:
steps = [('standardize', StandardScaler()),
         ('clf', SVC(kernel="rbf", cache_size=3000))]
clf = Pipeline(steps)

In [11]:
score = cross_val_score(clf, features_df, labels, groups=user_ids, cv=GroupKFold(n_splits=3), n_jobs=12, scoring=scorer, verbose=2)


[CV]  ................................................................
[CV]  ................................................................
[CV]  ................................................................


Process ForkPoolWorker-22:
Process ForkPoolWorker-23:
Process ForkPoolWorker-21:
Process ForkPoolWorker-28:
Process ForkPoolWorker-27:
Process ForkPoolWorker-20:
Process ForkPoolWorker-26:
Process ForkPoolWorker-25:
Process ForkPoolWorker-24:
Traceback (most recent call last):
Traceback (most recent call last):
Traceback (most recent call last):
Traceback (most recent call last):
Traceback (most recent call last):
Traceback (most recent call last):
Traceback (most recent call last):
Traceback (most recent call last):
  File "/usr/lib64/python3.4/multiprocessing/process.py", line 254, in _bootstrap
    self.run()
Traceback (most recent call last):
  File "/usr/lib64/python3.4/multiprocessing/process.py", line 254, in _bootstrap
    self.run()
  File "/usr/lib64/python3.4/multiprocessing/process.py", line 254, in _bootstrap
    self.run()
  File "/usr/lib64/python3.4/multiprocessing/process.py", line 93, in run
    self._target(*self._args, **self._kwargs)
  File "/usr/lib64/python3.4/mu

KeyboardInterrupt: 

In [7]:
clf = Pipeline(steps)
cv = list(GroupKFold(n_splits=3).split(features_df, labels, user_ids))
gsv = GridSearchCV(clf, param_grid=param_grid, cv=cv, n_jobs=4, verbose=2)

gsv_results = gsv.fit(features_df, labels)
print("best score : %s" % gsv_results.best_score_)
print("best params : %s" % gsv_results.best_params_)

Fitting 3 folds for each of 20 candidates, totalling 60 fits
[CV] clf__C=0.001, clf__gamma=0.001 ..................................
[CV] clf__C=0.001, clf__gamma=0.001 ..................................
[CV] clf__C=0.001, clf__gamma=0.001 ..................................
[CV] clf__C=0.001, clf__gamma=0.01 ...................................


KeyboardInterrupt: 

These SVMs do not appear to converge with the RBF kernel even after many hours of training. This likely means that there is too much data to hold in even 60GB of memory.

# Bagging SVMs

In [19]:
from sklearn.ensemble import BaggingClassifier

In [36]:
steps = [('standardize', StandardScaler()),
         ('clf', BaggingClassifier(SVC(kernel="rbf", class_weight="balanced", probability=True), max_samples=0.1, max_features=0.25, n_jobs=4, n_estimators=100, verbose=2))]
clf = Pipeline(steps)

In [39]:
cv = cv=GroupKFold(n_splits=2).split(features_df, labels, user_ids)
iteration = next(cv)
X_train = features_df.iloc[iteration[0]]
y_train = labels.iloc[iteration[0]]

X_test = features_df.iloc[iteration[1]]
y_test = labels.iloc[iteration[1]]

In [40]:
clf.fit(X_train, y_train)

Building estimator 1 of 25 for this parallel run (total 100)...
Building estimator 1 of 25 for this parallel run (total 100)...
Building estimator 1 of 25 for this parallel run (total 100)...
Building estimator 1 of 25 for this parallel run (total 100)...
Building estimator 2 of 25 for this parallel run (total 100)...
Building estimator 2 of 25 for this parallel run (total 100)...
Building estimator 2 of 25 for this parallel run (total 100)...
Building estimator 2 of 25 for this parallel run (total 100)...
Building estimator 3 of 25 for this parallel run (total 100)...
Building estimator 3 of 25 for this parallel run (total 100)...
Building estimator 3 of 25 for this parallel run (total 100)...
Building estimator 3 of 25 for this parallel run (total 100)...
Building estimator 4 of 25 for this parallel run (total 100)...
Building estimator 4 of 25 for this parallel run (total 100)...
Building estimator 4 of 25 for this parallel run (total 100)...
Building estimator 4 of 25 for this para

[Parallel(n_jobs=4)]: Done   4 out of   4 | elapsed: 42.5min finished


Pipeline(memory=None,
     steps=[('standardize', StandardScaler(copy=True, with_mean=True, with_std=True)), ('clf', BaggingClassifier(base_estimator=SVC(C=1.0, cache_size=200, class_weight='balanced', coef0=0.0,
  decision_function_shape='ovr', degree=3, gamma='auto', kernel='rbf',
  max_iter=-1, probability=True, random_sta...stimators=100, n_jobs=4, oob_score=False,
         random_state=None, verbose=2, warm_start=False))])

In [42]:
predictions = clf.predict(X_test)

[Parallel(n_jobs=4)]: Done   4 out of   4 | elapsed: 38.8min finished


In [43]:
accuracy_score(y_test, predictions)

0.59946404819654175

In [25]:
score = cross_val_score(clf, features_df, labels, groups=user_ids, cv=GroupKFold(n_splits=3), n_jobs=12, scoring=scorer, verbose=2)


[CV]  ................................................................
[CV]  ................................................................
[CV]  ................................................................


Process ForkPoolWorker-86:
Process ForkPoolWorker-80:
Process ForkPoolWorker-83:
Process ForkPoolWorker-85:
Process ForkPoolWorker-84:
Process ForkPoolWorker-87:
Process ForkPoolWorker-82:
Process ForkPoolWorker-88:
Traceback (most recent call last):
Traceback (most recent call last):
Traceback (most recent call last):
Traceback (most recent call last):
Process ForkPoolWorker-81:
Traceback (most recent call last):
Traceback (most recent call last):
Traceback (most recent call last):
Traceback (most recent call last):
Traceback (most recent call last):
  File "/usr/lib64/python3.4/multiprocessing/process.py", line 254, in _bootstrap
    self.run()
  File "/usr/lib64/python3.4/multiprocessing/process.py", line 254, in _bootstrap
    self.run()
  File "/usr/lib64/python3.4/multiprocessing/process.py", line 254, in _bootstrap
    self.run()
  File "/usr/lib64/python3.4/multiprocessing/process.py", line 254, in _bootstrap
    self.run()
  File "/usr/lib64/python3.4/multiprocessing/process.p

KeyboardInterrupt: 

# Tuning a KNN

In [12]:
from sklearn.neighbors import KNeighborsClassifier

In [16]:
param_grid = {'clf__n_neighbors' : [2,5,10,15]}

In [17]:
steps = [('standardize', StandardScaler()),
         ('clf', KNeighborsClassifier())]
clf = Pipeline(steps)

In [18]:
clf = Pipeline(steps)
cv = list(GroupKFold(n_splits=3).split(features_df, labels, user_ids))
gsv = GridSearchCV(clf, param_grid=param_grid, cv=cv, n_jobs=12, verbose=2)

gsv_results = gsv.fit(features_df, labels)
print("best score : %s" % gsv_results.best_score_)
print("best params : %s" % gsv_results.best_params_)

Fitting 3 folds for each of 4 candidates, totalling 12 fits
[CV] clf__n_neighbors=2 ..............................................
[CV] clf__n_neighbors=2 ..............................................
[CV] clf__n_neighbors=2 ..............................................
[CV] clf__n_neighbors=5 ..............................................
[CV] clf__n_neighbors=5 ..............................................
[CV] clf__n_neighbors=5 ..............................................
[CV] clf__n_neighbors=10 .............................................
[CV] clf__n_neighbors=10 .............................................
[CV] clf__n_neighbors=10 .............................................
[CV] clf__n_neighbors=15 .............................................
[CV] clf__n_neighbors=15 .............................................
[CV] clf__n_neighbors=15 .............................................
[CV] ............................... clf__n_neighbors=2, total= 4.1min
[CV] ............

[Parallel(n_jobs=12)]: Done   3 out of  12 | elapsed: 13.9min remaining: 41.6min


[CV] ............................... clf__n_neighbors=2, total= 6.3min
[CV] .............................. clf__n_neighbors=10, total= 6.0min
[CV] .............................. clf__n_neighbors=15, total= 6.6min
[CV] ............................... clf__n_neighbors=5, total= 7.6min
[CV] ............................... clf__n_neighbors=5, total= 7.6min
[CV] .............................. clf__n_neighbors=10, total= 8.6min
[CV] .............................. clf__n_neighbors=10, total= 8.4min


[Parallel(n_jobs=12)]: Done  10 out of  12 | elapsed: 20.8min remaining:  4.2min


[CV] .............................. clf__n_neighbors=15, total= 9.3min
[CV] .............................. clf__n_neighbors=15, total= 9.5min


[Parallel(n_jobs=12)]: Done  12 out of  12 | elapsed: 22.4min finished


best score : 0.582167860618
best params : {'clf__n_neighbors': 15}


# Try [XGBoost](https://machinelearningmastery.com/develop-first-xgboost-model-python-scikit-learn/)

# Boosting algos
* [Machine Learning Mastery's page on gradient boosting, lots of other resources too!](https://machinelearningmastery.com/gentle-introduction-gradient-boosting-algorithm-machine-learning/)
* [Kaggle master's blog on gradient descent](http://blog.kaggle.com/2017/01/23/a-kaggle-master-explains-gradient-boosting/)


In [44]:
from xgboost import XGBClassifier

In [45]:
steps = [('standardize', StandardScaler()),
         ('clf', XGBClassifier())]
clf = Pipeline(steps)

In [51]:
cv = cv=GroupKFold(n_splits=2).split(features_df, labels, user_ids)
iteration = next(cv)
X_train = features_df.iloc[iteration[0]]
y_train = labels.iloc[iteration[0]]

X_test = features_df.iloc[iteration[1]]
y_test = labels.iloc[iteration[1]]

In [52]:
clf.fit(X_train, y_train)

Pipeline(memory=None,
     steps=[('standardize', StandardScaler(copy=True, with_mean=True, with_std=True)), ('clf', XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
       colsample_bytree=1, gamma=0, learning_rate=0.1, max_delta_step=0,
       max_depth=3, min_child_weight=1, missing=None, n_estimators=100,
       n_jobs=1, nthread=None, objective='multi:softprob', random_state=0,
       reg_alpha=0, reg_lambda=1, scale_pos_weight=1, seed=None,
       silent=True, subsample=1))])

In [53]:
predictions = clf.predict(X_test)

In [54]:
accuracy_score(y_test, predictions)

0.65276582427040142

## ^^^ Best score yet and it's not even optimized!

In [55]:
score = cross_val_score(clf, features_df, labels, groups=user_ids, cv=GroupKFold(n_splits=3), n_jobs=12, scoring=scorer, verbose=2)
score

[CV]  ................................................................
[CV]  ................................................................
[CV]  ................................................................
[CV] ................................................. , total= 3.4min
[CV] ................................................. , total= 3.6min
[CV] ................................................. , total= 3.9min


[Parallel(n_jobs=12)]: Done   3 out of   3 | elapsed:  3.9min finished


array([ 0.65479285,  0.62705914,  0.63133807])

In [57]:
np.mean(score)

0.63773002246473798

### ^^^^ Optimize the shit out of this!!!
* [Analytics Vidhya's blog about tuning xgboost](https://www.analyticsvidhya.com/blog/2016/03/complete-guide-parameter-tuning-xgboost-with-codes-python/)

# Attempting one vs. rest with regression models
* [Drawing from Sklearn's page](http://scikit-learn.org/stable/modules/multiclass.html#one-vs-the-rest)

# Maybe also consider [Deep Forest](https://github.com/kingfengji/gcForest)