The xGBoost random forest model.


In [33]:
import numpy as np
import pandas as pd
from sklearn import metrics as mr
from sklearn.model_selection import cross_validate, GridSearchCV,RandomizedSearchCV, train_test_split,StratifiedKFold
import xgboost as gb

import utility as util

from gensim.models import Word2Vec
from nltk import word_tokenize

from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

In [2]:
reviews = pd.read_csv("../Datasets/processedAnimeReviews.csv",index_col = 'id')
reviews.head(10)

Unnamed: 0_level_0,workName,overallRating,review,sentiment
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
8121,Cowboy_Bebop,10,cowboy bebop episodic series episodic mean one...,1
63480,Utawarerumono,8,utawarerumono manages one harem anime anyone p...,1
8452,Hajime_no_Ippo,10,first let say fan boxing fact pretty much hate...,1
66544,Gensoumaden_Saiyuuki,9,saiyuki one anime grab first episode let go ev...,1
55936,Ranma_½,7,comedy romance based manga rumiko takahashi ra...,0
22039,Kino_no_Tabi__The_Beautiful_World,9,say anime traveler journeying different countr...,1
68626,Kareshi_Kanojo_no_Jijou,8,kare kano romance anime could become incredibl...,1
18797,Hunter_x_Hunter,10,overall best anime actually seen anything else...,1
43899,Golden_Boy,10,overall honestly really care others opinion an...,1
18796,Hunter_x_Hunter,10,think hear anime people killing poor cute anim...,1


Some other bs

In [3]:
w2v_model = Word2Vec.load('../Models/w2vmodel.bin')
# Get mean feature vector of all words in a sentence
def meanFeatureVec(sentence, word_vectors):
    word_vecs = [word_vectors[word] for word in word_tokenize(sentence)]
    mean_vec = np.asarray(word_vecs).mean(axis=0)
    return mean_vec

# Takes a dataframe of the reviews and returns a new dataframe of the word embeddings per review
def reviewToVectors(sentences, word_vectors):
    sent_vecs = [meanFeatureVec(sentence, word_vectors) for sentence in sentences]
    df = pd.DataFrame(sent_vecs, index=sentences.index)
    return df
# Convert reviews to word embeddings
X_vectors = reviewToVectors(reviews['review'], w2v_model.wv)
# Split into train and test
y = reviews['sentiment']

X_train, X_test, y_train, y_test = train_test_split(X_vectors, y, test_size=0.1, random_state=2)

#Baseline
   All possible parameters for xgboost.XGBClassifier are:
   ----------
   max_depth : [0,infinity)
   
       Maximum tree depth for base learners.

   learning_rate : float
   
       Boosting learning rate (xgb's "eta")

   n_estimators : int
   
       Number of trees to fit.


   objective : string or callable
       Specify the learning task and the corresponding learning objective or
       a custom objective function to be used. There are 

   gamma : float
       Minimum loss reduction required to make a further partition on a leaf node of the tree.

   min_child_weight : int
       Minimum sum of instance weight(hessian) needed in a child.

   max_delta_step : int
       Maximum delta step we allow each tree's weight estimation to be.

   subsample : float
       Subsample ratio of the training instance.



   reg_alpha : float (xgb's alpha)
       L1 regularization term on weights

   reg_lambda : float (xgb's lambda)
       L2 regularization term on weights

   scale_pos_weight : float
       Balancing of positive and negative weights.

   base_score:
      The initial prediction score of all instances, global bias.

Not all of these paramaters affect accuracy of the model. The relevant once are:

Also it is important to note that we only have one X column so :

   colsample_bytree : float
       Subsample ratio of columns when constructing each tree.

   colsample_bylevel : float
       Subsample ratio of columns for each level.

   colsample_bynode : float
       Subsample ratio of columns for each split.
       
must all be set to 1 which is there default value.

In [4]:
metrics = ['accuracy', 'recall', 'precision', 'f1', 'roc_auc']
#RFModel = gb.XGBRFClassifier()
#util.cross_validate_scores(RFModel, X_train, y_train, cv=5, metrics=metrics)


First we will test out different objective methods for the XGBClassifier. There are 12 different objective functions
we can define. We will break down this report into 12 seperate testing grounds and tweak each parameter for them.
The 12 are:

reg:squarederror: regression with squared loss. This is the default value

reg:logistic: logistic regression

binary:logistic: logistic regression for binary classification

binary:hinge: hinge loss for binary classification. 

count:poisson –poisson regression for count data

survival:cox: Cox regression for right censored survival time data (negative values are considered right censored). Note that predictions are returned on the hazard ratio scale (i.e., as HR = exp(marginal_prediction) in the proportional hazard function h(t) = h0(t) * HR).

multi:softmax: set XGBoost to do multiclass classification using the softmax objective, you also need to set num_class(number of classes)

rank:pairwise: Use LambdaMART to perform pairwise ranking where the pairwise loss is minimized

rank:ndcg: Use LambdaMART to perform list-wise ranking where Normalized Discounted Cumulative Gain (NDCG) is maximized

rank:map: Use LambdaMART to perform list-wise ranking where Mean Average Precision (MAP) is maximized

reg:gamma: gamma regression with log-link. 

reg:tweedie: Tweedie regression with log-link. 



First we will test the baseline for each objective. Then we will break down each objective and do furthur testing.


In [18]:
SquaredError = gb.XGBClassifier(objective = "reg:squarederror",n_estimators =  10) 
Logistic = gb.XGBClassifier(objective = "reg:logistic",n_estimators =  10) 
BinaryLogistic = gb.XGBClassifier(objective = "binary:logistic",n_estimators =  10) 
BinaryHinge = gb.XGBClassifier(objective = "binary:hinge",n_estimators =  10) 
CountPoisson = gb.XGBClassifier(objective = "count:poisson",n_estimators =  10) 
SurvivalCox = gb.XGBClassifier(objective = "survival:cox",n_estimators =  10) 
MultiSoftMax = gb.XGBClassifier(objective = "multi:softmax", num_class = 2,n_estimators =  10) 
RankPairWise = gb.XGBClassifier(objective = "rank:pairwise",n_estimators =  10) 
RankNDCG = gb.XGBClassifier(objective = "rank:ndcg",n_estimators =  10) 
RankMap = gb.XGBClassifier(objective = "rank:map",n_estimators =  10) 
RegGamma = gb.XGBClassifier(objective = "reg:gamma",n_estimators =  10) 
RegTweedie = gb.XGBClassifier(objective = "reg:tweedie",n_estimators =  10) 

print("Regression Squared Error")
util.cross_validate_scores(SquaredError, X_train, y_train, cv=5, metrics=metrics)

print("Logistical regression")
util.cross_validate_scores(Logistic, X_train, y_train, cv=5, metrics=metrics)

print("Logistical regression for binary classification")
util.cross_validate_scores(BinaryLogistic, X_train, y_train, cv=5, metrics=metrics)

print("Binary classification using hinge loss optimization")
util.cross_validate_scores(BinaryHinge, X_train, y_train, cv=5, metrics=metrics)

print("Poisson regression for count data")
util.cross_validate_scores(CountPoisson, X_train, y_train, cv=5, metrics=metrics)

print("Survival Cox Regression")
util.cross_validate_scores(SurvivalCox, X_train, y_train, cv=5, metrics=metrics)

print("Classification by optimizing softmax objective")
util.cross_validate_scores(MultiSoftMax, X_train, y_train, cv=5, metrics=metrics)

print("Rank Pairwise")
util.cross_validate_scores(RankPairWise, X_train, y_train, cv=5, metrics=metrics)

print("Rank ndcg")
util.cross_validate_scores(RankNDCG, X_train, y_train, cv=5, metrics=metrics)

print("Rank map")
util.cross_validate_scores(RankMap, X_train, y_train, cv=5, metrics=metrics)

print("Regression gamma")
util.cross_validate_scores(RegGamma, X_train, y_train, cv=5, metrics=metrics)

print("Regression tweedie")
util.cross_validate_scores(RegTweedie, X_train, y_train, cv=5, metrics=metrics)


Regression Squared Error


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done   2 out of   5 | elapsed:   29.4s remaining:   44.1s


Training scores
accuracy: 0.7088 (0.0008)
recall: 0.8322 (0.0018)
precision: 0.7084 (0.0008)
f1: 0.7653 (0.0008)
roc_auc: 0.7776 (0.0018)

Validation Scores
accuracy: 0.7061 (0.0037)
recall: 0.8301 (0.0052)
precision: 0.7063 (0.0025)
f1: 0.7632 (0.0033)
roc_auc: 0.7730 (0.0032)


[Parallel(n_jobs=-1)]: Done   5 out of   5 | elapsed:   29.7s finished


Unnamed: 0,metric,train_score,val_scores
0,accuracy,0.708802,0.706104
1,recall,0.832193,0.830133
2,precision,0.708385,0.706263
3,f1,0.765313,0.7632
4,roc_auc,0.777602,0.773018


Logistical regression


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done   2 out of   5 | elapsed:   25.1s remaining:   37.7s


Training scores
accuracy: 0.7091 (0.0006)
recall: 0.8324 (0.0015)
precision: 0.7086 (0.0003)
f1: 0.7655 (0.0007)
roc_auc: 0.7782 (0.0003)

Validation Scores
accuracy: 0.7051 (0.0034)
recall: 0.8297 (0.0038)
precision: 0.7054 (0.0026)
f1: 0.7625 (0.0029)
roc_auc: 0.7733 (0.0020)


[Parallel(n_jobs=-1)]: Done   5 out of   5 | elapsed:   25.6s finished


Unnamed: 0,metric,train_score,val_scores
0,accuracy,0.7091,0.705117
1,recall,0.832405,0.829657
2,precision,0.708624,0.705394
3,f1,0.765543,0.762494
4,roc_auc,0.778242,0.773296


Logistical regression for binary classification


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done   2 out of   5 | elapsed:   25.6s remaining:   38.4s


Training scores
accuracy: 0.7091 (0.0006)
recall: 0.8324 (0.0015)
precision: 0.7086 (0.0003)
f1: 0.7655 (0.0007)
roc_auc: 0.7782 (0.0003)

Validation Scores
accuracy: 0.7051 (0.0034)
recall: 0.8297 (0.0038)
precision: 0.7054 (0.0026)
f1: 0.7625 (0.0029)
roc_auc: 0.7733 (0.0020)


[Parallel(n_jobs=-1)]: Done   5 out of   5 | elapsed:   26.3s finished


Unnamed: 0,metric,train_score,val_scores
0,accuracy,0.7091,0.705117
1,recall,0.832405,0.829657
2,precision,0.708624,0.705394
3,f1,0.765543,0.762494
4,roc_auc,0.778242,0.773296


Binary classification using hinge loss optimization


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done   2 out of   5 | elapsed:   29.6s remaining:   44.4s


Training scores
accuracy: 0.5705 (0.0000)
recall: 1.0000 (0.0000)
precision: 0.5705 (0.0000)
f1: 0.7266 (0.0000)
roc_auc: 0.5000 (0.0000)

Validation Scores
accuracy: 0.5705 (0.0000)
recall: 1.0000 (0.0000)
precision: 0.5705 (0.0000)
f1: 0.7266 (0.0000)
roc_auc: 0.5000 (0.0000)


[Parallel(n_jobs=-1)]: Done   5 out of   5 | elapsed:   30.1s finished


Unnamed: 0,metric,train_score,val_scores
0,accuracy,0.57054,0.57054
1,recall,1.0,1.0
2,precision,0.57054,0.57054
3,f1,0.726552,0.726552
4,roc_auc,0.5,0.5


Poisson regression for count data


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done   2 out of   5 | elapsed:   26.0s remaining:   39.0s


Training scores
accuracy: 0.7086 (0.0011)
recall: 0.8205 (0.0027)
precision: 0.7124 (0.0006)
f1: 0.7626 (0.0012)
roc_auc: 0.7772 (0.0011)

Validation Scores
accuracy: 0.7046 (0.0025)
recall: 0.8172 (0.0029)
precision: 0.7092 (0.0022)
f1: 0.7594 (0.0020)
roc_auc: 0.7725 (0.0022)


[Parallel(n_jobs=-1)]: Done   5 out of   5 | elapsed:   28.1s finished


Unnamed: 0,metric,train_score,val_scores
0,accuracy,0.708598,0.704567
1,recall,0.820462,0.81721
2,precision,0.71241,0.709245
3,f1,0.762626,0.759406
4,roc_auc,0.777152,0.772529


Survival Cox Regression


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done   2 out of   5 | elapsed:   25.8s remaining:   38.7s


Training scores
accuracy: 0.5705 (0.0000)
recall: 1.0000 (0.0000)
precision: 0.5705 (0.0000)
f1: 0.7266 (0.0000)
roc_auc: 0.4835 (0.0197)

Validation Scores
accuracy: 0.5705 (0.0000)
recall: 1.0000 (0.0000)
precision: 0.5705 (0.0000)
f1: 0.7266 (0.0000)
roc_auc: 0.4841 (0.0185)


[Parallel(n_jobs=-1)]: Done   5 out of   5 | elapsed:   26.4s finished


Unnamed: 0,metric,train_score,val_scores
0,accuracy,0.57054,0.57054
1,recall,1.0,1.0
2,precision,0.57054,0.57054
3,f1,0.726552,0.726552
4,roc_auc,0.483459,0.484136


Classification by optimizing softmax objective


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done   2 out of   5 | elapsed:   44.1s remaining:  1.1min


Training scores
accuracy: 0.7091 (0.0006)
recall: 0.8324 (0.0015)
precision: 0.7086 (0.0003)
f1: 0.7655 (0.0007)
roc_auc: 0.6888 (0.0005)

Validation Scores
accuracy: 0.7051 (0.0034)
recall: 0.8297 (0.0038)
precision: 0.7054 (0.0026)
f1: 0.7625 (0.0029)
roc_auc: 0.6847 (0.0035)


[Parallel(n_jobs=-1)]: Done   5 out of   5 | elapsed:   45.4s finished


Unnamed: 0,metric,train_score,val_scores
0,accuracy,0.709096,0.705117
1,recall,0.832395,0.829657
2,precision,0.708623,0.705394
3,f1,0.765538,0.762494
4,roc_auc,0.688844,0.684661


Rank Pairwise


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done   2 out of   5 | elapsed:   27.3s remaining:   41.0s


Training scores
accuracy: 0.6877 (0.0027)
recall: 0.6795 (0.0110)
precision: 0.7497 (0.0028)
f1: 0.7128 (0.0050)
roc_auc: 0.7603 (0.0021)

Validation Scores
accuracy: 0.6842 (0.0028)
recall: 0.6769 (0.0105)
precision: 0.7461 (0.0033)
f1: 0.7098 (0.0048)
roc_auc: 0.7566 (0.0023)


[Parallel(n_jobs=-1)]: Done   5 out of   5 | elapsed:   28.0s finished


Unnamed: 0,metric,train_score,val_scores
0,accuracy,0.687655,0.684207
1,recall,0.679487,0.676948
2,precision,0.74968,0.746086
3,f1,0.712793,0.709774
4,roc_auc,0.760333,0.756579


Rank ndcg


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done   2 out of   5 | elapsed:   16.4s remaining:   24.7s


Training scores
accuracy: 0.4295 (0.0000)
recall: 0.0000 (0.0000)
precision: 0.0000 (0.0000)
f1: 0.0000 (0.0000)
roc_auc: 0.5000 (0.0000)

Validation Scores
accuracy: 0.4295 (0.0000)
recall: 0.0000 (0.0000)
precision: 0.0000 (0.0000)
f1: 0.0000 (0.0000)
roc_auc: 0.5000 (0.0000)


[Parallel(n_jobs=-1)]: Done   5 out of   5 | elapsed:   17.0s finished


Unnamed: 0,metric,train_score,val_scores
0,accuracy,0.42946,0.42946
1,recall,0.0,0.0
2,precision,0.0,0.0
3,f1,0.0,0.0
4,roc_auc,0.5,0.5


Rank map


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done   2 out of   5 | elapsed:   14.0s remaining:   21.0s


Training scores
accuracy: 0.4295 (0.0000)
recall: 0.0000 (0.0000)
precision: 0.0000 (0.0000)
f1: 0.0000 (0.0000)
roc_auc: 0.5000 (0.0000)

Validation Scores
accuracy: 0.4295 (0.0000)
recall: 0.0000 (0.0000)
precision: 0.0000 (0.0000)
f1: 0.0000 (0.0000)
roc_auc: 0.5000 (0.0000)


[Parallel(n_jobs=-1)]: Done   5 out of   5 | elapsed:   15.7s finished


Unnamed: 0,metric,train_score,val_scores
0,accuracy,0.42946,0.42946
1,recall,0.0,0.0
2,precision,0.0,0.0
3,f1,0.0,0.0
4,roc_auc,0.5,0.5


Regression gamma


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done   2 out of   5 | elapsed:   27.4s remaining:   41.2s


Training scores
accuracy: 0.7076 (0.0009)
recall: 0.7721 (0.0016)
precision: 0.7307 (0.0009)
f1: 0.7508 (0.0008)
roc_auc: 0.7732 (0.0018)

Validation Scores
accuracy: 0.7028 (0.0028)
recall: 0.7678 (0.0039)
precision: 0.7268 (0.0018)
f1: 0.7467 (0.0028)
roc_auc: 0.7685 (0.0027)


[Parallel(n_jobs=-1)]: Done   5 out of   5 | elapsed:   28.6s finished


Unnamed: 0,metric,train_score,val_scores
0,accuracy,0.707631,0.702849
1,recall,0.772067,0.767781
2,precision,0.730726,0.726795
3,f1,0.750827,0.746724
4,roc_auc,0.773224,0.768475


Regression tweedie


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done   2 out of   5 | elapsed:   34.4s remaining:   51.7s


Training scores
accuracy: 0.7096 (0.0018)
recall: 0.8008 (0.0036)
precision: 0.7211 (0.0020)
f1: 0.7589 (0.0016)
roc_auc: 0.7763 (0.0015)

Validation Scores
accuracy: 0.7063 (0.0033)
recall: 0.7986 (0.0039)
precision: 0.7182 (0.0030)
f1: 0.7563 (0.0027)
roc_auc: 0.7725 (0.0015)


[Parallel(n_jobs=-1)]: Done   5 out of   5 | elapsed:   34.8s finished


Unnamed: 0,metric,train_score,val_scores
0,accuracy,0.709642,0.706317
1,recall,0.800805,0.798611
2,precision,0.721112,0.718205
3,f1,0.758865,0.756271
4,roc_auc,0.776294,0.772498


In [19]:
AbsoluteBaseline = gb.XGBClassifier() 

# Mean Squared Error Objective 1



In [20]:
parametersMSE = {'objective':['reg:squarederror'],
              'learning_rate': [x * 0.1 for x in range(0, 10)], #so called `eta` value
              'max_depth': [i for i in range(1,12,2)],
              'min_child_weight': [i for i in range(0,20,5)],
              'n_estimators': [10],
              'gamma' : [i for i in range(0,20,5)],
              'max_delta_step':[0],
              'reg_alpha':[0, 0.001, 0.005, 0.01, 0.05],
              'reg_lambda':[0, 0.1, 0.3, 0.6, 9],
              'scale_pos_weight':[0, 0.1, 0.3, 0.6, 9]
             }
clfMSE = RandomizedSearchCV(AbsoluteBaseline, parametersMSE, n_jobs=-1, 
                   cv=StratifiedKFold(), 
                   scoring='roc_auc',
                   verbose=2, refit=True)

clfMSE.fit( X_train,y_train)

BestMSEParam = clfMSE.cv_results_['params'][clfMSE.best_index_]
BestMSE = clfMSE.best_estimator_



Fitting 3 folds for each of 10 candidates, totalling 30 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done  30 out of  30 | elapsed:  1.8min finished


RandomizedSearchCV(cv=StratifiedKFold(n_splits=3, random_state=None, shuffle=False),
                   error_score='raise-deprecating',
                   estimator=XGBClassifier(base_score=0.5, booster='gbtree',
                                           colsample_bylevel=1,
                                           colsample_bynode=1,
                                           colsample_bytree=1, gamma=0,
                                           learning_rate=0.1, max_delta_step=0,
                                           max_depth=3, min_child_weight=1,
                                           missing=None, n_estimators=100,
                                           n_jobs=1, nthread=N...
                                                          0.8, 0.9],
                                        'max_delta_step': [0],
                                        'max_depth': [1, 3, 5, 7, 9, 11],
                                        'min_child_weight': [0, 5, 10, 15],
        

# Logistical regression for binary classification Objective 3
This is the exact same for objective

In [21]:
parametersLRB = {'objective':['binary:logistic'],
              'learning_rate': [x * 0.1 for x in range(0, 10)], #so called `eta` value
              'max_depth': [i for i in range(1,12,2)],
              'min_child_weight': [i for i in range(0,20,5)],
              'n_estimators': [10],
              'gamma' : [i for i in range(0,20,5)],
              'max_delta_step':[0],
              'reg_alpha':[0, 0.001, 0.005, 0.01, 0.05],
              'reg_lambda':[0, 0.1, 0.3, 0.6, 9],
              'scale_pos_weight':[0, 0.1, 0.3, 0.6, 9]
             }
clfLRB = RandomizedSearchCV(AbsoluteBaseline, parametersLRB, n_jobs=-1, 
                   cv=StratifiedKFold(), 
                   scoring='roc_auc',
                   verbose=2, refit=True)

clfLRB.fit( X_train,y_train)

BestLogisticRegressionParam = clfLRB.cv_results_['params'][clfLRB.best_index_]
BestLogisticRegression = clfLRB.best_estimator_



Fitting 3 folds for each of 10 candidates, totalling 30 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done  30 out of  30 | elapsed:  3.3min finished


RandomizedSearchCV(cv=StratifiedKFold(n_splits=3, random_state=None, shuffle=False),
                   error_score='raise-deprecating',
                   estimator=XGBClassifier(base_score=0.5, booster='gbtree',
                                           colsample_bylevel=1,
                                           colsample_bynode=1,
                                           colsample_bytree=1, gamma=0,
                                           learning_rate=0.1, max_delta_step=0,
                                           max_depth=3, min_child_weight=1,
                                           missing=None, n_estimators=100,
                                           n_jobs=1, nthread=N...
                                                          0.8, 0.9],
                                        'max_delta_step': [0],
                                        'max_depth': [1, 3, 5, 7, 9, 11],
                                        'min_child_weight': [0, 5, 10, 15],
        

# Hinge loss for binary classificative Objective 4

$\ell(y) = \max(0, 1-t \cdot y)$

In [22]:
parametersHL = {'objective':['binary:hinge'],
              'learning_rate': [x * 0.1 for x in range(0, 10)], #so called `eta` value
              'max_depth': [i for i in range(1,12,2)],
              'min_child_weight': [i for i in range(0,20,5)],
              'n_estimators': [10],
              'gamma' : [i for i in range(0,20,5)],
              'max_delta_step':[0],
              'reg_alpha':[0, 0.001, 0.005, 0.01, 0.05],
              'reg_lambda':[0, 0.1, 0.3, 0.6, 9],
              'scale_pos_weight':[0, 0.1, 0.3, 0.6, 9]
             }
clfHL = RandomizedSearchCV(AbsoluteBaseline, parametersHL, n_jobs=-1, 
                   cv=StratifiedKFold(), 
                   scoring='roc_auc',
                   verbose=2, refit=True)

clfHL.fit( X_train,y_train)

BestHingeLossParam = clfHL.cv_results_['params'][clfHL.best_index_]
BestHingeLoss = clfHL.best_estimator_



Fitting 3 folds for each of 10 candidates, totalling 30 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done  30 out of  30 | elapsed:  3.3min finished


RandomizedSearchCV(cv=StratifiedKFold(n_splits=3, random_state=None, shuffle=False),
                   error_score='raise-deprecating',
                   estimator=XGBClassifier(base_score=0.5, booster='gbtree',
                                           colsample_bylevel=1,
                                           colsample_bynode=1,
                                           colsample_bytree=1, gamma=0,
                                           learning_rate=0.1, max_delta_step=0,
                                           max_depth=3, min_child_weight=1,
                                           missing=None, n_estimators=100,
                                           n_jobs=1, nthread=N...
                                                          0.8, 0.9],
                                        'max_delta_step': [0],
                                        'max_depth': [1, 3, 5, 7, 9, 11],
                                        'min_child_weight': [0, 5, 10, 15],
        

# Poisson regression for count data

In [23]:
parametersPRC = {'objective':['count:poisson'],
              'learning_rate': [x * 0.1 for x in range(0, 10)], #so called `eta` value
              'max_depth': [i for i in range(1,12,2)],
              'min_child_weight': [i for i in range(0,20,5)],
              'n_estimators': [10],
              'gamma' : [i for i in range(0,20,5)],
              'max_delta_step':[0],
              'reg_alpha':[0, 0.001, 0.005, 0.01, 0.05],
              'reg_lambda':[0, 0.1, 0.3, 0.6, 9],
              'scale_pos_weight':[0, 0.1, 0.3, 0.6, 9]
             }
clfPRC = RandomizedSearchCV(AbsoluteBaseline, parametersPRC, n_jobs=-1, 
                   cv=StratifiedKFold(), 
                   scoring='roc_auc',
                   verbose=2, refit=True)

clfPRC.fit( X_train,y_train)

BestPossionCountDataParam = clfPRC.cv_results_['params'][clfPRC.best_index_]
BestPossionCountData = clfPRC.best_estimator_



Fitting 3 folds for each of 10 candidates, totalling 30 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done  30 out of  30 | elapsed:  2.9min finished


RandomizedSearchCV(cv=StratifiedKFold(n_splits=3, random_state=None, shuffle=False),
                   error_score='raise-deprecating',
                   estimator=XGBClassifier(base_score=0.5, booster='gbtree',
                                           colsample_bylevel=1,
                                           colsample_bynode=1,
                                           colsample_bytree=1, gamma=0,
                                           learning_rate=0.1, max_delta_step=0,
                                           max_depth=3, min_child_weight=1,
                                           missing=None, n_estimators=100,
                                           n_jobs=1, nthread=N...
                                                          0.8, 0.9],
                                        'max_delta_step': [0],
                                        'max_depth': [1, 3, 5, 7, 9, 11],
                                        'min_child_weight': [0, 5, 10, 15],
        

# Survival Cox

In [24]:
parametersSC = {'objective':['survival:cox'],
              'learning_rate': [x * 0.1 for x in range(0, 10)], #so called `eta` value
              'max_depth': [i for i in range(1,12,2)],
              'min_child_weight': [i for i in range(0,20,5)],
              'n_estimators': [10],
              'gamma' : [i for i in range(0,20,5)],
              'max_delta_step':[0],
              'reg_alpha':[0, 0.001, 0.005, 0.01, 0.05],
              'reg_lambda':[0, 0.1, 0.3, 0.6, 9],
              'scale_pos_weight':[0, 0.1, 0.3, 0.6, 9]
             }
clfSC = RandomizedSearchCV(AbsoluteBaseline, parametersPRC, n_jobs=-1, 
                   cv=StratifiedKFold(), 
                   scoring='roc_auc',
                   verbose=2, refit=True)

clfSC.fit( X_train,y_train)

BestSurvivalCoxParam = clfSC.cv_results_['params'][clfSC.best_index_]
BestSurvivalCox = clfSC.best_estimator_



Fitting 3 folds for each of 10 candidates, totalling 30 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done  30 out of  30 | elapsed:  3.2min finished


RandomizedSearchCV(cv=StratifiedKFold(n_splits=3, random_state=None, shuffle=False),
                   error_score='raise-deprecating',
                   estimator=XGBClassifier(base_score=0.5, booster='gbtree',
                                           colsample_bylevel=1,
                                           colsample_bynode=1,
                                           colsample_bytree=1, gamma=0,
                                           learning_rate=0.1, max_delta_step=0,
                                           max_depth=3, min_child_weight=1,
                                           missing=None, n_estimators=100,
                                           n_jobs=1, nthread=N...
                                                          0.8, 0.9],
                                        'max_delta_step': [0],
                                        'max_depth': [1, 3, 5, 7, 9, 11],
                                        'min_child_weight': [0, 5, 10, 15],
        

# multi:softmax

In [26]:
parametersMSM = {'objective':['multi:softmax'],
              'learning_rate': [x * 0.1 for x in range(0, 10)], #so called `eta` value
              'max_depth': [i for i in range(1,12,2)],
              'min_child_weight': [i for i in range(0,20,5)],
              'n_estimators': [10],
              'gamma' : [i for i in range(0,20,5)],
              'max_delta_step':[0],
              'reg_alpha':[0, 0.001, 0.005, 0.01, 0.05],
              'reg_lambda':[0, 0.1, 0.3, 0.6, 9],
              'scale_pos_weight':[0, 0.1, 0.3, 0.6, 9],
              'num_class':[2]}
clfMSM = RandomizedSearchCV(AbsoluteBaseline, parametersMSM, n_jobs=-1, 
                   cv=StratifiedKFold(), 
                   scoring='roc_auc',
                   verbose=2, refit=True)

clfMSM.fit( X_train,y_train)

BestMSMParam = clfMSM.cv_results_['params'][clfMSM.best_index_]
BestMSM = clfMSM.best_estimator_

[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.


Fitting 3 folds for each of 10 candidates, totalling 30 fits


[Parallel(n_jobs=-1)]: Done  30 out of  30 | elapsed:  5.6min finished


RandomizedSearchCV(cv=StratifiedKFold(n_splits=3, random_state=None, shuffle=False),
                   error_score='raise-deprecating',
                   estimator=XGBClassifier(base_score=0.5, booster='gbtree',
                                           colsample_bylevel=1,
                                           colsample_bynode=1,
                                           colsample_bytree=1, gamma=0,
                                           learning_rate=0.1, max_delta_step=0,
                                           max_depth=3, min_child_weight=1,
                                           missing=None, n_estimators=100,
                                           n_jobs=1, nthread=N...
                                        'max_delta_step': [0],
                                        'max_depth': [1, 3, 5, 7, 9, 11],
                                        'min_child_weight': [0, 5, 10, 15],
                                        'n_estimators': [10], 'num_class': [2

# rank pairwise

In [27]:


parametersRPW = {'objective':['rank:pairwise'],
              'learning_rate': [x * 0.1 for x in range(0, 10)], #so called `eta` value
              'max_depth': [i for i in range(1,12,2)],
              'min_child_weight': [i for i in range(0,20,5)],
              'n_estimators': [10],
              'gamma' : [i for i in range(0,20,5)],
              'max_delta_step':[0],
              'reg_alpha':[0, 0.001, 0.005, 0.01, 0.05],
              'reg_lambda':[0, 0.1, 0.3, 0.6, 9],
              'scale_pos_weight':[0, 0.1, 0.3, 0.6, 9]
             }
clfRPW = RandomizedSearchCV(AbsoluteBaseline, parametersRPW, n_jobs=-1, 
                   cv=StratifiedKFold(), 
                   scoring='roc_auc',
                   verbose=2, refit=True)

clfRPW.fit( X_train,y_train)

BestPairwiseParam = clfRPW.cv_results_['params'][clfRPW.best_index_]
BestPairwise = clfRPW.best_estimator_



Fitting 3 folds for each of 10 candidates, totalling 30 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done  30 out of  30 | elapsed:  3.0min finished


RandomizedSearchCV(cv=StratifiedKFold(n_splits=3, random_state=None, shuffle=False),
                   error_score='raise-deprecating',
                   estimator=XGBClassifier(base_score=0.5, booster='gbtree',
                                           colsample_bylevel=1,
                                           colsample_bynode=1,
                                           colsample_bytree=1, gamma=0,
                                           learning_rate=0.1, max_delta_step=0,
                                           max_depth=3, min_child_weight=1,
                                           missing=None, n_estimators=100,
                                           n_jobs=1, nthread=N...
                                                          0.8, 0.9],
                                        'max_delta_step': [0],
                                        'max_depth': [1, 3, 5, 7, 9, 11],
                                        'min_child_weight': [0, 5, 10, 15],
        

# rank ndcg

In [None]:

parametersNDCG = {'objective':['rank:ndcg'],
              'learning_rate': [x * 0.1 for x in range(0, 10)], #so called `eta` value
              'max_depth': [i for i in range(1,12,2)],
              'min_child_weight': [i for i in range(0,20,5)],
              'n_estimators': [10],
              'gamma' : [i for i in range(0,20,5)],
              'max_delta_step':[0],
              'reg_alpha':[0, 0.001, 0.005, 0.01, 0.05],
              'reg_lambda':[0, 0.1, 0.3, 0.6, 9],
              'scale_pos_weight':[0, 0.1, 0.3, 0.6, 9]
             }
#clfNDCG = RandomizedSearchCV(AbsoluteBaseline, parametersNDCG, n_jobs=-1, 
                 #  cv=StratifiedKFold(), 
                  # scoring='roc_auc',
                   #verbose=2, refit=True)

#clfNDCG.fit( X_train,y_train)

#BestNDCGParam = clfNDCG.cv_results_['params'][clfNDCG.best_index_]
#BestNDCG = clfNDCG.best_estimator_

# rank map

In [None]:
parametersMAP = {'objective':['rank:map'],
              'learning_rate': [x * 0.1 for x in range(0, 10)], #so called `eta` value
              'max_depth': [i for i in range(1,12,2)],
              'min_child_weight': [i for i in range(0,20,5)],
              'n_estimators': [10],
              'gamma' : [i for i in range(0,20,5)],
              'max_delta_step':[0],
              'reg_alpha':[0, 0.001, 0.005, 0.01, 0.05],
              'reg_lambda':[0, 0.1, 0.3, 0.6, 9],
              'scale_pos_weight':[0, 0.1, 0.3, 0.6, 9]
             }
#clfMAP = RandomizedSearchCV(AbsoluteBaseline, parametersMAP, n_jobs=-1, 
                   #cv=StratifiedKFold(), 
                   #scoring='roc_auc',
                   #verbose=2, refit=True)

#clfMAP.fit( X_train,y_train)

#BestMapParam = clfMAP.cv_results_['params'][clfMAP.best_index_]
#BestMap = clfMap.best_estimator_

# reg gamma

In [28]:
parametersGam = {'objective':['reg:gamma'],
              'learning_rate': [x * 0.1 for x in range(0, 10)], #so called `eta` value
              'max_depth': [i for i in range(1,12,2)],
              'min_child_weight': [i for i in range(0,20,5)],
              'n_estimators': [10],
              'gamma' : [i for i in range(0,20,5)],
              'max_delta_step':[0],
              'reg_alpha':[0, 0.001, 0.005, 0.01, 0.05],
              'reg_lambda':[0, 0.1, 0.3, 0.6, 9],
              'scale_pos_weight':[0, 0.1, 0.3, 0.6, 9]
             }
clfGam = RandomizedSearchCV(AbsoluteBaseline, parametersGam, n_jobs=-1, 
                   cv=StratifiedKFold(), 
                   scoring='roc_auc',
                   verbose=2, refit=True)

clfGam.fit( X_train,y_train)

BestGammaParam = clfGam.cv_results_['params'][clfGam.best_index_]
BestGamma = clfGam.best_estimator_



Fitting 3 folds for each of 10 candidates, totalling 30 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.


ValueError: Input contains NaN, infinity or a value too large for dtype('float32').

# reg tweedie

In [29]:
parametersTwd = {'objective':['reg:tweedie'],
              'learning_rate': [x * 0.1 for x in range(0, 10)], #so called `eta` value
              'max_depth': [i for i in range(1,12,2)],
              'min_child_weight': [i for i in range(0,20,5)],
              'n_estimators': [10],
              'gamma' : [i for i in range(0,20,5)],
              'max_delta_step':[0],
              'reg_alpha':[0, 0.001, 0.005, 0.01, 0.05],
              'reg_lambda':[0, 0.1, 0.3, 0.6, 9],
              'scale_pos_weight':[0, 0.1, 0.3, 0.6, 9]
             }
clfTwd = RandomizedSearchCV(AbsoluteBaseline, parametersTwd, n_jobs=-1, 
                   cv=StratifiedKFold(), 
                   scoring='roc_auc',
                   verbose=2, refit=True)

clfTwd.fit( X_train,y_train)

BestTweedieParam = clfTwd.cv_results_['params'][clfTwd.best_index_]


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.


Fitting 3 folds for each of 10 candidates, totalling 30 fits


[Parallel(n_jobs=-1)]: Done  30 out of  30 | elapsed:  2.1min finished


RandomizedSearchCV(cv=StratifiedKFold(n_splits=3, random_state=None, shuffle=False),
                   error_score='raise-deprecating',
                   estimator=XGBClassifier(base_score=0.5, booster='gbtree',
                                           colsample_bylevel=1,
                                           colsample_bynode=1,
                                           colsample_bytree=1, gamma=0,
                                           learning_rate=0.1, max_delta_step=0,
                                           max_depth=3, min_child_weight=1,
                                           missing=None, n_estimators=100,
                                           n_jobs=1, nthread=N...
                                                          0.7000000000000001,
                                                          0.8, 0.9],
                                        'max_delta_step': [0],
                                        'max_depth': [1, 3, 5, 7, 9, 11],
      

NameError: name 'clfTwb' is not defined

# Comparing models

In [35]:
BestTweedie = clfTwd.best_estimator_
def getMetrics(model,y_true,X_test,name):
    predictions = model.predict(X_test)
    acc = mr.accuracy_score(y_true,predictions)
    prec =  mr.precision_score(y_true,predictions)
    rec = mr.recall_score(y_true,predictions)
    roc_auc = mr.roc_auc_score(y_true,predictions)
    f1 = mr.f1_score(y_true,predictions)
    print()
    print()
    print(name,acc,prec,rec,roc_auc,f1)


getMetrics(BestMSE,y_test,X_test,"Regression squared error")

getMetrics(BestLogisticRegression,y_test,X_test,"Logistic regression")

getMetrics(BestHingeLoss,y_test,X_test,"Binary classification using hinge loss optimization")

getMetrics(BestPossionCountData,y_test,X_test,"Best Poisson regression for count data")

getMetrics(BestSurvivalCox,y_test,X_test,"Best Survival Cox Regression")


getMetrics(BestPairwise,y_test,X_test,"Best Rank Pairwise")

#getMetrics(BestNDCG,y_test,X_test, "Best Rank ndcg")

#getMetrics(BestMap ,y_test,X_test,"Best Rank map")

#getMetrics(BestGamma ,y_test,X_test,"Best Regression gamma")

getMetrics(BestTweedie ,y_test,X_test, "Best Regression tweedie")





Regression squared error 0.6097344478141874 0.8867041198501873 0.3663915398503998 0.6517126823276284 0.5185252783354627


Logistic regression 0.71033360455655 0.8076306508496313 0.6497291720402373 0.7207882633835185 0.720125786163522


Binary classification using hinge loss optimization 0.7271987573045343 0.7409338705854468 0.8062935259221047 0.7135543952247123 0.7722332015810275


Best Poisson regression for count data 0.7257193579406761 0.7576413652572593 0.7672169202992004 0.7185607585017252 0.76239907727797


Best Survival Cox Regression 0.7243878985132036 0.7416026871401151 0.797265927263348 0.7118159644989767 0.7684275947793662


Best Rank Pairwise 0.7194319106442785 0.7771092766195606 0.7162754707247873 0.7199764170623069 0.7454533252801825


Best Regression tweedie 0.7244618684813966 0.7537473233404711 0.7717307196285788 0.716307684185495 0.7626330210922068


In [36]:
getMetrics(BestMSM,y_test,X_test,"Best Classification by optimizing softmax objective")




Best Classification by optimizing softmax objective 0.7296397662549005 0.7464822609741432 0.8004900696414754 0.7174176280557767 0.7725434065592133
