# Project 2 Supplement
## Optimizing Classification Parameters with Imbalanced Data

Whenever you're doing machine learning with a classification model, you need to be aware of imbalanced data. Imbalanced data is data where the majority of the response (y) values in your data fall into one class. In the data you were given for the project, we'd artificially balanced the data - giving you roughly equal good and bad loans. In real life, there would be far more good loans than bad loans, far more real credit card transactions than fraudulent ones, far more negative cancer tests than positive ones, etc. Classification problems are often looking for that proverbial "needle in the haystack." 

The problem is that if we're using accuracy as our metric with imbalanced data, the classifier can always guess the majority class, and we'll have pretty decent accuracy. For example, the data that we're using in this example is survey data from the 2016 National Survey on Drug Use and Health. The purpose of the classifier is to predict opioid users, which account for just 6% of the survey respondents. If the classifier predicted "not an opioid user" for every respondent, it would be correct 94% of the time. That's a pretty good accuracy score! Of course, it doesn't tell us anything about what we actually want to know.

Let's see what can be done about this imbalanced data with parameter optimization.




## Predicting Opioid Abuse from Perception of Risk

The data for this project uses 2016 National Survey on Drug Use and Health to attempt to predict opioid abuse risk based on responses from a small number of survey questions regarding the perceived risk of alcohol, tobacco, and substance use. The intent was to create a screening tool for participants in Division of Extension education programs that could flag individuals that might be more at risk, so additional targeted interventions could be provided. 

Extensive data cleaning was performed in R, resulting in a dataset with 40241 adults with no history of opioid abuse and 2381 adults with a history of opioid abuse. 

Let's read in the data and one-hot-encode the category variables for sklearn.

We'll also make a much smaller data set for demonstration purposes. Otherwise, this code runs extremely slowly. If you wanted more accurate results, the entire dataset should be used.

## Loading the data

In [3]:
# imports
import pandas as pd
import numpy as np
from sklearn.preprocessing import OneHotEncoder

#read in the data
X = pd.read_csv('./data/opioid_data.csv')
#grab the y column (1 = opioid user, 0 = not a user)
y = np.array(X['isUser'])
#drop the y column 
X = X.drop(columns = ['isUser'])

#one hot encode the categories
onehot_encoder = OneHotEncoder(sparse=False, categories='auto')
X = onehot_encoder.fit_transform(X)

# split into test and training data
from sklearn.model_selection import train_test_split
#for testing, split twice to get a much smaller dataset - just 5000
#comment out this line to run with the entire data set
x_train_toss, X, y_train_toss, y = train_test_split(X, y, test_size = 5000, random_state = 0)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)

#Just to confirm how many records we're dealing with....
print('Final Training Size', len(X_train))
print('Final Testing Size', len(X_test))

Final Training Size 4000
Final Testing Size 1000


For every 1 opioid user in our dataset, we have approximately 17 non opioid users. Given that our sample is so imbalanced, we'll need to use some mechanism to try to even the scales. Luckily, sklearn has ways of handling that. For instance, in LogisticRegression, we can pass the class_weight parameter to obtain a "balanced" problem. 

## An example classifier
Let's do a simple logistic regression. We'll compare our accuracy score for a model that does not account for our imbalanced data with one that does account for it.

Note that all we need to do to make it balanced is to use the class_weight parameter with the value of balanced. We found the needed parameter by consulting the <a href="https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html">documentation for sklearn LogisticRegression</a>.

The documentation states that "balanced" mode uses the values of y to automatically adjust weights inversely proportional to class frequencies in the input data. In other words, it more strongly weights the minority class, so that the classifier does a better job of finding those needles.



In [20]:
from sklearn.linear_model import LogisticRegression
from sklearn import metrics

# we do need to go higher than the default iterations for the solver to get convergence
# and the explicity declaration of the solver avoids a warning message, otherwise
# the parameters are defaults.

#without balancing
logreg_model_imbalanced = LogisticRegression(solver='lbfgs',max_iter=1000)
#fit
logreg_model_imbalanced.fit(X_train, y_train)
# Use score method to get accuracy of model
score_imbalanced = logreg_model_imbalanced.score(X_test, y_test) # this is accuracy
print('Score (Accuracy) - Imbalanced:', score_imbalanced)

#with balancing
logreg_model = LogisticRegression(solver='lbfgs',max_iter=1000, class_weight='balanced')
#fit
logreg_model.fit(X_train, y_train)
# Use score method to get accuracy of  the balanced model
score = logreg_model.score(X_test, y_test) # this is accuracy

print('Score (Accuracy) - Balanced:', score)

Score (Accuracy) - Imbalanced: 0.946
Score (Accuracy) - Balanced: 0.701


Our imbalanced score sure looks good, doesn't it? Hm... Let's look at another metric.

### Accuracy vs. Area Under the Curve
Accuracy is how many of the predicted values matched the actual values. Area Under the Curve is a different measure for scoring classifiers. An AUC of .5 would indicate random guessing, or the inability of your classifier to separate the two groups, whereas an AUC of 1 would indicate a perfect classifier. 

We'll also track AUC for our classifiers.

In [21]:
#get auc
y_pred = logreg_model_imbalanced.predict(X_test)
fpr, tpr, thresholds = metrics.roc_curve(y_test, y_pred, pos_label=1)
auc = metrics.auc(fpr, tpr)
print('Area Under the Curve (imbalanced):', auc)

#get auc
y_pred = logreg_model.predict(X_test)
fpr, tpr, thresholds = metrics.roc_curve(y_test, y_pred, pos_label=1)
auc = metrics.auc(fpr, tpr)
print('Area Under the Curve (balanced):', auc)

Area Under the Curve (imbalanced): 0.5
Area Under the Curve (balanced): 0.6760825307336935


Even though our accuracy was really high for the model that didn't take the imbalanced nature of the data into account, when we look at area under the curve, we can see that the model actually did no better than random guessing. 

### Confusion Matrix and Statistics
A confusion matrix is a quick way to look at how well your classifier did, and from it we can derive some more statistics. Specifically, we'll be looking at sensitivity (true positive rate), specificity (true negative rate), and precision (positive predictive value).

Sklearn provides a quick and easy way to get the statistics via the classification_report function.

In [22]:
# obtaining the confusion matrix and making it look nice
from sklearn.metrics import confusion_matrix
import pandas as pd

#get predictions from the imbalanced model
y_pred = logreg_model_imbalanced.predict(X_test)

# must put true before predictions in confusion matrix function
cmtx = pd.DataFrame(
    confusion_matrix(y_test, y_pred, labels=[1,0]), 
    index=['true:user', 'true:not user'], 
    columns=['pred:user','pred:not user']
)
print('Imbalanced Confusion Matrix:')
display(cmtx)

#we can also get the classification report directly from sklearn.
from sklearn.metrics import classification_report
cr = classification_report(y_test, y_pred, output_dict=True)
print('Imbalanced Statistics:')
display(cr)


#get predictions from the balanced model
y_pred = logreg_model.predict(X_test)

# must put true before predictions in confusion matrix function
cmtx = pd.DataFrame(
    confusion_matrix(y_test, y_pred, labels=[1,0]), 
    index=['true:user', 'true:not user'], 
    columns=['pred:user','pred:not user']
)
print('Balanced Confusion Matrix:')
display(cmtx)

#we can also get the classification report directly from sklearn.
from sklearn.metrics import classification_report
cr = classification_report(y_test, y_pred, output_dict=True)
print('Balanced Statistics:')
display(cr)


Imbalanced Confusion Matrix:


Unnamed: 0,pred:user,pred:not user
true:user,0,54
true:not user,0,946


Imbalanced Statistics:


{'0': {'precision': 0.946,
  'recall': 1.0,
  'f1-score': 0.9722507708119219,
  'support': 946},
 '1': {'precision': 0.0, 'recall': 0.0, 'f1-score': 0.0, 'support': 54},
 'accuracy': 0.946,
 'macro avg': {'precision': 0.473,
  'recall': 0.5,
  'f1-score': 0.48612538540596095,
  'support': 1000},
 'weighted avg': {'precision': 0.8949159999999999,
  'recall': 0.946,
  'f1-score': 0.9197492291880782,
  'support': 1000}}

Balanced Confusion Matrix:


Unnamed: 0,pred:user,pred:not user
true:user,35,19
true:not user,280,666


Balanced Statistics:


{'0': {'precision': 0.9722627737226277,
  'recall': 0.7040169133192389,
  'f1-score': 0.8166768853464133,
  'support': 946},
 '1': {'precision': 0.1111111111111111,
  'recall': 0.6481481481481481,
  'f1-score': 0.18970189701897017,
  'support': 54},
 'accuracy': 0.701,
 'macro avg': {'precision': 0.5416869424168694,
  'recall': 0.6760825307336935,
  'f1-score': 0.5031893911826917,
  'support': 1000},
 'weighted avg': {'precision': 0.9257605839416058,
  'recall': 0.701,
  'f1-score': 0.7828202359767313,
  'support': 1000}}

When we look at our confusion matrix and statistics, we can see why our area under the curve was so bad for the imbalanced model. It just predicted everyone was not an opioid user. This is the behavior we expected. But, you can see that the model that used class weights to balance the data did a much better job. It overpredicted the number of users, but it did also correctly predict most of the users in the test set.

### Tracking our Results

We're going to run 10 different models and track our statistics for each model. We'll set up a dataframe and a function to update the dataframe after each test.

In [8]:
#set up a dataframe for storing results for the classification models
import pandas as pd

blanks = [None for i in range(0, 10)]
c_df = pd.DataFrame({'Model Fits': blanks, 
                     'Accuracy': blanks, 
                     'AUC': blanks,
                      'Sensitivity': blanks, 
                      'Precision': blanks,
                       'Specificity': blanks}, 
                     index=[
                            'LogReg - Baseline',
                            'RFC - Baseline',
                            'GNB - Baseline',
                            'Ridge - Baseline',
                            'XGB - Baseline',
                            'XGB - Grid Search', 'XGB - Random Search', 'XGB - Bayesian','XGB - TPOT',
                            'TPOT-General'])


#create a function for updating the grid
def updateResults(df, approach, fits, score, auc, sens, prec, spec):
    df.loc[approach, 'Model Fits'] = fits
    df.loc[approach, 'Accuracy'] = score
    df.loc[approach, 'AUC'] = auc
    df.loc[approach, 'Sensitivity'] = sens
    df.loc[approach, 'Precision'] = prec
    df.loc[approach, 'Specificity'] = spec
    return(df)

c_df    

Unnamed: 0,Model Fits,Accuracy,AUC,Sensitivity,Precision,Specificity
LogReg - Baseline,,,,,,
RFC - Baseline,,,,,,
GNB - Baseline,,,,,,
Ridge - Baseline,,,,,,
XGB - Baseline,,,,,,
XGB - Grid Search,,,,,,
XGB - Random Search,,,,,,
XGB - Bayesian,,,,,,
XGB - TPOT,,,,,,
TPOT-General,,,,,,


Since we'll be running the same code for each model, let's wrap it in a function

In [9]:
#wrapping it all up in a function
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.naive_bayes import GaussianNB
import xgboost as xgb
from sklearn.metrics import classification_report


def my_classifier_results(model): 
    #get predictions
    y_pred = model.predict(X_test)
    #get the classification report
    cr = classification_report(y_test, y_pred, output_dict=True)
    accuracy = cr['accuracy'] #total number of correct predictions (positive or negative)
    sensitivity = cr['1']['recall'] #true positive rate - accurately predicting a user when they are - 1 is best
    precision = cr['1']['precision'] #positive predictive value - 1 is best
    specificity = cr['0']['recall'] #true negative rate - accurately predicting not a user when they aren't
    #get the area under curve
    fpr, tpr, thresholds = metrics.roc_curve(y_test, y_pred, pos_label=1)
    auc = metrics.auc(fpr, tpr)
    print('Model accuracy score from test data: {:0.4f}'.format(accuracy))
    print('Model AUC from test data: {:0.4f}'.format(auc))
    print('Sensitivity (true positive rate) on test data: {:0.2f}'.format(sensitivity))
    print('Precision (positive predictive value) on test data: {:0.2f}'.format(precision))
    print('Specificity (true negative rate) on test data: {:0.2f}'.format(specificity))
    cmtx = pd.DataFrame(
    confusion_matrix(y_test, y_pred, labels=[1,0]), 
    index=['true:user', 'true:not user'], 
    columns=['pred:user','pred:not user']
    )
    display(cmtx)
    
    return(accuracy, auc, sensitivity, precision, specificity)

### Logistic Regession Baseline
This is the same bit of code we already saw as an example above, but now we're using our function to do the work for us, and updating our tracking dataframe.

In [10]:
#run the logistic regressgion baseline
logreg_model = LogisticRegression(solver='lbfgs',max_iter=1000, class_weight='balanced')    

#fit the model
logreg_model.fit(X_train, y_train)

#calculate score
score, auc, sens, prec, spec = my_classifier_results(logreg_model)
c_df = updateResults(c_df, 'LogReg - Baseline', 1, score, auc, sens, prec, spec)

Model accuracy score from test data: 0.7010
Model AUC from test data: 0.6761
Sensitivity (true positive rate) on test data: 0.65
Precision (positive predictive value) on test data: 0.11
Specificity (true negative rate) on test data: 0.70


Unnamed: 0,pred:user,pred:not user
true:user,35,19
true:not user,280,666


### Random Forest Classifier Baseline
For random forest, we're again using the class_weight parameter, but since random forests work a bit differently than logistic regression, we're using a different parameter - balanced_subsample. This means that each of the trees within the forest will have their weights balanced within that tree.

You can read more about it in the <a href="https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html">documentation</a>.

In [11]:
#the randomforestclassifer uses a balanced_subsample
rfc_model = RandomForestClassifier(n_estimators=100, class_weight='balanced_subsample')    

#fit the model
rfc_model.fit(X_train, y_train)

#calculate score

score, auc, sens, prec, spec = my_classifier_results(rfc_model)
c_df = updateResults(c_df, 'RFC - Baseline', 1, score, auc, sens, prec, spec)

Model accuracy score from test data: 0.9430
Model AUC from test data: 0.4984
Sensitivity (true positive rate) on test data: 0.00
Precision (positive predictive value) on test data: 0.00
Specificity (true negative rate) on test data: 1.00


Unnamed: 0,pred:user,pred:not user
true:user,0,54
true:not user,3,943


### Gaussian Naive Bayes Classifier - Baseline
The <a href="https://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.GaussianNB.html">Gaussian Naive Bayes classifier</a> doesn't have a simple balanced parameter for sample_weights. Instead, we have to create a vector with a weight for each of our rows of data. We know that for every 1 positive respondent, we have 17 negative respondents. So we need to weight the positives with a 17 and the negatives with a one. We'll use a list comprehension to generate our vector of weights.

In [12]:
gnb_model = GaussianNB()    

#the GNB model requires an array of weights - use a list comprehension and cast to numpy array
sample_weights = np.array([17 if i == 1 else 1 for i in y_train])

#fit the model
gnb_model.fit(X_train, y_train, sample_weight=sample_weights)

#calculate score
score, auc, sens, prec, spec = my_classifier_results(gnb_model)
c_df = updateResults(c_df, 'GNB - Baseline', 1, score, auc, sens, prec, spec)

Model accuracy score from test data: 0.0790
Model AUC from test data: 0.5045
Sensitivity (true positive rate) on test data: 0.98
Precision (positive predictive value) on test data: 0.05
Specificity (true negative rate) on test data: 0.03


Unnamed: 0,pred:user,pred:not user
true:user,53,1
true:not user,920,26


### Ridge Baseline
The <a href="https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.RidgeClassifier.html">ridge classifier</a> uses ridge regression. Remember that ridge regression uses a tuning parameter (alpha) to shrink coefficients towards zero. We could alter the alpha parameter, but we'll leave it at the default of 1. It also supports using "balanced" with class_weight to handle imbalanced classes. 

In [13]:
from sklearn.linear_model import RidgeClassifier   

ridge_model = RidgeClassifier(class_weight='balanced')


#fit the model
ridge_model.fit(X_train, y_train)

#calculate score
score, auc, sens, prec, spec = my_classifier_results(ridge_model)
c_df = updateResults(c_df, 'Ridge - Baseline', 1, score, auc, sens, prec, spec)

Model accuracy score from test data: 0.6950
Model AUC from test data: 0.6642
Sensitivity (true positive rate) on test data: 0.63
Precision (positive predictive value) on test data: 0.11
Specificity (true negative rate) on test data: 0.70


Unnamed: 0,pred:user,pred:not user
true:user,34,20
true:not user,285,661


### XGBoost - Baseline
Finally, <a href="https://xgboost.readthedocs.io/en/latest/python/python_api.html">XGBoost</a> works like the Guassian Naive Bayes classifier - it requires a vector of weights. We've already created it, so here we'll just use it again.

In [14]:
xgb_model = xgb.XGBClassifier(objective="binary:logistic")

#fit the model - passing in the sample_weights
xgb_model.fit(X_train, y_train, sample_weight=sample_weights)

#calculate score
score, auc, sens, prec, spec = my_classifier_results(xgb_model)
c_df = updateResults(c_df, 'XGB - Baseline', 1, score, auc, sens, prec, spec)



Model accuracy score from test data: 0.7350
Model AUC from test data: 0.6591
Sensitivity (true positive rate) on test data: 0.57
Precision (positive predictive value) on test data: 0.11
Specificity (true negative rate) on test data: 0.74


Unnamed: 0,pred:user,pred:not user
true:user,31,23
true:not user,242,704


### XGBoost with Grid Search

We've finally gotten all of our baseline classifiers done. Let's optimize some parameters!

We'll start by optimizing XGBoost with Grid Search.

These searches tend to take quite a while. This is a large data set. I'm using <a href="https://scikit-learn.org/stable/glossary.html#term-n-jobs">n_jobs</a> in my grid search. This allows the computer to spin up parallel processes. This will only work if your computer has multiple cores. But if it does, it can certainly make this run faster. If your computer does not have multiple cores available, just make n_jobs = 1.

While you're running the code below, let's take a minute to discuss the parameters. We're using the default booster for XGBoost, which is the Tree Booster. Grid search will search every combination of parameters we give it, so the more parameters we include, the more models will be fit (and the longer it will take to run the code). You can review all the possible parameters and what they mean in the <a href="https://xgboost.readthedocs.io/en/latest/parameter.html?highlight=learning_rate#parameters-for-tree-booster">documentation</a>. That's where you also find the possible range of each parameter.

Your job is to pick enough numbers within each range to give you a good search space, without picking so many as to make the code impossibly long to run. I've found that 2 or 3 of each parameter is enough to find a decently fitting model. After you have run this and found your best model, if any of the parameters are at the minimum or maximum of what you searched, you might want to try a lower or higher number in the range and search again.



In [15]:
# run GridSearchCV with our model to find better hyperparameters
from sklearn.model_selection import GridSearchCV


#########################
#grid search
#########################
params = {
    "learning_rate": [0.01, 0.1], #pick floats from 0 to 1
    "max_depth": [2, 4, 6], #pick integers in range [0,inf] (but you'd usually want at least 1)
    "n_estimators": [10, 100], #number of trees. Default is 100
    "subsample": [0.8, 1], #pick floats from 0 to 1
    "min_child_weight": [1, 3], #pick numbers 0 to inf
    "reg_lambda": [1, 3], #pick numbers from 0 to inf
    "reg_alpha:": [1, 3] #pick numbers from 0 to inf
}

# setup the grid search
grid_search = GridSearchCV(xgb_model,
                           param_grid=params,
                           cv=5,
                           verbose=1,
                           n_jobs=3,
                           return_train_score=True)


#fit the model
grid_search.fit(X_train, y_train, sample_weight=sample_weights)

#calculate score
score, auc, sens, prec, spec = my_classifier_results(grid_search)
c_df = updateResults(c_df, 'XGB - Grid Search', 960, score, auc, sens, prec, spec)


 
 

Fitting 5 folds for each of 192 candidates, totalling 960 fits


[Parallel(n_jobs=3)]: Using backend LokyBackend with 3 concurrent workers.
[Parallel(n_jobs=3)]: Done  44 tasks      | elapsed:    5.0s
[Parallel(n_jobs=3)]: Done 194 tasks      | elapsed:   33.1s
[Parallel(n_jobs=3)]: Done 444 tasks      | elapsed:  1.9min
[Parallel(n_jobs=3)]: Done 794 tasks      | elapsed:  3.4min
[Parallel(n_jobs=3)]: Done 960 out of 960 | elapsed:  4.5min finished


Model accuracy score from test data: 0.8590
Model AUC from test data: 0.5239
Sensitivity (true positive rate) on test data: 0.15
Precision (positive predictive value) on test data: 0.08
Specificity (true negative rate) on test data: 0.90


Unnamed: 0,pred:user,pred:not user
true:user,8,46
true:not user,95,851


### XGBoost with Random Search
Once again, I'm using n_jobs with random search. We're doing 5-fold cross validation. We have the same parameters here, but since we're randomly combining different parameters, we can use more options. For some of our parameters, instead of picking a few numbers in a list, we're telling the the algorithm to randomly choose from a list of integers or from a uniform distribution.

In [16]:
from sklearn.model_selection import RandomizedSearchCV
from scipy.stats import uniform, randint

params = {
    "learning_rate": [0.001, 0.01, 0.1, 0.5, 1.],
    "max_depth": randint(1, 10),
    "n_estimators": randint(10, 100),
    "subsample": uniform(0.05, 0.95),  # so uniform on [.05,.05+.95] = [.05,1.]
    "min_child_weight": randint(1, 20),
    "reg_alpha": uniform(0, 5),
    "reg_lambda": uniform(0, 5)
}

random_search = RandomizedSearchCV(
    xgb_model,
    param_distributions=params,
    random_state=8675309,
    n_iter=25,
    cv=5,
    verbose=1,
    n_jobs=2,
    return_train_score=True)

#fit the model
random_search.fit(X_train, y_train, sample_weight=sample_weights)

#calculate score
score, auc, sens, prec, spec = my_classifier_results(random_search)
c_df = updateResults(c_df, 'XGB - Random Search', 125, score, auc, sens, prec, spec)



Fitting 5 folds for each of 25 candidates, totalling 125 fits


[Parallel(n_jobs=2)]: Using backend LokyBackend with 2 concurrent workers.
[Parallel(n_jobs=2)]: Done  46 tasks      | elapsed:   19.2s
[Parallel(n_jobs=2)]: Done 125 out of 125 | elapsed:   44.8s finished


Model accuracy score from test data: 0.8740
Model AUC from test data: 0.5667
Sensitivity (true positive rate) on test data: 0.22
Precision (positive predictive value) on test data: 0.12
Specificity (true negative rate) on test data: 0.91


Unnamed: 0,pred:user,pred:not user
true:user,12,42
true:not user,84,862


### XGBoost with Bayesian Optimization

Bayesian optimization requires a different set up for our parameters. Note that the parameters themselves are the same. But, we have to make an hp_bounds object. Float numbers are of type "continuous" while integers are of type "discrete." For each parameter, we give a domain - which is the high and low from which values will be chosen.

In [17]:
np.random.seed(8675309)  # seed courtesy of Tommy Tutone
from GPyOpt.methods import BayesianOptimization
from sklearn.model_selection import cross_val_score, KFold

hp_bounds = [{
    'name': 'learning_rate',
    'type': 'continuous',
    'domain': (0.001, 1.0)
}, {
    'name': 'max_depth',
    'type': 'discrete',
    'domain': (1, 10)
}, {
    'name': 'n_estimators',
    'type': 'discrete',
    'domain': (10, 100)
}, {
    'name': 'subsample',
    'type': 'continuous',
    'domain': (0.05, 1.0)
}, {
    'name': 'min_child_weight',
    'type': 'discrete',
    'domain': (1, 20)
}, {
    'name': 'reg_alpha',
    'type': 'continuous',
    'domain': (0, 5)
}, {
    'name': 'reg_lambda',
    'type': 'continuous',
    'domain': (0, 5)
}]


# Optimization objective
def cv_score(hyp_parameters):
    hyp_parameters = hyp_parameters[0]
    xgb_model = xgb.XGBClassifier(objective="binary:logistic",
                                 learning_rate=hyp_parameters[0],
                                 max_depth=int(hyp_parameters[1]),
                                 n_estimators=int(hyp_parameters[2]),
                                 subsample=hyp_parameters[3],
                                 min_child_weight=int(hyp_parameters[4]),
                                 reg_alpha=hyp_parameters[5],
                                 reg_lambda=hyp_parameters[6])
    scores = cross_val_score(xgb_model,
                             X=X_train,
                             y=y_train,
                             cv=KFold(n_splits=5))
    return np.array(scores.mean())  # return average of 5-fold scores


optimizer = BayesianOptimization(f=cv_score,
                                 domain=hp_bounds,
                                 model_type='GP',
                                 acquisition_type='EI',
                                 acquisition_jitter=0.05,
                                 exact_feval=True,
                                 maximize=True,
                                 verbosity=True)

optimizer.run_optimization(max_iter=20,verbosity=True)

best_hyp_set = {}
for i in range(len(hp_bounds)):
    if hp_bounds[i]['type'] == 'continuous':
        best_hyp_set[hp_bounds[i]['name']] = optimizer.x_opt[i]
    else:
        best_hyp_set[hp_bounds[i]['name']] = int(optimizer.x_opt[i])
        
bayopt_search = xgb.XGBClassifier(objective="binary:logistic",**best_hyp_set)        

#fit the model
bayopt_search.fit(X_train, y_train, sample_weight=sample_weights)

#calculate score
score, auc, sens, prec, spec = my_classifier_results(bayopt_search)
c_df = updateResults(c_df, 'XGB - Bayesian', 125, score, auc, sens, prec, spec)


num acquisition: 1, time elapsed: 0.50s
num acquisition: 2, time elapsed: 1.19s
num acquisition: 3, time elapsed: 4.45s
num acquisition: 4, time elapsed: 7.85s
num acquisition: 5, time elapsed: 15.69s
num acquisition: 6, time elapsed: 16.27s
num acquisition: 7, time elapsed: 16.94s
num acquisition: 8, time elapsed: 17.59s
num acquisition: 9, time elapsed: 18.30s
Model accuracy score from test data: 0.6560
Model AUC from test data: 0.6610
Sensitivity (true positive rate) on test data: 0.67
Precision (positive predictive value) on test data: 0.10
Specificity (true negative rate) on test data: 0.66


Unnamed: 0,pred:user,pred:not user
true:user,36,18
true:not user,326,620


### XGBoost with TPOT Classifier

With a single model type, TPOT looks a whole lot like our regular xgboost setup with random search. The syntax is slightly different, but the concepts are the same. We've added one extra parameter here to account for our imbalanced classes - the 'scale_pos_weight' parameter. Again, we're telling the classifier that we have about 17 negative cases for each positive case.

In [18]:
from tpot import TPOTClassifier

tpot_config = {
    'xgboost.XGBClassifier': {
        'n_estimators': [100],
        'max_depth': range(1, 11),
        'learning_rate': [1e-3, 1e-2, 1e-1, 0.5, 1.],
        'subsample': np.arange(0.05, 1.01, 0.05),
        'min_child_weight': range(1, 21),
        'reg_alpha': range(1, 6),
        'reg_lambda': range(1, 6),
        'nthread': [2],
        'objective': ['binary:logistic'],
        'scale_pos_weight': [17] #trying to force tpot and xgboost to handle the imbalanced classes....
    }
}

tpot = TPOTClassifier(generations=10,
                     population_size=40,
                     verbosity=2,
                     config_dict=tpot_config,
                     cv=3,
                     scoring='balanced_accuracy',
                     random_state=8675309)


#fit the model
tpot.fit(X_train, y_train, sample_weight=sample_weights)



#calculate score
score, auc, sens, prec, spec = my_classifier_results(tpot)
c_df = updateResults(c_df, 'XGB - TPOT', 400, score, auc, sens, prec, spec)
tpot.export('tpot_XGBclassifier-opioid.py')

HBox(children=(IntProgress(value=0, description='Optimization Progress', max=440, style=ProgressStyle(descript…

Generation 1 - Current best internal CV score: 0.6722895925843752
Generation 2 - Current best internal CV score: 0.6729643337671076
Generation 3 - Current best internal CV score: 0.6729643337671076
Generation 4 - Current best internal CV score: 0.6729643337671076
Generation 5 - Current best internal CV score: 0.6739639442156862
Generation 6 - Current best internal CV score: 0.6769774396561972
Generation 7 - Current best internal CV score: 0.6771318365736375
Generation 8 - Current best internal CV score: 0.6771318365736375
Generation 9 - Current best internal CV score: 0.6791073854986531
Generation 10 - Current best internal CV score: 0.6791073854986531

Best pipeline: XGBClassifier(XGBClassifier(input_matrix, learning_rate=0.001, max_depth=9, min_child_weight=8, n_estimators=100, nthread=2, objective=binary:logistic, reg_alpha=3, reg_lambda=3, scale_pos_weight=17, subsample=0.6000000000000001), learning_rate=0.001, max_depth=7, min_child_weight=12, n_estimators=100, nthread=2, objectiv

Unnamed: 0,pred:user,pred:not user
true:user,16,38
true:not user,180,766


### AutoML with TPOT Classifier

With imbalanced classes, even using the balanced_accuracy scoring method, TPOT chooses algorithms that perform horribly if you give it just the default config. A custom config using just methods that can handle imbalanced classes, and setting the parameter that they use for imbalanced classes, is necessary to get a reasonable result.

How did we figure this out? Well, like many packages in Python, the code that makes up the package is available on <a href="https://github.com/EpistasisLab/tpot">github</a>. We're using the default classifier configuration. We can see all the models that the default classifier will attempt to run in the <a href="https://github.com/EpistasisLab/tpot/blob/master/tpot/config/classifier.py">classifier config file</a>. Not all the models have mechanisms for dealing with imbalanced data. So, the config dictionary below includes just the models that have a mechanism for balancing classes. For each of them, we've added the parameter required to balance the data.

In [19]:
from tpot import TPOTClassifier

classifier_config_dict = {



    'sklearn.tree.DecisionTreeClassifier': {
        'criterion': ["gini", "entropy"],
        'max_depth': range(1, 11),
        'min_samples_split': range(2, 21),
        'min_samples_leaf': range(1, 21),
        'class_weight': ['balanced']
    },

    'sklearn.ensemble.ExtraTreesClassifier': {
        'n_estimators': [100],
        'criterion': ["gini", "entropy"],
        'max_features': np.arange(0.05, 1.01, 0.05),
        'min_samples_split': range(2, 21),
        'min_samples_leaf': range(1, 21),
        'bootstrap': [True, False],
        'class_weight': ['balanced']
    },

    'sklearn.ensemble.RandomForestClassifier': {
        'n_estimators': [100],
        'criterion': ["gini", "entropy"],
        'max_features': np.arange(0.05, 1.01, 0.05),
        'min_samples_split': range(2, 21),
        'min_samples_leaf':  range(1, 21),
        'bootstrap': [True, False],
        'class_weight': ['balanced']
    },

    #'sklearn.ensemble.GradientBoostingClassifier' - has no paramter for imbalanced data

    #'sklearn.neighbors.KNeighborsClassifier' - has no parameter for imbalanced data

    'sklearn.svm.LinearSVC': {
        'penalty': ["l1", "l2"],
        'loss': ["hinge", "squared_hinge"],
        'dual': [True, False],
        'tol': [1e-5, 1e-4, 1e-3, 1e-2, 1e-1],
        'C': [1e-4, 1e-3, 1e-2, 1e-1, 0.5, 1., 5., 10., 15., 20., 25.],
        'class_weight': ['balanced']
    },

    'sklearn.linear_model.LogisticRegression': {
        'penalty': ["l1", "l2"],
        'C': [1e-4, 1e-3, 1e-2, 1e-1, 0.5, 1., 5., 10., 15., 20., 25.],
        'dual': [True, False],
        'class_weight': ['balanced']
    },

    'xgboost.XGBClassifier': {
        'n_estimators': [100],
        'max_depth': range(1, 11),
        'learning_rate': [1e-3, 1e-2, 1e-1, 0.5, 1.],
        'subsample': np.arange(0.05, 1.01, 0.05),
        'min_child_weight': range(1, 21),
        'nthread': [1],
        'scale_pos_weight': [17]
    }
}



tpot_auto = TPOTClassifier(generations=10,
                     population_size=40,
                     verbosity=2,
                     cv=3,
                    config_dict=classifier_config_dict,
                     scoring='balanced_accuracy',
                     random_state=8675309)

#fit the model
tpot_auto.fit(X_train, y_train, sample_weight=sample_weights)

#calculate score
score, auc, sens, prec,spec = my_classifier_results(tpot_auto)
c_df = updateResults(c_df, 'TPOT-General', 1600, score, auc, sens, prec, spec)
tpot.export('tpot_optimal_pipeline-opioid.py')

HBox(children=(IntProgress(value=0, description='Optimization Progress', max=440, style=ProgressStyle(descript…

Generation 1 - Current best internal CV score: 0.6716822468374519
Generation 2 - Current best internal CV score: 0.6716822468374519
Generation 3 - Current best internal CV score: 0.6740268775392209
Generation 4 - Current best internal CV score: 0.6740268775392209
Generation 5 - Current best internal CV score: 0.6792025126199815
Generation 6 - Current best internal CV score: 0.6792025126199815
Generation 7 - Current best internal CV score: 0.67945712073083
Generation 8 - Current best internal CV score: 0.6815589591157618
Generation 9 - Current best internal CV score: 0.6815589591157618
Generation 10 - Current best internal CV score: 0.6855140125014586

Best pipeline: RandomForestClassifier(LinearSVC(CombineDFs(input_matrix, CombineDFs(LinearSVC(input_matrix, C=5.0, class_weight=balanced, dual=True, loss=hinge, penalty=l2, tol=0.01), input_matrix)), C=0.01, class_weight=balanced, dual=False, loss=squared_hinge, penalty=l2, tol=0.0001), bootstrap=True, class_weight=balanced, criterion=ent

Unnamed: 0,pred:user,pred:not user
true:user,10,44
true:not user,110,836


## Conclusion

When working with imbalanced data, you definitely do not want to use accuracy as your primary measure. When sorting by accuracy, we can see that RFC - Baseline is the "best" choice. But the sensitivity and precision are both terrible.


In [23]:
c_df.sort_values(by=['Accuracy'], ascending=False)

Unnamed: 0,Model Fits,Accuracy,AUC,Sensitivity,Precision,Specificity
RFC - Baseline,1,0.943,0.498414,0.0,0.0,0.996829
XGB - Random Search,125,0.874,0.566714,0.222222,0.125,0.911205
XGB - Grid Search,960,0.859,0.523863,0.148148,0.0776699,0.899577
TPOT-General,1600,0.846,0.534453,0.185185,0.0833333,0.883721
XGB - TPOT,400,0.782,0.553011,0.296296,0.0816327,0.809725
XGB - Baseline,1,0.735,0.65913,0.574074,0.113553,0.744186
LogReg - Baseline,1,0.701,0.676083,0.648148,0.111111,0.704017
Ridge - Baseline,1,0.695,0.664181,0.62963,0.106583,0.698732
XGB - Bayesian,125,0.656,0.661029,0.666667,0.0994475,0.655391
GNB - Baseline,1,0.079,0.504483,0.981481,0.0544707,0.0274841


While RFC - baseline did have a parameter that should have accounted for the imbalanced data, it did not handle the imbalanced data well at all.

Let's sort the dataframe by AUC and see where we're at.

In [24]:
c_df.sort_values(by=['AUC'], ascending=False)

Unnamed: 0,Model Fits,Accuracy,AUC,Sensitivity,Precision,Specificity
LogReg - Baseline,1,0.701,0.676083,0.648148,0.111111,0.704017
Ridge - Baseline,1,0.695,0.664181,0.62963,0.106583,0.698732
XGB - Bayesian,125,0.656,0.661029,0.666667,0.0994475,0.655391
XGB - Baseline,1,0.735,0.65913,0.574074,0.113553,0.744186
XGB - Random Search,125,0.874,0.566714,0.222222,0.125,0.911205
XGB - TPOT,400,0.782,0.553011,0.296296,0.0816327,0.809725
TPOT-General,1600,0.846,0.534453,0.185185,0.0833333,0.883721
XGB - Grid Search,960,0.859,0.523863,0.148148,0.0776699,0.899577
GNB - Baseline,1,0.079,0.504483,0.981481,0.0544707,0.0274841
RFC - Baseline,1,0.943,0.498414,0.0,0.0,0.996829


Sorting the dataframe this way, our formerly "best" model based on accuracy is now the bottom of the pile (not surprisingly). 

With the small dataset, the logistic regression baseline comes out on top. With the full dataset, the TPOT auto-tuning algorithm comes out on top.

None of these models have a great AUC. This model is probably not particularly useful. What it probably needs is a different selection of predictors. But we'll leave that for another day.