# Naive Bayes

* Naive Bayes falls under the category of probabilistic classifiers, that computes the probability of a each feature belonging in each class in order to make a prediction.
    * Naive: because it goes along with the assumption that the features are mutually independent (do not affect one another).
    * Bayes: Follows Bays Theorem.
    
    
## Bays Theorem

$$P(A \mid B) = \frac{P(B \mid A) \, P(A)}{P(B)}$$

* Intuition: Defines the probability of A occurring in the probability subset of B.
    * If A on its own is high, then naturally P(A|B) is also high, but if B is high, there is less room for A to be high so A becomes lower.
* A good way to get familiar with naive ways is by doing some probability examples.

### Warm Up Examples

#### Example 1
 
> A box has 3 white balls and 7 black balls. Three balls are randomly selected one by one without replacement. 
>    1. if the first two balls are black, what is the probability the third ball will be black? 
>    2. If the first two balls are the same color. what is the probability the third ball will be black? 

![mc_ex1](../../../assets/naive_bayes/markov_chain_ex1.png)

* A great way to visualize a probabilistic process is by using _Markov Chains_. A Markov chain is a model of some random process that happens over time. The Markov property states that whatever happens next in a process only depends on it's current state. It doesn't have a "memory" of how it was before.

#### Example 2

> Suppose that I% of the population have a disease called stataphobia. There is a diagnostic test with 0.95 probability of being positive when a person has stataphobia and 0.8 probability of being negative when a person does not have stataphobia. If a person tests positive for stataphobia, what is the probability they have stataphobia? 

![mc_ex2](../../../assets/naive_bayes/markov_chain_ex2.png)


### Naive Bayes Examples

#### Example 1

Given the following training data, using a naive Bayes classifier, predict y of new sample {x1 = S, x2 = C, x3 = H, x4 = S}.


| x1 | x2 | x3 | x4 | y | 
|----|----|----|----|---| 
| S  | H  | H  | W  | N | 
| S  | H  | H  | S  | N | 
| O  | H  | H  | W  | Y | 
| R  | M  | H  | W  | Y | 
| R  | C  | N  | W  | Y | 
| R  | C  | N  | S  | N | 
| O  | C  | N  | S  | Y | 
| S  | M  | H  | W  | N | 
| S  | C  | N  | W  | Y | 
| R  | M  | N  | W  | Y | 
| S  | M  | N  | S  | Y | 
| O  | M  | H  | S  | Y | 
| O  | H  | N  | W  | Y | 
| R  | M  | H  | S  | N | 


![nb_sol1](../../../assets/naive_bayes/naive_bayes_sol1.png)

1. Our goal here is to find the classification that is maximized given previous training examples. The first equation summarizes this. Find the argument (y=c* which can be either Y or N) that is maximized given the state {x1 = S, x2 = C, x3 = H, x4 = S}. 
2. Using Bayes rule, the condition is reversed and we can observe that becomes maximized.
3. The final line shows why the process is naive. We are simply counting the ratio of occurrences each column contains the desired configuration conditionally on the classification. If there are fewer occurrences then naturally the probability decreases. 

We can write this in general as:

$$P(y_k|X) \propto P(x|y_k)P(y_k) = P(x_1|y_k)*P(x_2|y_k)* \dots * P(x_n|y_k)*P(y_k)$$


#### Example 2

Given the following training data...

| ID | Terms in email              | Is spam | 
|----|-----------------------------|---------| 
| 1  | Click win prize             | 1       | 
| 2  | Click meeting setup meeting | 0       | 
| 3  | Prize free pizza            | 1       | 
| 4  | Click prize free            | 1       | 

Predict the following test sample:

| ID | Terms in email              | Is spam | 
|----|-----------------------------|---------| 
| 5  | Free setup meeting free     | ?       | 



$P(S|x) = P(S)*P(\text{free}|S)*P(\text{setup}|S)*P(\text{meeting}|S) = \frac{3}{4}*\frac{2+1}{9+7}*\frac{0+1}{9+7}*\frac{0+1}{9+7} = 0.0006667$

$P(NS|x) = P(NS)*P(\text{free}|NS)*P(\text{setup}|NS)*P(\text{meeting}|NS) = \frac{1}{4}*\frac{0+1}{4+7}*\frac{1+1}{4+7}*\frac{2+1}{4+7} = 0.0015$

Conclusion: more likely to be not spam.

**Explanation**
* 1/4 of the documents are spam, therefore the remaining 3/4 are not.
* Every conditional probability looks ratio of the term frequency for a respective class. For example, there are 9 terms in the spam class and two of them contain the word 'free'. On the other hand there are 4 terms in the not spam class, and none of them contain the word free.
    * The is where the 1/7 comes in. In order to avoid zeroing out the other term frequencies (which potentially could be very high!), we apply _Laplace smoothing_. Within Laplace smoothing, all terms begin there count at 1, rather than zero, hence the +1 that we see. To compensate, we divide by the number of unique terms in the denominator (there are 7 of these terms).


## Implementing Naive Bayes

We will be building a spam classifier using data downloaded from: http://www.aueb.gr/users/ion/data/enron-spam/preprocessed/enron1.tar.gz

* In this dataset, there are approximately 3,672 ham (legitimate) emails and 1,500 spam emails. So there are approximately 2 non spam email examples per spam examples.

### Data Exploration
#### Sample Ham

In [84]:
import os

DATA_PATH = os.path.join('data', 'enron1')


file_path = os.path.join(DATA_PATH, 'ham', '0007.1999-12-14.farmer.ham.txt')
with open(file_path, 'r') as infile:
    ham_sample = infile.read()
print(ham_sample)

Subject: mcmullen gas for 11 / 99
jackie ,
since the inlet to 3 river plant is shut in on 10 / 19 / 99 ( the last day of
flow ) :
at what meter is the mcmullen gas being diverted to ?
at what meter is hpl buying the residue gas ? ( this is the gas from teco ,
vastar , vintage , tejones , and swift )
i still see active deals at meter 3405 in path manager for teco , vastar ,
vintage , tejones , and swift
i also see gas scheduled in pops at meter 3404 and 3405 .
please advice . we need to resolve this as soon as possible so settlement
can send out payments .
thanks


#### Sample Spam

In [85]:
file_path = os.path.join(DATA_PATH, 'spam', '0058.2003-12-21.GP.spam.txt')
with open(file_path, 'r') as infile:
    spam_sample = infile.read()
print(spam_sample)

Subject: stacey automated system generating 8 k per week parallelogram
people are
getting rich using this system ! now it ' s your
turn !
we ' ve
cracked the code and will show you . . . .
this is the
only system that does everything for you , so you can make
money
. . . . . . . .
because your
success is . . . completely automated !
let me show
you how !
click
here
to opt out click here % random _ text



### Implementation

In [86]:
# loading in the data for spam emails
emails, labels = [], []
SPAM_DIR = os.path.join(DATA_PATH, 'spam')
spam_files = [os.path.join(SPAM_DIR, spam) for spam in os.listdir(SPAM_DIR) if spam.endswith('.txt')]
for spam_file in spam_files:
    with open(spam_file, 'r', encoding = "ISO-8859-1") as infile:
        emails.append(infile.read())
        labels.append(1)

# loading in the data for ham emails
HAM_DIR = os.path.join(DATA_PATH, 'ham')
ham_files = [os.path.join(HAM_DIR, ham) for ham in os.listdir(HAM_DIR) if ham.endswith('.txt')]
for ham_file in ham_files:
    with open(ham_file, 'r', encoding = "ISO-8859-1") as infile:
        emails.append(infile.read())
        labels.append(0)


In [87]:
# data preprocess the data by cleaning it. this will include:
# - number and punctuation removal
# - human name removal (optional)
# - stop words removal
# - lemmatization

from nltk.tokenize import word_tokenize
from nltk.corpus import names
from nltk.stem import WordNetLemmatizer


NAMES = set(names.words())

def my_filter(doc):
    lemmatizer = WordNetLemmatizer()
    f1 = [word.lower() for word in word_tokenize(doc)]
    f2 = [word for word in f1 if word.isalpha() and word not in NAMES]
    return ' '.join(lemmatizer.lemmatize(word) for word in f2)


emails = [my_filter(email) for email in emails]
emails[0]

'subject dobmeos with hgh my energy level ha gone up stukm introducing doctor formulated hgh human growth hormone also called hgh is referred to in medical science a the master hormone it is very plentiful when we are young but near the age of twenty one our body begin to produce le of it by the time we are forty nearly everyone is deficient in hgh and at eighty our production ha normally diminished at least advantage of hgh increased muscle strength loss in body fat increased bone density lower blood pressure quickens wound healing reduces cellulite improved vision wrinkle disappearance increased skin thickness texture increased energy level improved sleep and emotional stability improved memory and mental alertness increased sexual potency resistance to common illness strengthened heart muscle controlled cholesterol controlled mood swing new hair growth and color restore read more at this website unsubscribe'

In [88]:
# we will be using the term-frequency as our features
from sklearn.feature_extraction.text import CountVectorizer


cv = CountVectorizer(stop_words="english", max_features=500)
term_docs = cv.fit_transform(emails)

In [89]:
# these are our features (most frequent terms) in order
print(cv.get_feature_names()[0:10])

['able', 'access', 'account', 'accounting', 'act', 'action', 'activity', 'actual', 'actuals', 'add']


In [90]:
# and we can access a row in our feature matrix by indexing by a document ID
# .A converts the document from sparse to non.sparse
print(term_docs[1].A)

[[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
  0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0
  0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
  0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
  0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
  0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
  0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0
  1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
  0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
  0 0 0 0 0 0 0 0 0 0 0 0 0 3 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
  0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
  0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0
  0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
  0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 

In [91]:
# notice how the length of each row matches with the length of the number of features
assert(len(cv.get_feature_names()) == len(term_docs[1].A[0]))

In [92]:
# in order map each feature to its cooresponding column, we can access the following
cv.vocabulary_

{'subject': 418,
 'energy': 125,
 'ha': 178,
 'called': 47,
 'young': 497,
 'le': 231,
 'time': 446,
 'production': 345,
 'loss': 250,
 'swing': 425,
 'new': 285,
 'color': 69,
 'read': 357,
 'website': 481,
 'prescription': 337,
 'low': 252,
 'cost': 86,
 'online': 301,
 'order': 306,
 'direct': 111,
 'click': 66,
 'thanks': 439,
 'list': 241,
 'people': 319,
 'change': 58,
 'able': 0,
 'partner': 310,
 'team': 431,
 'investment': 212,
 'account': 2,
 'agreement': 17,
 'project': 348,
 'based': 34,
 'contact': 77,
 'set': 394,
 'business': 44,
 'request': 368,
 'act': 4,
 'opportunity': 304,
 'come': 71,
 'send': 389,
 'million': 272,
 'state': 411,
 'dollar': 114,
 'area': 29,
 'cover': 90,
 'regard': 362,
 'better': 37,
 'special': 405,
 'following': 152,
 'tax': 427,
 'share': 395,
 'provide': 349,
 'process': 342,
 'money': 275,
 'long': 247,
 'form': 154,
 'need': 282,
 'security': 387,
 'john': 221,
 'number': 295,
 'america': 23,
 'stock': 415,
 'company': 73,
 'release': 364,


In [93]:
# map the labels by its type
from collections import defaultdict


def build_label_index(labels):
    label_index = defaultdict(list)
    for i, label in enumerate(labels):
        label_index[label].append(i)
    return label_index

label_index = build_label_index(labels)

In [94]:
# build the prior function

def get_prior(label_index):
    """
    objective: compute prior probabilities, which finds the probability
    of a class occuring
    :param label_index: dict(int:list) - class to index (location) mapping
    :return: dict(int:float) - class to proir probability 
    """
    total_count = sum(len(indices) for _, indices in label_index.items())
    return {label:len(indices)/total_count for label, indices in label_index.items()}

prior = get_prior(label_index)

In [95]:
# build the likelihood function

import numpy as np


def get_likelihood(term_doc_matrix, label_index, smoothing=0):
    """
    objective: compute likelihood probability, which is the probability of 
    a particular sequences occuring conditionally on a class
    
    | class | f1       | f2       | . | . | . | fn       | 
    |-------|----------|----------|---|---|---|----------| 
    | c1    | P(f1|c1) | P(f2|c1) |   |   |   | P(fn|c1) | 
    | c2    | P(f1|c2) | P(f2|c2) |   |   |   | P(fn|c2) | 
    | .     |          |          |   |   |   |          | 
    | .     |          |          |   |   |   |          | 
    | .     |          |          |   |   |   |          | 
    | cn    | P(f1|cn) | P(f2|cn) |   |   |   | P(fn|cn) | 

    :param term_doc_matrix: sparse np matrix - as determined by some vectorizer
    :param label_index: dict(int:list) - class to index (location) mapping
    :smoothing: int - integer to start counting from
    :return: dict(int:float) - class to likelihood probability product(P(feature|class))
    """
    likelihood = {}
    for label, indices in label_index.items():
        # index [each row] in the term matrix for a particular class. this returns a new submatrix
        # then and sum each column (axis = 0), which denote occurances per feature
        # finally, add + smoothing to each column to ensure non-zero multiplication (if applicable)
        likelihood[label] = np.asarray(term_doc_matrix[indices, :].sum(axis=0) + smoothing)[0]
        
        # compute the total count for the denominater, and perform element wise division to compute a 
        # a single row of likelihoods
        total_count = likelihood[label].sum()
        likelihood[label] = likelihood[label] / float(total_count)
    return likelihood

likelihood = get_likelihood(term_docs, label_index, smoothing=1)

In [96]:
# confirm that there 500 likelihoods (number of features specified)
assert(len(likelihood[0]) == 500)

# sample first 5 likelihoods for ham and spam class respectively
print(likelihood[0][0:5])
print(likelihood[1][0:5])

# which coorespond to the following terms
print(cv.get_feature_names()[0:5])

[1.06413167e-03 9.38618703e-04 8.62219506e-04 8.29476993e-04
 9.82275386e-05]
[0.00105635 0.00137524 0.00442469 0.00051821 0.00408586]
['able', 'access', 'account', 'accounting', 'act']


In [97]:
# build posterior function
# - the posterior = likelihood * prior (when I mean '=' here I am implying proportionality)
# - this might cause an overflow issue because the likelihood consists of  hundreds of values <= 1
#   - to combat this we take the log likelihood, and later convert it back to obtain the original probability
#     - posterior = prior * likelihood
#     - posterior = exp(log(prior * likelihood))
#     - posterior = exp(log(prior * likelihood))
#     - posterior = exp(log(prior) + log(likelihood))
# 

def get_posterior(term_doc_matrix, prior, likelihood):
    """
    objective: computes the posterior based on prior and likelihood
    :param term_doc_matrix: sparse matrix - vectorized term frequencies by doc
    :param prior: dict(int, float) - mapping from class to prior probability
    :param likelihood: dict(int, np.array(float)) - mapping from class to 
    conditional likelihood probabilitites
    :return: [dict(int, float)] - posterior probabilities for each class per document
    """
    n_docs, posteriors = term_doc_matrix.shape[0], []
    labels = prior.keys()
    for i in range(n_docs):
        # look into what features the current document has
        # and only consider non-zero counts
        cur_doc = term_doc_matrix.getrow(i)
        non_zero_indices, non_zero_counts = cur_doc.indices, cur_doc.data
        posterior = {label:np.exp(np.log(prior[label]) 
                         + np.sum(np.log(likelihood[label][non_zero_indices]))) 
                     for label in labels}
        
        # now normalize the probabilities so that they sum to 1
        sum_posterior = sum(posterior.values())
        posteriors.append({label:_posterior/sum_posterior for label, _posterior in posterior.items()}) 
    return posteriors


# test with a ham and spam email respectively
sample_test = [emails[0], emails[-1]]
print('\n\n'.join(sample_test), end='\n\n')

term_docs_test = cv.transform(sample_test)
posterior = get_posterior(term_docs_test, prior, likelihood)
print(posterior)

subject dobmeos with hgh my energy level ha gone up stukm introducing doctor formulated hgh human growth hormone also called hgh is referred to in medical science a the master hormone it is very plentiful when we are young but near the age of twenty one our body begin to produce le of it by the time we are forty nearly everyone is deficient in hgh and at eighty our production ha normally diminished at least advantage of hgh increased muscle strength loss in body fat increased bone density lower blood pressure quickens wound healing reduces cellulite improved vision wrinkle disappearance increased skin thickness texture increased energy level improved sleep and emotional stability improved memory and mental alertness increased sexual potency resistance to common illness strengthened heart muscle controlled cholesterol controlled mood swing new hair growth and color restore read more at this website unsubscribe

subject re tenaska iv i tried calling you this am but your phone rolled to s

### Testing Classifier Performance

In [98]:
from sklearn.model_selection import train_test_split


X_train, X_test, y_train, y_test = train_test_split(emails, labels, test_size=0.33, random_state=42)
term_docs_train = cv.fit_transform(X_train)
label_index = build_label_index(y_train)

# the main observation here is that we compute our prior and likelihood using the training data
prior = get_prior(label_index)
likelihood = get_likelihood(term_docs_train, label_index, smoothing=1)

# and use this to test data it has not seen before to test its performance
term_docs_test = cv.transform(X_test)
posterior = get_posterior(term_docs_test, prior, likelihood)

def get_accuracy(posterior, true_labels, t=.5):
    """
    objective: classify samples as either ham or spam on the
    basis of a threshold against the posterior
    :param true_labels: int - 1 for spam, 0 for ham
    :param threshold: float - decision criterion
    :return: float - ratio of correct:total classifications 
    """
    # here there are two particular ways we can be correct
    # 1) the email is actually spam, and we classified it with a probability > .5
    # 2) the email is actually not spam, and we classified it with a probability > .5
    return np.mean([1 if (true==1 and pred[1] >= t) or (true==0 and pred[0] > t) else 0 
                    for pred, true in zip(posterior, true_labels)])


accuracy = get_accuracy(posterior, y_test, t=.5)
print(f'The accuracy on {len(y_test)} testing samples is with threshold={.5} is {accuracy:.2f}.')

The accuracy on 1707 testing samples is with threshold=0.5 is 0.92.




## Usage via Sci-kit Learn API

In [101]:
from sklearn.naive_bayes import MultinomialNB
    

# alpha is also known as the smoothing parameter
# fit prior is a boolean indicating whether the prior should be fit on the training set
clf = MultinomialNB(alpha=1.0, fit_prior=True)

# train the classifier X and y, data and output
clf.fit(term_docs_train, y_train)

# nowing compute the posterior on the testing set with the prior and likelihood from training set
# note that the default theshold is .5
prediction_prob = clf.predict_proba(term_docs_test)
prediction_prob[0:10]

array([[1.00000000e+00, 1.80494500e-10],
       [1.00000000e+00, 6.93842036e-75],
       [6.43054246e-01, 3.56945754e-01],
       [1.00000000e+00, 1.26282643e-12],
       [1.00000000e+00, 3.69207533e-12],
       [1.53290848e-04, 9.99846709e-01],
       [0.00000000e+00, 1.00000000e+00],
       [1.00000000e+00, 4.21663711e-19],
       [1.00000000e+00, 1.75639432e-13],
       [3.10923660e-01, 6.89076340e-01]])

In [102]:
# we can directly obtain the classifications with predict method
# if the probability is > .5 thna class 1 is assigned, otherwise class 0 is assigned
prediction = clf.predict(term_docs_test)
prediction[0:10]

array([0, 0, 0, 0, 0, 1, 1, 0, 0, 1])

In [103]:
# and finally, we can measure accuracy of the classifier using the score method
accuracy = clf.score(term_docs_test, y_test)
print(f'The accuracy on {len(y_test)} testing samples using MultinomialNB with threshold={.5} is {accuracy:.2f}.')

The accuracy on 1707 testing samples using MultinomialNB with threshold=0.5 is 0.92.


### Evaluation

In [105]:
from sklearn.metrics import confusion_matrix

# provided the true labels, predicted labels, and labels where order 
# does not matter to the extend of swapping what true and false is
confusion_matrix(y_test, prediction, labels=[0, 1])

array([[1098,   93],
       [  44,  472]], dtype=int64)

In [108]:
from sklearn.metrics import precision_score, recall_score, f1_score


# precision, recall, and f1-score
ps = precision_score(y_test, prediction, pos_label=1)
rs = recall_score(y_test, prediction, pos_label=1)
f1s = f1_score(y_test, prediction, pos_label=1)

print(f'Precision: {ps:.2f}, Recall: {rs:.2f}, F1-score: {f1s:.2f}')

Precision: 0.84, Recall: 0.91, F1-score: 0.87


In [110]:
from sklearn.metrics import classification_report


# all three metrics in one, also in dependences on what label is which
print(classification_report(y_test, prediction))

             precision    recall  f1-score   support

          0       0.96      0.92      0.94      1191
          1       0.84      0.91      0.87       516

avg / total       0.92      0.92      0.92      1707



In [112]:
from sklearn.metrics import roc_auc_score


# area under the roc curve
pos_prob = prediction_prob[:, 1]
roc_auc_score(y_test, pos_prob)

0.9588711199630302

### Parameter Tuning for Optimal Performance


In [126]:
from sklearn.model_selection import StratifiedKFold
import itertools
import pandas as pd


# as an evaluation metric, we are going to use cross validation in conjunction with
# the auc under roc as the performance metric
k_fold = StratifiedKFold(n_splits=10)
emails_np, labels_np = np.array(emails), np.array(labels)

# before training and testing our classiifer we subjectively chose between the following options:
# - smoothing (offset counting parameter)
# - number of features (selected 500)
# - whether or not to use the prior as part of the mutlplier for NB
max_features_option = [2000, 4000, 8000]
smoothing_factor_option = [0.5, 1.0, 1.5, 2.0]
fit_prior_option = [True, False]
all_options = [max_features_option, smoothing_factor_option, fit_prior_option]
auc_k_fold_record = []


# iterate through all the combinations of model features and record the results
for features_opt, smoothing_opt, prior_opt in list(itertools.product(*all_options)):
    # k-fold cross validation as our performance metric
    auc_temp = []
    for train_indices, test_indices in k_fold.split(emails_np, labels_np):
        # split
        X_train, X_test = emails_np[train_indices], emails_np[test_indices]
        y_train, y_test = labels_np[train_indices], labels_np[test_indices]
        
        # vectorize
        cv = CountVectorizer(stop_words="english", max_features=features_opt)
        term_docs_train = cv.fit_transform(X_train)
        term_docs_test = cv.transform(X_test)
        
        # train
        clf = MultinomialNB(alpha=smoothing_opt, fit_prior=prior_opt)
        clf.fit(term_docs_train, y_train)
        
        # test
        prediction_prob = clf.predict_proba(term_docs_test)
        pos_prob = prediction_prob[:, 1]
        auc = roc_auc_score(y_test, pos_prob)
        auc_temp.append(auc)
        
    auc_k_fold_record.append({'features_opt': features_opt, 'smoothing_opt': smoothing_opt, 
                              'prior_opt': prior_opt, 'k_fold_auc': np.mean(auc_temp)})


In [129]:
# results of the combination run
auc_records = pd.DataFrame(auc_k_fold_record)
max_auc_row = auc_records.iloc[auc_records['k_fold_auc'].idxmax()]
max_auc_row

features_opt         8000
k_fold_auc       0.985681
prior_opt            True
smoothing_opt         0.5
Name: 16, dtype: object