John Nipp, Self-study, 6-15-2022

The first thing we are going to do, is our basic import statements and setup. Make sure to upload the bank-additional-full.csv within the sample_data directory.

Project Link here:
https://docs.google.com/document/d/17d5zMHkYPojlAQB9qA7PZGNiofkSV_ekPjQmkTQ8KmE/edit#

In [None]:
import numpy as np
import pandas as pd
import sklearn

# Make it clear what we are take from the data
ze_dir = 'sample_data/bank-additional-full.csv'

# Importing the data in a way that I'm familiar with
raw_data = pd.read_csv(ze_dir, delimiter = ';')
raw_data.head()

# Printing summaries to give a first impression of the data
print(raw_data[['age', 'campaign', 'previous', 'pdays', 'emp.var.rate', 
                'cons.price.idx', 'cons.conf.idx', 'euribor3m', 
                'nr.employed']].describe())

               age      campaign      previous         pdays  emp.var.rate  \
count  41188.00000  41188.000000  41188.000000  41188.000000  41188.000000   
mean      40.02406      2.567593      0.172963    962.475454      0.081886   
std       10.42125      2.770014      0.494901    186.910907      1.570960   
min       17.00000      1.000000      0.000000      0.000000     -3.400000   
25%       32.00000      1.000000      0.000000    999.000000     -1.800000   
50%       38.00000      2.000000      0.000000    999.000000      1.100000   
75%       47.00000      3.000000      0.000000    999.000000      1.400000   
max       98.00000     56.000000      7.000000    999.000000      1.400000   

       cons.price.idx  cons.conf.idx     euribor3m   nr.employed  
count    41188.000000   41188.000000  41188.000000  41188.000000  
mean        93.575664     -40.502600      3.621291   5167.035911  
std          0.578840       4.628198      1.734447     72.251528  
min         92.201000     -50

Now, we need to separate the data between our training set, and our test set. It shouldn't very hard, we'll use a well-known method from sklearn to split it between test and training data.

I also went ahead and checked to see if we have any minimums we have to readjust for our naivebayes classifier. I will recenter those numeric columns to be at a minimum of 0 by "subtracting" the minimum value from the entire column.

The project doesn't reccommend to separate the y values until after we stratify the data, but common practice is that we drop it before, this way we don't have to do similar code twice.

We also need to get rid of duration, because the description says it won't help with creating a "realistic prediction model".

In [None]:
from sklearn.model_selection import train_test_split

# Dropping duration
raw_data = raw_data.drop('duration', axis = 1)

# Making the minimum of these two columns to be 0.
raw_data['cons.conf.idx'] -= min(raw_data['cons.conf.idx'])
raw_data['emp.var.rate'] -= min(raw_data['emp.var.rate'])

# Saving a copy for later in the project
raw_data_copy = raw_data.copy()

# Stratifying the y (dependant) variable
y_val = raw_data['y']

# Dropping the dependant variable from the rest of the data
raw_data = raw_data.drop('y', axis = 1)

# Separating the data between train data and test data
train_data, test_data, y_train, y_test = train_test_split( raw_data ,
                              y_val, test_size = .1, random_state=(2021-10-25))

We are going to be using algorithms that have problems with collinear features, so we will drop_first. 

We are going to have a hot-one-encoding type of data frame, where there are a lot of columns, each with a 1 or 0. We can do this, especially since our data is largely already clean.

In [None]:
# Applying the encoding. Later, I will encode the data before stratifying
# the samples.
no_dummies_train = pd.get_dummies(train_data, drop_first=True)
no_dummies_test = pd.get_dummies(test_data, drop_first=True)

# Verifying my output is correct
no_dummies_train.head()

Unnamed: 0,age,campaign,pdays,previous,emp.var.rate,cons.price.idx,cons.conf.idx,euribor3m,nr.employed,job_blue-collar,...,month_may,month_nov,month_oct,month_sep,day_of_week_mon,day_of_week_thu,day_of_week_tue,day_of_week_wed,poutcome_nonexistent,poutcome_success
22975,56,5,999,0,4.8,93.444,14.7,4.965,5228.1,0,...,0,0,0,0,1,0,0,0,1,0
14746,26,1,999,0,4.8,93.918,8.1,4.957,5228.1,0,...,0,0,0,0,0,0,0,1,1,0
12505,38,1,999,0,4.8,93.918,8.1,4.96,5228.1,0,...,0,0,0,0,1,0,0,0,1,0
6801,33,2,999,0,4.5,93.994,14.4,4.857,5191.0,0,...,1,0,0,0,0,0,0,1,1,0
18389,29,3,999,0,4.8,93.918,8.1,4.968,5228.1,0,...,0,0,0,0,0,1,0,0,1,0


Now we have to train our model, and score it!

Earlier, I had tried scaling it manually by checking to see what the minimum values were of the numerics. I thought using the minmaxScaler would help anyway... then I found that it might nix the experiment that the professor had come up with.

In [None]:
# Importing the classifier
from sklearn.naive_bayes import CategoricalNB

# from sklearn.preprocessing import MinMaxScaler #fixed import

# scaler = MinMaxScaler()
# no_dummies_train = pd.DataFrame(scaler.fit_transform(no_dummies_train))
# no_dummies_test = pd.DataFrame(scaler.transform(no_dummies_test))

# Creating the model instance
model = CategoricalNB()

# Fitting the data
model.fit(no_dummies_train, y_train)

# Creating two scores, one for train set, other for test set.
score1 = model.score(no_dummies_train, y_train)
score2 = model.score(no_dummies_test, y_test)

Let's see those scores!

In [None]:
print("Categorical Naive Bayes after setting cons.conf.idx and emp.var.rate",
      "\nto 0 or above, and using get_dummies to hotOneEncode categorical",
      "variables:", "\n")
print("Score for train set:", score1, "\nScore for test set" , score2)

Categorical Naive Bayes after setting cons.conf.idx and emp.var.rate 
to 0 or above, and using get_dummies to hotOneEncode categorical variables: 

Score for train set: 0.8583182713318406 
Score for test set 0.85360524399126


It turns out using the minmaxScaler scored more than .02 higher, but for now, let's move on with what the professor suggested. 

My Classifier is 85.36% accurate with my test set, and 85.83% accurate on the training set.

The next thing we are going to do, is check to see how many unique values we have in age. We can use Pandas .value_counts() method for this

In [None]:
print(raw_data['age'].value_counts())

31    1947
32    1846
33    1833
36    1780
35    1759
      ... 
89       2
91       2
94       1
87       1
95       1
Name: age, Length: 78, dtype: int64


There are 78 unique values in the age category, but as you can see, some values are more popular than others. The zero-frequency problem may skew some of the data, since some data may have sparse information. Combining into larger groups may help.

We are going to split them into bins of 10. I'm going to do it manually using a for loop, then see if there is a way to do it with a built-in function in pandas.

In [None]:
# age_decade = pd.DataFrame([], columns = ['age_decade'])
# for a in raw_data['age']:
#   if a < 11:
#     age_decade.append(pd.DataFrame([10], columns = ['age_decade']))
#   if a < 21:
#     age_decade.append(pd.DataFrame([20], columns = ['age_decade']))
#   if a < 31:
#     age_decade.append(pd.DataFrame([30], columns = ['age_decade']))
#   if a < 41:
#     age_decade.append(pd.DataFrame([40], columns = ['age_decade']))
#   if a < 51:
#     age_decade.append(pd.DataFrame([50], columns = ['age_decade']))
#   if a < 61:
#     age_decade.append(pd.DataFrame([60], columns = ['age_decade']))
#   if a < 71:
#     age_decade.append(pd.DataFrame([70], columns = ['age_decade']))
#   if a < 81:
#     age_decade.append(pd.DataFrame([80], columns = ['age_decade']))
#   if a < 91:
#     age_decade.append(pd.DataFrame([90], columns = ['age_decade']))
#   if a < 101:
#     age_decade.append(pd.DataFrame([100], columns = ['age_decade']))
#   if a >= 101:
#     age_decade.append(pd.DataFrame([101], columns = ['age_decade']))

Given a combination of list instantiation, and repeated pandas.DataFrame calls, this approach is horrifically inefficient. I was in it to feel like a true Software Engineer, but we all know, that in reality, this isn't an efficient way. We are going to try using cut from the pandas package instead.

I tested and tried like a good Data Scientist 😎.

In [None]:
# Group by decade, 10 being for anyone in their first decade, 20 for anyone in 
# their second, etc.
raw_data['age_decade'] = pd.cut(raw_data['age'], [0, 10, 20, 30, 40, 50, 60, 70, 80, 90, 100],
                                labels = [10, 20, 30, 40, 50, 60, 70, 80, 90, 100])
print(raw_data['age_decade'])

# Double checking the value_counts changed as intended
print(raw_data['age_decade'].value_counts())

0        60
1        60
2        40
3        40
4        60
         ..
41183    80
41184    50
41185    60
41186    50
41187    80
Name: age_decade, Length: 41188, dtype: category
Categories (10, int64): [10 < 20 < 30 < 40 ... 70 < 80 < 90 < 100]
40     16385
50     10240
30      7243
60      6270
70       488
80       303
20       140
90       109
100       10
10         0
Name: age_decade, dtype: int64


So far so good, now let's try throwing those other values in the model. I just copy pasted from above.

In [None]:

# Made a new instantiation, since this data doesn't take up too much disc/RAM.
raw_data2 = raw_data.copy()

# Drop the age variable.
raw_data2 = raw_data2.drop('age', axis = 1)

# Stratify the data AGAIN, and retitled the data to protect from confusion
train_data2, test_data2, y_train2, y_test2 = train_test_split( raw_data2 , y_val,
                                  test_size = .1, random_state=(2021-10-25))

# Doing that get_dummies song and dance, again!
no_dummies_train2 = pd.get_dummies(train_data2, drop_first=True)
no_dummies_test2 = pd.get_dummies(test_data2, drop_first=True)

# Creating a new NaiveBayes model
model2 = CategoricalNB()
model2.fit(no_dummies_train2, y_train2)

# Score 'em again!
score3 = model2.score(no_dummies_train2, y_train2)
score4 = model2.score(no_dummies_test2, y_test2)

print("Categorical Naive Bayes after setting cons.conf.idx and emp.var.rate",
      "\nto 0 or above, and using get_dummies to hotOneEncode categorical",
      "variables,\nand have changed integer age values to bins based on decade:",
      "\n")
print("Score on training data:", score3, "\nScore on testing data:" , score4)

Categorical Naive Bayes after setting cons.conf.idx and emp.var.rate 
to 0 or above, and using get_dummies to hotOneEncode categorical variables,
and have changed integer age values to bins based on decade: 

Score on training data: 0.858534085084572 
Score on testing data: 0.85360524399126


It looks like the training set scored a little bit better, and the test set almost the exact same. I guess the change to the data meant nothing!

Next, we are going to try a KNeighborsClassifier. Oh boy, I hope it's better!

In [None]:
# Import KNeighborsClassifier
from sklearn.neighbors import KNeighborsClassifier

# Create a KNeighborsClassifier instance
model3 = KNeighborsClassifier()

# Fit the model, using our first DataSet (age feature unaltered)
model3.fit(no_dummies_train, y_train)

# Score em' again
score5 = model3.score(no_dummies_train, y_train)
score6 = model3.score(no_dummies_test, y_test)

Let's check the score!

In [None]:
print("K Neighbors Classifier after setting cons.conf.idx and emp.var.rate",
      "\nto 0 or above, and using get_dummies to hotOneEncode categoricals.\n",)

print("Score 1:", score5, "\nScore 2:" , score6)

K Neighbors Classifier after setting cons.conf.idx and emp.var.rate 
to 0 or above, and using get_dummies to hotOneEncode categoricals.

Score 1: 0.9136744989074429 
Score 2: 0.8846807477543093


Yeah. The KNN classifier seems to be doing better without much touching. Let's see what the probability of having a certain y value is, and judge this against the score.

In [None]:
print(y_test.value_counts()['no']/ (y_test.value_counts()['yes']+ y_test.value_counts()['no']))

0.8851663025006069


Houston, we have a problem. There might be bias for classifying everything as no. Let's try making a confusion matrix! The roc auc score should also help determine how precise our classifier is.

In [None]:
# Importing the confusion_matrix metric, and the roc_auc_score
from sklearn.metrics import confusion_matrix
from sklearn.metrics import roc_auc_score

# Creating prediction values by using the KNeighborsClassifier with
# original data.
model3_ytest = model3.predict(no_dummies_test)

# My first confusion matrix!
ze_matrix = confusion_matrix(y_test, model3_ytest)
tn, fp, fn, tp = ze_matrix.ravel()
print("True negative: ", tn, "\nFalse positive:", fp, "\nFalse negative",
      fn, "\nTrue positive:", tp)
print(ze_matrix)

# Tried calculating AUC ROC manually, then realized it requires calculus!!
# true_positive_rate = tp / (tp + fn)
# false_positive_rate = fp / (tn + fn)

# roc_auc_score uses binary np.array's for input, so need to create
# np.arrays to substitute. Will first create equal-size arrays full of
# false values.
the_answer3 = np.zeros_like(model3_ytest, dtype = np.dtype(bool))
the_original = np.zeros_like(y_test, dtype = np.dtype(bool))

# Creating analogous numpy arrays to the y-output arrays. They use 0's and
# 1's rather than yes's and no's, or True's and False's.
for i in range(len(model3_ytest)):
  if model3_ytest[i] == 'yes':
    the_answer3[i] = True
  else:
    the_answer3[i] = False
  if y_test.iloc[i] == 'yes':
    the_original[i] = True
  else:
    the_original[i] = False

score7 = roc_auc_score(the_original, the_answer3)
print("The ROC AUC score of the K Neighbors Classifier is:", score7)

True negative:  3517 
False positive: 129 
False negative 346 
True positive: 127
[[3517  129]
 [ 346  127]]
The ROC AUC score of the K Neighbors Classifier is: 0.6165588516013958


Now, we are going to check the roc auc score for our other two classifiers

In [None]:
# For the Naive Bayes

# Create the prediction array
model_ytest = model.predict(no_dummies_test)

# Create another confusion matrix
ze_matrix2 = confusion_matrix(y_test, model_ytest)

# Print them nicely!
tn2, fp2, fn2, tp2 = ze_matrix.ravel()
print("True negative: ", tn2, "\nFalse positive:", fp2, "\nFalse negative",
      fn2, "\nTrue positive:", tp2)
print(ze_matrix2)

# Creating the binary array needed to find the roc_auc_score
the_answer1 = np.zeros_like(model_ytest, dtype = np.dtype(bool))

# Creating analogous numpy arrays to the y-output array.
for i in range(len(model_ytest)):
  if model_ytest[i] == 'yes':
    the_answer1[i] = True
  else:
    the_answer1[i] = False

# Finding the roc_auc_score of the first classifier of this project
score8 = roc_auc_score(the_original, the_answer1)
print("The ROC AUC score of the first Naive Bayes model is:", score8)

True negative:  3517 
False positive: 129 
False negative 346 
True positive: 127
[[3262  384]
 [ 219  254]]
The ROC AUC score of the first Naive Bayes model is: 0.7158384931095387


Now we need to do the same thing, but this time, for our KNeighborsClassifier with age bins.

In [None]:
# For the Naive_Bayes with age_bins. 
# Quick note: we are using the SECOND Dataset

# Create the predictions array
model2_ytest = model2.predict(no_dummies_test2)

# Create another confusion_matrix
ze_matrix3 = confusion_matrix(y_test2, model2_ytest)

# Print the confusion matrix nicely
tn3, fp3, fn3, tp3 = ze_matrix3.ravel()
print("True negative: ", tn3, "\nFalse positive:", fp3, "\nFalse negative",
      fn3, "\nTrue positive:", tp3)
print(ze_matrix3)

# Creating the binary array needed to find the roc_auc_score
the_answer2 = np.zeros_like(model2_ytest, dtype = np.dtype(bool))
the_original2 = np.zeros_like(y_test2, dtype = np.dtype(bool))

# Creating analogous numpy arrays to the y-output arrays.
for i in range(len(model2_ytest)):
  if model2_ytest[i] == 'yes':
    the_answer2[i] = True
  else:
    the_answer2[i] = False
  if y_test2.iloc[i] == 'yes':
    the_original2[i] = True
  else:
    the_original2[i] = False

# Finding the roc_auc_score of the second classifier of this project
score9 = roc_auc_score(the_original2, the_answer2)
print("The ROC AUC score of the first Naive Bayes model is:",score9, 
      "\nNote: This data's been altered to use decade bins rather than age.")

True negative:  3261 
False positive: 385 
False negative 218 
True positive: 255
[[3261  385]
 [ 218  255]]
The ROC AUC score of the first Naive Bayes model is: 0.7167584389739284 
Note: This data's been altered to use decade bins rather than age.


It's hard for me to know what a good roc auc score is, but both classifiers performed about the same level. I could assume 70% ROC AUC score isn't good for classifiers.

But the point made that the KNeighbors classifiers has bias to say negative all the time is true, and you can see that it's precision score is 10 points (10%) beneath the other two.

We are going to try to deal with this by oversampling to see if we can improve our precision. We have around ~12% yes's, let's try to move this closer to 50%, and then see if our models are any more accurate.

In [None]:
raw_data_yes = raw_data_copy[raw_data_copy['y'] == 'yes']
raw_data_no = raw_data_copy[raw_data_copy['y'] == 'no']
print(raw_data_yes.shape, raw_data_no.shape)

(4640, 20) (36548, 20)


I'm going to do random sampling with replacement for the yes values. I will sample the same amount that there are in no's.

In [None]:
new_raw_data_yes = raw_data_yes.sample(n = 36548, replace = True, random_state = (2021-10-25))
raw_data_overSample= new_raw_data_yes.append(raw_data_no)
print(raw_data_overSample.shape)

(73096, 20)


Now, we have equal amount of yes's and no's in our data. Now, we need to reclean our data, create the models, score them,  generate the confusion matrices and take the roc auc score.

In [None]:
y_val = raw_data_overSample['y']
raw_data = raw_data_overSample.drop('y', axis = 1)
raw_data = pd.get_dummies(raw_data, drop_first=True)
train_data, test_data, y_train, y_test = train_test_split( raw_data ,
                              y_val, test_size = .1, random_state=(2021-10-25))

# no_dummies_train = pd.get_dummies(train_data, drop_first=True)
# no_dummies_test = pd.get_dummies(test_data, drop_first=True)

model3 = KNeighborsClassifier()
model3.fit(train_data, y_train)
score10 = model3.score(train_data, y_train)
score11 = model3.score(test_data, y_test)


print("Score 1:", score10, "\nScore 2:" , score11)



raw_data2 = raw_data_overSample.copy()
raw_data2['age_decade'] = pd.cut(raw_data2['age'], [0, 10, 20, 30, 40, 50, 60, 
                                                    70, 80, 90, 100], labels = 
                                [10, 20, 30, 40, 50, 60, 70, 80, 90, 100])
raw_data2 = raw_data2.drop('age', axis = 1)
raw_data2 = pd.get_dummies( raw_data2, drop_first = True)
train_data2, test_data2, y_train2, y_test2 = train_test_split( raw_data2 , y_val,
                                  test_size = .1, random_state=(2021-10-25))

model2 = CategoricalNB()
model2.fit(train_data2, y_train2)
score12 = model2.score(train_data2, y_train2)
score13= model2.score(test_data2, y_test2)

print("Score 1:", score12, "\nScore 2:" , score13)


Score 1: 0.9151947222813365 
Score 2: 0.878796169630643
Score 1: 0.9872769282218101 
Score 2: 0.9886456908344733


Not sure what the information is worth yet, but the raw scores are significantly higher than before, especialy with Naive Bayes.

In [None]:
model_ytest = model2.predict(test_data2)
# For the Naive Bayes
ze_matrix2 = confusion_matrix(y_test2, model_ytest)
tn2, fp2, fn2, tp2 = ze_matrix2.ravel()
print("True negative: ", tn2, "\nFalse positive:", fp2, "\nFalse negative",
      fn2, "\nTrue positive:", tp2)
print(ze_matrix2)


the_answer1 = np.zeros_like(model_ytest, dtype = np.dtype(bool))
the_original2 = np.zeros_like(y_test2, dtype = np.dtype(bool))

# Creating analogous numpy arrays to the y-output arrays. They use 0's and
# 1's rather than yes's and no's.
for i in range(len(model_ytest)):
  if model_ytest[i] == 'yes':
    the_answer1[i] = True
  else:
    the_answer1[i] = False
  if y_test2.iloc[i] == 'yes':
    the_original2[i] = True
  else:
    the_original2[i] = False

score14 = roc_auc_score(the_original2, the_answer1)
print(score14)

True negative:  3574 
False positive: 83 
False negative 0 
True positive: 3653
[[3574   83]
 [   0 3653]]
0.9886519004648618


That's a lot better

In [None]:
model2_ytest = model3.predict(test_data)
# For the KNN
ze_matrix3 = confusion_matrix(y_test, model2_ytest)
tn3, fp3, fn3, tp3 = ze_matrix3.ravel()
print("True negative: ", tn3, "\nFalse positive:", fp3, "\nFalse negative",
      fn3, "\nTrue positive:", tp3)
print(ze_matrix3)


the_answer2 = np.zeros_like(model2_ytest, dtype = np.dtype(bool))
the_original1 = np.zeros_like(y_test, dtype = np.dtype(bool))

# Creating analogous numpy arrays to the y-output arrays. They use 0's and
# 1's rather than yes's and no's.
for i in range(len(model2_ytest)):
  if model2_ytest[i] == 'yes':
    the_answer2[i] = True
  else:
    the_answer2[i] = False
  if y_test.iloc[i] == 'yes':
    the_original1[i] = True
  else:
    the_original1[i] = False

score9 = roc_auc_score(the_original1, the_answer2)
print(score9)

True negative:  2820 
False positive: 837 
False negative 49 
True positive: 3604
[[2820  837]
 [  49 3604]]
0.8788551196977683


Based on the roc auc score, we can conclude that the naive bayes is significantly more accurate. Balancing the data made a huge difference!

I've been asked to try making a Gaussian Bayes Classifier using input variables described as “social and economic context attributes” in bank-additional-names.txt.

In [None]:
# Importing Naive Bayes Gaussian
from sklearn.naive_bayes import GaussianNB

# Creating new data set with columns/features specified above in-tact
raw_data = raw_data_copy[['emp.var.rate', 'cons.price.idx', 'cons.conf.idx', 
                          'euribor3m', 'nr.employed', 'y']]

# Stratifying the y (dependant) variable
y_val = raw_data['y']

# Dropping the dependant variable from the rest of the data
raw_data = raw_data.drop('y', axis = 1)

# Separating the data between train data and test data
train_data, test_data, y_train, y_test = train_test_split( raw_data ,
                              y_val, test_size = .1, random_state=(2021-10-25))

# Model instantiation
model4 = GaussianNB()

# Fitting the model
model4.fit(train_data, y_train)

# Scoring the model based on training data and on test data.
score15 = model4.score(train_data, y_train)
score16 = model4.score(test_data, y_test)

# Displaying the scores
print("Guassian Naive Bayes classifier using only the variables specified as",
      "\nsocial and economic context attributes.\n",)

print("Score based on training data:", score15, 
      "\nScore based on test data:" , score16)

Guassian Naive Bayes classifier using only the variables specified as 
social and economic context attributes.

Score based on training data: 0.7199816558310178 
Score based on test data: 0.7193493566399611


So far so good, let's check the AUC and confusion matrix.

In [None]:
# Creating prediction values by using the GaussianNB classifier with
# data specified as social and economic context attributes.
model4_ytest = model4.predict(test_data)

# Another confusion matrix!
ze_matrix = confusion_matrix(y_test, model4_ytest)
tn, fp, fn, tp = ze_matrix.ravel()
print("True negative: ", tn, "\nFalse positive:", fp, "\nFalse negative",
      fn, "\nTrue positive:", tp)
print(ze_matrix)

# To create equal-size arrays full of false values.
the_answer4 = np.zeros_like(model4_ytest, dtype = np.dtype(bool))
the_original = np.zeros_like(y_test, dtype = np.dtype(bool))

# Creating analogous numpy arrays to the y-output arrays. They use 0's and
# 1's rather than yes's and no's, or True's and False's.
for i in range(len(model4_ytest)):
  if model4_ytest[i] == 'yes':
    the_answer4[i] = True
  else:
    the_answer4[i] = False
  if y_test.iloc[i] == 'yes':
    the_original[i] = True
  else:
    the_original[i] = False

score17 = roc_auc_score(the_original, the_answer4)
print("The ROC AUC score of the Gaussian Naive Bayes is:", score17)

True negative:  2634 
False positive: 1012 
False negative 144 
True positive: 329
[[2634 1012]
 [ 144  329]]
The ROC AUC score of the Gaussian Naive Bayes is: 0.7089978997517046


The results of the Gaussian Naive Bayes is comparable with the score of the Categorical Bayes when using the original dataset. The AUC score is .7089978997517046, a little bit less than .01 less than the naive bayes models.

Now, we are going to try oversampling again for this model, and see how much better our Guassian Naive Bayes classifier gets.

In [None]:
# Using our oversampled data_set from above, and selecting the categories that
# can be described as “social and economic context attributes” in 
# bank-additional-names.txt
raw_data_overSample = raw_data_overSample[['emp.var.rate', 'cons.price.idx', 
                                          'cons.conf.idx', 
                                          'euribor3m', 'nr.employed', 'y']]

# Drop the y-value and save it as a separate series
y_val = raw_data_overSample['y']
raw_data = raw_data_overSample.drop('y', axis = 1)

# Stratify the data as normal.
train_data, test_data, y_train, y_test = train_test_split( raw_data ,
                              y_val, test_size = .1, random_state=(2021-10-25))

# Model instantiation
model5 = GaussianNB()

# Fitting the model
model5.fit(train_data, y_train)

# Scoring the model based on training data and on test data.
score18 = model5.score(train_data, y_train)
score19 = model5.score(test_data, y_test)

# Displaying the scores
print("Guassian Naive Bayes classifier using only the variables specified as",
      "\nsocial and economic context attributes, and using oversampling.\n",)

print("Score based on training data:", score18, 
      "\nScore based on test data:" , score19)

Guassian Naive Bayes classifier using only the variables specified as 
social and economic context attributes, and using oversampling.

Score based on training data: 0.7174930836348159 
Score based on test data: 0.7108071135430917


Ok, well I guess that our data is isn't very good.

In [None]:
# Creating prediction values by using the GaussianNB classifier with
# data specified as social and economic context attributes.
model5_ytest = model5.predict(test_data)

# Another confusion matrix!
ze_matrix = confusion_matrix(y_test, model5_ytest)
tn, fp, fn, tp = ze_matrix.ravel()
print("True negative: ", tn, "\nFalse positive:", fp, "\nFalse negative",
      fn, "\nTrue positive:", tp)
print(ze_matrix)

# To create equal-size arrays full of false values.
the_answer5 = np.zeros_like(model5_ytest, dtype = np.dtype(bool))
the_original = np.zeros_like(y_test, dtype = np.dtype(bool))

# Creating analogous numpy arrays to the y-output arrays. They use 0's and
# 1's rather than yes's and no's, or True's and False's.
for i in range(len(model5_ytest)):
  if model5_ytest[i] == 'yes':
    the_answer5[i] = True
  else:
    the_answer5[i] = False
  if y_test.iloc[i] == 'yes':
    the_original[i] = True
  else:
    the_original[i] = False

score20 = roc_auc_score(the_original, the_answer5)
print("The ROC AUC score of the Gaussian Naive Bayes is:", score20)

True negative:  2619 
False positive: 1038 
False negative 1076 
True positive: 2577
[[2619 1038]
 [1076 2577]]
The ROC AUC score of the Gaussian Naive Bayes is: 0.7108041824322306


Now we can see, that the AUC score did not improve, if only neglibly. The lesson of the story is that if the data is good, picking the model that fits the data can make a big difference in how well your model works.

The results did not significantly improve with oversampling, using the Gaussian Naive Bayes.