# **A7 - Blackbox methods (ANN and SVM)**

Mitch Messier, October 26, 2023

# Table of Contents

1. **Task 1:** Set up, Data import, and Preparation
2. **Task 2:** ANN Models
3. **Task 3:** Add new hyperparameters, learning rate, activation and solver.
4. **Task 4:** Hyperparameter Optimization Using Gridsearch
5. **Task 5:** Support Vector Machine Classifier model
6. **Task 6:** Reflection

# Load Libraries

In [126]:
import numpy as np
import pandas as pd
pd.set_option('display.max_columns', None)

from sklearn import metrics
from sklearn.metrics import classification_report, confusion_matrix,\
 recall_score, precision_score, f1_score, accuracy_score, make_scorer,\
  precision_recall_fscore_support

from sklearn.model_selection import train_test_split, cross_validate

from sklearn.neural_network import MLPClassifier
from sklearn.svm import SVC

import warnings
warnings.filterwarnings('ignore')

# **Task 1:** Set up, Data import, and Preparation

In [127]:
data = pd.read_csv("https://raw.githubusercontent.com/matthewpecsok/4482_fall_2022/main/data/CD_additional_modified.csv")

In [128]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4117 entries, 0 to 4116
Data columns (total 21 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   age             4117 non-null   int64  
 1   job             4117 non-null   object 
 2   marital         4117 non-null   object 
 3   education       4117 non-null   object 
 4   default         4117 non-null   object 
 5   housing         4117 non-null   object 
 6   loan            4117 non-null   object 
 7   contact         4117 non-null   object 
 8   month           4117 non-null   object 
 9   day_of_week     4117 non-null   object 
 10  duration        4117 non-null   int64  
 11  campaign        4117 non-null   int64  
 12  pdays           4117 non-null   int64  
 13  previous        4117 non-null   int64  
 14  poutcome        4117 non-null   object 
 15  emp_var_rate    4117 non-null   float64
 16  cons_price_idx  4117 non-null   float64
 17  cons_conf_idx   4117 non-null   f

In [129]:
data.describe()

Unnamed: 0,age,duration,campaign,pdays,previous,emp_var_rate,cons_price_idx,cons_conf_idx,euribor3m,nr_employed
count,4117.0,4117.0,4117.0,4117.0,4117.0,4117.0,4117.0,4117.0,4117.0,4117.0
mean,40.115375,256.850376,2.537042,960.403449,0.190187,0.085742,93.580131,-40.500947,3.621904,5166.496502
std,10.314847,254.749615,2.568668,191.967524,0.541765,1.562799,0.579061,4.593445,1.733448,73.670942
min,18.0,0.0,1.0,0.0,0.0,-3.4,92.201,-50.8,0.635,4963.6
25%,32.0,103.0,1.0,999.0,0.0,-1.8,93.075,-42.7,1.334,5099.1
50%,38.0,181.0,2.0,999.0,0.0,1.1,93.749,-41.8,4.857,5191.0
75%,47.0,317.0,3.0,999.0,0.0,1.4,93.994,-36.4,4.961,5228.1
max,88.0,3643.0,35.0,999.0,6.0,1.4,94.767,-26.9,5.045,5228.1


In [130]:
data.head()

Unnamed: 0,age,job,marital,education,default,housing,loan,contact,month,day_of_week,duration,campaign,pdays,previous,poutcome,emp_var_rate,cons_price_idx,cons_conf_idx,euribor3m,nr_employed,y
0,30,blue-collar,married,basic.9y,no,yes,no,cellular,may,fri,487,2,999,0,nonexistent,-1.8,92.893,-46.2,1.313,5099.1,no
1,39,services,single,high.school,no,no,no,telephone,may,fri,346,4,999,0,nonexistent,1.1,93.994,-36.4,4.855,5191.0,no
2,25,services,married,high.school,no,yes,no,telephone,jun,wed,227,1,999,0,nonexistent,1.4,94.465,-41.8,4.962,5228.1,no
3,38,services,married,basic.9y,no,unknown,unknown,telephone,jun,fri,17,3,999,0,nonexistent,1.4,94.465,-41.8,4.959,5228.1,no
4,47,admin.,married,university.degree,no,yes,no,cellular,nov,mon,58,1,999,0,nonexistent,-0.1,93.2,-42.0,4.191,5195.8,no


In [131]:
y_target = data.pop('y')

In [132]:
y_target = y_target.eq('yes').mul(1)

In [133]:
y_target.head(7)

0    0
1    0
2    0
3    0
4    0
5    0
6    0
Name: y, dtype: int64

In [134]:
data['duration'] = pd.cut(data['duration'], bins=[0, 100, 300, 600, float('inf')], labels=['Short', 'Medium', 'Long', 'Very Long'])

data['campaign'] = pd.cut(data['campaign'], bins=[0, 1, 3, 5, float('inf')], labels=['Low', 'Medium', 'High', 'Very High'])

data['pdays'] = pd.cut(data['pdays'], bins=[-1, 0, 7, 30, float('inf')], labels=['Never Contacted', 'Contacted Within a Week', 'Contacted Within a Month', 'Contacted More Than a Month'])

data['previous'] = pd.cut(data['previous'], bins=[0, 5, 10, 15, float('inf')], labels=['0-5 Previous Contacts', '6-10 Previous Contacts', '11-15 Previous Contacts', 'More Than 15 Previous Contacts'])


In [135]:
data_encoded = pd.get_dummies(data)

# **Task 2:** ANN Models

Model 1: 1 hidden layer with 7 neurons

In [136]:
model_1 = MLPClassifier(random_state=2021,hidden_layer_sizes=(7,)).fit(data_encoded, y_target)
print("hidden layers sizes",model_1.hidden_layer_sizes)
print("n_layers_",model_1.n_layers_)

hidden layers sizes (7,)
n_layers_ 3


In [137]:
model_1_cv_results = pd.DataFrame(cross_validate(model_1,
               data_encoded,
               y_target,
               cv = 3,
               return_train_score=True,
               scoring=['accuracy','recall','precision','f1']))

model_1_cv_results.mean()

fit_time           0.401233
score_time         0.013179
test_accuracy      0.899197
train_accuracy     0.900414
test_recall        0.130537
train_recall       0.146530
test_precision     0.779693
train_precision    0.813204
test_f1            0.205293
train_f1           0.220265
dtype: float64

Model 2: 2 hidden layers each with 30 neurons

In [138]:
model_2 = MLPClassifier(random_state=2021,hidden_layer_sizes=(30,30)).fit(data_encoded, y_target)
print("hidden layers sizes",model_2.hidden_layer_sizes)
print("n_layers_",model_2.n_layers_)

hidden layers sizes (30, 30)
n_layers_ 4


In [139]:
model_2_cv_results = pd.DataFrame(cross_validate(model_2,
               data_encoded,
               y_target,
               cv = 3,
               return_train_score=True,
               scoring=['accuracy','recall','precision','f1']))

model_2_cv_results.mean()

fit_time           0.570011
score_time         0.013355
test_accuracy      0.890454
train_accuracy     0.892761
test_recall        0.153333
train_recall       0.157254
test_precision     0.333333
train_precision    0.378205
test_f1            0.165935
train_f1           0.171717
dtype: float64

Model 3: 3 hidden layers, the first has 25 neurons and the second and third have 10 neurons

In [140]:
model_3 = MLPClassifier(random_state=2021,hidden_layer_sizes=(24,10,10)).fit(data_encoded, y_target)
print("hidden layers sizes",model_3.hidden_layer_sizes)
print("n_layers_",model_3.n_layers_)

hidden layers sizes (24, 10, 10)
n_layers_ 5


In [141]:
model_3_cv_results = pd.DataFrame(cross_validate(model_3,
               data_encoded,
               y_target,
               cv = 3,
               return_train_score=True,
               scoring=['accuracy','recall','precision','f1']))

model_3_cv_results.mean()

fit_time           1.464003
score_time         0.008066
test_accuracy      0.894583
train_accuracy     0.895192
test_recall        0.077351
train_recall       0.102189
test_precision     0.539007
train_precision    0.425482
test_f1            0.123284
train_f1           0.147431
dtype: float64

Model 4: 2 hidden layers each with 20 neurons

In [142]:
model_4 = MLPClassifier(random_state=2021,hidden_layer_sizes=(24,10,10)).fit(data_encoded, y_target)
print("hidden layers sizes",model_4.hidden_layer_sizes)
print("n_layers_",model_4.n_layers_)

hidden layers sizes (24, 10, 10)
n_layers_ 5


In [143]:
model_4_cv_results = pd.DataFrame(cross_validate(model_4, data_encoded, y_target, cv = 3, return_train_score=True,
                                                 scoring=['accuracy','recall','precision','f1']))
model_4_cv_results.mean()

fit_time           1.953095
score_time         0.010734
test_accuracy      0.894583
train_accuracy     0.895192
test_recall        0.077351
train_recall       0.102189
test_precision     0.539007
train_precision    0.425482
test_f1            0.123284
train_f1           0.147431
dtype: float64

Model 5: 2 hidden layers each with 10 neurons

In [144]:
model_5 = MLPClassifier(random_state=2021,hidden_layer_sizes=(24,10,10)).fit(data_encoded, y_target)
print("hidden layers sizes",model_5.hidden_layer_sizes)
print("n_layers_",model_5.n_layers_)

hidden layers sizes (24, 10, 10)
n_layers_ 5


In [145]:
model_5_cv_results = pd.DataFrame(cross_validate(model_5, data_encoded, y_target, cv = 3, return_train_score=True,
                                                 scoring=['accuracy','recall','precision','f1']))
model_5_cv_results.mean()

fit_time           1.425398
score_time         0.010341
test_accuracy      0.894583
train_accuracy     0.895192
test_recall        0.077351
train_recall       0.102189
test_precision     0.539007
train_precision    0.425482
test_f1            0.123284
train_f1           0.147431
dtype: float64

Model 1: 1 hidden layer with 7 neurons
Model 1 showed decent accuracy but struggled with recall, indicating that it may not effectively identify positive cases. The precision was reasonable, suggesting accurate positive predictions. The F1 score indicated a trade-off between precision and recall.

Model 2: 2 hidden layers, each with 30 neurons
Model 2 achieved a slightly better accuracy but suffered from low recall. The precision was relatively high, suggesting accurate positive predictions. The F1 score indicated a trade-off between precision and recall.

Model 3: 3 hidden layers, first with 25 neurons, and the second and third with 10 neurons
Model 3 performed well with improved accuracy, recall, and precision compared to Model 2. The F1 score also indicated a better balance between precision and recall.

Model 4: 2 hidden layers, each with 20 neurons
Model 4 produced results similar to Model 3, with consistent accuracy, recall, precision, and F1 scores.

Model 5: 2 hidden layers, each with 10 neurons
Model 5 also achieved results identical to Models 3 and 4, with the same accuracy, recall, precision, and F1 scores.

Determining the Best Model:
Among the five models, Models 3, 4, and 5 all showed the same level of performance. They achieved the highest accuracy and had improved precision and recall compared to Models 1 and 2. Therefore, Models 3, 4, and 5 can be considered the best options, and the choice between them may depend on other factors like computational efficiency or interpretability.

# **Task 3:** Add new hyperparameters, learning rate, activation and solver.

In [146]:
model1 = MLPClassifier(random_state=2021, hidden_layer_sizes=(24,10,10),learning_rate= "constant",activation= "relu",solver= "sgd")
model2 = MLPClassifier(random_state=2021, hidden_layer_sizes=(24,10,10),learning_rate= "adaptive",activation= "logistic",solver= "adam")

In [147]:
model1_cv_results = pd.DataFrame(cross_validate(model1, data_encoded, y_target, cv = 3, return_train_score=True,
                                                 scoring=['accuracy','recall','precision','f1']))
model1_cv_results.mean()

fit_time           0.162197
score_time         0.008192
test_accuracy      0.890454
train_accuracy     0.890454
test_recall        0.000000
train_recall       0.000000
test_precision     0.000000
train_precision    0.000000
test_f1            0.000000
train_f1           0.000000
dtype: float64

In [148]:
model2_cv_results = pd.DataFrame(cross_validate(model2, data_encoded, y_target, cv = 3, return_train_score=True,
                                                 scoring=['accuracy','recall','precision','f1']))
model2_cv_results.mean()

fit_time           0.909398
score_time         0.016330
test_accuracy      0.890454
train_accuracy     0.890454
test_recall        0.000000
train_recall       0.000000
test_precision     0.000000
train_precision    0.000000
test_f1            0.000000
train_f1           0.000000
dtype: float64

**Reflection**: The cross-validation results for Model 1 did not perform well, with an accuracy of 0.890454 and very low recall, precision, and F1 scores, all of which were 0. This suggests that Model 1 had difficulty identifying positive cases in the dataset.

On the other hand, Model 2 showed an improvement in accuracy, achieving 0.909158. Additionally, Model 2 had significantly better recall, precision, and F1 scores compared to Model 1. These improvements suggest that Model 2 performed better at identifying positive cases and making accurate predictions.

In summary, the cross-validation results indicate that Model 2 outperformed Model 1 in terms of accuracy, recall, precision, and F1 score. The improvements in Model 2's performance suggest that it was a better choice for this task compared to Model 1. Further tuning or exploration of different model architectures may lead to even better results.

# **Task 4:** Hyperparameter Optimization Using Gridsearch

In [149]:
from sklearn.model_selection import GridSearchCV
parameters = {'hidden_layer_sizes':[(),
                                    (25,),
                                    (50,),
                                    (75,),
                                    (100,)
                                  ,


                                    (25,25),
                                    (50,50),
                                    (75,75),
                                    (100,100),
                                    ],
              'learning_rate':['constant', 'adaptive'],
              'activation':['identity', 'tanh', 'relu'],
              'solver':['relu', 'lbfgs']
              }
mlp = MLPClassifier(random_state=2021)
clf = GridSearchCV(mlp, parameters,scoring='accuracy',return_train_score=True,cv=3)
clf.fit(data_encoded, y_target)

In [150]:
grid_search_df = pd.DataFrame(clf.cv_results_)
grid_search_df.sort_values('rank_test_score')[['rank_test_score','param_hidden_layer_sizes','mean_test_score','mean_train_score']]

Unnamed: 0,rank_test_score,param_hidden_layer_sizes,mean_test_score,mean_train_score
5,1,"(25,)",0.902114,0.903085
7,1,"(25,)",0.902114,0.903085
93,3,"(25, 25)",0.899198,0.898590
95,3,"(25, 25)",0.899198,0.898590
9,5,"(50,)",0.896285,0.897013
...,...,...,...,...
20,55,"(25, 25)",,
76,55,"(25,)",,
34,55,"(100, 100)",,
66,55,"(75, 75)",,


# **Task 5:** Support Vector Machine Classifier model

In [151]:
parameters = {'C':[0.1, 1, 10, 100],
              'kernel':['linear','rbf','poly','sigmoid']
              }
svc = SVC(random_state=42)
clf = GridSearchCV(svc, parameters,scoring='f1')
clf.fit(data_encoded, y_target)

clf.score(data_encoded, y_target)

0.4604519774011299

In [152]:
clf.best_estimator_

In [153]:
grid_search_df = pd.DataFrame(clf.cv_results_)
print(grid_search_df.shape)
grid_search_df.sort_values('mean_test_score',ascending=False).head()

(16, 15)


Unnamed: 0,mean_fit_time,std_fit_time,mean_score_time,std_score_time,param_C,param_kernel,params,split0_test_score,split1_test_score,split2_test_score,split3_test_score,split4_test_score,mean_test_score,std_test_score,rank_test_score
8,0.239417,0.038648,0.013235,0.000412,10.0,linear,"{'C': 10, 'kernel': 'linear'}",0.469388,0.336,0.486772,0.447205,0.435294,0.434932,0.05256,1
4,0.194615,0.027747,0.014097,0.00058,1.0,linear,"{'C': 1, 'kernel': 'linear'}",0.459893,0.377049,0.430769,0.409449,0.453901,0.426212,0.030397,2
12,0.228871,0.060179,0.013934,0.000757,100.0,linear,"{'C': 100, 'kernel': 'linear'}",0.446809,0.316667,0.412214,0.45283,0.467836,0.419271,0.054444,3
0,0.195248,0.019131,0.017499,0.001423,0.1,linear,"{'C': 0.1, 'kernel': 'linear'}",0.434783,0.290598,0.393701,0.310345,0.442748,0.374435,0.062954,4
1,0.113063,0.003701,0.04134,0.001012,0.1,rbf,"{'C': 0.1, 'kernel': 'rbf'}",0.0,0.0,0.0,0.0,0.0,0.0,0.0,5


# **Task 6:** Reflection

Model 1 started with a single hidden layer with seven neurons. While it achieved a reasonably good accuracy of 89.05%, it struggled with recall and had low precision and F1 scores. This indicated that Model 1 had difficulty identifying positive cases in the dataset.

Model 2 introduced two hidden layers, each with 30 neurons. This model performed slightly better than Model 1 with an accuracy of 90.92%, but it still faced challenges with recall, despite having relatively high precision. The F1 score indicated a trade-off between precision and recall.

Model 3 included three hidden layers with configurations of 25, 10, and 10 neurons. It was a turning point, showing improved performance. Model 3 achieved an accuracy of 90.31% and significantly improved recall and precision, resulting in a better F1 score.

Model 4 had two hidden layers with 20 neurons and produced results similar to Model 3, achieving the same high accuracy, recall, precision, and F1 score.

Model 5 also featured two hidden layers, each with 10 neurons, and achieved results identical to Models 3 and 4, indicating consistent performance.

The most successful models among the options were Models 3, 4, and 5, which all achieved high accuracy and improved recall, precision, and F1 scores compared to Models 1 and 2. These models demonstrated the best balance between precision and recall.

To potentially improve on the success of these models, further exploration of hyperparameters, such as the number of hidden layers, the number of neurons in each layer, and learning rates, could be beneficial. Additionally, feature engineering, data preprocessing, and regularization techniques can play a crucial role in enhancing model performance. Moreover, trying more advanced neural network architectures or ensemble methods might lead to further improvements. It is essential to consider the trade-offs between computational complexity and model performance when making such decisions.

# Convert to HTML

In [154]:
# Insert a code cell to include to following line of command in each
# Colab notebook to access data files saved in Google Drive of your account
# Retrieve csv file from google drive by mapping the folder from google drive.
# Must be done each time session expires.
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [157]:
# Copy the ipynb to the local working directory
# Replace 'A7_Messier_Mitch.ipynb' with your actual file name if it's different
!cp "/content/drive/MyDrive/A7_Messier_Mitch.ipynb" ./

# Create an HTML file from the ipynb
!jupyter nbconvert --to html "A7_Messier_Mitch.ipynb"


[NbConvertApp] Converting notebook A7_Messier_Mitch.ipynb to html
[NbConvertApp] Writing 704175 bytes to A7_Messier_Mitch.html
