In [2]:
import pandas as pd
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.dummy import DummyClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
import time
from sklearn.tree import DecisionTreeClassifier
from sklearn.svm import SVC

# Practical Application III: Comparing Classifiers

**Overview**: In this practical application, my goal is to compare the performance of classifiers, namely K Nearest Neighbor, Logistic Regression, Decision Trees, and Support Vector Machines.  I will utilize a dataset related to marketing bank products over the telephone.  



## Getting Started

My dataset comes from the UCI Machine Learning repository [link](https://archive.ics.uci.edu/ml/datasets/bank+marketing).  The data is from a Portugese banking institution and is a collection of the results of multiple marketing campaigns.  I will make use of the article accompanying the dataset [here](CRISP-DM-BANK.pdf) for more information on the data and features.



This data is the result of data collection over 17 marketing campaigns. I will start by reading in the data to a dataframe and looking at some descriptive data to get a sense of the data.


In [3]:
df = pd.read_csv('data/bank-additional-full.csv', sep = ';')

In [4]:
df.head()

Unnamed: 0,age,job,marital,education,default,housing,loan,contact,month,day_of_week,...,campaign,pdays,previous,poutcome,emp.var.rate,cons.price.idx,cons.conf.idx,euribor3m,nr.employed,y
0,56,housemaid,married,basic.4y,no,no,no,telephone,may,mon,...,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no
1,57,services,married,high.school,unknown,no,no,telephone,may,mon,...,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no
2,37,services,married,high.school,no,yes,no,telephone,may,mon,...,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no
3,40,admin.,married,basic.6y,no,no,no,telephone,may,mon,...,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no
4,56,services,married,high.school,no,no,yes,telephone,may,mon,...,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no


In [8]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 41188 entries, 0 to 41187
Data columns (total 21 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   age             41188 non-null  int64  
 1   job             41188 non-null  object 
 2   marital         41188 non-null  object 
 3   education       41188 non-null  object 
 4   default         41188 non-null  object 
 5   housing         41188 non-null  object 
 6   loan            41188 non-null  object 
 7   contact         41188 non-null  object 
 8   month           41188 non-null  object 
 9   day_of_week     41188 non-null  object 
 10  duration        41188 non-null  int64  
 11  campaign        41188 non-null  int64  
 12  pdays           41188 non-null  int64  
 13  previous        41188 non-null  int64  
 14  poutcome        41188 non-null  object 
 15  emp.var.rate    41188 non-null  float64
 16  cons.price.idx  41188 non-null  float64
 17  cons.conf.idx   41188 non-null 

The data looks to be already clean, but looking at the default column in the header shows what seems to be a binary yes/no datatype with an 'unknown' value that may make the data less complete then it seems. The info block only looks for explicitly empty cells, so these 'unknown' values may indicate a large and unseen amount of missing data.

### Understanding the Features


The data provides a dictionary for better understanding of the features. For the purposes of this project, not every column will be used. I will only be using the first 7 features for my classifiers and the 'y' column as my target feature.

```
Input variables:
# bank client data:
1 - age (numeric)
2 - job : type of job (categorical: 'admin.','blue-collar','entrepreneur','housemaid','management','retired','self-employed','services','student','technician','unemployed','unknown')
3 - marital : marital status (categorical: 'divorced','married','single','unknown'; note: 'divorced' means divorced or widowed)
4 - education (categorical: 'basic.4y','basic.6y','basic.9y','high.school','illiterate','professional.course','university.degree','unknown')
5 - default: has credit in default? (categorical: 'no','yes','unknown')
6 - housing: has housing loan? (categorical: 'no','yes','unknown')
7 - loan: has personal loan? (categorical: 'no','yes','unknown')
# related with the last contact of the current campaign:
8 - contact: contact communication type (categorical: 'cellular','telephone')
9 - month: last contact month of year (categorical: 'jan', 'feb', 'mar', ..., 'nov', 'dec')
10 - day_of_week: last contact day of the week (categorical: 'mon','tue','wed','thu','fri')
11 - duration: last contact duration, in seconds (numeric). Important note: this attribute highly affects the output target (e.g., if duration=0 then y='no'). Yet, the duration is not known before a call is performed. Also, after the end of the call y is obviously known. Thus, this input should only be included for benchmark purposes and should be discarded if the intention is to have a realistic predictive model.
# other attributes:
12 - campaign: number of contacts performed during this campaign and for this client (numeric, includes last contact)
13 - pdays: number of days that passed by after the client was last contacted from a previous campaign (numeric; 999 means client was not previously contacted)
14 - previous: number of contacts performed before this campaign and for this client (numeric)
15 - poutcome: outcome of the previous marketing campaign (categorical: 'failure','nonexistent','success')
# social and economic context attributes
16 - emp.var.rate: employment variation rate - quarterly indicator (numeric)
17 - cons.price.idx: consumer price index - monthly indicator (numeric)
18 - cons.conf.idx: consumer confidence index - monthly indicator (numeric)
19 - euribor3m: euribor 3 month rate - daily indicator (numeric)
20 - nr.employed: number of employees - quarterly indicator (numeric)

Output variable (desired target):
21 - y - has the client subscribed a term deposit? (binary: 'yes','no')
```



Looking at the data definitions, these 'unknown' values should be considered missing. I will go through each column and remove the 'unknown' values. If there is a significant number of the values that are missing, then I will consider a threshold of missing values before removing a row.

The analysis does not require all the features in the dataset, so I will only drop the rows for the features I will be using.

### Remove the Missing Values

In [39]:
# Create a new_df and build it with only the non-null values.
# Note that this is only done for the first 7 columns, as those are the ones that I will use.
new_df = ''
new_df = df[(df.job != 'unknown') & (df.marital != 'unknown') & (df.education != 'unknown') & (df.default != 'unknown') & (df.housing != 'unknown') & (df.loan != 'unknown') & (df.job != 'unknown')]

In [40]:
# Verify the drop was successful
new_df['loan'].value_counts()

no     25720
yes     4768
Name: loan, dtype: int64

In [41]:
# Verify that the target feature is not missing values
df['y'].value_counts()

no     36548
yes     4640
Name: y, dtype: int64

In [7]:
# This is the dataset after dropping the rows that are relevant to my analysis.
new_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 30488 entries, 0 to 41187
Data columns (total 21 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   age             30488 non-null  int64  
 1   job             30488 non-null  object 
 2   marital         30488 non-null  object 
 3   education       30488 non-null  object 
 4   default         30488 non-null  object 
 5   housing         30488 non-null  object 
 6   loan            30488 non-null  object 
 7   contact         30488 non-null  object 
 8   month           30488 non-null  object 
 9   day_of_week     30488 non-null  object 
 10  duration        30488 non-null  int64  
 11  campaign        30488 non-null  int64  
 12  pdays           30488 non-null  int64  
 13  previous        30488 non-null  int64  
 14  poutcome        30488 non-null  object 
 15  emp.var.rate    30488 non-null  float64
 16  cons.price.idx  30488 non-null  float64
 17  cons.conf.idx   30488 non-null 

After dropping all 'unkown' values, there is very little lost data, so I can progress with the analysis.

### Understanding the Task

After reading through the data description, the goal of this project is to create a model to predict the 'y' column, which indicates whether the target user has subscribed for a 'term deposit'. This is the goal of the marketing campaigns put on by the bank, so this analysis should help to improve the marketing campaigns and drive a higher subscription rate.

## Engineering Features

To start, I need to separate the bank information features (columns 1-7) and transform them into numeric values for my models.

The features used in the modeling are: age, job, marital, education, default, housing, and loan. The target column is the 'y' columns. After removing the 'unknown' values, there are 4 columns that are binary values (yes/no), and 3 columns that contain ordinal data that need to be encoded.

In [42]:
# Create a new df with only the features needed.
prepared_df = new_df[['age', 'job', 'marital', 'education', 'default', 'housing',  'loan', 'y']]
prepared_df.reset_index(drop = True, inplace = True)
prepared_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 30488 entries, 0 to 30487
Data columns (total 8 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   age        30488 non-null  int64 
 1   job        30488 non-null  object
 2   marital    30488 non-null  object
 3   education  30488 non-null  object
 4   default    30488 non-null  object
 5   housing    30488 non-null  object
 6   loan       30488 non-null  object
 7   y          30488 non-null  object
dtypes: int64(1), object(7)
memory usage: 1.9+ MB


In [10]:
# Encoding the job column.
job_df = pd.get_dummies(prepared_df['job'])
job_df.reset_index(drop = True, inplace = True)
job_df.head()

Unnamed: 0,admin.,blue-collar,entrepreneur,housemaid,management,retired,self-employed,services,student,technician,unemployed
0,0,0,0,1,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,1,0,0,0
2,1,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,1,0,0,0
4,1,0,0,0,0,0,0,0,0,0,0


In [11]:
# Encoding the marital column.
marital_df = pd.get_dummies(prepared_df['marital'])
marital_df.reset_index(drop = True, inplace = True)
marital_df.head()

Unnamed: 0,divorced,married,single
0,0,1,0
1,0,1,0
2,0,1,0
3,0,1,0
4,0,1,0


In [12]:
# Encoding the education column.
education_df = pd.get_dummies(prepared_df['education'])
education_df.reset_index(drop = True, inplace = True)
education_df.head()

Unnamed: 0,basic.4y,basic.6y,basic.9y,high.school,illiterate,professional.course,university.degree
0,1,0,0,0,0,0,0
1,0,0,0,1,0,0,0
2,0,1,0,0,0,0,0
3,0,0,0,1,0,0,0
4,0,0,0,0,0,1,0


In [13]:
# Combining the encoded dataframes together with the remaining features.
target_df = pd.concat([prepared_df['age'], job_df, marital_df, education_df, prepared_df['default'], prepared_df['housing'], prepared_df['loan'], prepared_df['y']], axis = 1)
target_df.reset_index(drop = True, inplace = True)
target_df.head()

Unnamed: 0,age,admin.,blue-collar,entrepreneur,housemaid,management,retired,self-employed,services,student,...,basic.6y,basic.9y,high.school,illiterate,professional.course,university.degree,default,housing,loan,y
0,56,0,0,0,1,0,0,0,0,0,...,0,0,0,0,0,0,no,no,no,no
1,37,0,0,0,0,0,0,0,1,0,...,0,0,1,0,0,0,no,yes,no,no
2,40,1,0,0,0,0,0,0,0,0,...,1,0,0,0,0,0,no,no,no,no
3,56,0,0,0,0,0,0,0,1,0,...,0,0,1,0,0,0,no,no,yes,no
4,59,1,0,0,0,0,0,0,0,0,...,0,0,0,0,1,0,no,no,no,no


In [14]:
# Binarizing the remaining non-numeric features.
target_df = target_df.replace({'default': {'yes':1, 'no':0}, 'housing': {'yes':1, 'no':0}, 'loan': {'yes':1, 'no':0}, 'y': {'yes':1, 'no':0}})

In [15]:
# Verifying that the data has been properly encoded.
target_df.head()

Unnamed: 0,age,admin.,blue-collar,entrepreneur,housemaid,management,retired,self-employed,services,student,...,basic.6y,basic.9y,high.school,illiterate,professional.course,university.degree,default,housing,loan,y
0,56,0,0,0,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,37,0,0,0,0,0,0,0,1,0,...,0,0,1,0,0,0,0,1,0,0
2,40,1,0,0,0,0,0,0,0,0,...,1,0,0,0,0,0,0,0,0,0
3,56,0,0,0,0,0,0,0,1,0,...,0,0,1,0,0,0,0,0,1,0
4,59,1,0,0,0,0,0,0,0,0,...,0,0,0,0,1,0,0,0,0,0


In [43]:
target_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 30488 entries, 0 to 30487
Data columns (total 26 columns):
 #   Column               Non-Null Count  Dtype
---  ------               --------------  -----
 0   age                  30488 non-null  int64
 1   admin.               30488 non-null  uint8
 2   blue-collar          30488 non-null  uint8
 3   entrepreneur         30488 non-null  uint8
 4   housemaid            30488 non-null  uint8
 5   management           30488 non-null  uint8
 6   retired              30488 non-null  uint8
 7   self-employed        30488 non-null  uint8
 8   services             30488 non-null  uint8
 9   student              30488 non-null  uint8
 10  technician           30488 non-null  uint8
 11  unemployed           30488 non-null  uint8
 12  divorced             30488 non-null  uint8
 13  married              30488 non-null  uint8
 14  single               30488 non-null  uint8
 15  basic.4y             30488 non-null  uint8
 16  basic.6y             3

Looking at the target_df dataframe, all features have been successfully encoded, and the data is now ready for analysis.

### Train/Test Split

With properly encoded data, I split my data into train and test sets.

In [44]:
# Move the predictive features and the target feature into separate dataframes.
X = target_df.drop(['y'], axis = 1)
y = target_df['y']
X.head()

Unnamed: 0,age,admin.,blue-collar,entrepreneur,housemaid,management,retired,self-employed,services,student,...,basic.4y,basic.6y,basic.9y,high.school,illiterate,professional.course,university.degree,default,housing,loan
0,56,0,0,0,1,0,0,0,0,0,...,1,0,0,0,0,0,0,0,0,0
1,37,0,0,0,0,0,0,0,1,0,...,0,0,0,1,0,0,0,0,1,0
2,40,1,0,0,0,0,0,0,0,0,...,0,1,0,0,0,0,0,0,0,0
3,56,0,0,0,0,0,0,0,1,0,...,0,0,0,1,0,0,0,0,0,1
4,59,1,0,0,0,0,0,0,0,0,...,0,0,0,0,0,1,0,0,0,0


In [17]:
# Splitting the data
X_train, X_test, y_train, y_test = train_test_split(X, y)

### A Baseline Model

Before I begin building my models, I want a baseline. For this, I use the sklearn 'dummy' classifier and a logistic regression.

In [45]:
dummy_clf = DummyClassifier(random_state = 42).fit(X_train, y_train)
baseline_score = dummy_clf.score(X_test, y_test)

print(baseline_score)

0.8752295985305694


In [46]:
simple_model = LogisticRegression(max_iter = 10000, random_state = 42).fit(X_train, y_train)
simple_score = simple_model.score(X_test, y_test)
print(simple_score)

0.8752295985305694


With this, I have a good baseline model to compare my model's performance.

### Model Comparisons

Now, I aim to compare the performance of the Logistic Regression model to KNN algorithm, Decision Tree, and SVM models.  Using the default settings for each of the models, I fit and score each.  Also, I compare the fit time of each of the models.

In [47]:
knn_start = time.time()
knn_model = KNeighborsClassifier().fit(X_train, y_train)
knn_end = time.time()
knn_time = knn_end - knn_start
knn_train_score = knn_model.score(X_train, y_train)
knn_test_score = knn_model.score(X_test, y_test)

In [48]:
tree_start = time.time()
tree_model = DecisionTreeClassifier().fit(X_train, y_train)
tree_end = time.time()
tree_time = tree_end - tree_start
tree_train_score = tree_model.score(X_train, y_train)
tree_test_score = tree_model.score(X_test, y_test)

In [49]:
SVC_start = time.time()
SVC_model = SVC().fit(X_train, y_train)
SVC_end = time.time()
SVC_time = SVC_end - SVC_start
SVC_train_score = SVC_model.score(X_train, y_train)
SVC_test_score = SVC_model.score(X_test, y_test)

In [50]:
comparison_df = pd.DataFrame([
    ['KNN', knn_time, knn_train_score, knn_test_score], 
    ['Decision Tree', tree_time, tree_train_score, tree_test_score], 
    ['SVM', SVC_time, SVC_train_score, SVC_test_score]
], columns = ['Model', 'Train Time', 'Train Accuracy', 'Test Accuracy'])
comparison_df.set_index('Model', drop = True).head()

Unnamed: 0_level_0,Train Time,Train Accuracy,Test Accuracy
Model,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
KNN,0.020632,0.878072,0.867489
Decision Tree,0.091916,0.90077,0.857124
SVM,11.972924,0.872824,0.87523


For the baseline models, SVM took the longest, but it achieved a marginally better score than the other models.

## Improving the Model

Now that we have some basic models on the board, I want to see if I can improve their performance. To do this, I optimize hyperparameters for each model to attampt to improve their performance. Note that this is nto an exhaustive search, but it is possible to find better values through this tuning.

In [33]:
# Performing a grid search to find optimal number of neighbors.
knn_params = {'n_neighbors': [1, 2, 3, 4, 5]}
knn_grid_model = KNeighborsClassifier()
knn_grid = GridSearchCV(knn_grid_model, param_grid = knn_params).fit(X_train, y_train)

In [34]:
#Storing the score and best parameter from the search
knn_grid_score = knn_grid.best_score_
knn_grid.best_params_

{'n_neighbors': 4}

In [35]:
# Performing a grid search to find the optimal combination of min impurity decrease, max depth, and min samples split
tree_params = {'min_impurity_decrease': [0.01, 0.02, 0.03, 0.05], 'max_depth': [2, 3, 4, 5, 10], 'min_samples_split': [0.1, 0.2, 0.05]}
tree_grid_model = DecisionTreeClassifier()
tree_grid = GridSearchCV(tree_grid_model, param_grid = tree_params).fit(X_train, y_train)

In [51]:
#Storing the score and best parameter from the search
tree_grid_score = tree_grid.best_score_
tree_grid.best_params_

{'max_depth': 2, 'min_impurity_decrease': 0.01, 'min_samples_split': 0.1}

In [37]:
# Perfroming a grid search to find the best kernel for the SVM
svm_params = {'kernel': ['rbf', 'poly', 'linear', 'sigmoid']}
svm_grid_model = SVC()
svm_grid = GridSearchCV(svm_grid_model, param_grid = svm_params).fit(X_train, y_train)

In [38]:
#Storing the score and best parameter from the search
svm_grid_score = svm_grid.best_score_
svm_grid.best_params_

{'kernel': 'rbf'}

Now that I have my optimized parameters, I will add them to the dataframe with the base models and compare the performance.

In [63]:
grid_search_scores = [knn_grid_score, tree_grid_score, svm_grid_score]
comparison_df['Search Scores'] = grid_search_scores
comparison_df.set_index('Model', drop = True)

Unnamed: 0_level_0,Train Time,Train Accuracy,Test Accuracy,Search Scores
Model,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
KNN,0.020632,0.878072,0.867489,0.865302
Decision Tree,0.091916,0.90077,0.857124,0.872824
SVM,11.972924,0.872824,0.87523,0.872824


Finally, looking at the optimized scores, there is a minor decrease in performance for the KNN and SVM models, and a slight improvement with the Decision Tree model. Overall, these changes are very minor, and don;t indicate that the optimized hyperparameters were significantly better. At this point, I would recommend any of these modesl as a good way to suggest whether a user will sign up for their intended service; however, I might recommend a simpler model like KNN or Logistic Regression, as they are simple and fast models that performed well.

In future work, I would break the data up with more data about the marketing campaigns to hopefully show what elements of the campaigns were most closely correlated with a user signing up. This and the classifier model would provide very actionable data tha the bank could use to inform their next marketing campaign.