# Assignment 17.1

### The goal in this assignment is to compare the performance of the classifiers (k-nearest neighbors, logistic regression, decision trees, and support vector machines).

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay

from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC

import time

## 

## Step 1: Business Understanding

Problem Statement: Accurately classify potential customers into those with high propensity to subscribe to a term deposit so that sales/ telemarketers can target the right group of customers to contact.

Measure of Effective: Here, the accuracy score is used to score the performance of the classification models. 

Two measures of performance were also included: 

   (a) the fitting time of the model and 

   (b) the percentage of customers to call (predicted == “Yes”) should not exceed the percentage called in the previous campaign = 18%. 

This latter metric is added so that the number of customers to call is manageable otherwise it would be costly to the banks to employ too many telemarketers.

In [None]:
# Here, we load in the data.  The data is semi-colon separated so we included the constructor sep = ';'
# This bank data set was acquired and used and this data can be found in https://archive.ics.uci.edu/dataset/222/bank+marketing.

data = pd.read_csv('data/bank-full.csv', sep=';') 

In [None]:
data.head(5)

# 

## Step 2: Understanding the Data 

It was expected that client data, business data, outcome of previous campaign data (if contacted by sales personnel) and if the client subscribed to the term deposit was important. Hence, the bank data set was acquired and used and this data can be found in https://archive.ics.uci.edu/dataset/222/bank+marketing.

##### Bank Client Data

1 - age (numeric)

2 - job : type of job (categorical: "admin.","unknown","unemployed","management","housemaid","entrepreneur","student", "blue-collar","self-employed","retired","technician","services")

3 - marital : marital status (categorical: "married","divorced","single"; note: "divorced" means divorced or widowed)

4 - education (categorical: "unknown","secondary","primary","tertiary")

5 - default: has credit in default? (binary: "yes","no")

6 - balance: average yearly balance, in euros (numeric)

7 - housing: has housing loan? (binary: "yes","no")

8 - loan: has personal loan? (binary: "yes","no")

##### related with the last contact of the current campaign:

9 - contact: contact communication type (categorical: "unknown","telephone","cellular")

10 - day: last contact day of the month (numeric)

11 - month: last contact month of year (categorical: "jan", "feb", "mar", ..., "nov", "dec")

12 - duration: last contact duration, in seconds (numeric)

##### other attributes:

13 - campaign: number of contacts performed during this campaign and for this client (numeric, includes last contact)

14 - pdays: number of days that passed by after the client was last contacted from a previous campaign (numeric, -1 means client was not previously contacted)

15 - previous: number of contacts performed before this campaign and for this client (numeric)

16 - poutcome: outcome of the previous marketing campaign (categorical: "unknown","other","failure","success")

##### Output variable (desired target):

17 - y - has the client subscribed a term deposit? (binary: "yes","no")

# 


### Understanding the Data

First, I split the data into 2 groups, one containing customers who were previously contacted; the other containing customers who weren’t. It was observed that 18% of the customers were contacted in the previous campaign.

##### Customers contacted in the previous campaign


In [None]:
# Customers who were contacted in the previous campaign
data_campaign = data[data['pdays'] != -1]
success_campaign = len(data_campaign[data_campaign['y']=='yes'])
success_campaign_prop = success_campaign/len(data_campaign)
print('Number of customers contacted in the previous campaign = ' + str(len(data_campaign)))
print('Percentage of customers contacted in the previous campaign = ' + str(100*len(data_campaign)/len(data)) + '%')

##### Customers not contacted in the previous campaign

In [None]:
# Customers not contacted in the previous campaign
data_no_campaign = data[data['pdays'] == -1]
success_no_campaign = len(data_no_campaign[data_no_campaign['y']=='yes'])
success_no_campaign_prop = success_no_campaign/len(data_no_campaign)
print('Number of customers not contacted in the previous campaign = ' + str(len(data_no_campaign)))
print('Percentage of customers not contacted in the previous campaign = ' + str(100*len(data_no_campaign)/len(data)) + '%')

##### Customer success rate based on whether they are contacted or not

In [None]:
plt.bar(['contacted and subscribed', 'not contacted and subscribed'], [success_campaign_prop,success_no_campaign_prop], color='Green')
plt.title("Proportion of Success")
plt.xticks(rotation=40)

##### Count of customers in each group

In [None]:
plt.bar(['contacted and subscribed', 'not contacted and subscribed'], [success_campaign,success_no_campaign], color='Green')
plt.bar(['contacted and not subscribed', 'not contacted and not subscribed'], [len(data_campaign)-success_campaign,len(data_no_campaign)-success_no_campaign], color='red')
plt.title("Count of Customers")
plt.xticks(rotation=40)

# 

## Step 3: Data Preparation 

For fields with “yes” and “no” option, I replaced them with 1 and 0 respectively. For the other fields such as “marital” and “job” data, they were digitised using LabelEncoder.

In [None]:
#default, housing loan -> no = 0; yes = 1

data['default'] =data['default'].replace(to_replace=['no', 'yes'], value=[0, 1])
data['housing'] =data['housing'].replace(to_replace=['no', 'yes'], value=[0, 1])
data['loan'] =data['loan'].replace(to_replace=['no', 'yes'], value=[0, 1])


In [None]:
#transform job, marital, education, contact to numerical values

le = LabelEncoder()
data['marital_num'] = le.fit_transform(data['marital'])
data['job_num'] = le.fit_transform(data['job'])
data['education_num'] = le.fit_transform(data['education'])
data['contact_num'] = le.fit_transform(data['contact'])
data['poutcome_num'] = le.fit_transform(data['poutcome'])


In [None]:
#remove month, day as they are represented in pdays
#remove job, marital, education, contact, poutcome as numerical values included

X=data.drop(columns = ['marital', 'job', 'education','contact', 'poutcome','y', 'day', 'month'])

In [None]:
y=data['y']

Split the data into training and development sets in the ratio of 70-30 

In [None]:
X_train, X_test, y_train, y_test=train_test_split(X, y, test_size=0.3)

# 

## Step 4: Modelling 

4 models were used, namely Logistic Regression, Decision Tree, k Nearest Neighbors and Support Vector Machine 

##### Logistic Regression

In [None]:
start_time = time.time()
logreg =LogisticRegression(max_iter=10_000).fit(X_train, y_train)
end_time = time.time()
timer_logreg = end_time - start_time

##### Decision Tree

In [None]:
start_time = time.time()
DecTree =DecisionTreeClassifier().fit(X_train, y_train)
end_time = time.time()
timer_DecTree = end_time - start_time

##### k Nearest Neighbors

In [None]:
start_time = time.time()
knn =KNeighborsClassifier().fit(X_train, y_train)
end_time = time.time()
timer_knn = end_time - start_time

##### Support Vector Machine

In [None]:
start_time = time.time()
svc =SVC().fit(X_train, y_train)
end_time = time.time()
timer_svc = end_time - start_time

# 

## Step 5: Evaluation 

First, I look at the performance matrix for the 4 models and the confusion matrices 

##### Performance matrix for the 4 models

From the matrix below, logistic regression recorded the best accuracy score for the development set, followed by the SVM, KNN and Decision Tree. However, the KNN took the shortest time to fit the model, followed by the Decision Tree, Logistic Regression and SVM. Since the fitting times were all tolerable and the accuracy score was the measure of effectiveness, the Logistic Regression Model appears to be the best model. In addition, the workload for the Logistic Regression was 3.5% which is way lower than the original 18%. 

In [None]:
res_dict = {'model': ['Logistic Regression', 'Decision Tree', 'KNN', 'SVM'], #logistic regression, decision trees, KNN, and SVMs
           'train score': [logreg.score(X_train, y_train), DecTree.score(X_train, y_train), 
                           knn.score(X_train, y_train), svc.score(X_train, y_train)],
           'test score': [logreg.score(X_test, y_test), DecTree.score(X_test, y_test), 
                           knn.score(X_test, y_test), svc.score(X_test, y_test)],
           'fitting time': [timer_logreg, timer_DecTree, timer_knn, timer_svc],
           'proportion yes predictions': [pd.Series(logreg.predict(X_test)).value_counts()['yes']/len(y_test), 
                                          pd.Series(DecTree.predict(X_test)).value_counts()['yes']/len(y_test),
                                          pd.Series(knn.predict(X_test)).value_counts()['yes']/len(y_test),
                                          pd.Series(svc.predict(X_test)).value_counts()['yes']/len(y_test)
                                         ]
           }
results_df = pd.DataFrame(res_dict).set_index('model')
results_df

Confusion matrices were added to enrich the analysis. 

##### Confusion Matrix - Logistic Regression 



In [None]:
ConfusionMatrixDisplay.from_estimator(logreg, X_test, y_test)
plt.title('Logistic Regression')

##### Confusion Matrix - Decision Tree

In [None]:
ConfusionMatrixDisplay.from_estimator(DecTree, X_test, y_test)
plt.title('Decision Tree')

##### Confusion Matrix - k Nearest Neighbors

In [None]:
ConfusionMatrixDisplay.from_estimator(knn, X_test, y_test)
plt.title('k Nearest Neighbors')

##### Confusion Matrix - Support Vector Machine

In [None]:
ConfusionMatrixDisplay.from_estimator(svc, X_test, y_test)
plt.title('SVC')

# 

### Sensitivity Analysis - Analysing without data from customer who were not contacted and did not subscribe to the term deposit

As it was hard to tell if the group of customers who were not contacted and did not subscribe to the term deposit were indeed not interested in the term deposit or was simply unaware of the campaign, I decided to rerun the model without the group of customers who were not contacted and did not subscribe to the term deposit. 

##### Create new_data to remove the group of customers who were not contacted and did not subscribe to the term deposit 

In [None]:
#remove customer who did not participate in past campaign and did not subscribe
data_no_campaign= data[data['pdays']==-1]
data_no_campaign_success = data_no_campaign[data_no_campaign['y']=='yes']
data_campaign= data[data['pdays']!=-1]
new_data = data_campaign.append(data_no_campaign_success)

In [None]:
X1 = new_data.drop(columns = ['marital', 'job', 'education','contact', 'poutcome','y', 'day', 'month'])
y1 = new_data['y']

In [None]:
X1_train, X1_test, y1_train, y1_test=train_test_split(X1, y1, test_size=0.3)

##### Logistic Regression on new_data

In [None]:
start_time = time.time()
logreg1 =LogisticRegression(max_iter=10_000).fit(X1_train, y1_train)
end_time = time.time()
timer_logreg1 = end_time - start_time

##### Decision Tree on new_data

In [None]:
start_time = time.time()
DecTree1 =DecisionTreeClassifier().fit(X1_train, y1_train)
end_time = time.time()
timer_DecTree1 = end_time - start_time

##### k Nearest Neighbors on new_data

In [None]:
start_time = time.time()
knn1 =KNeighborsClassifier().fit(X1_train, y1_train)
end_time = time.time()
timer_knn1 = end_time - start_time

##### Support Vector Machine on new_data

In [None]:
start_time = time.time()
svc1 =SVC().fit(X1_train, y1_train)
end_time = time.time()
timer_svc1 = end_time - start_time

##### Performance matrix on new_data

In [None]:
res_dict1 = {'model': ['Logistic Regression', 'Decision Tree', 'KNN', 'SVM'], #logistic regression, decision trees, KNN, and SVMs
           'train score': [logreg1.score(X1_train, y1_train), DecTree1.score(X1_train, y1_train), 
                           knn1.score(X1_train, y1_train), svc1.score(X1_train, y1_train)],
           'test score': [logreg1.score(X1_test, y1_test), DecTree1.score(X1_test, y1_test), 
                           knn1.score(X1_test, y1_test), svc1.score(X1_test, y1_test)],
           'fitting time': [timer_logreg1, timer_DecTree1, timer_knn1, timer_svc1],
           'proportion yes predictions': [pd.Series(logreg1.predict(X1_test)).value_counts()['yes']/len(y1_test), 
                                          pd.Series(DecTree1.predict(X1_test)).value_counts()['yes']/len(y1_test),
                                          pd.Series(knn1.predict(X1_test)).value_counts()['yes']/len(y1_test),
                                          pd.Series(svc1.predict(X1_test)).value_counts()['yes']/len(y1_test)
                                         ]
           }
results_df1 = pd.DataFrame(res_dict1).set_index('model')
results_df1

##### Confusion Matrix on new_data

In [None]:
ConfusionMatrixDisplay.from_estimator(logreg, X_test, y_test)
plt.title('Logistic Regression')
ConfusionMatrixDisplay.from_estimator(DecTree, X_test, y_test)
plt.title('Decision Tree')
ConfusionMatrixDisplay.from_estimator(knn, X_test, y_test)
plt.title('k Nearest Neighbors')
ConfusionMatrixDisplay.from_estimator(svc, X_test, y_test)
plt.title('SVC')

# 

## Step 6: Deployment 

Since the accuracy of the model was no better off in the sensitivity study but the workload becomes significantly higher, it was better to deploy the original Logistic Regression model for 2 reasons. First, test score was slightly better and next, the workload was significantly lower for the original model.

Next Steps and recommendations:

a. It is recommended that Logistic Regression model be deployed for the next campaign and new data be collated to further refine the model

b. Even though this is a classification model, a time series model can be deployed if there is sufficient data collated over a few campaign

c. External data such as economic data can be collated to enrich the analysis as state of the economy can have an impact on people’s propensity to take up term deposit.