# Practical Application III: Comparing Classifiers

**Overview**: In this practical application, your goal is to compare the performance of the classifiers we encountered in this section, namely K Nearest Neighbor, Logistic Regression, Decision Trees, and Support Vector Machines.  We will utilize a dataset related to marketing bank products over the telephone.  



### Learning Outcomes
#### 1. Apply various classification methods to a business problem
#### 2. Compare results of k-nearest neighbors, logistic regression, decision trees, and support vector machines 

### Deliverables
#### 1. Understand data
#### 2. Model data
#### 3. Build Jupyter Notebook
##### a. Demonstrate understanding of business problem
##### b. Provide correct and concise interpretation of descriptive and inferential statistics
##### c. Show findings (include actionable insights, next steps, and recommendations)
#### 4. Submit website URL to public-facing GitHub respository


### New Modeling Deliverables
#### 1. Use four classifier models (kNN, Decision Trees, Logistic Regression, SVM)
#### 2. Clearly identify evaluation metrics
#### 3. Appropriately interpret evaluation metrics
#### 4. Display clear rationale for use of evaluation metrics
#### 5. Appropriately compare the four models 

### CRISP-DM Framework: Standard Process for Data Projects/Mining
#### 1) Business Understanding: Background, Objectives, Success Criteria, Inventory of Resources/Requirements/Assumptions/Constraints/Risks/Contingencies/Terminology/Costs/Benefits, Data Mining Goals/Success Criteria

#### 2) Data Understanding: Data Collection/Exploration/Quality Report

#### 3) Data Preparation: Data Description/Inclusion/Exclusion/Attributes/Records, Merged Data, Reformatted Data

#### 4) Modeling: Select Technique/Assumptions, Generate Test Designs, Build Model/Parameter Settings/Model Descrption, Assess Model, Revise Parameter Settings

#### 5) Evaluation: Evaluate Results/Assessment of Results w.r.t Business Success Criteria/Approved Models, Review Process, Determine Next Steps, List of Possible Action Decisions

#### 6) Deployment: Plan Deployment, Plan Monitoring and Maintenance Plan, Produce Final Report/Final Presentation, Review Project/Experience Documentation

### Getting Started: Use Dataset Related to 

Our dataset comes from the UCI Machine Learning repository [link](https://archive.ics.uci.edu/ml/datasets/bank+marketing).  The data is from a Portugese banking institution and is a collection of the results of multiple marketing campaigns.  We will make use of the article accompanying the dataset [here](CRISP-DM-BANK.pdf) for more information on the data and features.



### Problem 1: Understanding the Data

To gain a better understanding of the data, please read the information provided in the UCI link above, and examine the **Materials and Methods** section of the paper.  How many marketing campaigns does this data represent?

### Answer 1: Understanding the Data

Examining the **Materials and Methods** section of the paper, it appears the dataset is related to 17 campaigns that occurred between May 2008 and November 2010.

### Problem 2: Read in the Data

Use pandas to read in the dataset `bank-additional-full.csv` and assign to a meaningful variable name.

### Answer 2: Read in the Data

In [None]:
# First importing all libraries
%matplotlib inline
import pandas as pd
import numpy as np
import seaborn as sns
import seaborn.objects as so
import matplotlib.pyplot as plt
from matplotlib import colors as mcolors
import plotly.express as px
import plotly.figure_factory as ff
import plotly.graph_objects as go
import math 
import scipy.stats as stats
import scipy.optimize
from scipy.optimize import minimize
from sklearn import linear_model
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.utils import shuffle
from sklearn.metrics import mean_absolute_error, mean_squared_error
from sklearn.pipeline import Pipeline, make_pipeline
from sklearn.preprocessing import PolynomialFeatures, OneHotEncoder, OrdinalEncoder
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.preprocessing import StandardScaler, PolynomialFeatures
from sklearn.linear_model import LogisticRegression
from sklearn.feature_selection import SelectFromModel
from sklearn.linear_model import Ridge
from sklearn.metrics import r2_score
from sklearn.compose import make_column_transformer, make_column_selector
from sklearn.neighbors import KNeighborsRegressor
from sklearn.decomposition import PCA
from sklearn.utils import shuffle
from random import shuffle, seed
from sklearn import set_config
from sklearn.metrics import accuracy_score, recall_score, precision_score
from sklearn.linear_model import Ridge
from sklearn.metrics import mean_squared_error
from sklearn.neighbors import KNeighborsRegressor
from sklearn.decomposition import PCA
from sklearn.utils import shuffle
from sklearn import set_config
from random import shuffle, seed
from sklearn.neighbors import KNeighborsRegressor
from sklearn.impute import KNNImputer
from sklearn.pipeline import Pipeline
from sklearn.compose import make_column_transformer
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn import set_config
from sklearn.preprocessing import LabelEncoder
from sklearn.neighbors import KNeighborsClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, recall_score, precision_score
from sklearn.datasets import load_breast_cancer
from sklearn.metrics import confusion_matrix, precision_score, recall_score, accuracy_score, plot_confusion_matrix, ConfusionMatrixDisplay
from sklearn.metrics import plot_confusion_matrix
from sklearn.metrics import accuracy_score, recall_score, precision_score, precision_recall_curve, roc_curve, f1_score, roc_auc_score
from sklearn.compose import make_column_selector, make_column_transformer
from sklearn.preprocessing import OneHotEncoder, StandardScaler, PolynomialFeatures
from sklearn.linear_model import LogisticRegression
from sklearn.feature_selection import SelectFromModel
from sklearn.metrics import ConfusionMatrixDisplay, RocCurveDisplay
from sklearn import tree
from sklearn.tree import DecisionTreeClassifier, export_text, plot_tree
from sklearn.experimental import enable_halving_search_cv
from sklearn.tree import DecisionTreeClassifier, export_text, plot_tree
from sklearn.experimental import enable_halving_search_cv 
from sklearn.model_selection import HalvingRandomSearchCV
from sklearn.model_selection import HalvingGridSearchCV
from sklearn.experimental import enable_halving_search_cv
from sklearn.tree import DecisionTreeClassifier, export_text, plot_tree
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV, HalvingGridSearchCV, HalvingRandomSearchCV
from sklearn.svm import SVC
from sklearn.metrics import confusion_matrix, precision_score, recall_score, accuracy_score, plot_confusion_matrix, ConfusionMatrixDisplay
from sklearn.metrics import plot_confusion_matrix


In [None]:
# Next establishing settings for viewing plots
sns.set_theme(style="darkgrid")

In [None]:
# Setting diagram 
set_config(display="diagram")

In [None]:
# Now using pandas to read in the dataset bank-additional-full.csv
# Assigning to a meaningful variable name

Bank_Full_DF = pd.read_csv('bank-additional-full.csv', sep = ';')

In [None]:
# Checking pandas read
Bank_Full_DF.head()

### Problem 3: Understanding the Features


Examine the data description below, and determine if any of the features are missing values or need to be coerced to a different data type.


```
Input variables:
# bank client data:
1 - age (numeric)
2 - job : type of job (categorical: 'admin.','blue-collar','entrepreneur','housemaid','management','retired','self-employed','services','student','technician','unemployed','unknown')
3 - marital : marital status (categorical: 'divorced','married','single','unknown'; note: 'divorced' means divorced or widowed)
4 - education (categorical: 'basic.4y','basic.6y','basic.9y','high.school','illiterate','professional.course','university.degree','unknown')
5 - default: has credit in default? (categorical: 'no','yes','unknown')
6 - housing: has housing loan? (categorical: 'no','yes','unknown')
7 - loan: has personal loan? (categorical: 'no','yes','unknown')
# related with the last contact of the current campaign:
8 - contact: contact communication type (categorical: 'cellular','telephone')
9 - month: last contact month of year (categorical: 'jan', 'feb', 'mar', ..., 'nov', 'dec')
10 - day_of_week: last contact day of the week (categorical: 'mon','tue','wed','thu','fri')
11 - duration: last contact duration, in seconds (numeric). Important note: this attribute highly affects the output target (e.g., if duration=0 then y='no'). Yet, the duration is not known before a call is performed. Also, after the end of the call y is obviously known. Thus, this input should only be included for benchmark purposes and should be discarded if the intention is to have a realistic predictive model.
# other attributes:
12 - campaign: number of contacts performed during this campaign and for this client (numeric, includes last contact)
13 - pdays: number of days that passed by after the client was last contacted from a previous campaign (numeric; 999 means client was not previously contacted)
14 - previous: number of contacts performed before this campaign and for this client (numeric)
15 - poutcome: outcome of the previous marketing campaign (categorical: 'failure','nonexistent','success')
# social and economic context attributes
16 - emp.var.rate: employment variation rate - quarterly indicator (numeric)
17 - cons.price.idx: consumer price index - monthly indicator (numeric)
18 - cons.conf.idx: consumer confidence index - monthly indicator (numeric)
19 - euribor3m: euribor 3 month rate - daily indicator (numeric)
20 - nr.employed: number of employees - quarterly indicator (numeric)

Output variable (desired target):
21 - y - has the client subscribed a term deposit? (binary: 'yes','no')
```



### Answer 3: Examining the Data

#### Initial Data Exploration

In [None]:
# Exploring dataset by columns
Bank_Full_DF.columns


In [None]:
# Exploring dataset by shape
Bank_Full_DF.shape

In [None]:
# Exploring dataset by info
Bank_Full_DF.info()

In [None]:
# Exploring data set by describe
Bank_Full_DF.describe()

In [None]:
# Determining sum of non-values by column
# Determined no null values
Bank_Full_DF.isna().sum()

#### Examining each column

##### Bank Client Columns

In [None]:
Bank_Full_DF['age'].value_counts()

In [None]:
Bank_Full_DF['job'].value_counts()

In [None]:
Bank_Full_DF['marital'].value_counts()

In [None]:
Bank_Full_DF['education'].value_counts()

In [None]:
Bank_Full_DF['default'].value_counts()

In [None]:
Bank_Full_DF['housing'].value_counts()

In [None]:
Bank_Full_DF['loan'].value_counts()

In [None]:
# Dropping 'unknown' values for categorical variables 'loan', 'housing', 'education', 'marital', 'job'
# Lack of contribution to meaningful results

In [None]:
Drop1 = Bank_Full_DF[Bank_Full_DF['loan'].str.contains('unknown')==False]

In [None]:
Drop2 = Drop1[Drop1['housing'].str.contains('unknown')==False]

In [None]:
Drop3 = Drop2[Drop2['education'].str.contains('unknown')==False]

In [None]:
Drop4 = Drop3[Drop3['marital'].str.contains('unknown')==False]

In [None]:
Drop5 = Drop4[Drop4['job'].str.contains('unknown')==False]

In [None]:
# Checking variables
Drop5['loan'].value_counts()

In [None]:
Drop5['housing'].value_counts()

In [None]:
Drop5['education'].value_counts()

In [None]:
Drop5['marital'].value_counts()

In [None]:
Drop5['marital'].replace({"divorced": "divorced_widowed"})

In [None]:
Drop5['job'].value_counts()

In [None]:
# Checking shape of new data frame
Drop5.shape

##### Columns Containing Last Contacted Information

In [None]:
Drop5['contact'].value_counts()

In [None]:
Drop5['month'].value_counts()

In [None]:
Drop5['day_of_week'].value_counts()

In [None]:
Drop5['duration'].value_counts()

In [None]:
# According to "Attribute Information",'duration' is last contact duration, in seconds 
# This input should only be included for benchmark purposes 
# Should be discarded if the intention is to have a realistic predictive model
# Dropping duration column
Drop6 = Drop5.drop(["duration"], axis=1)

In [None]:
# Checking new data frame
Drop6.head()

In [None]:
# Checking new data frame
Drop6.shape

##### Columns Containing Other Attributes

In [None]:
Drop6['campaign'].value_counts()

In [None]:
Drop6['pdays'].value_counts()

###### Creating new column for pdays
####### Coercing column for pdays into a new, future dummy variable column (Yes/No). According to "Attribute Information",'pdays' column is number of days that passed by after client was last contacted from a previous campaign ("999" means client was not previously contacted, all other numerical values emans client was previously contacted)

In [None]:
Drop6['pdays'].sort_values().head(1500)

In [None]:
Drop6['Contact_In_Prior_Campaign'] = Drop6['pdays'].replace({999: "no",
                                                             0: "yes", 
                                                             1: "yes",
                                                             2: "yes",
                                                             3: "yes",
                                                             4: "yes",
                                                             5: "yes",
                                                             6: "yes",
                                                             7: "yes", 
                                                             8: "yes", 
                                                             9: "yes",
                                                             10: "yes", 
                                                             11: "yes", 
                                                             12: "yes", 
                                                             13: "yes",
                                                             14: "yes", 
                                                             15: "yes",
                                                             16: "yes",
                                                             17: "yes",
                                                             18: "yes",
                                                             19: "yes",
                                                             20: "yes",
                                                             21: "yes",
                                                             22: "yes", 
                                                             23: "yes", 
                                                             24: "yes",
                                                             25: "yes", 
                                                             26: "yes", 
                                                             27: "yes"})

In [None]:
Drop6['Contact_In_Prior_Campaign'].value_counts()

In [None]:
Drop7 = Drop6.drop(['pdays'], axis=1)

##### Columns Containing Other Attributes (cont.)

In [None]:
Drop7['previous'].value_counts()

In [None]:
Drop7['poutcome'].value_counts()

##### Columns Containing Social and Economic Context Attributes

In [None]:
Drop7['emp.var.rate'].value_counts()

In [None]:
Drop7['emp.var.rate'].value_counts()

In [None]:
Drop7['cons.price.idx'].value_counts()

In [None]:
Drop7['cons.conf.idx'].value_counts()

In [None]:
Drop7['euribor3m'].value_counts()

#### Assessing For Duplicate Rows Overall

In [None]:
Duplicated = Drop7.duplicated()

In [None]:
sorted = Duplicated.sort_values()
sorted.head(100)

In [None]:
sorted.tail(100)

In [None]:
# Dropping discovered duplicates
Drop8 = Drop7.drop_duplicates()

In [None]:
# Checking shape of new data set 
Drop8.shape

#### Renaming Columns for Visualizations

In [None]:
DFRenamed = Drop8.rename({'age': 'Age', 'job': 'Job_Type', 'marital': 'Marital_Status',
                         'education': 'Education', 'housing': 'Housing_Loan', 
                          'default': 'Credit_In_Default', 'loan': 'Personal_Loan', 
                          'contact':'Communication_Type', 'month': 'Last_Contact_Month', 
                          'day_of_week': 'Last_Contact_Day', 'campaign': 'Total_Times_Contacted', 
                          'pdays': 'Days_Since_Last_Contact', 'previous': 'Previous_Contacts_Performed',
                          'poutcome': 'Previous_Outcome', 'emp.var.rate': 'Employment_Variation_Quarterly', 
                          'cons.price.idx': 'Consumer_Price_Monthly', 'cons.conf.idx': 'Consumer_Confidence_Monthly',
                          'euribor3m': 'Euribor_3Month_Daily','nr.employed': 'Employees_Quarterly', 'y': 'Subscribed'}, axis = 1)
                          

In [None]:
DFRenamed.head()

#### Heat Map

In [None]:
plt.subplots(figsize=(12,7))
corr = DFRenamed.corr()
sns.heatmap(corr, annot=True).set(title = 'Numeric Variables Correlation')

##### Correlations above .90 threshold include Euribor_3Month_Daily with Employment_Variation_Quarterly, Euribor_3Month_Daily with Employees_Quarterly,  and Employees_Quarterly with Employment_Variation_Quarterly.  Since these correlations are substantially high, it would be possible to drop particular columns to reduce dimensionality in the data set.

##### Correlations above and below the .50 and -.50 threshold (and below .90), respectively, include Employment_Variation_Quarterly with Consumer_Price_Monthly (positive), Euribor_3Month_Daily with Consumer_Price_Monthly, Consumer_Price_Monthly with Consumer_Confidence_Monthly,   Days_Since_Last_Contact with Previous_Contacts_Performed (negative correlation),  Consumer_Price_Monthly with Employees_Quarterly, Previous_Contacts_Performed with Employees_Quarterly (negative correlation). These are variables to keep in mind during data exploration. 

###### Dropping highly correlated features (> 0.90) to reduce dimensionality, and retaining Euribor_3Month_Daily Feature

In [None]:
Drop9 = DFRenamed.drop(["Employment_Variation_Quarterly"], axis=1)
Drop10 = Drop9.drop(["Employees_Quarterly"], axis=1)
Drop10.shape

#### Examining Outcome Variable

In [None]:
# Binary outcome
# Has the client subscribed a term deposit?

Drop10['Subscribed'].value_counts()

In [None]:
# Replace outcome variable with 0, 1
Subscribed_Outcome = pd.DataFrame(Drop10["Subscribed"].replace({"no": 0, "yes": 1}))

In [None]:
Subscribed_Outcome.head()

In [None]:
Subscribed_Outcome.tail()

### Problem 4: Understanding the Task

After examining the description and data, your goal now is to clearly state the *Business Objective* of the task.  State the objective below.


### Answer 4: Understanding the Task - Will the Client Subscribe to a Bank Deposit?

Business Goal:  To increase campaign efficiency, find a model that can explain if the client subscribes to the deposit. Identify the main characteristics that affect success to help management of human effort, phone calls, time, and other available resources. In addition, aid selection of a high quality and an affordable set of potential buying customers.

### Problem 5: Engineering Features

Now that you understand your business objective, we will build a basic model to get started.  Before we can do this, we must work to encode the data.  Using just the bank information features (columns 1 - 7), prepare the features and target column for modeling with appropriate encoding and transformations.

### Answer 5: Encoding Data for Columns 1-7

In [None]:
# Creating dummy dataframes for categorical variables
dummiesjob = pd.get_dummies(Drop10["Job_Type"])
dummiesmarital = pd.get_dummies(Drop10["Marital_Status"])
dummieseducation = pd.get_dummies(Drop10["Education"])
dummiesdefault = pd.get_dummies(Drop10["Credit_In_Default"])
dummieshousing = pd.get_dummies(Drop10["Housing_Loan"])
dummiesloan = pd.get_dummies(Drop10["Personal_Loan"])

#### Encoding Data for Remaining Categorical Variables

In [None]:
dummiescommunication = pd.get_dummies(Drop10["Communication_Type"])
dummiesmonth = pd.get_dummies(Drop10["Last_Contact_Month"])
dummiesdayofweek = pd.get_dummies(Drop10["Last_Contact_Day"])
dummiesprioroutcome = pd.get_dummies(Drop10["Previous_Outcome"])
dummiescontactedprior = pd.get_dummies(Drop10["Contact_In_Prior_Campaign"])

#### Establishing Baseline Variables Data Frame

In [None]:
# Joining Categorical Variables in Data Frame
CtgrcalVarbls = pd.concat([dummiesjob, dummiesmarital, dummieseducation, 
                           dummiesdefault, dummieshousing, dummiesloan, dummiescommunication,
                           dummiesmonth, dummiesdayofweek, dummiesprioroutcome,
                           dummiescontactedprior], axis =1)

In [None]:
#Joining Numerical Variables in Data Frame
NmrclVrbls = Drop10[['Age', 'Total_Times_Contacted', 'Previous_Contacts_Performed', 'Consumer_Price_Monthly',
                     'Consumer_Confidence_Monthly','Euribor_3Month_Daily']]

In [None]:
#Joining Numerical and Categorical Variables in Data Frame
PreBaseline = pd.concat([CtgrcalVarbls,NmrclVrbls], axis =1)

In [None]:
#Joining Outcome Variable to Data Frame
Baseline = pd.concat([CtgrcalVarbls,NmrclVrbls, Subscribed_Outcome], axis = 1)

##### All Features Baseline Data Frame

In [None]:
Baseline.head()

In [None]:
Baseline.shape

In [None]:
Baseline.info()

In [None]:
Baseline.isna().sum()

#### Shuffling Data Set

In [None]:
#Prepare data for sklearn
#Create indices for data frame and shuffle
ShuffleM1 = list(range(0, len(Baseline)))
seed(42)
shuffle(ShuffleM1)
ShuffleM1[:5]

### Problem 6: Train/Test Split
With your data prepared, split it into a train and test set.

### Answer 6: Train/Test Split

In [None]:
# Naming feature and outcome variables
X = Baseline.drop(['Subscribed'], axis=1)
y = Baseline['Subscribed']

In [None]:
# Performing Train/Test Split

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42)

### Problem 7: A Baseline Model

Before we build our first model, we want to establish a baseline.  What is the baseline performance that our classifier should aim to beat?

### Answer 7: Establishing Baseline Performance

In [None]:
# Baseline With Logistic Regression for 56 features
sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)
logreg = LogisticRegression()
logreg.fit(X_train, y_train)
y_proba = logreg.predict_proba(X_test)[:, 1]
thresholds = [0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9]

In [None]:
# Obtaining Intercept
logreg.intercept_

In [None]:
# Obtaining Coefficients
logreg.coef_

In [None]:
logreg_beta0 = logreg.intercept_
logreg_beta1 = logreg.coef_
logreg_thresh = -logreg_beta0/logreg_beta1

logreg_beta0, logreg_beta1, logreg_thresh

In [None]:
#Training Accuracy
score = logreg.score(X_train, y_train)
print(score)

#### Baseline Performance That Classifier Should Aim to Beat

In [None]:
# Test Accuracy
for threshold in thresholds:
    y_pred = (y_proba > threshold).astype(int)
    Accuracy = accuracy_score(y_test, y_pred)

In [None]:
Accuracy

#### Test accuracy score to beat is 88.1%

### Problem 8: A Simple Model

Use Logistic Regression to build a basic model on your data.  

### Answer 8: Building Logistic Regression Model

#### Evaluating priority features via L1 regularization

In [None]:
# Naming feature and outcome variables
X = Baseline.drop(['Subscribed'], axis=1)
y = Baseline['Subscribed']

In [None]:
# Normalizing data
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

In [None]:
Cs = np.logspace(-5, .5)

In [None]:
coef_list = []
for C in Cs:
    lgr = LogisticRegression(penalty = 'l1', solver = 'liblinear', C = C, random_state=42, max_iter = 1000).fit(X_scaled, y)
    coef_list.append(list(lgr.coef_[0]))

In [None]:
coef_list[0]

In [None]:
coef_df = pd.DataFrame(coef_list, columns = X.columns)
coef_df.index = Cs

In [None]:
coef_df.head()

In [None]:
coef_df.sort_values

In [None]:
plt.figure(figsize = (12, 5))
plt.semilogx(coef_df)
plt.gca().invert_xaxis()
plt.grid()
plt.legend(list(coef_df.columns));
plt.title('Increasing Regularization on Baseline Model')
plt.xlabel("Increasing 1/C")
plt.savefig('coefl1.png')

#### After evaluation, top features are Contact_In_Prior_Campaign and Euribor_3Month_Daily

##### Visualizing Model Prior to Creating Data Frame for Simple Model

In [None]:
Optimized_Model = Drop10[['Contact_In_Prior_Campaign', 'Euribor_3Month_Daily', 'Subscribed']]
Optimized_Model

In [None]:
Contacted_Prior_Campaign = pd.DataFrame(Optimized_Model["Contact_In_Prior_Campaign"].replace({"no": 0, "yes": 1}))
Subscribed_Outcome = pd.DataFrame(Optimized_Model["Subscribed"].replace({"no": 0, "yes": 1}))
Euribor_Rate = pd.DataFrame(Optimized_Model['Euribor_3Month_Daily'])

In [None]:
OptimizedDF = pd.concat([Euribor_Rate,Contacted_Prior_Campaign, Subscribed_Outcome], axis = 1)

In [None]:
OptimizedDF

#### Histogram

In [None]:
sns.countplot(data=Drop10, x = 'Subscribed')
plt.title('Count of target observations')

##### Imbalanced Target classes

In [None]:
y_test.value_counts(normalize = True)

In [None]:
y_train.value_counts(normalize = True)

##### Based on the above results the baseline for "yes" subscribed is 11.8%.

#### Scatter Plot

In [None]:
sns.scatterplot(data = OptimizedDF, x='Euribor_3Month_Daily', 
                y='Contact_In_Prior_Campaign',
                hue='Subscribed')
plt.title('Model Scatterplot: Euribor Rate by Contact In Prior Campaign')

##### On this scatter plot visualization for the Simple Model, in appears that if the client was contacted in a prior campaign, and the Euribor rate was below 2%, the client was more likely to subscribe.  This simple model will be compared with Logistic Regression, k-Nearest Neighbors, Decision Tree, and Support Vector Machine algorithms. Accuracy, the number of all correct predictions divided by the total number of the dataset, will be used as the evaluation metric to compare models. 

##### Data Frame for Simple Model

In [None]:
Simple_Model_Variables = Drop10[['Contact_In_Prior_Campaign','Euribor_3Month_Daily', 'Subscribed']]
Simple_Model_Variables

In [None]:
#New data frame with numeric dummy variables
CtgrcalVarbls2 = dummiescontactedprior
NmrclVrbls2 = Drop10[['Euribor_3Month_Daily']]

In [None]:
#Joining Outcome Variable to Data Frame
SimpleModel = pd.concat([CtgrcalVarbls2, NmrclVrbls2, Subscribed_Outcome], axis = 1)

In [None]:
SimpleModel.head()

In [None]:
BalanceData = SimpleModel.query('Subscribed == 1')

In [None]:
BalanceData

In [None]:
# Naming feature and outcome variables
X = BalanceData.drop(['Subscribed'], axis=1)
y = BalanceData['Subscribed']

#### Handling problem of imbalanced data when comparing accuracy results. Accuracy will be used chosen as comparison metric as it may best capture the number of correct predictions against the total number of predictions - in a binary classification problem.

In [None]:
yes_subscribed = len(SimpleModel[SimpleModel['Subscribed'] == 1])
no_subscribed_indices = SimpleModel[SimpleModel.Subscribed == 0].index

In [None]:
random_indices = np.random.choice(no_subscribed_indices, yes_subscribed, replace = False)

In [None]:
yes_subscribed_indices = SimpleModel[SimpleModel.Subscribed == 1].index

In [None]:
under_sample_indices = np.concatenate([yes_subscribed_indices,random_indices])

In [None]:
under_sample = SimpleModel.loc[under_sample_indices]

In [None]:
sns.countplot(data=under_sample, x = 'Subscribed')
plt.title('Count of target observations')

#### Sampling from balanced data set. Please note that results below will vary based on new sample. Please refer to the tables below for results (e.g., first recorded outcomes on training and test accuracies, coefficients, etc.).

In [None]:
import warnings
warnings.filterwarnings('ignore')

In [None]:
# Naming feature and outcome variables for balanced data set
X_under = under_sample.loc[:,under_sample.columns != 'Subscribed']
y_under = under_sample.loc[:,under_sample.columns == 'Subscribed']
X_under_train, X_under_test, y_under_train, y_under_test = train_test_split(X_under,y_under,test_size = 0.25, random_state = 0)

X = X_under 
y = y_under 

### Problem 9: Score the Model

What is the accuracy of your model?

### Answer 9: Scoring the Model

#### Performing Logistic Regression on Model

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42)
logreg = LogisticRegression(multi_class='multinomial')
logreg.fit(X_train, y_train)
y_proba = logreg.predict_proba(X_test)[:, 1]
thresholds = [0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9]

In [None]:
# Obtaining Intercept
logreg.intercept_

In [None]:
# Obtaining Coefficients
logreg.coef_

In [None]:
logreg_beta0 = logreg.intercept_
logreg_beta1 = logreg.coef_
logreg_thresh = -logreg_beta0/logreg_beta1
logreg_beta0, logreg_beta1, logreg_thresh

In [None]:
for threshold in thresholds:
    y_pred = (y_proba > threshold).astype(int)
    Accuracy = accuracy_score(y_test, y_pred)
  

In [None]:
d = {'Intercept': ['0.99', '', ''],
     'Coefficients (Prior Contact, No Prior Contact, Euribor)': ['-0.45', '0.45', ' -0.20'],
     'Thresholds (Prior Contact, No Prior Contact, Euribor)': ['2.18', '-2.18', '4.85']}

In [None]:
PreSimpleModelTable = pd.DataFrame(data=d)

In [None]:
SimpleModelTable = PreSimpleModelTable.set_index('Intercept')

In [None]:
SimpleModelTable

#### Simple Model Coefficients Table
##### In this model, the feature that has the most explanatory values in the outcome variable were the 'No Prior Contact' variable and the 'Prior Contact' variables. One can interpret that for the target of subscription, it might be best to contact targeted groups multiple times across campaigns.

In [None]:
# Obtaining training accuracy for model
score = logreg.score(X_train, y_train)
print(score)

In [None]:
# Obtaining test accuracy
for threshold in thresholds:
    y_pred = (y_proba > threshold).astype(int)
    Accuracy = accuracy_score(y_test, y_pred)

In [None]:
Accuracy

### Problem 10: Model Comparisons

Now, we aim to compare the performance of the Logistic Regression model to our KNN algorithm, Decision Tree, and SVM models.  Using the default settings for each of the models, fit and score each.  Also, be sure to compare the fit time of each of the models.  Present your findings in a `DataFrame` similar to that below:

| Model | Train Time | Train Accuracy | Test Accuracy |
| ----- | ---------- | -------------  | -----------   |
|     |    |.     |.     |

### Answer 10: Setting Up Model for Comparisons

### k-Nearest Neighbors

In [None]:
print(X.head())
print('==============')
print(y.head())

In [None]:
#train test split with 25% of the data assigned as the test set
#Set random_state = 42 to assure correct grading

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)
print(X_train.shape)
print(X_test.shape)


In [None]:
KNNBasic = Pipeline([('scale', StandardScaler()), ('knn', KNeighborsClassifier(n_neighbors = 5))])
KNNBasic.fit(X_train, y_train)
KNNVBasic_acc_train = KNNBasic.score(X_train, y_train)
KNNBasic_acc_test = KNNBasic.score(X_test, y_test)


In [None]:
KNNVBasic_acc_train

In [None]:
KNNBasic_acc_test

#### Time to train was .1 seconds

### Decision Tree


In [None]:
#train test split with 25% of the data assigned as the test set
#Set random_state = 42 to assure correct grading

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)
print(X_train.shape)
print(X_test.shape)

In [None]:
dtree = DecisionTreeClassifier(max_depth = None, random_state = 42).fit(X_train, y_train)
print(dtree)

In [None]:
depth_1 = dtree.get_depth()
print(depth_1)

In [None]:
train_acc = dtree.score(X_train, y_train)
test_acc = dtree.score(X_test, y_test)
print(f'Training Accuracy: {train_acc: .3f}')
print(f'Test Accuracy: {test_acc: .3f}')

#### Time to train was .1 seconds

### Support Vector Machine


In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)
print(X_train.shape)
print(X_test.shape)

In [None]:
# Instantiating and fitting SVC Estimator
svc_1 = SVC(kernel = 'linear').fit(X_train, y_train)
support_vectors = svc_1.support_vectors_
support_vectors

In [None]:
# Training Accuracy
y_pred = svc_1.predict(X_train)

In [None]:
accuracy = accuracy_score(y_train, y_pred)
precision = precision_score(y_train, y_pred)
recall = recall_score(y_train, y_pred)

In [None]:
accuracy

In [None]:
# Test Accuracy
y_pred = svc_1.predict(X_test)

In [None]:
accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)

In [None]:
Accuracy

#### Time to train was 7 seconds

In [None]:
Model_Type = ['Logistic Regression', 'kNN', 'Decision Tree', 'SVM']
Training_Accuracy = [0.716, 0.733, 0.746, 0.715]
Test_Accuracy = [0.587, 0.722, 0.724, 0.587]
Train_Time_Sec = [.1, .1, .1, 7]

In [None]:
df1 = pd.DataFrame(data=zip(Model_Type,Training_Accuracy,Test_Accuracy, Train_Time_Sec),columns=['Model','Training_Accuracy', 'Test_Accuracy', 'Train(Sec)'])

In [None]:
df1

In [None]:
ModelComparisonTable = df1.set_index('Model')

#### Model Comparisons

In [None]:
ModelComparisonTable

#### The best performing model was the decision tree, with a test accuracy score of 72.4%. Closely following was the k-Nearest Neighbors model, with a test accuracy score of 72.2%.  These were lower test accuracy scores compared to baseline model, which included an optimized set of attributes (yet classes were imbalanced in analysis). The Logistic Regression model listed in the above table would serve as a better baseline (58.7% test accuracy), as classes were balanced in the analysis.  Thus, when compared to baseline, the decision tree, with a test accuracy score of 72.4%, was an improvement. It can also be noted that the slowest model to train was the Support Vector Machine model.

### Problem 11: Improving the Model

Now that we have some basic models on the board, we want to try to improve these.  Below, we list a few things to explore in this pursuit.

- More feature engineering and exploration.  For example, should we keep the gender feature?  Why or why not?
- Hyperparameter tuning and grid search.  All of our models have additional hyperparameters to tune and explore.  For example the number of neighbors in KNN or the maximum depth of a Decision Tree.  
- Adjust your performance metric

### Answer 11: Improving the Model
#### Optimal feature variables were found early in the analysis via logistic regression. These features were the following:  

##### 1) The dichotomous variable describing whether a client was contacted in a prior campaign 
##### 2) The continuous variable describing the Euribor interest rate

###### Test accuracy was 58.7%. 

###### Based on such feature variables, the accuracy of the models for Logistic Regression, k-Nearest Neighbors, Decision Tree, and Support Vector Machine were compared. The decision tree performed the best when comparing each of these models, with an accuracy score of 72.4%. Thus, the decision tree parameters will be optimized.

### Answer 11: Improving the Decision Tree Model

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state = 42)

In [None]:
# Optimizing parameters for the decision tree model with various cross validation methods
params = {'max_depth': [2,5,10],
         'min_samples_split': [.1,.2,.05],
          'criterion': ['gini', 'gini', 'gini'],
          'min_samples_leaf': [1,10,20]
         }

In [None]:
dtree = DecisionTreeClassifier()

In [None]:
#GridSearchCV
grid = GridSearchCV(DecisionTreeClassifier(random_state = 42), param_grid=params).fit(X_train, y_train)
grid_train_acc = grid.score(X_train, y_train)
grid_test_acc = grid.score(X_test, y_test)
best_params = grid.best_params_
print(f'Training Accuracy: {grid_train_acc: .3f}')
print(f'Test Accuracy: {grid_test_acc: .3f}')
print(f'Best parameters of tree: {best_params}')

##### Time 5 seconds

In [None]:
#RandomizedSearchCV
random = RandomizedSearchCV(DecisionTreeClassifier(random_state = 42), params).fit(X_train, y_train)
random_train_acc = random.score(X_train, y_train)
random_test_acc = random.score(X_test, y_test)
best_params = random.best_params_
print(f'Training Accuracy: {random_train_acc: .3f}')
print(f'Test Accuracy: {random_test_acc: .3f}')
print(f'Best parameters of tree: {best_params}')

##### Time .1 second

In [None]:
#HalvingGridSearchCV
halving = HalvingGridSearchCV(DecisionTreeClassifier(random_state = 42), param_grid=params).fit(X_train, y_train)
halving_train_acc = halving.score(X_train, y_train)
halving_test_acc = halving.score(X_test, y_test)
best_params = halving.best_params_
print(f'Training Accuracy: {halving_train_acc: .3f}')
print(f'Test Accuracy: {halving_test_acc: .3f}')
print(f'Best parameters of tree: {best_params}')

##### Time 5 seconds

In [None]:
# HalvingRandomSearchCV
halvingrand = HalvingRandomSearchCV(DecisionTreeClassifier(random_state = 42), params).fit(X_train, y_train)
halvingrand_train_acc = halvingrand.score(X_train, y_train)
halvingrand_test_acc = halvingrand.score(X_test, y_test)
best_params = halvingrand.best_params_
print(f'Training Accuracy: {halvingrand_train_acc: .3f}')
print(f'Test Accuracy: {halvingrand_test_acc: .3f}')
print(f'Best parameters of tree: {best_params}')

##### Time 5 seconds

In [None]:
SearchCVModel = ['Grid', 'Randomized', 'HalvingGrid', 'HalvingRandom']
Performance_Sec = [5, .1, 5, 5]
Training_Accuracy = [0.733, 0.739, 0.716, 0.740]
Test_Accuracy = [0.738, 0.727, 0.700, 0.735]
Max_Depth = [10, 10, 2, 5]
Min_Samples_Leaf = [1, 10, 20, 1]
Min_Samples_Split = [0.1, 0.05, 0.1, 0.05]
Criterion = ["Gini", "Gini", "Gini", "Gini"]

In [None]:
df = pd.DataFrame(data=zip(SearchCVModel,Performance_Sec,Training_Accuracy,Test_Accuracy, Max_Depth, Min_Samples_Leaf, Min_Samples_Split),columns=['CVModel','Performance_Seconds','Training_Accuracy','Test_Accuracy', 'Depth(Max)', 'Leaf(Min)', 'Split(Min)'])

In [None]:
df

In [None]:
DecisionTreeTable = df.set_index('CVModel')

In [None]:
DecisionTreeTable

In [None]:
# Another Visualization of Decision Tree Results
kwargs= dict (linestyle='dashed', color=['red', 'green'], linewidth=1.2)
line_plot = df.plot(y = ['Training_Accuracy','Test_Accuracy'], figsize= (10,6),**kwargs ) 
line_plot.set_title('Accuracy Line Plot (Test & Train)')
line_plot.grid()
plt.xlim([0,3])
line_plot.set_xlabel('0.0: GridSearchCV, 1.0: RandomizedSearchCV, 2.0: HalvingGridSearchCV, 3.0: HalvingRandomSearchCV'),
line_plot.set_ylabel('Results')

### GridSearchCV performs best with a test accuracy of 73.8%. Training accuracy was lower than test accuracy (73.3%), which is an indication that the model was not overfit.

In [None]:
from sklearn.tree import DecisionTreeClassifier
from sklearn import tree
plt.subplots(figsize=(18,19))
clf = DecisionTreeClassifier(max_depth = 10, min_samples_leaf=1, min_samples_split = 0.10,
                             random_state = 42)
clf.fit(X_train, y_train)
tree.plot_tree(clf)

###   Above is the best performing model, the decision tree with the following parameters (determined by GridSearchCV):  

#### max_depth = 10, min_samples_leaf=1, min_samples_split = 0.10

# Summary Description

## 1) Business Understanding: 

### The analyzed data set is from a Portuguese retail bank, and was initially gathered to observe the effects of financial crisis. Currently, the data set will be used for classification purposes in order to understand relevant features to long-term bank subscriptions. 

## 2) Data Understanding:

### Examining the **Materials and Methods** section of the paper, it appears the dataset is related to 17 campaigns that occurred between May 2008 and November 2010. It also appears that 20 attributes are included, in the categories of bank client data, social and economic attributes, and additional campaign attributes.

### The main question is one of predicting whether or not a client will subscribe to a long-term bank deposit. The business goal is to increase campaign efficiency via finding a model that can explain if the client subscribes the deposit. Identifying the main characteristics that affect success can help management of human effort, phone calls, time, and other available resources. In addition, this can aid in selection of a high quality and an affordable set of potential buying customers.

## 3) Data Preparation

### Data exploration and analysis included 20 initial attributes. While carefully examining the data features, the following actions occurred:

#### - 'Unknown' values for categorical variables 'loan', 'housing', 'education', 'marital', 'job' were removed for lack of contribution to meaningful results

#### - Created new column for ‘pdays’ - coerced column for ‘pdays’ column into a new, dummy variable column (Yes/No). According to "Attribute Information",'pdays' column is number of days that passed by after client was last contacted from a previous campaign ("999" means client was not previously contacted, all other numerical values means client was previously contacted)

#### - Euribor_3Month_Daily feature was retained while highly correlated features (> 0.90) were removed to reduce dimensionality


## 4) Modeling:

### All remaining features were dummy coded if not continuous, and evaluated though Logistic Regression. Priority features were determined through L1 regularization. 

### The top two features were whether the client was contacted in a prior campaign and the numerical inter-bank interest rate (Euribor).

#### Data were properly balanced and accuracy was chosen as comparison metric for model comparison. The reason was accuracy might best capture prediction in a binary classification problem. 

## 5) Evaluation

### In this model, the feature that had the most explanatory values in the outcome variable were the 'No Prior Contact' variable and the 'Prior Contact' variables. One can interpret that for the target of subscription, it might be best to contact targeted groups multiple times across campaigns.

#### Summary of Key Features and Coefficients

In [None]:
SimpleModelTable

### The decision tree, and parameters found through GridSearchCV, performed best -  with a test accuracy of 73.8%. Training accuracy was lower than test accuracy (73.3%), which was an indication that the training model was not overfit.

#### Summary of Model Comparison

In [None]:
ModelComparisonTable

#### Summary of Decision Tree Parameter Search

In [None]:
DecisionTreeTable

## 6) Deployment

### From this particular analysis, it appears that if the client was contacted in a prior campaign, and the current Euribor rate was below 2%, the client was more likely to subscribe to a long-term bank deposit. 

### Interestingly, it appears the inter-bank interest rate (Euribor), rather than characteristics of clients, contribute more to the agreement to subscribe to a long-term bank deposit. 

### Proceeding forward, it seems telemarketing calls should be made when the Euribor is low. Any target or priority individuals should also be called across campaigns to increase likelihood of subscription. The next step is to determine which priority clients need to be called across campaigns. In addition, methods need to be devised to make calls according to the fluctuating Euribor.
