<a href="https://colab.research.google.com/github/OBULAMRANA/comparing_classifiers/blob/main/prompt_III.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Practical Application III: Comparing Classifiers

**Overview**: In this practical application, your goal is to compare the performance of the classifiers we encountered in this section, namely K Nearest Neighbor, Logistic Regression, Decision Trees, and Support Vector Machines.  We will utilize a dataset related to marketing bank products over the telephone.  



### Getting Started

Our dataset comes from the UCI Machine Learning repository [link](https://archive.ics.uci.edu/ml/datasets/bank+marketing).  The data is from a Portugese banking institution and is a collection of the results of multiple marketing campaigns.  We will make use of the article accompanying the dataset [here](CRISP-DM-BANK.pdf) for more information on the data and features.



### Problem 1: Understanding the Data

To gain a better understanding of the data, please read the information provided in the UCI link above, and examine the **Materials and Methods** section of the paper.  How many marketing campaigns does this data represent?

### Problem 2: Read in the Data

Use pandas to read in the dataset `bank-additional-full.csv` and assign to a meaningful variable name.

In [3]:
import pandas as pd

In [4]:
df = pd.read_csv('/bin/data/bank-additional.csv', sep = ';')

In [5]:
df.head()

Unnamed: 0,age,job,marital,education,default,housing,loan,contact,month,day_of_week,...,campaign,pdays,previous,poutcome,emp.var.rate,cons.price.idx,cons.conf.idx,euribor3m,nr.employed,y
0,30,blue-collar,married,basic.9y,no,yes,no,cellular,may,fri,...,2,999,0,nonexistent,-1.8,92.893,-46.2,1.313,5099.1,no
1,39,services,single,high.school,no,no,no,telephone,may,fri,...,4,999,0,nonexistent,1.1,93.994,-36.4,4.855,5191.0,no
2,25,services,married,high.school,no,yes,no,telephone,jun,wed,...,1,999,0,nonexistent,1.4,94.465,-41.8,4.962,5228.1,no
3,38,services,married,basic.9y,no,unknown,unknown,telephone,jun,fri,...,3,999,0,nonexistent,1.4,94.465,-41.8,4.959,5228.1,no
4,47,admin.,married,university.degree,no,yes,no,cellular,nov,mon,...,1,999,0,nonexistent,-0.1,93.2,-42.0,4.191,5195.8,no


### Problem 3: Understanding the Features


Examine the data description below, and determine if any of the features are missing values or need to be coerced to a different data type.


```
Input variables:
# bank client data:
1 - age (numeric)
2 - job : type of job (categorical: 'admin.','blue-collar','entrepreneur','housemaid','management','retired','self-employed','services','student','technician','unemployed','unknown')
3 - marital : marital status (categorical: 'divorced','married','single','unknown'; note: 'divorced' means divorced or widowed)
4 - education (categorical: 'basic.4y','basic.6y','basic.9y','high.school','illiterate','professional.course','university.degree','unknown')
5 - default: has credit in default? (categorical: 'no','yes','unknown')
6 - housing: has housing loan? (categorical: 'no','yes','unknown')
7 - loan: has personal loan? (categorical: 'no','yes','unknown')
# related with the last contact of the current campaign:
8 - contact: contact communication type (categorical: 'cellular','telephone')
9 - month: last contact month of year (categorical: 'jan', 'feb', 'mar', ..., 'nov', 'dec')
10 - day_of_week: last contact day of the week (categorical: 'mon','tue','wed','thu','fri')
11 - duration: last contact duration, in seconds (numeric). Important note: this attribute highly affects the output target (e.g., if duration=0 then y='no'). Yet, the duration is not known before a call is performed. Also, after the end of the call y is obviously known. Thus, this input should only be included for benchmark purposes and should be discarded if the intention is to have a realistic predictive model.
# other attributes:
12 - campaign: number of contacts performed during this campaign and for this client (numeric, includes last contact)
13 - pdays: number of days that passed by after the client was last contacted from a previous campaign (numeric; 999 means client was not previously contacted)
14 - previous: number of contacts performed before this campaign and for this client (numeric)
15 - poutcome: outcome of the previous marketing campaign (categorical: 'failure','nonexistent','success')
# social and economic context attributes
16 - emp.var.rate: employment variation rate - quarterly indicator (numeric)
17 - cons.price.idx: consumer price index - monthly indicator (numeric)
18 - cons.conf.idx: consumer confidence index - monthly indicator (numeric)
19 - euribor3m: euribor 3 month rate - daily indicator (numeric)
20 - nr.employed: number of employees - quarterly indicator (numeric)

Output variable (desired target):
21 - y - has the client subscribed a term deposit? (binary: 'yes','no')
```



In [6]:
# clean up the data contact, month,day_of_week, duration, emp.var.rate, cons.price.idx, cons.price.idx, cons.conf.idx, euribor3m,nr.employed
bank_df = df.drop(columns=['duration', 'contact', 'month', 'day_of_week', 'emp.var.rate', 'cons.price.idx', 'cons.conf.idx', 'euribor3m', 'nr.employed'])
bank_df.head()
#Rana cleanup the rows that have  education_unknown	default_unknown		housing_unknown	housing_yes	loan_unknown from bank_df
#bank_df = df.drop(row=['education_unknown', 'default_unknown', 'housing_unknown','loan_unknown'])
#bank_df - bank_df.drop(['education_unknown', 'default_unknown', 'housing_unknown','loan_unknown'])
#bank_df.head()

Unnamed: 0,age,job,marital,education,default,housing,loan,campaign,pdays,previous,poutcome,y
0,30,blue-collar,married,basic.9y,no,yes,no,2,999,0,nonexistent,no
1,39,services,single,high.school,no,no,no,4,999,0,nonexistent,no
2,25,services,married,high.school,no,yes,no,1,999,0,nonexistent,no
3,38,services,married,basic.9y,no,unknown,unknown,3,999,0,nonexistent,no
4,47,admin.,married,university.degree,no,yes,no,1,999,0,nonexistent,no


### Problem 4: Understanding the Task

After examining the description and data, your goal now is to clearly state the *Business Objective* of the task.  State the objective below.

In [7]:
bank_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4119 entries, 0 to 4118
Data columns (total 12 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   age        4119 non-null   int64 
 1   job        4119 non-null   object
 2   marital    4119 non-null   object
 3   education  4119 non-null   object
 4   default    4119 non-null   object
 5   housing    4119 non-null   object
 6   loan       4119 non-null   object
 7   campaign   4119 non-null   int64 
 8   pdays      4119 non-null   int64 
 9   previous   4119 non-null   int64 
 10  poutcome   4119 non-null   object
 11  y          4119 non-null   object
dtypes: int64(4), object(8)
memory usage: 386.3+ KB


Based on the cleaned dataset, we need numerical data. The current dataset contains mixed, numerical and catagorical.
Catogorical needs to be transfarmed into numerical by using transformers.

### Problem 5: Engineering Features

Now that you understand your business objective, we will build a basic model to get started.  Before we can do this, we must work to encode the data.  Using just the bank information features, prepare the features and target column for modeling with appropriate encoding and transformations.

In [8]:
# #  identify numerical and categorical columns and transform the catagorical columns
n_cols = bank_df.select_dtypes(include=['int64', 'float64']).columns
c_cols = bank_df.select_dtypes(include=['object']).columns

print("Bank Numerical columns:", n_cols)
print("Bank Categorical columns:", c_cols)

# transform the categorical columns using one-hot encoding
bank_df_trans_to_r_c = pd.get_dummies(bank_df, columns=c_cols, drop_first=True)
bank_df_trans_to_r_c = bank_df_trans_to_r_c.drop(columns=['education_unknown', 'default_unknown', 'housing_unknown','loan_unknown'])
display(bank_df_trans_to_r_c.head())


Bank Numerical columns: Index(['age', 'campaign', 'pdays', 'previous'], dtype='object')
Bank Categorical columns: Index(['job', 'marital', 'education', 'default', 'housing', 'loan', 'poutcome',
       'y'],
      dtype='object')


Unnamed: 0,age,campaign,pdays,previous,job_blue-collar,job_entrepreneur,job_housemaid,job_management,job_retired,job_self-employed,...,education_high.school,education_illiterate,education_professional.course,education_university.degree,default_yes,housing_yes,loan_yes,poutcome_nonexistent,poutcome_success,y_yes
0,30,2,999,0,True,False,False,False,False,False,...,False,False,False,False,False,True,False,True,False,False
1,39,4,999,0,False,False,False,False,False,False,...,True,False,False,False,False,False,False,True,False,False
2,25,1,999,0,False,False,False,False,False,False,...,True,False,False,False,False,True,False,True,False,False
3,38,3,999,0,False,False,False,False,False,False,...,False,False,False,False,False,False,False,True,False,False
4,47,1,999,0,False,False,False,False,False,False,...,False,False,False,True,False,True,False,True,False,False


In [9]:
bank_df_trans_to_r_c.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4119 entries, 0 to 4118
Data columns (total 30 columns):
 #   Column                         Non-Null Count  Dtype
---  ------                         --------------  -----
 0   age                            4119 non-null   int64
 1   campaign                       4119 non-null   int64
 2   pdays                          4119 non-null   int64
 3   previous                       4119 non-null   int64
 4   job_blue-collar                4119 non-null   bool 
 5   job_entrepreneur               4119 non-null   bool 
 6   job_housemaid                  4119 non-null   bool 
 7   job_management                 4119 non-null   bool 
 8   job_retired                    4119 non-null   bool 
 9   job_self-employed              4119 non-null   bool 
 10  job_services                   4119 non-null   bool 
 11  job_student                    4119 non-null   bool 
 12  job_technician                 4119 non-null   bool 
 13  job_unemployed    

### Problem 6: Train/Test Split

With your data prepared, split it into a train and test set.

In [11]:
from sklearn.model_selection import train_test_split

# Separate features (X) and target (y)
X = bank_df_trans_to_r_c.drop('y_yes', axis=1)
y = bank_df_trans_to_r_c['y_yes']

# Split the data into training and testing sets (80/20 split)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

print("X_train shape:", X_train.shape)
print("X_test shape:", X_test.shape)
print("y_train shape:", y_train.shape)
print("y_test shape:", y_test.shape)

X_train shape: (3295, 29)
X_test shape: (824, 29)
y_train shape: (3295,)
y_test shape: (824,)


### Problem 7: A Baseline Model

Before we build our first model, we want to establish a baseline.  What is the baseline performance that our classifier should aim to beat?

In [12]:
# Calculate the frequency of the target variable
bank_df_trans_to_r_c_target_counts = bank_df_trans_to_r_c['y_yes'].value_counts()
print("Target variable counts for bank_df_trans_to_r_c :\n", bank_df_trans_to_r_c_target_counts)

# Determine the most frequent class and calculate the baseline accuracy
baseline_accuracy = bank_df_trans_to_r_c_target_counts.max() / bank_df_trans_to_r_c_target_counts.sum()
print("\nBaseline accuracy (by predicting the most frequent class):", baseline_accuracy)

Target variable counts for bank_df_trans_to_r_c :
 y_yes
False    3668
True      451
Name: count, dtype: int64

Baseline accuracy (by predicting the most frequent class): 0.890507404709881


### Problem 8: A Simple Model

Use Logistic Regression to build a basic model on your data.  

In [13]:
from sklearn.linear_model import LogisticRegression

# Initializing the Logistic Regression model
Logistic_Logistic_model = LogisticRegression(max_iter=500) # Increased max_iter for convergence

# Training the model
Logistic_Logistic_model.fit(X_train, y_train)

STOP: TOTAL NO. OF ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


### Problem 9: Score the Model

What is the accuracy of your model?

In [14]:
# Score the Logistic Regression model on the test data
log_reg_accuracy = Logistic_Logistic_model.score(X_test, y_test)

print(f"Logistic Regression Model Accuracy: {log_reg_accuracy:.4f}")

Logistic Regression Model Accuracy: 0.8993


### Problem 10: Model Comparisons

Now, we aim to compare the performance of the Logistic Regression model to our KNN algorithm, Decision Tree, and SVM models.  Using the default settings for each of the models, fit and score each.  Also, be sure to compare the fit time of each of the models.  Present your findings in a `DataFrame` similar to that below:

| Model | Train Time | Train Accuracy | Test Accuracy |
| ----- | ---------- | -------------  | -----------   |
|     |    |.     |.     |

In [15]:
import time
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score
import pandas as pd

# Initialize models
knn_model = KNeighborsClassifier()
decision_tree_model = DecisionTreeClassifier(random_state=42)
svm_model = SVC(random_state=42)

# Create a list of models to compare
models = {
    'Logistic Regression': Logistic_Logistic_model,
    'K Nearest Neighbors': knn_model,
    'Decision Tree': decision_tree_model,
    'SVM': svm_model
}

results = []

for model_name, model in models.items():
    start_time = time.time()
    model.fit(X_train, y_train)
    end_time = time.time()
    train_time = end_time - start_time

    y_train_pred = model.predict(X_train)
    train_accuracy = accuracy_score(y_train, y_train_pred)

    y_test_pred = model.predict(X_test)
    test_accuracy = accuracy_score(y_test, y_test_pred)

    results.append({
        'Model': model_name,
        'Train Time': train_time,
        'Train Accuracy': train_accuracy,
        'Test Accuracy': test_accuracy
    })

results_df = pd.DataFrame(results)
display(results_df)

STOP: TOTAL NO. OF ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


Unnamed: 0,Model,Train Time,Train Accuracy,Test Accuracy
0,Logistic Regression,0.465326,0.901062,0.899272
1,K Nearest Neighbors,0.003912,0.906222,0.878641
2,Decision Tree,0.01997,0.985129,0.830097
3,SVM,0.153188,0.900455,0.899272


### Problem 11: Improving the Model

Now that we have some basic models on the board, we want to try to improve these.  Below, we list a few things to explore in this pursuit.

- More feature engineering and exploration.  For example, should we keep the gender feature?  Why or why not?
- Hyperparameter tuning and grid search.  All of our models have additional hyperparameters to tune and explore.  For example the number of neighbors in KNN or the maximum depth of a Decision Tree.  
- Adjust your performance metric

In [21]:
from sklearn.model_selection import GridSearchCV
from sklearn.neighbors import KNeighborsClassifier
tuned_results = []
# Define a range of k values to test
k_range = list(range(1, 31))
param_grid = dict(n_neighbors=k_range)

# Initialize KNN model
knn_model = KNeighborsClassifier()

# Initialize GridSearchCV with cross-validation
grid = GridSearchCV(knn_model, param_grid, cv=10, scoring='accuracy')
start_time = time.time()

# Fit the grid search to the training data
grid.fit(X_train, y_train)

# Print the best k and best score
print("Best number of neighbors (k):", grid.best_params_['n_neighbors'])
print("Best cross-validation accuracy:", grid.best_score_)

# Train a KNN model with the best k
best_knn_model = KNeighborsClassifier(n_neighbors=grid.best_params_['n_neighbors'])
best_knn_model.fit(X_train, y_train)

# Evaluate the best KNN model on the test set
test_accuracy_best_knn = best_knn_model.score(X_test, y_test)
print("Test accuracy with best k:", test_accuracy_best_knn)
end_time = time.time()
train_time = end_time - start_time
tuned_results.append({
    'Model': 'K Nearest Neighbors (Tuned)',
    'Train Time': train_time, # Need to capture train time for tuned models
    'Train Accuracy': best_knn_model.score(X_train, y_train),
    'Test Accuracy': test_accuracy_best_knn
})

Best number of neighbors (k): 27
Best cross-validation accuracy: 0.9004485585336649
Test accuracy with best k: 0.9004854368932039


In [22]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import GridSearchCV

# Define the parameter grid for Decision Tree
param_grid_dt = {
    'max_depth': [None, 5, 10, 15, 20],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 4]
}

# Initialize Decision Tree model
dt = DecisionTreeClassifier(random_state=42)

# Initialize GridSearchCV with cross-validation
grid_dt = GridSearchCV(dt, param_grid_dt, cv=10, scoring='accuracy')
start_time = time.time()
# Fit the grid search to the training data
grid_dt.fit(X_train, y_train)

# Print the best parameters and best score
print("Best parameters for Decision Tree:", grid_dt.best_params_)
print("Best cross-validation accuracy for Decision Tree:", grid_dt.best_score_)

# Train a Decision Tree model with the best parameters
best_dt_model = DecisionTreeClassifier(**grid_dt.best_params_, random_state=42)
best_dt_model.fit(X_train, y_train)

# Evaluate the best Decision Tree model on the test set
test_accuracy_best_dt = best_dt_model.score(X_test, y_test)
print("Test accuracy with best Decision Tree model:", test_accuracy_best_dt)

end_time = time.time()
train_time = end_time - start_time
tuned_results.append({
    'Model': 'Decision Tree (Tuned)',
    'Train Time': time.time() - start_time, # Need to capture train time for tuned models
    'Train Accuracy': best_dt_model.score(X_train, y_train),
    'Test Accuracy': test_accuracy_best_dt
})

Best parameters for Decision Tree: {'max_depth': 5, 'min_samples_leaf': 4, 'min_samples_split': 2}
Best cross-validation accuracy for Decision Tree: 0.8955936262319242
Test accuracy with best Decision Tree model: 0.8932038834951457


In [23]:
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import GridSearchCV

# Define the parameter grid for Logistic Regression
param_grid_lr = {
    'C': [0.001, 0.01, 0.1, 1, 10, 100],
    'penalty': ['l1', 'l2']
}

# Initialize Logistic Regression model
# Use a solver that supports both l1 and l2 penalties, like 'liblinear'
lr = LogisticRegression(solver='liblinear', max_iter=1000, random_state=42)

# Initialize GridSearchCV with cross-validation
grid_lr = GridSearchCV(lr, param_grid_lr, cv=10, scoring='accuracy')

start_time = time.time()

# Fit the grid search to the training data
grid_lr.fit(X_train, y_train)

# Print the best parameters and best score
print("Best parameters for Logistic Regression:", grid_lr.best_params_)
print("Best cross-validation accuracy for Logistic Regression:", grid_lr.best_score_)

# Train a Logistic Regression model with the best parameters
best_lr_model = LogisticRegression(**grid_lr.best_params_, solver='liblinear', max_iter=1000, random_state=42)
best_lr_model.fit(X_train, y_train)

# Evaluate the best Logistic Regression model on the test set
test_accuracy_best_lr = best_lr_model.score(X_test, y_test)
print("Test accuracy with best Logistic Regression model:", test_accuracy_best_lr)

end_time = time.time()
train_time = end_time - start_time
tuned_results.append({
    'Model': 'Logistic Regression (Tuned)',
    'Train Time': time.time() - start_time, # Need to capture train time for tuned models
    'Train Accuracy': best_lr_model.score(X_train, y_train),
    'Test Accuracy': test_accuracy_best_lr
})


Best parameters for Logistic Regression: {'C': 0.01, 'penalty': 'l2'}
Best cross-validation accuracy for Logistic Regression: 0.9004485585336651
Test accuracy with best Logistic Regression model: 0.9004854368932039


In [24]:
from sklearn.svm import SVC
from sklearn.model_selection import GridSearchCV
import time

# Define the parameter grid for SVM
# Starting with a smaller grid due to computational cost
param_grid_svm = {
    'C': [0.1, 1, 10],
    'kernel': ['linear', 'rbf'],
    'gamma': ['scale', 'auto']
}

# Initialize SVM model
svm = SVC(random_state=42)

# Initialize GridSearchCV with cross-validation
grid_svm = GridSearchCV(svm, param_grid_svm, cv=3, scoring='accuracy') # Using fewer folds for faster execution

# Fit the grid search to the training data
print("Starting SVM hyperparameter tuning...")
start_time = time.time()
grid_svm.fit(X_train, y_train)
##end_time = time.time()
print(f"SVM hyperparameter tuning finished in {end_time - start_time:.2f} seconds.")


# Print the best parameters and best score
print("Best parameters for SVM:", grid_svm.best_params_)
print("Best cross-validation accuracy for SVM:", grid_svm.best_score_)

# Train an SVM model with the best parameters
best_svm_model = SVC(**grid_svm.best_params_, random_state=42)
best_svm_model.fit(X_train, y_train)

# Evaluate the best SVM model on the test set
test_accuracy_best_svm = best_svm_model.score(X_test, y_test)
print("Test accuracy with best SVM model:", test_accuracy_best_svm)
end_time = time.time()
train_time = end_time - start_time
tuned_results.append({
    'Model': 'SVM (Tuned)',
    'Train Time': time.time() - start_time, # Need to capture train time for tuned models
    'Train Accuracy': best_svm_model.score(X_train, y_train),
    'Test Accuracy': test_accuracy_best_svm
})

Starting SVM hyperparameter tuning...
SVM hyperparameter tuning finished in -4.33 seconds.
Best parameters for SVM: {'C': 0.1, 'gamma': 'scale', 'kernel': 'linear'}
Best cross-validation accuracy for SVM: 0.9004537436196619
Test accuracy with best SVM model: 0.8992718446601942


In [25]:
### Display the results


tuned_results_df = pd.DataFrame(tuned_results)

# Combine the results DataFrames
combined_results_df = pd.concat([results_df, tuned_results_df], ignore_index=True)

# Display the combined results
display(combined_results_df)

Unnamed: 0,Model,Train Time,Train Accuracy,Test Accuracy
0,Logistic Regression,0.465326,0.901062,0.899272
1,K Nearest Neighbors,0.003912,0.906222,0.878641
2,Decision Tree,0.01997,0.985129,0.830097
3,SVM,0.153188,0.900455,0.899272
4,K Nearest Neighbors (Tuned),11.71389,0.900152,0.900485
5,Decision Tree (Tuned),6.411558,0.905918,0.893204
6,Logistic Regression (Tuned),4.122691,0.900455,0.900485
7,SVM (Tuned),740.795573,0.900455,0.899272


##### Questions