### Assignment 1: Ensemble Learning in Action

**Objective:**

To build, evaluate, and compare and contrast ensemble models with standard machine learning models. You should demonstrate an understanding of each model's pros and cons and evaluate, from a business context, which model is most appropriate.

##### Part 1: Data Preprocessing

Data Set: https://www.kaggle.com/datasets/prakharrathi25/banking-dataset-marketing-targetsLinks

**Detailed Column Descriptions**

***bank client data:***

1 - age (numeric)

2 - job : type of job (categorical: "admin.","unknown","unemployed","management","housemaid","entrepreneur","student","blue-collar","self-employed","retired","technician","services")

3 - marital : marital status (categorical: "married","divorced","single"; note: "divorced" means divorced or widowed)

4 - education (categorical: "unknown","secondary","primary","tertiary")

5 - default: has credit in default? (binary: "yes","no")

6 - balance: average yearly balance, in euros (numeric)

7 - housing: has housing loan? (binary: "yes","no")

8 - loan: has personal loan? (binary: "yes","no")

related with the last contact of the current campaign:

9 - contact: contact communication type (categorical: "unknown","telephone","cellular")

10 - day: last contact day of the month (numeric)

11 - month: last contact month of year (categorical: "jan", "feb", "mar", …, "nov", "dec")

12 - duration: last contact duration, in seconds (numeric)

other attributes:

13 - campaign: number of contacts performed during this campaign and for this client (numeric, includes last contact)

14 - pdays: number of days that passed by after the client was last contacted from a previous campaign (numeric, -1 means client was not previously contacted)

15 - previous: number of contacts performed before this campaign and for this client (numeric)

16 - poutcome: outcome of the previous marketing campaign (categorical: "unknown","other","failure","success")


***Output Variable (desired target):***

17 - y - has the client subscribed a term deposit? (binary: "yes","no")


Import relevant libraries and functions. 
Since I will be comparing ensemble models to my own ensemble and also to a basseline model, I will import the following models: Random Forest, AdaBoost, BaggingClassifier, VotingClassifier. I will compare these to the performance of a Logistic Regression and KNN model. I will also use the Decision Tree model to build my own ensemble.

In [1]:
# import relevant libraries
import numpy as np
import pandas as pd
import seaborn as sns
import time #to measure how long the models take
from sklearn import datasets
from sklearn.compose import ColumnTransformer
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier, BaggingClassifier, VotingClassifier
from sklearn.metrics import accuracy_score, f1_score, precision_score, recall_score, precision_recall_curve
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler, OneHotEncoder

In [6]:
# Read training and testing data
df_train = pd.read_csv("https://raw.githubusercontent.com/GoldenSnow-Xue/Schulich_Data_Science2/main/banking_train.csv", sep=';')
df_test = pd.read_csv("https://raw.githubusercontent.com/GoldenSnow-Xue/Schulich_Data_Science2/main/banking_test.csv", sep=';')

In [7]:
df_train

Unnamed: 0,age,job,marital,education,default,balance,housing,loan,contact,day,month,duration,campaign,pdays,previous,poutcome,y
0,58,management,married,tertiary,no,2143,yes,no,unknown,5,may,261,1,-1,0,unknown,no
1,44,technician,single,secondary,no,29,yes,no,unknown,5,may,151,1,-1,0,unknown,no
2,33,entrepreneur,married,secondary,no,2,yes,yes,unknown,5,may,76,1,-1,0,unknown,no
3,47,blue-collar,married,unknown,no,1506,yes,no,unknown,5,may,92,1,-1,0,unknown,no
4,33,unknown,single,unknown,no,1,no,no,unknown,5,may,198,1,-1,0,unknown,no
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
45206,51,technician,married,tertiary,no,825,no,no,cellular,17,nov,977,3,-1,0,unknown,yes
45207,71,retired,divorced,primary,no,1729,no,no,cellular,17,nov,456,2,-1,0,unknown,yes
45208,72,retired,married,secondary,no,5715,no,no,cellular,17,nov,1127,5,184,3,success,yes
45209,57,blue-collar,married,secondary,no,668,no,no,telephone,17,nov,508,4,-1,0,unknown,no


In [8]:
df_test

Unnamed: 0,age,job,marital,education,default,balance,housing,loan,contact,day,month,duration,campaign,pdays,previous,poutcome,y
0,30,unemployed,married,primary,no,1787,no,no,cellular,19,oct,79,1,-1,0,unknown,no
1,33,services,married,secondary,no,4789,yes,yes,cellular,11,may,220,1,339,4,failure,no
2,35,management,single,tertiary,no,1350,yes,no,cellular,16,apr,185,1,330,1,failure,no
3,30,management,married,tertiary,no,1476,yes,yes,unknown,3,jun,199,4,-1,0,unknown,no
4,59,blue-collar,married,secondary,no,0,yes,no,unknown,5,may,226,1,-1,0,unknown,no
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4516,33,services,married,secondary,no,-333,yes,no,cellular,30,jul,329,5,-1,0,unknown,no
4517,57,self-employed,married,tertiary,yes,-3313,yes,yes,unknown,9,may,153,1,-1,0,unknown,no
4518,57,technician,married,secondary,no,295,no,no,cellular,19,aug,151,11,-1,0,unknown,no
4519,28,blue-collar,married,secondary,no,1137,no,no,cellular,6,feb,129,4,211,3,other,no


Clean and Explore the data

In [9]:
df_train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 45211 entries, 0 to 45210
Data columns (total 17 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   age        45211 non-null  int64 
 1   job        45211 non-null  object
 2   marital    45211 non-null  object
 3   education  45211 non-null  object
 4   default    45211 non-null  object
 5   balance    45211 non-null  int64 
 6   housing    45211 non-null  object
 7   loan       45211 non-null  object
 8   contact    45211 non-null  object
 9   day        45211 non-null  int64 
 10  month      45211 non-null  object
 11  duration   45211 non-null  int64 
 12  campaign   45211 non-null  int64 
 13  pdays      45211 non-null  int64 
 14  previous   45211 non-null  int64 
 15  poutcome   45211 non-null  object
 16  y          45211 non-null  object
dtypes: int64(7), object(10)
memory usage: 5.9+ MB


Prepare the Training and Testing Dataset

In [11]:
# Training Dataset

## Removing the Target Feature 'y' from the dataset
train_x = df_train.iloc[:, :-1]
## Getting the Target Feature to compare with the predictions made
train_target = df_train.y


# Testing Dataset

## Removing the Target Feature 'y' from the dataset
test_x = df_test.iloc[:, :-1]
## Getting the Target Feature to compare with the predictions made
test_target = df_test.y

Identify numerical and categorical columns

In [16]:
train_x.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 45211 entries, 0 to 45210
Data columns (total 16 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   age        45211 non-null  int64 
 1   job        45211 non-null  object
 2   marital    45211 non-null  object
 3   education  45211 non-null  object
 4   default    45211 non-null  object
 5   balance    45211 non-null  int64 
 6   housing    45211 non-null  object
 7   loan       45211 non-null  object
 8   contact    45211 non-null  object
 9   day        45211 non-null  int64 
 10  month      45211 non-null  object
 11  duration   45211 non-null  int64 
 12  campaign   45211 non-null  int64 
 13  pdays      45211 non-null  int64 
 14  previous   45211 non-null  int64 
 15  poutcome   45211 non-null  object
dtypes: int64(7), object(9)
memory usage: 5.5+ MB


Build a pipeline to process the data

In [17]:
# Standardize using the StandardScaler and OneHotEncode the categoricals

## Identify numerical and categorical columns
num_cols = train_x.select_dtypes(include=['int64']).columns
cat_cols = train_x.select_dtypes(include=['object']).columns

In [18]:
# Create a column transformer
preprocessor = ColumnTransformer(
    transformers=[
        ('num', StandardScaler(), num_cols),
        ('cat', OneHotEncoder(), cat_cols)])

Explain pre-processing approach and justify the transformations you have made:

- 1. Removing the Target Feature 'y' from the Dataset
- 2. Identifying Numerical and Categorical Columns
- 3. Column Transformation using ColumnTransformer

Standardizing numerical features using 'StandardScaler' is important because many machine learning algorithms are sensitive to the scale of features.

One-hot encoding categorical features using 'OneHotEncoder' is necessary because machine learning models work with numerical data. 

##### Part 2: Build Baseline Models

Create baseline models using Logistic Regression and K-NN. Tune the models.

In [27]:
knn = KNeighborsClassifier(n_neighbors=10)
log_reg = LogisticRegression()

##### Part 3: Ensemble Modelling

1. Decision Tree

2. Random Forest

3. AdaBoost

4. Bagging Classifier

5. Voting Classifier

In [19]:
# Build a Decision Tree Model
dt = DecisionTreeClassifier(max_depth=20)

# Build a Random Forest Model
rf = RandomForestClassifier()

# Build an AdaBoost Model
ada = AdaBoostClassifier()

# Build a Bagging Classifier with a base estimator of your choice
bag = BaggingClassifier()

# Build a Voting Classifier using a mix of at least different classification models
voting = VotingClassifier(estimators=[('lr', log_reg), ('knn', knn), ('dt', dt)])

In [20]:
classifiers = {
    'K-Nearest Neighbors': knn,
    'Logistic Regression': log_reg,
    'Decision Tree': dt,
    'Random Forest': rf,
    'AdaBoost': ada,
    'Bagging': bag,
    'Voting': voting
}

In [21]:
# Create dictionary to store the results of each model
results = {}

##### Part 4: Performance Comparison

In [26]:
# Loop through list of models to compare performance
for name, clf in classifiers.items():
    start_time = time.time()

    # Create pipeline
    pipeline = Pipeline(steps=[('preprocessor', preprocessor),
                               ('classifier',clf)])
    
    # Fit the model
    pipeline.fit(train_x, train_target)

    # Make predictions
    y_pred = pipeline.predict(test_x)

    # Compute metrics
    ## Valid labels are 'no' and 'yes' instead of numeric lables like 0 or 1
    ### Class labels are 'no' and 'yes'
    pos_label = 'yes'

    precision = precision_score(test_target, y_pred, pos_label=pos_label)
    recall = recall_score(test_target, y_pred, pos_label=pos_label)
    f1 = f1_score(test_target, y_pred, pos_label=pos_label)
    accuracy = accuracy_score(test_target, y_pred)

    end_time = time.time()
    elapsed_time = end_time - start_time

    # Store results
    results[name] = {
        'Precision': precision,
        'Recall': recall,
        'F1-Score': f1,
        'Accuracy': accuracy,
        'Time (s)': elapsed_time
    }

# Convert results to DataFrame for easier viewing
results_df = pd.DataFrame(results).T
print(results_df)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


                     Precision    Recall  F1-Score  Accuracy   Time (s)
K-Nearest Neighbors   0.733624  0.322457  0.448000  0.908427   1.237059
Logistic Regression   0.645522  0.332054  0.438530  0.902013   1.537601
Decision Tree         0.972632  0.886756  0.927711  0.984074   0.975438
Random Forest         1.000000  1.000000  1.000000  1.000000  13.361279
AdaBoost              0.601329  0.347409  0.440389  0.898253   8.382354
Bagging               0.995992  0.953935  0.974510  0.994249   8.209315
Voting                0.853383  0.435701  0.576874  0.926344   2.914777


Tune each model and see if performance improves

In [None]:
import warnings

warnings.filterwarnings('ignore', category=UserWarning)

In [28]:
# Import additional libraries
from sklearn.model_selection import GridSearchCV

In [30]:
# Hyperparameter grids for tuning
knn_params = {'classifier__n_neighbors': [3, 5, 7, 20, 30, 50, 100]}
log_reg_params = {'classifier__C': [0.1, 1, 10]}
dt_params = {'classifier__max_depth': [10, 20, 30, 40, 50]}
rf_params = {'classifier__n_estimators': [50, 100, 150], 'classifier__max_depth': [None, 10, 20, 30, 50]}
ada_params = {'classifier__n_estimators': [25, 50, 75]}
bag_params = {'classifier__n_estimators': [5, 10, 20]}
# Experiment with both hard and soft voting
voting_params = {'classifier__voting': ['hard', 'soft']}

params_dict ={
    'K-Nearest Neighbors': knn_params,
    'Logistic Regression': log_reg_params,
    'Decision Tree': dt_params,
    'Random Forest': rf_params,
    'AdaBoost': ada_params,
    'Bagging': bag_params,
    'Voting': voting_params
}

# Initialize results dictionary for tuned models
tuned_results = {}

# Loop through classifiers for tuning
for name, clf in classifiers.items():
    start_time = time.time()

    # Create pipeline
    pipeline = Pipeline(steps=[('preprocessor', preprocessor),
                               ('classifier', clf)])
    
    # Create GridSearchCV object
    grid = GridSearchCV(pipeline, params_dict[name], cv=5)

    # Fit the model
    grid.fit(train_x, train_target)

    # Get the best estimator and predict
    best_model = grid.best_estimator_
    y_pred = best_model.predict(test_x)

    # Compute metrics
    ## Valid labels are 'no' and 'yes' instead of numeric lables like 0 or 1
    ### Class labels are 'no' and 'yes'
    pos_label = 'yes'
    
    precision = precision_score(test_target, y_pred, pos_label=pos_label)
    recall = recall_score(test_target, y_pred, pos_label=pos_label)
    f1 = f1_score(test_target, y_pred, pos_label=pos_label)

    end_time = time.time()
    elapsed_time = end_time - start_time

    # Store results
    tuned_results[name] = {
        'Best Params': grid.best_params_,
        'Precision': precision,
        'Recall': recall,
        'F1-Score': f1,
        'Time (s)': elapsed_time
    }

# Convert results to DataFrame for easier viewing
tuned_results_df = pd.DataFrame(tuned_results).T
print(tuned_results_df)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver opt

                                                           Best Params  \
K-Nearest Neighbors                   {'classifier__n_neighbors': 100}   
Logistic Regression                             {'classifier__C': 0.1}   
Decision Tree                            {'classifier__max_depth': 10}   
Random Forest        {'classifier__max_depth': 10, 'classifier__n_e...   
AdaBoost                              {'classifier__n_estimators': 25}   
Bagging                               {'classifier__n_estimators': 10}   
Voting                                  {'classifier__voting': 'hard'}   

                    Precision    Recall  F1-Score    Time (s)  
K-Nearest Neighbors  0.708609  0.205374  0.318452   68.730806  
Logistic Regression  0.641509  0.326296   0.43257   17.414587  
Decision Tree        0.737662  0.545106  0.626932   25.970428  
Random Forest        0.862319  0.228407  0.361153  707.435314  
AdaBoost             0.589655  0.328215  0.421702   95.230152  
Bagging              0.

Compare the performance of all models (including the baseline models). Consider both the time required for the models to run, and the performance of the models on the data set.

##### Part 5: Interpretation and Justification

**Interpretation**

- K-Nearest Neighbors (KNN): The KNN model, exhibits relatively high precision (0.709), indicating that it's effective at correctly predicting clients who subscribed to a term deposit. However, the KNN model has the lowest recall, suggesting that it misses some actual subscribers (false negatives). The F1-Score is also the lowest and the model's training time is relatively high, making it less efficient.

- Logistic Regression: The Logistic Regression model has relatively poor precision and recall, indicating that it has challenges in accurately classifying subscribers. However, it is the fastest models in terms of training time.

- Decision Tree: The Decision Tree model is the second fastest model and performs decently in terms of precision, recallm and F1-Score. It strikes a balance between efficiency and effectiveness.

- Random Forest: The Random Forest model is the slowest model and exhibits worse metrics compared to the Decision Tree. It has relatively high precision but struggles with recall, F1-Score and trainingn time wiich is inefficient.

- AdaBoost: The AdaBoost model has the metric which is similar to Logistic Regression. It is faster than Random Forest but is pretty slow compared to other models. 

- Bagging: The Bagging model has the highest precision, recall and F1-Score while it needs more time compared to Logistic Regression and Decision Tree.

- Voting: The voting model is highly efficient and achieves decent metrics across the board, making it an attractive choice for overall performance.

**1. Why did the ensemble models perform the way they did?**
- Beyond the hyperparameters, use your understanding of how the models work to explain why you think the models performed they way they did on the given data set. Was the result what you were expecting? Why or why not?

      The strong performance of ensemble models such as Bagging and Voting can be attributed to their ability to combine the strengths of several base models. These ensemble methods reduce overfitting and enhance generalization. The Decision Tree, while not an ensemble model, performed reasonably well due to its adaptability to complex decision boundaries. The unexpected underperformance of the Random Forest may be due to hyperparameter tuning or dataset-specific characteristics. The KNN model's low recall suggests that it struggled to capture all positive cases.
  
      In summary, the ensemble models, with their ability to leverage multiple models and reduce variance, outperformed single models like Logistic Regression and the Random Forest. The performance of each model depends not only on the algorithm itself but also on factors like hyperparameter tuning and dataset characteristics.

**2. If you had to pick one model to implement in business process, which would it be and why?**
- Discuss the business implications.
- Consider not only performance metrics but also computational cost and interpretability.

      Considering both performance and computational cost, Bagging stands out as a strong candidate for implementation in a business process. It offers high precision, recall, and F1-score while being computationally efficient.
       
      Implementing Bagging could lead to better targeting of potential clients who are likely to subscribe to a term deposit, increasing the success rate of marketing campaigns.
       
      The lower computational cost of Bagging makes it a practical choice for real-time or large-scale applications.
       
      However, it's essential to keep in mind that the final model choice should align with the specific business objectives and constraints. Factors like interpretability and ease of deployment should also be considered in the decision-making process.




**3. What decision criteria did you use to arrive at this conclusion?**
- Precision-Recall trade-off? Computational cost? Others?

      Decision Criteria for Conclusion:

      - Bagging achieves a high F1-score, indicating a balanced precision-recall trade-off. This is crucial in identifying subscribers to a term deposit while minimizing false positives.

      - Bagging is computationally efficient, making it suitable for practical implementation in business processes, especially when real-time or high-throughput predictions are required.

      - Bagging, being an ensemble of decision trees, provides a certain level of interpretability compared to more complex models like Random Forest.

      - In conclusion, Bagging stands out as a robust and efficient choice for implementing a predictive model in a business process, given its balance between precision and recall, computational efficiency, and moderate interpretability.