__Assignment 2__  
__Name: Bryan Lim Li Cheng__  
__Student ID: 2301214__

# CSD3185/CSD3186: Assignment 2

## Topics Covered:  
- Data Preprocessing
- Logistic Regression Classifier
- K-Nearest Neighbors (KNN) Classifier
- Hyperparameter Tuning


## Python and Dependency Versions  
To ensure reproducibility and compatibility, please check that you have the following Python and dependency versions installed:

- Python >= 3.12
- pandas == 2.3.3
- numpy == 2.4.1
- scikit-learn == 1.8.0

## Dataset
In this assignment, we'll use a bank marketing dataset, which focuses on direct marketing campaigns conducted via phone calls by a Portuguese banking institution. The objective is to predict whether a client will subscribe to a term deposit (coded as 1 for yes or 0 for no).

**IMPORTANT**: Download and use dataset provided in the Moodle for this assignment. Do NOT use any other version of the dataset available online as they may differ in structure and content.

#### **Attribute Overview**   
|**Input Feature**     |**Description**                         |
|:-----------------------|:-------------------------------------------------------------------------------------------------------------------|
| `age`                | Age                                                                                                                     |
| `job`                | Type of job                                                                                                             |
| `marital`            | Marital status                                                                                                          |
| `education`          | Education level                                                                                                         |
| `default`            | Has credit in default?                                                                                                  |
| `housing`            | Has housing loan?                                                                                                       |
| `loan`               | Has personal loan?                                                                                                      |
| `contact`            | Contact communication type                                                                                              |
| `month`              | Last contact month of year                                                                                              |
| `day_of_week`        | Last contact day of the week                                                                                            |
| `duration`           | Last contact duration in seconds.
| `campaign`           | Number of contacts performed during this campaign for this client                                                       |
| `pdays`              | Number of days since last contact from a previous campaign (999 = not previously contacted)                             |
| `previous`           | Number of contacts performed before this campaign for this client                                                       |
| `poutcome`           | Outcome of the previous marketing campaign                                                                              |
| `emp.var.rate`       | Employment variation rate                                                                                               | 
| `cons.price.idx`     | Consumer price index                                                                                                    |
| `cons.conf.idx`      | Consumer confidence index                                                                                               |
| `euribor3m`          | Euribor 3-month rate                                                                                                    |
| `nr.employed`        | Number of employees                                                                                                     |
| **Target variable**  | **Description**                                                                                                         |
| `y`                  | Has the client subscribed to a term deposit?                                                                            |


## Deliverables

Your submission for this assignment should be only __ONE__ file - this particular completed notebook file. 

Also, *RENAME* your file like this: __\<coursecode\>\_<assignment#>\_<your_full_name>.ipynb__  
Eg. CS3185_A2_John_Doe.ipynb  

To complete this assignment, you should follow instructions in below section Tasks.

## IMPORTANT! READ THIS BEFORE STARTING...
- DO NOT delete existing cells, but you can add more cells anywhere in the notebook as necessary.
- DO NOT modify or comment out the content of the existing cells unless otherwise stated (e.g., for code implementation). However, DO NOT change the variable names that are already defined in the existing cells.
- Follow the file naming convention for the notebook file as spelled out above strictly.

Please adhere strictly to the instructions as stated above as failure to do so might result in deduction of marks by the autograder.

Your assignment begins after the line below!! Complete all the tasks as specified.

---

## 1. Data Loading and Preprocessing

Load some basic libraries upfront. You may add any other libraries you deem necessary below or later on where appropriate.

In [1]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.utils import resample
from sklearn.preprocessing import MinMaxScaler
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
from sklearn.model_selection import GridSearchCV

__Task 1.1__

Load the banking dataset from the provided CSV file into a pandas DataFrame

In [2]:
# read the dataset as a pandas DataFrame into `data`
# place banking.csv in the same folder as this notebook; load it by filename only (no paths).
data = pd.read_csv("banking.csv")

__Task 1.2__

Perform a few basic data exploration steps specified to understand the structure and content of the dataset (use pandas DataFrame methods):

- Understand the dataset structure (number of rows and columns).
- Calculate and display the number of missing values in each column. 

In [3]:
data_shape = data.shape
data_shape

(41188, 21)

In [4]:
missing_values = data.isnull().sum()
missing_values

age               0
job               0
marital           0
education         0
default           0
housing           0
loan              0
contact           0
month             0
day_of_week       0
duration          0
campaign          0
pdays             0
previous          0
poutcome          0
emp_var_rate      0
cons_price_idx    0
cons_conf_idx     0
euribor3m         0
nr_employed       0
y                 0
dtype: int64

__Task 1.3__

The dataset contains categorical features. Display all unique values for each categorical feature in the dataset to understand the possible categories.

In [5]:
# Use a dictionary of unique values

categorical_cols = data.select_dtypes(include=["object"]).columns
for col in categorical_cols:
    print(f"\n{col}: {data[col].unique()}")


job: ['blue-collar' 'technician' 'management' 'services' 'retired' 'admin.'
 'housemaid' 'unemployed' 'entrepreneur' 'self-employed' 'unknown'
 'student']

marital: ['married' 'single' 'divorced' 'unknown']

education: ['basic.4y' 'unknown' 'university.degree' 'high.school' 'basic.9y'
 'professional.course' 'basic.6y' 'illiterate']

default: ['unknown' 'no' 'yes']

housing: ['yes' 'no' 'unknown']

loan: ['no' 'yes' 'unknown']

contact: ['cellular' 'telephone']

month: ['aug' 'nov' 'jun' 'apr' 'jul' 'may' 'oct' 'mar' 'sep' 'dec']

day_of_week: ['thu' 'fri' 'tue' 'mon' 'wed']

poutcome: ['nonexistent' 'success' 'failure']


__Task 1.4__

The education column contains various levels of education. We can simplify this by grouping similar education levels together.

Group "basic.4y", "basic.6y", and "basic.9y" into a single category called "basic" in the `data` DataFrame.

In [6]:
data['education'] = data['education'].replace(['basic.4y','basic.6y','basic.9y'],'basic')

__Task 1.5__

Analyze the target variable y to understand its distribution and derive some basic insights. This analysis helps assess class imbalance and potential relationships between the target and other variables.

- Display the value counts of the target variable y using the `value_counts()` method
- Calculate the number and percentage of rows where y is 0 (no subscription) and 1 (subscription).
- Display the mean of numeric columns grouped by the target variable y using the `groupby()` method.

In [7]:
# Display value counts for the target variable
y_value_counts = data["y"].value_counts()
print("Value counts for the target variable 'y':")
print(y_value_counts)

Value counts for the target variable 'y':
y
0    36548
1     4640
Name: count, dtype: int64


In [8]:
# Calculate subscription statistics
count_no_sub = (data["y"] == 0).sum()
count_sub = (data["y"] == 1).sum()

pct_of_no_sub = count_no_sub / len(data)
pct_of_sub = count_sub / len(data)

print(f"Percentage of no subscription: {pct_of_no_sub * 100:.2f}%")
print(f"Percentage of subscription: {pct_of_sub * 100:.2f}%")

Percentage of no subscription: 88.73%
Percentage of subscription: 11.27%


In [9]:
# Display the mean of numeric columns grouped by the target variable
grouped_means = data.groupby("y").mean(numeric_only = True)

print("\nMean of numeric columns grouped by 'y':")
print(grouped_means)


Mean of numeric columns grouped by 'y':
         age    duration  campaign       pdays  previous  emp_var_rate  \
y                                                                        
0  39.911185  220.844807  2.633085  984.113878  0.132374      0.248875   
1  40.913147  553.191164  2.051724  792.035560  0.492672     -1.233448   

   cons_price_idx  cons_conf_idx  euribor3m  nr_employed  
y                                                         
0       93.603757     -40.593097   3.811491  5176.166600  
1       93.354386     -39.789784   2.123135  5095.115991  


__Task 1.6__

Based on the outputs observe in previous steps, write a brief summary of your findings regarding the target variable distribution and any notable patterns in the data.

# Write your summary in this markdown cell here.  
The dataset shows extreme class imbalance with only 11.27% subscription rate reflects the challenge of converting prospects through phone calls. Subscribers show significantly longer call durations of 553 seconds, compared to 221 seconds of non-subscribers. This suggests that conversations quality matters more than frequent contacts. Economic indicators such as including employment variation rate, Euribor interest rates, and consumer confidence—show clear patterns, with favorable conditions correlating with higher conversion rates. Customer demographics span various occupations, education levels, and marital statuses, while previous campaign outcomes and contact history also influence subscription likelihood, indicating relationship-building improves conversion success.

__Task 1.7__

Categorical features often need to be encoded into numerical format for machine learning algorithms. Use one-hot encoding to convert all categorical features in the dataset into numerical format. 

Use the `pd.get_dummies()` function from pandas to achieve this. This approach is widely used to convert categorical variables into a format suitable for machine learning models. To learn more about pd.get_dummies() and its usage, refer to the [pandas documentation](https://pandas.pydata.org/docs/reference/api/pandas.get_dummies.html).

Create a function named `encode_categorical_features` that takes a pandas DataFrame as input and returns a new DataFrame with all categorical features one-hot encoded.

For the new column names generated by one-hot encoding, use the format `<column name>_<category value>`. For example, if the original column is `job` and one of its categories is `admin.`, the new column should be named `job_admin.`.

In [10]:
# Complete the function to one-hot encode categorical features
def encode_categorical_features(df):
    """One-hot encode all categorical features in the DataFrame.
    
    Args:
        df (pd.DataFrame): Input DataFrame with categorical features.
        
    Returns:
        pd.DataFrame: New DataFrame with one-hot encoded categorical features.
    """
    encoded_df = pd.get_dummies(df, drop_first=False)
    return encoded_df

In [11]:
# Apply the encoding function to the dataset
data_encoded = encode_categorical_features(data)

__Task 1.8__

The dataset have class imbalance as seen in the target variable distribution. Class imbalance can affect the performance of machine learning models. To address this, there are several techniques that can be employed, but for this assignment, we will focus on one specific technique: __over-sampling__.

Over-sampling, which is the process of randomly duplicating observations from the minority class to achieve a balanced dataset. The most common approach to over-sampling is to resample with replacement.

Perform over-sampling on the training data only. This is important to avoid data leakage and ensure that the model is evaluated on unseen data.

To help with this task, create a function named `split_data` which helps to split the dataset into training and testing sets. Use `train_test_split` inside and ensure to maintain the same percentage of class distribution in both sets.

Subsequently, implement a function named `oversample_minority_class` that takes the training set and the name of the target column as input. The function should return a new DataFrame with balanced classes in the target variable. Use the following steps in your implementation:

- Separate the majority and minority classes in the training data.
- Upsample the minority class by randomly duplicating its samples.
- Combine the upsampled minority class with the majority class to create a balanced dataset.

Note: Use the `resample` function from `sklearn.utils` to perform the upsampling. NO other libraries for handling imbalanced datasets should be used.

In [12]:
# Complete the function
def split_data(df, target_column, test_size, random_state):
    """Split the DataFrame into training and testing sets.
    
    Args:
        df (pd.DataFrame): Input DataFrame.
        target_column (str): Name of the target column.
        test_size (float): Proportion of the dataset to include in the test split.
        random_state (int): Random seed for reproducibility.
    
    Returns:
        pd.DataFrame, pd.DataFrame: train_data, test_data (including target)
    """
    train_data, test_data = train_test_split(df, test_size=test_size, random_state = random_state, stratify=df[target_column])
    return train_data, test_data

In [13]:
# Split the encoded data into training and testing sets 80-20 split
train_data, test_data = split_data(data_encoded, 'y', test_size = 0.2, random_state = 80)

In [14]:
# Check the shapes of the resulting datasets
train_data.shape, test_data.shape

((32950, 62), (8238, 62))

In [15]:
# Complete the function
def oversample_minority_class(df, target_col, random_state):
    """Over-sample the minority class in the DataFrame.
    
    Args:
        df (pd.DataFrame): Input DataFrame with imbalanced classes.
        target_col (str): Name of the target column.
        random_state (int): Random state for reproducibility.
    Returns:
        pd.DataFrame: New DataFrame with balanced classes.
    """
    # Init maj and min classes
    majority_class = df[df[target_col] == 0]
    minority_class = df[df[target_col] == 1]

    # Upsample min class to match maj class size
    minority_upsampled = resample(minority_class, replace = True, n_samples=len(majority_class), random_state = random_state)

    # Comb maj class with upsampled min class
    balanced_df = pd.concat([majority_class, minority_upsampled])

    # Shuffle dataset
    balanced_df = balanced_df.sample(frac = 1, random_state = random_state).reset_index(drop = True)
    
    return balanced_df

In [16]:
# Over-sample the minority class in the training data
train_data_balanced = oversample_minority_class(train_data, 'y', random_state = 80)

In [17]:
# check train_data_balanced class distribution
balanced_class_distribution = train_data_balanced['y'].value_counts()
balanced_class_distribution

y
1    29238
0    29238
Name: count, dtype: int64

__Task 1.9__

Scale the features in the training and testing sets using `MinMaxScaler` from sklearn. 

Create a function named `scale_features` that takes the training and testing DataFrames as input and returns the scaled versions of both DataFrames. Ensure that the scaler is fitted only on the training data to prevent data leakage.

In [18]:
# Complete the function
def scale_features(train_df, test_df):
    """Scale features in training and testing DataFrames using MinMaxScaler.
    
    Args:
        train_df (pd.DataFrame): Training DataFrame.
        test_df (pd.DataFrame): Testing DataFrame.
    Returns:
        pd.DataFrame, pd.DataFrame: Scaled training and testing DataFrames.
    """
    # Init scaler
    scaler = MinMaxScaler()

    # Fit scaler on training data and transform
    x_train_scale = scaler.fit_transform(train_df)

    # Transform test data 
    x_test_scale = scaler.transform(test_df)

    # Convert back to df with ori col names
    x_train_scale = pd.DataFrame(x_train_scale, columns=train_df.columns)
    x_test_scale = pd.DataFrame(x_test_scale, columns=test_df.columns)

    return x_train_scale, x_test_scale

In [19]:
# Separate features and target variable in training and testing sets
X_train = train_data_balanced.drop('y', axis = 1)
y_train = train_data_balanced['y']

X_test = test_data.drop('y', axis = 1)
y_test = test_data['y']

In [20]:
# Standardize features
X_train_scaled, X_test_scaled = scale_features(X_train, X_test)

---

## 2. Model Training and Evaluation

__Task 2.1__

Build a baseline logistic regression model and a K-Nearest Neighbors (KNN) model using the scaled training data.

In [21]:
# Instantiate logistic regression model with all default params
base_logreg = LogisticRegression()

# Fit the model on the training data
base_logreg.fit(X_train_scaled, y_train)

0,1,2
,penalty,'l2'
,dual,False
,tol,0.0001
,C,1.0
,fit_intercept,True
,intercept_scaling,1
,class_weight,
,random_state,
,solver,'lbfgs'
,max_iter,100


In [22]:
# Instantiate KNN model with all default params
base_knn = KNeighborsClassifier()

# Fit the model on the training data
base_knn.fit(X_train_scaled, y_train)

0,1,2
,n_neighbors,5
,weights,'uniform'
,algorithm,'auto'
,leaf_size,30
,p,2
,metric,'minkowski'
,metric_params,
,n_jobs,


__Task 2.2__

Evaluate both models on the scaled testing data using accuracy, precision, recall, and F1-score as metrics. Use the appropriate functions from `sklearn.metrics` to compute these metrics.

In [23]:
# Make preds based on test data
y_pred_logreg = base_logreg.predict(X_test_scaled)
y_pred_knn = base_knn.predict(X_test_scaled)

# Evaluate Logistic Regression
print("Logistic Regression Performance:")
print(f"Accuracy: {accuracy_score(y_test, y_pred_logreg):.4f}")
print(f"Precision: {precision_score(y_test, y_pred_logreg):.4f}")
print(f"Recall: {recall_score(y_test, y_pred_logreg):.4f}")
print(f"F1-Score: {f1_score(y_test, y_pred_logreg):.4f}")

Logistic Regression Performance:
Accuracy: 0.8570
Precision: 0.4334
Recall: 0.8772
F1-Score: 0.5802


In [24]:
# Evaluate KNN
print("KNN Performance:")
print(f"Accuracy: {accuracy_score(y_test, y_pred_knn):.4f}")
print(f"Precision: {precision_score(y_test, y_pred_knn):.4f}")
print(f"Recall: {recall_score(y_test, y_pred_knn):.4f}")
print(f"F1-Score: {f1_score(y_test, y_pred_knn):.4f}")

KNN Performance:
Accuracy: 0.8287
Precision: 0.3484
Recall: 0.5981
F1-Score: 0.4403


__Task 2.3__

Discuss your observations based on the evaluation results using Markdown cells below. (less than 100 words)

Logistic Regression outperforms KNN across most metrics. While both models achieve similar accuracy, with Logistic Regression at ~86% and KNN at 83%, Logistic Regression demonstrates superior precision of 0.43, indicating fewer false positives. However, KNN shows slightly better recall of 0.60, meaning it captures more actual subscribers. Logistic Regression's higher F1-score of 0.58 suggests a better overall balance between precision and recall. The relatively low precision scores for both models indicate many false positives, likely due to the inherent difficulty of predicting rare subscription events. Logistic Regression appears more suitable for this imbalanced classification task.

---

## 3. Tuning Your Models

Use `GridSearchCV` from sklearn to perform hyperparameter tuning for both the logistic regression and KNN models. Define a grid of hyperparameters to search over for each model.

The goal is to improve the performance of both models through hyperparameter tuning. Hence, try to obtain a better performing model based on accuracy.

However, take note that hyperparameter tuning can be computationally expensive. To manage this, limit the number of hyperparameter combinations by selecting only a few key hyperparameters and a small range of values for each. You can experiment with any amount of different hyperparameter values when you tune your models in your machine locally. But for submission, keep the grid search reasonable (include the best hyperparameters during your experiment) to ensure that it can complete in a timely manner. 

After completing the grid search, save the best models for both logistic regression and KNN based on cross-validation performance to the variables `best_logreg_model` and `best_knn_model` respectively.


__Task 3.1__

Perform hyperparameter tuning for the logistic regression model using `GridSearchCV`.

In [25]:
# Define hyperparam grids for log regression
logreg_param_grid = {
    # Regularization strength
    'C': [0.01, 0.1, 1, 10],  
    # Regularization type
    'penalty': ['l1', 'l2'],  
    # Solver that supports both L1 and L2
    'solver': ['liblinear'],  
     # Maximum iterations
    'max_iter': [100, 200]   
}

# Perform Grid Search for Logistic Regression
logreg_grid_search = GridSearchCV(
    estimator=LogisticRegression(),
    param_grid=logreg_param_grid,
    # 5-fold cross-validation
    cv=5,  
    scoring='accuracy',
    # Use all processors
    n_jobs=-1,  
    verbose=1
)

logreg_grid_search.fit(X_train_scaled, y_train)

Fitting 5 folds for each of 16 candidates, totalling 80 fits


0,1,2
,estimator,LogisticRegression()
,param_grid,"{'C': [0.01, 0.1, ...], 'max_iter': [100, 200], 'penalty': ['l1', 'l2'], 'solver': ['liblinear']}"
,scoring,'accuracy'
,n_jobs,-1
,refit,True
,cv,5
,verbose,1
,pre_dispatch,'2*n_jobs'
,error_score,
,return_train_score,False

0,1,2
,penalty,'l1'
,dual,False
,tol,0.0001
,C,1
,fit_intercept,True
,intercept_scaling,1
,class_weight,
,random_state,
,solver,'liblinear'
,max_iter,100


In [26]:
# Save your best model from the grid search here
best_logreg_model = logreg_grid_search.best_estimator_

In [27]:
# Evaluate the best model on the test set (DO NOT modify this cell)
from sklearn.metrics import accuracy_score

y_pred_best_logreg_test = best_logreg_model.predict(X_test_scaled)
best_logreg_test_accuracy = accuracy_score(y_test, y_pred_best_logreg_test)

__Task 3.2__

Perform hyperparameter tuning for the KNN model using `GridSearchCV`.

In [28]:
# Define hyperparameter grids for KNN
knn_param_grid = {
    # Number of neighbors
    'n_neighbors': [3, 5, 7, 9],    
    # Weight function
    'weights': ['uniform', 'distance'], 
    # Distance metric
    'metric': ['euclidean', 'manhattan'] 
}

# Perform Grid Search for KNN
knn_grid_search = GridSearchCV(
    estimator=KNeighborsClassifier(),
    param_grid=knn_param_grid,
    # 5-fold cross-validation
    cv=5,  
    scoring='accuracy',
    # Use all processors
    n_jobs=-1,  
    verbose=1
)

knn_grid_search.fit(X_train_scaled, y_train)

Fitting 5 folds for each of 16 candidates, totalling 80 fits


0,1,2
,estimator,KNeighborsClassifier()
,param_grid,"{'metric': ['euclidean', 'manhattan'], 'n_neighbors': [3, 5, ...], 'weights': ['uniform', 'distance']}"
,scoring,'accuracy'
,n_jobs,-1
,refit,True
,cv,5
,verbose,1
,pre_dispatch,'2*n_jobs'
,error_score,
,return_train_score,False

0,1,2
,n_neighbors,3
,weights,'distance'
,algorithm,'auto'
,leaf_size,30
,p,2
,metric,'manhattan'
,metric_params,
,n_jobs,


In [29]:
# Save your best model from the grid search here
best_knn_model = knn_grid_search.best_estimator_

In [30]:
# Evaluate the best model on the test set (DO NOT modify this cell)
from sklearn.metrics import accuracy_score

y_pred_best_knn_test = best_knn_model.predict(X_test_scaled)
best_knn_test_accuracy = accuracy_score(y_test, y_pred_best_knn_test)

<div style="display: flex; align-items: center; gap: 8px; margin: 12px 0; color: #868181;">
  <hr style="flex: 1; border: none; border-top: 1px solid #ccc;">
  <span style="font-size: 1.2em;">END ASSIGNMENT</span>
  <hr style="flex: 1; border: none; border-top: 1px solid #ccc;">
</div>
