# Introduction - Bank Marketing Dataset

The data is related with direct marketing campaigns (phone calls) of a Portuguese banking institution. The classification goal is to predict if the client will subscribe a term deposit (variable y).

It is commonly used for fairness tasks and marital status is often used as the sensitive attribute

### Imports

In [1]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder
from fairlearn.metrics import demographic_parity_difference, equalized_odds_difference
import fairness_functions as fp




### Load dataset

In [2]:

# fetch dataset 
from ucimlrepo import fetch_ucirepo 
  
# fetch dataset 
bank_marketing = fetch_ucirepo(id=222) 
  
# data (as pandas dataframes) 
X = bank_marketing.data.features 
y = bank_marketing.data.targets 
  
# metadata 
print(bank_marketing.metadata) 
  
# variable information 
print(bank_marketing.variables) 

# define middle age as sensitive attribute
X['middle_aged'] = X['age'].apply(lambda x: 1 if 25 <= x <= 60 else 0)

sensitive_col ='middle_aged'

X = X.dropna(subset=[sensitive_col])



{'uci_id': 222, 'name': 'Bank Marketing', 'repository_url': 'https://archive.ics.uci.edu/dataset/222/bank+marketing', 'data_url': 'https://archive.ics.uci.edu/static/public/222/data.csv', 'abstract': 'The data is related with direct marketing campaigns (phone calls) of a Portuguese banking institution. The classification goal is to predict if the client will subscribe a term deposit (variable y).', 'area': 'Business', 'tasks': ['Classification'], 'characteristics': ['Multivariate'], 'num_instances': 45211, 'num_features': 16, 'feature_types': ['Categorical', 'Integer'], 'demographics': ['Age', 'Occupation', 'Marital Status', 'Education Level'], 'target_col': ['y'], 'index_col': None, 'has_missing_values': 'yes', 'missing_values_symbol': 'NaN', 'year_of_dataset_creation': 2014, 'last_updated': 'Fri Aug 18 2023', 'dataset_doi': '10.24432/C5K306', 'creators': ['S. Moro', 'P. Rita', 'P. Cortez'], 'intro_paper': {'ID': 277, 'type': 'NATIVE', 'title': 'A data-driven approach to predict the s

### Data Preprocessing
In this section, we prepare the dataset for model training. This includes:
- Handling missing values.
- Defining the **middle-aged** attribute (`25 <= age <= 60`) as the sensitive feature.
- Encoding categorical variables.
- Splitting the dataset into training and testing sets.


### Impute Nan Values

Imputes numeric Nan values with column mean and Nans in categorical columns with column mode

In [3]:
# Define categorical columns based on dataset description
categorical_cols = ['job', 'marital', 'education', 'default', 'housing', 
                    'loan', 'contact', 'month', 'poutcome']

# Define numeric columns by excluding categorical and target columns
numeric_cols = [col for col in X.columns if col not in categorical_cols + ['y']]

print("Numeric columns:", numeric_cols)
print("Categorical columns:", categorical_cols)

# Convert numeric columns to numeric dtype (forcing non-numeric values to NaN)
X_numeric = X[numeric_cols].apply(lambda col: pd.to_numeric(col, errors='coerce'))

# Fill missing values in numeric columns with the mean of each column.
X_numeric = X_numeric.fillna(X_numeric.mean())

# For categorical columns, filter out any with high cardinality (e.g., >20 unique values)
max_unique_threshold = 20
filtered_categorical_cols = [col for col in categorical_cols if X[col].nunique() <= max_unique_threshold]
print("Filtered Categorical columns (<=20 unique values):", filtered_categorical_cols)

# Process categorical columns: fill missing values with the mode.
X_categorical = X[filtered_categorical_cols].copy()
for col in filtered_categorical_cols:
    X_categorical[col] = X_categorical[col].fillna(X_categorical[col].mode()[0])




Numeric columns: ['age', 'balance', 'day_of_week', 'duration', 'campaign', 'pdays', 'previous', 'middle_aged']
Categorical columns: ['job', 'marital', 'education', 'default', 'housing', 'loan', 'contact', 'month', 'poutcome']
Filtered Categorical columns (<=20 unique values): ['job', 'marital', 'education', 'default', 'housing', 'loan', 'contact', 'month', 'poutcome']


### Handling Missing Values & Encoding
We process missing values by:
- Imputing numeric columns with their mean.
- Filling missing values in categorical columns with the most frequent value.

Categorical variables are encoded using **one-hot encoding**, ensuring they can be used in machine learning models.


### One-hot encode categorical features

In [4]:

# One-hot encode the filtered categorical columns using pandas' get_dummies, dropping the first category.
X_categorical_encoded = pd.get_dummies(X_categorical, drop_first=True)

# Combine numeric and one-hot encoded categorical columns.
X_processed = pd.concat([X_numeric, X_categorical_encoded], axis=1)

# Fill any remaining NaN values with 0.
X_processed = X_processed.fillna(0)

# Preserve the sensitive attribute for fairness evaluation.
sens = X[sensitive_col]

print("Shape of processed features:", X_processed.shape)


Shape of processed features: (45211, 39)


### Split data to train & test sets

In [5]:

y = y.squeeze()  # Converts a single-column DataFrame to a Series if needed

# Convert categorical target labels to binary
y = y.map({'yes': 1, 'no': 0})


# Split data and also split the sensitive attribute for evaluation
X_train, X_test, y_train, y_test, sens_train, sens_test = train_test_split(
    X_processed, y, sens, test_size=0.3, random_state=42
)


print("X train shape: ",X_train.shape)
print("X test shape: ",X_test.shape)



X train shape:  (31647, 39)
X test shape:  (13564, 39)


### Train and evaluate baseline model

### Baseline Model - Logistic Regression
We begin with a **baseline logistic regression model** trained without any fairness adjustments.
This model will serve as a benchmark for evaluating fairness-aware techniques.

The model is evaluated using:
- **Accuracy**: Measures overall prediction correctness.
- **F1 Score**: Balances precision and recall.
- **Demographic Parity Difference**: Checks if positive predictions are distributed equally across groups.
- **Equalized Odds Difference**: Measures whether errors occur equally across groups.


In [6]:
# Train the logistic regression model
lr = LogisticRegression(random_state=42, max_iter=10000)
lr.fit(X_train, y_train)

# Predict on the test set with the baseline model
y_pred_baseline = lr.predict(X_test)

# Evaluate baseline performance metrics
baseline_accuracy = accuracy_score(y_test, y_pred_baseline)
f1_score_baseline = f1_score(y_test, y_pred_baseline)

# Evaluate fairness metrics for the baseline model
baseline_dp_diff = demographic_parity_difference(y_test, y_pred_baseline, sensitive_features=sens_test)
baseline_eo_diff = equalized_odds_difference(y_test, y_pred_baseline, sensitive_features=sens_test)

print("=== Baseline Model Metrics ===")
print("Accuracy:", baseline_accuracy)
print("F1 score:",f1_score_baseline) 
print("Demographic Parity Difference:", baseline_dp_diff)
print("Equalized Odds Difference:", baseline_eo_diff)


=== Baseline Model Metrics ===
Accuracy: 0.8987024476555588
F1 score: 0.4396411092985318
Demographic Parity Difference: 0.24352861751267824
Equalized Odds Difference: 0.22574017708909794


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


### Baseline Model Evaluation
The results indicate that the baseline model performs well in terms of accuracy but exhibits fairness disparities:
- The **Demographic Parity Difference** is quite high, meaning one group is more likely to receive positive predictions.
- The **Equalized Odds Difference** suggests unequal error rates across different groups.

We will now explore fairness interventions to address these disparities.


### Naive Fairness Approach - Removing Sensitive Attributes
A simple strategy to mitigate bias is **removing the sensitive feature (`middle_aged`)** from the dataset.
However, bias can still be embedded in other correlated variables.
This approach will be compared to more sophisticated fairness-aware techniques.


In [7]:
# Process X_processed as before
# Drop sensitive columns from the entire processed dataset
X_processed_no_sensitive = X_processed.drop(columns=sensitive_col)

# Split the data
X_train, X_test, y_train, y_test, sens_train, sens_test = train_test_split(
    X_processed_no_sensitive, y, sens, test_size=0.3, random_state=42
)

# Train the logistic regression model
lr = LogisticRegression(random_state=42,max_iter=10000)
lr.fit(X_train, y_train)

# Predict on the test set
y_pred_naive = lr.predict(X_test)

# Evaluate baseline performance metrics
naive_accuracy = accuracy_score(y_test, y_pred_naive)
f1_score_naive = f1_score(y_test, y_pred_naive)

# Evaluate fairness metrics for the baseline model
naive_dp_diff = demographic_parity_difference(y_test, y_pred_naive, sensitive_features=sens_test)
naive_eo_diff = equalized_odds_difference(y_test, y_pred_naive, sensitive_features=sens_test)

print("=== Naive Model Metrics ===")
print("Accuracy:", naive_accuracy)
print("F1 score:",f1_score_naive) 
print("Demographic Parity Difference:", naive_dp_diff)
print("Equalized Odds Difference:", naive_eo_diff)


=== Naive Model Metrics ===
Accuracy: 0.8986287230905338
F1 score: 0.4339234252778921
Demographic Parity Difference: 0.14622118013247004
Equalized Odds Difference: 0.08658261408160078


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


### Naive Model Evaluation
Removing the sensitive attribute **reduces** fairness disparities but does not fully eliminate them.
- While the fairness metrics show improvement, they are still not ideal.
- A more robust approach is required to better balance accuracy and fairness.


### Fairness-Aware Learning
To achieve **better fairness without significantly sacrificing accuracy**, we experiment with multiple fairness-aware techniques:
1. **Pre-processing**: Adjusting data distributions to minimize bias before training.
2. **In-processing**: Training with fairness constraints.
3. **Post-processing**: Adjusting final predictions to correct disparities.

Each method is evaluated based on accuracy and fairness trade-offs.

In [8]:
# Define candidate methods for each stage.
pre_methods = {
    "None": fp.pre_none,
    "Correlation_Remover": fp.pre_correlation_remover,
    "Sensitive_Resampling": fp.pre_sensitive_resampling  # new candidate
}

in_methods = {
    "Baseline": fp.in_baseline,
    "Reweighting": fp.in_reweighting,
    "Exponential_Gradient_Demogrphic_Parity": fp.in_expgrad_dp,
    "Exponential_Gradient_Equalized_Odds": fp.in_expgrad_eo
}

post_methods = {
    "None": fp.post_none,
    "Threshold_Demogrphic_Parity": fp.post_threshold_dp,
    "Threshold_Equalized_Odds": fp.post_threshold_eo
}

# Run experiments:
results = fp.run_experiments(pre_methods, in_methods, post_methods,
                             X_train, y_train, sens_train,
                             X_test, y_test, sens_test)


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver opt

### Select only pareto optimal methods

In [9]:

objectives = {"f1_score": True,"accuracy":True, "Demographic_parity": False, "Equalized_odds": False}

frontier = fp.pareto_frontier(results, objectives)

print("Pareto Frontier configurations:")
for config, metrics in frontier.items():
    print(f"{config}: {metrics}")

Pareto Frontier configurations:
Pre-processing: None. In-training: Baseline. Post-processing:None: {'accuracy': 0.897670303745208, 'f1_score': 0.4269199009083402, 'Demographic_parity': 0.16762055812518167, 'Equalized_odds': 0.1144438295517432}
Pre-processing: None. In-training: Baseline. Post-processing:Threshold_Demogrphic_Parity: {'accuracy': 0.8959746387496313, 'f1_score': 0.41766405282707386, 'Demographic_parity': 0.0010695716228525873, 'Equalized_odds': 0.24243220807969007}
Pre-processing: None. In-training: Baseline. Post-processing:Threshold_Equalized_Odds: {'accuracy': 0.8825567679150693, 'f1_score': 0.025688073394495414, 'Demographic_parity': 0.0006220585498833545, 'Equalized_odds': 0.004053680132816822}
Pre-processing: None. In-training: Reweighting. Post-processing:None: {'accuracy': 0.8958271896195813, 'f1_score': 0.43138832997987925, 'Demographic_parity': 0.16797503153865043, 'Equalized_odds': 0.1063018815716657}
Pre-processing: None. In-training: Reweighting. Post-process

### Results & Discussion
After testing multiple fairness-aware models, we analyze the **Pareto-optimal** solutions that best balance fairness and accuracy.
Key observations:
- Some methods significantly reduce **Demographic Parity Difference** but lower model accuracy.
- Other methods balance fairness while maintaining **reasonable predictive performance**.
- The optimal strategy depends on the acceptable trade-off between accuracy and fairness.


### Apply thresholds on biase and portion of retained accuracy

### Set thresholds on accurcy, demographic parity and equalized odds

In [10]:
f1_threshold = 0.38
demographic_parity_threshold = 0.1
equalized_odds_threshold = 0.1
accuracy_threshold = 0.70

In [11]:
# Filter results based on thresholds.
filtered = fp.filter_results(frontier, f1_threshold=f1_threshold,accuracy_threshold=accuracy_threshold,
                            dp_threshold=demographic_parity_threshold, eo_threshold=equalized_odds_threshold)

print("\nFiltered Results (satisfying thresholds):")
for config, metrics in filtered.items():
    print(config, metrics)


Filtered Results (satisfying thresholds):
Pre-processing: Sensitive_Resampling. In-training: Exponential_Gradient_Equalized_Odds. Post-processing:None {'accuracy': 0.8645679740489531, 'f1_score': 0.386644407345576, 'Demographic_parity': 0.09386368844068095, 'Equalized_odds': 0.029042621696691293}


### Conclusion
- The **baseline model** performed well in accuracy but showed substantial fairness disparities.
- **Removing the sensitive attribute** improved fairness slightly but did not fully resolve bias.
- **Fairness-aware methods** demonstrated the best trade-offs between accuracy and fairness.
- The **optimal model** depends on the business context and how much fairness should be prioritized over accuracy.

This analysis underscores the importance of fairness evaluation in machine learning and presents various strategies to mitigate bias.
