# Introduction - German Credit Dataset

The German Credit Dataset classifies people described by a set of attributes as good or bad credit risks. 

It is commonly used for fairness tasks and  "personal_status_sex", which combined gender and marital status, is usually the protected attribute

### Imports

In [1]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder
from fairlearn.metrics import demographic_parity_difference, equalized_odds_difference
import fairness_functions as fp




### Load dataset

In [2]:
from ucimlrepo import fetch_ucirepo
  
# Fetch dataset
statlog_german_credit_data = fetch_ucirepo(id=144)
  
# Data (as pandas dataframes)
X = statlog_german_credit_data.data.features
y = statlog_german_credit_data.data.targets
  

# Manually define column names based on the dataset's variable information.
# Here is an example list of column names:
feature_col_names = [
    "checking_account_status",    # Attribute1: Status of existing checking account
    "duration",                   # Attribute2: Duration (months)
    "credit_history",             # Attribute3: Credit history
    "purpose",                    # Attribute4: Purpose
    "credit_amount",              # Attribute5: Credit amount
    "savings_account_bonds",      # Attribute6: Savings account/bonds
    "present_employment_since",   # Attribute7: Present employment since (Other)
    "installment_rate",           # Attribute8: Installment rate in percentage of disposable income
    "personal_status_sex",        # Attribute9: Marital Status / Personal status and sex
    "other_debtors",              # Attribute10: Other debtors / guarantors
    "present_residence_since",    # Attribute11: Present residence since
    "property",                   # Attribute12: Property
    "age",                        # Attribute13: Age (years)
    "other_installment_plans",    # Attribute14: Other installment plans
    "housing",                    # Attribute15: Housing (Other)
    "number_of_existing_credits", # Attribute16: Number of existing credits at this bank
    "occupation",                 # Attribute17: Occupation (Job)
    "number_of_people_liable",    # Attribute18: Number of people being liable to provide maintenance for
    "telephone",                  # Attribute19: Telephone (Binary)
    "foreign_worker"              # Attribute20: foreign worker (Binary)
]

# Assign these column names to the features DataFrame if not already set.
X.columns = feature_col_names

sensitive_col ='personal_status_sex'

X = X.dropna(subset=[sensitive_col])



### Data Preprocessing
To prepare the dataset for model training, we perform the following steps:
- Handle missing values in categorical and numeric columns.
- Define `personal_status_sex` as the sensitive attribute for fairness evaluation.
- Encode categorical variables and drop irrelevant features.
- Split the dataset into training and test sets.


#### Adjust target column to be binary

In [3]:
if isinstance(y, pd.DataFrame):
    y = y.squeeze()

y = y.map({1: 0, 2: 1})
print("Unique target values after mapping:", y.unique())

print(y.value_counts())



Unique target values after mapping: [0 1]
class
0    700
1    300
Name: count, dtype: int64


### Handling Missing Values & Encoding
- **Numeric columns**: Missing values are imputed with the **mean**.
- **Categorical columns**: Missing values are replaced with the **most frequent value (mode)**.
- **One-hot encoding** is applied to categorical features to make them compatible with machine learning models.


### Impute Nan Values

Imputes numeric Nan values with column mean and Nans in categorical columns with column mode

In [4]:
# Define which columns are categorical based on domain knowledge for German Credit data.
categorical_cols = [
    "checking_account_status",  # e.g., categorical status of checking account
    "credit_history",           # credit history categories
    "purpose",                  # purpose of credit
    "savings_account_bonds",    # savings account/bonds categories
    "present_employment_since", # employment status (categorical)
    "personal_status_sex",      # combined personal status and sex
    "other_debtors",            # categorical: other debtors/guarantors
    "property",                 # property information (categorical)
    "other_installment_plans",  # other installment plans
    "housing",                  # housing situation (categorical)
    "occupation",               # occupation categories
    "telephone",                # binary, but treated as categorical
    "foreign_worker"            # binary, but treated as categorical
]

# All remaining columns are considered numeric.
numeric_cols = [col for col in X.columns if col not in categorical_cols]

print("Numeric columns:", numeric_cols)
print("Categorical columns:", categorical_cols)

# Convert numeric columns to numeric dtype (forcing non-numeric values to NaN)
X_numeric = X[numeric_cols].apply(lambda col: pd.to_numeric(col, errors='coerce'))

# Fill missing values in numeric columns with the mean of each column.
X_numeric = X_numeric.fillna(X_numeric.mean())

# Filter the categorical columns: drop any that have high cardinality (threshold = 20 unique values)
max_unique_threshold = 20
filtered_categorical_cols = [col for col in categorical_cols if X[col].nunique() <= max_unique_threshold]
print("Filtered Categorical columns (<=20 unique values):", filtered_categorical_cols)

# Process the categorical columns: fill missing values with the mode.
X_categorical = X[filtered_categorical_cols].copy()
for col in filtered_categorical_cols:
    X_categorical[col] = X_categorical[col].fillna(X_categorical[col].mode()[0])


Numeric columns: ['duration', 'credit_amount', 'installment_rate', 'present_residence_since', 'age', 'number_of_existing_credits', 'number_of_people_liable']
Categorical columns: ['checking_account_status', 'credit_history', 'purpose', 'savings_account_bonds', 'present_employment_since', 'personal_status_sex', 'other_debtors', 'property', 'other_installment_plans', 'housing', 'occupation', 'telephone', 'foreign_worker']
Filtered Categorical columns (<=20 unique values): ['checking_account_status', 'credit_history', 'purpose', 'savings_account_bonds', 'present_employment_since', 'personal_status_sex', 'other_debtors', 'property', 'other_installment_plans', 'housing', 'occupation', 'telephone', 'foreign_worker']


### One-hot encode categorical features

In [5]:

# One-hot encode the filtered categorical columns using pandas' get_dummies, dropping the first category.
X_categorical_encoded = pd.get_dummies(X_categorical, drop_first=True)

# Combine numeric and one-hot encoded categorical columns.
X_processed = pd.concat([X_numeric, X_categorical_encoded], axis=1)

# Fill any remaining NaN values with 0.
X_processed = X_processed.fillna(0)

# Preserve the sensitive attribute for fairness evaluation.
sens = X[sensitive_col]

print("Shape of processed features:", X_processed.shape)


Shape of processed features: (1000, 48)


### Split data to train & test sets

In [6]:
# Split data and also split the sensitive attribute for evaluation
X_train, X_test, y_train, y_test, sens_train, sens_test = train_test_split(
    X_processed, y, sens, test_size=0.3, random_state=42
)


print("X train shape: ",X_train.shape)
print("X test shape: ",X_test.shape)

X train shape:  (700, 48)
X test shape:  (300, 48)


### Baseline Model - Logistic Regression
We begin with a **baseline logistic regression model** trained **without** any fairness constraints.
This model serves as a benchmark for comparing fairness-aware techniques.

The model evaluation includes:
- **Accuracy**: Overall prediction correctness.
- **F1 Score**: Balances precision and recall.
- **Demographic Parity Difference**: Measures bias in positive predictions between groups.
- **Equalized Odds Difference**: Measures disparity in misclassification rates across groups.

In [7]:
# Train the logistic regression model
lr = LogisticRegression(random_state=42, max_iter=10000)
lr.fit(X_train, y_train)

# Predict on the test set with the baseline model
y_pred_baseline = lr.predict(X_test)

# Evaluate baseline performance metrics
baseline_accuracy = accuracy_score(y_test, y_pred_baseline)
f1_score_baseline = f1_score(y_test, y_pred_baseline)

# Evaluate fairness metrics for the baseline model
baseline_dp_diff = demographic_parity_difference(y_test, y_pred_baseline, sensitive_features=sens_test)
baseline_eo_diff = equalized_odds_difference(y_test, y_pred_baseline, sensitive_features=sens_test)

print("=== Baseline Model Metrics ===")
print("Accuracy:", baseline_accuracy)
print("F1 score:",f1_score_baseline) 
print("Demographic Parity Difference:", baseline_dp_diff)
print("Equalized Odds Difference:", baseline_eo_diff)


=== Baseline Model Metrics ===
Accuracy: 0.77
F1 score: 0.5548387096774193
Demographic Parity Difference: 0.15476190476190474
Equalized Odds Difference: 0.22539682539682537


### Baseline Model Evaluation
The results indicate that while the **baseline model performs well in accuracy**, it also shows **significant fairness disparities**:
- **Demographic Parity Difference** is high, meaning that one group receives more favorable predictions.
- **Equalized Odds Difference** suggests that the model misclassifies different groups at unequal rates.

We will now explore fairness interventions to address these biases.


### Naive Fairness Approach - Removing Sensitive Attributes
A simple approach to fairness is **removing the sensitive attribute (`personal_status_sex`)** from the dataset.
However, this method is often ineffective because bias can still persist in other correlated variables.
We will compare this approach to more advanced fairness-aware techniques.

In [8]:
# Process X_processed as before
# Drop sensitive columns from the entire processed dataset
sensitive_encoded_cols = [col for col in X_processed.columns if col.startswith(sensitive_col + '_')]
X_processed_no_sensitive = X_processed.drop(columns=sensitive_encoded_cols)

# Split the data
X_train, X_test, y_train, y_test, sens_train, sens_test = train_test_split(
    X_processed_no_sensitive, y, sens, test_size=0.3, random_state=42
)

# Train the logistic regression model
lr = LogisticRegression(random_state=42,max_iter=10000)
lr.fit(X_train, y_train)

# Predict on the test set
y_pred_naive = lr.predict(X_test)

# Evaluate baseline performance metrics
naive_accuracy = accuracy_score(y_test, y_pred_naive)
f1_score_naive = f1_score(y_test, y_pred_naive)

# Evaluate fairness metrics for the baseline model
naive_dp_diff = demographic_parity_difference(y_test, y_pred_naive, sensitive_features=sens_test)
naive_eo_diff = equalized_odds_difference(y_test, y_pred_naive, sensitive_features=sens_test)

print("=== Naive Model Metrics ===")
print("Accuracy:", naive_accuracy)
print("F1 score:",f1_score_naive) 
print("Demographic Parity Difference:", naive_dp_diff)
print("Equalized Odds Difference:", naive_eo_diff)


=== Naive Model Metrics ===
Accuracy: 0.7633333333333333
F1 score: 0.5477707006369427
Demographic Parity Difference: 0.11525974025974026
Equalized Odds Difference: 0.2698412698412699


### Naive Model Evaluation
Removing the sensitive attribute **slightly improves fairness metrics**, but disparities remain:
- **Demographic Parity Difference** has decreased, but not completely eliminated.
- **Equalized Odds Difference** still shows inconsistencies in error rates.

This confirms that a more robust fairness-aware approach is required.


### Fairness-Aware Learning
We experiment with various fairness-aware techniques to **reduce bias while maintaining predictive performance**:
1. **Pre-processing**: Adjusting the dataset before training (e.g., removing correlations).
2. **In-processing**: Training with fairness constraints.
3. **Post-processing**: Adjusting model predictions to correct disparities.

Each method is evaluated based on the trade-off between **accuracy and fairness**.

In [9]:
# Define candidate methods for each stage.
pre_methods = {
    "None": fp.pre_none,
    "Correlation_Remover": fp.pre_correlation_remover,
    "Sensitive_Resampling": fp.pre_sensitive_resampling  # new candidate
}

in_methods = {
    "Baseline": fp.in_baseline,
    "Reweighting": fp.in_reweighting,
    "Exponential_Gradient_Demogrphic_Parity": fp.in_expgrad_dp,
    "Exponential_Gradient_Equalized_Odds": fp.in_expgrad_eo
}

post_methods = {
    "None": fp.post_none,
    "Threshold_Demogrphic_Parity": fp.post_threshold_dp,
    "Threshold_Equalized_Odds": fp.post_threshold_eo
}

# Run experiments:
results = fp.run_experiments(pre_methods, in_methods, post_methods,
                             X_train, y_train, sens_train,
                             X_test, y_test, sens_test)


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver opt

### Select only pareto optimal methods

In [10]:

objectives = {"f1_score": True,"accuracy":True, "Demographic_parity": False, "Equalized_odds": False}

frontier = fp.pareto_frontier(results, objectives)

print("Pareto Frontier configurations:")
for config, metrics in frontier.items():
    print(f"{config}: {metrics}")

Pareto Frontier configurations:
Pre-processing: None. In-training: Baseline. Post-processing:Threshold_Demogrphic_Parity: {'accuracy': 0.79, 'f1_score': 0.6227544910179641, 'Demographic_parity': 0.019573473561203647, 'Equalized_odds': 0.2206896551724138}
Pre-processing: None. In-training: Reweighting. Post-processing:Threshold_Equalized_Odds: {'accuracy': 0.6966666666666667, 'f1_score': 0.0, 'Demographic_parity': 0.0, 'Equalized_odds': 0.0}
Pre-processing: None. In-training: Exponential_Gradient_Demogrphic_Parity. Post-processing:None: {'accuracy': 0.74, 'f1_score': 0.4868421052631579, 'Demographic_parity': 0.1266233766233766, 'Equalized_odds': 0.17029862792574657}
Pre-processing: None. In-training: Exponential_Gradient_Demogrphic_Parity. Post-processing:Threshold_Equalized_Odds: {'accuracy': 0.6966666666666667, 'f1_score': 0.0, 'Demographic_parity': 0.0, 'Equalized_odds': 0.0}
Pre-processing: None. In-training: Exponential_Gradient_Equalized_Odds. Post-processing:Threshold_Equalized_O

### Results & Discussion
After testing multiple fairness-aware models, we analyze the **Pareto-optimal** solutions that best balance fairness and accuracy.

Key findings:
- Some models reduce **Demographic Parity Difference** significantly but have a lower F1 score.
- Other methods maintain accuracy while **partially mitigating bias**.
- The best approach depends on the acceptable trade-off between fairness and model performance.


### Apply thresholds on biase and portion of retained accuracy

### Set thresholds on accurcy, demographic parity and equalized odds

In [11]:
f1_threshold = 0.4
accuracy_treshold = 0.7
demographic_parity_threshold = 0.2
equalized_odds_threshold = 0.2

In [12]:
# Filter results based on thresholds.
filtered = fp.filter_results(frontier, f1_threshold=f1_threshold,
                            dp_threshold=demographic_parity_threshold,accuracy_threshold=accuracy_treshold, eo_threshold=equalized_odds_threshold)

print("\nFiltered Results (satisfying thresholds):")
for config, metrics in filtered.items():
    print(config, metrics)


Filtered Results (satisfying thresholds):
Pre-processing: None. In-training: Exponential_Gradient_Demogrphic_Parity. Post-processing:None {'accuracy': 0.74, 'f1_score': 0.4868421052631579, 'Demographic_parity': 0.1266233766233766, 'Equalized_odds': 0.17029862792574657}
Pre-processing: Correlation_Remover. In-training: Exponential_Gradient_Equalized_Odds. Post-processing:Threshold_Demogrphic_Parity {'accuracy': 0.7366666666666667, 'f1_score': 0.4148148148148148, 'Demographic_parity': 0.01798661461238149, 'Equalized_odds': 0.1333333333333333}
Pre-processing: Sensitive_Resampling. In-training: Baseline. Post-processing:Threshold_Equalized_Odds {'accuracy': 0.7133333333333334, 'f1_score': 0.5943396226415094, 'Demographic_parity': 0.19719544259421556, 'Equalized_odds': 0.13333333333333341}


### Conclusion
- The **baseline model** had high accuracy but exhibited fairness disparities.
- **Removing the sensitive attribute** helped but was insufficient to fully address bias.
- **Fairness-aware techniques** demonstrated better trade-offs between fairness and performance.
- The **optimal solution** depends on the **desired balance between accuracy and fairness**.

This analysis highlights the importance of **evaluating fairness in credit risk prediction models** and demonstrates multiple strategies to mitigate bias.
