# Assignment 01

#### Deadline: No deadline
#### Deliverables: You will be quizzed on this assignment in the first 15 minutes of the next tutorial.

---

## Data

We are using the [Statlog (German Credit Data)](http://archive.ics.uci.edu) dataset. The German Credit dataset classifies people described by a set of 20 features as good or bad credit risk.

---

## 1. Data Exploration and Preparation 

**Q1.1** Load and inspect the data. Provide:
- Dataset dimensions (number of samples, features)
- Class distribution (percentage good vs bad credit risk)
- Summary statistics for numerical features
- Identification of categorical features and their unique values


In [3]:
import warnings
warnings.filterwarnings("ignore")
import pandas as pd

In [4]:
# complete dataset
loan_data = pd.read_csv('../datasets/credit/credit-g_csv.csv')

# train_data 
X_train = pd.read_csv('../datasets/credit/encoded_credit_X_train.csv')
y_train = pd.read_csv('../datasets/credit/credit_y_train.csv')

# test data
X_test = pd.read_csv('../datasets/credit/encoded_credit_X_test.csv')
y_test = pd.read_csv('../datasets/credit/credit_y_test.csv')

In [5]:
# create a dataframe to save y values in 
y_results = pd.DataFrame()
y_results['y_test'] = y_test

**Q1.2** Analyze class imbalance:
- Calculate the imbalance ratio
- Discuss why this matters for model evaluation
- Propose at least 3 appropriate performance metrics and justify your choices

**Q1.3** Feature analysis:
- Identify the 5 most correlated features with the target

In [6]:
num_cols = list(set(loan_data._get_numeric_data().columns) - set(['installment_commitment', 'residence_since', 'existing_credits', 'num_dependents']))
cat_cols = list(set(loan_data.columns) - set(num_cols)) + ['installment_commitment', 'residence_since', 'existing_credits','num_dependents']

loan_data["target_bin"] = loan_data["class"].map({"bad": 0, "good": 1})


In [7]:

# Write your code here for questions Q1.1, Q1.2 and Q1.3
# Q1.1
print("Dataset number of features/rows", loan_data.shape)
print("Class Distrib", loan_data['class'].value_counts())
print("Numerical catalogue")
print(loan_data[num_cols].describe())
print("Value Counts of categorical")
for i in cat_cols:
    print(loan_data[i].value_counts(), '\n')

Dataset number of features/rows (1000, 22)
Class Distrib class
good    700
bad     300
Name: count, dtype: int64
Numerical catalogue
       credit_amount     duration          age
count    1000.000000  1000.000000  1000.000000
mean     3271.258000    20.903000    35.546000
std      2822.736876    12.058814    11.375469
min       250.000000     4.000000    19.000000
25%      1365.500000    12.000000    27.000000
50%      2319.500000    18.000000    33.000000
75%      3972.250000    24.000000    42.000000
max     18424.000000    72.000000    75.000000
Value Counts of categorical
checking_status
no checking    394
<0             274
0<=X<200       269
>=200           63
Name: count, dtype: int64 

purpose
radio/tv               280
new car                234
furniture/equipment    181
used car               103
business                97
education               50
repairs                 22
domestic appliance      12
other                   12
retraining               9
Name: count, dtype

In [8]:
import pandas as pd
from scipy.stats import pointbiserialr

y = y_train.squeeze()  # ensure Series

pb_corr = {}

for col in X_train.columns:
    x = pd.to_numeric(X_train[col], errors="coerce")

    valid = ~(x.isna() | y.isna())

    if x[valid].nunique() <= 1:
        continue

    corr, _ = pointbiserialr(x[valid], y[valid])
    pb_corr[col] = corr
top_5_corr = (
    pd.Series(pb_corr)
    .abs()
    .sort_values(ascending=False)
    .head(5)
)

top_5_corr


checking_status_3    0.308352
checking_status_1    0.257842
duration             0.209096
credit_history_1     0.174778
credit_history_4     0.166700
dtype: float64

---

## 2. Baseline Model: Logistic Regression

**Q2.1** Fit a logistic regression model with default parameters (set random_state for reproducibility).

**Q2.2** Evaluate performance using your chosen metrics from Q1.2.

**Q2.3** Interpret the model:
- Which features have the strongest positive/negative coefficients?
- What do these coefficients tell you about credit risk?
- How interpretable is this model for a loan officer?

In [9]:
from sklearn.preprocessing import StandardScaler
from interpret.glassbox import LogisticRegression
from sklearn.pipeline import Pipeline
# train_data 
X_train = pd.read_csv('../datasets/credit/encoded_credit_X_train.csv')
y_train = pd.read_csv('../datasets/credit/credit_y_train.csv')

# test data
X_test = pd.read_csv('../datasets/credit/encoded_credit_X_test.csv')
y_test = pd.read_csv('../datasets/credit/credit_y_test.csv')
logreg = LogisticRegression(random_state=42)

logreg.fit(X_train, y_train)
from sklearn.metrics import (
    f1_score,
    recall_score,
    precision_recall_curve,
    auc,
    confusion_matrix,
    classification_report
)

# predictions
y_pred = logreg.predict(X_test)
y_proba = logreg.predict_proba(X_test)[:, 1]

f1 = f1_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)

precision, recall_curve, _ = precision_recall_curve(y_test, y_proba)
pr_auc = auc(recall_curve, precision)

print(f"F1 Score: {f1:.3f}")
print(f"Recall: {recall:.3f}")
print(f"PR-AUC: {pr_auc:.3f}")
cm = confusion_matrix(y_test, y_pred)
print("Confusion Matrix:\n", cm)



F1 Score: 0.867
Recall: 0.929
PR-AUC: 0.932
Confusion Matrix:
 [[ 30  30]
 [ 10 130]]


In [10]:
from interpret import show
# get trained logistic regression
from interpret import show

global_explanation = logreg.explain_global()
show(global_explanation)


In [11]:
import pandas as pd
import numpy as np

scores = global_explanation.data()["scores"]

coefs = pd.Series(
    np.abs(scores),
    index=global_explanation.data()["names"]
)
top5_logreg = coefs.sort_values(ascending=False).head(5)
top5_logreg




checking_status_3        1.105633
checking_status_1        0.614995
credit_history_1         0.589399
purpose_4                0.574484
other_payment_plans_1    0.562971
dtype: float64

---
## 3. Explainable Boosting Machine (EBM)

**Q3.1** Fit an EBM model on the training data. 

**Q3.2** Test the model on the test data and evaluate model performance based on the performance metrics you chose in Question 1.2.

**Q3.3** Global interpretability:
- What are the top 5 most important features?
- How do these compare to logistic regression's top features?
- Examine 2-3 feature shape functions in detail
- Are there any surprising non-linearities?
- What pairwise interactions does EBM detect?

**Q3.4** Local interpretability:
- Select one correctly classified positive sample
- Select one misclassified negative sample
- Explain the prediction process for both using local explanations
- What features drove each prediction?


In [12]:
# Write your code here for questions Q3.1, Q3.2, Q3.3 and Q3.4
from interpret.glassbox import ExplainableBoostingClassifier

ebm = ExplainableBoostingClassifier(
    random_state=42,
    n_jobs=1
)

ebm.fit(X_train, y_train)
from sklearn.metrics import f1_score, recall_score, precision_recall_curve, auc

y_pred = ebm.predict(X_test)
y_proba = ebm.predict_proba(X_test)[:, 1]

f1 = f1_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)

precision, recall_curve, _ = precision_recall_curve(y_test, y_proba)
pr_auc = auc(recall_curve, precision)

print(f"EBM F1 Score: {f1:.3f}")
print(f"EBM Recall: {recall:.3f}")
print(f"EBM PR-AUC: {pr_auc:.3f}")
from interpret import show

global_exp = ebm.explain_global()
show(global_exp)
results = X_test.copy()
results["y_true"] = y_test.values
results["y_pred"] = y_pred
results["y_proba"] = y_proba
pos_sample = results[
    (results["y_true"] == 1) &
    (results["y_pred"] == 1)
].iloc[[0]]   # üëà keep as DataFrame
neg_sample = results[
    (results["y_true"] == 0) &
    (results["y_pred"] == 1)
].iloc[[0]]   # üëà keep as DataFrame
local_exp_pos = ebm.explain_local(
    pos_sample.drop(columns=["y_true", "y_pred", "y_proba"]),
    pos_sample["y_true"]
)

local_exp_neg = ebm.explain_local(
    neg_sample.drop(columns=["y_true", "y_pred", "y_proba"]),
    neg_sample["y_true"]
)
from interpret import show

show(local_exp_pos)
show(local_exp_neg)



EBM F1 Score: 0.854
EBM Recall: 0.879
EBM PR-AUC: 0.935


In [13]:
import pandas as pd
import numpy as np

data = global_exp.data()

feature_names = data["names"]
importance_scores = data["scores"]
ebm_importance = pd.Series(
    importance_scores,
    index=feature_names
)
top5_ebm = ebm_importance.sort_values(ascending=False).head(5)
top5_ebm



checking_status_3    0.350248
checking_status_1    0.182816
credit_history_1     0.174325
duration             0.149916
savings_status_4     0.121537
dtype: float64

## 4. Rule-Based Model: RUG 

**Q4.1** Fit either a RUG classifier 

**Q4.2** Evaluate performance.
**Q4.3** Extract and analyze the rules: - How many rules were generated?  - What is the average rule length?  - Show 3-5 example rules with their coverage and accuracy - How do rules differ from tree paths?

**Q4.4** Apply the rules:
- For the same two samples from Q3.4, show which rules apply
- Explain how the final prediction is made

In [14]:
from ruleopt import RUGClassifier
from ruleopt.rule_cost import Length, Gini
import numpy as np
from sklearn.metrics import accuracy_score
# solver = ORToolsSolver()
rule_cost = Length()
# Initialize the RUGClassifier with specific parameters
rug = RUGClassifier(
    random_state=100,
    max_rmp_calls=8,
    rule_cost=rule_cost,
    max_depth=3,
    threshold=0.05)
rug.fit(X_train, y_train.squeeze())
y_results['rug_pred'] = rug.predict(np.array(X_test))
# Confusion matrix
cm = pd.crosstab(y_results['y_test'], y_results['rug_pred'])
print ("Confusion matrix : \n", cm)

print('\nAccuracy  = %.4f' % accuracy_score(y_results['y_test'], y_results['rug_pred']))
print('F1 score  = %.4f' % f1_score(y_results['y_test'], y_results['rug_pred']))



Confusion matrix : 
 rug_pred   0    1
y_test           
0         33   27
1         28  112

Accuracy  = 0.7250
F1 score  = 0.8029


In [15]:
from sklearn.metrics import f1_score, recall_score, roc_auc_score

# ensure y_test is a Series
y_true = y_test.squeeze()

# predictions
y_pred = rug.predict(X_test)
y_prob = rug.predict_proba(X_test)[:, 1]

# metrics
f1 = f1_score(y_true, y_pred)
recall = recall_score(y_true, y_pred)
auc = roc_auc_score(y_true, y_prob)

print(f"F1-score : {f1:.4f}")
print(f"Recall   : {recall:.4f}")
print(f"ROC AUC  : {auc:.4f}")


F1-score : 0.8029
Recall   : 0.8000
ROC AUC  : 0.7348


In [16]:
from ruleopt.explainer import Explainer

exp = Explainer(rug)


In [17]:
rules = exp.retrieve_rule_details(list(X_train.columns))

RULE 0:
0.50      < employment_4 <= inf       and not null
31.50     < duration  <= inf       and not null
-inf      < checking_status_3 <= 0.50      or null
Class: 1
Scaled rule weight: 1.0000

RULE 1:
-inf      < credit_amount <= 1576.50   and not null
0.50      < property_magnitude_1 <= inf       and not null
-inf      < other_parties_2 <= 0.50      and not null
Class: 0
Scaled rule weight: 0.9794

RULE 2:
-inf      < duration  <= 37.50     or null
-inf      < purpose_4 <= 0.50      or null
0.50      < other_parties_1 <= inf       and not null
Class: 1
Scaled rule weight: 0.8971

RULE 3:
-inf      < age       <= 61.00     or null
1360.50   < credit_amount <= inf       or null
0.50      < checking_status_2 <= inf       and not null
Class: 1
Scaled rule weight: 0.8560

RULE 4:
-inf      < age       <= 45.00     or null
-inf      < credit_amount <= 8234.00   or null
0.50      < purpose_9 <= inf       and not null
Class: 1
Scaled rule weight: 0.8354

RULE 5:
-inf      < credit_amount <=

In [18]:
rule_coverage_metrics = exp.evaluate_rule_coverage_metrics(X_test, info=True)

Number of instances not covered by any rule: 0
Average number of rules per sample: 5.27
Average length of rules per sample: 2.75


In [19]:
i = 9
print(f'True and predicted values for sample at index {i}:')
print(y_results.loc[i,:], '\n')
print(f'Sample {i} features:')
X_test.loc[[i]]

True and predicted values for sample at index 9:
y_test      1
rug_pred    1
Name: 9, dtype: int64 

Sample 9 features:


Unnamed: 0,duration,credit_amount,installment_commitment,residence_since,age,existing_credits,num_dependents,checking_status_1,checking_status_2,checking_status_3,...,purpose_4,purpose_5,purpose_6,purpose_7,purpose_8,purpose_9,savings_status_1,savings_status_2,savings_status_3,savings_status_4
9,7,730,4,2,46,2,1,0,0,1,...,0,0,1,0,0,0,0,0,0,1


In [20]:
exp.find_applicable_rules_for_samples(X_test.iloc[[i]], feature_names=list(X_train.columns), info=True)

Rules for instance 0
RULE 5:
-inf      < credit_amount <= 10725.50  or null
0.50      < other_payment_plans_1 <= inf       or null
0.50      < checking_status_3 <= inf       and not null
Class: 1
Scaled rule weight: 0.8148

RULE 17:
-inf      < credit_amount <= 2723.50   or null
-inf      < duration  <= 7.00      and not null
-inf      < purpose_9 <= 0.50      or null
Class: 1
Scaled rule weight: 0.4527

RULE 19:
0.50      < housing_2 <= inf       and not null
-inf      < employment_1 <= 0.50      or null
-inf      < checking_status_2 <= 0.50      or null
Class: 0
Scaled rule weight: 0.3951

RULE 24:
-inf      < credit_amount <= 1286.00   and not null
-inf      < savings_status_2 <= 0.50      and not null
-inf      < other_parties_1 <= 0.50      or null
Class: 0
Scaled rule weight: 0.3416

RULE 32:
699.00    < credit_amount <= 1236.50   and not null
-inf      < other_payment_plans_2 <= 0.50      or null
Class: 1
Scaled rule weight: 0.2305

RULE 36:
429.50    < credit_amount <= 8079.00 

[[5, 17, 19, 24, 32, 36, 40]]

In [21]:
print(dir(rug))

['__class__', '__delattr__', '__dict__', '__dir__', '__doc__', '__eq__', '__format__', '__ge__', '__getattribute__', '__getstate__', '__gt__', '__hash__', '__init__', '__init_subclass__', '__le__', '__lt__', '__module__', '__ne__', '__new__', '__reduce__', '__reduce_ex__', '__repr__', '__setattr__', '__setstate__', '__sizeof__', '__sklearn_clone__', '__sklearn_tags__', '__str__', '__subclasshook__', '__weakref__', '_cleanup', '_doc_link_module', '_doc_link_template', '_doc_link_url_param_generator', '_fill_rules', '_fit_decision_tree', '_get_class_infos', '_get_class_level_metadata_request_values', '_get_doc_link', '_get_matrix', '_get_metadata_request', '_get_param_names', '_get_params_html', '_get_rule', '_get_rule_cost', '_get_sample_weight', '_html_repr', '_is_fitted', '_predict_base', '_preprocess', '_repr_html_', '_repr_html_inner', '_repr_mimebundle_', '_rng', '_temp_rules', '_validate_parameters', '_validate_params', '_validate_rug_parameters', 'ccp_alpha', 'class_weight', 'cla

In [22]:
rules = exp.retrieve_rule_details(list(feature_names))

RULE 0:
0.50      < employment_4 <= inf       and not null
31.50     < duration  <= inf       and not null
-inf      < checking_status_3 <= 0.50      or null
Class: 1
Scaled rule weight: 1.0000

RULE 1:
-inf      < credit_amount <= 1576.50   and not null
0.50      < property_magnitude_1 <= inf       and not null
-inf      < other_parties_2 <= 0.50      and not null
Class: 0
Scaled rule weight: 0.9794

RULE 2:
-inf      < duration  <= 37.50     or null
-inf      < purpose_4 <= 0.50      or null
0.50      < other_parties_1 <= inf       and not null
Class: 1
Scaled rule weight: 0.8971

RULE 3:
-inf      < age       <= 61.00     or null
1360.50   < credit_amount <= inf       or null
0.50      < checking_status_2 <= inf       and not null
Class: 1
Scaled rule weight: 0.8560

RULE 4:
-inf      < age       <= 45.00     or null
-inf      < credit_amount <= 8234.00   or null
0.50      < purpose_9 <= inf       and not null
Class: 1
Scaled rule weight: 0.8354

RULE 5:
-inf      < credit_amount <=

---

## 5. Comparative Analysis

**Q5.1** Create a comparison table with ALL models showing:
- Accuracy
- Precision (both classes)
- Recall (both classes)
- F1 score
- Any other metrics you chose

**Q5.2** Use the following metrics to rank the models on a scale from 1 (bad) to 5 (good). Justify your opinions:
- Predictive performance
- Global interpretability 
- Local interpretability 
- Ease of deployment 

**Q5.3** Feature importance comparison:
- Create a table showing top 5 features from each model
- Discuss: Do models agree on what's important?
- What does this tell you about the data?

---

## 6. Model Selection and Ethics

**Q6.1** Which model would you recommend for deployment in:
- (a) A highly regulated bank (needs full explainability)
RUG
- (b) A fintech startup (prioritizes perfor
EBM
6. Model Selection and Ethics
## Q6.1(a) Highly regulated bank (needs full explainability)

## Recommended model: RUG

## Justification:
Although RUG has lower predictive performance (Accuracy = 0.725, ROC AUC = 0.735) compared to Logistic Regression and EBM, it offers maximum transparency and auditability. Each prediction is made through explicit IF‚ÄìTHEN rules that can be directly inspected, validated, and communicated to regulators and customers. This level of explainability is critical in highly regulated financial environments where compliance, fairness, and accountability outweigh marginal gains in predictive performance.

## Q6.1(b) Fintech startup (prioritizes performance)

## Recommended model: EBM

## Justification:
EBM provides the best balance between predictive performance and interpretability. It achieves the highest ROC AUC (0.853) and strong F1 score for the ‚ÄúGood‚Äù class (0.854), while still allowing global and local explanations through feature shape functions. For a fintech startup where performance and adaptability are prioritized, EBM captures non-linear relationships and interactions more effectively than Logistic Regression, without sacrificing transparency entirely.

In [23]:
from sklearn.metrics import (
    accuracy_score, precision_recall_fscore_support, roc_auc_score
)
import pandas as pd

def evaluate_model(name, model, X_test, y_test):
    y_true = y_test.squeeze()
    y_pred = model.predict(X_test)
    y_prob = model.predict_proba(X_test)[:, 1]

    precision, recall, f1, _ = precision_recall_fscore_support(
        y_true, y_pred, labels=[0, 1]
    )

    return {
        "Model": name,
        "Accuracy": accuracy_score(y_true, y_pred),
        "Precision (Bad)": precision[0],
        "Recall (Bad)": recall[0],
        "Precision (Good)": precision[1],
        "Recall (Good)": recall[1],
        "F1 (Good)": f1[1],
        "ROC AUC": roc_auc_score(y_true, y_prob)
    }

results = [
    evaluate_model("Logistic Regression", logreg, X_test, y_test),
    evaluate_model("EBM", ebm, X_test, y_test),
    evaluate_model("RUG", rug, X_test, y_test),
]

comparison_df = pd.DataFrame(results)
comparison_df


Unnamed: 0,Model,Accuracy,Precision (Bad),Recall (Bad),Precision (Good),Recall (Good),F1 (Good),ROC AUC
0,Logistic Regression,0.8,0.75,0.5,0.8125,0.928571,0.866667,0.842024
1,EBM,0.79,0.673077,0.583333,0.831081,0.878571,0.854167,0.853333
2,RUG,0.725,0.540984,0.55,0.805755,0.8,0.802867,0.734762


In [24]:
import pandas as pd

feature_comparison = pd.DataFrame({
    "Logistic Regression": top5_logreg.index,
    "EBM": top5_ebm.index,
})


rules


{0: {'label': 1,
  'weight': 1.0,
  'rule': {'employment_4': {'lb': 0.5, 'ub': inf, 'na': False},
   'duration': {'lb': 31.5, 'ub': inf, 'na': False},
   'checking_status_3': {'lb': -inf, 'ub': 0.5, 'na': True}},
  'sdist': [6]},
 1: {'label': 0,
  'weight': 0.9794238683127573,
  'rule': {'credit_amount': {'lb': -inf, 'ub': 1576.5, 'na': False},
   'property_magnitude_1': {'lb': 0.5, 'ub': inf, 'na': False},
   'other_parties_2': {'lb': -inf, 'ub': 0.5, 'na': False}},
  'sdist': [2, 4]},
 2: {'label': 1,
  'weight': 0.897119341563786,
  'rule': {'duration': {'lb': -inf, 'ub': 37.5, 'na': True},
   'purpose_4': {'lb': -inf, 'ub': 0.5, 'na': True},
   'other_parties_1': {'lb': 0.5, 'ub': inf, 'na': False}},
  'sdist': [1, 30]},
 3: {'label': 1,
  'weight': 0.8559670781893001,
  'rule': {'age': {'lb': -inf, 'ub': 61.0, 'na': True},
   'credit_amount': {'lb': 1360.5, 'ub': inf, 'na': True},
   'checking_status_2': {'lb': 0.5, 'ub': inf, 'na': False}},
  'sdist': [2, 25]},
 4: {'label': 1,


In [25]:
feature_comparison

Unnamed: 0,Logistic Regression,EBM
0,checking_status_3,checking_status_3
1,checking_status_1,checking_status_1
2,credit_history_1,credit_history_1
3,purpose_4,duration
4,other_payment_plans_1,savings_status_4



Logistic regression

Best accuracy and F1

Extremely high recall for good borrowers

Slightly weaker recall for bad borrowers

EBM

Best ROC‚ÄìAUC ‚Üí strongest ranking ability

More balanced performance across classes

Slightly lower accuracy than logistic regression

RUG

Lowest predictive performance

Still competitive recall for good borrowers

Sacrifices performance for interpretability

1. Logistic Regression

Overall picture: Strong, stable, conservative model

Accuracy (0.800): Highest among the three ‚Äî performs consistently across classes.

Recall (Good = 0.93): Very high ‚Üí excellent at approving good borrowers.

Recall (Bad = 0.50): Misses half of the bad cases ‚Üí risk of letting some risky applicants through.

ROC AUC (0.84): Strong class separation.

What this tells us

The model is biased toward predicting ‚ÄúGood‚Äù, which is common in imbalanced credit datasets.

Linear decision boundary ‚Üí stable and predictable behavior.

Very reliable baseline but not aggressive at catching bad loans.

Personality: Safe, conservative, regulator-friendly.

2. Explainable Boosting Machine (EBM)

Overall picture: Best trade-off between performance and sensitivity

ROC AUC (0.853): Highest overall ‚Üí best at ranking borrowers by risk.

Recall (Bad = 0.58): Best at catching risky borrowers.

Precision (Good = 0.83) and Recall (Good = 0.88): Balanced performance.

Slightly lower accuracy than LR, but not meaningfully worse.

What this tells us

Captures non-linear patterns missed by Logistic Regression.

Makes fewer systematic mistakes on bad borrowers.

Better risk differentiation across the full score range.

Personality: Smart, flexible, performance-oriented but still explainable.

3. RUG

Overall picture: Highly interpretable but weaker predictive power

Accuracy (0.725) and ROC AUC (0.735): Clearly lower.

Recall (Bad = 0.55): Reasonable at identifying bad borrowers.

F1 (Good = 0.80): Acceptable but not competitive with LR/EBM.

What this tells us

Sacrifices predictive performance for rule simplicity.

Rules generalize less well to unseen data.

More variance and less nuanced decision boundaries.

Personality: Transparent, human-readable, but blunt.