In [1]:
from helper import get_openai_api_key
openai_api_key = get_openai_api_key()

import os
import dspy
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
os.environ["OPENAI_API_KEY"]  = get_openai_api_key()

In [2]:
import dspy
import pandas as pd
from sklearn.model_selection import train_test_split


# Loan Approval Prediction with Reasoning Using DSPy

## Use Case

In the banking and lending sector, loan approval decisions are critical and must be both **accurate** and **explainable**. The goal of this project is to predict whether a loan should be **approved** or **rejected** based on applicant features such as:

- Age
- Gender
- Income
- Education
- Credit score
- Prior loan history

Additionally, each decision must provide **human-readable reasoning** to ensure transparency and fairness.

---

## Problem

Traditional ML models (Random Forests, XGBoost, etc.) can predict loan approvals with good accuracy but have several limitations:

- **Lack of reasoning:** They do not explain why a loan was approved or rejected.  
- **Bias risk:** Models can unintentionally favor certain groups (gender, income, home ownership).  
- **Complex feature handling:** Encoding categorical variables and maintaining fairness is non-trivial.

---

# Solution Overview

This solution leverages **DSPy (Declarative Self-Improving Programming)** to build a loan approval system that is:

- **Predictive:** Outputs `0 = rejected`, `1 = approved`.  
- **Explainable:** Generates reasoning text for every decision.  
- **Fair and auditable:** Can check predictions for bias across sensitive features.

---

## Steps Implemented

### 1. Data Processing
- Loaded the dataset `loan_data.csv`.  
- Converted categorical features to strings for LLM understanding.

### 2. DSPy Model Setup
- Built a `ChainOfThought` module that takes applicant features as input and predicts loan approval along with reasoning.

### 3. Single Prediction
- Input a single applicant’s features.  
- Get `loan_status` and reasoning text explaining the decision.

### 4. Batch Prediction
- Run predictions on the entire test dataset.  
- Compare predictions with actual loan outcomes.

### 5. Fairness Audit
- Analyze predictions by sensitive features like **gender** and **home ownership**.  
- Identify potential biases in approvals.

---

## How DSPy Helps

DSPy provides a declarative framework for **reasoning-based predictions**:

- **Chain-of-Thought Reasoning**  
  Explains decisions step-by-step, making the system transparent for auditors and stakeholders.

- **LLM Integration**  
  Integrates with OpenAI GPT models to handle categorical and continuous features without manual encoding.

- **Self-Improving**  
  Supports few-shot learning and prompt optimization, improving prediction and reasoning quality over time.

- **Bias Mitigation & Auditing**  
  Allows auditing for fairness in decisions, detecting potential discriminatory patterns.

- **Reusable Pipeline**  
  Single-row and batch predictions can be run through the same module, making the solution production-ready.

---

## Key Features

- Predict loan approval with **0/1 output**.  
- Generate **reasoning** for each decision.  
- **Batch prediction** for large datasets.  
- **Fairness audit** across sensitive features.  
- Minimal preprocessing and categorical encoding.  
- Easily extendable with more features or additional fairness checks.


In [3]:
df = pd.read_csv("loan_data.csv")
df.head()

Unnamed: 0,person_age,person_gender,person_education,person_income,person_emp_exp,person_home_ownership,loan_amnt,loan_intent,loan_int_rate,loan_percent_income,cb_person_cred_hist_length,credit_score,previous_loan_defaults_on_file,loan_status
0,22.0,female,Master,71948.0,0,RENT,35000.0,PERSONAL,16.02,0.49,3.0,561,No,1
1,21.0,female,High School,12282.0,0,OWN,1000.0,EDUCATION,11.14,0.08,2.0,504,Yes,0
2,25.0,female,High School,12438.0,3,MORTGAGE,5500.0,MEDICAL,12.87,0.44,3.0,635,No,1
3,23.0,female,Bachelor,79753.0,0,RENT,35000.0,MEDICAL,15.23,0.44,2.0,675,No,1
4,24.0,male,Master,66135.0,1,RENT,35000.0,MEDICAL,14.27,0.53,4.0,586,No,1


In [4]:
df.columns

Index(['person_age', 'person_gender', 'person_education', 'person_income',
       'person_emp_exp', 'person_home_ownership', 'loan_amnt', 'loan_intent',
       'loan_int_rate', 'loan_percent_income', 'cb_person_cred_hist_length',
       'credit_score', 'previous_loan_defaults_on_file', 'loan_status'],
      dtype='object')

In [5]:
categorical_cols = ["person_gender", "person_education", "person_home_ownership", 
                    "loan_intent", "previous_loan_defaults_on_file"]

encoders = {col: LabelEncoder().fit(df[col]) for col in categorical_cols}
for col, le in encoders.items():
    df[col] = le.transform(df[col])

# Train-test split
train_df, test_df = train_test_split(df, test_size=0.2, random_state=42)
print(f"Train shape: {train_df.shape}, Test shape: {test_df.shape}")



Train shape: (36000, 14), Test shape: (9000, 14)


In [7]:

# ---------- Step 1: Load dataset ----------
df = pd.read_csv("loan_data.csv")

# Convert categorical to string
categorical_cols = [
    "person_gender", "person_education", "person_home_ownership",
    "loan_intent", "previous_loan_defaults_on_file"
]
for col in categorical_cols:
    df[col] = df[col].astype(str)

# ---------- Step 2: Split dataset ----------
train_df, test_df = train_test_split(df, test_size=0.2, random_state=42)

# ---------- Step 3: Define a DSPy ChainOfThought module ----------
class LoanApprovalReasoner(dspy.Module):
    def __init__(self):
        super().__init__()
        # Use ChainOfThought to give reasoning along with prediction
        self.reason_model = dspy.ChainOfThought(
            "person_age: float, person_gender: str, person_education: str, "
            "person_income: float, person_emp_exp: int, person_home_ownership: str, "
            "loan_amnt: float, loan_intent: str, loan_int_rate: float, loan_percent_income: float, "
            "cb_person_cred_hist_length: float, credit_score: int, previous_loan_defaults_on_file: str "
            "-> loan_status: int, reasoning: str"
        )

    def forward(self, **features):
        result = self.reason_model(**features)
        return result

# ---------- Step 4: Initialize DSPy ----------
dspy.settings.configure(lm=dspy.LM("openai/gpt-4o-mini"))
loan_reasoner = LoanApprovalReasoner()
# ---------- Step 5: Test prediction + reasoning ----------
sample_input = {
    "person_age": 29,
    "person_gender": "female",
    "person_education": "Bachelor",
    "person_income": 70000,
    "person_emp_exp": 6,
    "person_home_ownership": "OWN",
    "loan_amnt": 12000,
    "loan_intent": "EDUCATION",
    "loan_int_rate": 9.8,
    "loan_percent_income": 0.17,
    "cb_person_cred_hist_length": 5,
    "credit_score": 750,
    "previous_loan_defaults_on_file": "No"
}

output = loan_reasoner(**sample_input)

print("Predicted loan_status:", output["loan_status"])
print("Reasoning:\n", output["reasoning"])

Predicted loan_status: 1
Reasoning:
 The applicant is 29 years old, has a Bachelor's degree, and a stable income of $70,000 with 6 years of employment experience. They own their home, which indicates financial stability. The loan amount of $12,000 for education purposes is reasonable given their income, as it only represents 17% of their income. Additionally, they have a good credit score of 750 and no previous loan defaults, which further supports their ability to repay the loan. Therefore, the loan should be approved.
