# FinRisk: Credit Risk & Fraud Detection - Notebook 6
## Phase 3: Fraud Detection & Real-time Scoring

**Objective:**
1.  Load the pre-trained champion credit risk model (Logistic Regression).
2.  Build an unsupervised anomaly detection model for fraud using `IsolationForest`.
3.  Simulate a real-time API endpoint that receives a transaction and returns a combined risk decision (credit risk + fraud risk).
4.  Outline an A/B testing framework and a simple alerting mechanism.

In [7]:
# ==============================================================================
# 1. Import Libraries
# ==============================================================================
import pandas as pd
import numpy as np
import os
import joblib # For saving and loading models

from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import IsolationForest

# Settings
import warnings
warnings.filterwarnings('ignore')

print("Libraries imported successfully.")

Libraries imported successfully.


## 2. Load Data and Pre-trained Credit Model

First, we load the necessary datasets. Then, to simulate a production environment, we will "load" our best-performing credit risk model from Phase 2. For this notebook, we'll quickly retrain the `Logistic Regression (Balanced)` model and save it to a file to simulate this process.

In [8]:
# --- Load Datasets ---
PROCESSED_DATA_PATH = '../data/processed/'
MODELS_PATH = '../models/' # Let's create a new directory for models

if not os.path.exists(MODELS_PATH):
    os.makedirs(MODELS_PATH)

try:
    credit_features_df = pd.read_csv(os.path.join(PROCESSED_DATA_PATH, 'credit_risk_features.csv'), parse_dates=['application_date'])
    fraud_features_df = pd.read_csv(os.path.join(PROCESSED_DATA_PATH, 'fraud_detection_features.csv'))
    print("Processed data loaded successfully.")
except FileNotFoundError as e:
    print(f"Error loading data: {e}. Please run previous notebooks first.")
    assert False, "Data not found."

# --- Simulate Loading a Pre-trained Credit Model ---
print("\n--- Training and saving the champion credit risk model ---")

# Define features (X) and target (y) from the credit dataset
TARGET = 'default_flag'
features_to_exclude = [
    'application_id', 'customer_id', 'application_date', 'last_activity_date',
    'default_flag', 'application_status', 'city'
]
X = credit_features_df.drop(columns=features_to_exclude)
y = credit_features_df[TARGET]

# Identify feature types
categorical_features = X.select_dtypes(include=['object', 'category']).columns
numerical_features = X.select_dtypes(include=np.number).columns

# Define the preprocessor
preprocessor = ColumnTransformer(
    transformers=[
        ('num', StandardScaler(), numerical_features),
        ('cat', OneHotEncoder(handle_unknown='ignore'), categorical_features)
    ]
)

# Create and train the pipeline
credit_risk_model = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('classifier', LogisticRegression(random_state=42, class_weight='balanced', solver='liblinear'))
])
credit_risk_model.fit(X, y) # Train on the full dataset for "production"

# Save the model
model_filename = os.path.join(MODELS_PATH, 'champion_credit_model.joblib')
joblib.dump(credit_risk_model, model_filename)

print(f"Credit risk model trained and saved to {model_filename}")

Processed data loaded successfully.

--- Training and saving the champion credit risk model ---
Credit risk model trained and saved to ../models/champion_credit_model.joblib


## 3. Build Fraud Detection Model (Anomaly Detection)

Fraud is a rare event, making it a perfect use case for anomaly detection. We will use `IsolationForest`, which is efficient at identifying unusual data points. We train this on the customer-level aggregated features.

In [9]:
print("--- Building the fraud detection model ---")

# Features for the fraud model are the aggregated transaction stats
X_fraud = fraud_features_df.drop(columns=['customer_id'])

# Isolation Forest works by "isolating" outliers.
# The `contamination` parameter is an estimate of the percentage of anomalies in the data.
# Based on the project brief, fraud rate is ~0.1%.
fraud_detection_model = IsolationForest(
    n_estimators=100,
    contamination=0.001, # Set to expected fraud rate
    random_state=42
)

fraud_detection_model.fit(X_fraud)

# Save the fraud model
fraud_model_filename = os.path.join(MODELS_PATH, 'fraud_detection_model.joblib')
joblib.dump(fraud_detection_model, fraud_model_filename)
print(f"Fraud detection model trained and saved to {fraud_model_filename}")

--- Building the fraud detection model ---
Fraud detection model trained and saved to ../models/fraud_detection_model.joblib


## 4. Simulate Real-Time Scoring API

This is the core of our production system. We'll create a function that simulates an API call. It will take new application/transaction data, use our models to score it, and return a decision based on business rules.

In [10]:
# Load the models back as if we were in a separate API application
loaded_credit_model = joblib.load(os.path.join(MODELS_PATH, 'champion_credit_model.joblib'))
loaded_fraud_model = joblib.load(os.path.join(MODELS_PATH, 'fraud_detection_model.joblib'))

print("Models loaded successfully.")

# Prepare data for quick lookups (simulating a database/cache)
credit_data_lookup = credit_features_df.set_index('customer_id')
fraud_data_lookup = fraud_features_df.set_index('customer_id')


def real_time_scoring_api(customer_id: str):
    """
    Simulates a real-time scoring API endpoint.
    - Fetches customer data.
    - Gets a credit risk score.
    - Gets a fraud score.
    - Applies business rules to make a decision.
    """
    print(f"\n--- Scoring request for Customer: {customer_id} ---")
    try:
        # 1. Fetch data for the customer
        customer_credit_features = credit_data_lookup.loc[[customer_id]].drop(columns=features_to_exclude)
        customer_fraud_features = fraud_data_lookup.loc[[customer_id]].drop(columns=['customer_id'])
    except KeyError:
        return {"customer_id": customer_id, "decision": "Error", "reason": "Customer not found"}

    # 2. Score for Credit Risk (probability of default)
    credit_prob_default = loaded_credit_model.predict_proba(customer_credit_features)[:, 1][0]
    
    # 3. Score for Fraud Risk (anomaly score)
    # The score is -1 for anomalies (fraud) and 1 for inliers (not fraud).
    fraud_score = loaded_fraud_model.predict(customer_fraud_features)[0]
    
    # 4. Apply Business Rules
    decision = "Approved"
    reason = "Customer meets credit and fraud criteria."
    
    if fraud_score == -1:
        decision = "Manual Review"
        reason = "High fraud risk detected. Transaction flagged for investigation."
    elif credit_prob_default > 0.6: # Example threshold
        decision = "Declined"
        reason = f"Credit risk too high. Probability of default: {credit_prob_default:.2%}"
        
    return {
        "customer_id": customer_id,
        "credit_default_probability": f"{credit_prob_default:.2%}",
        "is_potential_fraud": bool(fraud_score == -1),
        "decision": decision,
        "reason": reason
    }

# --- Example API Calls ---
# Example of a good customer
good_customer_id = 'CUST_007999' 
print(real_time_scoring_api(good_customer_id))

# Example of another customer
other_customer_id = 'CUST_008172'
print(real_time_scoring_api(other_customer_id))

Models loaded successfully.

--- Scoring request for Customer: CUST_007999 ---
{'customer_id': 'CUST_007999', 'decision': 'Error', 'reason': 'Customer not found'}

--- Scoring request for Customer: CUST_008172 ---
{'customer_id': 'CUST_008172', 'decision': 'Error', 'reason': 'Customer not found'}


## 5. A/B Testing Framework and Alerting

In a production environment, you never replace a model blindly. You test it.

**A/B Testing Framework:**
* **Champion vs. Challenger:** Our current `Logistic Regression` is the "Champion". A new model (e.g., an optimized XGBoost) would be the "Challenger".
* **Traffic Splitting:** We would configure our API to route a small percentage of traffic (e.g., 5%) to the Challenger model. The remaining 95% still goes to the Champion.
* **Performance Monitoring:** We would then monitor the key metrics (default rates, fraud capture rates) for both groups of customers over a period of time.
* **Promotion:** If the Challenger significantly and consistently outperforms the Champion, it gets promoted to become the new Champion.

**Alerting System:**
The `real_time_scoring_api` already includes a decision for "Manual Review". In a real system, this would trigger an alert.

In [11]:
def generate_alert(api_response: dict):
    """Generates an alert if the decision requires manual review."""
    if api_response.get("decision") == "Manual Review":
        customer_id = api_response.get("customer_id")
        reason = api_response.get("reason")
        print("\n" + "="*20 + " ALERT " + "="*20)
        print(f"ALERT: Manual investigation required for Customer {customer_id}.")
        print(f"Reason: {reason}")
        print("="*47)

# Example of an alert being triggered
fraudulent_customer_id = 'CUST_004019' # This is a made up example for simulation
# Let's create a fake high-risk profile for this customer to trigger the alert
high_risk_profile = fraud_data_lookup.mean().to_frame().T
high_risk_profile.index = [fraudulent_customer_id]
fraud_data_lookup = pd.concat([fraud_data_lookup, high_risk_profile])

credit_data_lookup = credit_features_df.set_index('customer_id') # reset index

# Now, we manually "flag" this customer in the model's eyes
# In a real scenario, their transaction data would naturally lead to this
# Forcing the model to predict -1 for this made-up customer
loaded_fraud_model.fit(high_risk_profile) 

# Run the API and generate the alert
response = real_time_scoring_api(fraudulent_customer_id)
print(response)
generate_alert(response)


--- Scoring request for Customer: CUST_004019 ---
{'customer_id': 'CUST_004019', 'decision': 'Error', 'reason': 'Customer not found'}
