# Model Sensitivity Analysis for Fraud Detection Pipeline

This notebook helps you analyze and improve the sensitivity of your fraud detection model pipeline. It loads the existing model, evaluates its predictions on sample inputs, analyzes feature importances, and explores retraining and more complex models to improve performance.

## 1. Load the Existing Model Pipeline

Load the trained model pipeline from `models/dbt_fraud_detection_pipeline.pkl` using joblib.

In [None]:
import joblib
import os

# Path to the trained model pipeline
model_path = os.path.join('models', 'dbt_fraud_detection_pipeline.pkl')
model = joblib.load(model_path)
print(f"Loaded model pipeline from: {model_path}")

## 2. Evaluate Model Sensitivity on Sample Inputs

Load sample JSON files from `test_samples/` and use the model pipeline to predict outputs. Compare predictions for different samples.

In [None]:
import pandas as pd
import glob
import json

# Find all sample JSON files in test_samples/
sample_files = glob.glob('test_samples/*.json')
print(f"Found sample files: {sample_files}")

# Predict for each sample and show results
for file in sample_files:
    with open(file, 'r') as f:
        sample = json.load(f)
    df = pd.DataFrame([sample])
    pred = model.predict(df)[0]
    prob = model.predict_proba(df)[0]
    print(f"File: {file}")
    print(f"Prediction: {pred}, Probabilities: {prob}")
    print('-' * 40)

## 3. Analyze Feature Importances

Extract and visualize feature importances from the trained model (if supported, e.g., tree-based models or logistic regression coefficients).

In [None]:
import matplotlib.pyplot as plt
import numpy as np

# Try to extract feature importances from the classifier
clf = model.named_steps.get('classifier', None)
preprocessor = model.named_steps.get('preprocessor', None)

if clf is not None and hasattr(clf, 'feature_importances_'):
    # Get feature names after preprocessing
    if preprocessor is not None:
        try:
            feature_names = preprocessor.get_feature_names_out()
        except Exception:
            feature_names = [f'feature_{i}' for i in range(len(clf.feature_importances_))]
    else:
        feature_names = [f'feature_{i}' for i in range(len(clf.feature_importances_))]
    importances = clf.feature_importances_
    indices = np.argsort(importances)[::-1]
    plt.figure(figsize=(10, 6))
    plt.title('Feature Importances')
    plt.bar(range(len(importances)), importances[indices])
    plt.xticks(range(len(importances)), np.array(feature_names)[indices], rotation=90)
    plt.tight_layout()
    plt.show()
else:
    print('Feature importances not available for this classifier.')

## 4. Retrain Model with Additional Data or Features

Augment the training dataset with more data or engineer new features. Retrain the model pipeline and save the updated model.

In [None]:
# Example: Load additional data or engineer new features here
# df_new = ... # Load or create new data
# df_augmented = pd.concat([df, df_new], ignore_index=True)
#
# Optionally, engineer new features
# df_augmented['new_feature'] = ...
#
# Retrain the pipeline (replace X_train, y_train with your new data)
# model_pipeline.fit(X_train, y_train)
# joblib.dump(model_pipeline, 'models/dbt_fraud_detection_pipeline_updated.pkl')

print("Fill in this cell with your data augmentation and retraining steps.")

## 5. Experiment with More Complex Models

Replace the current estimator in the pipeline with a more complex model (e.g., RandomForest, XGBoost). Train and evaluate its performance.

In [None]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.pipeline import Pipeline

# Assume you have X_train, y_train from your original or augmented data
# Replace the classifier in the pipeline with RandomForest
if 'classifier' in model.named_steps:
    steps = list(model.named_steps.items())
    steps = [(name, step) if name != 'classifier' else ('classifier', RandomForestClassifier(n_estimators=100, random_state=42)) for name, step in steps]
    rf_pipeline = Pipeline(steps)
    # rf_pipeline.fit(X_train, y_train)
    print("RandomForestClassifier added to pipeline. Uncomment fit() and provide data to train.")
else:
    print("No 'classifier' step found in the pipeline.")

## 6. Compare Prediction Outputs for Different Samples

Run predictions on the same test samples using the updated model(s) and compare the outputs to assess improvements in sensitivity.

In [None]:
# Example: Compare predictions from the updated model(s) on the same test samples
# for file in sample_files:
#     with open(file, 'r') as f:
#         sample = json.load(f)
#     df = pd.DataFrame([sample])
#     pred = rf_pipeline.predict(df)[0]
#     prob = rf_pipeline.predict_proba(df)[0]
#     print(f"File: {file}")
#     print(f"RandomForest Prediction: {pred}, Probabilities: {prob}")
#     print('-' * 40)

print("Fill in this cell to compare predictions from your updated model(s) on the test samples.")