# Fraud Detection Model - Isolation Forest
## A Machine Learning Approach to Detecting Transaction Anomalies

This notebook demonstrates the implementation and evaluation of an Isolation Forest model for detecting fraudulent transactions in financial data. The model uses unsupervised learning to identify anomalies based on transaction patterns.

## 1. Data Loading and Preparation

In [None]:
# Import necessary libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.preprocessing import LabelEncoder, StandardScaler
from sklearn.ensemble import IsolationForest
from sklearn.metrics import classification_report, confusion_matrix, roc_auc_score, roc_curve, auc
from sklearn.model_selection import train_test_split

# Set visualization style
sns.set_style("whitegrid")
plt.rcParams['figure.figsize'] = (12, 6)

print("Libraries imported successfully!")

In [None]:
# Generate synthetic transaction data
from src.fraud_detection import generate_synthetic_data

df = generate_synthetic_data(n=5000, fraud_count=150)

print(f"Dataset shape: {df.shape}")
print(f"\nFirst few records:")
print(df.head())
print(f"\nDataset Info:")
print(df.info())
print(f"\nFraud Distribution:")
print(df['is_fraud'].value_counts())
print(f"Fraud Rate: {df['is_fraud'].mean():.2%}")

## 2. Feature Engineering and Preprocessing

## 3. Model Training - Isolation Forest

**Why Isolation Forest?**
- Effective for anomaly detection in high-dimensional spaces
- Doesn't require labeled training data for normal samples
- Computationally efficient and scalable
- Works well for fraud detection with imbalanced datasets

## 4. Model Evaluation

## 5. Visualizations and Analysis

## 6. ROC Curve and Additional Metrics

In [None]:
# ROC Curve Analysis
fig, axes = plt.subplots(1, 2, figsize=(15, 5))

# ROC Curve
fpr, tpr, thresholds = roc_curve(y_true, -anomaly_scores)
roc_auc = auc(fpr, tpr)

axes[0].plot(fpr, tpr, color='darkorange', lw=2, label=f'ROC Curve (AUC = {roc_auc:.4f})')
axes[0].plot([0, 1], [0, 1], color='navy', lw=2, linestyle='--', label='Random Classifier')
axes[0].set_xlim([0.0, 1.0])
axes[0].set_ylim([0.0, 1.05])
axes[0].set_xlabel('False Positive Rate')
axes[0].set_ylabel('True Positive Rate')
axes[0].set_title('ROC Curve', fontsize=12, fontweight='bold')
axes[0].legend(loc="lower right")
axes[0].grid(True, alpha=0.3)

# Anomaly Score Distribution
axes[1].hist(anomaly_scores[y_true == 0], bins=50, alpha=0.6, label='Normal', color='green')
axes[1].hist(anomaly_scores[y_true == 1], bins=50, alpha=0.6, label='Fraud', color='red')
axes[1].set_xlabel('Anomaly Score (Decision Function)')
axes[1].set_ylabel('Frequency')
axes[1].set_title('Anomaly Score Distribution', fontsize=12, fontweight='bold')
axes[1].legend()
axes[1].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

print(f"ROC-AUC Score: {roc_auc:.4f}")

## 7. Key Findings and Conclusions

### Model Performance Summary
- **Algorithm**: Isolation Forest (Unsupervised Anomaly Detection)
- **Strengths**: 
  - Effective at detecting outliers in high-dimensional spaces
  - Works well with imbalanced datasets
  - Interpretable decision functions
- **Use Case**: Real-time fraud detection in transaction systems

### Key Insights
1. **Data Characteristics**: The synthetic data shows clear fraud patterns with high-value transactions occurring during late-night hours (0-3 AM)
2. **Model Effectiveness**: The Isolation Forest successfully identifies the injected fraud signal
3. **Threshold Selection**: The contamination parameter (3%) aligns with the actual fraud rate in the dataset

### Next Steps for Production
- Validate on real-world transaction data
- Implement threshold tuning based on business requirements (sensitivity vs. specificity)
- Consider ensemble methods combining multiple algorithms
- Deploy with monitoring for model drift
- Integrate with real-time transaction processing pipeline

In [None]:
# Prepare features for modeling
df_model = df.copy()

# Encode categorical variable (transaction_type)
le = LabelEncoder()
df_model['transaction_type_encoded'] = le.fit_transform(df_model['transaction_type'])

# Display encoding mapping
print("Transaction Type Encoding:")
for i, type_name in enumerate(le.classes_):
    print(f"  {i}: {type_name}")

# Select features and target
features = ['amount', 'time_of_day', 'transaction_type_encoded']
X = df_model[features].copy()
y_true = df_model['is_fraud'].copy()

# Standardize features (important for Isolation Forest)
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
X = pd.DataFrame(X_scaled, columns=features)

print(f"\nFeature Matrix Shape: {X.shape}")
print(f"Features: {list(X.columns)}")
print(f"\nFeature Statistics:")
print(X.describe())

In [None]:
# Train Isolation Forest model
print("Training Isolation Forest Model...")
print("-" * 50)

model = IsolationForest(
    n_estimators=200,        # Number of isolation trees
    contamination=0.03,       # Expected proportion of outliers (3%)
    random_state=42,          # For reproducibility
    n_jobs=-1                 # Use all CPU cores
)

model.fit(X)

# Get anomaly scores and predictions
anomaly_scores = model.decision_function(X)
predictions = model.predict(X)

# Convert predictions: normal=1 → 0, anomaly=-1 → 1
y_pred = np.where(predictions == -1, 1, 0)

print("Model Training Complete!")
print(f"Model Parameters: {model.get_params()}")
print(f"\nPrediction Distribution:")
print(f"  Normal Transactions: {(y_pred == 0).sum()}")
print(f"  Flagged as Fraud: {(y_pred == 1).sum()}")
print(f"  Flagged Rate: {(y_pred == 1).mean():.2%}")

In [None]:
# Comprehensive Model Evaluation
print("=" * 60)
print("MODEL EVALUATION RESULTS")
print("=" * 60)

# Confusion Matrix
cm = confusion_matrix(y_true, y_pred)
print("\nConfusion Matrix:")
print(cm)
tn, fp, fn, tp = cm.ravel()
print(f"  True Negatives: {tn}")
print(f"  False Positives: {fp}")
print(f"  False Negatives: {fn}")
print(f"  True Positives: {tp}")

# Calculate additional metrics
accuracy = (tp + tn) / (tp + tn + fp + fn)
precision = tp / (tp + fp) if (tp + fp) > 0 else 0
recall = tp / (tp + fn) if (tp + fn) > 0 else 0
f1_score = 2 * (precision * recall) / (precision + recall) if (precision + recall) > 0 else 0

print(f"\nKey Metrics:")
print(f"  Accuracy: {accuracy:.4f}")
print(f"  Precision: {precision:.4f}")
print(f"  Recall (Sensitivity): {recall:.4f}")
print(f"  F1-Score: {f1_score:.4f}")

# ROC-AUC Score
try:
    roc_auc = roc_auc_score(y_true, -anomaly_scores)  # Use negative scores for ROC
    print(f"  ROC-AUC Score: {roc_auc:.4f}")
except:
    print("  ROC-AUC Score: Could not compute")

print("\nClassification Report:")
print(classification_report(y_true, y_pred, target_names=['Normal', 'Fraud']))

In [None]:
# Comprehensive Visualizations
fig, axes = plt.subplots(2, 2, figsize=(15, 10))

# 1. Confusion Matrix Heatmap
cm_labels = np.array([['True Neg', 'False Pos'], ['False Neg', 'True Pos']])
sns.heatmap(cm, annot=cm_labels, fmt='', cmap='Blues', cbar=False, ax=axes[0, 0],
            xticklabels=['Normal', 'Fraud'], yticklabels=['Normal', 'Fraud'])
axes[0, 0].set_title('Confusion Matrix', fontsize=12, fontweight='bold')
axes[0, 0].set_ylabel('True Label')
axes[0, 0].set_xlabel('Predicted Label')

# 2. Transaction Amount Distribution (Normal vs Fraud)
axes[0, 1].hist(df[df['is_fraud'] == 0]['amount'], bins=40, alpha=0.6, label='Normal', color='green')
axes[0, 1].hist(df[df['is_fraud'] == 1]['amount'], bins=40, alpha=0.6, label='Fraud', color='red')
axes[0, 1].set_title('Transaction Amount Distribution', fontsize=12, fontweight='bold')
axes[0, 1].set_xlabel('Amount ($)')
axes[0, 1].set_ylabel('Frequency')
axes[0, 1].legend()
axes[0, 1].set_xlim(0, 200000)

# 3. Average Amount by Class
avg_amount = df.groupby('is_fraud')['amount'].mean()
colors = ['green', 'red']
axes[1, 0].bar(['Normal', 'Fraud'], avg_amount.values, color=colors, alpha=0.7)
axes[1, 0].set_title('Average Transaction Amount by Class', fontsize=12, fontweight='bold')
axes[1, 0].set_ylabel('Average Amount ($)')
for i, v in enumerate(avg_amount.values):
    axes[1, 0].text(i, v + 1000, f'${v:,.0f}', ha='center', fontweight='bold')

# 4. Fraud by Time of Day
fraud_by_hour = df.groupby('time_of_day')['is_fraud'].agg(['sum', 'count'])
fraud_by_hour['rate'] = (fraud_by_hour['sum'] / fraud_by_hour['count'] * 100).fillna(0)
axes[1, 1].plot(fraud_by_hour.index, fraud_by_hour['rate'], marker='o', linewidth=2, markersize=8, color='red')
axes[1, 1].set_title('Fraud Rate by Hour of Day', fontsize=12, fontweight='bold')
axes[1, 1].set_xlabel('Hour of Day')
axes[1, 1].set_ylabel('Fraud Rate (%)')
axes[1, 1].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

print("Visualizations generated successfully!")