- The model is performing at a low recall (28.25%)
- When we set contamination to percentage value of frauds present in the actual dataset, we the same percentage of perdictions classified as fraud (anomalies).
- Goal here is to explore how changing the contamination parameter affects the trade-off between identifying more actual frauds (recall) and reducing false alarms (precision).
- In a real world scenario, where we dont know the exact fraud rate, we would experiment this parameter.
- The tradeoff is between precision and recall.
    - If we go for higher values of contamination, more samples get classified as frauds, increasing the recall. But at the same time, the precision would go down, i.e more false positives.
    - The other way, if we go for lower contamination, less samples get classified as frauds, increasing the precision. But we will miss many actual fraud cases.

In [None]:
import pandas as pd

df = pd.read_csv(r".\data\creditcard.csv")
X = pd.read_csv(r".\preprocessed_data.csv")
y_true = pd.read_csv(r".\data\y_true.csv")

In [5]:
from sklearn.ensemble import IsolationForest
from sklearn.metrics import classification_report, confusion_matrix

# Define contamination values to test
# These are illustrative; choose values that make sense for your analysis.
# Remember the original fraud_percentage was approx 0.00172
contamination_values_to_test = [
    float(y_true.value_counts(normalize=True)[1]), # Baseline (actual fraud rate)
    0.001,  # Slightly lower
    0.002,  # Slightly higher
    0.005   # Even higher, to see impact
]

results = {}
for cont_rate in contamination_values_to_test:
    print(f"\nTesting Isolation Forest with contamination = {cont_rate:.4f}")

    # Re-instantiate and retrain model for each contamination value
    iso_forest_exp = IsolationForest(
        n_estimators=100,
        contamination=cont_rate,
        random_state=42,
        n_jobs=-1
    )
    iso_forest_exp.fit(X)

    # Get predictions
    df_temp = df.copy()
    df_temp['anomaly_prediction'] = iso_forest_exp.predict(X)
    df_temp['is_anomaly'] = df_temp['anomaly_prediction'].apply(lambda x: 1 if x == -1 else 0)

    # Calculate metrics
    report = classification_report(y_true, df_temp['is_anomaly'], target_names=['Legitimate', 'Fraud'], output_dict=True)

    # Store relevant metrics for comparison
    results[f"Contamination_{cont_rate:.4f}"] = {
        'Precision_Fraud': report['Fraud']['precision'],
        'Recall_Fraud': report['Fraud']['recall'],
        'F1_Fraud': report['Fraud']['f1-score'],
        'Num_Predicted_Anomalies': df_temp['is_anomaly'].sum(),
    }

    print(f"Classification Report for contamination={cont_rate:.4f}:")
    print(classification_report(y_true, df_temp['is_anomaly'], target_names=['Legitimate', 'Fraud']))

    # Optional: Display confusion matrix for each as well, or just for the best one later
    conf_matrix_exp = confusion_matrix(y_true, df_temp['is_anomaly'])
    print("Confusion Matrix:")
    print(conf_matrix_exp)

print("\n--- Contamination Experimentation Summary ---")
for cont_param, metrics in results.items():
    print(f"{cont_param}:")
    for metric_name, value in metrics.items():
        print(f"  {metric_name}: {value:.4f}")

print("\n--- Phase 3: Experimentation Complete ---")


Testing Isolation Forest with contamination = 0.0017
Classification Report for contamination=0.0017:
              precision    recall  f1-score   support

  Legitimate       1.00      1.00      1.00    284315
       Fraud       0.28      0.28      0.28       492

    accuracy                           1.00    284807
   macro avg       0.64      0.64      0.64    284807
weighted avg       1.00      1.00      1.00    284807

Confusion Matrix:
[[283962    353]
 [   353    139]]

Testing Isolation Forest with contamination = 0.0010
Classification Report for contamination=0.0010:
              precision    recall  f1-score   support

  Legitimate       1.00      1.00      1.00    284315
       Fraud       0.37      0.21      0.27       492

    accuracy                           1.00    284807
   macro avg       0.68      0.61      0.63    284807
weighted avg       1.00      1.00      1.00    284807

Confusion Matrix:
[[284135    180]
 [   387    105]]

Testing Isolation Forest with conta

- For a bank, missing fraud (low recall, high FN) is extremely costly. Thus, they might prioritize a higher recall, even if it means more false positives (lower precision) that require manual review. The contamination=0.0050 scenario, despite its low precision, might be preferable if the cost of missing fraud is exceptionally high.
- Conversely, if manual review resources are severely limited, a higher precision (like contamination=0.0010) might be chosen to reduce the burden of false alarms, accepting that some fraud will be missed.