# Fraud Detection Model Evaluation with CatBoost

This notebook evaluates a previously trained CatBoost model using test data. It includes generating predictions, calculating evaluation metrics, and visualizing the confusion matrix for performance analysis.


## 1. Importing Necessary Libraries
Libraries used in this notebook:
- **pandas**: For data manipulation.
- **CatBoostClassifier**: For loading the trained CatBoost model.
- **scikit-learn metrics**: For evaluation metrics such as accuracy, precision, recall, F1-score, ROC AUC score, and confusion matrix.
- **matplotlib**: For plotting visualizations (e.g., confusion matrix).


## 2. Loading the Trained CatBoost Model
The trained CatBoost model is loaded from the JSON file (`catboost_model.json`) for evaluation on the test dataset.


In [None]:
#Import Necessary Libraries
import pandas as pd
from catboost import CatBoostClassifier
from testing_preprocessing import dataset  # Assuming the test preprocessing script is named this way
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, roc_auc_score, confusion_matrix, ConfusionMatrixDisplay
import matplotlib.pyplot as plt

# Load the saved CatBoost model from JSON file
model = CatBoostClassifier()
model.load_model('catboost_model.json', format='json')
print("Loaded CatBoost model from 'catboost_model.json'.")

## 3. Loading and Verifying Test Data
The preprocessed test dataset is loaded using the `test_to_df` function, and the first few rows are displayed to confirm the dataset structure and correctness.


In [None]:
# using the preprocessed data
from data_reprocessing import test_to_df
dataset = 'new_test.csv'  
test_data = test_to_df(dataset)

# Display the first few rows of the dataset to verify the preprocessing
print(test_data.head())

## 4. Preparing Test Data
- **Features (`X_test`)**: All columns except the target (`is_attributed`).
- **Target (`y_test`)**: True labels indicating fraudulent (0) or non-fraudulent (1) clicks.
Predictions:
- **`y_test_pred_prob`**: Predicted probabilities for the positive class (fraudulent clicks).
- **`y_test_pred_class`**: Binary predictions based on a 0.5 threshold.


In [3]:
# Separate features and labels from the test data
X_test = test_data.drop(columns=['is_attributed'])  # Features
y_test = test_data['is_attributed']  # True labels

# Make predictions on the test set using the loaded model
y_test_pred_prob = model.predict_proba(X_test)[:, 1]

# Convert probabilities to binary class predictions (0 = fraudulent, 1 = non-fraudulent)
y_test_pred_class = (y_test_pred_prob > 0.5).astype(int)


## 5. Calculating Performance Metrics
The model's performance is evaluated using the following metrics:
- **Accuracy**: The fraction of correct predictions.
- **Precision**: The fraction of correctly predicted fraudulent clicks.
- **Recall**: The fraction of actual fraudulent clicks identified.
- **F1-Score**: Harmonic mean of precision and recall.
- **ROC AUC Score**: Measures the effectiveness of the classifier.
Evaluation results are displayed for analysis.


In [None]:
# Calculate performance metrics on the test set
accuracy = accuracy_score(y_test, y_test_pred_class)
precision = precision_score(y_test, y_test_pred_class, pos_label=0)  # 0 = fraudulent
recall = recall_score(y_test, y_test_pred_class, pos_label=0)  # 0 = fraudulent
f1 = f1_score(y_test, y_test_pred_class, pos_label=0)  # 0 = fraudulent
auc_roc = roc_auc_score(y_test, y_test_pred_prob)

# Display evaluation results
print(f"Test Set Accuracy: {accuracy:.4f}")
print(f"Test Set Precision (fraud detection): {precision:.4f}")
print(f"Test Set Recall (fraud detection): {recall:.4f}")
print(f"Test Set F1-Score (fraud detection): {f1:.4f}")
print(f"Test Set AUC-ROC: {auc_roc:.4f}")


## 6. Visualizing the Confusion Matrix
- The confusion matrix illustrates:
  - True Positives (TP)
  - False Positives (FP)
  - True Negatives (TN)
  - False Negatives (FN)
- The matrix is visualized as a heatmap with labels for fraud (0) and non-fraud (1) to provide insights into the model's classification performance.


In [None]:
# Calculate the confusion matrix
cm = confusion_matrix(y_test, y_test_pred_class)

# Plot the confusion matrix
disp = ConfusionMatrixDisplay(confusion_matrix=cm, display_labels=['Fraud (0)', 'Non-Fraud (1)'])
disp.plot(cmap='Blues')
plt.title('Confusion Matrix on Test Set')
plt.show()

## Conclusion
This notebook successfully evaluates a trained CatBoost model for fraud detection using key performance metrics and visualization techniques. The results highlight the model's effectiveness in identifying fraudulent and non-fraudulent clicks.
