<a href="https://colab.research.google.com/github/Nuha4/Algorithms/blob/master/pilot_study_phishing_storis.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
# Import necessary libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

print("Libraries imported successfully!")

Libraries imported successfully!


** Load and Inspect Data**

In [None]:
# The name of the CSV file uploaded
file_name = 'final_evaluation_data.csv'

# Load the data into a pandas DataFrame
df = pd.read_csv(file_name)

# Display the first 5 rows to make sure it looks right
print("First 5 rows of data:")
print(df.head())

**Calculate and Print Performance Metrics**


In [None]:
# Define "true" labels (Gold Standard) and the "predicted" labels (from the LLM)
y_true = df['Human_Label_RQ2']
y_pred = df['LLM_Label_RQ2']

print("--- PERFORMANCE METRICS: LLM vs. HUMAN GOLD STANDARD ---")

# 1. Calculate and print the overall accuracy
accuracy = accuracy_score(y_true, y_pred)
print(f"\nOverall Accuracy: {accuracy:.2%}\n")
print("This is the percentage of labels the LLM got exactly right.")
print("-" * 50)


# 2. Generate and print the detailed classification report
print("\nDetailed Classification Report (Precision, Recall, F1-Score):\n")
# The 'zero_division=0' parameter prevents warnings if a label was never predicted by the LLM
report = classification_report(y_true, y_pred, zero_division=0)
print(report)
print("-" * 50)

**Visualize the Errors with a Confusion Matrix**

In [None]:
print("\n--- VISUALIZING THE ERRORS: CONFUSION MATRIX ---")

# Get a sorted list of all unique labels that appear in data
unique_labels = sorted(list(set(y_true) | set(y_pred)))

# Create the confusion matrix
cm = confusion_matrix(y_true, y_pred, labels=unique_labels)

# Plot the confusion matrix using seaborn for a nice visual
plt.figure(figsize=(14, 12))
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues',
            xticklabels=unique_labels, yticklabels=unique_labels)

plt.title('Confusion Matrix: LLM Predictions vs. Human Gold Standard', fontsize=16)
plt.ylabel('True Label (Gold Standard)', fontsize=12)
plt.xlabel('Predicted Label (LLM Output)', fontsize=12)
plt.xticks(rotation=45, ha='right')
plt.yticks(rotation=0)
plt.tight_layout() # Adjust plot to ensure everything fits without overlapping
plt.show()

Step 4: How to Interpret this Results


1.  Overall Accuracy: This is headline number. "The LLM achieved an overall accuracy of 85% on the pilot dataset."
2.   Classification Report: This is where find the details.
*   F1-Score: Look at the F1-score for each persona. A high F1-score (e.g., > 0.90) means the LLM is excellent at identifying that specific persona. A low score (e.g., < 0.70) means it struggles with that one.
*   Precision vs. Recall: If precision is high but recall is low for "Fake Courier," it means that when the LLM predicts "Fake Courier," it's usually right, but it misses a lot of the actual "Fake Courier" stories (labeling them as something else).
*   Support: This just tells how many examples of each persona were in the human-labeled data.

3.  Confusion Matrix: This is diagnostic tool.
*   The Diagonal: The numbers running from top-left to bottom-right are the correct predictions. Prefer these numbers to be high.
*   Off-Diagonal Numbers: Any number not on the diagonal is an error. For example, if we find a "3" where the "True Label" row is Romance Scammer and the "Predicted Label" column is Family Impersonator, it means the LLM made that specific mistake 3 times. This tells exactly which personas the LLM is confusing with each other.