# Phishing Dataset Exploration

This notebook explores the characteristics of our phishing email dataset, including:
- Dataset size and composition
- Label distribution
- Basic text statistics
- Preprocessing analysis

In [None]:
# Import required libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from pathlib import Path

# Set style for better visualizations
plt.style.use('seaborn')
sns.set_palette("husl")

In [None]:
# Load the dataset
# Adjust the path as needed
data_path = Path("../data/raw/phishing_dataset.csv")  # Update this path
df = pd.read_csv(data_path)

# Display first few rows
print("First few rows of the dataset:")
display(df.head())

# Basic dataset info
print("\nDataset Info:")
display(df.info())

## Dataset Size and Label Distribution

Let's analyze the size of our dataset and the distribution of phishing vs benign labels.

In [None]:
# Dataset size
print(f"Total number of samples: {len(df)}")
print(f"Number of features: {df.shape[1]}")

# Label distribution
label_counts = df['label'].value_counts()
print("\nLabel distribution:")
display(label_counts)

# Visualize label distribution
plt.figure(figsize=(10, 6))
sns.barplot(x=label_counts.index, y=label_counts.values)
plt.title('Distribution of Labels (Phishing vs Benign)')
plt.xlabel('Label')
plt.ylabel('Count')
plt.show()

# Calculate percentages
label_percentages = (label_counts / len(df) * 100).round(2)
print("\nLabel percentages:")
display(label_percentages)

## Text Analysis

Let's analyze the characteristics of the email text, including length statistics and basic preprocessing metrics.

In [None]:
# Add text length statistics
df['text_length'] = df['text'].str.len()
df['subject_length'] = df['subject'].fillna('').str.len()

# Display basic statistics
print("Text length statistics:")
display(df[['text_length', 'subject_length']].describe())

# Visualize text length distributions by label
plt.figure(figsize=(12, 5))

plt.subplot(1, 2, 1)
sns.boxplot(x='label', y='text_length', data=df)
plt.title('Email Text Length by Label')
plt.ylabel('Length (characters)')

plt.subplot(1, 2, 2)
sns.boxplot(x='label', y='subject_length', data=df)
plt.title('Subject Length by Label')
plt.ylabel('Length (characters)')

plt.tight_layout()
plt.show()

# Check for missing values
print("\nMissing values in each column:")
display(df.isnull().sum())

## Preprocessing Analysis

Let's analyze how many emails contain HTML, URLs, and other characteristics that need preprocessing.

In [None]:
# Import regex patterns from prep_phish_jsonl
import re
SCRIPT_RE = re.compile(r"(?is)<script.*?>.*?</script>")
STYLE_RE = re.compile(r"(?is)<style.*?>.*?</style>")
TAG_RE = re.compile(r"(?s)<[^>]+>")
URL_RE = re.compile(r"https?://[^\s)>\]]+", re.I)

# Analysis functions
def has_html(text):
    return bool(TAG_RE.search(str(text)))

def has_scripts(text):
    return bool(SCRIPT_RE.search(str(text)))

def get_urls(text):
    return URL_RE.findall(str(text))

# Add preprocessing flags
df['has_html'] = df['text'].apply(has_html)
df['has_scripts'] = df['text'].apply(has_scripts)
df['url_count'] = df['text'].apply(lambda x: len(get_urls(x)))

# Display preprocessing statistics
print("Preprocessing statistics:")
print(f"Emails containing HTML: {df['has_html'].sum()} ({(df['has_html'].mean()*100):.1f}%)")
print(f"Emails containing scripts: {df['has_scripts'].sum()} ({(df['has_scripts'].mean()*100):.1f}%)")
print(f"Emails containing URLs: {(df['url_count'] > 0).sum()} ({((df['url_count'] > 0).mean()*100):.1f}%)")

# Visualize URL distribution
plt.figure(figsize=(10, 5))
sns.histplot(data=df, x='url_count', hue='label', multiple="stack", bins=20)
plt.title('Distribution of URL Count by Label')
plt.xlabel('Number of URLs in Email')
plt.ylabel('Count')
plt.show()

# Show correlation with phishing
print("\nCorrelation with phishing labels:")
correlations = {
    'Has HTML': df['has_html'].corr(df['label'] == 'phish'),
    'Has Scripts': df['has_scripts'].corr(df['label'] == 'phish'),
    'URL Count': df['url_count'].corr(df['label'] == 'phish')
}
display(pd.Series(correlations))

# Phishing Email Dataset Analysis
This notebook provides a comprehensive analysis of our phishing detection dataset, including:
1. Data distribution and statistics
2. Text characteristics analysis
3. Common phishing tactics visualization
4. Model confidence patterns
5. Entity and domain analysis

## Setup and Data Loading

In [None]:
# Import required libraries
import json
import os
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from collections import Counter
from typing import List, Dict
import re
from tqdm import tqdm

# Set plot style
plt.style.use('seaborn')
sns.set_palette("husl")

# Configure Jupyter display options
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', 50)
pd.set_option('display.width', 1000)

## Data Loading and Initial Processing
We'll load our dataset from the JSONL files and convert them to pandas DataFrames for analysis. Our data includes:
- Training set
- Test set
- Evaluation set
- Adversarial examples

In [None]:
def load_jsonl(file_path: str) -> pd.DataFrame:
    """Load JSONL file into DataFrame with proper structure"""
    data = []
    with open(file_path, 'r') as f:
        for line in f:
            example = json.loads(line)
            
            # Extract text and remove 'EMAIL: ' prefix
            text = example['input']
            if text.startswith('EMAIL: '):
                text = text[7:]
                
            # Extract other fields
            record = {
                'text': text,
                'label': example['output']['label'],
                'confidence': example['output'].get('confidence', None),
                'tactics': example['output'].get('tactics', []),
                'evidence': example['output'].get('evidence', [])
            }
            data.append(record)
            
    return pd.DataFrame(data)

# Load datasets
base_path = '../out_jsonl'
train_df = load_jsonl(os.path.join(base_path, 'train.jsonl'))
test_df = load_jsonl(os.path.join(base_path, 'test.jsonl'))
eval_df = load_jsonl(os.path.join(base_path, 'eval.jsonl'))

print("Dataset sizes:")
print(f"Training set: {len(train_df):,} examples")
print(f"Test set: {len(test_df):,} examples")
print(f"Evaluation set: {len(eval_df):,} examples")

## Data Distribution Analysis
Let's analyze the distribution of phishing vs benign emails and their characteristics:

In [None]:
# Analyze label distribution
def plot_label_distribution(df: pd.DataFrame, title: str):
    plt.figure(figsize=(10, 6))
    sns.countplot(data=df, x='label')
    plt.title(f'Label Distribution - {title}')
    plt.xlabel('Email Type')
    plt.ylabel('Count')
    plt.xticks(rotation=0)
    
    # Add percentage labels
    total = len(df)
    for p in plt.gca().patches:
        percentage = f'{100 * p.get_height() / total:.1f}%'
        plt.annotate(percentage, (p.get_x() + p.get_width() / 2., p.get_height()),
                    ha='center', va='bottom')
    plt.tight_layout()

# Plot distributions
plt.figure(figsize=(15, 5))
plt.subplot(131)
plot_label_distribution(train_df, 'Training Set')
plt.subplot(132)
plot_label_distribution(test_df, 'Test Set')
plt.subplot(133)
plot_label_distribution(eval_df, 'Evaluation Set')
plt.tight_layout()

# Print exact numbers
print("\nExact Distribution:")
for name, df in [('Training', train_df), ('Test', test_df), ('Eval', eval_df)]:
    dist = df['label'].value_counts()
    total = len(df)
    print(f"\n{name} Set:")
    for label, count in dist.items():
        print(f"{label}: {count:,} ({100 * count/total:.1f}%)")

## Text Characteristics Analysis
Let's analyze various characteristics of the emails such as:
1. Text length distribution
2. Word count distribution
3. Common words and phrases
4. Special character usage

In [None]:
# Add text characteristics
for df in [train_df, test_df, eval_df]:
    df['text_length'] = df['text'].str.len()
    df['word_count'] = df['text'].str.split().str.len()
    df['avg_word_length'] = df['text'].apply(lambda x: np.mean([len(w) for w in x.split()]))
    df['special_char_count'] = df['text'].apply(lambda x: len(re.findall(r'[^a-zA-Z0-9\s]', x)))

# Plot text length distribution by label
plt.figure(figsize=(15, 5))

# Text length
plt.subplot(131)
sns.boxplot(data=train_df, x='label', y='text_length')
plt.title('Text Length Distribution by Label')
plt.xticks(rotation=0)

# Word count
plt.subplot(132)
sns.boxplot(data=train_df, x='label', y='word_count')
plt.title('Word Count Distribution by Label')
plt.xticks(rotation=0)

# Special characters
plt.subplot(133)
sns.boxplot(data=train_df, x='label', y='special_char_count')
plt.title('Special Character Count by Label')
plt.xticks(rotation=0)

plt.tight_layout()

# Print summary statistics
print("\nText Characteristics Summary (Training Set):")
print("\nBy Label:")
for label in train_df['label'].unique():
    subset = train_df[train_df['label'] == label]
    print(f"\n{label.upper()}:")
    print(f"Text Length: mean={subset['text_length'].mean():.1f}, median={subset['text_length'].median():.1f}")
    print(f"Word Count: mean={subset['word_count'].mean():.1f}, median={subset['word_count'].median():.1f}")
    print(f"Special Chars: mean={subset['special_char_count'].mean():.1f}, median={subset['special_char_count'].median():.1f}")

The visualizations above show the distribution of various text characteristics across different labels in our dataset. These characteristics help us understand the structural differences between texts of different categories:

1. **Text Length Distribution**: Shows how the total character count varies across labels, helping identify if certain categories tend to have longer or shorter texts.

2. **Word Count Distribution**: Reveals the verbosity patterns across different labels, indicating whether certain categories typically require more or fewer words to express their content.

3. **Special Character Usage**: Illustrates the frequency of non-alphanumeric characters in different categories, which might indicate formatting patterns or writing styles specific to certain labels.

These patterns can be valuable features for our classification model and help us understand any potential biases in the dataset. They also provide insights into the preprocessing steps we might need to consider, such as length normalization or special character handling.

In [None]:
from collections import Counter
import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords

# Download required NLTK data
nltk.download('punkt')
nltk.download('stopwords')

def analyze_vocabulary(texts, top_n=20):
    """Analyze vocabulary patterns in a list of texts."""
    # Tokenize and lowercase all words
    all_words = [word.lower() for text in texts for word in word_tokenize(text)]
    
    # Remove stopwords
    stop_words = set(stopwords.words('english'))
    content_words = [word for word in all_words if word.isalnum() and word not in stop_words]
    
    # Get word frequencies
    word_freq = Counter(content_words)
    
    return word_freq.most_common(top_n)

# Analyze vocabulary by label
plt.figure(figsize=(15, 10))

for idx, label in enumerate(train_df['label'].unique()):
    subset = train_df[train_df['label'] == label]
    word_freq = analyze_vocabulary(subset['text'])
    
    plt.subplot(2, 2, idx + 1)
    words, freqs = zip(*word_freq)
    plt.bar(range(len(words)), freqs)
    plt.xticks(range(len(words)), words, rotation=45, ha='right')
    plt.title(f'Top Words in {label.upper()} Category')

plt.tight_layout()

# Print some vocabulary statistics
print("\nVocabulary Statistics by Label:")
for label in train_df['label'].unique():
    subset = train_df[train_df['label'] == label]
    word_freq = analyze_vocabulary(subset['text'])
    unique_words = len(set(word.lower() for text in subset['text'] for word in word_tokenize(text)))
    
    print(f"\n{label.upper()}:")
    print(f"Unique words: {unique_words}")
    print("Top 10 most frequent words:")
    for word, freq in word_freq[:10]:
        print(f"  {word}: {freq}")

## Vocabulary Analysis

This section examines the vocabulary patterns and word usage across different categories in our dataset. The analysis includes:

1. **Word Frequency Distribution**: Bar plots showing the most frequent content words (excluding stopwords) for each category. This helps identify:
   - Key terms associated with each label
   - Common vocabulary patterns
   - Potential discriminative features for classification

2. **Vocabulary Statistics**:
   - Unique word count per category
   - Top frequent words and their occurrence counts
   - Insights into the lexical diversity of each category

The analysis excludes common English stopwords to focus on meaningful content words. Understanding these vocabulary patterns is crucial for:
- Feature engineering in our classification model
- Identifying category-specific terminology
- Understanding the linguistic characteristics of each label