# Phishing Email Analysis
## Assignment: Data Import, Exploration, Visualization, and Text Analysis

**Dataset Source:** [Phishing Email Dataset on Kaggle](https://www.kaggle.com/datasets/naserabdullahalam/phishing-email-dataset/data)

**File Used:** CEAS_08.csv

**Citation:** Alam, Naser Abdullah. (2024). Phishing Email Dataset. Kaggle. https://www.kaggle.com/datasets/naserabdullahalam/phishing-email-dataset

### Learning Objectives:
1. Import and explore CSV data using Pandas
2. Create meaningful visualizations using Seaborn
3. Generate word clouds to visualize common words in phishing vs legitimate emails

## 1. Import Required Libraries

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from wordcloud import WordCloud
import warnings

# Set visualization style
sns.set_style('whitegrid')
warnings.filterwarnings('ignore')

# Display settings for better output
pd.set_option('display.max_columns', None)
pd.set_option('display.max_colwidth', 100)

## 2. Load the Dataset

In [None]:
# Load the CEAS_08.csv dataset with error handling
# Using on_bad_lines='skip' to handle malformed rows
df = pd.read_csv('archive/CEAS_08.csv', 
                 on_bad_lines='skip',
                 encoding='utf-8',
                 engine='python')
print(f"Dataset loaded successfully!")
print(f"Shape: {df.shape[0]:,} rows and {df.shape[1]} columns")

## 3. Data Exploration

In [None]:
# Display first 5 rows
df.head()

In [None]:
# Get dataset information
df.info()

In [None]:
# Display column names
print("Column names:")
print(df.columns.tolist())

In [None]:
# Check data types
df.dtypes

In [None]:
# Check for missing values
print("Missing values per column:")
print(df.isnull().sum())
print(f"\nTotal missing values: {df.isnull().sum().sum()}")

In [None]:
# Check the distribution of email labels
print("Email Label Distribution:")
print(df['label'].value_counts())
print(f"\nPercentage Distribution:")
print(df['label'].value_counts(normalize=True) * 100)

In [None]:
# Statistical summary for text length
df['body_length'] = df['body'].astype(str).apply(len)
df['subject_length'] = df['subject'].astype(str).apply(len)

print("Body Length Statistics:")
print(df.groupby('label')['body_length'].describe())

## 4. Data Visualizations with Seaborn

In [None]:
# Visualization 1: Distribution of Email Types (Spam vs Legitimate)
plt.figure(figsize=(10, 6))
sns.countplot(data=df, x='label', palette='viridis')
plt.title('Distribution of Email Types (Phishing vs Legitimate)', fontsize=16, fontweight='bold')
plt.xlabel('Email Type', fontsize=12)
plt.ylabel('Count', fontsize=12)
plt.xticks(rotation=0)

# Add count labels on bars
ax = plt.gca()
for container in ax.containers:
    ax.bar_label(container, fmt='%d')

plt.tight_layout()
plt.show()

In [None]:
# Visualization 2: Email Body Length Distribution
plt.figure(figsize=(12, 6))
sns.histplot(data=df, x='body_length', hue='label', bins=50, kde=True, palette='Set2')
plt.title('Distribution of Email Body Length by Type', fontsize=16, fontweight='bold')
plt.xlabel('Body Length (characters)', fontsize=12)
plt.ylabel('Frequency', fontsize=12)
plt.xlim(0, 5000)  # Limit x-axis for better visualization
plt.tight_layout()
plt.show()

In [None]:
# Visualization 3: Subject Length Comparison
plt.figure(figsize=(10, 6))
sns.boxplot(data=df, x='label', y='subject_length', palette='coolwarm')
plt.title('Email Subject Length Comparison', fontsize=16, fontweight='bold')
plt.xlabel('Email Type', fontsize=12)
plt.ylabel('Subject Length (characters)', fontsize=12)
plt.tight_layout()
plt.show()

In [None]:
# Visualization 4: Violin plot for body length distribution
plt.figure(figsize=(10, 6))
sns.violinplot(data=df, x='label', y='body_length', palette='muted')
plt.title('Email Body Length Distribution (Violin Plot)', fontsize=16, fontweight='bold')
plt.xlabel('Email Type', fontsize=12)
plt.ylabel('Body Length (characters)', fontsize=12)
plt.ylim(0, 5000)
plt.tight_layout()
plt.show()

## 5. Word Cloud Generation

In [None]:
# Prepare text data for word clouds
# Separate phishing and legitimate emails
spam_emails = df[df['label'] == 1]['body'].astype(str)
legitimate_emails = df[df['label'] == 0]['body'].astype(str)

# Combine all text for each category
spam_text = ' '.join(spam_emails)
legitimate_text = ' '.join(legitimate_emails)

print(f"Phishing emails text length: {len(spam_text):,} characters")
print(f"Legitimate emails text length: {len(legitimate_text):,} characters")

In [None]:
# Word Cloud 1: Phishing Emails
plt.figure(figsize=(15, 8))

wordcloud_spam = WordCloud(
    width=1600,
    height=800,
    background_color='white',
    colormap='Reds',
    max_words=100,
    relative_scaling=0.5,
    min_font_size=10
).generate(spam_text)

plt.imshow(wordcloud_spam, interpolation='bilinear')
plt.axis('off')
plt.title('Most Common Words in Phishing Emails', fontsize=20, fontweight='bold', pad=20)
plt.tight_layout(pad=0)
plt.show()

In [None]:
# Word Cloud 2: Legitimate Emails
plt.figure(figsize=(15, 8))

wordcloud_legitimate = WordCloud(
    width=1600,
    height=800,
    background_color='white',
    colormap='Blues',
    max_words=100,
    relative_scaling=0.5,
    min_font_size=10
).generate(legitimate_text)

plt.imshow(wordcloud_legitimate, interpolation='bilinear')
plt.axis('off')
plt.title('Most Common Words in Legitimate Emails', fontsize=20, fontweight='bold', pad=20)
plt.tight_layout(pad=0)
plt.show()

In [None]:
# Side-by-side comparison of word clouds
fig, axes = plt.subplots(1, 2, figsize=(20, 8))

# Phishing emails word cloud
axes[0].imshow(wordcloud_spam, interpolation='bilinear')
axes[0].axis('off')
axes[0].set_title('Phishing Emails', fontsize=16, fontweight='bold')

# Legitimate emails word cloud
axes[1].imshow(wordcloud_legitimate, interpolation='bilinear')
axes[1].axis('off')
axes[1].set_title('Legitimate Emails', fontsize=16, fontweight='bold')

plt.suptitle('Word Cloud Comparison: Phishing vs Legitimate Emails', fontsize=20, fontweight='bold', y=0.98)
plt.tight_layout()
plt.show()

## 6. Key Insights and Summary

In [None]:
# Generate summary statistics
print("=" * 60)
print("PHISHING EMAIL ANALYSIS SUMMARY")
print("=" * 60)
print(f"\nTotal emails analyzed: {len(df):,}")
print(f"Phishing emails: {len(df[df['label'] == 1]):,} ({len(df[df['label'] == 1])/len(df)*100:.2f}%)")
print(f"Legitimate emails: {len(df[df['label'] == 0]):,} ({len(df[df['label'] == 0])/len(df)*100:.2f}%)")

print("\n" + "=" * 60)
print("AVERAGE EMAIL CHARACTERISTICS")
print("=" * 60)

print("\nPhishing Emails:")
print(f"  - Average body length: {df[df['label'] == 1]['body_length'].mean():.0f} characters")
print(f"  - Average subject length: {df[df['label'] == 1]['subject_length'].mean():.0f} characters")

print("\nLegitimate Emails:")
print(f"  - Average body length: {df[df['label'] == 0]['body_length'].mean():.0f} characters")
print(f"  - Average subject length: {df[df['label'] == 0]['subject_length'].mean():.0f} characters")

print("\n" + "=" * 60)

## Conclusions

This analysis successfully demonstrated:

1. **Data Import & Exploration**: Loaded and explored the CEAS_08.csv phishing email dataset using Pandas
2. **Visualizations**: Created multiple Seaborn visualizations showing:
   - Distribution of phishing vs legitimate emails
   - Email body and subject length patterns
   - Comparative analysis between email types
3. **Word Clouds**: Generated word clouds highlighting the most common words in:
   - Phishing emails (showing typical spam keywords)
   - Legitimate emails (showing normal communication patterns)

The visualizations reveal clear differences in language patterns and characteristics between phishing and legitimate emails, which could be useful for developing detection systems.