### Goal
The goal of this notebook is to classify examples into different categories, specifically identifying spam vs. non-spam emails using a generated dataset.

### Dataset
Instead of using a pre-existing dataset, we generate a synthetic dataset of ham and spam emails. Each email is saved as an individual text file in respective `ham` and `spam` directories. This approach ensures flexibility and control over the dataset.

### Generate Dataset
Before starting, we generate the spam and ham datasets.

In [7]:
import os
import random

# Define the folder structure
base_path = "../data"
ham_path = os.path.join(base_path, "ham")
spam_path = os.path.join(base_path, "spam")

# Create directories if they don't exist
os.makedirs(ham_path, exist_ok=True)
os.makedirs(spam_path, exist_ok=True)

# Define templates for spam and ham messages
spam_templates = [
    "Congratulations! You've won ${} in cash rewards. Claim now!",
    "Limited offer: Get {}% discount on all items. Expires soon.",
    "Alert: Your account {} has been compromised. Secure it immediately.",
    "Win a free {} by signing up today. Don't miss out!",
    "Earn ${}/week working remotely. Apply now!",
    "Exclusive deal: Get {}% cashback on your next purchase.",
    "You are the lucky winner of a new {}. Redeem your prize!",
    "Urgent: Your subscription will expire in {} days. Renew now!",
    "Flash sale: Up to {}% off on top brands. Shop now!",
    "Act fast: {} bonus points added to your account. Redeem today!",
]

ham_templates = [
    "Meeting rescheduled to {} PM tomorrow. Let me know if you can join.",
    "Reminder: Submit your report by {} PM today.",
    "Don't forget about the team lunch at {} PM on Friday.",
    "Can we discuss the project updates at {} PM next Monday?",
    "Please review the attached document and share your feedback by {} PM.",
    "Here is the link to the repository. Please update it before {} PM.",
    "Looking forward to our call at {} PM. Let me know if it works for you.",
    "Can we finalize the budget discussion at {} PM on Thursday?",
    "Reminder: Complete the assigned tasks before {} PM.",
    "Join us for the workshop at {} PM this weekend. Let me know if you're interested.",
]

# Generate unique messages with specific parameters
def generate_messages(templates, num_messages, value_range):
    messages = []
    for i in range(num_messages):
        template = templates[i % len(templates)]
        filled_message = template.format(random.randint(*value_range))
        messages.append(filled_message)
    return messages

# Generate 100 unique messages for spam and ham
spam_emails = generate_messages(spam_templates, 100, (100, 5000))
ham_emails = generate_messages(ham_templates, 100, (1, 12))

# Save ham emails as individual files
for i, email in enumerate(ham_emails, 1):
    with open(f"{ham_path}/ham{i}.txt", "w", encoding="utf-8") as file:
        file.write(email)

# Save spam emails as individual files
for i, email in enumerate(spam_emails, 1):
    with open(f"{spam_path}/spam{i}.txt", "w", encoding="utf-8") as file:
        file.write(email)

print("Data generation completed. Ham and spam emails saved in respective folders.")


Data generation completed. Ham and spam emails saved in respective folders.


## 2. Import Libraries and Load Data

In [11]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.model_selection import train_test_split, GridSearchCV, cross_val_score
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.pipeline import Pipeline
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, roc_auc_score, confusion_matrix, classification_report, roc_curve, auc

# Import models
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier

from nltk.stem import WordNetLemmatizer
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
import re
import nltk

nltk.data.path.append('../nltk_data')
nltk.download('punkt', download_dir='../nltk_data')
nltk.download('wordnet', download_dir='../nltk_data')
nltk.download('stopwords', download_dir='../nltk_data')

### Load Data

import os

def load_spamassassin_data(ham_dir, spam_dir):
    data = []
    labels = []

    # Load ham emails
    for filename in os.listdir(ham_dir):
        with open(os.path.join(ham_dir, filename), 'r', encoding='latin-1') as file:
            data.append(file.read())
            labels.append(0)  # 0 for non-spam

    # Load spam emails
    for filename in os.listdir(spam_dir):
        with open(os.path.join(spam_dir, filename), 'r', encoding='latin-1') as file:
            data.append(file.read())
            labels.append(1)  # 1 for spam

    return pd.DataFrame({'text': data, 'label': labels})

# Load the dataset
df = load_spamassassin_data('../data/ham', '../data/spam')

# Display the first few rows
df.head()

[nltk_data] Downloading package punkt to ../nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to ../nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package stopwords to ../nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


Unnamed: 0,text,label
0,Join us for the workshop at 2 PM this weekend....,0
1,Reminder: Complete the assigned tasks before 5...,0
2,Here is the link to the repository. Please upd...,0
3,Reminder: Submit your report by 10 PM today.,0
4,Don't forget about the team lunch at 1 PM on F...,0
...,...,...
96,Don't forget about the team lunch at 7 PM on F...,0
97,Can we discuss the project updates at 7 PM nex...,0
98,Join us for the workshop at 2 PM this weekend....,0
99,Here is the link to the repository. Please upd...,0
