<a href="https://colab.research.google.com/github/Ayush54555/GNCIPL-Internship-Major-Project-Generate-Email-Text-for-Spam-Classification/blob/main/Data_Augmentation_%26_Model_Preparation.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Data Augmentation & Model Preparation


In this phase, our focus is on **enhancing the dataset with synthetic emails** and preparing it for classification using various machine learning models. By generating high-quality synthetic spam and ham emails, we aim to address class imbalance and improve the robustness of our spam detection system.

---

## Objectives

1. **Synthetic Data Generation**
   - Fine-tune a generative AI model (e.g., GPT-2 or T5) on the real email dataset.
   - Generate new synthetic emails for both spam and ham categories.
   - Preprocess and clean synthetic emails to match the format of real emails.
   - Combine real and synthetic emails to create the final **augmented dataset**.

2. **Exploratory Data Analysis & Feature Engineering**
   - Perform quick overview and detailed EDA to understand data characteristics.
   - Apply **TF-IDF vectorization** to convert text data into numerical features suitable for machine learning.

3. **Model Selection (Initial Research)**
   - Based on EDA and dataset characteristics, shortlist potential supervised learning algorithms:
     - Naive Bayes, Logistic Regression, Random Forest, XGBoost
     - Simple neural networks (MLP) for deep learning approaches
   - Consider unsupervised learning and anomaly detection for detecting rare or new spam patterns.

---

By the end of this phase, we will have a **cleaned, augmented, and vectorized dataset** ready for training multiple models, along with a clear plan for model selection and evaluation.


# **1.Load dataset**

In [1]:
# Mount Google Drive
from google.colab import drive
drive.mount('/content/drive')

import pandas as pd

file_path = "/content/drive/MyDrive/CleanedDataset.csv"
df = pd.read_csv(file_path, low_memory=False)




Mounted at /content/drive


#  **2.Inspect dataset**

## Shape Of The Dataset

In [2]:

print(df.shape)



(83448, 2)


The dataset has 83448 rows and 2 columns.

# First 5 rows

In [3]:
print(df.head())

   label                                       cleaned_text
0      1  ounc feather bowl hummingbird opec moment alab...
1      1  wulvob medirc onlin qnb ikud viagra escapenumb...
2      0  comput connect cnn com wednesday escapenumb es...
3      1  univers degre obtain prosper futur money earn ...
4      0  thank answer guy know check rsync manual escap...


# Info

In [4]:
print(df.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 83448 entries, 0 to 83447
Data columns (total 2 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   label         83448 non-null  int64 
 1   cleaned_text  83393 non-null  object
dtypes: int64(1), object(1)
memory usage: 1.3+ MB
None


# **3. Separate spam and ham**


In [5]:
spam_df = df[df['label'] == 1]   #  1 = spam
ham_df = df[df['label'] == 0]    #  0 = ham

print("Spam samples:", len(spam_df))
print("Ham samples:", len(ham_df))

Spam samples: 43910
Ham samples: 39538


##  Save as text files for training

In [6]:

spam_file = "/content/drive/MyDrive/spam.txt"
ham_file = "/content/drive/MyDrive/ham.txt"

spam_df['cleaned_text'].to_csv(spam_file, index=False, header=False)
ham_df['cleaned_text'].to_csv(ham_file, index=False, header=False)

print(" Files saved:")
print(f"Spam -> {spam_file}, {len(spam_df)} samples")
print(f"Ham  -> {ham_file}, {len(ham_df)} samples")

 Files saved:
Spam -> /content/drive/MyDrive/spam.txt, 43910 samples
Ham  -> /content/drive/MyDrive/ham.txt, 39538 samples


# **3.GPT-2 fine-tuning**

## Install dependencies

In [7]:
!pip install transformers datasets




## Prepare Subset for Quick Training

In [8]:
import pandas as pd

# Load spam + ham (already saved earlier)
spam_file = "/content/drive/MyDrive/spam.txt"
ham_file = "/content/drive/MyDrive/ham.txt"

# Read into pandas
spam = pd.read_csv(spam_file, names=["text"])
ham = pd.read_csv(ham_file, names=["text"])

# Sample smaller subset for training (2K spam + 2K ham)
spam_sample = spam.sample(n=2000, random_state=42) if len(spam) > 2000 else spam
ham_sample = ham.sample(n=2000, random_state=42) if len(ham) > 2000 else ham

# Merge + add label tokens <SPAM>/<HAM>
spam_sample["text"] = "<SPAM> " + spam_sample["text"].astype(str)
ham_sample["text"] = "<HAM> " + ham_sample["text"].astype(str)

subset = pd.concat([spam_sample, ham_sample])
subset.to_csv("subset_emails.txt", index=False, header=False)

print(" Training subset ready with", len(subset), "emails")


 Training subset ready with 4000 emails


## Training

In [9]:
from transformers import GPT2Tokenizer, GPT2LMHeadModel, Trainer, TrainingArguments, DataCollatorForLanguageModeling
from datasets import load_dataset

# Load GPT-2 tokenizer + model
model_name = "gpt2-medium"
tokenizer = GPT2Tokenizer.from_pretrained(model_name)
tokenizer.pad_token = tokenizer.eos_token

model = GPT2LMHeadModel.from_pretrained(model_name)






The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

config.json:   0%|          | 0.00/718 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/1.52G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

## Load dataset for LM

In [10]:

dataset = load_dataset("text", data_files={"train": "subset_emails.txt"})

def tokenize_function(examples):
    return tokenizer(examples["text"], truncation=True, padding="max_length", max_length=128)

tokenized = dataset.map(tokenize_function, batched=True, remove_columns=["text"])

Generating train split: 0 examples [00:00, ? examples/s]

Map:   0%|          | 0/4000 [00:00<?, ? examples/s]

## Data collator

In [11]:

data_collator = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm=False)


## Training

In [None]:


training_args = TrainingArguments(
    output_dir="./gen_model",
    overwrite_output_dir=True,
    num_train_epochs=1,
    per_device_train_batch_size=4,
    save_steps=500,
    logging_steps=100,
    fp16=True,
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized["train"],
    data_collator=data_collator,
)

trainer.train()

  | |_| | '_ \/ _` / _` |  _/ -_)


<IPython.core.display.Javascript object>

[34m[1mwandb[0m: Logging into wandb.ai. (Learn how to deploy a W&B server locally: https://wandb.me/wandb-server)
[34m[1mwandb[0m: You can find your API key in your browser here: https://wandb.ai/authorize?ref=models
wandb: Paste an API key from your profile and hit enter:

 ··········


[34m[1mwandb[0m: No netrc file found, creating one.
[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc
[34m[1mwandb[0m: Currently logged in as: [33mvarshavanganuru7[0m ([33mvarshavanganuru7-student[0m) to [32mhttps://api.wandb.ai[0m. Use [1m`wandb login --relogin`[0m to force relogin


`loss_type=None` was set in the config but it is unrecognized. Using the default loss: `ForCausalLMLoss`.


Step,Training Loss


# 4.Generate synthetic spam/ham

In [None]:
# Generate synthetic spam & ham emails after fine-tuning

def generate_emails(prompt, n=5, max_length=80):
    inputs = tokenizer(prompt, return_tensors="pt").input_ids.to(model.device)
    outputs = model.generate(
        inputs,
        max_length=max_length,
        num_return_sequences=n,
        temperature=0.7,
        top_p=0.9,
        do_sample=True,
        pad_token_id=tokenizer.eos_token_id
    )
    return [tokenizer.decode(o, skip_special_tokens=True) for o in outputs]

# Generate synthetic samples
synthetic_spam = generate_emails("<SPAM>", n=50)
synthetic_ham  = generate_emails("<HAM>", n=50)

print("Example synthetic spam:\n", synthetic_spam[:3])
print("\nExample synthetic ham:\n", synthetic_ham[:3])

# Build augmented dataset
import pandas as pd

# Load your real dataset again
real_df = pd.concat([spam, ham], ignore_index=True)

# Synthetic dataframe
synthetic_df = pd.DataFrame({
    "text": synthetic_spam + synthetic_ham,
    "label": ["spam"]*len(synthetic_spam) + ["ham"]*len(synthetic_ham)
})

# Final augmented dataset
augmented_df = pd.concat([real_df, synthetic_df], ignore_index=True)
augmented_df.to_csv("augmented_emails.csv", index=False)

print(" Augmented dataset saved with", len(augmented_df), "samples")


# **5.Preprocessing**

In [None]:
import re
import pandas as pd

# -------------------------------
# Function to clean text
# -------------------------------
def clean_text(text):
    if not isinstance(text, str):
        return ""
    text = re.sub(r"<SPAM>|<HAM>", "", text)       # remove special tokens
    text = re.sub(r"http\S+|www\S+", "", text)     # remove URLs
    text = re.sub(r"\S+@\S+", "", text)            # remove email addresses
    text = re.sub(r"[^a-zA-Z0-9\s]", " ", text)   # keep letters + numbers
    text = re.sub(r"\s+", " ", text).strip()      # normalize spaces
    return text.lower()

# -------------------------------
# Load augmented dataset
# -------------------------------
try:
    augmented_df = pd.read_csv("augmented_emails.csv")
except FileNotFoundError:
    print("Error: 'augmented_emails.csv' not found. Please run synthetic generation first.")
    exit()

# -------------------------------
# Apply cleaning to text
# -------------------------------
augmented_df['cleaned_text'] = augmented_df['text'].astype(str).apply(clean_text)

# Keep only cleaned text and label
final_cleaned_augmented_df = augmented_df[['cleaned_text', 'label']]

# -------------------------------
# Save final cleaned augmented dataset
# -------------------------------
final_cleaned_augmented_df.to_csv("augmented_emails_cleaned.csv", index=False)
print(" Final cleaned augmented dataset saved as 'augmented_emails_cleaned.csv' with", len(final_cleaned_augmented_df), "samples")


## Load cleaned augmented dataset

In [None]:

import pandas as pd

# Load the final cleaned augmented dataset
try:
    df = pd.read_csv("augmented_emails_cleaned.csv")
except FileNotFoundError:
    print("Error: 'augmented_emails_cleaned.csv' not found. Please run the previous steps first.")
    exit()



from google.colab import files

# Download the cleaned augmented dataset
files.download("augmented_emails_cleaned.csv")


# Quick overview

In [None]:

print(" Dataset loaded successfully")
print("Shape:", df.shape)
print("\nClass distribution:")
print(df['label'].value_counts())


# Display first 5 rows

In [None]:

df.head()

## Quick overview

# text length distribution

In [None]:

print("\nLabel distribution:")
print(augmented_df['label'].value_counts())

# Text length analysis

In [None]:

augmented_df['text_length'] = augmented_df['cleaned_text'].astype(str).fillna('').apply(len)
print("\nText length stats:")
print(augmented_df['text_length'].describe())


# Most common words (quick bag-of-words check)

In [None]:

# Ensure the column is string type and fill NaNs before joining
all_words = " ".join(augmented_df['cleaned_text'].astype(str).fillna('')).split()
word_freq = Counter(all_words)
print("\nTop 20 most common words:")
print(word_freq.most_common(20))

# **6.Exploratory Data Analysis**


# Class distribution

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt

plt.figure(figsize=(6,4))
sns.countplot(data=df, x='label', palette='pastel')
plt.title("Spam vs Ham Distribution")
plt.xlabel("Label")
plt.ylabel("Count")
plt.show()


# Email Length Analysis

In [None]:
df['num_words'] = df['cleaned_text'].astype(str).apply(lambda x: len(x.split()))
df['num_chars'] = df['cleaned_text'].astype(str).apply(len)

plt.figure(figsize=(12,5))
sns.histplot(data=df, x='num_words', hue='label', bins=50, kde=True, palette='Set2')
plt.title("Distribution of Email Length (Words) by Label")
plt.show()

plt.figure(figsize=(12,5))
sns.histplot(data=df, x='num_chars', hue='label', bins=50, kde=True, palette='Set1')
plt.title("Distribution of Email Length (Characters) by Label")
plt.show()

# Plot class balance

In [None]:

plt.hist(augmented_df['text_length'], bins=50, color='skyblue', edgecolor='black')
plt.title("Email Text Length Distribution")
plt.xlabel("Length (characters)")
plt.ylabel("Frequency")
plt.show()


# word cloud for spam vs ham

In [None]:

from wordcloud import WordCloud

# Separate spam and ham
spam_text = " ".join(augmented_df[augmented_df['label']=="spam"]['cleaned_text'])
ham_text  = " ".join(augmented_df[augmented_df['label']=="ham"]['cleaned_text'])

# Generate word clouds
spam_wc = WordCloud(width=800, height=400, background_color="white", max_words=200).generate(spam_text)
ham_wc  = WordCloud(width=800, height=400, background_color="white", max_words=200).generate(ham_text)

# Plot side by side
fig, axes = plt.subplots(1, 2, figsize=(15, 7))
axes[0].imshow(spam_wc, interpolation="bilinear")
axes[0].set_title("Spam Word Cloud")
axes[0].axis("off")

axes[1].imshow(ham_wc, interpolation="bilinear")
axes[1].set_title("Ham Word Cloud")
axes[1].axis("off")

plt.show()


# Correlation / Co-occurrence Heatmap

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.feature_extraction.text import TfidfVectorizer

# Re-defining tfidf and X_tfidf from previous steps
tfidf = TfidfVectorizer(
    max_features=5000,   # limit vocabulary size (tune if needed)
    ngram_range=(1,2),   # unigrams + bigrams
    stop_words='english' # remove common stopwords
)
# Assuming 'df' with 'cleaned_text' is available from previous steps
# If not, you might need to load it or ensure the previous cell that loads it is run
try:
    # Attempt to use the existing 'df' and 'X' if they are in the environment
    X = df['cleaned_text'].astype(str).fillna('missingtext').apply(lambda x: x if x.strip() != '' else 'missingtext')
    tfidf.fit(X)
    X_tfidf = tfidf.transform(X)
except NameError:
    print("Error: 'df' or 'X' not found. Please ensure the dataset is loaded and preprocessed.")
    exit()


feature_names = tfidf.get_feature_names_out()
tfidf_matrix = X_tfidf.toarray()


top_features = np.array(feature_names)[np.argsort(np.mean(tfidf_matrix, axis=0))[-30:]]
df_top = pd.DataFrame(tfidf_matrix[:, [np.where(feature_names==f)[0][0] for f in top_features]], columns=top_features)

plt.figure(figsize=(12,10))
sns.heatmap(df_top.corr(), annot=True, cmap='coolwarm')
plt.title("Correlation Between Top TF-IDF Features")
plt.show()

# TF-IDF vectorization

In [None]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer


## 1. Load cleaned augmented dataset

In [None]:



df = pd.read_csv("augmented_emails_cleaned.csv")

# Drop rows with missing labels
df.dropna(subset=['label'], inplace=True)


## 2. Prepare features (X) and labels (y)

In [None]:



X = df['cleaned_text'].astype(str)
y = df['label']

# Replace NaN or empty strings in text
X = X.fillna('missingtext')  # Fill NaN
X = X.apply(lambda x: x if x.strip() != '' else 'missingtext')  # Replace empty strings


## 3. TF-IDF Vectorization

In [None]:
tfidf = TfidfVectorizer(
    max_features=5000,   # limit vocabulary size (tune if needed)
    ngram_range=(1,2),   # unigrams + bigrams
    stop_words='english' # remove common stopwords
)

X_tfidf = tfidf.fit_transform(X)
print(" TF-IDF matrix shape:", X_tfidf.shape)

## 4. Train-test split

In [None]:



X_train, X_test, y_train, y_test = train_test_split(
    X_tfidf, y, test_size=0.2, random_state=42, stratify=y
)

print("Training set size:", X_train.shape, " Test set size:", X_test.shape)


# TF-IDF Feature Insights

In [None]:
import numpy as np

# Average TF-IDF scores per class
tfidf_matrix = X_tfidf.toarray()
labels = df['label'].values
feature_names = tfidf.get_feature_names_out()

spam_tfidf = tfidf_matrix[labels=='spam'].mean(axis=0)
ham_tfidf = tfidf_matrix[labels=='ham'].mean(axis=0)

top_spam_idx = np.argsort(spam_tfidf)[-20:]
top_ham_idx = np.argsort(ham_tfidf)[-20:]

plt.figure(figsize=(12,6))
sns.barplot(x=spam_tfidf[top_spam_idx], y=np.array(feature_names)[top_spam_idx], palette='Reds_r')
plt.title("Top 20 TF-IDF Features in Spam Emails")
plt.show()

plt.figure(figsize=(12,6))
sns.barplot(x=ham_tfidf[top_ham_idx], y=np.array(feature_names)[top_ham_idx], palette='Blues_r')
plt.title("Top 20 TF-IDF Features in Ham Emails")
plt.show()


## t-SNE and PCA Visualization

In [None]:

from sklearn.decomposition import PCA
from sklearn.manifold import TSNE

subset_size = 5000
if X_tfidf.shape[0] > subset_size:
    X_sample = X_tfidf[:subset_size].toarray()
    y_sample = labels[:subset_size]
else:
    X_sample = X_tfidf.toarray()
    y_sample = labels





##  PCA for 2D visualization

In [None]:
pca = PCA(n_components=2, random_state=42)
X_pca = pca.fit_transform(X_sample)

plt.figure(figsize=(10,6))
sns.scatterplot(x=X_pca[:,0], y=X_pca[:,1], hue=y_sample, palette={'spam':'red','ham':'blue'}, alpha=0.6)
plt.title("PCA Projection of Emails (2D)")
plt.xlabel("PCA Component 1")
plt.ylabel("PCA Component 2")
plt.show()


##  t-SNE for 2D visualization

In [None]:

tsne = TSNE(n_components=2, random_state=42, perplexity=40, n_iter=1000)
X_tsne = tsne.fit_transform(X_sample)

plt.figure(figsize=(10,6))
sns.scatterplot(x=X_tsne[:,0], y=X_tsne[:,1], hue=y_sample, palette={'spam':'red','ham':'blue'}, alpha=0.6)
plt.title("t-SNE Projection of Emails (2D)")
plt.xlabel("t-SNE Component 1")
plt.ylabel("t-SNE Component 2")
plt.show()


# **statistical tests**

## statistical tests to compare real vs synthetic distributions.

In [None]:
from scipy.stats import chi2_contingency
import pandas as pd
from collections import Counter # Import Counter

# Load spam + ham (already saved earlier in cell fs1VVGvhsV_5)
spam_file = "/content/drive/MyDrive/spam.txt"
ham_file = "/content/drive/MyDrive/ham.txt"

# Read into pandas - ensures column is named 'text'
spam = pd.read_csv(spam_file, names=["text"], keep_default_na=False) # Added keep_default_na=False
ham = pd.read_csv(ham_file, names=["text"], keep_default_na=False)   # Added keep_default_na=False

real_df = pd.concat([spam, ham], ignore_index=True)
# Rename the 'text' column to 'cleaned_text' to match the rest of the notebook
real_df.rename(columns={'text': 'cleaned_text'}, inplace=True)
try:

    synthetic_df = pd.DataFrame({
        "text": synthetic_spam + synthetic_ham,
        "label": ["spam"]*len(synthetic_spam) + ["ham"]*len(synthetic_ham)
    })
    # Rename the 'text' column to 'cleaned_text'
    synthetic_df.rename(columns={'text': 'cleaned_text'}, inplace=True)

except NameError:
    print("synthetic_spam or synthetic_ham not found. Please generate synthetic data first.")
    # Fallback: Try loading from augmented_emails.csv and filtering
    try:
        augmented_df_full = pd.read_csv("augmented_emails.csv")
        synthetic_df = augmented_df_full[augmented_df_full['label'].isin(['spam', 'ham'])].copy() # Filter for synthetic labels
        if 'text' in synthetic_df.columns and 'cleaned_text' not in synthetic_df.columns:
             synthetic_df.rename(columns={'text': 'cleaned_text'}, inplace=True)
        # Ensure 'cleaned_text' exists
        if 'cleaned_text' not in synthetic_df.columns:
             print("Could not find 'text' or 'cleaned_text' in augmented_emails.csv for synthetic data.")
             exit() # Exit if data is not found

    except FileNotFoundError:
        print("augmented_emails.csv not found. Cannot load synthetic data.")
        exit()
if 'cleaned_text' not in real_df.columns or 'cleaned_text' not in synthetic_df.columns:
    print("Error: 'cleaned_text' column not found in one of the dataframes.")
    exit()
real_text = real_df['cleaned_text'].astype(str).fillna('') # Ensure string type and fill NaNs
synthetic_text = synthetic_df['cleaned_text'].astype(str).fillna('') # Ensure string type and fill NaNs

# Build word frequency tables
def get_top_words(texts, n=50):
    # Filter out empty strings before joining
    all_words = " ".join([text for text in texts if text]).split()
    freq = Counter(all_words)
    return dict(freq.most_common(n))

real_top = get_top_words(real_text, 50)
synthetic_top = get_top_words(synthetic_text, 50)

# Merge into dataframe
words = list(set(real_top.keys()) | set(synthetic_top.keys()))
freq_df = pd.DataFrame({
    "word": words,
    "real": [real_top.get(w,0) for w in words],
    "synthetic": [synthetic_top.get(w,0) for w in words]
})
contingency_table = freq_df[['real', 'synthetic']].values



# Perform Chi-square test

In [None]:

# chi2_contingency expects observed frequencies in a 2D array
if contingency_table.shape[0] > 1 and contingency_table.shape[1] == 2: # Need at least 2 rows for Chi-squared
    chi2, p, dof, expected = chi2_contingency(contingency_table)
    print("Chi-Square Test Statistic:", chi2)
    print("p-value:", p)

    # Interpretation
    alpha = 0.05
    print("\nInterpretation:")
    if p < alpha:
        print(f"The p-value ({p:.4f}) is less than the significance level ({alpha}). We reject the null hypothesis.")
        print("There is a statistically significant difference between the word frequency distributions of the real and synthetic datasets.")
    else:
        print(f"The p-value ({p:.4f}) is greater than the significance level ({alpha}). We fail to reject the null hypothesis.")
        print("There is no statistically significant difference between the word frequency distributions of the real and synthetic datasets.")

else:
    print("\nSkipping Chi-square test: Not enough data (less than 2 words) to perform the test.")

print("\nWord Frequencies:")
display(freq_df.sort_values(by='real', ascending=False).head())

# Model Selection (Initial Research)

Based on the EDA, dataset characteristics, and augmented TF-IDF features, we shortlist the following models for initial research and experimentation:

## 1️⃣ Supervised Learning (Labeled Data: Spam vs Ham)
- **Naive Bayes (MultinomialNB / BernoulliNB)**: Fast, strong baseline for text classification, works well with TF-IDF.  
- **Logistic Regression**: Linear model suitable for high-dimensional TF-IDF vectors; interpretable and probabilistic outputs.  
- **Linear SVM**: Effective for high-dimensional sparse data; robust classifier for text.  
- **Simple Feedforward Neural Network (MLP)**: Optional deep learning model to explore non-linear relationships in TF-IDF features.  
- **Transformer (DistilBERT / BERT)**: Optional state-of-the-art model capturing context and semantics; requires GPU.

## 2️⃣ Unsupervised Learning (No Labels / Clustering / Anomaly Detection)
- **K-Means Clustering**: Group emails into “spam-like” vs “ham-like” clusters; simple and fast.  
- **DBSCAN**: Detect clusters and outliers (rare spam) in feature space; handles irregular patterns.  
- **Isolation Forest**: Detect rare or anomalous spam emails; works well for imbalanced data.  
- **One-Class SVM**: Learn “normal” ham emails and detect spam as anomalies; suitable for high-dimensional TF-IDF features.

## 3️⃣ Deep Learning / Neural Networks
- **MLP (Feedforward NN)**: Flexible for TF-IDF features; can model non-linear relationships.  
- **CNN for Text**: Good for local n-gram patterns if using word embeddings.  
- **RNN / LSTM / GRU**: Capture sequential patterns in emails; useful for context-aware modeling.  
- **Transformer (BERT / DistilBERT / RoBERTa)**: State-of-the-art NLP model; handles semantic and contextual information.  
- **Hybrid Models (CNN + LSTM / Attention)**: Powerful but complex; optional for advanced experiments.

## Notes
- **Start with supervised models** (Naive Bayes, Logistic Regression, Linear SVM) as quick baselines.  
- **Unsupervised methods** can be used to explore **anomaly detection** or identify clusters of spam for additional insights.  
- **Deep learning models** are optional and recommended if GPU is available or you need higher accuracy.  
- Final model selection will depend on **training results, evaluation metrics, and computational efficiency**.


# Conclusion: Data Augmentation & Model Preparation



**1. Data Inspection and Cleaning**  
- Loaded and inspected the original email dataset.  
- Split emails into **spam and ham** categories.  
- Applied **text cleaning** to remove special characters, URLs, email addresses, and normalize text.

**2. Synthetic Data Generation**  
- Fine-tuned a **GPT-2 model** on real email data to learn spam and ham styles.  
- Generated **high-quality synthetic emails** to address class imbalance.  
- Preprocessed synthetic data to match the real email format.

**3. Dataset Augmentation**  
- Combined real and synthetic emails to create the **final augmented dataset**.  
- Saved the cleaned dataset as `augmented_emails_cleaned.csv` for future use.  
- Ensured it is **ready for machine learning workflows**.

**4. Exploratory Data Analysis (EDA)**  
- Quick overview: dataset size, missing values, class distribution.  
- Advanced visualizations:  
  - Spam vs ham distribution, email length histograms  
  - Word clouds for frequent words  
  - Top TF-IDF features per class  
  - Correlation heatmap of TF-IDF features  
  - PCA and t-SNE projections to visualize class separation

**5. Feature Engineering**  
- Applied **TF-IDF vectorization** to transform text into numerical features.  
- Limited vocabulary to **5000 features** (unigrams + bigrams) and removed common stopwords.

**6. Statistical Testing**  
- Chi-Square test showed **statistically significant differences** between real and synthetic word distributions.  
- Indicates synthetic data differs in word frequency but still adds value for augmentation.

**7. Model Selection (Initial Research)**  
- **Supervised models**: Naive Bayes, Logistic Regression, Linear SVM (recommended baselines).  
- **Optional deep learning**: MLP, CNN, RNN/LSTM, Transformers.  
- **Unsupervised / anomaly detection**: K-Means, DBSCAN, Isolation Forest, One-Class SVM.  
- Models balance **speed, interpretability, and accuracy potential**.

**8. Next Steps**  
- Initialize selected models and perform **training and evaluation**.  
- Compare performance using **Accuracy, Precision, Recall, and F1-Score**.  
- Explore optional **deep learning or unsupervised methods** for improvements.  
- Dataset `augmented_emails_cleaned.csv` is ready for **reuse**.

**Overall Summary:**  
- Created a **robust, balanced, and clean dataset**.  
- Conducted **advanced EDA** to understand patterns and features.  
- Provided a **structured model selection roadmap**.  
- All artifacts are **reusable and reproducible**, enabling efficient spam detection experiments.
