#  **Augmented Data Preprocessing & Model Preparation**

## **Objective**
The goal of this notebook is to preprocess, clean, and augment the dataset to prepare it for model training.  
We aim to create a robust and balanced dataset that enhances model performance and generalization.  
This notebook also includes initial research and selection of suitable supervised learning algorithms for the classification task.

---

## **Introduction**
In this notebook, we focus on preparing and enhancing the dataset for our classification model through data preprocessing and augmentation techniques.  
The goal is to ensure that the data is clean, balanced, and suitable for training robust machine learning models.  
We perform key steps such as data inspection, cleaning, preprocessing, and augmentation before moving into model research and preparation.  
This process forms the foundation for building a reliable predictive model that generalizes well on unseen data.

---




#*1.Preprocessing of synthetic data*

# 1.Import libraries

In [None]:
import re
import string
import pandas as pd
!pip install plotly ipywidgets wordcloud
import pandas as pd
import numpy as np
import plotly.express as px
from wordcloud import WordCloud
from collections import Counter
import matplotlib.pyplot as plt
from ipywidgets import interact, widgets


# 2.Load Data

In [None]:
from google.colab import drive
drive.mount('/content/drive')


In [None]:
import pandas as pd
augmented_path = "/content/drive/MyDrive/augmented_email.csv"
df_augmented = pd.read_csv(augmented_path)

print("Augmented data shape:", df_augmented.shape)

## first 5 rows

In [None]:

df_augmented.head()

## Basic Info

In [None]:
print(df_augmented.info())

## Dataset Description

In [None]:
print(df_augmented.describe(include='all'))

## Statistical summary

In [None]:
import numpy as np

df_augmented['text_length'] = df_augmented['cleaned_text'].apply(len)
df_augmented['word_count'] = df_augmented['cleaned_text'].apply(lambda x: len(x.split()))
df_augmented['avg_word_length'] = df_augmented['cleaned_text'].apply(lambda x: np.mean([len(w) for w in x.split()]))

df_augmented[['text_length', 'word_count', 'avg_word_length']].describe()

## Check missing values per column

In [None]:

print("\nMissing values in each column:")
print(df_augmented.isnull().sum())


## Fill missing 'cleaned_text' with 'text' column

In [None]:

df_augmented['cleaned_text'] = df_augmented['cleaned_text'].fillna(df_augmented['text'])

## Drop rows where 'cleaned_text' or 'label' is still missing

In [None]:

df_augmented = df_augmented.dropna(subset=['cleaned_text', 'label'])

##  Remove duplicates

In [None]:

df_augmented = df_augmented.drop_duplicates(subset=['cleaned_text']).reset_index(drop=True)

##  Preprocessing function

In [None]:


def preprocess_text(text):
    text = str(text).lower()
    text = re.sub(r"http\S+|www\S+|https\S+", "", text)
    text = re.sub(r'\S+@\S+', '', text)
    text = re.sub(r'\d+', '', text)
    text = text.translate(str.maketrans('', '', string.punctuation))
    text = re.sub(r'\s+', ' ', text).strip()
    return text

# Apply preprocessing
df_augmented['cleaned_text'] = df_augmented['cleaned_text'].apply(preprocess_text)

# Check results
print("Shape after handling missing, duplicates, and preprocessing:", df_augmented.shape)
print(df_augmented.head())


## Tokenize and count

In [None]:
from collections import Counter


def get_top_n_words(texts, n=20):
    all_words = " ".join(texts).split()
    return Counter(all_words).most_common(n)

print("Top 20 words in Spam:")
print(get_top_n_words(df_augmented[df_augmented['label']=='spam']['cleaned_text']))

print("\nTop 20 words in Ham:")
print(get_top_n_words(df_augmented[df_augmented['label']=='ham']['cleaned_text']))

# **3.Exploratory data analysis**

##Label Distribution

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt

plt.figure(figsize=(6,4))
ax = sns.countplot(x='label', data=df_augmented, palette='coolwarm', hue='label', legend=False)
plt.title("📊 Label Distribution (Synthetic Data)", fontsize=14)
plt.xlabel("Label (0 = Ham, 1 = Spam)")
plt.ylabel("Count")
for p in ax.patches:
    ax.annotate(f'{p.get_height()}', (p.get_x() + 0.35, p.get_height()+100))
plt.show()

##Text Length Distribution

In [None]:
df_augmented['text_length'] = df_augmented['cleaned_text'].apply(len)

plt.figure(figsize=(8,5))
plt.hist(df_augmented[df_augmented['label']=='ham']['text_length'], bins=40, alpha=0.7, label='Ham')
plt.hist(df_augmented[df_augmented['label']=='spam']['text_length'], bins=40, alpha=0.7, label='Spam')
plt.legend()
plt.title("Text Length Distribution (Synthetic Data)")
plt.xlabel("Length of Email")
plt.ylabel("Frequency")
plt.show()

#KDE (Density) Plots for Text Features

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt

plt.figure(figsize=(10,6))
sns.kdeplot(data=df_augmented, x='text_length', hue='label', fill=True, palette='coolwarm')
plt.title("📈 Text Length Density by Label", fontsize=14)
plt.xlabel("Text Length")
plt.ylabel("Density")
plt.show()

plt.figure(figsize=(10,6))
sns.kdeplot(data=df_augmented, x='word_count', hue='label', fill=True, palette='mako')
plt.title("🧮 Word Count Density by Label", fontsize=14)
plt.xlabel("Word Count")
plt.ylabel("Density")
plt.show()

## Correlation Heatmap

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt

# Convert 'label' column to numeric if it's not already
df_augmented['label'] = df_augmented['label'].replace({'ham': 0, 'spam': 1, '1': 1, '0': 0}).astype(int)

plt.figure(figsize=(6,4))
sns.heatmap(df_augmented[['text_length', 'word_count', 'avg_word_length', 'label']].corr(), annot=True, cmap='coolwarm', fmt=".2f")
plt.title(" Feature Correlation Heatmap")
plt.show()

#Word Cloud Comparison

In [None]:
from wordcloud import WordCloud
import matplotlib.pyplot as plt

spam_words = " ".join(df_augmented[df_augmented['label']==1]['cleaned_text'])
ham_words = " ".join(df_augmented[df_augmented['label']==0]['cleaned_text'])

fig, axes = plt.subplots(1, 2, figsize=(14,6))
axes[0].imshow(WordCloud(width=800, height=400, colormap='Reds').generate(spam_words))
axes[0].set_title("Spam Word Cloud", fontsize=14)
axes[0].axis("off")

axes[1].imshow(WordCloud(width=800, height=400, colormap='Blues').generate(ham_words))
axes[1].set_title("Ham Word Cloud", fontsize=14)
axes[1].axis("off")
plt.show()

## Top 20 Most Frequent Words (Bar Charts)

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt

def top_words(df, label, n=20):
    words = " ".join(df[df['label']==label]['cleaned_text']).split()
    return pd.DataFrame(Counter(words).most_common(n), columns=['word', 'count'])

top_spam = top_words(df_augmented, 1)
top_ham = top_words(df_augmented, 0)

fig, axes = plt.subplots(1, 2, figsize=(16,6))
sns.barplot(y='word', x='count', data=top_spam, palette='Reds_r', ax=axes[0])
axes[0].set_title("Top 20 Words in Spam", fontsize=14)
sns.barplot(y='word', x='count', data=top_ham, palette='Blues_r', ax=axes[1])
axes[1].set_title("Top 20 Words in Ham", fontsize=14)
plt.show()

# ***2.Augmenting cleaned and synthetic data***

##  Load the original cleaned dataset

In [None]:

cleaned_path = "/content/drive/MyDrive/CleanedDataset.csv"
df_cleaned = pd.read_csv(cleaned_path)


##  Check original dataset info

In [None]:

print("Original CleanedDataset shape:", df_cleaned.shape)
print(df_cleaned.head())



##  Merge datasets

In [None]:

df_merged = pd.concat([df_cleaned, df_augmented], ignore_index=True)



##  Remove duplicates across the merged dataset

In [None]:

df_merged = df_merged.drop_duplicates(subset=['cleaned_text']).reset_index(drop=True)



##  Shuffle the merged dataset

In [None]:

df_merged = df_merged.sample(frac=1, random_state=42).reset_index(drop=True)



In [None]:
#  Save merged dataset
merged_path = "/content/drive/MyDrive/merged_dataset.csv"
df_merged.to_csv(merged_path, index=False)



In [None]:
#  Verify
print("Merged dataset shape:", df_merged.shape)
print(df_merged.head())

#  Load the merged dataset

In [None]:

merged_path = "/content/drive/MyDrive/merged_dataset.csv"
df_merged = pd.read_csv(merged_path)

##  Preprocessing function

In [None]:

def preprocess_text(text):
    text = str(text).lower()  # lowercase
    text = re.sub(r"http\S+|www\S+|https\S+", "", text)  # remove URLs
    text = re.sub(r'\S+@\S+', '', text)  # remove email addresses
    text = re.sub(r'\d+', '', text)  # remove numbers
    text = text.translate(str.maketrans('', '', string.punctuation))  # remove punctuation
    text = re.sub(r'\s+', ' ', text).strip()  # remove extra spaces
    return text

#  Apply preprocessing to the merged dataset

In [None]:

df_merged['cleaned_text'] = df_merged['cleaned_text'].astype(str).apply(preprocess_text)

#  Remove any empty or duplicate rows after preprocessing

In [None]:

df_merged = df_merged[df_merged['cleaned_text'] != '']  # remove empty
df_merged = df_merged.drop_duplicates(subset=['cleaned_text']).reset_index(drop=True)

# Drop unnecessary column

In [None]:


df_final = df_merged[['label', 'cleaned_text']]


# Map labels

In [None]:


df_final['label'] = df_final['label'].replace({
    'ham': 0,
    'spam': 1,
    0: 0,
    1: 1
}).astype(int)


##  Normalize labels

In [None]:

df_final['label'] = df_final['label'].replace({'ham': 0, 'spam': 1})
df_final['label'] = df_final['label'].astype(int)

##  Remove tokenization artifacts

In [None]:

def clean_artifacts(text):
    text = re.sub(r'\b(escapenumb|escapelong|escap|esc)\b', '', str(text))
    text = re.sub(r'\s+', ' ', text).strip()
    return text

df_final['cleaned_text'] = df_final['cleaned_text'].apply(clean_artifacts)

## dropping duplicates




In [None]:

df_final = df_final.drop_duplicates(subset=['cleaned_text']).reset_index(drop=True)

##Label Distribution:

In [None]:


label_counts = df_final['label'].value_counts()
print("\nLabel Distribution:\n", label_counts)


In [None]:

def get_top_words(texts, n=20):
    words = " ".join(texts).split()
    return Counter(words).most_common(n)

ham_top = get_top_words(df_final[df_final['label'] == 0]['cleaned_text'])
spam_top = get_top_words(df_final[df_final['label'] == 1]['cleaned_text'])

print("\nTop 20 Words in Ham:\n", ham_top)
print("\nTop 20 Words in Spam:\n", spam_top)

##  Text length analysis

In [None]:

df_final['text_len'] = df_final['cleaned_text'].apply(len)
ham_lens = df_final[df_final['label'] == 0]['text_len']
spam_lens = df_final[df_final['label'] == 1]['text_len']

print("Avg Lengths -> Ham:", ham_lens.mean(), " Spam:", spam_lens.mean())

# **Exploratory data analysis**

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from wordcloud import WordCloud
from collections import Counter
import numpy as np


In [None]:
# Load dataset
df = pd.read_csv("/content/drive/MyDrive/final_dataset_clean.csv")

# Add text length feature
df['text_length'] = df['cleaned_text'].astype(str).apply(len)

# Preview
df.head()

##Label Distribution

In [None]:
fig = px.pie(df, names='label', title='Spam vs Ham Distribution',
             color='label', color_discrete_map={0:'lightblue', 1:'salmon'})
fig.show()


##Text Length Distribution

In [None]:
fig = px.histogram(df, x='text_length', color='label', nbins=100,
                   barmode='overlay', opacity=0.7,
                   labels={'text_length':'Email Length', 'label':'Label'},
                   title='Text Length Distribution by Label')
fig.show()


##Top Words Comparison (Bar Plot)

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt

def get_top_n_words(texts, n=20):
    all_words = " ".join(texts).split()
    return pd.DataFrame(Counter(all_words).most_common(n), columns=['word', 'count'])

top_spam_words = get_top_n_words(df[df['label']==1]['cleaned_text'].astype(str), 20)
top_ham_words = get_top_n_words(df[df['label']==0]['cleaned_text'].astype(str), 20)

fig, axes = plt.subplots(1, 2, figsize=(16, 8))

sns.barplot(x='count', y='word', data=top_spam_words, ax=axes[0], palette='Reds_r', hue='word', legend=False)
axes[0].set_title('Top 20 Words in Spam Emails')
axes[0].set_xlabel('Count')
axes[0].set_ylabel('Word')

sns.barplot(x='count', y='word', data=top_ham_words, ax=axes[1], palette='Blues_r', hue='word', legend=False)
axes[1].set_title('Top 20 Words in Ham Emails')
axes[1].set_xlabel('Count')
axes[1].set_ylabel('Word')

plt.tight_layout()
plt.show()

##Add More Features

In [None]:
# Add word count feature
df['word_count'] = df['cleaned_text'].apply(lambda x: len(x.split()))

# Add average word length
df['avg_word_len'] = df['cleaned_text'].apply(lambda x: np.mean([len(w) for w in x.split()]) if len(x.split())>0 else 0)

df.head()


##Scatterplot: Text Length vs Word Count

In [None]:
fig = px.scatter(df, x='word_count', y='text_length', color='label',
                 hover_data=['cleaned_text'], opacity=0.7,
                 labels={'word_count':'Word Count', 'text_length':'Text Length', 'label':'Label'},
                 title='Text Length vs Word Count by Label')
fig.show()


##Scatterplot: Text Length vs Avg Word Length

In [None]:
fig = px.scatter(df, x='avg_word_len', y='text_length', color='label',
                 hover_data=['cleaned_text'], opacity=0.7,
                 labels={'avg_word_len':'Average Word Length', 'text_length':'Text Length', 'label':'Label'},
                 title='Text Length vs Average Word Length by Label')
fig.show()


##Correlation Heatmap

In [None]:
# Select numeric features
num_df = df[['text_length','word_count','avg_word_len','label']]

fig = px.imshow(num_df.corr(), text_auto=True, color_continuous_scale='RdBu_r',
                title='Feature Correlation Heatmap')
fig.show()


# **statistical testing**

## Chi-Square Test for Label Balance

In [None]:
from scipy.stats import chi2_contingency

# Create contingency table
label_counts = clean_df['label'].value_counts()
print("Label Counts:\n", label_counts)

# Chi-square test
chi2, p, dof, ex = chi2_contingency([[label_counts[0], label_counts[1]]])
print(f"Chi-Square Test: chi2 = {chi2:.4f}, p-value = {p:.4f}")


dataset has a good balance between spam and ham emails, so, can proceed with modeling without worrying about severe label imbalance affecting the results.

## KS test

In [None]:
from scipy.stats import ks_2samp

# Add text length column
clean_df['text_length'] = clean_df['cleaned_text'].astype(str).apply(len)

# Separate by label
spam_lengths = clean_df[clean_df['label'] == 1]['text_length']
ham_lengths = clean_df[clean_df['label'] == 0]['text_length']


ks_stat, ks_p = ks_2samp(spam_lengths, ham_lengths)
print(f"KS Test: statistic = {ks_stat:.4f}, p-value = {ks_p:.4f}")

insight: Spam and ham emails tend to have different text lengths.

This is important for feature engineering — text length can be a useful predictive feature for your spam classifier.

## Word Frequency Comparison

In [None]:
# Split words
spam_words = ' '.join(clean_df[clean_df['label']==1]['cleaned_text'].astype(str)).split()
ham_words = ' '.join(clean_df[clean_df['label']==0]['cleaned_text'].astype(str)).split()

# Count top 20 words
spam_counter = Counter(spam_words).most_common(20)
ham_counter = Counter(ham_words).most_common(20)

print("Top 20 Spam Words:", spam_counter)
print("Top 20 Ham Words:", ham_counter)

#  **Model Selection (Initial Research)**
Based on the Exploratory Data Analysis (EDA) and the overall characteristics of the dataset, several supervised learning algorithms are considered for the classification task.  
The shortlisted models for initial experimentation include:

- **Naive Bayes** – A probabilistic model suitable for text-based and categorical data.  
- **Logistic Regression** – A simple yet effective linear model for binary or multi-class classification.  
- **Random Forest** – An ensemble of decision trees that reduces variance and improves stability.  
- **XGBoost** – A gradient boosting algorithm known for strong performance on tabular data.  
- **Simple Neural Network** – A baseline deep learning model capable of capturing non-linear relationships.

These models will be trained and compared to determine which algorithm performs best for our dataset.

---

# **Conclusion**
In this notebook, we have completed essential steps of data cleaning, preprocessing, and augmentation to prepare the dataset for model training.  
We also performed an initial model selection study to shortlist supervised learning algorithms suitable for our classification problem.  
The next step involves implementing these models, evaluating their performance using metrics like accuracy, precision, recall, and F1-score, and selecting the most robust one for deployment.

---