### **project:** GENERATE EMAIL TEXT FOR SPAM CLASSIFICATION USING GENERATIVE AI
---------------------------------------------------------------------------

### **Preprocessing stage**

### **Data Set:** Spam Email Classification Dataset
###Raw Dataset - https://drive.google.com/file/d/1n04lsuCHvYsczEDSzixwar67ZTz4Dd-H/view?usp=sharing

Importing the Libraries

In [14]:
import numpy as np
import pandas as pd
import re
from nltk.stem import PorterStemmer
from sklearn.feature_extraction.text import CountVectorizer

Loading the Data Set

In [15]:
df = pd.read_csv("combined_data (1).csv")

In [16]:
#Display first 5 rows
display(df.head())

Unnamed: 0,label,text
0,1,ounce feather bowl hummingbird opec moment ala...
1,1,wulvob get your medircations online qnb ikud v...
2,0,computer connection from cnn com wednesday es...
3,1,university degree obtain a prosperous future m...
4,0,thanks for all your answers guys i know i shou...


In [17]:
#Display last 5 rows
display(df.tail())

Unnamed: 0,label,text
83443,0,hi given a date how do i get the last date of ...
83444,1,now you can order software on cd or download i...
83445,1,dear valued member canadianpharmacy provides a...
83446,0,subscribe change profile contact us long term ...
83447,1,get the most out of life ! viagra has helped m...


In [18]:
#Display random 5 rows
display(df.sample(5))

Unnamed: 0,label,text
18147,0,the ksvm object is probably what you need to u...
36837,1,rhythm the like it meat it bread the detail no...
52547,1,delphi bertrand faint jet industrial mario aut...
68725,1,dear ho me owner are you tired of paying high ...
23955,1,seize the opportunity escapenumber anatrim esc...


In [19]:
# Dataset Info
# Checking information about the dataset using info
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 83448 entries, 0 to 83447
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   label   83448 non-null  int64 
 1   text    83448 non-null  object
dtypes: int64(1), object(1)
memory usage: 1.3+ MB


Shape of Data set

In [20]:
# Checking number of rows and columns of the dataset using shape
print("Number of rows are: ",df.shape[0])
print("Number of columns are: ",df.shape[1])

Number of rows are:  83448
Number of columns are:  2


In [23]:
print("\n=== VALUE COUNTS FOR EACH COLUMN ===")
for column in df.columns:
    print(f"\n--- {column} ---")
    print(f"Data type: {df[column].dtype}")
    print(f"Unique values: {df[column].nunique()}")
    print(f"Missing values: {df[column].isna().sum()}")
    print("Value counts:")
    print(df[column].value_counts().head(10))


=== VALUE COUNTS FOR EACH COLUMN ===

--- label ---
Data type: int64
Unique values: 2
Missing values: 0
Value counts:
label
1    43910
0    39538
Name: count, dtype: int64

--- text ---
Data type: object
Unique values: 83446
Missing values: 0
Value counts:
text
unsubscribe                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      

Duplicated values

In [24]:
# Dataset Duplicate Value Count
dup = df.duplicated().sum()
print(f'number of duplicated rows are {dup}')

number of duplicated rows are 0


Missing values

In [25]:
# Missing Values/Null Values Count
df.isnull().sum()

Unnamed: 0,0
label,0
text,0


In [26]:
df.columns

Index(['label', 'text'], dtype='object')

Describe Data set

In [27]:
# Dataset Describe (all columns included)
df.describe(include= 'all').round(2)

Unnamed: 0,label,text
count,83448.0,83448
unique,,83446
top,,unsubscribe
freq,,2
mean,0.53,
std,0.5,
min,0.0,
25%,0.0,
50%,1.0,
75%,1.0,


Understanding variables

In [28]:
# Check Unique Values for each variable using a for loop.
for i in df.columns.tolist():
  print("No. of unique values in",i,"is",df[i].nunique())

No. of unique values in label is 2
No. of unique values in text is 83446


Text Preprocessing

In [None]:
ps = PorterStemmer()
def simple_clean(text):
    """Basic text cleaning + stemming"""
    if pd.isna(text):
        return ""

    text = str(text)
    text = re.sub(r'[^a-zA-Z\s]', ' ', text)
    words = text.lower().split()

    words = [ps.stem(word) for word in words if len(word) >= 3]
    return ' '.join(words)

df['cleaned_text'] = df['text'].apply(simple_clean)

vectorizer = CountVectorizer(stop_words='english')
stopwords = set(vectorizer.get_stop_words())

def remove_stopwords(text):
    words = text.split()
    return ' '.join([w for w in words if w not in stopwords])

df['cleaned_text'] = df['cleaned_text'].apply(remove_stopwords)

In [None]:
df.head

<bound method NDFrame.head of        label                                               text  \
0          1  ounce feather bowl hummingbird opec moment ala...   
1          1  wulvob get your medircations online qnb ikud v...   
2          0   computer connection from cnn com wednesday es...   
3          1  university degree obtain a prosperous future m...   
4          0  thanks for all your answers guys i know i shou...   
...      ...                                                ...   
83443      0  hi given a date how do i get the last date of ...   
83444      1  now you can order software on cd or download i...   
83445      1  dear valued member canadianpharmacy provides a...   
83446      0  subscribe change profile contact us long term ...   
83447      1  get the most out of life ! viagra has helped m...   

                                            cleaned_text  
0      ounc feather bowl hummingbird opec moment alab...  
1      wulvob medirc onlin qnb ikud viagra esca

In this stage, I performed text preprocessing to clean and standardize the email data before training the spam classification model. The following steps were carried out:

### **Text Cleaning and Stemming**:

Removed all non-alphabetic characters (such as numbers, punctuation, and symbols).

Converted all text to lowercase to maintain consistency.

Tokenized the text into individual words.

Applied Porter Stemming to reduce words to their root form (e.g., “running” → “run”) using PorterStemmer().

Removed short words (less than 3 characters) as they generally do not add value for classification.

### **Stopword Removal**:

Used CountVectorizer(stop_words='english') to obtain a predefined list of English stopwords.

Removed these stopwords (like “the”, “is”, “and”) since they do not contribute to the meaning or classification of the text.

### **Result**:

Created a new column cleaned_text that contains the preprocessed version of each email, ready for feature extraction and model training.

In [None]:
df = df.drop(columns=['text'])

In [None]:
# Save to CSV file
df.to_csv('cleaned_dataset.csv', index=False)
print("✅ Dataset saved as 'cleaned_dataset.csv'")

✅ Dataset saved as 'cleaned_dataset.csv'
