# Spam Email Classification Report
Overview
This project aims to develop a spam email classification system using various natural language processing (NLP) and machine learning techniques. The code provided outlines the entire pipeline from data preprocessing to model training and evaluation. The dataset used in this project is labeled as "Spam_Email_Data.csv", which consists of emails classified into spam and non-spam categories.

In [1]:
import pandas as pd
import re
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
import nltk
import numpy as np
from gensim.models import Word2Vec, Doc2Vec
from gensim.models.doc2vec import TaggedDocument
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.naive_bayes import MultinomialNB
from sklearn.model_selection import train_test_split
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.metrics import classification_report

nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')



[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\DELL\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\DELL\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\DELL\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

In [2]:
data = pd.read_csv('Spam_Email_Data.csv')


# Data Preprocessing
Data preprocessing is a crucial step in any machine learning task. In this project, the following preprocessing steps were applied to the email text data:

- Cleaning Text: The raw email texts were cleaned to remove headers, HTML tags, email addresses, URLs, non-word characters, and extra whitespaces.
- Text Normalization:
Conversion to lowercase to ensure uniformity.
Removal of non-alphabet characters.
Tokenization of text into individual words using NLTK's word_tokenize.
Removal of English stopwords using NLTKâ€™s predefined list.
Lemmatization of words to reduce them to their base or root form.
- Final Text Representation: After lemmatization, the tokens were joined back into a single string per email, which serves as the final text input for vectorization and machine learning modeling.

In [4]:
import re

def clean_email_text(text):
    if not isinstance(text, str):
        return ""
    
    text= re.sub(r'^(.*?\n\n)', '', text, flags=re.S)      #remove email headers
    # remove html tags
    text= re.sub(r'<[^>]+>', '', text)

    text= re.sub(r'\S*@\S*\s?', '', text)  # remove emails
    text= re.sub(r'http\S+', '', text)     # remove urls

    #remove all non alphanum
    text= re.sub(r'[^\w\s.,;!?-]', '', text)

    #cleaning white spaces
    text= re.sub(r'\s+', ' ', text).strip()
    return text

data['cleaned_text']= data['text'].apply(clean_email_text)


print(data[ 'cleaned_text'].head())


0    From Mon Jul 29 112802 2002 Return-Path Delive...
1    From Mon Jun 24 175421 2002 Return-Path Delive...
2    From Mon Jul 29 113957 2002 Return-Path Delive...
3    From Mon Jun 24 174923 2002 Return-Path Delive...
4    From Mon Aug 19 110247 2002 Return-Path Delive...
Name: cleaned_text, dtype: object


In [5]:
from nltk.stem import WordNetLemmatizer

# lowercase
data['cleaned_text']= data['cleaned_text'].apply(lambda x: x.lower())

#tokenize the text
data['tokenized_text'] = data['cleaned_text'].apply(word_tokenize)

#remove stopwords
stop_words = set(stopwords.words('english'))
data['filtered_text'] = data['tokenized_text'].apply(lambda x: [word for word in x if word not in stop_words])

#apply lemmatization
lemmatizer = WordNetLemmatizer()
data['lemmatized_text'] = data['tokenized_text'].apply(lambda x: [lemmatizer.lemmatize(word) for word in x])

# Check the output
print(data[['text', 'cleaned_text', 'filtered_text', 'lemmatized_text']].head())


                                                text  \
0  From ilug-admin@linux.ie Mon Jul 29 11:28:02 2...   
1  From gort44@excite.com Mon Jun 24 17:54:21 200...   
2  From fork-admin@xent.com Mon Jul 29 11:39:57 2...   
3  From dcm123@btamail.net.cn Mon Jun 24 17:49:23...   
4  From ilug-admin@linux.ie Mon Aug 19 11:02:47 2...   

                                        cleaned_text  \
0  from mon jul 29 112802 2002 return-path delive...   
1  from mon jun 24 175421 2002 return-path delive...   
2  from mon jul 29 113957 2002 return-path delive...   
3  from mon jun 24 174923 2002 return-path delive...   
4  from mon aug 19 110247 2002 return-path delive...   

                                       filtered_text  \
0  [mon, jul, 29, 112802, 2002, return-path, deli...   
1  [mon, jun, 24, 175421, 2002, return-path, deli...   
2  [mon, jul, 29, 113957, 2002, return-path, deli...   
3  [mon, jun, 24, 174923, 2002, return-path, deli...   
4  [mon, aug, 19, 110247, 2002, return-path, d

In [6]:
data['final_text'] = data['lemmatized_text'].apply(lambda x: ' '.join(x))

print(data['final_text'])

0       from mon jul 29 112802 2002 return-path delive...
1       from mon jun 24 175421 2002 return-path delive...
2       from mon jul 29 113957 2002 return-path delive...
3       from mon jun 24 174923 2002 return-path delive...
4       from mon aug 19 110247 2002 return-path delive...
                              ...                        
5791    from mon jul 22 181245 2002 return-path delive...
5792    from mon oct 7 203702 2002 return-path deliver...
5793    received from hq.pro-ns.net localhost 127.0.0....
5794    from thu sep 12 184430 2002 return-path delive...
5795    from mon sep 30 134410 2002 return-path delive...
Name: final_text, Length: 5796, dtype: object


# Word Embeddings:



- Word2Vec: Custom transformer using Gensim's Word2Vec model to convert words into vectors.


In [7]:
tokens = data['final_text'].tolist()
word2vec_model = Word2Vec(sentences=tokens, vector_size=100, window=5, min_count=2, workers=4)
word2vec_model.train(tokens, total_examples=len(tokens), epochs=10)

(34793990, 155067240)

In [9]:
def get_average_word2vec(tokens,model,vector_size):
    vec= np.zeros((vector_size,), dtype='float32')
    n_words= 0
    for word in tokens:
        if word in model.wv:
            n_words += 1
            vec= np.add(vec,model.wv[word])
    if n_words > 0:
        vec= np.divide(vec,n_words)
    return vec

vector_size = 100
data['word2vec_features'] = data['final_text'].apply(lambda x: get_average_word2vec(x, word2vec_model, vector_size))

In [11]:
X= np.array(data['word2vec_features'].tolist())
y= data['target'].values
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [8]:
!pip install imbalanced-learn

Collecting imbalanced-learn
  Downloading imbalanced_learn-0.12.2-py3-none-any.whl.metadata (8.2 kB)
Downloading imbalanced_learn-0.12.2-py3-none-any.whl (257 kB)
   ---------------------------------------- 0.0/258.0 kB ? eta -:--:--
   - -------------------------------------- 10.2/258.0 kB ? eta -:--:--
   ---- ---------------------------------- 30.7/258.0 kB 435.7 kB/s eta 0:00:01
   ------------- ------------------------- 92.2/258.0 kB 871.5 kB/s eta 0:00:01
   ------------------------------- -------- 204.8/258.0 kB 1.4 MB/s eta 0:00:01
   ---------------------------------------- 258.0/258.0 kB 1.4 MB/s eta 0:00:00
Installing collected packages: imbalanced-learn
Successfully installed imbalanced-learn-0.12.2


In [16]:
from imblearn.under_sampling import RandomUnderSampler
# undersampler
undersampler = RandomUnderSampler(random_state=42)

#undersampling to the training data
X_train_resampled, y_train_resampled = undersampler.fit_resample(X_train, y_train)

print(f"Resampled: {pd.Series(y_train_resampled).value_counts()}")


Resampled: 0    1515
1    1515
Name: count, dtype: int64


In [17]:
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)


# Model Building and Evaluation
machine learning model built using vectorizer (Word2Vec)  Support Vector Machine (SVM). The evaluation of models was performed using a hold-out test set (20% of the data).



In [19]:
from sklearn.metrics import classification_report, confusion_matrix, roc_auc_score

svm_model = SVC(kernel='linear', probability=True)
svm_model.fit(X_train, y_train)

y_pred = svm_model.predict(X_test)
y_pred_proba = svm_model.predict_proba(X_test)[:, 1]
print(classification_report(y_test, y_pred))
print('Confusion Matrix:\n', confusion_matrix(y_test, y_pred))
print('ROC-AUC Score:', roc_auc_score(y_test, y_pred_proba))

              precision    recall  f1-score   support

           0       0.95      0.97      0.96       779
           1       0.93      0.90      0.91       381

    accuracy                           0.94      1160
   macro avg       0.94      0.93      0.94      1160
weighted avg       0.94      0.94      0.94      1160

Confusion Matrix:
 [[753  26]
 [ 39 342]]
ROC-AUC Score: 0.9777441972513384
