## Spam Mail Detector (NLP Classification)


### Spam Mail Detector using Enron Email Dataset

#### Dataset (Kaggle)

https://www.kaggle.com/datasets/marcelwiechmann/enron-spam-data

Contains : 33716 real emails

#### Project Objective
The goal of this project is to build a Machine Learning classifier that can
automatically distinguish between **spam** and **non-spam (ham)** emails using
Natural Language Processing (NLP) techniques.

#### Why This Project?
Spam emails are a major security and productivity issue. By applying text
preprocessing, feature extraction, and classification algorithms, we can
automate spam detection efficiently.

#### Techniques Used
- Text preprocessing (tokenization, stopword removal)
- TF-IDF feature extraction
- Naive Bayes & Logistic Regression models
- Performance evaluation using Accuracy, Precision, Recall, and F1-score


### 1. Import Python Libraries

In [1]:
import pandas as pd
import numpy as np
import string
import nltk

from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report, f1_score


nltk.download('punkt')
nltk.download('stopwords')


[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\Pc\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\Pc\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

### 2. Load Dataset

In [2]:
data = pd.read_csv("enron_spam_data.csv")
data.head()


Unnamed: 0.1,Unnamed: 0,Subject,Message,Spam/Ham,Date
0,0,christmas tree farm pictures,,ham,1999-12-10
1,1,"vastar resources , inc .","gary , production from the high island larger ...",ham,1999-12-13
2,2,calpine daily gas nomination,- calpine daily gas nomination 1 . doc,ham,1999-12-14
3,3,re : issue,fyi - see note below - already done .\nstella\...,ham,1999-12-14
4,4,meter 7268 nov allocation,fyi .\n- - - - - - - - - - - - - - - - - - - -...,ham,1999-12-14


In [4]:
data.shape

(33716, 5)

In [5]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 33716 entries, 0 to 33715
Data columns (total 5 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   Unnamed: 0  33716 non-null  int64 
 1   Subject     33716 non-null  object
 2   Message     33664 non-null  object
 3   Spam/Ham    33716 non-null  object
 4   Date        33716 non-null  object
dtypes: int64(1), object(4)
memory usage: 1.3+ MB


In [6]:
data.drop("Unnamed: 0",axis=1,inplace=True)

In [7]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 33716 entries, 0 to 33715
Data columns (total 4 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   Subject   33716 non-null  object
 1   Message   33664 non-null  object
 2   Spam/Ham  33716 non-null  object
 3   Date      33716 non-null  object
dtypes: object(4)
memory usage: 1.0+ MB


### 3. Handling Null Values and Duplicates

In [8]:
data.isnull().sum()

Subject      0
Message     52
Spam/Ham     0
Date         0
dtype: int64

#### Removes emails with no message body.

In [9]:
data.dropna(subset=['Message'], inplace=True)


In [10]:
data.isnull().sum()

Subject     0
Message     0
Spam/Ham    0
Date        0
dtype: int64

In [11]:
data.duplicated().sum()


np.int64(15436)

#### Duplicate emails can bias the model by repeating the same patterns and inflating performance metrics.

In [12]:
data.drop_duplicates(inplace=True)


In [13]:
data.shape

(18228, 4)

### 4. Data Preprocessing (Text Cleaning)

In [14]:
stop_words = set(stopwords.words('english'))

def preprocess_text(text):
    text = text.lower()
    tokens = word_tokenize(text)
    tokens = [t for t in tokens if t not in stop_words and t not in string.punctuation]
    return " ".join(tokens)

data['clean_text'] = data['Message'].apply(preprocess_text)


- Lowering reduces variability

- Stopwords removal reduces noise

- Tokenization breaks text into words

### 5. Feature Extraction (TF-IDF)

In [15]:
vectorizer = TfidfVectorizer(max_features=5000)
X = vectorizer.fit_transform(data['clean_text'])
y = data['Spam/Ham'].map({'ham':0, 'spam':1})


#### It converts text into numbers that ML models can understand.

### 6. Train/Test Split

In [16]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)


### 7. Train Naive Bayes Model

In [17]:
model = MultinomialNB()
model.fit(X_train, y_train)

predictions = model.predict(X_test)


### 8. Evaluation

In [18]:
print("Accuracy:", accuracy_score(y_test, predictions))
print(classification_report(y_test, predictions))


Accuracy: 0.9786066922654965
              precision    recall  f1-score   support

           0       1.00      0.98      0.99      3184
           1       0.86      1.00      0.92       462

    accuracy                           0.98      3646
   macro avg       0.93      0.99      0.95      3646
weighted avg       0.98      0.98      0.98      3646



### 9. Test on new data

In [40]:
# New email samples with stronger spam cues
new_emails = [
    "URGENT: Your bank account has been compromised. Verify here immediately."
]

# Preprocess new emails
clean_new_emails = [preprocess_text(email) for email in new_emails]

# Convert text to TF-IDF
X_new = vectorizer.transform(clean_new_emails)

# Prediction probabilities
probs = model.predict_proba(X_new)

# Lower threshold for testing aggressive spam detection
threshold = 0.25
custom_predictions = (probs[:, 1] > threshold).astype(int)

# Display results
for email, prob, pred in zip(new_emails, probs, custom_predictions):
    label = "SPAM" if pred == 1 else "HAM"
    print(f"Email: {email}")
    print(f"HAM probability : {prob[0]:.4f}")
    print(f"SPAM probability: {prob[1]:.4f}")
    print(f"Final Prediction: {label}")
    print("-" * 70)


Email: URGENT: Your bank account has been compromised. Verify here immediately.
HAM probability : 0.9840
SPAM probability: 0.0160
Final Prediction: HAM
----------------------------------------------------------------------


### Conclusion

- A spam email detection system was successfully built using Natural Language Processing (NLP) and Machine Learning techniques on the Enron email dataset.

- Text preprocessing steps such as lowercasing, tokenization, stopword removal, and duplicate handling significantly improved data quality and model reliability.

- TF-IDF vectorization effectively transformed raw email text into meaningful numerical features for classification.

- The trained model achieved a high accuracy of 97.86%, demonstrating strong overall performance in distinguishing spam from non-spam emails.

- The model obtained a perfect recall (1.00) for spam emails, ensuring that no spam messages were missed, which is critical for real-world email filtering systems.

- Precision for spam classification was 0.86, indicating that a small number of legitimate emails were incorrectly classified as spam.

- The F1-score of 0.92 for spam class shows a strong balance between precision and recall.

- The weighted average F1-score of 0.98 confirms that the model performs consistently well across both classes.

- Proper data cleaning (null value and duplicate removal) helped prevent data leakage and inflated accuracy.
