# Spam Classification Using NLP

Dataset → Text Cleaning → Feature Extraction → Model Training → Prediction

### Problem Statement
Spam messages cause security risks, waste user time, and reduce communication efficiency.  
This project aims to classify messages as **Spam** or **Ham (Not Spam)** using **Natural Language Processing (NLP)** and **Machine Learning** techniques.

---

### Project Overview
The spam classification model follows a standard NLP pipeline:
1. Text Preprocessing  
2. Feature Extraction  
3. Model Training  
4. Model Evaluation  

---

### Results
- The model accurately distinguishes spam from ham messages  
- Effective preprocessing significantly improves performance  
- NLP techniques enhance text understanding  

---


# Text Preprocessing

In [28]:
import re
import nltk

from nltk.corpus import stopwords
from nltk.stem import PorterStemmer

nltk.download('stopwords')
stem = PorterStemmer()

[nltk_data] Downloading package stopwords to C:\Users\Purvi
[nltk_data]     jain\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


# Importing Dataset

In [29]:
import pandas as pd

df = pd.read_csv('emails.csv') 

In [30]:
df.head()

Unnamed: 0,Text,Spam
0,Subject: naturally irresistible your corporate...,1
1,Subject: the stock trading gunslinger fanny i...,1
2,Subject: unbelievable new homes made easy im ...,1
3,Subject: 4 color printing special request add...,1
4,"Subject: do not have money , get software cds ...",1


# Text Cleaning 

In [31]:
stop_words = set(stopwords.words('english'))

In [32]:
# This function cleans raw text by removing special characters, converting to lowercase, removing stopwords, applying stemming, and returning a normalized sentence for NLP modeling.

def clean_text(text):
    text = re.sub('[^a-zA-Z0-9]', ' ', str(text))
    text = text.lower()
    text = text.split()
    text = [
        stem.stem(word)
        for word in text
        if word not in stop_words
    ]
    return ' '.join(text)

In [33]:
df['clean_text'] = df['Text'].apply(clean_text)

In [34]:
df.shape

(5728, 3)

In [9]:
df

Unnamed: 0,Text,Spam,clean_text
0,Subject: naturally irresistible your corporate...,1,subject natur irresist corpor ident lt realli ...
1,Subject: the stock trading gunslinger fanny i...,1,subject stock trade gunsling fanni merril muzo...
2,Subject: unbelievable new homes made easy im ...,1,subject unbeliev new home made easi im want sh...
3,Subject: 4 color printing special request add...,1,subject 4 color print special request addit in...
4,"Subject: do not have money , get software cds ...",1,subject money get softwar cd softwar compat gr...
...,...,...,...
5723,Subject: re : research and development charges...,0,subject research develop charg gpg forward shi...
5724,"Subject: re : receipts from visit jim , than...",0,subject receipt visit jim thank invit visit ls...
5725,Subject: re : enron case study update wow ! a...,0,subject enron case studi updat wow day super t...
5726,"Subject: re : interest david , please , call...",0,subject interest david pleas call shirley cren...


#### Spam column:
1 = Spam
0 = Not Spam

# Feature Extraction via TfidfVectorizer

In [35]:
from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer = TfidfVectorizer(max_features=5000, ngram_range=(1,2))
x = vectorizer.fit_transform(df['clean_text']) # input feature
y = df['Spam'] # output label

# Training

You should not train and test a model on the same data, because then the model might just memorize the answers instead of actually learning patterns. To avoid this, the dataset is split into two parts.

#### Training data (80%):
This is the data the model learns from. It studies the relationship between the text (X) and the labels (y).

#### Testing data (20%):
This data is kept aside and never shown to the model during training. After training, we use this data to check how accurately the model predicts spam or ham.

#### How it is done:

- train_test_split randomly divides the data into training and testing sets.
- test_size=0.2 means 20% data is used for testing and 80% for training.
- random_state=42 ensures the same split happens every time you run the code, so results stay consistent.

#### What it means overall:
This step helps us measure the real performance of the model and ensures it can correctly classify spam messages in real-world situations, not just on known data.


In [36]:
from sklearn.model_selection import train_test_split

x_train, x_test, y_train, y_test = train_test_split(
    x, y, test_size=0.2, random_state=42
)


# Model = Naive Bayes (MultinomialNB)

In [37]:
from sklearn.naive_bayes import MultinomialNB

model = MultinomialNB()
model.fit(x_train, y_train)


##  Model Evaluation and Performance Metrics

1. **accuracy_score(y_test, y_pred)** calculates the overall accuracy of the model, meaning the percentage of messages that were classified correctly as spam or ham.
2. **classification_report(y_test, y_pred)** provides a detailed summary of the model’s performance, including precision, recall, F1-score, and support for each class. This helps understand not only how accurate the model is, but also how well it identifies spam and non-spam messages individually. 

- Precision → “Was I right when I said spam?”
- Recall → “Did I catch all spam?”
- F1 → “How good is the model overall?”
- Support → “How much data is there?”

In [38]:
from sklearn.metrics import accuracy_score, classification_report

y_pred = model.predict(x_test)

print("Accuracy:", accuracy_score(y_test, y_pred)) # Calculates the overall accuracy of the model
print(classification_report(y_test, y_pred)) # Gives a detailed performance summary


Accuracy: 0.9834205933682374
              precision    recall  f1-score   support

           0       0.99      0.99      0.99       856
           1       0.98      0.96      0.97       290

    accuracy                           0.98      1146
   macro avg       0.98      0.98      0.98      1146
weighted avg       0.98      0.98      0.98      1146



In [39]:
y.value_counts()


Spam
0    4360
1    1368
Name: count, dtype: int64

# Testing

In [42]:
msg = ["Hi, can we reschedule our meeting to tomorrow afternoon? Please let me know your availability."]
msg_clean = [clean_text(msg[0])]
msg_vector = vectorizer.transform(msg_clean)

print(model.predict(msg_vector))

[0]


In [43]:
msg = ["Congratulations! You won a free iPhone. Call now to claim your prize"]
msg_clean = [clean_text(msg[0])]
msg_vector = vectorizer.transform(msg_clean)

print(model.predict(msg_vector))


[1]


# Saving the Trained Model and Vectorizer

This code saves the trained spam classification model and the TF-IDF vectorizer using the pickle library. the pickle library is used to serialize (convert) these objects into a format that Python can store and load later. The model file stores all the learned patterns required to classify messages, while the vectorizer file stores the vocabulary and feature mapping used during training.

The files are saved in binary mode (`wb`) because machine learning objects are complex and must be stored in a machine-readable binary format. Saving these files allows the model to be reused later for prediction and deployment without retraining.


In [45]:
import pickle

pickle.dump(model, open("spam_model.pkl", "wb"))
pickle.dump(vectorizer, open("tfidf_vectorizer.pkl", "wb"))

print("pickle model saved")

pickle model saved
