# **SMS Spam Detection Project**

# 1. **Project Overview**

The SMS Spam Detection project aims to build a machine learning model capable of classifying SMS messages as either spam or ham (legitimate). The dataset used for this project is the SMS Spam Collection, which consists of 5,574 SMS messages labeled accordingly. The goal is to develop a robust classifier that can help filter out unwanted spam messages automatically.

# 2. **Dataset Description**

**Source & Context**

The dataset has been compiled from multiple sources, including:

**Grumbletext Website** – A UK-based forum where users report SMS spam.

**NUS SMS Corpus (NSC)** – A collection of 3,375 ham messages from Singaporean university students.

**Caroline Tag’s PhD Thesis** – A collection of 450 ham messages.

**SMS Spam Corpus v.0.1 Big** – Contains 1,002 ham and 322 spam messages.

# Dataset Structure

The dataset consists of two main columns:

**v1 (Label)** – Indicates whether the message is ham (legitimate) or spam (unwanted message).

**v2 (Message Text)** – The raw text content of the SMS.

**Import Necessory Libaries**

In [4]:
import numpy as np
import pandas as pd


import re
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import PorterStemmer

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer

from sklearn.metrics import accuracy_score, precision_score, classification_report, confusion_matrix
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import MultinomialNB
from sklearn.ensemble import RandomForestClassifier
from xgboost import XGBClassifier

**Load DataSet**

In [6]:
df = pd.read_csv("spam.csv", encoding='latin1')
# Size Of Dataset
df.shape

(5572, 5)

**Show Top 5 Rows**

In [8]:
df.head()

Unnamed: 0,v1,v2,Unnamed: 2,Unnamed: 3,Unnamed: 4
0,ham,"Go until jurong point, crazy.. Available only ...",,,
1,ham,Ok lar... Joking wif u oni...,,,
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...,,,
3,ham,U dun say so early hor... U c already then say...,,,
4,ham,"Nah I don't think he goes to usf, he lives aro...",,,


**Rename Columns for readability and Take only Usefull Columns**

In [10]:
# Rename Columns name
df.rename(columns={'v1': 'label', 'v2': 'text'}, inplace=True)
# Take Only ?Usefull Columns
df = df[["text", "label"]]
# Size of Of data set
df.shape

(5572, 2)

# **Data Analysis**

**Checke Missing Values**

In [13]:
df.isnull().sum()

text     0
label    0
dtype: int64

**Chack Duplicated Values**

In [15]:
print("Dublicated Values is -->> ",df.duplicated().sum())
df.drop_duplicates(inplace=True)
print("\nDrop Succesfully Duplicates\n")
print("After Drop Dublicated Values is -->> ",df.duplicated().sum())


Dublicated Values is -->>  403

Drop Succesfully Duplicates

After Drop Dublicated Values is -->>  0


# **Label**



*   **Unbalanced DataSet**



In [17]:
df["label"].value_counts()

label
ham     4516
spam     653
Name: count, dtype: int64

# **Data Preprocessing**

To ensure optimal performance of the classification model, the following preprocessing steps were applied:

1. **Text Normalization**: Conversion to lowercase and removal of special characters.

2.  **Tokenization:** Splitting messages into individual words.

3. **Stopword Removal:** Removing common words that do not contribute to meaning.

4. **Lemmatization/Stemming:** Reducing words to their root forms to improve generalization

In [19]:
# nltk.download('stopwords')
# nltk.download('punkt')

# Initialize Lemmatizer
stemming = PorterStemmer()

# Get stopwords
stop_words = stopwords.words('english')

# Define the text preprocessing function
def preprocessing_text(text):
    text = text.lower()  # Convert to lowercase
    text = re.sub(r'[^a-z\s]', '', text)  # Remove non-alphabetic characters (keeps spaces)
    tokens = word_tokenize(text)  # Tokenize text into words
    tokens = [stemming.stem(token) for token in tokens if token not in stop_words]  # Stemmig and remove stopwords
    text = " ".join(tokens)  # Join tokens back to string
    text = re.sub(r'\s+', ' ', text)  # Replace multiple spaces with single space
    return text.strip()  # Remove leading/trailing spaces

# Apply preprocessing to the 'text' column of the dataframe
df["clean_text"] = df["text"].apply(preprocessing_text)


In [20]:
print("===================================")
print("       Before Text Preprocessing      ")
print("===================================")

print(df["text"].iloc[0])

print("===================================")
print("       After Text Preprocessing      ")
print("===================================")

print(df["clean_text"].iloc[0])

       Before Text Preprocessing      
Go until jurong point, crazy.. Available only in bugis n great world la e buffet... Cine there got amore wat...
       After Text Preprocessing      
go jurong point crazi avail bugi n great world la e buffet cine got amor wat


# **Feature Engineering**

**Split Data in two parts Dependent and Independent**

In [23]:
X=df["text"]
y=df["label"]

**Split Dataset into Tain and Test data**

In [25]:
# Split the dataset into training and testing sets
x_train, x_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Print the shape of train and test sets
print(f"x_train shape: {x_train.shape}")
print(f"x_test shape: {x_test.shape}")
print(f"y_train shape: {y_train.shape}")
print(f"y_test shape: {y_test.shape}")


x_train shape: (4135,)
x_test shape: (1034,)
y_train shape: (4135,)
y_test shape: (1034,)


**Apply Label**

In [27]:
encode = LabelEncoder()
y_train = encode.fit_transform(y_train)
y_test = encode.transform(y_test)

# **Vectorization: Converting text data into numerical features using:**

1. **Bag of Words (BoW)**

2. **TF-IDF (Term Frequency-Inverse**

# **Bag Of Words**

In [30]:
bow = CountVectorizer(max_features=5000, binary=True)
x_train_bow = bow.fit_transform(x_train)
x_test_bow = bow.transform(x_test)
x_train_bow.shape

(4135, 5000)

# **TFIDF**

In [32]:
tfidf = TfidfVectorizer(max_features=5000)
x_train_tfidf = tfidf.fit_transform(x_train)
x_test_tfidf = tfidf.transform(x_test)
x_train_tfidf.shape

(4135, 5000)

# **Model Trainning**

## **Selected Models**

Several machine learning models were evaluated for classification:

1. Logistic Regression

2. Multinomial Naïve Bayes

3. Random Forest Classifier

4. XGBoost Classifier

In [34]:
# Define your models
models = {
    "logistic_regression": LogisticRegression(),
    "MultinomialNB": MultinomialNB(),
    "RandomForestClassifier": RandomForestClassifier(n_estimators=50,random_state=2),
    "XGBClassifier": XGBClassifier(),
}

# Initialize lists to store results
model_names = []
accuracy_scores_bow = []
precision_scores_bow = []


accuracy_scores_tfidf = []
precision_scores_tfidf = []

# **Train Using BOW**

In [36]:
# Loop through models and evaluate them
for name, model in models.items():
    print(f"\nModel -- >> {name}\n")

    # Fit the model
    model.fit(x_train_bow, y_train)

    # Predict on train and test data
    train_pred_bow = model.predict(x_train_bow)
    test_pred_bow = model.predict(x_test_bow)

    # Calculate accuracy scores
    train_acc_score_bow = accuracy_score(y_train, train_pred_bow)
    test_acc_score_bow = accuracy_score(y_test, test_pred_bow)

    # Calculate precision scores
    train_precision_score_bow = precision_score(y_train, train_pred_bow)
    test_precision_score_bow = precision_score(y_test, test_pred_bow)

    # Generate classification reports
    train_class_rep_bow = classification_report(y_train, train_pred_bow)
    test_class_rep_bow = classification_report(y_test, test_pred_bow)

    # Generate confusion matrices
    train_conf_matrix_bow = confusion_matrix(y_train, train_pred_bow)
    test_conf_matrix_bow = confusion_matrix(y_test, test_pred_bow)

    # Print the results
    print(f"Train Accuracy == {train_acc_score_bow}")
    print(f"Test Accuracy \n== {test_acc_score_bow}")

    print("\nTrain Classification Report \n", train_class_rep_bow)
    print("\nTest Classification Report \n", test_class_rep_bow)

    print("Train Confusion Matrix == \n", train_conf_matrix_bow)
    print("Test Confusion Matrix == \n", test_conf_matrix_bow)

    print("=="*20)
    print("=="*20)

    # Append results to the lists
    model_names.append(name)
    accuracy_scores_bow.append(test_acc_score_bow)
    precision_scores_bow.append(test_precision_score_bow)



Model -- >> logistic_regression

Train Accuracy == 0.9961305925030229
Test Accuracy 
== 0.9825918762088974

Train Classification Report 
               precision    recall  f1-score   support

           0       1.00      1.00      1.00      3627
           1       1.00      0.97      0.98       508

    accuracy                           1.00      4135
   macro avg       1.00      0.98      0.99      4135
weighted avg       1.00      1.00      1.00      4135


Test Classification Report 
               precision    recall  f1-score   support

           0       0.98      1.00      0.99       889
           1       0.99      0.88      0.93       145

    accuracy                           0.98      1034
   macro avg       0.99      0.94      0.96      1034
weighted avg       0.98      0.98      0.98      1034

Train Confusion Matrix == 
 [[3627    0]
 [  16  492]]
Test Confusion Matrix == 
 [[888   1]
 [ 17 128]]

Model -- >> MultinomialNB

Train Accuracy == 0.9915356711003628
Test Ac

# **Train Using Tfidf**

In [38]:
# Loop through models and evaluate them
for name, model in models.items():
    print(f"\nModel -- >> {name}\n")

    # Fit the model
    model.fit(x_train_tfidf, y_train)

    # Predict on train and test data
    train_pred_tfidf = model.predict(x_train_tfidf)
    test_pred_tfidf = model.predict(x_test_tfidf)

    # Calculate accuracy scores
    train_acc_score_tfidf = accuracy_score(y_train, train_pred_tfidf)
    test_acc_score_tfidf  = accuracy_score(y_test, test_pred_tfidf)

    # Calculate precision scores
    train_precision_score_tfidf = precision_score(y_train, train_pred_tfidf)
    test_precision_score_tfidf = precision_score(y_test, test_pred_tfidf)

    # Generate classification reports
    train_class_rep_tfidf = classification_report(y_train, train_pred_tfidf)
    test_class_rep_tfidf = classification_report(y_test, test_pred_tfidf)

    # Generate confusion matrices
    train_conf_matrix_tfidf = confusion_matrix(y_train, train_pred_tfidf)
    test_conf_matrix_tfidf = confusion_matrix(y_test, test_pred_tfidf)

    # Print the results
    print(f"Train Accuracy == {train_acc_score_tfidf}")
    print(f"Test Accuracy \n== {test_acc_score_tfidf}")

    print("\nTrain Classification Report \n", train_class_rep_tfidf)
    print("\nTest Classification Report \n", test_class_rep_tfidf)

    print("Train Confusion Matrix == \n", train_conf_matrix_tfidf)
    print("Test Confusion Matrix == \n", test_conf_matrix_tfidf)

    print("=="*20)
    print("=="*20)

    # Append results to the lists
    accuracy_scores_tfidf.append(test_acc_score_tfidf)
    precision_scores_tfidf.append(test_precision_score_tfidf)



Model -- >> logistic_regression

Train Accuracy == 0.9729141475211608
Test Accuracy 
== 0.9700193423597679

Train Classification Report 
               precision    recall  f1-score   support

           0       0.97      1.00      0.98      3627
           1       0.99      0.78      0.88       508

    accuracy                           0.97      4135
   macro avg       0.98      0.89      0.93      4135
weighted avg       0.97      0.97      0.97      4135


Test Classification Report 
               precision    recall  f1-score   support

           0       0.97      1.00      0.98       889
           1       0.97      0.81      0.88       145

    accuracy                           0.97      1034
   macro avg       0.97      0.90      0.93      1034
weighted avg       0.97      0.97      0.97      1034

Train Confusion Matrix == 
 [[3625    2]
 [ 110  398]]
Test Confusion Matrix == 
 [[886   3]
 [ 28 117]]

Model -- >> MultinomialNB

Train Accuracy == 0.973881499395405
Test Acc

**Store Result in DataFrame**

In [40]:
pd.DataFrame({
    "Models": model_names,
    "Accuracy_score_bow": accuracy_scores_bow,
    "Precision_score_bow": precision_scores_bow,
    "Accuracy_score_tfidf": accuracy_scores_tfidf,
    "Precision_score_tfidf": precision_scores_tfidf
})

Unnamed: 0,Models,Accuracy_score_bow,Precision_score_bow,Accuracy_score_tfidf,Precision_score_tfidf
0,logistic_regression,0.982592,0.992248,0.970019,0.975
1,MultinomialNB,0.988395,1.0,0.968085,1.0
2,RandomForestClassifier,0.978723,0.992,0.974855,0.991736
3,XGBClassifier,0.982592,0.977444,0.976789,0.984


**Accuracy is Aleady Good So We donot need to do Hyper Parameter Tunning**

* Best Model is -- >> **MultiNomialNB** with **BOW**
  * With high **Accuracy and Precission**

In [42]:
# Initialize model
log_reg = LogisticRegression()

# Fit the model
log_reg.fit(x_train_bow, y_train)

# Predict on train and test data
train_pred = log_reg.predict(x_train_bow)
test_pred = log_reg.predict(x_test_bow)

# Calculate accuracy scores
train_acc_score = accuracy_score(y_train, train_pred)
test_acc_score = accuracy_score(y_test, test_pred)

# Calculate precision scores
train_precision_score = precision_score(y_train, train_pred, average='weighted')
test_precision_score = precision_score(y_test, test_pred, average='weighted')

# Generate classification reports
train_class_rep = classification_report(y_train, train_pred)
test_class_rep = classification_report(y_test, test_pred)

# Generate confusion matrices
train_conf_matrix = confusion_matrix(y_train, train_pred)
test_conf_matrix = confusion_matrix(y_test, test_pred)

# Print the results
print(f"Train Accuracy == {train_acc_score}")
print(f"Test Accuracy == {test_acc_score}")

# Print precision scores
print(f"Train Precision Score == {train_precision_score}")
print(f"Test Precision Score == {test_precision_score}")

# Generate and print classification reports
print("\nTrain Classification Report \n", train_class_rep)
print("\nTest Classification Report \n", test_class_rep)

# Print confusion matrices
print("Train Confusion Matrix == \n", train_conf_matrix)
print("Test Confusion Matrix == \n", test_conf_matrix)



Train Accuracy == 0.9961305925030229
Test Accuracy == 0.9825918762088974
Train Precision Score == 0.9961475868812694
Test Precision Score == 0.9827625933060309

Train Classification Report 
               precision    recall  f1-score   support

           0       1.00      1.00      1.00      3627
           1       1.00      0.97      0.98       508

    accuracy                           1.00      4135
   macro avg       1.00      0.98      0.99      4135
weighted avg       1.00      1.00      1.00      4135


Test Classification Report 
               precision    recall  f1-score   support

           0       0.98      1.00      0.99       889
           1       0.99      0.88      0.93       145

    accuracy                           0.98      1034
   macro avg       0.99      0.94      0.96      1034
weighted avg       0.98      0.98      0.98      1034

Train Confusion Matrix == 
 [[3627    0]
 [  16  492]]
Test Confusion Matrix == 
 [[888   1]
 [ 17 128]]


# Model Saving

In [45]:
import pickle
with open("model.pkl",'wb') as file:
    pickle.dump(log_reg, file)
    
with open("encode.pkl",'wb') as file1:
    pickle.dump(encode, file1)

with open("BOW.pkl",'wb') as file2:
    pickle.dump(bow,file2)