<a href="https://colab.research.google.com/github/Olanle/Project-003-Spam-email-detector-logistic-regression-Naive-Bayes-/blob/main/003.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [2]:
!wget https://archive.ics.uci.edu/ml/machine-learning-databases/00228/smsspamcollection.zip
!unzip smsspamcollection.zip

--2025-10-04 09:02:35--  https://archive.ics.uci.edu/ml/machine-learning-databases/00228/smsspamcollection.zip
Resolving archive.ics.uci.edu (archive.ics.uci.edu)... 128.195.10.252
Connecting to archive.ics.uci.edu (archive.ics.uci.edu)|128.195.10.252|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: unspecified
Saving to: ‘smsspamcollection.zip’

smsspamcollection.z     [ <=>                ] 198.65K  --.-KB/s    in 0.08s   

2025-10-04 09:02:36 (2.44 MB/s) - ‘smsspamcollection.zip’ saved [203415]

Archive:  smsspamcollection.zip
  inflating: SMSSpamCollection       
  inflating: readme                  


In [6]:
import pandas as pd

df = pd.read_csv("SMSSpamCollection", sep="\t", header=None, names=["label", "message"])
print(df.head())

  label                                            message
0   ham  Go until jurong point, crazy.. Available only ...
1   ham                      Ok lar... Joking wif u oni...
2  spam  Free entry in 2 a wkly comp to win FA Cup fina...
3   ham  U dun say so early hor... U c already then say...
4   ham  Nah I don't think he goes to usf, he lives aro...


In [10]:
# check shape (rows, columns)
print("Dataset shape:", df.shape)

Dataset shape: (5572, 2)


In [11]:
# check how many spam vs ham
print("\nClass distribution:")
print(df['label'].value_counts())


Class distribution:
label
ham     4825
spam     747
Name: count, dtype: int64


In [12]:
# check some random messages
print("\nRandom samples:")
print(df.sample(10))


Random samples:
     label                                            message
3375   ham                            Also andros ice etc etc
67    spam  Urgent UR awarded a complimentary trip to Euro...
2022   ham  I don't have anybody's number, I still haven't...
5362   ham  I'm in inside office..still filling forms.don ...
4185   ham  I just really need shit before tomorrow and I ...
1006   ham              Give me a sec to think think about it
4668   ham                         I send the print  outs da.
288    ham  hi baby im cruisin with my girl friend what r ...
2952   ham                   Hey now am free you can call me.
3474   ham                    You getting back any time soon?


In [18]:
df['label_num'] = df['label'].map({'ham': 0, 'spam': 1})
print(df[['label', 'label_num']].head(10))

  label  label_num
0   ham          0
1   ham          0
2  spam          1
3   ham          0
4   ham          0
5  spam          1
6   ham          0
7   ham          0
8  spam          1
9  spam          1


In [19]:
import re

def clean_text(text):
    # lowercase
    text = text.lower()
    # remove URLs
    text = re.sub(r'http\S+|www\S+', ' ', text)
    # remove numbers
    text = re.sub(r'\d+', ' ', text)
    # remove special characters
    text = re.sub(r'[^a-z\s]', ' ', text)
    # remove extra spaces
    text = re.sub(r'\s+', ' ', text).strip()
    return text

# apply cleaning
df['clean_message'] = df['message'].apply(clean_text)

# check samples before vs after
print(df[['message', 'clean_message']].head(10))

                                             message  \
0  Go until jurong point, crazy.. Available only ...   
1                      Ok lar... Joking wif u oni...   
2  Free entry in 2 a wkly comp to win FA Cup fina...   
3  U dun say so early hor... U c already then say...   
4  Nah I don't think he goes to usf, he lives aro...   
5  FreeMsg Hey there darling it's been 3 week's n...   
6  Even my brother is not like to speak with me. ...   
7  As per your request 'Melle Melle (Oru Minnamin...   
8  WINNER!! As a valued network customer you have...   
9  Had your mobile 11 months or more? U R entitle...   

                                       clean_message  
0  go until jurong point crazy available only in ...  
1                            ok lar joking wif u oni  
2  free entry in a wkly comp to win fa cup final ...  
3        u dun say so early hor u c already then say  
4  nah i don t think he goes to usf he lives arou...  
5  freemsg hey there darling it s been week s now... 

In [22]:
from sklearn.model_selection import train_test_split

# Features (X) = clean messages, Labels (y) = ham/spam numeric
X = df['clean_message']
y = df['label_num']

# Split: 80% training, 20% testing
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

print("Training set size:", X_train.shape[0])
print("Testing set size:", X_test.shape[0])

Training set size: 4457
Testing set size: 1115


In [23]:
from sklearn.feature_extraction.text import TfidfVectorizer

# Initialize TF-IDF
vectorizer = TfidfVectorizer(stop_words='english', ngram_range=(1,2), min_df=2)

# Fit on training data and transform both train/test
X_train_tfidf = vectorizer.fit_transform(X_train)
X_test_tfidf = vectorizer.transform(X_test)

print("Training features shape:", X_train_tfidf.shape)
print("Testing features shape:", X_test_tfidf.shape)

Training features shape: (4457, 6692)
Testing features shape: (1115, 6692)


In [24]:
from sklearn.naive_bayes import MultinomialNB

# Initialize Naive Bayes
nb_model = MultinomialNB()

# Train
nb_model.fit(X_train_tfidf, y_train)

# Predict on test set
y_pred_nb = nb_model.predict(X_test_tfidf)

In [25]:
from sklearn.linear_model import LogisticRegression

# Initialize Logistic Regression
lr_model = LogisticRegression(max_iter=1000, solver='liblinear')

# Train
lr_model.fit(X_train_tfidf, y_train)

# Predict on test set
y_pred_lr = lr_model.predict(X_test_tfidf)

In [27]:
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, confusion_matrix, classification_report

def evaluate_model(y_true, y_pred, model_name):
    print(f"\n=== {model_name} ===")
    print("Accuracy :", accuracy_score(y_true, y_pred))
    print("Precision:", precision_score(y_true, y_pred))
    print("Recall   :", recall_score(y_true, y_pred))
    print("F1-score :", f1_score(y_true, y_pred))
    print("\nClassification Report:\n", classification_report(y_true, y_pred, target_names=["ham", "spam"]))
    print("Confusion Matrix:\n", confusion_matrix(y_true, y_pred))

# Evaluate Naive Bayes
evaluate_model(y_test, y_pred_nb, "Multinomial Naive Bayes")
# Evaluate Logistic Regression
evaluate_model(y_test, y_pred_lr, "Logistic Regression")


=== Multinomial Naive Bayes ===
Accuracy : 0.9668161434977578
Precision: 1.0
Recall   : 0.7516778523489933
F1-score : 0.8582375478927203

Classification Report:
               precision    recall  f1-score   support

         ham       0.96      1.00      0.98       966
        spam       1.00      0.75      0.86       149

    accuracy                           0.97      1115
   macro avg       0.98      0.88      0.92      1115
weighted avg       0.97      0.97      0.96      1115

Confusion Matrix:
 [[966   0]
 [ 37 112]]

=== Logistic Regression ===
Accuracy : 0.9659192825112107
Precision: 1.0
Recall   : 0.7449664429530202
F1-score : 0.8538461538461538

Classification Report:
               precision    recall  f1-score   support

         ham       0.96      1.00      0.98       966
        spam       1.00      0.74      0.85       149

    accuracy                           0.97      1115
   macro avg       0.98      0.87      0.92      1115
weighted avg       0.97      0.97    

From your results:

Naive Bayes: Recall = 0.75, F1 = 0.858

Logistic Regression: Recall = 0.744, F1 = 0.854

👉 Both are very close, but Naive Bayes slightly outperforms Logistic Regression, so we’ll go with MultinomialNB as the final model.

In [28]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
import joblib

# Vectorizer + Model
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(df['message'])   # df contains your dataset
y = df['label']   # ham/spam labels

# Train final model
final_model = MultinomialNB()
final_model.fit(X, y)

In [29]:
# Save model & vectorizer
joblib.dump(final_model, "spam_classifier_model.pkl")
joblib.dump(vectorizer, "vectorizer.pkl")

['vectorizer.pkl']

In [33]:
# Load model & vectorizer
model = joblib.load("spam_classifier_model.pkl")
vectorizer = joblib.load("vectorizer.pkl")

# Example prediction
msg = ["""Hello OLAWALE SAMUEL OLAITAN,

The wait is over! 🚀 Applications for the 2025 Batch B College Relief Fund (CRF) Scholarship are officially OPEN. This is your opportunity to secure up to ₦300,000 to cover tuition, accommodation, and essential living expenses while you focus fully on your studies.

✨ Why Apply?

Over 5,000+ Nigerian students have already benefited from CRF Main and Mini Scholarships.
Open to students in all accredited Universities, Polytechnics, Colleges of Education, and Colleges of Health Sciences & Technology.
It’s a grant, not a loan – no repayment required.
👉 What to Do Now:

Login or Register on the CRF Portal.
Enroll for the 2025 Batch B Scholarship from your dashboard.
Complete and submit your application today with your ndolawale@gmail.com email."""]
X_new = vectorizer.transform(msg)
prediction = model.predict(X_new)

print("Spam" if prediction[0] == "spam" else "Ham")

Ham
