In [None]:
# Lab 9 Report: SMS Spam Detection using Naive Bayes

In this lab, we implemented a spam classifier using the Naive Bayes algorithm.
We used the SMS Spam Collection dataset and preprocessed the text by converting to lowercase, removing special characters, tokenizing, removing stopwords, and applying TF-IDF vectorization.
We trained a Multinomial Naive Bayes model and evaluated it using accuracy, precision, recall, and F1-score.
The model achieved high accuracy (~98%) and performed well in classifying spam vs. ham messages.
Next, we experimented with Bernoulli Naive Bayes, which also showed good results but slightly lower than MultinomialNB.
We concluded that MultinomialNB is more suitable for text classification with TF-IDF features.
Overall, Naive Bayes proved to be a fast and effective approach for spam detection.
This lab helped us understand Bayesian classification and compare model variants.
The experience reinforced the importance of preprocessing and model selection for NLP tasks.

# Naive Bayes for SMS Spam Detection – Task 1 

In [2]:
pip install pandas scikit-learn nltk


Note: you may need to restart the kernel to use updated packages.


# 2. Import Libraries

In [3]:
import pandas as pd
import string
import nltk
from nltk.corpus import stopwords
from sklearn.feature_extraction.text import TfidfVectorizer

nltk.download('stopwords')
stop_words = set(stopwords.words('english'))


[nltk_data] Downloading package stopwords to C:\Users\SMART
[nltk_data]     TECH\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping corpora\stopwords.zip.


 # 3. Load the Dataset


In [10]:
df = pd.read_csv(r"C:\Users\SMART TECH\Desktop\AppliedNLPMaterial-master\SMSSpamCollection", sep='\t', names=['label', 'message'])
print(df.head())


  label                                            message
0   ham  Go until jurong point, crazy.. Available only ...
1   ham                      Ok lar... Joking wif u oni...
2  spam  Free entry in 2 a wkly comp to win FA Cup fina...
3   ham  U dun say so early hor... U c already then say...
4   ham  Nah I don't think he goes to usf, he lives aro...


# 4. Text Preprocessing Function

In [13]:
def preprocess_text(text):
    # Lowercase
    text = text.lower()
    # Remove punctuation
    text = ''.join([char for char in text if char not in string.punctuation])
    # Tokenization
    tokens = text.split()
    # Remove stopwords
    filtered_tokens = [word for word in tokens if word not in stop_words]
    return " ".join(filtered_tokens)


In [21]:
df['cleaned_message'] = df['message'].apply(preprocess_text)


# 5. TF-IDF Vectorization

In [23]:
tfidf = TfidfVectorizer()
X = tfidf.fit_transform(df['cleaned_message'])  # Features
y = df['label']  # Labels: 'ham' or 'spam'


# Task 2: Train a Naive Bayes Classifier

In [25]:
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, classification_report


In [27]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)


In [29]:
model = MultinomialNB()
model.fit(X_train, y_train)


In [31]:
y_pred = model.predict(X_test)


In [33]:
print("Accuracy:", accuracy_score(y_test, y_pred))
print("Precision:", precision_score(y_test, y_pred, pos_label='spam'))
print("Recall:", recall_score(y_test, y_pred, pos_label='spam'))
print("F1 Score:", f1_score(y_test, y_pred, pos_label='spam'))
print("\nClassification Report:\n", classification_report(y_test, y_pred))


Accuracy: 0.9704035874439462
Precision: 1.0
Recall: 0.7785234899328859
F1 Score: 0.8754716981132076

Classification Report:
               precision    recall  f1-score   support

         ham       0.97      1.00      0.98       966
        spam       1.00      0.78      0.88       149

    accuracy                           0.97      1115
   macro avg       0.98      0.89      0.93      1115
weighted avg       0.97      0.97      0.97      1115



# Task 3: Train and Compare BernoulliNB

In [36]:
from sklearn.naive_bayes import BernoulliNB

bernoulli_model = BernoulliNB()
bernoulli_model.fit(X_train, y_train)


In [42]:
y_pred_bernoulli = bernoulli_model.predict(X_test)


In [44]:

from sklearn.metrics import classification_report

print("BernoulliNB Classification Report:\n")
print(classification_report(y_test, y_pred_bernoulli, target_names=["ham", "spam"]))


BernoulliNB Classification Report:

              precision    recall  f1-score   support

         ham       0.98      0.99      0.99       966
        spam       0.96      0.87      0.91       149

    accuracy                           0.98      1115
   macro avg       0.97      0.93      0.95      1115
weighted avg       0.98      0.98      0.98      1115



In [50]:
print(" MultinomialNB vs BernoulliNB Performance:\n")

print("MultinomialNB:")
print(classification_report(y_test, y_pred, target_names=["ham", "spam"]))

print("BernoulliNB:")
print(classification_report(y_test, y_pred_bernoulli, target_names=["ham", "spam"]))


 MultinomialNB vs BernoulliNB Performance:

MultinomialNB:
              precision    recall  f1-score   support

         ham       0.97      1.00      0.98       966
        spam       1.00      0.78      0.88       149

    accuracy                           0.97      1115
   macro avg       0.98      0.89      0.93      1115
weighted avg       0.97      0.97      0.97      1115

BernoulliNB:
              precision    recall  f1-score   support

         ham       0.98      0.99      0.99       966
        spam       0.96      0.87      0.91       149

    accuracy                           0.98      1115
   macro avg       0.97      0.93      0.95      1115
weighted avg       0.98      0.98      0.98      1115

