# Spam Classification using BOW and TF-IDF


### Goal

The main goal is to develop a robust SMS Spam Classification model by leveraging NLP text preprocessing techniques, specifically utilizing Multinomial Naive Bayes as the classifier along with both CountVectorizer and TF-IDF Vectorizer for feature extraction. The performance of both vectorization methods will be evaluated, and the best-performing model will be deployed to AWS for real-world application.


### Data Set Extraction


In [None]:
import pandas as pd
import numpy as np
import os
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer,TfidfVectorizer
from sklearn.pipeline import Pipeline
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score,classification_report
from nltk.tokenize import sent_tokenize,word_tokenize
from nltk.stem import PorterStemmer
from nltk.corpus import stopwords
import pickle
from google.colab import drive
drive.mount('/content/drive')
nltk.download('punkt')
nltk.download('stopwords')
nltk.download("punkt_tab")

In [4]:
def load_dataset(folder_path):
    texts = []
    labels = []
    for label in ['spam', 'ham']:
        folder = os.path.join(folder_path, label)

        if not os.path.exists(folder):
            continue
        for filename in os.listdir(folder):
            if filename.endswith('.txt'):
                file_path = os.path.join(folder, filename)
                try:
                    with open(file_path, 'r', encoding='utf-8') as file:
                        content = file.read()
                except UnicodeDecodeError:
                    with open(file_path, 'r', encoding='ISO-8859-1') as file:
                        content = file.read()
                texts.append(content)
                labels.append(label)

    return texts, labels

In [5]:
corpus= pd.DataFrame()
Overall_label =[]
for i in range(1,5):
    folder_path= f'/content/drive/MyDrive/Spam_Classification/enron{i}'
    texts, labels = load_dataset(folder_path)
    data = {"Mail":texts,"label":labels}
    df = pd.DataFrame(data)
    corpus = pd.concat([corpus,df],ignore_index=True)
    data = {"Mail":texts,"label":labels}
    df = pd.DataFrame(data)
    corpus = pd.concat([corpus,df],ignore_index=True)

In [6]:
print(f"Total texts loaded: {len(corpus)}")

Total texts loaded: 44488


In [7]:
corpus['Mail'] = corpus['Mail'].str.replace('Subject:', '', regex=False).str.strip().str.lower()

Removing the text Starting with "Subject" and removing the extra spaces from the text starting and ending

1. making the text to string lower to remove the repetition of words capitals and lowers
2. Removing the Special characters from Text

In [9]:
corpus['Mail'] = corpus['Mail'].str.replace(r'[^a-zA-Z0-9\s]', '', regex=True)

In [10]:
corpus.head()

Unnamed: 0,Mail,label
0,high quality medication low rates start liv...,spam
1,who are you \nyour needed soffttwares at rock ...,spam
2,vicodin for sale no prior prescription needed...,spam
3,d link airplus g 802 11 g wireless router 8...,spam
4,reply new sexy anime\nwould you believe it \n...,spam


### Data Preprocessing Techniques



1. Using the PorterStemmer technique
2. Using the SnowballStemmer Technique
3. Using the Lemmatization technique

Since the PorterStemmer technique is faster than lemmatization technique and having a huge dataset, we have been using the PorterStemmer Technique for our porject


In [None]:

stemmer = PorterStemmer()
for index,sentence in enumerate(corpus["Mail"]):
    sentence = str(sentence)
    sentences = sent_tokenize(sentence)
    processed_sentences = []

    for word in sentences:
        words = word_tokenize(word)
        words = [stemmer.stem(word) for word in words if word not in stopwords.words("english")]
        processed_sentences.append(" ".join(words))
    corpus.loc[index,"Mail"] = " ".join(processed_sentences)



In [13]:
print(corpus.head())

                                                Mail label
0  high qualiti medic low rate start live normal ...  spam
1  need soffttwar rock bottom prri ce bought prev...  spam
2  vicodin sale prior prescript need buy vicodin ...  spam
3  link airplu g 802 11 g wireless router 87 00 8...  spam
4  repli new sexi anim would believ fserh pron tn...  spam


In [None]:
# Converting the Label column using a LabelEncoder
le = LabelEncoder()
corpus['label'] = le.fit_transform(corpus['label'])

In [15]:
print(corpus.head(3))

                                                Mail  label
0  high qualiti medic low rate start live normal ...      1
1  need soffttwar rock bottom prri ce bought prev...      1
2  vicodin sale prior prescript need buy vicodin ...      1


### Splitting the Dataset into Train set and Test set

In [16]:
# Splitting the data set into train set and testset
X_train, X_test, y_train, y_test = train_test_split(corpus['Mail'], corpus['label'], test_size=0.2, random_state=42)
print(f"X_train shape: {X_train.shape}")
print(f"X_test shape: {X_test.shape}")
print(f"y_train shape: {y_train.value_counts()}")
print(f"y_test shape: {y_test.value_counts()}")

X_train shape: (35590,)
X_test shape: (8898,)
y_train shape: label
1    17976
0    17614
Name: count, dtype: int64
y_test shape: label
0    4472
1    4426
Name: count, dtype: int64


### Converting Text into Vectors Using CountVectorizer and TF-IDF Vectorizer Techniques


1. CountVectorizer
2. Tf-IDF Vecotrizer
* Using the Multinomial Naive Bayes model since multinomial model is well-suited for text-classification, especially spam detection. Because:
  1. Suitable for Text Data.
  2. Handles High Dimensionality
  3. Fast and Scalable
  4. Good Baseline Model

In [17]:
# creating the model pipeleline with CountVecotrizer
model = Pipeline([
    ('vectorizer', CountVectorizer()),
    ('classifier', MultinomialNB())
])
# creating the model pipeling with TfidfVectorizer
model_tfidf = Pipeline([
    ('vectorizer', TfidfVectorizer()),
    ('classifier', MultinomialNB())
])


### Fitting the model

In [19]:
# Fit the model with train set and test set
model.fit(X_train, y_train)
model_tfidf.fit(X_train, y_train)

In [23]:
model.score(X_test,y_test)

0.9898853674983142

In [25]:
model_tfidf.score(X_test,y_test)

0.9904472915261857

### Testing and Evaluation of model

A classification report is a performance evaluation metric in machine learning that provides a detailed breakdown of a classification model's performance for each class and overall. It includes key metrics such as precision, recall, F1-score, and support, which help in understanding how well the model is classifying different categories.

In [26]:
# Classification report for the CountVectorizer model
y_pred = model.predict(X_test)
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       1.00      0.98      0.99      4472
           1       0.98      1.00      0.99      4426

    accuracy                           0.99      8898
   macro avg       0.99      0.99      0.99      8898
weighted avg       0.99      0.99      0.99      8898



Classification Report: A classification report is a performance evaluation metric in machine learning that provides a detailed breakdown of a classification model's performance for each class and overall. It includes key metrics such as precision, recall, F1-score, and support, which help in understanding how well the model is classifying different categories.

* Formula: Precision = TP / (TP + FP)
* Formula: Recall = TP / (TP + FN)
* Formula: F1-score = 2 * (Precision * Recall) / (Precision + Recall)
* Support Definition: The number of actual instances in each class.
* Accuracy: The proportion of correctly classified instances out of all instances.
* Macro Average: The unweighted average of the metrics (precision, recall, F1-score) for each class.
* Weighted Average: The average of the metrics weighted by the support of each class.


In [27]:
# Classification report with tfidfvector
y_pred_tfidf = model_tfidf.predict(X_test)
print(classification_report(y_test, y_pred_tfidf))

              precision    recall  f1-score   support

           0       0.99      0.99      0.99      4472
           1       0.99      0.99      0.99      4426

    accuracy                           0.99      8898
   macro avg       0.99      0.99      0.99      8898
weighted avg       0.99      0.99      0.99      8898



In [None]:
# compare the both models accuracy
accuracy = accuracy_score(y_test, y_pred)
accuracy_tfidf = accuracy_score(y_test, y_pred_tfidf)
print(f"Accuracy with CountVectorizer: {accuracy}")
print(f"Accuracy with TfidfVectorizer: {accuracy_tfidf}")


Accuracy with CountVectorizer: 0.9898853674983142
Accuracy with TfidfVectorizer: 0.9904472915261857


In [None]:
# Saving the both models in pkl file
with open('/content/drive/MyDrive/Spam_Classification/spam_classification_model.pkl', 'wb') as f:
    pickle.dump(model, f)
with open("/content/drive/MyDrive/Spam_Classification/spam_classification_model_tfidf.pkl", "wb") as f:
    pickle.dump(model_tfidf, f)
with open("/content/drive/MyDrive/Spam_Classification/label_encoder.pkl", "wb") as f:
    pickle.dump(le, f)


### Predicting the model with sample input

In [32]:


# Loading the saved model and label encoder file
with open('/content/drive/MyDrive/Spam_Classification/spam_classification_model.pkl', 'rb') as f:
    model = pickle.load(f)
with open("/content/drive/MyDrive/Spam_Classification/label_encoder.pkl", "rb") as f:
    le = pickle.load(f)

def preprocess_text(text):
    text = re.sub(r'[^a-zA-Z0-9\s]', '', text.lower())
    sentences = sent_tokenize(text)
    processed_sentences = []
    for word in sentences:
        words = word_tokenize(word)
        words = [stemmer.stem(word) for word in words if word not in stopwords.words("english")]
        processed_sentences.append(" ".join(words))
    return " ".join(processed_sentences)

The message is classified as: spam


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!


In [46]:
def predict_spam(text):
    processed_text = preprocess_text(text)
    prediction = model.predict([processed_text])
    return le.inverse_transform(prediction)[0]

sample_text = "You have won a free vacation!"
result = predict_spam(sample_text)
print(f"The message is classified as:{sample_text=} {result}")
sample_text = "Hi, how are you doing today?"
result = predict_spam(sample_text)
print(f"The message is classified as: {sample_text=} {result}")

The message is classified as:sample_text='You have won a free vacation!' spam
The message is classified as: sample_text='Hi, how are you doing today?' ham


In [38]:
with open('/content/drive/MyDrive/Spam_Classification/spam_classification_model_tfidf.pkl','rb') as file:
  model_tfidf = pickle.load(file)
def preprocess_text(text):
    text = re.sub(r'[^a-zA-Z0-9\s]', '', text.lower())
    sentences = sent_tokenize(text)
    processed_sentences = []
    for word in sentences:
        words = word_tokenize(word)
        words = [stemmer.stem(word) for word in words if word not in stopwords.words("english")]
        processed_sentences.append(" ".join(words))
    return " ".join(processed_sentences)
def predict_tfidf_spam(text):
    processed_text = preprocess_text(text)
    prediction = model_tfidf.predict([processed_text])
    return le.inverse_transform(prediction)[0]

In [53]:
# Sample Input for testing the model
sample_text = "you bank account credited with xxxx56 with 1 million"
result = predict_tfidf_spam(sample_text)
print(f"The message is classified as:{sample_text=} {result}")
sample_text = "Hi, how are you doing today?"
result = predict_tfidf_spam(sample_text)
print(f"The message is classified as:{sample_text=} {result}")

The message is classified as:sample_text='you bank account credited with xxxx56 with 1 million' spam
The message is classified as:sample_text='Hi, how are you doing today?' spam


### Conclusion


This project demonstrates the effectiveness of using NLP techniques, specifically TF-IDF vectorization and Multinomial Naive Bayes classification, for achieving high accuracy in spam detection, with the TF-IDF model achieving a superior accuracy of 99.04% compared to the CountVectorizer model's 98.98%."