**Spam Detection with Naive Bayes Classifier: A Text Classification Journey**

__Objective__:
Detecting spam msg is crucial for maintaining a clean inbox and preventing potential security threats. This project utilizes a Multinomial Naive Bayes classifier to distinguish between spam and non-spam (ham) emails. The classifier is trained on a dataset containing labeled email messages and employs text preprocessing techniques to enhance the accuracy of predictions.

__Key Steps__:

__Data Preprocessing__: The dataset, consisting of  messages labeled as spam or ham, undergoes preprocessing. This involves converting text to lowercase, removing punctuation, and filtering out stopwords to enhance the quality of the text data.

__Dataset Splitting__: The preprocessed dataset is divided into training and testing sets using the train_test_split function. This ensures that the model is trained on a portion of the data and evaluated on unseen samples to assess its generalization performance.

__Feature Extraction__: Text data is converted into a numerical format using the CountVectorizer from scikit-learn. This step transforms text into a bag-of-words representation, capturing the frequency of words in each document.

__Model Training__: A Multinomial Naive Bayes classifier is initialized and trained using the vectorized training data. Naive Bayes classifiers are well-suited for text classification tasks due to their simplicity and efficiency.

__Model Evaluation__: The trained classifier is evaluated on the testing data to measure its performance. Metrics such as accuracy and classification report are computed to assess the classifier's ability to correctly classify spam and ham emails.

In [21]:
# Import necessary libraries
import pandas as pd
import re
import string
from nltk.corpus import stopwords
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score, classification_report
import nltk
from joblib import dump

# Download NLTK stopwords
nltk.download('stopwords')

# Load the dataset
data = pd.read_csv('spam.csv', encoding='latin1')

# Define function for text preprocessing
def preprocess_text(text):
    # Convert text to lowercase
    text = text.lower()
    # Remove punctuation
    text = re.sub('[%s]' % re.escape(string.punctuation), '', text)
    # Remove stopwords
    stop_words = set(stopwords.words('english'))
    words = text.split()
    filtered_words = [word for word in words if word not in stop_words]
    text = ' '.join(filtered_words)
    return text

# Apply preprocessing to the 'text' column
data['v2'] = data['v2'].apply(preprocess_text)

# Split the preprocessed dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(data['v2'], data['v1'], test_size=0.2, random_state=42)

# Initialize the CountVectorizer to convert text into bag-of-words representation
vectorizer = CountVectorizer()
X_train_vec = vectorizer.fit_transform(X_train)
X_test_vec = vectorizer.transform(X_test)

# Save the vocabulary
vocab = vectorizer.vocabulary_
dump(vocab, 'vocab.pkl')

# Initialize and train the Multinomial Naive Bayes classifier
nb_classifier = MultinomialNB()
nb_classifier.fit(X_train_vec, y_train)

# Make predictions on the test set
y_pred = nb_classifier.predict(X_test_vec)

# Evaluate the performance of the model
accuracy = accuracy_score(y_test, y_pred)

# Print predictions and evaluation metrics
print("Predictions:", y_pred)
print("Accuracy:", accuracy)
print(classification_report(y_test, y_pred))


[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\aCER\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


Predictions: ['ham' 'ham' 'spam' ... 'ham' 'ham' 'spam']
Accuracy: 0.9802690582959641
              precision    recall  f1-score   support

         ham       0.98      1.00      0.99       965
        spam       0.97      0.88      0.92       150

    accuracy                           0.98      1115
   macro avg       0.98      0.94      0.96      1115
weighted avg       0.98      0.98      0.98      1115



In [15]:
test = "SIX chances to win CASH! From 100 to 20,000 pounds txt> CSH11 and send to 87575. Cost 150p/day, 6day"

In [18]:
with open('vocab.pkl', 'rb') as f:
    vocab = joblib.load(f)

# Initialize a new CountVectorizer with the loaded vocabulary
vectorizer = CountVectorizer(vocabulary=vocab)

joblib.dump(nb_classifier, 'nb_classifier_model.pkl')
loaded_nb_classifier = joblib.load('nb_classifier_model.pkl')
preprocessed_text = preprocess_text(test)
text_vec = vectorizer.transform([preprocessed_text])

# Make predictions using the loaded model
y_pred_loaded = loaded_nb_classifier.predict(text_vec)

#y_pred = nb_classifier.predict(text_vec)
print("y pred",y_pred_loaded)

y pred ['spam']
