### Naive Bayes Spam Classifier 

##### Naive Bayes is a family of probabilistic algorithms based on Bayes' Theorem, commonly used in machine learning for classification tasks. Despite its simplicity, it is effective and often used for text classification problems, such as spam detection, sentiment analysis, and more.

##### The "naive" part of Naive Bayes comes from the assumption that the features (input variables) are independent of each other given the class label. This means the presence of one feature does not affect the presence of another, which is often not true in reality but simplifies the computation.

##### In text classification, the features are typically word frequencies or occurrences in a document. The model calculates the probability that a given document belongs to a particular class (e.g., "spam" or "ham") based on the words it contains.

In [6]:
import os
import io
import re
import pandas as pd
from pandas import DataFrame
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer

# Ensure you have downloaded the necessary NLTK data
import nltk
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\R-m-a\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [7]:
from sklearn.feature_extraction.text import CountVectorizer


In [3]:


# Text preprocessing function
def preprocess(text):
    text = text.lower()
    text = re.sub(r'\d+', '', text)  # Remove numbers
    text = re.sub(r'[^\w\s]', '', text)  # Remove punctuation
    ps = PorterStemmer()
    stop_words = set(stopwords.words('english'))
    text = ' '.join([ps.stem(word) for word in text.split() if word not in stop_words])
    return text

# Function to read files
def readFiles(path):
    for root, dirnames, filenames in os.walk(path):
        for filename in filenames:
            path = os.path.join(root, filename)
            inBody = False
            lines = []
            f = io.open(path, 'r', encoding='latin1')
            for line in f:
                if inBody:
                    lines.append(line)
                elif line == '\n':
                    inBody = True
            f.close()
            message = '\n'.join(lines)
            yield path, message

# Function to create DataFrame from directory
def dataFrameFromDirectory(path, classification):
    rows = []
    index = []
    for filename, message in readFiles(path):
        rows.append({'message': message, 'class': classification})
        index.append(filename)

    return DataFrame(rows, index=index)


#### using tfidf vectorizer

In [4]:

# Create DataFrame
data = DataFrame({'message': [], 'class': []})
data = pd.concat([data, dataFrameFromDirectory("D:/MLCourse/emails/spam", "spam")])
data = pd.concat([data, dataFrameFromDirectory("D:/MLCourse/emails/ham", "ham")])

# Apply text preprocessing
data['message'] = data['message'].apply(preprocess)

# Initialize the TF-IDF Vectorizer
vectorizer = TfidfVectorizer(stop_words='english')

# Fit and transform the message texts
counts = vectorizer.fit_transform(data['message'].values)

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(counts, data['class'].values, test_size=0.2, random_state=42)

# Train the Naive Bayes classifier
classifier = MultinomialNB()
classifier.fit(X_train, y_train)

# Predict on the test set
y_pred = classifier.predict(X_test)

# Evaluate the model
print(classification_report(y_test, y_pred))


              precision    recall  f1-score   support

         ham       0.89      1.00      0.94       482
        spam       1.00      0.47      0.64       118

    accuracy                           0.90       600
   macro avg       0.94      0.74      0.79       600
weighted avg       0.91      0.90      0.88       600



In [5]:

# Example predictions
examples = ['Free class now!!!', "Hi Bob, how about a game of golf tomorrow?"]
example_counts = vectorizer.transform(examples)
predictions = classifier.predict(example_counts)

# Display example predictions
for review, prediction in zip(examples, predictions):
    print(f"Review: {review}")
    print(f"Predicted Class: {prediction}")

Review: Free class now!!!
Predicted Class: ham
Review: Hi Bob, how about a game of golf tomorrow?
Predicted Class: ham


In [13]:
from sklearn.metrics import classification_report, accuracy_score


#### using countvectorizer


In [15]:

# Create an empty DataFrame
data = DataFrame({'message': [], 'class': []})

# Load spam and ham emails into the DataFrame
data = pd.concat([data, dataFrameFromDirectory("D:\MLCourse\emails\spam", "spam")])
data = pd.concat([data, dataFrameFromDirectory("D:\MLCourse\emails\ham", "ham")])

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(data['message'], data['class'], test_size=0.2, random_state=42)

# Ensure X_train and X_test are lists of strings
X_train = X_train.tolist()
X_test = X_test.tolist()

# Initialize the CountVectorizer and transform the messages into counts
vectorizer = CountVectorizer(stop_words='english')
X_train_counts = vectorizer.fit_transform(X_train)
X_test_counts = vectorizer.transform(X_test)

# Initialize the Multinomial Naive Bayes classifier and train it
classifier = MultinomialNB()
classifier.fit(X_train_counts, y_train)

# Predict the classes for the test set
y_pred = classifier.predict(X_test_counts)

# Evaluate the model
print(f"Accuracy: {accuracy_score(y_test, y_pred)}")
print(classification_report(y_test, y_pred))

# Example new reviews
new_reviews = [
    "Free class now!!!",
    "Hi Bob, how about a game of golf tomorrow?"
]

# Transform the new reviews into the same CountVectorizer features
new_reviews_transformed = vectorizer.transform(new_reviews)

# Predict the class for the new reviews
predicted_classes = classifier.predict(new_reviews_transformed)

# Display predictions
for review, predicted_class in zip(new_reviews, predicted_classes):
    print(f"Review: {review}")
    print(f"Predicted Class: {predicted_class}")

  data = pd.concat([data, dataFrameFromDirectory("D:\MLCourse\emails\spam", "spam")])
  data = pd.concat([data, dataFrameFromDirectory("D:\MLCourse\emails\ham", "ham")])


Accuracy: 0.9616666666666667
              precision    recall  f1-score   support

         ham       0.96      1.00      0.98       482
        spam       0.98      0.82      0.89       118

    accuracy                           0.96       600
   macro avg       0.97      0.91      0.94       600
weighted avg       0.96      0.96      0.96       600

Review: Free class now!!!
Predicted Class: spam
Review: Hi Bob, how about a game of golf tomorrow?
Predicted Class: ham
