<b>Naive Bayes classifiers</b> are a family of "probabilistic classifiers" which assumes that the features are conditionally independent, given the target class.

In other words, a naive Bayes model assumes the information about the class provided by each variable is unrelated to the information from the others, with no information shared between the predictors. The highly unrealistic nature of this assumption, called the naive independence assumption, is what gives the classifier its name.

These classifiers are some of the simplest Bayesian network models. Naive Bayes classifiers generally perform worse than more advanced models like logistic regressions, especially at quantifying uncertainty (with naive Bayes models often producing wildly overconfident probabilities). However, they are highly scalable, requiring only one parameter for each feature or predictor in a learning problem.

One of the most popular use cases of Naive Bayes is spam detection which will be implemented here.

In [1]:
import pandas as pd

df = pd.read_csv('datasets/spam.csv', encoding='latin-1')[['v1', 'v2']]
df.columns = ['label', 'text']

print(df.head())

  label                                               text
0   ham  Go until jurong point, crazy.. Available only ...
1   ham                      Ok lar... Joking wif u oni...
2  spam  Free entry in 2 a wkly comp to win FA Cup fina...
3   ham  U dun say so early hor... U c already then say...
4   ham  Nah I don't think he goes to usf, he lives aro...


In [2]:
df['label'] = df['label'].map({'ham': 0, 'spam': 1})

In [4]:
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report

X_train, X_test, y_train, y_test = train_test_split(df['text'], df['label'], test_size=0.2, random_state=42)

vectorizer = TfidfVectorizer(stop_words='english')
X_train_vec = vectorizer.fit_transform(X_train)
X_test_vec = vectorizer.transform(X_test)

model = MultinomialNB()
model.fit(X_train_vec, y_train)
y_pred = model.predict(X_test_vec)

print(f"Accuracy: {accuracy_score(y_test, y_pred):.4f}")
print(confusion_matrix(y_test, y_pred))
print(classification_report(y_test, y_pred))

Accuracy: 0.9668
[[965   0]
 [ 37 113]]
              precision    recall  f1-score   support

           0       0.96      1.00      0.98       965
           1       1.00      0.75      0.86       150

    accuracy                           0.97      1115
   macro avg       0.98      0.88      0.92      1115
weighted avg       0.97      0.97      0.96      1115



In [5]:
import numpy as np

# Get vocab and weights
feature_names = vectorizer.get_feature_names_out()
log_probs = model.feature_log_prob_

# Top spam words
spam_top = np.argsort(log_probs[1])[::-1][:10]
ham_top = np.argsort(log_probs[0])[::-1][:10]

print("Top spam words:")
print([feature_names[i] for i in spam_top])

print("\nTop ham words:")
print([feature_names[i] for i in ham_top])

Top spam words:
['free', 'txt', 'mobile', 'claim', 'stop', 'text', 'prize', 'ur', 'reply', 'www']

Top ham words:
['ok', 'll', 'come', 'lt', 'gt', 'just', 'good', 'home', 'got', 'time']
