# Naive Bayes

Naive Bayes is a popular machine learning algorithm for classification tasks, such as predicting the class label of a document or an image. It is based on Bayes' theorem, which is a fundamental concept in probability theory.

The basic idea behind Naive Bayes is to calculate the probability of each class label given the input features, and then choose the label with the highest probability as the predicted label. To do this, the algorithm makes the assumption that the input features are conditionally independent given the class label, which is why it is called "naive". This assumption simplifies the calculation of the probabilities, making Naive Bayes a fast and scalable algorithm that can handle large datasets.

Here's a simplified example to illustrate how Naive Bayes works:

Suppose we have a dataset of emails that are labeled as either spam or not spam (ham), and we want to classify a new email as either spam or ham. We can represent each email as a bag of words, where each word in the email is a feature. We can then calculate the probability of the email being spam or ham, given the words in the email.

Using Bayes' theorem, we can write:

$P(spam|words) = \frac{P(words|spam) \cdot P(spam)}{P(words)}$

$P(ham|words) = \frac{P(words|ham) \cdot P(ham)}{P(words)}$

where $P(spam|words)$ is the probability of the email being spam given the words in the email, $P(words|spam)$ is the probability of seeing the words in a spam email, $P(spam)$ is the prior probability of an email being spam, and $P(words)$ is the probability of seeing the words in any email (spam or ham). Similarly, $P(ham|words)$ is the probability of the email being ham given the words in the email.

To calculate the probabilities, we can use the training data to estimate the probabilities of seeing each word in a spam email and a ham email, as well as the prior probabilities of spam and ham. We can then plug these values into the equations above to calculate the probabilities of the email being spam or ham, and choose the label with the highest probability as the predicted label.

This is the basic idea behind Naive Bayes. The algorithm can be extended to handle multiple classes, as well as continuous and categorical features, but the core principle of calculating conditional probabilities based on Bayes' theorem remains the same.

In [1]:
from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score

# Load the dataset
newsgroups_train = fetch_20newsgroups(subset='train', remove=('headers', 'footers', 'quotes'))
newsgroups_test = fetch_20newsgroups(subset='test', remove=('headers', 'footers', 'quotes'))

# Vectorize the text data
vectorizer = CountVectorizer(stop_words='english')
X_train = vectorizer.fit_transform(newsgroups_train.data)
X_test = vectorizer.transform(newsgroups_test.data)

# Train the Naive Bayes classifier
clf = MultinomialNB()
clf.fit(X_train, newsgroups_train.target)

# Test the classifier
y_pred = clf.predict(X_test)
accuracy = accuracy_score(newsgroups_test.target, y_pred)

print('Accuracy:', accuracy)


Accuracy: 0.6343600637280935


In the following updated code, we use a TfidfVectorizer instead of a CountVectorizer, which uses TF-IDF weighting instead of simple word counts. We also use a pipeline to combine the vectorization and classification steps, and we experiment with a different alpha value for the Naive Bayes classifier.

These modifications should improve the accuracy of the classifier on the 20 Newsgroups dataset. However, keep in mind that the dataset itself is quite challenging, so the accuracy may still be lower than on other datasets.

In [2]:
from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import Pipeline
from sklearn.metrics import accuracy_score

# Load the dataset
newsgroups_train = fetch_20newsgroups(subset='train', remove=('headers', 'footers', 'quotes'))
newsgroups_test = fetch_20newsgroups(subset='test', remove=('headers', 'footers', 'quotes'))

# Define the pipeline
clf = Pipeline([
    ('vectorizer', TfidfVectorizer(stop_words='english')),
    ('classifier', MultinomialNB(alpha=0.1))
])

# Train the classifier
clf.fit(newsgroups_train.data, newsgroups_train.target)

# Test the classifier
y_pred = clf.predict(newsgroups_test.data)
accuracy = accuracy_score(newsgroups_test.target, y_pred)

print('Accuracy:', accuracy)


Accuracy: 0.6988847583643123
